[jira] [Resolved] (SPARK-22817) Use fixed testthat version for SparkR tests in AppVeyor

2017-12-16 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22817.
--
Resolution: Fixed
  Assignee: Hyukjin Kwon

> Use fixed testthat version for SparkR tests in AppVeyor
> ---
>
> Key: SPARK-22817
> URL: https://issues.apache.org/jira/browse/SPARK-22817
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.2.2, 2.3.0
>
>
> We happened to access to the internal {{run_tests}} - 
> https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75. 
> https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L58
> This seems removed out in 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22817) Use fixed testthat version for SparkR tests in AppVeyor

2017-12-16 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22817:
-
Fix Version/s: 2.3.0
   2.2.2

> Use fixed testthat version for SparkR tests in AppVeyor
> ---
>
> Key: SPARK-22817
> URL: https://issues.apache.org/jira/browse/SPARK-22817
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
> Fix For: 2.2.2, 2.3.0
>
>
> We happened to access to the internal {{run_tests}} - 
> https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75. 
> https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L58
> This seems removed out in 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22817) Use fixed testthat version for SparkR tests in AppVeyor

2017-12-16 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16294009#comment-16294009
 ] 

Hyukjin Kwon commented on SPARK-22817:
--

Fixed in https://github.com/apache/spark/pull/20003.

> Use fixed testthat version for SparkR tests in AppVeyor
> ---
>
> Key: SPARK-22817
> URL: https://issues.apache.org/jira/browse/SPARK-22817
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.2.2, 2.3.0
>
>
> We happened to access to the internal {{run_tests}} - 
> https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75. 
> https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L58
> This seems removed out in 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py

2017-12-16 Thread Prateek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prateek updated SPARK-22711:

Attachment: Jira_Spark_minimized_code.py

> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0, 2.2.1
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
> Attachments: Jira_Spark_minimized_code.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in 

[jira] [Updated] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py

2017-12-16 Thread Prateek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prateek updated SPARK-22711:

Attachment: (was: Jira_Spark_minimized_code.py)

> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0, 2.2.1
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
> Attachments: Jira_Spark_minimized_code.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, 

[jira] [Commented] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py

2017-12-16 Thread Prateek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293989#comment-16293989
 ] 

Prateek commented on SPARK-22711:
-

@ [~hyukjin.kwon] minimized code is attached.

> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0, 2.2.1
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
> Attachments: Jira_Spark_minimized_code.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File 

[jira] [Updated] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py

2017-12-16 Thread Prateek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prateek updated SPARK-22711:

Attachment: Jira_Spark_minimized_code.py

> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0, 2.2.1
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
> Attachments: Jira_Spark_minimized_code.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in 

[jira] [Commented] (SPARK-20970) Deprecate TaskMetrics._updatedBlockStatuses

2017-12-16 Thread Sergei Lebedev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293965#comment-16293965
 ] 

Sergei Lebedev commented on SPARK-20970:


Not only does it use a lot of memory on the driver, but it also bloats the 
event logs, see SPARK-22805.

> Deprecate TaskMetrics._updatedBlockStatuses
> ---
>
> Key: SPARK-20970
> URL: https://issues.apache.org/jira/browse/SPARK-20970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>
> TaskMetrics._updatedBlockStatuses isn't used anywhere internally by spark. It 
> could be used by users though since its exposed by  SparkListenerTaskEnd.  We 
> made it configurable to turn off the tracking of it since it uses a lot of 
> memory in https://issues.apache.org/jira/browse/SPARK-20923.  That config is 
> still true for backwards compatibility. We should turn that to false in next 
> release and deprecate that api altogether.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22805) Use aliases for StorageLevel in event logs

2017-12-16 Thread Sergei Lebedev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293836#comment-16293836
 ] 

Sergei Lebedev edited comment on SPARK-22805 at 12/16/17 11:17 PM:
---

I've emulated the effect of SPARK-20923 by removing all 
{{"internal.metrics.updatedBlockStatuses"}} entries from the original 79G event 
log. The table below compares uncompressed/compressed sizes of this log with 
and without the patch proposed in this issue:

||Mode||Size||
|Decompressed|157M|
|Decompressed with patch|155M|
-
*Update*: turns out {{SparkTaskEndEvent}} carries the list of updated blocks 
twice (!): as part of the {{"Accumulables"}} and in {{"Task Metrics"}}. 
[~andrewor14], [~srowen] do you know if there is a reason for that? It looks 
like a bug to me.

*Update*: I'm recomputing the numbers for a fully updatedBlockStatuses-free log.

*Update*: the effect of SPARK-20923 is much more noticeable than I thought 
initially. Removing {{"internal.metrics.updatedBlockStatuses"}} from 
{{"Accumulables"}} and {{"Updated Blocks"}} from {{"Task Metrics"}} reduced the 
log size to 160M. The storage level compression now just shaves of a few M (see 
updated table).


was (Author: lebedev):
I've emulated the effect of SPARK-20923 by removing all 
{{"internal.metrics.updatedBlockStatuses"}} entries from the original 79G event 
log. The table below compares uncompressed/compressed sizes of this log with 
and without the patch proposed in this issue:

||Mode||Size||
|LZ4-compressed|2.3G|
|Decompressed|25G|
|LZ4-compressed with patch|2.3G|
|Decompressed with patch|16G|

*Update*: turns out {{SparkTaskEndEvent}} carries the list of updated blocks 
twice (!): as part of the {{"Accumulables"}} and in {{"Task Metrics"}}. 
[~andrewor14], [~srowen] do you know if there is a reason for that? It looks 
like a bug to me.

*Update*: I'm recomputing the numbers for a fully updatedBlockStatuses-free log.

> Use aliases for StorageLevel in event logs
> --
>
> Key: SPARK-22805
> URL: https://issues.apache.org/jira/browse/SPARK-22805
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Sergei Lebedev
>Priority: Minor
>
> Fact 1: {{StorageLevel}} has a private constructor, therefore a list of 
> predefined levels is not extendable (by the users).
> Fact 2: The format of event logs uses redundant representation for storage 
> levels 
> {code}
> >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, 
> >>> "Replication": 1}')
> 79
> >>> len('DISK_ONLY')
> 9
> {code}
> Fact 3: This leads to excessive log sizes for workloads with lots of 
> partitions, because every partition would have the storage level field which 
> is 60-70 bytes more than it should be.
> Suggested quick win: use the names of the predefined levels to identify them 
> in the event log.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22805) Use aliases for StorageLevel in event logs

2017-12-16 Thread Sergei Lebedev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293836#comment-16293836
 ] 

Sergei Lebedev edited comment on SPARK-22805 at 12/16/17 10:54 PM:
---

I've emulated the effect of SPARK-20923 by removing all 
{{"internal.metrics.updatedBlockStatuses"}} entries from the original 79G event 
log. The table below compares uncompressed/compressed sizes of this log with 
and without the patch proposed in this issue:

||Mode||Size||
|LZ4-compressed|2.3G|
|Decompressed|25G|
|LZ4-compressed with patch|2.3G|
|Decompressed with patch|16G|

*Update*: turns out {{SparkTaskEndEvent}} carries the list of updated blocks 
twice (!): as part of the {{"Accumulables"}} and in {{"Task Metrics"}}. 
[~andrewor14], [~srowen] do you know if there is a reason for that? It looks 
like a bug to me.

*Update*: I'm recomputing the numbers for a fully updatedBlockStatuses-free log.


was (Author: lebedev):
I've emulated the effect of SPARK-20923 by removing all 
{{"internal.metrics.updatedBlockStatuses"}} entries from the original 79G event 
log. The table below compares uncompressed/compressed sizes of this log with 
and without the patch proposed in this issue:

||Mode||Size||
|LZ4-compressed|2.3G|
|Decompressed|25G|
|LZ4-compressed with patch|2.3G|
|Decompressed with patch|16G|

*Update*: turns out {{SparkTaskEndEvent}} carries the list of updated blocks 
twice (!): as part of the {{"Accumulables"}} and in {{"Task Metrics"}}. 
[~andrewor14], [~srowen] do you know if there is a reason for that? It looks 
like a bug to me.


> Use aliases for StorageLevel in event logs
> --
>
> Key: SPARK-22805
> URL: https://issues.apache.org/jira/browse/SPARK-22805
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Sergei Lebedev
>Priority: Minor
>
> Fact 1: {{StorageLevel}} has a private constructor, therefore a list of 
> predefined levels is not extendable (by the users).
> Fact 2: The format of event logs uses redundant representation for storage 
> levels 
> {code}
> >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, 
> >>> "Replication": 1}')
> 79
> >>> len('DISK_ONLY')
> 9
> {code}
> Fact 3: This leads to excessive log sizes for workloads with lots of 
> partitions, because every partition would have the storage level field which 
> is 60-70 bytes more than it should be.
> Suggested quick win: use the names of the predefined levels to identify them 
> in the event log.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22812) Failing cran-check on master

2017-12-16 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293940#comment-16293940
 ] 

Hossein Falaki commented on SPARK-22812:


Do you know what is being checked in that step? Is it trying to reach a CRAN 
server?

> Failing cran-check on master 
> -
>
> Key: SPARK-22812
> URL: https://issues.apache.org/jira/browse/SPARK-22812
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hossein Falaki
>Priority: Minor
>
> When I run {{R/run-tests.sh}} or {{R/check-cran.sh}} I get the following 
> failure message:
> {code}
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) :
>   dims [product 22] do not match the length of object [0]
> {code}
> cc [~felixcheung] have you experienced this error before?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22812) Failing cran-check on master

2017-12-16 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-22812:
---
Priority: Minor  (was: Major)

> Failing cran-check on master 
> -
>
> Key: SPARK-22812
> URL: https://issues.apache.org/jira/browse/SPARK-22812
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hossein Falaki
>Priority: Minor
>
> When I run {{R/run-tests.sh}} or {{R/check-cran.sh}} I get the following 
> failure message:
> {code}
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) :
>   dims [product 22] do not match the length of object [0]
> {code}
> cc [~felixcheung] have you experienced this error before?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22805) Use aliases for StorageLevel in event logs

2017-12-16 Thread Sergei Lebedev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293836#comment-16293836
 ] 

Sergei Lebedev edited comment on SPARK-22805 at 12/16/17 8:37 PM:
--

I've emulated the effect of SPARK-20923 by removing all 
{{"internal.metrics.updatedBlockStatuses"}} entries from the original 79G event 
log. The table below compares uncompressed/compressed sizes of this log with 
and without the patch proposed in this issue:

||Mode||Size||
|LZ4-compressed|2.3G|
|Decompressed|25G|
|LZ4-compressed with patch|2.3G|
|Decompressed with patch|16G|

*Update*: turns out {{SparkTaskEndEvent}} carries the list of updated blocks 
twice (!): as part of the {{"Accumulables"}} and in {{"Task Metrics"}}. 
[~andrewor14], [~srowen] do you know if there is a reason for that? It looks 
like a bug to me.



was (Author: lebedev):
I've emulated the effect of SPARK-20923 by removing all 
{{"internal.metrics.updatedBlockStatuses"}} entries from the original 79G event 
log. The table below compares uncompressed/compressed sizes of this log with 
and without the patch proposed in this issue:

||Mode||Size||
|LZ4-compressed|2.3G|
|Decompressed|25G|
|LZ4-compressed with patch|2.3G|
|Decompressed with patch|16G|


> Use aliases for StorageLevel in event logs
> --
>
> Key: SPARK-22805
> URL: https://issues.apache.org/jira/browse/SPARK-22805
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Sergei Lebedev
>Priority: Minor
>
> Fact 1: {{StorageLevel}} has a private constructor, therefore a list of 
> predefined levels is not extendable (by the users).
> Fact 2: The format of event logs uses redundant representation for storage 
> levels 
> {code}
> >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, 
> >>> "Replication": 1}')
> 79
> >>> len('DISK_ONLY')
> 9
> {code}
> Fact 3: This leads to excessive log sizes for workloads with lots of 
> partitions, because every partition would have the storage level field which 
> is 60-70 bytes more than it should be.
> Suggested quick win: use the names of the predefined levels to identify them 
> in the event log.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22817) Use fixed testthat version for SparkR tests in AppVeyor

2017-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22817:


Assignee: Apache Spark

> Use fixed testthat version for SparkR tests in AppVeyor
> ---
>
> Key: SPARK-22817
> URL: https://issues.apache.org/jira/browse/SPARK-22817
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> We happened to access to the internal {{run_tests}} - 
> https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75. 
> https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L58
> This seems removed out in 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22817) Use fixed testthat version for SparkR tests in AppVeyor

2017-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22817:


Assignee: (was: Apache Spark)

> Use fixed testthat version for SparkR tests in AppVeyor
> ---
>
> Key: SPARK-22817
> URL: https://issues.apache.org/jira/browse/SPARK-22817
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> We happened to access to the internal {{run_tests}} - 
> https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75. 
> https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L58
> This seems removed out in 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22817) Use fixed testthat version for SparkR tests in AppVeyor

2017-12-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293859#comment-16293859
 ] 

Apache Spark commented on SPARK-22817:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/20003

> Use fixed testthat version for SparkR tests in AppVeyor
> ---
>
> Key: SPARK-22817
> URL: https://issues.apache.org/jira/browse/SPARK-22817
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> We happened to access to the internal {{run_tests}} - 
> https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75. 
> https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L58
> This seems removed out in 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22817) Use fixed testthat version for SparkR tests in AppVeyor

2017-12-16 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-22817:


 Summary: Use fixed testthat version for SparkR tests in AppVeyor
 Key: SPARK-22817
 URL: https://issues.apache.org/jira/browse/SPARK-22817
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon


We happened to access to the internal {{run_tests}} - 
https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75. 

https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L58

This seems removed out in 2.0.0.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22805) Use aliases for StorageLevel in event logs

2017-12-16 Thread Sergei Lebedev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293836#comment-16293836
 ] 

Sergei Lebedev commented on SPARK-22805:


I've emulated the effect of SPARK-20923 by removing all 
{{"internal.metrics.updatedBlockStatuses"}} entries from the original 79G event 
log. The table below compares uncompressed/compressed sizes of this log with 
and without the patch proposed in this issue:

||Mode||Size||
|LZ4-compressed|2.3G|
|Decompressed|25G|
|LZ4-compressed with patch|2.3G|
|Decompressed with patch|16G|


> Use aliases for StorageLevel in event logs
> --
>
> Key: SPARK-22805
> URL: https://issues.apache.org/jira/browse/SPARK-22805
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Sergei Lebedev
>Priority: Minor
>
> Fact 1: {{StorageLevel}} has a private constructor, therefore a list of 
> predefined levels is not extendable (by the users).
> Fact 2: The format of event logs uses redundant representation for storage 
> levels 
> {code}
> >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, 
> >>> "Replication": 1}')
> 79
> >>> len('DISK_ONLY')
> 9
> {code}
> Fact 3: This leads to excessive log sizes for workloads with lots of 
> partitions, because every partition would have the storage level field which 
> is 60-70 bytes more than it should be.
> Suggested quick win: use the names of the predefined levels to identify them 
> in the event log.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2017-12-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293835#comment-16293835
 ] 

Sean Owen commented on SPARK-22809:
---

What is the error? shouldn't this fail on any dot import if so, and can we just 
see that error?
I still don't see how this is a Spark issue vs Python interpreter issue, but 
even there, not sure why this import type would be an issue.
I don't think it's an import thing.

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Cricket Temple
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22465) Cogroup of two disproportionate RDDs could lead into 2G limit BUG

2017-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22465:


Assignee: (was: Apache Spark)

> Cogroup of two disproportionate RDDs could lead into 2G limit BUG
> -
>
> Key: SPARK-22465
> URL: https://issues.apache.org/jira/browse/SPARK-22465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 
> 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 
> 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0
>Reporter: Amit Kumar
>Priority: Critical
>
> While running my spark pipeline, it failed with the following exception
> {noformat}
> 2017-11-03 04:49:09,776 [Executor task launch worker for task 58670] ERROR 
> org.apache.spark.executor.Executor  - Exception in task 630.0 in stage 28.0 
> (TID 58670)
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
>   at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:469)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:705)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> After debugging I found that the issue lies with how spark handles cogroup of 
> two RDDs.
> Here is the relevant code from apache spark
> {noformat}
>  /**
>* For each key k in `this` or `other`, return a resulting RDD that 
> contains a tuple with the
>* list of values for that key in `this` as well as `other`.
>*/
>   def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))] = 
> self.withScope {
> cogroup(other, defaultPartitioner(self, other))
>   }
> /**
>* Choose a partitioner to use for a cogroup-like operation between a 
> number of RDDs.
>*
>* If any of the RDDs already has a partitioner, choose that one.
>*
>* Otherwise, we use a default HashPartitioner. For the number of 
> partitions, if
>* spark.default.parallelism is set, then we'll use the value from 
> SparkContext
>* defaultParallelism, otherwise we'll use the max number of upstream 
> partitions.
>*
>* Unless spark.default.parallelism is set, the number of partitions will 
> be the
>* same as the number of partitions in the largest upstream RDD, as this 
> should
>* be least likely to cause out-of-memory errors.
>*
>* We use two method parameters (rdd, others) to enforce callers passing at 
> least 1 RDD.
>*/
>   def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
> val rdds = (Seq(rdd) ++ others)
> val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 
> 0))
> if (hasPartitioner.nonEmpty) {
>   hasPartitioner.maxBy(_.partitions.length).partitioner.get
> } else {
>   if (rdd.context.conf.contains("spark.default.parallelism")) {
> new HashPartitioner(rdd.context.defaultParallelism)
>   } else {
> new HashPartitioner(rdds.map(_.partitions.length).max)
>   }
> }
>   }
> {noformat}
> Given this  suppose we have two  pair RDDs.
> RDD1 : A small RDD which fewer data and partitions
> RDD2: A huge RDD which has loads of data and partitions
> Now in the code if we were to have a cogroup
> {noformat}
> val RDD3 = RDD1.cogroup(RDD2)
> {noformat}
> there is a case where this could lead to the SPARK-6235 Bug which is If RDD1 
> has a partitioner when it is being called into a cogroup. This is because the 
> cogroups partitions are then decided by the partitioner and could lead to the 
> huge RDD2 being shuffled into a small number of partitions.
> One way is probably to add a safety check here that would ignore the 
> partitioner if the number of partitions on the two RDDs are very different in 
> magnitude.



--
This message was sent by Atlassian 

[jira] [Commented] (SPARK-22465) Cogroup of two disproportionate RDDs could lead into 2G limit BUG

2017-12-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293795#comment-16293795
 ] 

Apache Spark commented on SPARK-22465:
--

User 'sujithjay' has created a pull request for this issue:
https://github.com/apache/spark/pull/20002

> Cogroup of two disproportionate RDDs could lead into 2G limit BUG
> -
>
> Key: SPARK-22465
> URL: https://issues.apache.org/jira/browse/SPARK-22465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 
> 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 
> 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0
>Reporter: Amit Kumar
>Priority: Critical
>
> While running my spark pipeline, it failed with the following exception
> {noformat}
> 2017-11-03 04:49:09,776 [Executor task launch worker for task 58670] ERROR 
> org.apache.spark.executor.Executor  - Exception in task 630.0 in stage 28.0 
> (TID 58670)
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
>   at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:469)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:705)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> After debugging I found that the issue lies with how spark handles cogroup of 
> two RDDs.
> Here is the relevant code from apache spark
> {noformat}
>  /**
>* For each key k in `this` or `other`, return a resulting RDD that 
> contains a tuple with the
>* list of values for that key in `this` as well as `other`.
>*/
>   def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))] = 
> self.withScope {
> cogroup(other, defaultPartitioner(self, other))
>   }
> /**
>* Choose a partitioner to use for a cogroup-like operation between a 
> number of RDDs.
>*
>* If any of the RDDs already has a partitioner, choose that one.
>*
>* Otherwise, we use a default HashPartitioner. For the number of 
> partitions, if
>* spark.default.parallelism is set, then we'll use the value from 
> SparkContext
>* defaultParallelism, otherwise we'll use the max number of upstream 
> partitions.
>*
>* Unless spark.default.parallelism is set, the number of partitions will 
> be the
>* same as the number of partitions in the largest upstream RDD, as this 
> should
>* be least likely to cause out-of-memory errors.
>*
>* We use two method parameters (rdd, others) to enforce callers passing at 
> least 1 RDD.
>*/
>   def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
> val rdds = (Seq(rdd) ++ others)
> val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 
> 0))
> if (hasPartitioner.nonEmpty) {
>   hasPartitioner.maxBy(_.partitions.length).partitioner.get
> } else {
>   if (rdd.context.conf.contains("spark.default.parallelism")) {
> new HashPartitioner(rdd.context.defaultParallelism)
>   } else {
> new HashPartitioner(rdds.map(_.partitions.length).max)
>   }
> }
>   }
> {noformat}
> Given this  suppose we have two  pair RDDs.
> RDD1 : A small RDD which fewer data and partitions
> RDD2: A huge RDD which has loads of data and partitions
> Now in the code if we were to have a cogroup
> {noformat}
> val RDD3 = RDD1.cogroup(RDD2)
> {noformat}
> there is a case where this could lead to the SPARK-6235 Bug which is If RDD1 
> has a partitioner when it is being called into a cogroup. This is because the 
> cogroups partitions are then decided by the partitioner and could lead to the 
> huge RDD2 being shuffled into a small number of partitions.
> One way is probably to add a safety check here that would ignore the 
> partitioner if the number of 

[jira] [Assigned] (SPARK-22465) Cogroup of two disproportionate RDDs could lead into 2G limit BUG

2017-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22465:


Assignee: Apache Spark

> Cogroup of two disproportionate RDDs could lead into 2G limit BUG
> -
>
> Key: SPARK-22465
> URL: https://issues.apache.org/jira/browse/SPARK-22465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 
> 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 
> 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0
>Reporter: Amit Kumar
>Assignee: Apache Spark
>Priority: Critical
>
> While running my spark pipeline, it failed with the following exception
> {noformat}
> 2017-11-03 04:49:09,776 [Executor task launch worker for task 58670] ERROR 
> org.apache.spark.executor.Executor  - Exception in task 630.0 in stage 28.0 
> (TID 58670)
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
>   at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:469)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:705)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> After debugging I found that the issue lies with how spark handles cogroup of 
> two RDDs.
> Here is the relevant code from apache spark
> {noformat}
>  /**
>* For each key k in `this` or `other`, return a resulting RDD that 
> contains a tuple with the
>* list of values for that key in `this` as well as `other`.
>*/
>   def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))] = 
> self.withScope {
> cogroup(other, defaultPartitioner(self, other))
>   }
> /**
>* Choose a partitioner to use for a cogroup-like operation between a 
> number of RDDs.
>*
>* If any of the RDDs already has a partitioner, choose that one.
>*
>* Otherwise, we use a default HashPartitioner. For the number of 
> partitions, if
>* spark.default.parallelism is set, then we'll use the value from 
> SparkContext
>* defaultParallelism, otherwise we'll use the max number of upstream 
> partitions.
>*
>* Unless spark.default.parallelism is set, the number of partitions will 
> be the
>* same as the number of partitions in the largest upstream RDD, as this 
> should
>* be least likely to cause out-of-memory errors.
>*
>* We use two method parameters (rdd, others) to enforce callers passing at 
> least 1 RDD.
>*/
>   def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
> val rdds = (Seq(rdd) ++ others)
> val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 
> 0))
> if (hasPartitioner.nonEmpty) {
>   hasPartitioner.maxBy(_.partitions.length).partitioner.get
> } else {
>   if (rdd.context.conf.contains("spark.default.parallelism")) {
> new HashPartitioner(rdd.context.defaultParallelism)
>   } else {
> new HashPartitioner(rdds.map(_.partitions.length).max)
>   }
> }
>   }
> {noformat}
> Given this  suppose we have two  pair RDDs.
> RDD1 : A small RDD which fewer data and partitions
> RDD2: A huge RDD which has loads of data and partitions
> Now in the code if we were to have a cogroup
> {noformat}
> val RDD3 = RDD1.cogroup(RDD2)
> {noformat}
> there is a case where this could lead to the SPARK-6235 Bug which is If RDD1 
> has a partitioner when it is being called into a cogroup. This is because the 
> cogroups partitions are then decided by the partitioner and could lead to the 
> huge RDD2 being shuffled into a small number of partitions.
> One way is probably to add a safety check here that would ignore the 
> partitioner if the number of partitions on the two RDDs are very different in 
> magnitude.



--
This 

[jira] [Assigned] (SPARK-22816) Basic tests for PromoteStrings and InConversion

2017-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22816:


Assignee: Apache Spark

> Basic tests for PromoteStrings and InConversion
> ---
>
> Key: SPARK-22816
> URL: https://issues.apache.org/jira/browse/SPARK-22816
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22816) Basic tests for PromoteStrings and InConversion

2017-12-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293793#comment-16293793
 ] 

Apache Spark commented on SPARK-22816:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/20001

> Basic tests for PromoteStrings and InConversion
> ---
>
> Key: SPARK-22816
> URL: https://issues.apache.org/jira/browse/SPARK-22816
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22816) Basic tests for PromoteStrings and InConversion

2017-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22816:


Assignee: (was: Apache Spark)

> Basic tests for PromoteStrings and InConversion
> ---
>
> Key: SPARK-22816
> URL: https://issues.apache.org/jira/browse/SPARK-22816
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22816) Basic tests for PromoteStrings and InConversion

2017-12-16 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-22816:
---

 Summary: Basic tests for PromoteStrings and InConversion
 Key: SPARK-22816
 URL: https://issues.apache.org/jira/browse/SPARK-22816
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22815) Keep PromotePrecision in Optimized Plans

2017-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22815:


Assignee: Apache Spark  (was: Xiao Li)

> Keep PromotePrecision in Optimized Plans
> 
>
> Key: SPARK-22815
> URL: https://issues.apache.org/jira/browse/SPARK-22815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> We could get incorrect results by running DecimalPrecision twice. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22815) Keep PromotePrecision in Optimized Plans

2017-12-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22815:


Assignee: Xiao Li  (was: Apache Spark)

> Keep PromotePrecision in Optimized Plans
> 
>
> Key: SPARK-22815
> URL: https://issues.apache.org/jira/browse/SPARK-22815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> We could get incorrect results by running DecimalPrecision twice. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22815) Keep PromotePrecision in Optimized Plans

2017-12-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293721#comment-16293721
 ] 

Apache Spark commented on SPARK-22815:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/2

> Keep PromotePrecision in Optimized Plans
> 
>
> Key: SPARK-22815
> URL: https://issues.apache.org/jira/browse/SPARK-22815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> We could get incorrect results by running DecimalPrecision twice. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22815) Keep PromotePrecision in Optimized Plans

2017-12-16 Thread Xiao Li (JIRA)
Xiao Li created SPARK-22815:
---

 Summary: Keep PromotePrecision in Optimized Plans
 Key: SPARK-22815
 URL: https://issues.apache.org/jira/browse/SPARK-22815
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Xiao Li
Assignee: Xiao Li


We could get incorrect results by running DecimalPrecision twice. 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22806) Window Aggregate functions: unexpected result at ordered partition

2017-12-16 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-22806.
-
Resolution: Invalid

> Window Aggregate functions: unexpected result at ordered partition
> --
>
> Key: SPARK-22806
> URL: https://issues.apache.org/jira/browse/SPARK-22806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
> Attachments: WindowFunctionsWithGroupByError.scala
>
>
> I got different results for aggregate functions (even for sum and count) when 
> the partition is ordered "Window.partitionBy(column).orderBy(column))" and 
> when it is not ordered 'Window.partitionBy(column)".
> Example:
> {code:java}
> test("count, sum, stddev_pop functions over window") {
> val df = Seq(
>   ("a", 1, 100.0),
>   ("b", 1, 200.0)).toDF("key", "partition", "value")
> df.createOrReplaceTempView("window_table")
> checkAnswer(
>   df.select(
> $"key",
> count("value").over(Window.partitionBy("partition")),
> sum("value").over(Window.partitionBy("partition")),
> stddev_pop("value").over(Window.partitionBy("partition"))
>   ),
>   Seq(
> Row("a", 2, 300.0, 50.0),
> Row("b", 2, 300.0, 50.0)))
>   }
>   test("count, sum, stddev_pop functions over ordered by window") {
> val df = Seq(
>   ("a", 1, 100.0),
>   ("b", 1, 200.0)).toDF("key", "partition", "value")
> df.createOrReplaceTempView("window_table")
> checkAnswer(
>   df.select(
> $"key",
> count("value").over(Window.partitionBy("partition").orderBy("key")),
> sum("value").over(Window.partitionBy("partition").orderBy("key")),
> 
> stddev_pop("value").over(Window.partitionBy("partition").orderBy("key"))
>   ),
>   Seq(
> Row("a", 2, 300.0, 50.0),
> Row("b", 2, 300.0, 50.0)))
>   }
> {code}
> The "count, sum, stddev_pop functions over ordered by window" fails with the 
> error:
> {noformat}
> == Results ==
> !== Correct Answer - 2 ==   == Spark Answer - 2 ==
> !struct<>   struct partition ORDER BY key ASC NULLS FIRST unspecifiedframe$()):bigint,sum(value) 
> OVER (PARTITION BY partition ORDER BY key ASC NULLS FIRST 
> unspecifiedframe$()):double,stddev_pop(value) OVER (PARTITION BY partition 
> ORDER BY key ASC NULLS FIRST unspecifiedframe$()):double>
> ![a,2,300.0,50.0]   [a,1,100.0,0.0]
>  [b,2,300.0,50.0]   [b,2,300.0,50.0]
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22806) Window Aggregate functions: unexpected result at ordered partition

2017-12-16 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293716#comment-16293716
 ] 

Marco Gaido commented on SPARK-22806:
-

This is the right behavior. Also Postgres works like this. if you specify the 
order by clause, by default the range is UNBOUNDED PRECEDING - CURRENT ROW.

> Window Aggregate functions: unexpected result at ordered partition
> --
>
> Key: SPARK-22806
> URL: https://issues.apache.org/jira/browse/SPARK-22806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
> Attachments: WindowFunctionsWithGroupByError.scala
>
>
> I got different results for aggregate functions (even for sum and count) when 
> the partition is ordered "Window.partitionBy(column).orderBy(column))" and 
> when it is not ordered 'Window.partitionBy(column)".
> Example:
> {code:java}
> test("count, sum, stddev_pop functions over window") {
> val df = Seq(
>   ("a", 1, 100.0),
>   ("b", 1, 200.0)).toDF("key", "partition", "value")
> df.createOrReplaceTempView("window_table")
> checkAnswer(
>   df.select(
> $"key",
> count("value").over(Window.partitionBy("partition")),
> sum("value").over(Window.partitionBy("partition")),
> stddev_pop("value").over(Window.partitionBy("partition"))
>   ),
>   Seq(
> Row("a", 2, 300.0, 50.0),
> Row("b", 2, 300.0, 50.0)))
>   }
>   test("count, sum, stddev_pop functions over ordered by window") {
> val df = Seq(
>   ("a", 1, 100.0),
>   ("b", 1, 200.0)).toDF("key", "partition", "value")
> df.createOrReplaceTempView("window_table")
> checkAnswer(
>   df.select(
> $"key",
> count("value").over(Window.partitionBy("partition").orderBy("key")),
> sum("value").over(Window.partitionBy("partition").orderBy("key")),
> 
> stddev_pop("value").over(Window.partitionBy("partition").orderBy("key"))
>   ),
>   Seq(
> Row("a", 2, 300.0, 50.0),
> Row("b", 2, 300.0, 50.0)))
>   }
> {code}
> The "count, sum, stddev_pop functions over ordered by window" fails with the 
> error:
> {noformat}
> == Results ==
> !== Correct Answer - 2 ==   == Spark Answer - 2 ==
> !struct<>   struct partition ORDER BY key ASC NULLS FIRST unspecifiedframe$()):bigint,sum(value) 
> OVER (PARTITION BY partition ORDER BY key ASC NULLS FIRST 
> unspecifiedframe$()):double,stddev_pop(value) OVER (PARTITION BY partition 
> ORDER BY key ASC NULLS FIRST unspecifiedframe$()):double>
> ![a,2,300.0,50.0]   [a,1,100.0,0.0]
>  [b,2,300.0,50.0]   [b,2,300.0,50.0]
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org