date:20160527

[jira] [Assigned] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15616:


Assignee: Apache Spark

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>Assignee: Apache Spark
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-05-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305191#comment-15305191
 ] 

Apache Spark commented on SPARK-15616:
--

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/13373

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15616:


Assignee: (was: Apache Spark)

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15585) Don't use null in data source options to indicate default value

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15585:


Assignee: Apache Spark

> Don't use null in data source options to indicate default value
> ---
>
> Key: SPARK-15585
> URL: https://issues.apache.org/jira/browse/SPARK-15585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Critical
>
> See email: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html
> We'd need to change DataFrameReader/DataFrameWriter in Python's 
> csv/json/parquet/... functions to put the actual default option values as 
> function parameters, rather than setting them to None. We can then in 
> CSVOptions.getChar (and JSONOptions, etc) to actually return null if the 
> value is null, rather  than setting it to default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15585) Don't use null in data source options to indicate default value

2016-05-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305185#comment-15305185
 ] 

Apache Spark commented on SPARK-15585:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/13372

> Don't use null in data source options to indicate default value
> ---
>
> Key: SPARK-15585
> URL: https://issues.apache.org/jira/browse/SPARK-15585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> See email: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html
> We'd need to change DataFrameReader/DataFrameWriter in Python's 
> csv/json/parquet/... functions to put the actual default option values as 
> function parameters, rather than setting them to None. We can then in 
> CSVOptions.getChar (and JSONOptions, etc) to actually return null if the 
> value is null, rather  than setting it to default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15585) Don't use null in data source options to indicate default value

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15585:


Assignee: (was: Apache Spark)

> Don't use null in data source options to indicate default value
> ---
>
> Key: SPARK-15585
> URL: https://issues.apache.org/jira/browse/SPARK-15585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> See email: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html
> We'd need to change DataFrameReader/DataFrameWriter in Python's 
> csv/json/parquet/... functions to put the actual default option values as 
> function parameters, rather than setting them to None. We can then in 
> CSVOptions.getChar (and JSONOptions, etc) to actually return null if the 
> value is null, rather  than setting it to default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15639:


Assignee: (was: Apache Spark)

> Try to push down filter at RowGroups level for parquet reader
> -
>
> Key: SPARK-15639
> URL: https://issues.apache.org/jira/browse/SPARK-15639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> When we use vecterized parquet reader, although the base reader (i.e., 
> SpecificParquetRecordReaderBase) will retrieve pushed-down filters for 
> RowGroups-level filtering, we seems not really set up the filters to be 
> pushed down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader

2016-05-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305180#comment-15305180
 ] 

Apache Spark commented on SPARK-15639:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/13371

> Try to push down filter at RowGroups level for parquet reader
> -
>
> Key: SPARK-15639
> URL: https://issues.apache.org/jira/browse/SPARK-15639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> When we use vecterized parquet reader, although the base reader (i.e., 
> SpecificParquetRecordReaderBase) will retrieve pushed-down filters for 
> RowGroups-level filtering, we seems not really set up the filters to be 
> pushed down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15639:


Assignee: Apache Spark

> Try to push down filter at RowGroups level for parquet reader
> -
>
> Key: SPARK-15639
> URL: https://issues.apache.org/jira/browse/SPARK-15639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> When we use vecterized parquet reader, although the base reader (i.e., 
> SpecificParquetRecordReaderBase) will retrieve pushed-down filters for 
> RowGroups-level filtering, we seems not really set up the filters to be 
> pushed down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader

2016-05-27 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-15639:
---

 Summary: Try to push down filter at RowGroups level for parquet 
reader
 Key: SPARK-15639
 URL: https://issues.apache.org/jira/browse/SPARK-15639
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


When we use vecterized parquet reader, although the base reader (i.e., 
SpecificParquetRecordReaderBase) will retrieve pushed-down filters for 
RowGroups-level filtering, we seems not really set up the filters to be pushed 
down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15585) Don't use null in data source options to indicate default value

2016-05-27 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305178#comment-15305178
 ] 

Takeshi Yamamuro commented on SPARK-15585:
--

okay, I'll push soon.

> Don't use null in data source options to indicate default value
> ---
>
> Key: SPARK-15585
> URL: https://issues.apache.org/jira/browse/SPARK-15585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> See email: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html
> We'd need to change DataFrameReader/DataFrameWriter in Python's 
> csv/json/parquet/... functions to put the actual default option values as 
> function parameters, rather than setting them to None. We can then in 
> CSVOptions.getChar (and JSONOptions, etc) to actually return null if the 
> value is null, rather  than setting it to default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15638) Audit Dataset, SparkSession, and SQLContext functions and documentations

2016-05-27 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15638:

Description: 
See the attached pull request for details.


> Audit Dataset, SparkSession, and SQLContext functions and documentations
> 
>
> Key: SPARK-15638
> URL: https://issues.apache.org/jira/browse/SPARK-15638
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> See the attached pull request for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15638) Audit Dataset, SparkSession, and SQLContext functions and documentations

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15638:


Assignee: Apache Spark  (was: Reynold Xin)

> Audit Dataset, SparkSession, and SQLContext functions and documentations
> 
>
> Key: SPARK-15638
> URL: https://issues.apache.org/jira/browse/SPARK-15638
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15638) Audit Dataset, SparkSession, and SQLContext functions and documentations

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15638:


Assignee: Reynold Xin  (was: Apache Spark)

> Audit Dataset, SparkSession, and SQLContext functions and documentations
> 
>
> Key: SPARK-15638
> URL: https://issues.apache.org/jira/browse/SPARK-15638
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15638) Audit Dataset, SparkSession, and SQLContext functions and documentations

2016-05-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305170#comment-15305170
 ] 

Apache Spark commented on SPARK-15638:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13370

> Audit Dataset, SparkSession, and SQLContext functions and documentations
> 
>
> Key: SPARK-15638
> URL: https://issues.apache.org/jira/browse/SPARK-15638
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15638) Audit Dataset, SparkSession, and SQLContext functions and documentations

2016-05-27 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-15638:
---

 Summary: Audit Dataset, SparkSession, and SQLContext functions and 
documentations
 Key: SPARK-15638
 URL: https://issues.apache.org/jira/browse/SPARK-15638
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15611) Got the same sequence random number in every forked worker.

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15611:


Assignee: Apache Spark

> Got the same sequence random number in every forked worker.
> ---
>
> Key: SPARK-15611
> URL: https://issues.apache.org/jira/browse/SPARK-15611
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Thomas Lau
>Assignee: Apache Spark
>Priority: Minor
>
> hi, i'm writing some code as below:
> {code:java|title=marlkov.py|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
> from random import random
> from operator import add
> def funcx( x ):
>   print x[0],x[1]
>   return 1 if x[0]** 2 + x[1]** 2 < 1 else 0
> def genRnd(ind):
>   x=random() * 2 - 1
>   y=random() * 2 - 1
>   return (x,y)
> def runsp(total):
>   ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, 
> y: x + y)/float(total) * 4
>   print ret
> runsp(3)
> {code}
> once started the pyspark shell, no matter how many times i run "runsp(N)" , 
> this code always get a same sequece of random numbers, like this
> {code:title=Output|borderStyle=solid}
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
> >>>  * 4
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
> >>>  * 4
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> {code}
> i think this is because when we import pyspark.worker in the daemon.py, we 
> alse import a random by shuffle.py which is imported by pyspark.worker, 
> this worker, forked by *pid = os.fork()*, also remains the state of the 
> parent's random, thus every forked worker get the same random.next().
> we need to re-random the random by random.seed, which will solve the problem, 
> but i think this PR. may not be the proper fix.
> ths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15611) Got the same sequence random number in every forked worker.

2016-05-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305169#comment-15305169
 ] 

Apache Spark commented on SPARK-15611:
--

User 'ThomasLau' has created a pull request for this issue:
https://github.com/apache/spark/pull/13350

> Got the same sequence random number in every forked worker.
> ---
>
> Key: SPARK-15611
> URL: https://issues.apache.org/jira/browse/SPARK-15611
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Thomas Lau
>Priority: Minor
>
> hi, i'm writing some code as below:
> {code:java|title=marlkov.py|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
> from random import random
> from operator import add
> def funcx( x ):
>   print x[0],x[1]
>   return 1 if x[0]** 2 + x[1]** 2 < 1 else 0
> def genRnd(ind):
>   x=random() * 2 - 1
>   y=random() * 2 - 1
>   return (x,y)
> def runsp(total):
>   ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, 
> y: x + y)/float(total) * 4
>   print ret
> runsp(3)
> {code}
> once started the pyspark shell, no matter how many times i run "runsp(N)" , 
> this code always get a same sequece of random numbers, like this
> {code:title=Output|borderStyle=solid}
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
> >>>  * 4
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
> >>>  * 4
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> {code}
> i think this is because when we import pyspark.worker in the daemon.py, we 
> alse import a random by shuffle.py which is imported by pyspark.worker, 
> this worker, forked by *pid = os.fork()*, also remains the state of the 
> parent's random, thus every forked worker get the same random.next().
> we need to re-random the random by random.seed, which will solve the problem, 
> but i think this PR. may not be the proper fix.
> ths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15611) Got the same sequence random number in every forked worker.

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15611:


Assignee: (was: Apache Spark)

> Got the same sequence random number in every forked worker.
> ---
>
> Key: SPARK-15611
> URL: https://issues.apache.org/jira/browse/SPARK-15611
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Thomas Lau
>Priority: Minor
>
> hi, i'm writing some code as below:
> {code:java|title=marlkov.py|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
> from random import random
> from operator import add
> def funcx( x ):
>   print x[0],x[1]
>   return 1 if x[0]** 2 + x[1]** 2 < 1 else 0
> def genRnd(ind):
>   x=random() * 2 - 1
>   y=random() * 2 - 1
>   return (x,y)
> def runsp(total):
>   ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, 
> y: x + y)/float(total) * 4
>   print ret
> runsp(3)
> {code}
> once started the pyspark shell, no matter how many times i run "runsp(N)" , 
> this code always get a same sequece of random numbers, like this
> {code:title=Output|borderStyle=solid}
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
> >>>  * 4
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
> >>>  * 4
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> {code}
> i think this is because when we import pyspark.worker in the daemon.py, we 
> alse import a random by shuffle.py which is imported by pyspark.worker, 
> this worker, forked by *pid = os.fork()*, also remains the state of the 
> parent's random, thus every forked worker get the same random.next().
> we need to re-random the random by random.seed, which will solve the problem, 
> but i think this PR. may not be the proper fix.
> ths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15611) Got the same sequence random number in every forked worker.

2016-05-27 Thread Thomas Lau (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Lau updated SPARK-15611:
---
Summary: Got the same sequence random number in every forked worker.  (was: 
Each forked worker  in daemon.py keep the parent's random state)

> Got the same sequence random number in every forked worker.
> ---
>
> Key: SPARK-15611
> URL: https://issues.apache.org/jira/browse/SPARK-15611
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Thomas Lau
>Priority: Minor
>
> hi, i'm writing some code as below:
> {code:java|title=marlkov.py|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
> from random import random
> from operator import add
> def funcx( x ):
>   print x[0],x[1]
>   return 1 if x[0]** 2 + x[1]** 2 < 1 else 0
> def genRnd(ind):
>   x=random() * 2 - 1
>   y=random() * 2 - 1
>   return (x,y)
> def runsp(total):
>   ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, 
> y: x + y)/float(total) * 4
>   print ret
> runsp(3)
> {code}
> once started the pyspark shell, no matter how many times i run "runsp(N)" , 
> this code always get a same sequece of random numbers, like this
> {code:title=Output|borderStyle=solid}
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
> >>>  * 4
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
> >>>  * 4
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> {code}
> i think this is because when we import pyspark.worker in the daemon.py, we 
> alse import a random by shuffle.py which is imported by pyspark.worker, 
> this worker, forked by *pid = os.fork()*, also remains the state of the 
> parent's random, thus every forked worker get the same random.next().
> we need to re-random the random by random.seed, which will solve the problem, 
> but i think this PR. may not be the proper fix.
> ths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15553) Dataset.createTempView should use CreateViewCommand

2016-05-27 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15553.
-
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.0.0

> Dataset.createTempView should use CreateViewCommand
> ---
>
> Key: SPARK-15553
> URL: https://issues.apache.org/jira/browse/SPARK-15553
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> Dataset.createTempView and Dataset.createOrReplaceTempView should use 
> CreateViewCommand, rather than calling SparkSession.createTempView. Once this 
> is done, we can also remove SparkSession.createTempView.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15611) Each forked worker in daemon.py keep the parent's random state

2016-05-27 Thread Thomas Lau (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Lau updated SPARK-15611:
---
Description: 
hi, i'm writing some code as below:

{code:java|title=marlkov.py|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
from random import random
from operator import add
def funcx( x ):
  print x[0],x[1]
  return 1 if x[0]** 2 + x[1]** 2 < 1 else 0
def genRnd(ind):
  x=random() * 2 - 1
  y=random() * 2 - 1
  return (x,y)
def runsp(total):
  ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, 
y: x + y)/float(total) * 4
  print ret
runsp(3)
{code}

once started the pyspark shell, no matter how many times i run "runsp(N)" , 
this code always get a same sequece of random numbers, like this


{code:title=Output|borderStyle=solid}
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
>>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
>>>  * 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
>>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
>>>  * 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
{code}

i think this is because when we import pyspark.worker in the daemon.py, we alse 
import a random by shuffle.py which is imported by pyspark.worker, 
this worker, forked by *pid = os.fork()*, also remains the state of the 
parent's random, thus every forked worker get the same random.next().

we need to re-random the random by random.seed, which will solve the problem, 
but i think this PR. may not be the proper fix.
ths.

  was:
hi, i'm writing some code as below:

{code:java|title=marlkov.py|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
from random import random
from operator import add
def funcx( x ):
  print x[0],x[1]
  return 1 if x[0]** 2 + x[1]** 2 < 1 else 0
def genRnd(ind):
  x=random() * 2 - 1
  y=random() * 2 - 1
  return (x,y)
def runsp(total):
  ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, 
y: x + y)/float(total) * 4
  print ret
runsp(3)
{code}

once started the pyspark shell, no matter how many times i run "runsp(N)" , 
this code always get a same sequece of random numbers, like this


{code:title=Output|borderStyle=solid}
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
>>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
>>>  * 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
>>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
>>>  * 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
{code}

i think this is because when we import pyspark.worker in the daemon.py, we alse 
import a random by the shuffle.py which is imported by pyspark.worker, this 
worker, forked by "pid = os.fork()", also remains the state of the parent's 
random, thus every forked worker get the same random.next().

we need to re-random the random by random.seed, which will solve the problem, 
but i think this PR. may not be the proper fix.
ths.


> Each forked worker  in daemon.py keep the parent's random state
> ---
>
> Key: SPARK-15611
> URL: https://issues.apache.org/jira/browse/SPARK-15611
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Thomas Lau
>Priority: Minor
>
> hi, i'm writing some code as below:
> {code:java|title=marlkov.py|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
> from random import random
> from operator import add
> def funcx( x ):
>   print x[0],x[1]
>   return 1 if x[0]** 2 + x[1]** 2 < 1 else 0
> def genRnd(ind):
>   x=random() * 2 - 1
>   y=random() * 2 - 1
>   return (x,y)
> def runsp(total):
>   ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, 
> y: x + y)/float(total) * 4
>   print ret
> runsp(3)
> {code}
> once started the pyspark shell, no matter how many times i run "runsp(N)" , 
> this code always get a same sequece of random numbers, like this
> {code:title=Output|borderStyle=solid}
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
> >>>  * 4
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> >>> sc.paralleli

[jira] [Resolved] (SPARK-15597) Add SparkSession.emptyDataset

2016-05-27 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15597.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Add SparkSession.emptyDataset
> -
>
> Key: SPARK-15597
> URL: https://issues.apache.org/jira/browse/SPARK-15597
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> SparkSession currently has emptyDataFrame, but not emptyDataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15633) Make package name for Java tests consistent

2016-05-27 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15633.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Make package name for Java tests consistent
> ---
>
> Key: SPARK-15633
> URL: https://issues.apache.org/jira/browse/SPARK-15633
> Project: Spark
>  Issue Type: Sub-task
>  Components: Java API
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13184) Support minPartitions parameter for JSON and CSV datasources as options

2016-05-27 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-13184:

Target Version/s: 2.1.0

> Support minPartitions parameter for JSON and CSV datasources as options
> ---
>
> Key: SPARK-13184
> URL: https://issues.apache.org/jira/browse/SPARK-13184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> After looking through the pull requests below at Spark CSV datasources,
> https://github.com/databricks/spark-csv/pull/256
> https://github.com/databricks/spark-csv/issues/141
> https://github.com/databricks/spark-csv/pull/186
> It looks Spark might need to be able to set {{minPartitions}}.
> {{repartition()}} or {{coalesce()}} can be alternatives but it looks it needs 
> to shuffle the data for most cases.
> Although I am still not sure if it needs this, I will open this ticket just 
> for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15611) Each forked worker in daemon.py keep the parent's random state

2016-05-27 Thread Thomas Lau (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Lau updated SPARK-15611:
---
Description: 
hi, i'm writing some code as below:

{code:java|title=marlkov.py|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
from random import random
from operator import add
def funcx( x ):
  print x[0],x[1]
  return 1 if x[0]** 2 + x[1]** 2 < 1 else 0
def genRnd(ind):
  x=random() * 2 - 1
  y=random() * 2 - 1
  return (x,y)
def runsp(total):
  ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, 
y: x + y)/float(total) * 4
  print ret
runsp(3)
{code}

once started the pyspark shell, no matter how many times i run "runsp(N)" , 
this code always get a same sequece of random numbers, like this


{code:title=Output|borderStyle=solid}
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
>>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
>>>  * 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
>>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
>>>  * 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
{code}

i think this is because when we import pyspark.worker in the daemon.py, we alse 
import a random by the shuffle.py which is imported by pyspark.worker, this 
worker, forked by "pid = os.fork()", also remains the state of the parent's 
random, thus every forked worker get the same random.next().

we need to re-random the random by random.seed, which will solve the problem, 
but i think this PR. may not be the proper fix.
ths.

  was:
hi, i'm writing some code as below:

{code:python}
from random import random
from operator import add
def funcx( x ):
  print x[0],x[1]
  return 1 if x[0]** 2 + x[1]** 2 < 1 else 0
def genRnd(ind):
  x=random() * 2 - 1
  y=random() * 2 - 1
  return (x,y)
def runsp(total):
  ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, 
y: x + y)/float(total) * 4
  print ret
runsp(3)
{code}

once started the pyspark shell, no matter how many times i run "runsp(N)" , 
this code always get a same sequece of random numbers, like this

```sh
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
>>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
>>>  * 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
>>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
>>>  * 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
```
i think this is because when we import pyspark.worker in the daemon.py, we alse 
import a random by the shuffle.py which is imported by pyspark.worker, this 
worker, forked by "pid = os.fork()", also remains the state of the parent's 
random, thus every forked worker get the same random.next().

we need to re-random the random by random.seed, which will solve the problem, 
but i think this PR. may not be the proper fix.
ths.


> Each forked worker  in daemon.py keep the parent's random state
> ---
>
> Key: SPARK-15611
> URL: https://issues.apache.org/jira/browse/SPARK-15611
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Thomas Lau
>Priority: Minor
>
> hi, i'm writing some code as below:
> {code:java|title=marlkov.py|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
> from random import random
> from operator import add
> def funcx( x ):
>   print x[0],x[1]
>   return 1 if x[0]** 2 + x[1]** 2 < 1 else 0
> def genRnd(ind):
>   x=random() * 2 - 1
>   y=random() * 2 - 1
>   return (x,y)
> def runsp(total):
>   ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, 
> y: x + y)/float(total) * 4
>   print ret
> runsp(3)
> {code}
> once started the pyspark shell, no matter how many times i run "runsp(N)" , 
> this code always get a same sequece of random numbers, like this
> {code:title=Output|borderStyle=solid}
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
> >>>  * 4
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
> >>>  * 4
> 0.896083541418 -0.635625854075
> -0.0423

[jira] [Commented] (SPARK-15585) Don't use null in data source options to indicate default value

2016-05-27 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305166#comment-15305166
 ] 

Reynold Xin commented on SPARK-15585:
-

Feel free to create a pr with python changes and then we can iterate on the R 
part too.


> Don't use null in data source options to indicate default value
> ---
>
> Key: SPARK-15585
> URL: https://issues.apache.org/jira/browse/SPARK-15585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> See email: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html
> We'd need to change DataFrameReader/DataFrameWriter in Python's 
> csv/json/parquet/... functions to put the actual default option values as 
> function parameters, rather than setting them to None. We can then in 
> CSVOptions.getChar (and JSONOptions, etc) to actually return null if the 
> value is null, rather  than setting it to default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15611) Each forked worker in daemon.py keep the parent's random state

2016-05-27 Thread Thomas Lau (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Lau updated SPARK-15611:
---
Description: 
hi, i'm writing some code as below:

{code:python}
from random import random
from operator import add
def funcx( x ):
  print x[0],x[1]
  return 1 if x[0]** 2 + x[1]** 2 < 1 else 0
def genRnd(ind):
  x=random() * 2 - 1
  y=random() * 2 - 1
  return (x,y)
def runsp(total):
  ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, 
y: x + y)/float(total) * 4
  print ret
runsp(3)
{code}

once started the pyspark shell, no matter how many times i run "runsp(N)" , 
this code always get a same sequece of random numbers, like this

```sh
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
>>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
>>>  * 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
>>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
>>>  * 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
```
i think this is because when we import pyspark.worker in the daemon.py, we alse 
import a random by the shuffle.py which is imported by pyspark.worker, this 
worker, forked by "pid = os.fork()", also remains the state of the parent's 
random, thus every forked worker get the same random.next().

we need to re-random the random by random.seed, which will solve the problem, 
but i think this PR. may not be the proper fix.
ths.

  was:
hi, i'm writing some code as below:

{quote}
from random import random
from operator import add
def funcx( x ):
  print x[0],x[1]
  return 1 if x[0]** 2 + x[1]** 2 < 1 else 0
def genRnd(ind):
  x=random() * 2 - 1
  y=random() * 2 - 1
  return (x,y)
def runsp(total):
  ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, 
y: x + y)/float(total) * 4
  print ret
runsp(3)
{quote}

once started the pyspark shell, no matter how many times i run "runsp(N)" , 
this code always get a same sequece of random numbers, like this

```sh
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
>>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
>>>  * 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
>>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
>>>  * 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
```
i think this is because when we import pyspark.worker in the daemon.py, we alse 
import a random by the shuffle.py which is imported by pyspark.worker, this 
worker, forked by "pid = os.fork()", also remains the state of the parent's 
random, thus every forked worker get the same random.next().

we need to re-random the random by random.seed, which will solve the problem, 
but i think this PR. may not be the proper fix.
ths.


> Each forked worker  in daemon.py keep the parent's random state
> ---
>
> Key: SPARK-15611
> URL: https://issues.apache.org/jira/browse/SPARK-15611
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Thomas Lau
>Priority: Minor
>
> hi, i'm writing some code as below:
> {code:python}
> from random import random
> from operator import add
> def funcx( x ):
>   print x[0],x[1]
>   return 1 if x[0]** 2 + x[1]** 2 < 1 else 0
> def genRnd(ind):
>   x=random() * 2 - 1
>   y=random() * 2 - 1
>   return (x,y)
> def runsp(total):
>   ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, 
> y: x + y)/float(total) * 4
>   print ret
> runsp(3)
> {code}
> once started the pyspark shell, no matter how many times i run "runsp(N)" , 
> this code always get a same sequece of random numbers, like this
> ```sh
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
> >>>  * 4
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
> >>>  * 4
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> ```
> i think this is because when we import pyspark.worker in the daemon.py, we 
> alse import a random by the shuffle.py which is imported by pyspark.worker, 
> this

[jira] [Updated] (SPARK-15611) Each forked worker in daemon.py keep the parent's random state

2016-05-27 Thread Thomas Lau (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Lau updated SPARK-15611:
---
Description: 
hi, i'm writing some code as below:

{quote}
from random import random
from operator import add
def funcx( x ):
  print x[0],x[1]
  return 1 if x[0]** 2 + x[1]** 2 < 1 else 0
def genRnd(ind):
  x=random() * 2 - 1
  y=random() * 2 - 1
  return (x,y)
def runsp(total):
  ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, 
y: x + y)/float(total) * 4
  print ret
runsp(3)
{quote}

once started the pyspark shell, no matter how many times i run "runsp(N)" , 
this code always get a same sequece of random numbers, like this

```sh
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
>>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
>>>  * 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
>>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
>>>  * 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
```
i think this is because when we import pyspark.worker in the daemon.py, we alse 
import a random by the shuffle.py which is imported by pyspark.worker, this 
worker, forked by "pid = os.fork()", also remains the state of the parent's 
random, thus every forked worker get the same random.next().

we need to re-random the random by random.seed, which will solve the problem, 
but i think this PR. may not be the proper fix.
ths.

  was:
hi, i'm writing some code as below:

```py
from random import random
from operator import add
def funcx( x ):
  print x[0],x[1]
  return 1 if x[0]** 2 + x[1]** 2 < 1 else 0
def genRnd(ind):
  x=random() * 2 - 1
  y=random() * 2 - 1
  return (x,y)
def runsp(total):
  ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, 
y: x + y)/float(total) * 4
  print ret
runsp(3)
```
once started the pyspark shell, no matter how many times i run "runsp(N)" , 
this code always get a same sequece of random numbers, like this

```sh
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
>>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
>>>  * 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
>>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
>>>  * 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.
```
i think this is because when we import pyspark.worker in the daemon.py, we alse 
import a random by the shuffle.py which is imported by pyspark.worker, this 
worker, forked by "pid = os.fork()", also remains the state of the parent's 
random, thus every forked worker get the same random.next().

we need to re-random the random by random.seed, which will solve the problem, 
but i think this PR. may not be the proper fix.
ths.


> Each forked worker  in daemon.py keep the parent's random state
> ---
>
> Key: SPARK-15611
> URL: https://issues.apache.org/jira/browse/SPARK-15611
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Thomas Lau
>Priority: Minor
>
> hi, i'm writing some code as below:
> {quote}
> from random import random
> from operator import add
> def funcx( x ):
>   print x[0],x[1]
>   return 1 if x[0]** 2 + x[1]** 2 < 1 else 0
> def genRnd(ind):
>   x=random() * 2 - 1
>   y=random() * 2 - 1
>   return (x,y)
> def runsp(total):
>   ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, 
> y: x + y)/float(total) * 4
>   print ret
> runsp(3)
> {quote}
> once started the pyspark shell, no matter how many times i run "runsp(N)" , 
> this code always get a same sequece of random numbers, like this
> ```sh
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
> >>>  * 4
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
> >>>  * 4
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> ```
> i think this is because when we import pyspark.worker in the daemon.py, we 
> alse import a random by the shuffle.py which is imported by pyspark.worker, 
> this worker, forked by

[jira] [Updated] (SPARK-15611) Each forked worker in daemon.py keep the parent's random state

2016-05-27 Thread Thomas Lau (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Lau updated SPARK-15611:
---
Summary: Each forked worker  in daemon.py keep the parent's random state  
(was: each forked worker  in daemon.py keep the parent's random state)

> Each forked worker  in daemon.py keep the parent's random state
> ---
>
> Key: SPARK-15611
> URL: https://issues.apache.org/jira/browse/SPARK-15611
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Thomas Lau
>Priority: Minor
>
> hi, i'm writing some code as below:
> ```py
> from random import random
> from operator import add
> def funcx( x ):
>   print x[0],x[1]
>   return 1 if x[0]** 2 + x[1]** 2 < 1 else 0
> def genRnd(ind):
>   x=random() * 2 - 1
>   y=random() * 2 - 1
>   return (x,y)
> def runsp(total):
>   ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, 
> y: x + y)/float(total) * 4
>   print ret
> runsp(3)
> ```
> once started the pyspark shell, no matter how many times i run "runsp(N)" , 
> this code always get a same sequece of random numbers, like this
> ```sh
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
> >>>  * 4
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
> >>>  * 4
> 0.896083541418 -0.635625854075
> -0.0423532645466 -0.526910255885
> 0.498518696049 -0.872983895832
> 1.
> ```
> i think this is because when we import pyspark.worker in the daemon.py, we 
> alse import a random by the shuffle.py which is imported by pyspark.worker, 
> this worker, forked by "pid = os.fork()", also remains the state of the 
> parent's random, thus every forked worker get the same random.next().
> we need to re-random the random by random.seed, which will solve the problem, 
> but i think this PR. may not be the proper fix.
> ths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15585) Don't use null in data source options to indicate default value

2016-05-27 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305152#comment-15305152
 ] 

Takeshi Yamamuro commented on SPARK-15585:
--

okay

> Don't use null in data source options to indicate default value
> ---
>
> Key: SPARK-15585
> URL: https://issues.apache.org/jira/browse/SPARK-15585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> See email: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html
> We'd need to change DataFrameReader/DataFrameWriter in Python's 
> csv/json/parquet/... functions to put the actual default option values as 
> function parameters, rather than setting them to None. We can then in 
> CSVOptions.getChar (and JSONOptions, etc) to actually return null if the 
> value is null, rather  than setting it to default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15528) conv function returns inconsistent result for the same data

2016-05-27 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305150#comment-15305150
 ] 

Takeshi Yamamuro commented on SPARK-15528:
--

I tried this in master and I could reproduce;

{code}
import org.apache.spark.sql.functions._
val df = Seq(("", 0), ("", 1)).toDF("a", "b")
(0 until 10).map(_ => df.select(countDistinct(conv(df("a"), 16, 10))).show)

+---+
|count(DISTINCT conv(a, 16, 10))|
+---+
|  1|
+---+

+---+
|count(DISTINCT conv(a, 16, 10))|
+---+
|  1|
+---+

+---+
|count(DISTINCT conv(a, 16, 10))|
+---+
|  1|
+---+

+---+
|count(DISTINCT conv(a, 16, 10))|
+---+
|  2|
+---+

+---+
|count(DISTINCT conv(a, 16, 10))|
+---+
|  1|
+---+
{code}

Sometimes, we could weirdly get not '1' but '2'.
The explain is below;

{code}
== Physical Plan ==
*TungstenAggregate(key=[], functions=[(count(conv(a#5, 16, 
10)#19),mode=Final,isDistinct=true)], output=[count(DISTINCT conv(a, 16, 
10))#15L])
+- Exchange SinglePartition, None
   +- *TungstenAggregate(key=[], functions=[(count(conv(a#5, 16, 
10)#19),mode=Partial,isDistinct=true)], output=[count#22L])
  +- *TungstenAggregate(key=[conv(a#5, 16, 10)#19], functions=[], 
output=[conv(a#5, 16, 10)#19])
 +- Exchange hashpartitioning(conv(a#5, 16, 10)#19, 200), None
+- *TungstenAggregate(key=[conv(a#5, 16, 10) AS conv(a#5, 16, 
10)#19], functions=[], output=[conv(a#5, 16, 10)#19])
   +- LocalTableScan [a#5], [[],[]]
{code}




> conv function returns inconsistent result for the same data
> ---
>
> Key: SPARK-15528
> URL: https://issues.apache.org/jira/browse/SPARK-15528
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Lior Regev
>
> When using F.conv to convert a column from a hexadecimal string to an 
> integer, the results are inconsistent
> val col = F.conv(df("some_col"), 16, 10)
> val a = df.select(F.countDistinct("some_col"), F.countDistinct(col)).collect()
> val b = df.select(F.countDistinct("some_col"), F.countDistinct(col)).collect()
> returns:
> a: Array[org.apache.spark.sql.Row] = Array([59776,1941936])
> b: Array[org.apache.spark.sql.Row] = Array([59776,1965154])
> P.S.
> "some_col" is a md5 hash of some string column calculated using F.md5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15634) SQL repl is bricked if a function is registered with a non-existent jar

2016-05-27 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305143#comment-15305143
 ] 

Dilip Biswal commented on SPARK-15634:
--

I would like to work on this issue.

> SQL repl is bricked if a function is registered with a non-existent jar
> ---
>
> Key: SPARK-15634
> URL: https://issues.apache.org/jira/browse/SPARK-15634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eric Liang
>
> After attempting to register a function using a non-existent jar, no further 
> SQL commands succeed (and you also cannot un-register the function).
> {code}
> build/sbt -Phive sparkShell
> {code}
> {code}
> scala> sql("""CREATE TEMPORARY FUNCTION x AS "com.example.functions.Function" 
> USING JAR "file:///path/to/example.jar)
> 16/05/27 14:53:49 ERROR SessionState: file:///path/to/example.jar does not 
> exist
> java.lang.IllegalArgumentException: file:///path/to/example.jar does not exist
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.validateFiles(SessionState.java:998)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState$ResourceType.preHook(SessionState.java:1102)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState$ResourceType$1.preHook(SessionState.java:1091)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1191)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>   at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:564)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:533)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:260)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:207)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:206)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:249)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:533)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:523)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:668)
>   at 
> org.apache.spark.sql.hive.HiveSessionState.addJar(HiveSessionState.scala:109)
>   at 
> org.apache.spark.sql.internal.SessionState$$anon$2.loadResource(SessionState.scala:80)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:734)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:734)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadFunctionResources(SessionCatalog.scala:734)
>   at 
> org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:59)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:187)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:168)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:532)
>   at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw

[jira] [Resolved] (SPARK-15610) update error message for k in pca

2016-05-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15610.
---
   Resolution: Fixed
 Assignee: zhengruifeng
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/13356

> update error message for k in pca
> -
>
> Key: SPARK-15610
> URL: https://issues.apache.org/jira/browse/SPARK-15610
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
> Fix For: 2.0.0
>
>
> error message for {{k}} should match the bound



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12550) sbt-launch-lib.bash: line 72: 2404 Killed "$@"

2016-05-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305119#comment-15305119
 ] 

Sean Owen commented on SPARK-12550:
---

This is not from the Spark project. I mean, what docs _from the project_ lead 
to this error?

> sbt-launch-lib.bash: line 72:  2404 Killed "$@"
> ---
>
> Key: SPARK-12550
> URL: https://issues.apache.org/jira/browse/SPARK-12550
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.2
> Environment: Ubuntu 14.04.3 LTS
> Scala version 2.10.4
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: ibrahim yilmaz
>
> sbt-launch-lib.bash: line 72:  2404 Killed  "$@"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15619) spark builds filling up /tmp

2016-05-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305117#comment-15305117
 ] 

Sean Owen commented on SPARK-15619:
---

Interesting, looks like it's related to the lz4 library, and I see a similar 
issue reported for Cassandra: 
https://issues.apache.org/jira/browse/CASSANDRA-7712

It does create this temp library: 
https://github.com/jpountz/lz4-java/blob/b69d5676f74344bf04068594644fa5ecc2bb6a67/src/java/net/jpountz/util/Native.java#L81

but seems to do a pretty comprehensive job of trying to clean it up at shutdown.
It might be left around after hard JVM failures / exits, in which case it may 
unfortunately be a side effect of testing failure conditions. I don't see 
anything in Spark that tries to manage it, and not sure it could.

> spark builds filling up /tmp
> 
>
> Key: SPARK-15619
> URL: https://issues.apache.org/jira/browse/SPARK-15619
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: shane knapp
>Priority: Minor
>
> spark builds aren't cleaning up /tmp after they run...  it's hard to pinpoint 
> EXACTLY what is left there by the spark builds (as other builds are also 
> guilty of doing this), but a quick perusal of the /tmp directory during some 
> spark builds show that there are myriad empty directories being created and a 
> massive pile of shared object libraries being dumped there.
> $ for x in $(cat jenkins_workers.txt ); do echo $x; ssh $x "ls -l /tmp/*.so | 
> wc -l"; done
> amp-jenkins-worker-01
> 0
> ls: cannot access /tmp/*.so: No such file or directory
> amp-jenkins-worker-02
> 22312
> amp-jenkins-worker-03
> 39673
> amp-jenkins-worker-04
> 39548
> amp-jenkins-worker-05
> 39577
> amp-jenkins-worker-06
> 39299
> amp-jenkins-worker-07
> 39315
> amp-jenkins-worker-08
> 38529
> to help combat this, i set up a cron job on each worker that runs tmpwatch 
> during system downtime on sundays to clean up files older than a week.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15562) Temp directory is not deleted after program exit in DataFrameExample

2016-05-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15562:
--
Assignee: ding

> Temp directory is not deleted after program exit in DataFrameExample
> 
>
> Key: SPARK-15562
> URL: https://issues.apache.org/jira/browse/SPARK-15562
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: ding
>Assignee: ding
>Priority: Minor
> Fix For: 2.0.0
>
>
> Temp directory used to save records is not deleted after program exit in 
> DataFrameExample. Although it called deleteOnExit, it doesn't work as the 
> directory is not empty. Similar things happend in ContextCleanerSuite



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15562) Temp directory is not deleted after program exit in DataFrameExample

2016-05-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15562.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13328
[https://github.com/apache/spark/pull/13328]

> Temp directory is not deleted after program exit in DataFrameExample
> 
>
> Key: SPARK-15562
> URL: https://issues.apache.org/jira/browse/SPARK-15562
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: ding
>Priority: Minor
> Fix For: 2.0.0
>
>
> Temp directory used to save records is not deleted after program exit in 
> DataFrameExample. Although it called deleteOnExit, it doesn't work as the 
> directory is not empty. Similar things happend in ContextCleanerSuite



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15449) MLlib NaiveBayes example in Java uses wrong data format

2016-05-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15449:
--
Assignee: Miao Wang

> MLlib NaiveBayes example in Java uses wrong data format
> ---
>
> Key: SPARK-15449
> URL: https://issues.apache.org/jira/browse/SPARK-15449
> Project: Spark
>  Issue Type: Documentation
>  Components: Examples
>Affects Versions: 1.6.1
>Reporter: Kiran Biradarpatil
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.0.0
>
>
> JAVA example given for MLLib NaiveBayes at 
> http://spark.apache.org/docs/latest/mllib-naive-bayes.html expects the data 
> in LibSVM format. But the example data in MLLib 
> data/mllib/sample_naive_bayes_data.txt is not in right format. 
> So please rectify the sample data file or the the implementation example.
> Thanks!
> Kiran 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15449) MLlib NaiveBayes example in Java uses wrong data format

2016-05-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15449.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13301
[https://github.com/apache/spark/pull/13301]

> MLlib NaiveBayes example in Java uses wrong data format
> ---
>
> Key: SPARK-15449
> URL: https://issues.apache.org/jira/browse/SPARK-15449
> Project: Spark
>  Issue Type: Documentation
>  Components: Examples
>Affects Versions: 1.6.1
>Reporter: Kiran Biradarpatil
>Priority: Minor
> Fix For: 2.0.0
>
>
> JAVA example given for MLLib NaiveBayes at 
> http://spark.apache.org/docs/latest/mllib-naive-bayes.html expects the data 
> in LibSVM format. But the example data in MLLib 
> data/mllib/sample_naive_bayes_data.txt is not in right format. 
> So please rectify the sample data file or the the implementation example.
> Thanks!
> Kiran 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-15607) Remove redundant toArray in ml.linalg

2016-05-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-15607:
---

> Remove redundant toArray in ml.linalg
> -
>
> Key: SPARK-15607
> URL: https://issues.apache.org/jira/browse/SPARK-15607
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> {{sliceInds, sliceVals}} are already of type {{Array}}, so remove {{toArray}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15610) update error message for k in pca

2016-05-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15610:
--
   Priority: Trivial  (was: Minor)
Component/s: Documentation

> update error message for k in pca
> -
>
> Key: SPARK-15610
> URL: https://issues.apache.org/jira/browse/SPARK-15610
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, ML
>Reporter: zhengruifeng
>Priority: Trivial
>
> error message for {{k}} should match the bound



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15607) Remove redundant toArray in ml.linalg

2016-05-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15607.
---
Resolution: Not A Problem

> Remove redundant toArray in ml.linalg
> -
>
> Key: SPARK-15607
> URL: https://issues.apache.org/jira/browse/SPARK-15607
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> {{sliceInds, sliceVals}} are already of type {{Array}}, so remove {{toArray}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15549) Disable bucketing when the output doesn't contain all bucketing columns

2016-05-27 Thread Yadong Qi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yadong Qi updated SPARK-15549:
--
Summary: Disable bucketing when the output doesn't contain all bucketing 
columns  (was: Bucket column only need to be found in the output of relation 
when use bucketed table)

> Disable bucketing when the output doesn't contain all bucketing columns
> ---
>
> Key: SPARK-15549
> URL: https://issues.apache.org/jira/browse/SPARK-15549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yadong Qi
>
> I create a bucketed table test(i int, j int, k int) with bucket column i, 
> {code:java}
> case class Data(i: Int, j: Int, k: Int)
> sc.makeRDD(Array((1, 2, 3))).map(x => Data(x._1, x._2, 
> x._3)).toDF.write.bucketBy(2, "i").saveAsTable("test")
> {code}
> and I run the following SQL:
> {code:sql}
> SELECT j FROM test;
> Error in query: bucket column i not found in existing columns (j);
> SELECT j, MAX(k) FROM test GROUP BY j;
> Error in query: bucket column i not found in existing columns (j, k);
> {code}
> I think the bucket column only need to be found in the output of relation. So 
> the 2 sqls bellow should be executed right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12550) sbt-launch-lib.bash: line 72: 2404 Killed "$@"

2016-05-27 Thread Greg Silverman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305095#comment-15305095
 ] 

Greg Silverman edited comment on SPARK-12550 at 5/28/16 1:45 AM:
-

I am having the same exact issue on Debian 7.10 wheezy. I'm following these 
directions: http://www.mfactorengineering.com/blog/2015/spark/
.



was (Author: horcle_buzz):
I am having the same exact issue on Debian 7.10 wheezy. I'm following these 
directions: http://www.mfactorengineering.com/blog/2015/spark/
Scala is version 2.11.8, if it matters...


> sbt-launch-lib.bash: line 72:  2404 Killed "$@"
> ---
>
> Key: SPARK-12550
> URL: https://issues.apache.org/jira/browse/SPARK-12550
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.2
> Environment: Ubuntu 14.04.3 LTS
> Scala version 2.10.4
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: ibrahim yilmaz
>
> sbt-launch-lib.bash: line 72:  2404 Killed  "$@"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12550) sbt-launch-lib.bash: line 72: 2404 Killed "$@"

2016-05-27 Thread Greg Silverman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305095#comment-15305095
 ] 

Greg Silverman commented on SPARK-12550:


I am having the same exact issue on Debian 7.10 wheezy. I'm following these 
directions: http://www.mfactorengineering.com/blog/2015/spark/
Scala is version 2.11.8, if it matters...


> sbt-launch-lib.bash: line 72:  2404 Killed "$@"
> ---
>
> Key: SPARK-12550
> URL: https://issues.apache.org/jira/browse/SPARK-12550
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.2
> Environment: Ubuntu 14.04.3 LTS
> Scala version 2.10.4
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: ibrahim yilmaz
>
> sbt-launch-lib.bash: line 72:  2404 Killed  "$@"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15617) Clarify that fMeasure in MulticlassMetrics and MulticlassClassificationEvaluator is "micro" f1_score

2016-05-27 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305089#comment-15305089
 ] 

zhengruifeng commented on SPARK-15617:
--

I can work on this

> Clarify that fMeasure in MulticlassMetrics and 
> MulticlassClassificationEvaluator is "micro" f1_score
> 
>
> Key: SPARK-15617
> URL: https://issues.apache.org/jira/browse/SPARK-15617
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See description in sklearn docs: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html]
> I believe we are calculating the "micro" average for {{val fMeasure: 
> Double}}.  We should clarify this in the docs.
> I'm not sure if "micro" is a common term, so we should check other libraries 
> too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15617) Clarify that fMeasure in MulticlassMetrics and MulticlassClassificationEvaluator is "micro" f1_score

2016-05-27 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305086#comment-15305086
 ] 

zhengruifeng commented on SPARK-15617:
--

Revolutions(http://blog.revolutionanalytics.com/2016/03/com_class_eval_metrics_r.html#micro)
 also call it `Micro-averaged Metrics` 

> Clarify that fMeasure in MulticlassMetrics and 
> MulticlassClassificationEvaluator is "micro" f1_score
> 
>
> Key: SPARK-15617
> URL: https://issues.apache.org/jira/browse/SPARK-15617
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See description in sklearn docs: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html]
> I believe we are calculating the "micro" average for {{val fMeasure: 
> Double}}.  We should clarify this in the docs.
> I'm not sure if "micro" is a common term, so we should check other libraries 
> too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15637) SparkR tests failing on R 3.2.2

2016-05-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305077#comment-15305077
 ] 

Apache Spark commented on SPARK-15637:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/13369

> SparkR tests failing on R 3.2.2
> ---
>
> Key: SPARK-15637
> URL: https://issues.apache.org/jira/browse/SPARK-15637
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>
> As discussed in SPARK-15439
> I think we have an issue here - I"m running R 3.2.2 and the mask tests are 
> failing because:
> > R.version$minor
> [1] "2.2"
> And this is not strict enough?
> if (as.numeric(R.version$major) == 3 && as.numeric(R.version$minor) > 2)
> { namesOfMasked <- c("endsWith", "startsWith", namesOfMasked) 
> namesOfMaskedCompletely <- c("endsWith", "startsWith", 
> namesOfMaskedCompletely) }
> 1. Failure: Check masked functions (@test_context.R#35) 
> 
> length(maskedBySparkR) not equal to length(namesOfMasked).
> 1/1 mismatches
> [1] 20 - 22 == -2
> 2. Failure: Check masked functions (@test_context.R#36) 
> 
> sort(maskedBySparkR) not equal to sort(namesOfMasked).
> Lengths differ: 20 vs 22
> 3. Failure: Check masked functions (@test_context.R#44) 
> 
> length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
> 1/1 mismatches
> [1] 3 - 5 == -2
> 4. Failure: Check masked functions (@test_context.R#45) 
> 
> sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely).
> Lengths differ: 3 vs 5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15637) SparkR tests failing on R 3.2.2

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15637:


Assignee: (was: Apache Spark)

> SparkR tests failing on R 3.2.2
> ---
>
> Key: SPARK-15637
> URL: https://issues.apache.org/jira/browse/SPARK-15637
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>
> As discussed in SPARK-15439
> I think we have an issue here - I"m running R 3.2.2 and the mask tests are 
> failing because:
> > R.version$minor
> [1] "2.2"
> And this is not strict enough?
> if (as.numeric(R.version$major) == 3 && as.numeric(R.version$minor) > 2)
> { namesOfMasked <- c("endsWith", "startsWith", namesOfMasked) 
> namesOfMaskedCompletely <- c("endsWith", "startsWith", 
> namesOfMaskedCompletely) }
> 1. Failure: Check masked functions (@test_context.R#35) 
> 
> length(maskedBySparkR) not equal to length(namesOfMasked).
> 1/1 mismatches
> [1] 20 - 22 == -2
> 2. Failure: Check masked functions (@test_context.R#36) 
> 
> sort(maskedBySparkR) not equal to sort(namesOfMasked).
> Lengths differ: 20 vs 22
> 3. Failure: Check masked functions (@test_context.R#44) 
> 
> length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
> 1/1 mismatches
> [1] 3 - 5 == -2
> 4. Failure: Check masked functions (@test_context.R#45) 
> 
> sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely).
> Lengths differ: 3 vs 5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15637) SparkR tests failing on R 3.2.2

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15637:


Assignee: Apache Spark

> SparkR tests failing on R 3.2.2
> ---
>
> Key: SPARK-15637
> URL: https://issues.apache.org/jira/browse/SPARK-15637
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>
> As discussed in SPARK-15439
> I think we have an issue here - I"m running R 3.2.2 and the mask tests are 
> failing because:
> > R.version$minor
> [1] "2.2"
> And this is not strict enough?
> if (as.numeric(R.version$major) == 3 && as.numeric(R.version$minor) > 2)
> { namesOfMasked <- c("endsWith", "startsWith", namesOfMasked) 
> namesOfMaskedCompletely <- c("endsWith", "startsWith", 
> namesOfMaskedCompletely) }
> 1. Failure: Check masked functions (@test_context.R#35) 
> 
> length(maskedBySparkR) not equal to length(namesOfMasked).
> 1/1 mismatches
> [1] 20 - 22 == -2
> 2. Failure: Check masked functions (@test_context.R#36) 
> 
> sort(maskedBySparkR) not equal to sort(namesOfMasked).
> Lengths differ: 20 vs 22
> 3. Failure: Check masked functions (@test_context.R#44) 
> 
> length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
> 1/1 mismatches
> [1] 3 - 5 == -2
> 4. Failure: Check masked functions (@test_context.R#45) 
> 
> sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely).
> Lengths differ: 3 vs 5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15637) SparkR tests failing on R 3.2.2

2016-05-27 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-15637:


 Summary: SparkR tests failing on R 3.2.2
 Key: SPARK-15637
 URL: https://issues.apache.org/jira/browse/SPARK-15637
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.0
Reporter: Felix Cheung


As discussed in SPARK-15439
I think we have an issue here - I"m running R 3.2.2 and the mask tests are 
failing because:
> R.version$minor
[1] "2.2"
And this is not strict enough?
if (as.numeric(R.version$major) == 3 && as.numeric(R.version$minor) > 2)
{ namesOfMasked <- c("endsWith", "startsWith", namesOfMasked) 
namesOfMaskedCompletely <- c("endsWith", "startsWith", namesOfMaskedCompletely) 
}
1. Failure: Check masked functions (@test_context.R#35) 
length(maskedBySparkR) not equal to length(namesOfMasked).
1/1 mismatches
[1] 20 - 22 == -2
2. Failure: Check masked functions (@test_context.R#36) 
sort(maskedBySparkR) not equal to sort(namesOfMasked).
Lengths differ: 20 vs 22
3. Failure: Check masked functions (@test_context.R#44) 
length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
1/1 mismatches
[1] 3 - 5 == -2
4. Failure: Check masked functions (@test_context.R#45) 
sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely).
Lengths differ: 3 vs 5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15557) expression ((cast(99 as decimal) + '3') * '2.3' ) return null

2016-05-27 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15557:
-
Target Version/s: 2.0.0
 Description: 
expression "select  (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null

expression "select  (cast(40 as decimal(19,6))+ '3')*'2.3' "  is OK

I find that maybe it will be null if the result is more than 100

  was:

expression "select  (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null

expression "select  (cast(40 as decimal(19,6))+ '3')*'2.3' "  is OK

I find that maybe it will be null if the result is more than 100


> expression ((cast(99 as decimal) + '3') * '2.3' ) return null
> -
>
> Key: SPARK-15557
> URL: https://issues.apache.org/jira/browse/SPARK-15557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: spark1.6.1
>Reporter: cen yuhai
>
> expression "select  (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null
> expression "select  (cast(40 as decimal(19,6))+ '3')*'2.3' "  is OK
> I find that maybe it will be null if the result is more than 100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15594) ALTER TABLE ... SERDEPROPERTIES does not respect partition spec

2016-05-27 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15594.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13343
[https://github.com/apache/spark/pull/13343]

> ALTER TABLE ... SERDEPROPERTIES does not respect partition spec
> ---
>
> Key: SPARK-15594
> URL: https://issues.apache.org/jira/browse/SPARK-15594
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> {code}
> case class AlterTableSerDePropertiesCommand(
> tableName: TableIdentifier,
> serdeClassName: Option[String],
> serdeProperties: Option[Map[String, String]],
> partition: Option[Map[String, String]])
>   extends RunnableCommand {
> {code}
> The `partition` flag is not read anywhere!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results

2016-05-27 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-14343:
---
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-15631

> Dataframe operations on a partitioned dataset (using partition discovery) 
> return invalid results
> 
>
> Key: SPARK-14343
> URL: https://issues.apache.org/jira/browse/SPARK-14343
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS
>Reporter: Jurriaan Pruis
>Priority: Blocker
>
> When reading a dataset using {{sqlContext.read.text()}} queries on the 
> partitioned column return invalid results.
> h2. How to reproduce:
> h3. Generate datasets
> {code:title=repro.sh}
> #!/bin/sh
> mkdir -p dataset/year=2014
> mkdir -p dataset/year=2015
> echo "data from 2014" > dataset/year=2014/part01.txt
> echo "data from 2015" > dataset/year=2015/part01.txt
> {code}
> {code:title=repro2.sh}
> #!/bin/sh
> mkdir -p dataset2/month=june
> mkdir -p dataset2/month=july
> echo "data from june" > dataset2/month=june/part01.txt
> echo "data from july" > dataset2/month=july/part01.txt
> {code}
> h3. using first dataset
> {code:none}
> >>> df = sqlContext.read.text('dataset')
> ...
> >>> df
> DataFrame[value: string, year: int]
> >>> df.show()
> +--++
> | value|year|
> +--++
> |data from 2014|2014|
> |data from 2015|2015|
> +--++
> >>> df.select('year').show()
> ++
> |year|
> ++
> |  14|
> |  14|
> ++
> {code}
> This is clearly wrong. Seems like it returns the length of the value column?
> h3. using second dataset
> With another dataset it looks like this:
> {code:none}
> >>> df = sqlContext.read.text('dataset2')
> >>> df
> DataFrame[value: string, month: string]
> >>> df.show()
> +--+-+
> | value|month|
> +--+-+
> |data from june| june|
> |data from july| july|
> +--+-+
> >>> df.select('month').show()
> +--+
> | month|
> +--+
> |data from june|
> |data from july|
> +--+
> {code}
> Here it returns the value of the value column instead of the month partition.
> h3. Workaround
> When I convert the dataframe to an RDD and back to a DataFrame I get the 
> following result (which is the expected behaviour):
> {code:none}
> >>> df.rdd.toDF().select('month').show()
> +-+
> |month|
> +-+
> | june|
> | july|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15610) update error message for k in pca

2016-05-27 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-15610:
-
Description: error message for {{k}} should match the bound  (was: Vector 
size must be greater than {{k}}, but now it support {{k == vector.size}})

> update error message for k in pca
> -
>
> Key: SPARK-15610
> URL: https://issues.apache.org/jira/browse/SPARK-15610
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> error message for {{k}} should match the bound



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15610) PCA should not support k == numFeatures

2016-05-27 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-15610:
-
Priority: Minor  (was: Major)

> PCA should not support k == numFeatures
> ---
>
> Key: SPARK-15610
> URL: https://issues.apache.org/jira/browse/SPARK-15610
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> Vector size must be greater than {{k}}, but now it support {{k == 
> vector.size}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15610) update error message for k in pca

2016-05-27 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-15610:
-
Summary: update error message for k in pca  (was: PCA should not support k 
== numFeatures)

> update error message for k in pca
> -
>
> Key: SPARK-15610
> URL: https://issues.apache.org/jira/browse/SPARK-15610
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> Vector size must be greater than {{k}}, but now it support {{k == 
> vector.size}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15632) Dataset typed filter operation changes query plan schema

2016-05-27 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-15632:
---
Description: 
Filter operations should never change query plan schema. However, Dataset typed 
filter operation does introduce schema change:

{code}
case class A(b: Double, a: String)

val data = Seq(
  "{ 'a': 'foo', 'b': 1, 'c': 'extra' }",
  "{ 'a': 'bar', 'b': 2, 'c': 'extra' }",
  "{ 'a': 'bar', 'c': 'extra' }"
)

val df1 = spark.read.json(sc.parallelize(data))
df1.printSchema()
// root
//  |-- a: string (nullable = true)
//  |-- b: long (nullable = true)
//  |-- c: string (nullable = true)

val ds1 = df1.as[A]
ds1.printSchema()
// root
//  |-- a: string (nullable = true)
//  |-- b: long (nullable = true)
//  |-- c: string (nullable = true)

val ds2 = ds1.filter(_.b > 1)// <- Here comes the trouble maker
ds2.printSchema()
// root <- 1. reordered `a` and `b`, and
//  |-- b: double (nullable = true)2. dropped `c`, and
//  |-- a: string (nullable = true)3. up-casted `b` from long to double

val df2 = ds2.toDF()
df2.printSchema()
// root <- (Same as above)
//  |-- b: double (nullable = true)
//  |-- a: string (nullable = true)
{code}

This is becase we wraps the actual {{Filter}} operator with a 
{{SerializeFromObject}}/{{DeserializeToObject}} pair.

{{DeserializeToObject}} does a bunch of magic tricks here:

# Field order change
#- {{DeserializeToObject}} resolves the encoder deserializer expression by 
**name**. Thus field order in input query plan doesn't matter.
# Field number change
#- Same as above, fields not referred by the encoder are silently dropped while 
resolving deserializer expressions by name.
# Field data type change
#- When generating deserializer expressions, we allows "sane" implicit 
coercions (e.g. integer to long, and long to double) by inserting {{UpCast}} 
operators. Thus actual field data types in input query plan don't matter either 
as long as there are valid implicit coercions.

Actually, even field names may change once [PR 
#13269|https://github.com/apache/spark/pull/13269] gets merged, because it 
introduces case-insensitive encoder resolution.

  was:
Filter operations should never changes query plan schema. However, Dataset 
typed filter operation does introduce schema change:

{code}
case class A(b: Double, a: String)

val data = Seq(
  "{ 'a': 'foo', 'b': 1, 'c': 'extra' }",
  "{ 'a': 'bar', 'b': 2, 'c': 'extra' }",
  "{ 'a': 'bar', 'c': 'extra' }"
)

val df1 = spark.read.json(sc.parallelize(data))
df1.printSchema()
// root
//  |-- a: string (nullable = true)
//  |-- b: long (nullable = true)
//  |-- c: string (nullable = true)

val ds1 = df1.as[A]
ds1.printSchema()
// root
//  |-- a: string (nullable = true)
//  |-- b: long (nullable = true)
//  |-- c: string (nullable = true)

val ds2 = ds1.filter(_.b > 1)// <- Here comes the trouble maker
ds2.printSchema()
// root <- 1. reordered `a` and `b`, and
//  |-- b: double (nullable = true)2. dropped `c`, and
//  |-- a: string (nullable = true)3. up-casted `b` from long to double

val df2 = ds2.toDF()
df2.printSchema()
// root <- (Same as above)
//  |-- b: double (nullable = true)
//  |-- a: string (nullable = true)
{code}

This is becase we wraps the actual {{Filter}} operator with a 
{{SerializeFromObject}}/{{DeserializeToObject}} pair.

{{DeserializeToObject}} does a bunch of magic tricks here:

# Field order change
#- {{DeserializeToObject}} resolves the encoder deserializer expression by 
**name**. Thus field order in input query plan doesn't matter.
# Field number change
#- Same as above, fields not referred by the encoder are silently dropped while 
resolving deserializer expressions by name.
# Field data type change
#- When generating deserializer expressions, we allows "sane" implicit 
coercions (e.g. integer to long, and long to double) by inserting {{UpCast}} 
operators. Thus actual field data types in input query plan don't matter either 
as long as there are valid implicit coercions.

Actually, even field names may change once [PR 
#13269|https://github.com/apache/spark/pull/13269] gets merged, because it 
introduces case-insensitive encoder resolution.


> Dataset typed filter operation changes query plan schema
> 
>
> Key: SPARK-15632
> URL: https://issues.apache.org/jira/browse/SPARK-15632
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> Filter operations should never change query plan schema. However, Dataset 
> typed filter operation does introduce schema change:
> {code}
> case class A(b: Double, a: String)
> val data = Seq(
>   "{ 'a': 'foo', 'b': 1, 'c': 'extra' }",
>   "{ 'a': 'bar', '

[jira] [Updated] (SPARK-9876) Upgrade parquet-mr to 1.8.1

2016-05-27 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-9876:
--
Assignee: Ryan Blue

> Upgrade parquet-mr to 1.8.1
> ---
>
> Key: SPARK-9876
> URL: https://issues.apache.org/jira/browse/SPARK-9876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Ryan Blue
> Fix For: 2.0.0
>
>
> {{parquet-mr}} 1.8.1 fixed several issues that affect Spark. For example 
> PARQUET-201 (SPARK-9407).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9876) Upgrade parquet-mr to 1.8.1

2016-05-27 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-9876.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13280
[https://github.com/apache/spark/pull/13280]

> Upgrade parquet-mr to 1.8.1
> ---
>
> Key: SPARK-9876
> URL: https://issues.apache.org/jira/browse/SPARK-9876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Ryan Blue
> Fix For: 2.0.0
>
>
> {{parquet-mr}} 1.8.1 fixed several issues that affect Spark. For example 
> PARQUET-201 (SPARK-9407).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-15607) Remove redundant toArray in ml.linalg

2016-05-27 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng closed SPARK-15607.

Resolution: Won't Fix

> Remove redundant toArray in ml.linalg
> -
>
> Key: SPARK-15607
> URL: https://issues.apache.org/jira/browse/SPARK-15607
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> {{sliceInds, sliceVals}} are already of type {{Array}}, so remove {{toArray}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-15291) Remove redundant codes in SVD++

2016-05-27 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng closed SPARK-15291.

Resolution: Won't Fix

> Remove redundant codes in SVD++
> ---
>
> Key: SPARK-15291
> URL: https://issues.apache.org/jira/browse/SPARK-15291
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: zhengruifeng
>Priority: Minor
>
> {code}
> val newVertices = g.vertices.mapValues(v => (v._1.toArray, v._2.toArray, 
> v._3, v._4))
> (Graph(newVertices, g.edges), u)
> {code}
> is just the same as 
> {code}
> (g, u)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15557) expression ((cast(99 as decimal) + '3') * '2.3' ) return null

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15557:


Assignee: (was: Apache Spark)

> expression ((cast(99 as decimal) + '3') * '2.3' ) return null
> -
>
> Key: SPARK-15557
> URL: https://issues.apache.org/jira/browse/SPARK-15557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: spark1.6.1
>Reporter: cen yuhai
>
> expression "select  (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null
> expression "select  (cast(40 as decimal(19,6))+ '3')*'2.3' "  is OK
> I find that maybe it will be null if the result is more than 100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15557) expression ((cast(99 as decimal) + '3') * '2.3' ) return null

2016-05-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304994#comment-15304994
 ] 

Apache Spark commented on SPARK-15557:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/13368

> expression ((cast(99 as decimal) + '3') * '2.3' ) return null
> -
>
> Key: SPARK-15557
> URL: https://issues.apache.org/jira/browse/SPARK-15557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: spark1.6.1
>Reporter: cen yuhai
>
> expression "select  (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null
> expression "select  (cast(40 as decimal(19,6))+ '3')*'2.3' "  is OK
> I find that maybe it will be null if the result is more than 100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15557) expression ((cast(99 as decimal) + '3') * '2.3' ) return null

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15557:


Assignee: Apache Spark

> expression ((cast(99 as decimal) + '3') * '2.3' ) return null
> -
>
> Key: SPARK-15557
> URL: https://issues.apache.org/jira/browse/SPARK-15557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: spark1.6.1
>Reporter: cen yuhai
>Assignee: Apache Spark
>
> expression "select  (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null
> expression "select  (cast(40 as decimal(19,6))+ '3')*'2.3' "  is OK
> I find that maybe it will be null if the result is more than 100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9576) DataFrame API improvement umbrella ticket (Spark 2.0 and 2.1)

2016-05-27 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9576:
---
Target Version/s: 2.1.0  (was: 2.0.0)

> DataFrame API improvement umbrella ticket (Spark 2.0 and 2.1)
> -
>
> Key: SPARK-9576
> URL: https://issues.apache.org/jira/browse/SPARK-9576
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9576) DataFrame API improvement umbrella ticket (Spark 2.0 and 2.1)

2016-05-27 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9576:
---
Summary: DataFrame API improvement umbrella ticket (Spark 2.0 and 2.1)  
(was: DataFrame API improvement umbrella ticket (Spark 2.0))

> DataFrame API improvement umbrella ticket (Spark 2.0 and 2.1)
> -
>
> Key: SPARK-9576
> URL: https://issues.apache.org/jira/browse/SPARK-9576
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15636) Make aggregate expressions more concise in explain

2016-05-27 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15636:

Description: 
Aggregate expressions have very long string representations in explain outputs. 
For more information, see the description here: 
https://github.com/apache/spark/pull/13367


  was:
Aggregate expressions have very long string representations in explain outputs. 

[I will fill in more details in a bit]



> Make aggregate expressions more concise in explain
> --
>
> Key: SPARK-15636
> URL: https://issues.apache.org/jira/browse/SPARK-15636
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Aggregate expressions have very long string representations in explain 
> outputs. For more information, see the description here: 
> https://github.com/apache/spark/pull/13367



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15636) Make aggregate expressions more concise in explain

2016-05-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304934#comment-15304934
 ] 

Apache Spark commented on SPARK-15636:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13367

> Make aggregate expressions more concise in explain
> --
>
> Key: SPARK-15636
> URL: https://issues.apache.org/jira/browse/SPARK-15636
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Aggregate expressions have very long string representations in explain 
> outputs. 
> [I will fill in more details in a bit]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15636) Make aggregate expressions more concise in explain

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15636:


Assignee: Reynold Xin  (was: Apache Spark)

> Make aggregate expressions more concise in explain
> --
>
> Key: SPARK-15636
> URL: https://issues.apache.org/jira/browse/SPARK-15636
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Aggregate expressions have very long string representations in explain 
> outputs. 
> [I will fill in more details in a bit]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15636) Make aggregate expressions more concise in explain

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15636:


Assignee: Apache Spark  (was: Reynold Xin)

> Make aggregate expressions more concise in explain
> --
>
> Key: SPARK-15636
> URL: https://issues.apache.org/jira/browse/SPARK-15636
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Aggregate expressions have very long string representations in explain 
> outputs. 
> [I will fill in more details in a bit]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15636) Make aggregate expressions more concise in explain

2016-05-27 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-15636:
---

 Summary: Make aggregate expressions more concise in explain
 Key: SPARK-15636
 URL: https://issues.apache.org/jira/browse/SPARK-15636
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


Aggregate expressions have very long string representations in explain outputs. 

[I will fill in more details in a bit]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15635) ALTER TABLE RENAME doesn't work for datasource tables

2016-05-27 Thread Andrew Or (JIRA)

Andrew Or created SPARK-15635:
-

 Summary: ALTER TABLE RENAME doesn't work for datasource tables
 Key: SPARK-15635
 URL: https://issues.apache.org/jira/browse/SPARK-15635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


{code}
scala> sql("CREATE TABLE students (age INT, name STRING) USING parquet")
scala> sql("ALTER TABLE students RENAME TO teachers")
scala> spark.table("teachers").show()
com.google.common.util.concurrent.UncheckedExecutionException: 
org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/Users/andrew/Documents/dev/spark/andrew-spark/spark-warehouse/students;
  at 
com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882)
  at 
com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898)
  at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:170)
  at 
org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:67)
  at org.apache.spark.sql.SparkSession.table(SparkSession.scala:583)
  at org.apache.spark.sql.SparkSession.table(SparkSession.scala:579)
  ... 48 elided
Caused by: org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/Users/andrew/Documents/dev/spark/andrew-spark/spark-warehouse/students;
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:351)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:340)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15619) spark builds filling up /tmp

2016-05-27 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304911#comment-15304911
 ] 

shane knapp edited comment on SPARK-15619 at 5/27/16 10:33 PM:
---

next time we have a maintenance, i will wipe /tmp completely so that we can at 
least try and see what's creating what...  right now it's such a mess that it's 
hard to attribute anything to anything.

i did watch as a spark build (spark-master-test-maven-hadoop-2.7, IIRC) dump a 
bunch of the liblz4-java3891256912513794605.so files in /tmp this morning on 
worker-08 (the number string changes for each file).

and, so far today, on just one worker, we've had 2628 of these files left in 
/tmp:

[root@amp-jenkins-worker-08 tmp]# ls -lt | grep liblz4 |grep "May 27" | wc -l
2628

i'm not worried about us running out of disk, and this is something i can 
manage on the system-level, but it'd still be nice to have well behaved tests.  
:)


was (Author: shaneknapp):
next time we have a maintenance, i will wipe /tmp completely so that we can at 
least try and see what's creating what...  right now it's such a mess that it's 
hard to attribute anything to anything.

i did watch as a spark build (spark-master-test-maven-hadoop-2.7, IIRC) dump a 
bunch of the liblz4-java3891256912513794605.so files in /tmp this morning on 
worker-08 (the number string changes for each file).

and, so far today, we've had 2628 of these files left in /tmp:

[root@amp-jenkins-worker-08 tmp]# ls -lt | grep liblz4 |grep "May 27" | wc -l
2628

i'm not worried about us running out of disk, and this is something i can 
manage on the system-level, but it'd still be nice to have well behaved tests.  
:)

> spark builds filling up /tmp
> 
>
> Key: SPARK-15619
> URL: https://issues.apache.org/jira/browse/SPARK-15619
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: shane knapp
>Priority: Minor
>
> spark builds aren't cleaning up /tmp after they run...  it's hard to pinpoint 
> EXACTLY what is left there by the spark builds (as other builds are also 
> guilty of doing this), but a quick perusal of the /tmp directory during some 
> spark builds show that there are myriad empty directories being created and a 
> massive pile of shared object libraries being dumped there.
> $ for x in $(cat jenkins_workers.txt ); do echo $x; ssh $x "ls -l /tmp/*.so | 
> wc -l"; done
> amp-jenkins-worker-01
> 0
> ls: cannot access /tmp/*.so: No such file or directory
> amp-jenkins-worker-02
> 22312
> amp-jenkins-worker-03
> 39673
> amp-jenkins-worker-04
> 39548
> amp-jenkins-worker-05
> 39577
> amp-jenkins-worker-06
> 39299
> amp-jenkins-worker-07
> 39315
> amp-jenkins-worker-08
> 38529
> to help combat this, i set up a cron job on each worker that runs tmpwatch 
> during system downtime on sundays to clean up files older than a week.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15450) Clean up SparkSession builder for python

2016-05-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15450.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Clean up SparkSession builder for python
> 
>
> Key: SPARK-15450
> URL: https://issues.apache.org/jira/browse/SPARK-15450
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Eric Liang
> Fix For: 2.0.0
>
>
> This is the sister JIRA for SPARK-15075. Today we use 
> `SQLContext.getOrCreate` in our builder. Instead we should just have a real 
> `SparkSession.getOrCreate` and use that in our builder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15534) TRUNCATE TABLE should throw exceptions, not logError

2016-05-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15534.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> TRUNCATE TABLE should throw exceptions, not logError
> 
>
> Key: SPARK-15534
> URL: https://issues.apache.org/jira/browse/SPARK-15534
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 2.0.0
>
>
> If the table to truncate doesn't exist, throw an exception!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15535) Remove code for TRUNCATE TABLE ... COLUMN

2016-05-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15535.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Remove code for TRUNCATE TABLE ... COLUMN
> -
>
> Key: SPARK-15535
> URL: https://issues.apache.org/jira/browse/SPARK-15535
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 2.0.0
>
>
> This was never supported in the first place. Also Hive doesn't support it: 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15450) Clean up SparkSession builder for python

2016-05-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15450:
--
Assignee: Eric Liang  (was: Andrew Or)

> Clean up SparkSession builder for python
> 
>
> Key: SPARK-15450
> URL: https://issues.apache.org/jira/browse/SPARK-15450
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Eric Liang
> Fix For: 2.0.0
>
>
> This is the sister JIRA for SPARK-15075. Today we use 
> `SQLContext.getOrCreate` in our builder. Instead we should just have a real 
> `SparkSession.getOrCreate` and use that in our builder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15619) spark builds filling up /tmp

2016-05-27 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304911#comment-15304911
 ] 

shane knapp commented on SPARK-15619:
-

next time we have a maintenance, i will wipe /tmp completely so that we can at 
least try and see what's creating what...  right now it's such a mess that it's 
hard to attribute anything to anything.

i did watch as a spark build (spark-master-test-maven-hadoop-2.7, IIRC) dump a 
bunch of the liblz4-java3891256912513794605.so files in /tmp this morning on 
worker-08 (the number string changes for each file).

and, so far today, we've had 2628 of these files left in /tmp:

[root@amp-jenkins-worker-08 tmp]# ls -lt | grep liblz4 |grep "May 27" | wc -l
2628

i'm not worried about us running out of disk, and this is something i can 
manage on the system-level, but it'd still be nice to have well behaved tests.  
:)

> spark builds filling up /tmp
> 
>
> Key: SPARK-15619
> URL: https://issues.apache.org/jira/browse/SPARK-15619
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: shane knapp
>Priority: Minor
>
> spark builds aren't cleaning up /tmp after they run...  it's hard to pinpoint 
> EXACTLY what is left there by the spark builds (as other builds are also 
> guilty of doing this), but a quick perusal of the /tmp directory during some 
> spark builds show that there are myriad empty directories being created and a 
> massive pile of shared object libraries being dumped there.
> $ for x in $(cat jenkins_workers.txt ); do echo $x; ssh $x "ls -l /tmp/*.so | 
> wc -l"; done
> amp-jenkins-worker-01
> 0
> ls: cannot access /tmp/*.so: No such file or directory
> amp-jenkins-worker-02
> 22312
> amp-jenkins-worker-03
> 39673
> amp-jenkins-worker-04
> 39548
> amp-jenkins-worker-05
> 39577
> amp-jenkins-worker-06
> 39299
> amp-jenkins-worker-07
> 39315
> amp-jenkins-worker-08
> 38529
> to help combat this, i set up a cron job on each worker that runs tmpwatch 
> during system downtime on sundays to clean up files older than a week.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15622) Janino's classloader has an unexpected behavior when its parent classloader throws an ClassNotFoundException with a cause set

2016-05-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304905#comment-15304905
 ] 

Apache Spark commented on SPARK-15622:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/13366

> Janino's classloader has an unexpected behavior when its parent classloader 
> throws an ClassNotFoundException with a cause set
> -
>
> Key: SPARK-15622
> URL: https://issues.apache.org/jira/browse/SPARK-15622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> At 
> https://github.com/aunkrig/janino/blob/janino_2.7.8/janino/src/org/codehaus/janino/ClassLoaderIClassLoader.java#L80-L85,
>  Janino's classloader throws the exception when its parent throws a 
> ClassNotFoundException with a cause set. However, it does not throw the 
> exception when there is no cause set. Seems we need to create a special 
> ClassLoader to wrap the actual parent classloader set to Janino handle this 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15622) Janino's classloader has an unexpected behavior when its parent classloader throws an ClassNotFoundException with a cause set

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15622:


Assignee: Apache Spark  (was: Yin Huai)

> Janino's classloader has an unexpected behavior when its parent classloader 
> throws an ClassNotFoundException with a cause set
> -
>
> Key: SPARK-15622
> URL: https://issues.apache.org/jira/browse/SPARK-15622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Critical
>
> At 
> https://github.com/aunkrig/janino/blob/janino_2.7.8/janino/src/org/codehaus/janino/ClassLoaderIClassLoader.java#L80-L85,
>  Janino's classloader throws the exception when its parent throws a 
> ClassNotFoundException with a cause set. However, it does not throw the 
> exception when there is no cause set. Seems we need to create a special 
> ClassLoader to wrap the actual parent classloader set to Janino handle this 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15622) Janino's classloader has an unexpected behavior when its parent classloader throws an ClassNotFoundException with a cause set

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15622:


Assignee: Yin Huai  (was: Apache Spark)

> Janino's classloader has an unexpected behavior when its parent classloader 
> throws an ClassNotFoundException with a cause set
> -
>
> Key: SPARK-15622
> URL: https://issues.apache.org/jira/browse/SPARK-15622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> At 
> https://github.com/aunkrig/janino/blob/janino_2.7.8/janino/src/org/codehaus/janino/ClassLoaderIClassLoader.java#L80-L85,
>  Janino's classloader throws the exception when its parent throws a 
> ClassNotFoundException with a cause set. However, it does not throw the 
> exception when there is no cause set. Seems we need to create a special 
> ClassLoader to wrap the actual parent classloader set to Janino handle this 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15489) Dataset kryo encoder won't load custom user settings

2016-05-27 Thread Amit Sela (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Sela updated SPARK-15489:
--
Description: 
When setting a custom "spark.kryo.registrator" (or any other configuration for 
that matter) through the API, this configuration will not propagate to the 
encoder that uses a KryoSerializer since it instantiates with "new SparkConf()".
See:  
https://github.com/apache/spark/blob/07c36a2f07fcf5da6fb395f830ebbfc10eb27dcc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L554

This could be hacked by providing those configurations as System properties, 
but this probably should be passed to the encoder and set in the 
SerializerInstance after creation.

Example:
When using Encoders with kryo to encode generically typed Objects in the 
following manner:

public static  Encoder encoder() {
  return Encoders.kryo((Class) Object.class);
}

I get a decoding exception when trying to decode 
`java.util.Collections$UnmodifiableCollection`, which probably comes from 
Guava's `ImmutableList`.

This happens when running with master = local[1]. Same code had no problems 
with RDD api.


  was:
When setting a custom "spark.kryo.registrator" (or any other configuration for 
that matter) through the API, this configuration will not propagate to the 
encoder that uses a KryoSerializer since it instantiates with "new SparkConf()".
See:  

Example:
When using Encoders with kryo to encode generically typed Objects in the 
following manner:

public static  Encoder encoder() {
  return Encoders.kryo((Class) Object.class);
}

I get a decoding exception when trying to decode 
`java.util.Collections$UnmodifiableCollection`, which probably comes from 
Guava's `ImmutableList`.

This happens when running with master = local[1]. Same code had no problems 
with RDD api.



> Dataset kryo encoder won't load custom user settings 
> -
>
> Key: SPARK-15489
> URL: https://issues.apache.org/jira/browse/SPARK-15489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Amit Sela
>
> When setting a custom "spark.kryo.registrator" (or any other configuration 
> for that matter) through the API, this configuration will not propagate to 
> the encoder that uses a KryoSerializer since it instantiates with "new 
> SparkConf()".
> See:  
> https://github.com/apache/spark/blob/07c36a2f07fcf5da6fb395f830ebbfc10eb27dcc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L554
> This could be hacked by providing those configurations as System properties, 
> but this probably should be passed to the encoder and set in the 
> SerializerInstance after creation.
> Example:
> When using Encoders with kryo to encode generically typed Objects in the 
> following manner:
> public static  Encoder encoder() {
>   return Encoders.kryo((Class) Object.class);
> }
> I get a decoding exception when trying to decode 
> `java.util.Collections$UnmodifiableCollection`, which probably comes from 
> Guava's `ImmutableList`.
> This happens when running with master = local[1]. Same code had no problems 
> with RDD api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15489) Dataset kryo encoder won't load custom user settings

2016-05-27 Thread Amit Sela (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Sela updated SPARK-15489:
--
Description: 
When setting a custom "spark.kryo.registrator" (or any other configuration for 
that matter) through the API, this configuration will not propagate to the 
encoder that uses a KryoSerializer since it instantiates with "new SparkConf()".
See:  

Example:
When using Encoders with kryo to encode generically typed Objects in the 
following manner:

public static  Encoder encoder() {
  return Encoders.kryo((Class) Object.class);
}

I get a decoding exception when trying to decode 
`java.util.Collections$UnmodifiableCollection`, which probably comes from 
Guava's `ImmutableList`.

This happens when running with master = local[1]. Same code had no problems 
with RDD api.


  was:
When using Encoders with kryo to encode generically typed Objects in the 
following manner:

public static  Encoder encoder() {
  return Encoders.kryo((Class) Object.class);
}

I get a decoding exception when trying to decode 
`java.util.Collections$UnmodifiableCollection`, which probably comes from 
Guava's `ImmutableList`.

This happens when running with master = local[1]. Same code had no problems 
with RDD api.



> Dataset kryo encoder won't load custom user settings 
> -
>
> Key: SPARK-15489
> URL: https://issues.apache.org/jira/browse/SPARK-15489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Amit Sela
>
> When setting a custom "spark.kryo.registrator" (or any other configuration 
> for that matter) through the API, this configuration will not propagate to 
> the encoder that uses a KryoSerializer since it instantiates with "new 
> SparkConf()".
> See:  
> Example:
> When using Encoders with kryo to encode generically typed Objects in the 
> following manner:
> public static  Encoder encoder() {
>   return Encoders.kryo((Class) Object.class);
> }
> I get a decoding exception when trying to decode 
> `java.util.Collections$UnmodifiableCollection`, which probably comes from 
> Guava's `ImmutableList`.
> This happens when running with master = local[1]. Same code had no problems 
> with RDD api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15623) 2.0 python coverage ml.feature

2016-05-27 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304898#comment-15304898
 ] 

Bryan Cutler edited comment on SPARK-15623 at 5/27/16 10:11 PM:


I was only able to quickly go though the user guide and api docs, but did not 
see any breaking api changes and just a couple missing params that were being 
worked on currently.  It might be good if someone else is able to take a look, 
otherwise I can give it a more thorough pass when I get a chance.

The discrepancies I found in the docs went in PR  
[#13159|https://github.com/apache/spark/pull/13159]


was (Author: bryanc):
I was only able to quickly go though the user guide and api docs, but did not 
see any breaking api changes and just a couple missing params that were being 
worked on currently.  It might be good if someone else is able to take a look, 
otherwise I can give it a more thorough pass when I get a chance.

The discrepancies I found in the docs went in PR #13159

> 2.0 python coverage ml.feature
> --
>
> Key: SPARK-15623
> URL: https://issues.apache.org/jira/browse/SPARK-15623
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>
> See parent task SPARK-14813.
> [~bryanc] did this component.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15623) 2.0 python coverage ml.feature

2016-05-27 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304898#comment-15304898
 ] 

Bryan Cutler commented on SPARK-15623:
--

I was only able to quickly go though the user guide and api docs, but did not 
see any breaking api changes and just a couple missing params that were being 
worked on currently.  It might be good if someone else is able to take a look, 
otherwise I can give it a more thorough pass when I get a chance.

The discrepancies I found in the docs went in PR #13159

> 2.0 python coverage ml.feature
> --
>
> Key: SPARK-15623
> URL: https://issues.apache.org/jira/browse/SPARK-15623
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>
> See parent task SPARK-14813.
> [~bryanc] did this component.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15632) Dataset typed filter operation changes query plan schema

2016-05-27 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304897#comment-15304897
 ] 

Wenchen Fan commented on SPARK-15632:
-

good catch! we should not implement typed filter in this way, but always embed 
deserializer in filter condition. We can create a `TypedFilter` and a 
`TypedFilterWithObject` operator, and optimize it case by case like we did to 
`AppendColumns` and `AppendColumnWithObject`. In the planner we can just plan 
typed filter with normal filter physical operator.

> Dataset typed filter operation changes query plan schema
> 
>
> Key: SPARK-15632
> URL: https://issues.apache.org/jira/browse/SPARK-15632
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> Filter operations should never changes query plan schema. However, Dataset 
> typed filter operation does introduce schema change:
> {code}
> case class A(b: Double, a: String)
> val data = Seq(
>   "{ 'a': 'foo', 'b': 1, 'c': 'extra' }",
>   "{ 'a': 'bar', 'b': 2, 'c': 'extra' }",
>   "{ 'a': 'bar', 'c': 'extra' }"
> )
> val df1 = spark.read.json(sc.parallelize(data))
> df1.printSchema()
> // root
> //  |-- a: string (nullable = true)
> //  |-- b: long (nullable = true)
> //  |-- c: string (nullable = true)
> val ds1 = df1.as[A]
> ds1.printSchema()
> // root
> //  |-- a: string (nullable = true)
> //  |-- b: long (nullable = true)
> //  |-- c: string (nullable = true)
> val ds2 = ds1.filter(_.b > 1)// <- Here comes the trouble maker
> ds2.printSchema()
> // root <- 1. reordered `a` and `b`, and
> //  |-- b: double (nullable = true)2. dropped `c`, and
> //  |-- a: string (nullable = true)3. up-casted `b` from long to double
> val df2 = ds2.toDF()
> df2.printSchema()
> // root <- (Same as above)
> //  |-- b: double (nullable = true)
> //  |-- a: string (nullable = true)
> {code}
> This is becase we wraps the actual {{Filter}} operator with a 
> {{SerializeFromObject}}/{{DeserializeToObject}} pair.
> {{DeserializeToObject}} does a bunch of magic tricks here:
> # Field order change
> #- {{DeserializeToObject}} resolves the encoder deserializer expression by 
> **name**. Thus field order in input query plan doesn't matter.
> # Field number change
> #- Same as above, fields not referred by the encoder are silently dropped 
> while resolving deserializer expressions by name.
> # Field data type change
> #- When generating deserializer expressions, we allows "sane" implicit 
> coercions (e.g. integer to long, and long to double) by inserting {{UpCast}} 
> operators. Thus actual field data types in input query plan don't matter 
> either as long as there are valid implicit coercions.
> Actually, even field names may change once [PR 
> #13269|https://github.com/apache/spark/pull/13269] gets merged, because it 
> introduces case-insensitive encoder resolution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15489) Dataset kryo encoder won't load custom user settings

2016-05-27 Thread Amit Sela (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Sela updated SPARK-15489:
--
Summary: Dataset kryo encoder won't load custom user settings   (was: 
Dataset kryo encoder fails on Collections$UnmodifiableCollection)

> Dataset kryo encoder won't load custom user settings 
> -
>
> Key: SPARK-15489
> URL: https://issues.apache.org/jira/browse/SPARK-15489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Amit Sela
>
> When using Encoders with kryo to encode generically typed Objects in the 
> following manner:
> public static  Encoder encoder() {
>   return Encoders.kryo((Class) Object.class);
> }
> I get a decoding exception when trying to decode 
> `java.util.Collections$UnmodifiableCollection`, which probably comes from 
> Guava's `ImmutableList`.
> This happens when running with master = local[1]. Same code had no problems 
> with RDD api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15489) Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-27 Thread Amit Sela (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304893#comment-15304893
 ] 

Amit Sela commented on SPARK-15489:
---

The issue here is the fact that setting the SparkConf does not propagate to the 
KryoSerializer used by the encoder.
I managed to make this work by using Java System properties instead of 
SparkConf#set since the SparkConf constructor will take them into account, but 
it's a hack...

For now I think I'll change the description of the issue, and propose this as a 
temporary solution.

> Dataset kryo encoder fails on Collections$UnmodifiableCollection
> 
>
> Key: SPARK-15489
> URL: https://issues.apache.org/jira/browse/SPARK-15489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Amit Sela
>
> When using Encoders with kryo to encode generically typed Objects in the 
> following manner:
> public static  Encoder encoder() {
>   return Encoders.kryo((Class) Object.class);
> }
> I get a decoding exception when trying to decode 
> `java.util.Collections$UnmodifiableCollection`, which probably comes from 
> Guava's `ImmutableList`.
> This happens when running with master = local[1]. Same code had no problems 
> with RDD api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15618) Use SparkSession.builder.sparkContext(...) in tests where possible

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15618:


Assignee: Apache Spark  (was: Dongjoon Hyun)

> Use SparkSession.builder.sparkContext(...) in tests where possible
> --
>
> Key: SPARK-15618
> URL: https://issues.apache.org/jira/browse/SPARK-15618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Minor
>
> There are many places where we could be more explicit about the particular 
> underlying SparkContext we want, but we just do 
> `SparkSession.builder.getOrCreate()` anyway. It's better to be clearer in the 
> code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15618) Use SparkSession.builder.sparkContext(...) in tests where possible

2016-05-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304891#comment-15304891
 ] 

Apache Spark commented on SPARK-15618:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/13365

> Use SparkSession.builder.sparkContext(...) in tests where possible
> --
>
> Key: SPARK-15618
> URL: https://issues.apache.org/jira/browse/SPARK-15618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> There are many places where we could be more explicit about the particular 
> underlying SparkContext we want, but we just do 
> `SparkSession.builder.getOrCreate()` anyway. It's better to be clearer in the 
> code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15618) Use SparkSession.builder.sparkContext(...) in tests where possible

2016-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15618:


Assignee: Dongjoon Hyun  (was: Apache Spark)

> Use SparkSession.builder.sparkContext(...) in tests where possible
> --
>
> Key: SPARK-15618
> URL: https://issues.apache.org/jira/browse/SPARK-15618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> There are many places where we could be more explicit about the particular 
> underlying SparkContext we want, but we just do 
> `SparkSession.builder.getOrCreate()` anyway. It's better to be clearer in the 
> code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15489) Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-27 Thread Amit Sela (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304886#comment-15304886
 ] 

Amit Sela commented on SPARK-15489:
---

Got it!

So I wasn't using the custom registrator correctly, it works better like this:

public class ImmutablesRegistrator implements KryoRegistrator {

  @Override
  public void registerClasses(Kryo kryo) {
UnmodifiableCollectionsSerializer.registerSerializers(kryo);
// Guava
ImmutableListSerializer.registerSerializers(kryo);
ImmutableSetSerializer.registerSerializers(kryo);
ImmutableMapSerializer.registerSerializers(kryo);
ImmutableMultimapSerializer.registerSerializers(kryo);
  }
}


> Dataset kryo encoder fails on Collections$UnmodifiableCollection
> 
>
> Key: SPARK-15489
> URL: https://issues.apache.org/jira/browse/SPARK-15489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Amit Sela
>
> When using Encoders with kryo to encode generically typed Objects in the 
> following manner:
> public static  Encoder encoder() {
>   return Encoders.kryo((Class) Object.class);
> }
> I get a decoding exception when trying to decode 
> `java.util.Collections$UnmodifiableCollection`, which probably comes from 
> Guava's `ImmutableList`.
> This happens when running with master = local[1]. Same code had no problems 
> with RDD api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-05-27 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304884#comment-15304884
 ] 

Seth Hendrickson commented on SPARK-15581:
--

[~BenFradet] See [SPARK-7159|https://issues.apache.org/jira/browse/SPARK-7159] 
for multinomial logistic regression.

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality (SPARK-15571)
> * Improve Python API handling of Params, persistence (SPARK-14771) 
> (SPARK-14706)
> ** Note: The linked JIRAs for this are incomplete.  More to be created...
> ** Related: Implement Python meta-algorithms in Scala (to simplify 
> persistence) (SPARK-15574)
> * Feature parity: The main goal of the Python API is to have feature parity 
> with the

[jira] [Commented] (SPARK-15619) spark builds filling up /tmp

2016-05-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304874#comment-15304874
 ] 

Sean Owen commented on SPARK-15619:
---

Although I think we've cleaned up this over time, and even seen a few fixes for 
this recently, I imagine it will not be perfect. Yes, a cron job sounds like a 
good solution. If you can post the names of the dirs, maybe that would narrow 
it down.

> spark builds filling up /tmp
> 
>
> Key: SPARK-15619
> URL: https://issues.apache.org/jira/browse/SPARK-15619
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: shane knapp
>Priority: Minor
>
> spark builds aren't cleaning up /tmp after they run...  it's hard to pinpoint 
> EXACTLY what is left there by the spark builds (as other builds are also 
> guilty of doing this), but a quick perusal of the /tmp directory during some 
> spark builds show that there are myriad empty directories being created and a 
> massive pile of shared object libraries being dumped there.
> $ for x in $(cat jenkins_workers.txt ); do echo $x; ssh $x "ls -l /tmp/*.so | 
> wc -l"; done
> amp-jenkins-worker-01
> 0
> ls: cannot access /tmp/*.so: No such file or directory
> amp-jenkins-worker-02
> 22312
> amp-jenkins-worker-03
> 39673
> amp-jenkins-worker-04
> 39548
> amp-jenkins-worker-05
> 39577
> amp-jenkins-worker-06
> 39299
> amp-jenkins-worker-07
> 39315
> amp-jenkins-worker-08
> 38529
> to help combat this, i set up a cron job on each worker that runs tmpwatch 
> during system downtime on sundays to clean up files older than a week.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15634) SQL repl is bricked if a function is registered with a non-existent jar

2016-05-27 Thread Eric Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304872#comment-15304872
 ] 

Eric Liang edited comment on SPARK-15634 at 5/27/16 9:57 PM:
-

Note that adding jars in the repl also doesn't work currently, so this issue 
may be minor (see linked issue).

cc [~yhuai]


was (Author: ekhliang):
Note that adding jars in the repl also doesn't work currently, so this issue 
may be minor (see linked issue).

> SQL repl is bricked if a function is registered with a non-existent jar
> ---
>
> Key: SPARK-15634
> URL: https://issues.apache.org/jira/browse/SPARK-15634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eric Liang
>
> After attempting to register a function using a non-existent jar, no further 
> SQL commands succeed (and you also cannot un-register the function).
> {code}
> build/sbt -Phive sparkShell
> {code}
> {code}
> scala> sql("""CREATE TEMPORARY FUNCTION x AS "com.example.functions.Function" 
> USING JAR "file:///path/to/example.jar)
> 16/05/27 14:53:49 ERROR SessionState: file:///path/to/example.jar does not 
> exist
> java.lang.IllegalArgumentException: file:///path/to/example.jar does not exist
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.validateFiles(SessionState.java:998)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState$ResourceType.preHook(SessionState.java:1102)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState$ResourceType$1.preHook(SessionState.java:1091)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1191)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>   at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:564)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:533)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:260)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:207)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:206)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:249)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:533)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:523)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:668)
>   at 
> org.apache.spark.sql.hive.HiveSessionState.addJar(HiveSessionState.scala:109)
>   at 
> org.apache.spark.sql.internal.SessionState$$anon$2.loadResource(SessionState.scala:80)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:734)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:734)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadFunctionResources(SessionCatalog.scala:734)
>   at 
> org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:59)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
>   at org.apache.spark.sql.Datas

[jira] [Commented] (SPARK-15634) SQL repl is bricked if a function is registered with a non-existent jar

2016-05-27 Thread Eric Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304872#comment-15304872
 ] 

Eric Liang commented on SPARK-15634:


Note that adding jars in the repl also doesn't work currently, so this issue 
may be minor (see linked issue).

> SQL repl is bricked if a function is registered with a non-existent jar
> ---
>
> Key: SPARK-15634
> URL: https://issues.apache.org/jira/browse/SPARK-15634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eric Liang
>
> After attempting to register a function using a non-existent jar, no further 
> SQL commands succeed (and you also cannot un-register the function).
> {code}
> build/sbt -Phive sparkShell
> {code}
> {code}
> scala> sql("""CREATE TEMPORARY FUNCTION x AS "com.example.functions.Function" 
> USING JAR "file:///path/to/example.jar)
> 16/05/27 14:53:49 ERROR SessionState: file:///path/to/example.jar does not 
> exist
> java.lang.IllegalArgumentException: file:///path/to/example.jar does not exist
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.validateFiles(SessionState.java:998)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState$ResourceType.preHook(SessionState.java:1102)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState$ResourceType$1.preHook(SessionState.java:1091)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1191)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>   at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:564)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:533)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:260)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:207)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:206)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:249)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:533)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:523)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:668)
>   at 
> org.apache.spark.sql.hive.HiveSessionState.addJar(HiveSessionState.scala:109)
>   at 
> org.apache.spark.sql.internal.SessionState$$anon$2.loadResource(SessionState.scala:80)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:734)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:734)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadFunctionResources(SessionCatalog.scala:734)
>   at 
> org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:59)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:187)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:168)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
>   at org.apache.spark.sql.SparkSession.sql(S

[jira] [Commented] (SPARK-15632) Dataset typed filter operation changes query plan schema

2016-05-27 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304873#comment-15304873
 ] 

Cheng Lian commented on SPARK-15632:


cc [~cloud_fan] [~marmbrus]

> Dataset typed filter operation changes query plan schema
> 
>
> Key: SPARK-15632
> URL: https://issues.apache.org/jira/browse/SPARK-15632
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> Dataset typed filter operation changes query plan schema
> Filter operations should never changes query plan schema. However, Dataset 
> typed filter operation does introduce schema change:
> {code}
> case class A(b: Double, a: String)
> val data = Seq(
>   "{ 'a': 'foo', 'b': 1, 'c': 'extra' }",
>   "{ 'a': 'bar', 'b': 2, 'c': 'extra' }",
>   "{ 'a': 'bar', 'c': 'extra' }"
> )
> val df1 = spark.read.json(sc.parallelize(data))
> df1.printSchema()
> // root
> //  |-- a: string (nullable = true)
> //  |-- b: long (nullable = true)
> //  |-- c: string (nullable = true)
> val ds1 = df1.as[A]
> ds1.printSchema()
> // root
> //  |-- a: string (nullable = true)
> //  |-- b: long (nullable = true)
> //  |-- c: string (nullable = true)
> val ds2 = ds1.filter(_.b > 1)// <- Here comes the trouble maker
> ds2.printSchema()
> // root <- 1. reordered `a` and `b`, and
> //  |-- b: double (nullable = true)2. dropped `c`, and
> //  |-- a: string (nullable = true)3. up-casted `b` from long to double
> val df2 = ds2.toDF()
> df2.printSchema()
> // root <- (Same as above)
> //  |-- b: double (nullable = true)
> //  |-- a: string (nullable = true)
> {code}
> This is becase we wraps the actual {{Filter}} operator with a 
> {{SerializeFromObject}}/{{DeserializeToObject}} pair.
> {{DeserializeToObject}} does a bunch of magic tricks here:
> # Field order change
> #- {{DeserializeToObject}} resolves the encoder deserializer expression by 
> **name**. Thus field order in input query plan doesn't matter.
> # Field number change
> #- Same as above, fields not referred by the encoder are silently dropped 
> while resolving deserializer expressions by name.
> # Field data type change
> #- When generating deserializer expressions, we allows "sane" implicit 
> coercions (e.g. integer to long, and long to double) by inserting {{UpCast}} 
> operators. Thus actual field data types in input query plan don't matter 
> either as long as there are valid implicit coercions.
> Actually, even field names may change once [PR 
> #13269|https://github.com/apache/spark/pull/13269] gets merged, because it 
> introduces case-insensitive encoder resolution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15632) Dataset typed filter operation changes query plan schema

2016-05-27 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-15632:
---
Description: 
Filter operations should never changes query plan schema. However, Dataset 
typed filter operation does introduce schema change:

{code}
case class A(b: Double, a: String)

val data = Seq(
  "{ 'a': 'foo', 'b': 1, 'c': 'extra' }",
  "{ 'a': 'bar', 'b': 2, 'c': 'extra' }",
  "{ 'a': 'bar', 'c': 'extra' }"
)

val df1 = spark.read.json(sc.parallelize(data))
df1.printSchema()
// root
//  |-- a: string (nullable = true)
//  |-- b: long (nullable = true)
//  |-- c: string (nullable = true)

val ds1 = df1.as[A]
ds1.printSchema()
// root
//  |-- a: string (nullable = true)
//  |-- b: long (nullable = true)
//  |-- c: string (nullable = true)

val ds2 = ds1.filter(_.b > 1)// <- Here comes the trouble maker
ds2.printSchema()
// root <- 1. reordered `a` and `b`, and
//  |-- b: double (nullable = true)2. dropped `c`, and
//  |-- a: string (nullable = true)3. up-casted `b` from long to double

val df2 = ds2.toDF()
df2.printSchema()
// root <- (Same as above)
//  |-- b: double (nullable = true)
//  |-- a: string (nullable = true)
{code}

This is becase we wraps the actual {{Filter}} operator with a 
{{SerializeFromObject}}/{{DeserializeToObject}} pair.

{{DeserializeToObject}} does a bunch of magic tricks here:

# Field order change
#- {{DeserializeToObject}} resolves the encoder deserializer expression by 
**name**. Thus field order in input query plan doesn't matter.
# Field number change
#- Same as above, fields not referred by the encoder are silently dropped while 
resolving deserializer expressions by name.
# Field data type change
#- When generating deserializer expressions, we allows "sane" implicit 
coercions (e.g. integer to long, and long to double) by inserting {{UpCast}} 
operators. Thus actual field data types in input query plan don't matter either 
as long as there are valid implicit coercions.

Actually, even field names may change once [PR 
#13269|https://github.com/apache/spark/pull/13269] gets merged, because it 
introduces case-insensitive encoder resolution.

  was:
Dataset typed filter operation changes query plan schema

Filter operations should never changes query plan schema. However, Dataset 
typed filter operation does introduce schema change:

{code}
case class A(b: Double, a: String)

val data = Seq(
  "{ 'a': 'foo', 'b': 1, 'c': 'extra' }",
  "{ 'a': 'bar', 'b': 2, 'c': 'extra' }",
  "{ 'a': 'bar', 'c': 'extra' }"
)

val df1 = spark.read.json(sc.parallelize(data))
df1.printSchema()
// root
//  |-- a: string (nullable = true)
//  |-- b: long (nullable = true)
//  |-- c: string (nullable = true)

val ds1 = df1.as[A]
ds1.printSchema()
// root
//  |-- a: string (nullable = true)
//  |-- b: long (nullable = true)
//  |-- c: string (nullable = true)

val ds2 = ds1.filter(_.b > 1)// <- Here comes the trouble maker
ds2.printSchema()
// root <- 1. reordered `a` and `b`, and
//  |-- b: double (nullable = true)2. dropped `c`, and
//  |-- a: string (nullable = true)3. up-casted `b` from long to double

val df2 = ds2.toDF()
df2.printSchema()
// root <- (Same as above)
//  |-- b: double (nullable = true)
//  |-- a: string (nullable = true)
{code}

This is becase we wraps the actual {{Filter}} operator with a 
{{SerializeFromObject}}/{{DeserializeToObject}} pair.

{{DeserializeToObject}} does a bunch of magic tricks here:

# Field order change
#- {{DeserializeToObject}} resolves the encoder deserializer expression by 
**name**. Thus field order in input query plan doesn't matter.
# Field number change
#- Same as above, fields not referred by the encoder are silently dropped while 
resolving deserializer expressions by name.
# Field data type change
#- When generating deserializer expressions, we allows "sane" implicit 
coercions (e.g. integer to long, and long to double) by inserting {{UpCast}} 
operators. Thus actual field data types in input query plan don't matter either 
as long as there are valid implicit coercions.

Actually, even field names may change once [PR 
#13269|https://github.com/apache/spark/pull/13269] gets merged, because it 
introduces case-insensitive encoder resolution.


> Dataset typed filter operation changes query plan schema
> 
>
> Key: SPARK-15632
> URL: https://issues.apache.org/jira/browse/SPARK-15632
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> Filter operations should never changes query plan schema. However, Dataset 
> typed filter operation does introduce schema change:
> {code}
> case class A(b: Double, a: String)
> val data = Seq(
>

1 2 3 >

1 - 100 of 280 matches

Mail list logo