[jira] [Assigned] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15616: Assignee: Apache Spark > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang >Assignee: Apache Spark > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305191#comment-15305191 ] Apache Spark commented on SPARK-15616: -- User 'lianhuiwang' has created a pull request for this issue: https://github.com/apache/spark/pull/13373 > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15616: Assignee: (was: Apache Spark) > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15585) Don't use null in data source options to indicate default value
[ https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15585: Assignee: Apache Spark > Don't use null in data source options to indicate default value > --- > > Key: SPARK-15585 > URL: https://issues.apache.org/jira/browse/SPARK-15585 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark >Priority: Critical > > See email: > http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html > We'd need to change DataFrameReader/DataFrameWriter in Python's > csv/json/parquet/... functions to put the actual default option values as > function parameters, rather than setting them to None. We can then in > CSVOptions.getChar (and JSONOptions, etc) to actually return null if the > value is null, rather than setting it to default value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15585) Don't use null in data source options to indicate default value
[ https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305185#comment-15305185 ] Apache Spark commented on SPARK-15585: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/13372 > Don't use null in data source options to indicate default value > --- > > Key: SPARK-15585 > URL: https://issues.apache.org/jira/browse/SPARK-15585 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > See email: > http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html > We'd need to change DataFrameReader/DataFrameWriter in Python's > csv/json/parquet/... functions to put the actual default option values as > function parameters, rather than setting them to None. We can then in > CSVOptions.getChar (and JSONOptions, etc) to actually return null if the > value is null, rather than setting it to default value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15585) Don't use null in data source options to indicate default value
[ https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15585: Assignee: (was: Apache Spark) > Don't use null in data source options to indicate default value > --- > > Key: SPARK-15585 > URL: https://issues.apache.org/jira/browse/SPARK-15585 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > See email: > http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html > We'd need to change DataFrameReader/DataFrameWriter in Python's > csv/json/parquet/... functions to put the actual default option values as > function parameters, rather than setting them to None. We can then in > CSVOptions.getChar (and JSONOptions, etc) to actually return null if the > value is null, rather than setting it to default value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader
[ https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15639: Assignee: (was: Apache Spark) > Try to push down filter at RowGroups level for parquet reader > - > > Key: SPARK-15639 > URL: https://issues.apache.org/jira/browse/SPARK-15639 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > When we use vecterized parquet reader, although the base reader (i.e., > SpecificParquetRecordReaderBase) will retrieve pushed-down filters for > RowGroups-level filtering, we seems not really set up the filters to be > pushed down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader
[ https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305180#comment-15305180 ] Apache Spark commented on SPARK-15639: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/13371 > Try to push down filter at RowGroups level for parquet reader > - > > Key: SPARK-15639 > URL: https://issues.apache.org/jira/browse/SPARK-15639 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > When we use vecterized parquet reader, although the base reader (i.e., > SpecificParquetRecordReaderBase) will retrieve pushed-down filters for > RowGroups-level filtering, we seems not really set up the filters to be > pushed down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader
[ https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15639: Assignee: Apache Spark > Try to push down filter at RowGroups level for parquet reader > - > > Key: SPARK-15639 > URL: https://issues.apache.org/jira/browse/SPARK-15639 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > When we use vecterized parquet reader, although the base reader (i.e., > SpecificParquetRecordReaderBase) will retrieve pushed-down filters for > RowGroups-level filtering, we seems not really set up the filters to be > pushed down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader
Liang-Chi Hsieh created SPARK-15639: --- Summary: Try to push down filter at RowGroups level for parquet reader Key: SPARK-15639 URL: https://issues.apache.org/jira/browse/SPARK-15639 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh When we use vecterized parquet reader, although the base reader (i.e., SpecificParquetRecordReaderBase) will retrieve pushed-down filters for RowGroups-level filtering, we seems not really set up the filters to be pushed down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15585) Don't use null in data source options to indicate default value
[ https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305178#comment-15305178 ] Takeshi Yamamuro commented on SPARK-15585: -- okay, I'll push soon. > Don't use null in data source options to indicate default value > --- > > Key: SPARK-15585 > URL: https://issues.apache.org/jira/browse/SPARK-15585 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > See email: > http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html > We'd need to change DataFrameReader/DataFrameWriter in Python's > csv/json/parquet/... functions to put the actual default option values as > function parameters, rather than setting them to None. We can then in > CSVOptions.getChar (and JSONOptions, etc) to actually return null if the > value is null, rather than setting it to default value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15638) Audit Dataset, SparkSession, and SQLContext functions and documentations
[ https://issues.apache.org/jira/browse/SPARK-15638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15638: Description: See the attached pull request for details. > Audit Dataset, SparkSession, and SQLContext functions and documentations > > > Key: SPARK-15638 > URL: https://issues.apache.org/jira/browse/SPARK-15638 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > See the attached pull request for details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15638) Audit Dataset, SparkSession, and SQLContext functions and documentations
[ https://issues.apache.org/jira/browse/SPARK-15638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15638: Assignee: Apache Spark (was: Reynold Xin) > Audit Dataset, SparkSession, and SQLContext functions and documentations > > > Key: SPARK-15638 > URL: https://issues.apache.org/jira/browse/SPARK-15638 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15638) Audit Dataset, SparkSession, and SQLContext functions and documentations
[ https://issues.apache.org/jira/browse/SPARK-15638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15638: Assignee: Reynold Xin (was: Apache Spark) > Audit Dataset, SparkSession, and SQLContext functions and documentations > > > Key: SPARK-15638 > URL: https://issues.apache.org/jira/browse/SPARK-15638 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15638) Audit Dataset, SparkSession, and SQLContext functions and documentations
[ https://issues.apache.org/jira/browse/SPARK-15638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305170#comment-15305170 ] Apache Spark commented on SPARK-15638: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/13370 > Audit Dataset, SparkSession, and SQLContext functions and documentations > > > Key: SPARK-15638 > URL: https://issues.apache.org/jira/browse/SPARK-15638 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15638) Audit Dataset, SparkSession, and SQLContext functions and documentations
Reynold Xin created SPARK-15638: --- Summary: Audit Dataset, SparkSession, and SQLContext functions and documentations Key: SPARK-15638 URL: https://issues.apache.org/jira/browse/SPARK-15638 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15611) Got the same sequence random number in every forked worker.
[ https://issues.apache.org/jira/browse/SPARK-15611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15611: Assignee: Apache Spark > Got the same sequence random number in every forked worker. > --- > > Key: SPARK-15611 > URL: https://issues.apache.org/jira/browse/SPARK-15611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Thomas Lau >Assignee: Apache Spark >Priority: Minor > > hi, i'm writing some code as below: > {code:java|title=marlkov.py|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} > from random import random > from operator import add > def funcx( x ): > print x[0],x[1] > return 1 if x[0]** 2 + x[1]** 2 < 1 else 0 > def genRnd(ind): > x=random() * 2 - 1 > y=random() * 2 - 1 > return (x,y) > def runsp(total): > ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, > y: x + y)/float(total) * 4 > print ret > runsp(3) > {code} > once started the pyspark shell, no matter how many times i run "runsp(N)" , > this code always get a same sequece of random numbers, like this > {code:title=Output|borderStyle=solid} > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) > >>> * 4 > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) > >>> * 4 > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > {code} > i think this is because when we import pyspark.worker in the daemon.py, we > alse import a random by shuffle.py which is imported by pyspark.worker, > this worker, forked by *pid = os.fork()*, also remains the state of the > parent's random, thus every forked worker get the same random.next(). > we need to re-random the random by random.seed, which will solve the problem, > but i think this PR. may not be the proper fix. > ths. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15611) Got the same sequence random number in every forked worker.
[ https://issues.apache.org/jira/browse/SPARK-15611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305169#comment-15305169 ] Apache Spark commented on SPARK-15611: -- User 'ThomasLau' has created a pull request for this issue: https://github.com/apache/spark/pull/13350 > Got the same sequence random number in every forked worker. > --- > > Key: SPARK-15611 > URL: https://issues.apache.org/jira/browse/SPARK-15611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Thomas Lau >Priority: Minor > > hi, i'm writing some code as below: > {code:java|title=marlkov.py|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} > from random import random > from operator import add > def funcx( x ): > print x[0],x[1] > return 1 if x[0]** 2 + x[1]** 2 < 1 else 0 > def genRnd(ind): > x=random() * 2 - 1 > y=random() * 2 - 1 > return (x,y) > def runsp(total): > ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, > y: x + y)/float(total) * 4 > print ret > runsp(3) > {code} > once started the pyspark shell, no matter how many times i run "runsp(N)" , > this code always get a same sequece of random numbers, like this > {code:title=Output|borderStyle=solid} > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) > >>> * 4 > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) > >>> * 4 > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > {code} > i think this is because when we import pyspark.worker in the daemon.py, we > alse import a random by shuffle.py which is imported by pyspark.worker, > this worker, forked by *pid = os.fork()*, also remains the state of the > parent's random, thus every forked worker get the same random.next(). > we need to re-random the random by random.seed, which will solve the problem, > but i think this PR. may not be the proper fix. > ths. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15611) Got the same sequence random number in every forked worker.
[ https://issues.apache.org/jira/browse/SPARK-15611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15611: Assignee: (was: Apache Spark) > Got the same sequence random number in every forked worker. > --- > > Key: SPARK-15611 > URL: https://issues.apache.org/jira/browse/SPARK-15611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Thomas Lau >Priority: Minor > > hi, i'm writing some code as below: > {code:java|title=marlkov.py|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} > from random import random > from operator import add > def funcx( x ): > print x[0],x[1] > return 1 if x[0]** 2 + x[1]** 2 < 1 else 0 > def genRnd(ind): > x=random() * 2 - 1 > y=random() * 2 - 1 > return (x,y) > def runsp(total): > ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, > y: x + y)/float(total) * 4 > print ret > runsp(3) > {code} > once started the pyspark shell, no matter how many times i run "runsp(N)" , > this code always get a same sequece of random numbers, like this > {code:title=Output|borderStyle=solid} > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) > >>> * 4 > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) > >>> * 4 > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > {code} > i think this is because when we import pyspark.worker in the daemon.py, we > alse import a random by shuffle.py which is imported by pyspark.worker, > this worker, forked by *pid = os.fork()*, also remains the state of the > parent's random, thus every forked worker get the same random.next(). > we need to re-random the random by random.seed, which will solve the problem, > but i think this PR. may not be the proper fix. > ths. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15611) Got the same sequence random number in every forked worker.
[ https://issues.apache.org/jira/browse/SPARK-15611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Lau updated SPARK-15611: --- Summary: Got the same sequence random number in every forked worker. (was: Each forked worker in daemon.py keep the parent's random state) > Got the same sequence random number in every forked worker. > --- > > Key: SPARK-15611 > URL: https://issues.apache.org/jira/browse/SPARK-15611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Thomas Lau >Priority: Minor > > hi, i'm writing some code as below: > {code:java|title=marlkov.py|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} > from random import random > from operator import add > def funcx( x ): > print x[0],x[1] > return 1 if x[0]** 2 + x[1]** 2 < 1 else 0 > def genRnd(ind): > x=random() * 2 - 1 > y=random() * 2 - 1 > return (x,y) > def runsp(total): > ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, > y: x + y)/float(total) * 4 > print ret > runsp(3) > {code} > once started the pyspark shell, no matter how many times i run "runsp(N)" , > this code always get a same sequece of random numbers, like this > {code:title=Output|borderStyle=solid} > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) > >>> * 4 > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) > >>> * 4 > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > {code} > i think this is because when we import pyspark.worker in the daemon.py, we > alse import a random by shuffle.py which is imported by pyspark.worker, > this worker, forked by *pid = os.fork()*, also remains the state of the > parent's random, thus every forked worker get the same random.next(). > we need to re-random the random by random.seed, which will solve the problem, > but i think this PR. may not be the proper fix. > ths. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15553) Dataset.createTempView should use CreateViewCommand
[ https://issues.apache.org/jira/browse/SPARK-15553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15553. - Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 2.0.0 > Dataset.createTempView should use CreateViewCommand > --- > > Key: SPARK-15553 > URL: https://issues.apache.org/jira/browse/SPARK-15553 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Liang-Chi Hsieh > Fix For: 2.0.0 > > > Dataset.createTempView and Dataset.createOrReplaceTempView should use > CreateViewCommand, rather than calling SparkSession.createTempView. Once this > is done, we can also remove SparkSession.createTempView. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15611) Each forked worker in daemon.py keep the parent's random state
[ https://issues.apache.org/jira/browse/SPARK-15611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Lau updated SPARK-15611: --- Description: hi, i'm writing some code as below: {code:java|title=marlkov.py|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} from random import random from operator import add def funcx( x ): print x[0],x[1] return 1 if x[0]** 2 + x[1]** 2 < 1 else 0 def genRnd(ind): x=random() * 2 - 1 y=random() * 2 - 1 return (x,y) def runsp(total): ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, y: x + y)/float(total) * 4 print ret runsp(3) {code} once started the pyspark shell, no matter how many times i run "runsp(N)" , this code always get a same sequece of random numbers, like this {code:title=Output|borderStyle=solid} 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) >>> * 4 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) >>> * 4 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. {code} i think this is because when we import pyspark.worker in the daemon.py, we alse import a random by shuffle.py which is imported by pyspark.worker, this worker, forked by *pid = os.fork()*, also remains the state of the parent's random, thus every forked worker get the same random.next(). we need to re-random the random by random.seed, which will solve the problem, but i think this PR. may not be the proper fix. ths. was: hi, i'm writing some code as below: {code:java|title=marlkov.py|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} from random import random from operator import add def funcx( x ): print x[0],x[1] return 1 if x[0]** 2 + x[1]** 2 < 1 else 0 def genRnd(ind): x=random() * 2 - 1 y=random() * 2 - 1 return (x,y) def runsp(total): ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, y: x + y)/float(total) * 4 print ret runsp(3) {code} once started the pyspark shell, no matter how many times i run "runsp(N)" , this code always get a same sequece of random numbers, like this {code:title=Output|borderStyle=solid} 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) >>> * 4 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) >>> * 4 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. {code} i think this is because when we import pyspark.worker in the daemon.py, we alse import a random by the shuffle.py which is imported by pyspark.worker, this worker, forked by "pid = os.fork()", also remains the state of the parent's random, thus every forked worker get the same random.next(). we need to re-random the random by random.seed, which will solve the problem, but i think this PR. may not be the proper fix. ths. > Each forked worker in daemon.py keep the parent's random state > --- > > Key: SPARK-15611 > URL: https://issues.apache.org/jira/browse/SPARK-15611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Thomas Lau >Priority: Minor > > hi, i'm writing some code as below: > {code:java|title=marlkov.py|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} > from random import random > from operator import add > def funcx( x ): > print x[0],x[1] > return 1 if x[0]** 2 + x[1]** 2 < 1 else 0 > def genRnd(ind): > x=random() * 2 - 1 > y=random() * 2 - 1 > return (x,y) > def runsp(total): > ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, > y: x + y)/float(total) * 4 > print ret > runsp(3) > {code} > once started the pyspark shell, no matter how many times i run "runsp(N)" , > this code always get a same sequece of random numbers, like this > {code:title=Output|borderStyle=solid} > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) > >>> * 4 > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > >>> sc.paralleli
[jira] [Resolved] (SPARK-15597) Add SparkSession.emptyDataset
[ https://issues.apache.org/jira/browse/SPARK-15597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15597. - Resolution: Fixed Fix Version/s: 2.0.0 > Add SparkSession.emptyDataset > - > > Key: SPARK-15597 > URL: https://issues.apache.org/jira/browse/SPARK-15597 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > SparkSession currently has emptyDataFrame, but not emptyDataset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15633) Make package name for Java tests consistent
[ https://issues.apache.org/jira/browse/SPARK-15633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15633. - Resolution: Fixed Fix Version/s: 2.0.0 > Make package name for Java tests consistent > --- > > Key: SPARK-15633 > URL: https://issues.apache.org/jira/browse/SPARK-15633 > Project: Spark > Issue Type: Sub-task > Components: Java API >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13184) Support minPartitions parameter for JSON and CSV datasources as options
[ https://issues.apache.org/jira/browse/SPARK-13184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-13184: Target Version/s: 2.1.0 > Support minPartitions parameter for JSON and CSV datasources as options > --- > > Key: SPARK-13184 > URL: https://issues.apache.org/jira/browse/SPARK-13184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > After looking through the pull requests below at Spark CSV datasources, > https://github.com/databricks/spark-csv/pull/256 > https://github.com/databricks/spark-csv/issues/141 > https://github.com/databricks/spark-csv/pull/186 > It looks Spark might need to be able to set {{minPartitions}}. > {{repartition()}} or {{coalesce()}} can be alternatives but it looks it needs > to shuffle the data for most cases. > Although I am still not sure if it needs this, I will open this ticket just > for discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15611) Each forked worker in daemon.py keep the parent's random state
[ https://issues.apache.org/jira/browse/SPARK-15611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Lau updated SPARK-15611: --- Description: hi, i'm writing some code as below: {code:java|title=marlkov.py|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} from random import random from operator import add def funcx( x ): print x[0],x[1] return 1 if x[0]** 2 + x[1]** 2 < 1 else 0 def genRnd(ind): x=random() * 2 - 1 y=random() * 2 - 1 return (x,y) def runsp(total): ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, y: x + y)/float(total) * 4 print ret runsp(3) {code} once started the pyspark shell, no matter how many times i run "runsp(N)" , this code always get a same sequece of random numbers, like this {code:title=Output|borderStyle=solid} 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) >>> * 4 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) >>> * 4 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. {code} i think this is because when we import pyspark.worker in the daemon.py, we alse import a random by the shuffle.py which is imported by pyspark.worker, this worker, forked by "pid = os.fork()", also remains the state of the parent's random, thus every forked worker get the same random.next(). we need to re-random the random by random.seed, which will solve the problem, but i think this PR. may not be the proper fix. ths. was: hi, i'm writing some code as below: {code:python} from random import random from operator import add def funcx( x ): print x[0],x[1] return 1 if x[0]** 2 + x[1]** 2 < 1 else 0 def genRnd(ind): x=random() * 2 - 1 y=random() * 2 - 1 return (x,y) def runsp(total): ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, y: x + y)/float(total) * 4 print ret runsp(3) {code} once started the pyspark shell, no matter how many times i run "runsp(N)" , this code always get a same sequece of random numbers, like this ```sh 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) >>> * 4 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) >>> * 4 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. ``` i think this is because when we import pyspark.worker in the daemon.py, we alse import a random by the shuffle.py which is imported by pyspark.worker, this worker, forked by "pid = os.fork()", also remains the state of the parent's random, thus every forked worker get the same random.next(). we need to re-random the random by random.seed, which will solve the problem, but i think this PR. may not be the proper fix. ths. > Each forked worker in daemon.py keep the parent's random state > --- > > Key: SPARK-15611 > URL: https://issues.apache.org/jira/browse/SPARK-15611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Thomas Lau >Priority: Minor > > hi, i'm writing some code as below: > {code:java|title=marlkov.py|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} > from random import random > from operator import add > def funcx( x ): > print x[0],x[1] > return 1 if x[0]** 2 + x[1]** 2 < 1 else 0 > def genRnd(ind): > x=random() * 2 - 1 > y=random() * 2 - 1 > return (x,y) > def runsp(total): > ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, > y: x + y)/float(total) * 4 > print ret > runsp(3) > {code} > once started the pyspark shell, no matter how many times i run "runsp(N)" , > this code always get a same sequece of random numbers, like this > {code:title=Output|borderStyle=solid} > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) > >>> * 4 > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) > >>> * 4 > 0.896083541418 -0.635625854075 > -0.0423
[jira] [Commented] (SPARK-15585) Don't use null in data source options to indicate default value
[ https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305166#comment-15305166 ] Reynold Xin commented on SPARK-15585: - Feel free to create a pr with python changes and then we can iterate on the R part too. > Don't use null in data source options to indicate default value > --- > > Key: SPARK-15585 > URL: https://issues.apache.org/jira/browse/SPARK-15585 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > See email: > http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html > We'd need to change DataFrameReader/DataFrameWriter in Python's > csv/json/parquet/... functions to put the actual default option values as > function parameters, rather than setting them to None. We can then in > CSVOptions.getChar (and JSONOptions, etc) to actually return null if the > value is null, rather than setting it to default value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15611) Each forked worker in daemon.py keep the parent's random state
[ https://issues.apache.org/jira/browse/SPARK-15611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Lau updated SPARK-15611: --- Description: hi, i'm writing some code as below: {code:python} from random import random from operator import add def funcx( x ): print x[0],x[1] return 1 if x[0]** 2 + x[1]** 2 < 1 else 0 def genRnd(ind): x=random() * 2 - 1 y=random() * 2 - 1 return (x,y) def runsp(total): ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, y: x + y)/float(total) * 4 print ret runsp(3) {code} once started the pyspark shell, no matter how many times i run "runsp(N)" , this code always get a same sequece of random numbers, like this ```sh 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) >>> * 4 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) >>> * 4 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. ``` i think this is because when we import pyspark.worker in the daemon.py, we alse import a random by the shuffle.py which is imported by pyspark.worker, this worker, forked by "pid = os.fork()", also remains the state of the parent's random, thus every forked worker get the same random.next(). we need to re-random the random by random.seed, which will solve the problem, but i think this PR. may not be the proper fix. ths. was: hi, i'm writing some code as below: {quote} from random import random from operator import add def funcx( x ): print x[0],x[1] return 1 if x[0]** 2 + x[1]** 2 < 1 else 0 def genRnd(ind): x=random() * 2 - 1 y=random() * 2 - 1 return (x,y) def runsp(total): ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, y: x + y)/float(total) * 4 print ret runsp(3) {quote} once started the pyspark shell, no matter how many times i run "runsp(N)" , this code always get a same sequece of random numbers, like this ```sh 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) >>> * 4 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) >>> * 4 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. ``` i think this is because when we import pyspark.worker in the daemon.py, we alse import a random by the shuffle.py which is imported by pyspark.worker, this worker, forked by "pid = os.fork()", also remains the state of the parent's random, thus every forked worker get the same random.next(). we need to re-random the random by random.seed, which will solve the problem, but i think this PR. may not be the proper fix. ths. > Each forked worker in daemon.py keep the parent's random state > --- > > Key: SPARK-15611 > URL: https://issues.apache.org/jira/browse/SPARK-15611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Thomas Lau >Priority: Minor > > hi, i'm writing some code as below: > {code:python} > from random import random > from operator import add > def funcx( x ): > print x[0],x[1] > return 1 if x[0]** 2 + x[1]** 2 < 1 else 0 > def genRnd(ind): > x=random() * 2 - 1 > y=random() * 2 - 1 > return (x,y) > def runsp(total): > ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, > y: x + y)/float(total) * 4 > print ret > runsp(3) > {code} > once started the pyspark shell, no matter how many times i run "runsp(N)" , > this code always get a same sequece of random numbers, like this > ```sh > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) > >>> * 4 > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) > >>> * 4 > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > ``` > i think this is because when we import pyspark.worker in the daemon.py, we > alse import a random by the shuffle.py which is imported by pyspark.worker, > this
[jira] [Updated] (SPARK-15611) Each forked worker in daemon.py keep the parent's random state
[ https://issues.apache.org/jira/browse/SPARK-15611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Lau updated SPARK-15611: --- Description: hi, i'm writing some code as below: {quote} from random import random from operator import add def funcx( x ): print x[0],x[1] return 1 if x[0]** 2 + x[1]** 2 < 1 else 0 def genRnd(ind): x=random() * 2 - 1 y=random() * 2 - 1 return (x,y) def runsp(total): ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, y: x + y)/float(total) * 4 print ret runsp(3) {quote} once started the pyspark shell, no matter how many times i run "runsp(N)" , this code always get a same sequece of random numbers, like this ```sh 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) >>> * 4 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) >>> * 4 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. ``` i think this is because when we import pyspark.worker in the daemon.py, we alse import a random by the shuffle.py which is imported by pyspark.worker, this worker, forked by "pid = os.fork()", also remains the state of the parent's random, thus every forked worker get the same random.next(). we need to re-random the random by random.seed, which will solve the problem, but i think this PR. may not be the proper fix. ths. was: hi, i'm writing some code as below: ```py from random import random from operator import add def funcx( x ): print x[0],x[1] return 1 if x[0]** 2 + x[1]** 2 < 1 else 0 def genRnd(ind): x=random() * 2 - 1 y=random() * 2 - 1 return (x,y) def runsp(total): ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, y: x + y)/float(total) * 4 print ret runsp(3) ``` once started the pyspark shell, no matter how many times i run "runsp(N)" , this code always get a same sequece of random numbers, like this ```sh 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) >>> * 4 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) >>> * 4 0.896083541418 -0.635625854075 -0.0423532645466 -0.526910255885 0.498518696049 -0.872983895832 1. ``` i think this is because when we import pyspark.worker in the daemon.py, we alse import a random by the shuffle.py which is imported by pyspark.worker, this worker, forked by "pid = os.fork()", also remains the state of the parent's random, thus every forked worker get the same random.next(). we need to re-random the random by random.seed, which will solve the problem, but i think this PR. may not be the proper fix. ths. > Each forked worker in daemon.py keep the parent's random state > --- > > Key: SPARK-15611 > URL: https://issues.apache.org/jira/browse/SPARK-15611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Thomas Lau >Priority: Minor > > hi, i'm writing some code as below: > {quote} > from random import random > from operator import add > def funcx( x ): > print x[0],x[1] > return 1 if x[0]** 2 + x[1]** 2 < 1 else 0 > def genRnd(ind): > x=random() * 2 - 1 > y=random() * 2 - 1 > return (x,y) > def runsp(total): > ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, > y: x + y)/float(total) * 4 > print ret > runsp(3) > {quote} > once started the pyspark shell, no matter how many times i run "runsp(N)" , > this code always get a same sequece of random numbers, like this > ```sh > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) > >>> * 4 > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) > >>> * 4 > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > ``` > i think this is because when we import pyspark.worker in the daemon.py, we > alse import a random by the shuffle.py which is imported by pyspark.worker, > this worker, forked by
[jira] [Updated] (SPARK-15611) Each forked worker in daemon.py keep the parent's random state
[ https://issues.apache.org/jira/browse/SPARK-15611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Lau updated SPARK-15611: --- Summary: Each forked worker in daemon.py keep the parent's random state (was: each forked worker in daemon.py keep the parent's random state) > Each forked worker in daemon.py keep the parent's random state > --- > > Key: SPARK-15611 > URL: https://issues.apache.org/jira/browse/SPARK-15611 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Thomas Lau >Priority: Minor > > hi, i'm writing some code as below: > ```py > from random import random > from operator import add > def funcx( x ): > print x[0],x[1] > return 1 if x[0]** 2 + x[1]** 2 < 1 else 0 > def genRnd(ind): > x=random() * 2 - 1 > y=random() * 2 - 1 > return (x,y) > def runsp(total): > ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda x, > y: x + y)/float(total) * 4 > print ret > runsp(3) > ``` > once started the pyspark shell, no matter how many times i run "runsp(N)" , > this code always get a same sequece of random numbers, like this > ```sh > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) > >>> * 4 > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > >>> sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) > >>> * 4 > 0.896083541418 -0.635625854075 > -0.0423532645466 -0.526910255885 > 0.498518696049 -0.872983895832 > 1. > ``` > i think this is because when we import pyspark.worker in the daemon.py, we > alse import a random by the shuffle.py which is imported by pyspark.worker, > this worker, forked by "pid = os.fork()", also remains the state of the > parent's random, thus every forked worker get the same random.next(). > we need to re-random the random by random.seed, which will solve the problem, > but i think this PR. may not be the proper fix. > ths. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15585) Don't use null in data source options to indicate default value
[ https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305152#comment-15305152 ] Takeshi Yamamuro commented on SPARK-15585: -- okay > Don't use null in data source options to indicate default value > --- > > Key: SPARK-15585 > URL: https://issues.apache.org/jira/browse/SPARK-15585 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > See email: > http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html > We'd need to change DataFrameReader/DataFrameWriter in Python's > csv/json/parquet/... functions to put the actual default option values as > function parameters, rather than setting them to None. We can then in > CSVOptions.getChar (and JSONOptions, etc) to actually return null if the > value is null, rather than setting it to default value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15528) conv function returns inconsistent result for the same data
[ https://issues.apache.org/jira/browse/SPARK-15528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305150#comment-15305150 ] Takeshi Yamamuro commented on SPARK-15528: -- I tried this in master and I could reproduce; {code} import org.apache.spark.sql.functions._ val df = Seq(("", 0), ("", 1)).toDF("a", "b") (0 until 10).map(_ => df.select(countDistinct(conv(df("a"), 16, 10))).show) +---+ |count(DISTINCT conv(a, 16, 10))| +---+ | 1| +---+ +---+ |count(DISTINCT conv(a, 16, 10))| +---+ | 1| +---+ +---+ |count(DISTINCT conv(a, 16, 10))| +---+ | 1| +---+ +---+ |count(DISTINCT conv(a, 16, 10))| +---+ | 2| +---+ +---+ |count(DISTINCT conv(a, 16, 10))| +---+ | 1| +---+ {code} Sometimes, we could weirdly get not '1' but '2'. The explain is below; {code} == Physical Plan == *TungstenAggregate(key=[], functions=[(count(conv(a#5, 16, 10)#19),mode=Final,isDistinct=true)], output=[count(DISTINCT conv(a, 16, 10))#15L]) +- Exchange SinglePartition, None +- *TungstenAggregate(key=[], functions=[(count(conv(a#5, 16, 10)#19),mode=Partial,isDistinct=true)], output=[count#22L]) +- *TungstenAggregate(key=[conv(a#5, 16, 10)#19], functions=[], output=[conv(a#5, 16, 10)#19]) +- Exchange hashpartitioning(conv(a#5, 16, 10)#19, 200), None +- *TungstenAggregate(key=[conv(a#5, 16, 10) AS conv(a#5, 16, 10)#19], functions=[], output=[conv(a#5, 16, 10)#19]) +- LocalTableScan [a#5], [[],[]] {code} > conv function returns inconsistent result for the same data > --- > > Key: SPARK-15528 > URL: https://issues.apache.org/jira/browse/SPARK-15528 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Lior Regev > > When using F.conv to convert a column from a hexadecimal string to an > integer, the results are inconsistent > val col = F.conv(df("some_col"), 16, 10) > val a = df.select(F.countDistinct("some_col"), F.countDistinct(col)).collect() > val b = df.select(F.countDistinct("some_col"), F.countDistinct(col)).collect() > returns: > a: Array[org.apache.spark.sql.Row] = Array([59776,1941936]) > b: Array[org.apache.spark.sql.Row] = Array([59776,1965154]) > P.S. > "some_col" is a md5 hash of some string column calculated using F.md5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15634) SQL repl is bricked if a function is registered with a non-existent jar
[ https://issues.apache.org/jira/browse/SPARK-15634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305143#comment-15305143 ] Dilip Biswal commented on SPARK-15634: -- I would like to work on this issue. > SQL repl is bricked if a function is registered with a non-existent jar > --- > > Key: SPARK-15634 > URL: https://issues.apache.org/jira/browse/SPARK-15634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Eric Liang > > After attempting to register a function using a non-existent jar, no further > SQL commands succeed (and you also cannot un-register the function). > {code} > build/sbt -Phive sparkShell > {code} > {code} > scala> sql("""CREATE TEMPORARY FUNCTION x AS "com.example.functions.Function" > USING JAR "file:///path/to/example.jar) > 16/05/27 14:53:49 ERROR SessionState: file:///path/to/example.jar does not > exist > java.lang.IllegalArgumentException: file:///path/to/example.jar does not exist > at > org.apache.hadoop.hive.ql.session.SessionState.validateFiles(SessionState.java:998) > at > org.apache.hadoop.hive.ql.session.SessionState$ResourceType.preHook(SessionState.java:1102) > at > org.apache.hadoop.hive.ql.session.SessionState$ResourceType$1.preHook(SessionState.java:1091) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1191) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:564) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:533) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:260) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:207) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:206) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:249) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:533) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:523) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:668) > at > org.apache.spark.sql.hive.HiveSessionState.addJar(HiveSessionState.scala:109) > at > org.apache.spark.sql.internal.SessionState$$anon$2.loadResource(SessionState.scala:80) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:734) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:734) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadFunctionResources(SessionCatalog.scala:734) > at > org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:59) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85) > at org.apache.spark.sql.Dataset.(Dataset.scala:187) > at org.apache.spark.sql.Dataset.(Dataset.scala:168) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:532) > at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw
[jira] [Resolved] (SPARK-15610) update error message for k in pca
[ https://issues.apache.org/jira/browse/SPARK-15610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15610. --- Resolution: Fixed Assignee: zhengruifeng Fix Version/s: 2.0.0 Resolved by https://github.com/apache/spark/pull/13356 > update error message for k in pca > - > > Key: SPARK-15610 > URL: https://issues.apache.org/jira/browse/SPARK-15610 > Project: Spark > Issue Type: Bug > Components: Documentation, ML >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Trivial > Fix For: 2.0.0 > > > error message for {{k}} should match the bound -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12550) sbt-launch-lib.bash: line 72: 2404 Killed "$@"
[ https://issues.apache.org/jira/browse/SPARK-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305119#comment-15305119 ] Sean Owen commented on SPARK-12550: --- This is not from the Spark project. I mean, what docs _from the project_ lead to this error? > sbt-launch-lib.bash: line 72: 2404 Killed "$@" > --- > > Key: SPARK-12550 > URL: https://issues.apache.org/jira/browse/SPARK-12550 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.2 > Environment: Ubuntu 14.04.3 LTS > Scala version 2.10.4 > java version "1.8.0_66" > Java(TM) SE Runtime Environment (build 1.8.0_66-b17) > Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode) >Reporter: ibrahim yilmaz > > sbt-launch-lib.bash: line 72: 2404 Killed "$@" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15619) spark builds filling up /tmp
[ https://issues.apache.org/jira/browse/SPARK-15619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305117#comment-15305117 ] Sean Owen commented on SPARK-15619: --- Interesting, looks like it's related to the lz4 library, and I see a similar issue reported for Cassandra: https://issues.apache.org/jira/browse/CASSANDRA-7712 It does create this temp library: https://github.com/jpountz/lz4-java/blob/b69d5676f74344bf04068594644fa5ecc2bb6a67/src/java/net/jpountz/util/Native.java#L81 but seems to do a pretty comprehensive job of trying to clean it up at shutdown. It might be left around after hard JVM failures / exits, in which case it may unfortunately be a side effect of testing failure conditions. I don't see anything in Spark that tries to manage it, and not sure it could. > spark builds filling up /tmp > > > Key: SPARK-15619 > URL: https://issues.apache.org/jira/browse/SPARK-15619 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: shane knapp >Priority: Minor > > spark builds aren't cleaning up /tmp after they run... it's hard to pinpoint > EXACTLY what is left there by the spark builds (as other builds are also > guilty of doing this), but a quick perusal of the /tmp directory during some > spark builds show that there are myriad empty directories being created and a > massive pile of shared object libraries being dumped there. > $ for x in $(cat jenkins_workers.txt ); do echo $x; ssh $x "ls -l /tmp/*.so | > wc -l"; done > amp-jenkins-worker-01 > 0 > ls: cannot access /tmp/*.so: No such file or directory > amp-jenkins-worker-02 > 22312 > amp-jenkins-worker-03 > 39673 > amp-jenkins-worker-04 > 39548 > amp-jenkins-worker-05 > 39577 > amp-jenkins-worker-06 > 39299 > amp-jenkins-worker-07 > 39315 > amp-jenkins-worker-08 > 38529 > to help combat this, i set up a cron job on each worker that runs tmpwatch > during system downtime on sundays to clean up files older than a week. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15562) Temp directory is not deleted after program exit in DataFrameExample
[ https://issues.apache.org/jira/browse/SPARK-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15562: -- Assignee: ding > Temp directory is not deleted after program exit in DataFrameExample > > > Key: SPARK-15562 > URL: https://issues.apache.org/jira/browse/SPARK-15562 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 >Reporter: ding >Assignee: ding >Priority: Minor > Fix For: 2.0.0 > > > Temp directory used to save records is not deleted after program exit in > DataFrameExample. Although it called deleteOnExit, it doesn't work as the > directory is not empty. Similar things happend in ContextCleanerSuite -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15562) Temp directory is not deleted after program exit in DataFrameExample
[ https://issues.apache.org/jira/browse/SPARK-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15562. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13328 [https://github.com/apache/spark/pull/13328] > Temp directory is not deleted after program exit in DataFrameExample > > > Key: SPARK-15562 > URL: https://issues.apache.org/jira/browse/SPARK-15562 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 >Reporter: ding >Priority: Minor > Fix For: 2.0.0 > > > Temp directory used to save records is not deleted after program exit in > DataFrameExample. Although it called deleteOnExit, it doesn't work as the > directory is not empty. Similar things happend in ContextCleanerSuite -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15449) MLlib NaiveBayes example in Java uses wrong data format
[ https://issues.apache.org/jira/browse/SPARK-15449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15449: -- Assignee: Miao Wang > MLlib NaiveBayes example in Java uses wrong data format > --- > > Key: SPARK-15449 > URL: https://issues.apache.org/jira/browse/SPARK-15449 > Project: Spark > Issue Type: Documentation > Components: Examples >Affects Versions: 1.6.1 >Reporter: Kiran Biradarpatil >Assignee: Miao Wang >Priority: Minor > Fix For: 2.0.0 > > > JAVA example given for MLLib NaiveBayes at > http://spark.apache.org/docs/latest/mllib-naive-bayes.html expects the data > in LibSVM format. But the example data in MLLib > data/mllib/sample_naive_bayes_data.txt is not in right format. > So please rectify the sample data file or the the implementation example. > Thanks! > Kiran -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15449) MLlib NaiveBayes example in Java uses wrong data format
[ https://issues.apache.org/jira/browse/SPARK-15449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15449. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13301 [https://github.com/apache/spark/pull/13301] > MLlib NaiveBayes example in Java uses wrong data format > --- > > Key: SPARK-15449 > URL: https://issues.apache.org/jira/browse/SPARK-15449 > Project: Spark > Issue Type: Documentation > Components: Examples >Affects Versions: 1.6.1 >Reporter: Kiran Biradarpatil >Priority: Minor > Fix For: 2.0.0 > > > JAVA example given for MLLib NaiveBayes at > http://spark.apache.org/docs/latest/mllib-naive-bayes.html expects the data > in LibSVM format. But the example data in MLLib > data/mllib/sample_naive_bayes_data.txt is not in right format. > So please rectify the sample data file or the the implementation example. > Thanks! > Kiran -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-15607) Remove redundant toArray in ml.linalg
[ https://issues.apache.org/jira/browse/SPARK-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-15607: --- > Remove redundant toArray in ml.linalg > - > > Key: SPARK-15607 > URL: https://issues.apache.org/jira/browse/SPARK-15607 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Minor > > {{sliceInds, sliceVals}} are already of type {{Array}}, so remove {{toArray}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15610) update error message for k in pca
[ https://issues.apache.org/jira/browse/SPARK-15610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15610: -- Priority: Trivial (was: Minor) Component/s: Documentation > update error message for k in pca > - > > Key: SPARK-15610 > URL: https://issues.apache.org/jira/browse/SPARK-15610 > Project: Spark > Issue Type: Bug > Components: Documentation, ML >Reporter: zhengruifeng >Priority: Trivial > > error message for {{k}} should match the bound -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15607) Remove redundant toArray in ml.linalg
[ https://issues.apache.org/jira/browse/SPARK-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15607. --- Resolution: Not A Problem > Remove redundant toArray in ml.linalg > - > > Key: SPARK-15607 > URL: https://issues.apache.org/jira/browse/SPARK-15607 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Minor > > {{sliceInds, sliceVals}} are already of type {{Array}}, so remove {{toArray}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15549) Disable bucketing when the output doesn't contain all bucketing columns
[ https://issues.apache.org/jira/browse/SPARK-15549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yadong Qi updated SPARK-15549: -- Summary: Disable bucketing when the output doesn't contain all bucketing columns (was: Bucket column only need to be found in the output of relation when use bucketed table) > Disable bucketing when the output doesn't contain all bucketing columns > --- > > Key: SPARK-15549 > URL: https://issues.apache.org/jira/browse/SPARK-15549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yadong Qi > > I create a bucketed table test(i int, j int, k int) with bucket column i, > {code:java} > case class Data(i: Int, j: Int, k: Int) > sc.makeRDD(Array((1, 2, 3))).map(x => Data(x._1, x._2, > x._3)).toDF.write.bucketBy(2, "i").saveAsTable("test") > {code} > and I run the following SQL: > {code:sql} > SELECT j FROM test; > Error in query: bucket column i not found in existing columns (j); > SELECT j, MAX(k) FROM test GROUP BY j; > Error in query: bucket column i not found in existing columns (j, k); > {code} > I think the bucket column only need to be found in the output of relation. So > the 2 sqls bellow should be executed right. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12550) sbt-launch-lib.bash: line 72: 2404 Killed "$@"
[ https://issues.apache.org/jira/browse/SPARK-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305095#comment-15305095 ] Greg Silverman edited comment on SPARK-12550 at 5/28/16 1:45 AM: - I am having the same exact issue on Debian 7.10 wheezy. I'm following these directions: http://www.mfactorengineering.com/blog/2015/spark/ . was (Author: horcle_buzz): I am having the same exact issue on Debian 7.10 wheezy. I'm following these directions: http://www.mfactorengineering.com/blog/2015/spark/ Scala is version 2.11.8, if it matters... > sbt-launch-lib.bash: line 72: 2404 Killed "$@" > --- > > Key: SPARK-12550 > URL: https://issues.apache.org/jira/browse/SPARK-12550 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.2 > Environment: Ubuntu 14.04.3 LTS > Scala version 2.10.4 > java version "1.8.0_66" > Java(TM) SE Runtime Environment (build 1.8.0_66-b17) > Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode) >Reporter: ibrahim yilmaz > > sbt-launch-lib.bash: line 72: 2404 Killed "$@" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12550) sbt-launch-lib.bash: line 72: 2404 Killed "$@"
[ https://issues.apache.org/jira/browse/SPARK-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305095#comment-15305095 ] Greg Silverman commented on SPARK-12550: I am having the same exact issue on Debian 7.10 wheezy. I'm following these directions: http://www.mfactorengineering.com/blog/2015/spark/ Scala is version 2.11.8, if it matters... > sbt-launch-lib.bash: line 72: 2404 Killed "$@" > --- > > Key: SPARK-12550 > URL: https://issues.apache.org/jira/browse/SPARK-12550 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.2 > Environment: Ubuntu 14.04.3 LTS > Scala version 2.10.4 > java version "1.8.0_66" > Java(TM) SE Runtime Environment (build 1.8.0_66-b17) > Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode) >Reporter: ibrahim yilmaz > > sbt-launch-lib.bash: line 72: 2404 Killed "$@" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15617) Clarify that fMeasure in MulticlassMetrics and MulticlassClassificationEvaluator is "micro" f1_score
[ https://issues.apache.org/jira/browse/SPARK-15617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305089#comment-15305089 ] zhengruifeng commented on SPARK-15617: -- I can work on this > Clarify that fMeasure in MulticlassMetrics and > MulticlassClassificationEvaluator is "micro" f1_score > > > Key: SPARK-15617 > URL: https://issues.apache.org/jira/browse/SPARK-15617 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > See description in sklearn docs: > [http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html] > I believe we are calculating the "micro" average for {{val fMeasure: > Double}}. We should clarify this in the docs. > I'm not sure if "micro" is a common term, so we should check other libraries > too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15617) Clarify that fMeasure in MulticlassMetrics and MulticlassClassificationEvaluator is "micro" f1_score
[ https://issues.apache.org/jira/browse/SPARK-15617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305086#comment-15305086 ] zhengruifeng commented on SPARK-15617: -- Revolutions(http://blog.revolutionanalytics.com/2016/03/com_class_eval_metrics_r.html#micro) also call it `Micro-averaged Metrics` > Clarify that fMeasure in MulticlassMetrics and > MulticlassClassificationEvaluator is "micro" f1_score > > > Key: SPARK-15617 > URL: https://issues.apache.org/jira/browse/SPARK-15617 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > See description in sklearn docs: > [http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html] > I believe we are calculating the "micro" average for {{val fMeasure: > Double}}. We should clarify this in the docs. > I'm not sure if "micro" is a common term, so we should check other libraries > too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15637) SparkR tests failing on R 3.2.2
[ https://issues.apache.org/jira/browse/SPARK-15637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305077#comment-15305077 ] Apache Spark commented on SPARK-15637: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/13369 > SparkR tests failing on R 3.2.2 > --- > > Key: SPARK-15637 > URL: https://issues.apache.org/jira/browse/SPARK-15637 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Felix Cheung > > As discussed in SPARK-15439 > I think we have an issue here - I"m running R 3.2.2 and the mask tests are > failing because: > > R.version$minor > [1] "2.2" > And this is not strict enough? > if (as.numeric(R.version$major) == 3 && as.numeric(R.version$minor) > 2) > { namesOfMasked <- c("endsWith", "startsWith", namesOfMasked) > namesOfMaskedCompletely <- c("endsWith", "startsWith", > namesOfMaskedCompletely) } > 1. Failure: Check masked functions (@test_context.R#35) > > length(maskedBySparkR) not equal to length(namesOfMasked). > 1/1 mismatches > [1] 20 - 22 == -2 > 2. Failure: Check masked functions (@test_context.R#36) > > sort(maskedBySparkR) not equal to sort(namesOfMasked). > Lengths differ: 20 vs 22 > 3. Failure: Check masked functions (@test_context.R#44) > > length(maskedCompletely) not equal to length(namesOfMaskedCompletely). > 1/1 mismatches > [1] 3 - 5 == -2 > 4. Failure: Check masked functions (@test_context.R#45) > > sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely). > Lengths differ: 3 vs 5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15637) SparkR tests failing on R 3.2.2
[ https://issues.apache.org/jira/browse/SPARK-15637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15637: Assignee: (was: Apache Spark) > SparkR tests failing on R 3.2.2 > --- > > Key: SPARK-15637 > URL: https://issues.apache.org/jira/browse/SPARK-15637 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Felix Cheung > > As discussed in SPARK-15439 > I think we have an issue here - I"m running R 3.2.2 and the mask tests are > failing because: > > R.version$minor > [1] "2.2" > And this is not strict enough? > if (as.numeric(R.version$major) == 3 && as.numeric(R.version$minor) > 2) > { namesOfMasked <- c("endsWith", "startsWith", namesOfMasked) > namesOfMaskedCompletely <- c("endsWith", "startsWith", > namesOfMaskedCompletely) } > 1. Failure: Check masked functions (@test_context.R#35) > > length(maskedBySparkR) not equal to length(namesOfMasked). > 1/1 mismatches > [1] 20 - 22 == -2 > 2. Failure: Check masked functions (@test_context.R#36) > > sort(maskedBySparkR) not equal to sort(namesOfMasked). > Lengths differ: 20 vs 22 > 3. Failure: Check masked functions (@test_context.R#44) > > length(maskedCompletely) not equal to length(namesOfMaskedCompletely). > 1/1 mismatches > [1] 3 - 5 == -2 > 4. Failure: Check masked functions (@test_context.R#45) > > sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely). > Lengths differ: 3 vs 5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15637) SparkR tests failing on R 3.2.2
[ https://issues.apache.org/jira/browse/SPARK-15637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15637: Assignee: Apache Spark > SparkR tests failing on R 3.2.2 > --- > > Key: SPARK-15637 > URL: https://issues.apache.org/jira/browse/SPARK-15637 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Felix Cheung >Assignee: Apache Spark > > As discussed in SPARK-15439 > I think we have an issue here - I"m running R 3.2.2 and the mask tests are > failing because: > > R.version$minor > [1] "2.2" > And this is not strict enough? > if (as.numeric(R.version$major) == 3 && as.numeric(R.version$minor) > 2) > { namesOfMasked <- c("endsWith", "startsWith", namesOfMasked) > namesOfMaskedCompletely <- c("endsWith", "startsWith", > namesOfMaskedCompletely) } > 1. Failure: Check masked functions (@test_context.R#35) > > length(maskedBySparkR) not equal to length(namesOfMasked). > 1/1 mismatches > [1] 20 - 22 == -2 > 2. Failure: Check masked functions (@test_context.R#36) > > sort(maskedBySparkR) not equal to sort(namesOfMasked). > Lengths differ: 20 vs 22 > 3. Failure: Check masked functions (@test_context.R#44) > > length(maskedCompletely) not equal to length(namesOfMaskedCompletely). > 1/1 mismatches > [1] 3 - 5 == -2 > 4. Failure: Check masked functions (@test_context.R#45) > > sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely). > Lengths differ: 3 vs 5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15637) SparkR tests failing on R 3.2.2
Felix Cheung created SPARK-15637: Summary: SparkR tests failing on R 3.2.2 Key: SPARK-15637 URL: https://issues.apache.org/jira/browse/SPARK-15637 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.0.0 Reporter: Felix Cheung As discussed in SPARK-15439 I think we have an issue here - I"m running R 3.2.2 and the mask tests are failing because: > R.version$minor [1] "2.2" And this is not strict enough? if (as.numeric(R.version$major) == 3 && as.numeric(R.version$minor) > 2) { namesOfMasked <- c("endsWith", "startsWith", namesOfMasked) namesOfMaskedCompletely <- c("endsWith", "startsWith", namesOfMaskedCompletely) } 1. Failure: Check masked functions (@test_context.R#35) length(maskedBySparkR) not equal to length(namesOfMasked). 1/1 mismatches [1] 20 - 22 == -2 2. Failure: Check masked functions (@test_context.R#36) sort(maskedBySparkR) not equal to sort(namesOfMasked). Lengths differ: 20 vs 22 3. Failure: Check masked functions (@test_context.R#44) length(maskedCompletely) not equal to length(namesOfMaskedCompletely). 1/1 mismatches [1] 3 - 5 == -2 4. Failure: Check masked functions (@test_context.R#45) sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely). Lengths differ: 3 vs 5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15557) expression ((cast(99 as decimal) + '3') * '2.3' ) return null
[ https://issues.apache.org/jira/browse/SPARK-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-15557: - Target Version/s: 2.0.0 Description: expression "select (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null expression "select (cast(40 as decimal(19,6))+ '3')*'2.3' " is OK I find that maybe it will be null if the result is more than 100 was: expression "select (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null expression "select (cast(40 as decimal(19,6))+ '3')*'2.3' " is OK I find that maybe it will be null if the result is more than 100 > expression ((cast(99 as decimal) + '3') * '2.3' ) return null > - > > Key: SPARK-15557 > URL: https://issues.apache.org/jira/browse/SPARK-15557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: spark1.6.1 >Reporter: cen yuhai > > expression "select (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null > expression "select (cast(40 as decimal(19,6))+ '3')*'2.3' " is OK > I find that maybe it will be null if the result is more than 100 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15594) ALTER TABLE ... SERDEPROPERTIES does not respect partition spec
[ https://issues.apache.org/jira/browse/SPARK-15594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-15594. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13343 [https://github.com/apache/spark/pull/13343] > ALTER TABLE ... SERDEPROPERTIES does not respect partition spec > --- > > Key: SPARK-15594 > URL: https://issues.apache.org/jira/browse/SPARK-15594 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > > {code} > case class AlterTableSerDePropertiesCommand( > tableName: TableIdentifier, > serdeClassName: Option[String], > serdeProperties: Option[Map[String, String]], > partition: Option[Map[String, String]]) > extends RunnableCommand { > {code} > The `partition` flag is not read anywhere! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results
[ https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14343: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-15631 > Dataframe operations on a partitioned dataset (using partition discovery) > return invalid results > > > Key: SPARK-14343 > URL: https://issues.apache.org/jira/browse/SPARK-14343 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS >Reporter: Jurriaan Pruis >Priority: Blocker > > When reading a dataset using {{sqlContext.read.text()}} queries on the > partitioned column return invalid results. > h2. How to reproduce: > h3. Generate datasets > {code:title=repro.sh} > #!/bin/sh > mkdir -p dataset/year=2014 > mkdir -p dataset/year=2015 > echo "data from 2014" > dataset/year=2014/part01.txt > echo "data from 2015" > dataset/year=2015/part01.txt > {code} > {code:title=repro2.sh} > #!/bin/sh > mkdir -p dataset2/month=june > mkdir -p dataset2/month=july > echo "data from june" > dataset2/month=june/part01.txt > echo "data from july" > dataset2/month=july/part01.txt > {code} > h3. using first dataset > {code:none} > >>> df = sqlContext.read.text('dataset') > ... > >>> df > DataFrame[value: string, year: int] > >>> df.show() > +--++ > | value|year| > +--++ > |data from 2014|2014| > |data from 2015|2015| > +--++ > >>> df.select('year').show() > ++ > |year| > ++ > | 14| > | 14| > ++ > {code} > This is clearly wrong. Seems like it returns the length of the value column? > h3. using second dataset > With another dataset it looks like this: > {code:none} > >>> df = sqlContext.read.text('dataset2') > >>> df > DataFrame[value: string, month: string] > >>> df.show() > +--+-+ > | value|month| > +--+-+ > |data from june| june| > |data from july| july| > +--+-+ > >>> df.select('month').show() > +--+ > | month| > +--+ > |data from june| > |data from july| > +--+ > {code} > Here it returns the value of the value column instead of the month partition. > h3. Workaround > When I convert the dataframe to an RDD and back to a DataFrame I get the > following result (which is the expected behaviour): > {code:none} > >>> df.rdd.toDF().select('month').show() > +-+ > |month| > +-+ > | june| > | july| > +-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15610) update error message for k in pca
[ https://issues.apache.org/jira/browse/SPARK-15610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-15610: - Description: error message for {{k}} should match the bound (was: Vector size must be greater than {{k}}, but now it support {{k == vector.size}}) > update error message for k in pca > - > > Key: SPARK-15610 > URL: https://issues.apache.org/jira/browse/SPARK-15610 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: zhengruifeng >Priority: Minor > > error message for {{k}} should match the bound -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15610) PCA should not support k == numFeatures
[ https://issues.apache.org/jira/browse/SPARK-15610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-15610: - Priority: Minor (was: Major) > PCA should not support k == numFeatures > --- > > Key: SPARK-15610 > URL: https://issues.apache.org/jira/browse/SPARK-15610 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: zhengruifeng >Priority: Minor > > Vector size must be greater than {{k}}, but now it support {{k == > vector.size}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15610) update error message for k in pca
[ https://issues.apache.org/jira/browse/SPARK-15610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-15610: - Summary: update error message for k in pca (was: PCA should not support k == numFeatures) > update error message for k in pca > - > > Key: SPARK-15610 > URL: https://issues.apache.org/jira/browse/SPARK-15610 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: zhengruifeng >Priority: Minor > > Vector size must be greater than {{k}}, but now it support {{k == > vector.size}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15632) Dataset typed filter operation changes query plan schema
[ https://issues.apache.org/jira/browse/SPARK-15632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-15632: --- Description: Filter operations should never change query plan schema. However, Dataset typed filter operation does introduce schema change: {code} case class A(b: Double, a: String) val data = Seq( "{ 'a': 'foo', 'b': 1, 'c': 'extra' }", "{ 'a': 'bar', 'b': 2, 'c': 'extra' }", "{ 'a': 'bar', 'c': 'extra' }" ) val df1 = spark.read.json(sc.parallelize(data)) df1.printSchema() // root // |-- a: string (nullable = true) // |-- b: long (nullable = true) // |-- c: string (nullable = true) val ds1 = df1.as[A] ds1.printSchema() // root // |-- a: string (nullable = true) // |-- b: long (nullable = true) // |-- c: string (nullable = true) val ds2 = ds1.filter(_.b > 1)// <- Here comes the trouble maker ds2.printSchema() // root <- 1. reordered `a` and `b`, and // |-- b: double (nullable = true)2. dropped `c`, and // |-- a: string (nullable = true)3. up-casted `b` from long to double val df2 = ds2.toDF() df2.printSchema() // root <- (Same as above) // |-- b: double (nullable = true) // |-- a: string (nullable = true) {code} This is becase we wraps the actual {{Filter}} operator with a {{SerializeFromObject}}/{{DeserializeToObject}} pair. {{DeserializeToObject}} does a bunch of magic tricks here: # Field order change #- {{DeserializeToObject}} resolves the encoder deserializer expression by **name**. Thus field order in input query plan doesn't matter. # Field number change #- Same as above, fields not referred by the encoder are silently dropped while resolving deserializer expressions by name. # Field data type change #- When generating deserializer expressions, we allows "sane" implicit coercions (e.g. integer to long, and long to double) by inserting {{UpCast}} operators. Thus actual field data types in input query plan don't matter either as long as there are valid implicit coercions. Actually, even field names may change once [PR #13269|https://github.com/apache/spark/pull/13269] gets merged, because it introduces case-insensitive encoder resolution. was: Filter operations should never changes query plan schema. However, Dataset typed filter operation does introduce schema change: {code} case class A(b: Double, a: String) val data = Seq( "{ 'a': 'foo', 'b': 1, 'c': 'extra' }", "{ 'a': 'bar', 'b': 2, 'c': 'extra' }", "{ 'a': 'bar', 'c': 'extra' }" ) val df1 = spark.read.json(sc.parallelize(data)) df1.printSchema() // root // |-- a: string (nullable = true) // |-- b: long (nullable = true) // |-- c: string (nullable = true) val ds1 = df1.as[A] ds1.printSchema() // root // |-- a: string (nullable = true) // |-- b: long (nullable = true) // |-- c: string (nullable = true) val ds2 = ds1.filter(_.b > 1)// <- Here comes the trouble maker ds2.printSchema() // root <- 1. reordered `a` and `b`, and // |-- b: double (nullable = true)2. dropped `c`, and // |-- a: string (nullable = true)3. up-casted `b` from long to double val df2 = ds2.toDF() df2.printSchema() // root <- (Same as above) // |-- b: double (nullable = true) // |-- a: string (nullable = true) {code} This is becase we wraps the actual {{Filter}} operator with a {{SerializeFromObject}}/{{DeserializeToObject}} pair. {{DeserializeToObject}} does a bunch of magic tricks here: # Field order change #- {{DeserializeToObject}} resolves the encoder deserializer expression by **name**. Thus field order in input query plan doesn't matter. # Field number change #- Same as above, fields not referred by the encoder are silently dropped while resolving deserializer expressions by name. # Field data type change #- When generating deserializer expressions, we allows "sane" implicit coercions (e.g. integer to long, and long to double) by inserting {{UpCast}} operators. Thus actual field data types in input query plan don't matter either as long as there are valid implicit coercions. Actually, even field names may change once [PR #13269|https://github.com/apache/spark/pull/13269] gets merged, because it introduces case-insensitive encoder resolution. > Dataset typed filter operation changes query plan schema > > > Key: SPARK-15632 > URL: https://issues.apache.org/jira/browse/SPARK-15632 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > Filter operations should never change query plan schema. However, Dataset > typed filter operation does introduce schema change: > {code} > case class A(b: Double, a: String) > val data = Seq( > "{ 'a': 'foo', 'b': 1, 'c': 'extra' }", > "{ 'a': 'bar', '
[jira] [Updated] (SPARK-9876) Upgrade parquet-mr to 1.8.1
[ https://issues.apache.org/jira/browse/SPARK-9876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-9876: -- Assignee: Ryan Blue > Upgrade parquet-mr to 1.8.1 > --- > > Key: SPARK-9876 > URL: https://issues.apache.org/jira/browse/SPARK-9876 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Ryan Blue > Fix For: 2.0.0 > > > {{parquet-mr}} 1.8.1 fixed several issues that affect Spark. For example > PARQUET-201 (SPARK-9407). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9876) Upgrade parquet-mr to 1.8.1
[ https://issues.apache.org/jira/browse/SPARK-9876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-9876. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13280 [https://github.com/apache/spark/pull/13280] > Upgrade parquet-mr to 1.8.1 > --- > > Key: SPARK-9876 > URL: https://issues.apache.org/jira/browse/SPARK-9876 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Ryan Blue > Fix For: 2.0.0 > > > {{parquet-mr}} 1.8.1 fixed several issues that affect Spark. For example > PARQUET-201 (SPARK-9407). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15607) Remove redundant toArray in ml.linalg
[ https://issues.apache.org/jira/browse/SPARK-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng closed SPARK-15607. Resolution: Won't Fix > Remove redundant toArray in ml.linalg > - > > Key: SPARK-15607 > URL: https://issues.apache.org/jira/browse/SPARK-15607 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Minor > > {{sliceInds, sliceVals}} are already of type {{Array}}, so remove {{toArray}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15291) Remove redundant codes in SVD++
[ https://issues.apache.org/jira/browse/SPARK-15291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng closed SPARK-15291. Resolution: Won't Fix > Remove redundant codes in SVD++ > --- > > Key: SPARK-15291 > URL: https://issues.apache.org/jira/browse/SPARK-15291 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: zhengruifeng >Priority: Minor > > {code} > val newVertices = g.vertices.mapValues(v => (v._1.toArray, v._2.toArray, > v._3, v._4)) > (Graph(newVertices, g.edges), u) > {code} > is just the same as > {code} > (g, u) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15557) expression ((cast(99 as decimal) + '3') * '2.3' ) return null
[ https://issues.apache.org/jira/browse/SPARK-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15557: Assignee: (was: Apache Spark) > expression ((cast(99 as decimal) + '3') * '2.3' ) return null > - > > Key: SPARK-15557 > URL: https://issues.apache.org/jira/browse/SPARK-15557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: spark1.6.1 >Reporter: cen yuhai > > expression "select (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null > expression "select (cast(40 as decimal(19,6))+ '3')*'2.3' " is OK > I find that maybe it will be null if the result is more than 100 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15557) expression ((cast(99 as decimal) + '3') * '2.3' ) return null
[ https://issues.apache.org/jira/browse/SPARK-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304994#comment-15304994 ] Apache Spark commented on SPARK-15557: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/13368 > expression ((cast(99 as decimal) + '3') * '2.3' ) return null > - > > Key: SPARK-15557 > URL: https://issues.apache.org/jira/browse/SPARK-15557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: spark1.6.1 >Reporter: cen yuhai > > expression "select (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null > expression "select (cast(40 as decimal(19,6))+ '3')*'2.3' " is OK > I find that maybe it will be null if the result is more than 100 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15557) expression ((cast(99 as decimal) + '3') * '2.3' ) return null
[ https://issues.apache.org/jira/browse/SPARK-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15557: Assignee: Apache Spark > expression ((cast(99 as decimal) + '3') * '2.3' ) return null > - > > Key: SPARK-15557 > URL: https://issues.apache.org/jira/browse/SPARK-15557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: spark1.6.1 >Reporter: cen yuhai >Assignee: Apache Spark > > expression "select (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null > expression "select (cast(40 as decimal(19,6))+ '3')*'2.3' " is OK > I find that maybe it will be null if the result is more than 100 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9576) DataFrame API improvement umbrella ticket (Spark 2.0 and 2.1)
[ https://issues.apache.org/jira/browse/SPARK-9576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9576: --- Target Version/s: 2.1.0 (was: 2.0.0) > DataFrame API improvement umbrella ticket (Spark 2.0 and 2.1) > - > > Key: SPARK-9576 > URL: https://issues.apache.org/jira/browse/SPARK-9576 > Project: Spark > Issue Type: Umbrella > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9576) DataFrame API improvement umbrella ticket (Spark 2.0 and 2.1)
[ https://issues.apache.org/jira/browse/SPARK-9576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9576: --- Summary: DataFrame API improvement umbrella ticket (Spark 2.0 and 2.1) (was: DataFrame API improvement umbrella ticket (Spark 2.0)) > DataFrame API improvement umbrella ticket (Spark 2.0 and 2.1) > - > > Key: SPARK-9576 > URL: https://issues.apache.org/jira/browse/SPARK-9576 > Project: Spark > Issue Type: Umbrella > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15636) Make aggregate expressions more concise in explain
[ https://issues.apache.org/jira/browse/SPARK-15636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15636: Description: Aggregate expressions have very long string representations in explain outputs. For more information, see the description here: https://github.com/apache/spark/pull/13367 was: Aggregate expressions have very long string representations in explain outputs. [I will fill in more details in a bit] > Make aggregate expressions more concise in explain > -- > > Key: SPARK-15636 > URL: https://issues.apache.org/jira/browse/SPARK-15636 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Aggregate expressions have very long string representations in explain > outputs. For more information, see the description here: > https://github.com/apache/spark/pull/13367 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15636) Make aggregate expressions more concise in explain
[ https://issues.apache.org/jira/browse/SPARK-15636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304934#comment-15304934 ] Apache Spark commented on SPARK-15636: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/13367 > Make aggregate expressions more concise in explain > -- > > Key: SPARK-15636 > URL: https://issues.apache.org/jira/browse/SPARK-15636 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Aggregate expressions have very long string representations in explain > outputs. > [I will fill in more details in a bit] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15636) Make aggregate expressions more concise in explain
[ https://issues.apache.org/jira/browse/SPARK-15636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15636: Assignee: Reynold Xin (was: Apache Spark) > Make aggregate expressions more concise in explain > -- > > Key: SPARK-15636 > URL: https://issues.apache.org/jira/browse/SPARK-15636 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Aggregate expressions have very long string representations in explain > outputs. > [I will fill in more details in a bit] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15636) Make aggregate expressions more concise in explain
[ https://issues.apache.org/jira/browse/SPARK-15636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15636: Assignee: Apache Spark (was: Reynold Xin) > Make aggregate expressions more concise in explain > -- > > Key: SPARK-15636 > URL: https://issues.apache.org/jira/browse/SPARK-15636 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > Aggregate expressions have very long string representations in explain > outputs. > [I will fill in more details in a bit] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15636) Make aggregate expressions more concise in explain
Reynold Xin created SPARK-15636: --- Summary: Make aggregate expressions more concise in explain Key: SPARK-15636 URL: https://issues.apache.org/jira/browse/SPARK-15636 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Aggregate expressions have very long string representations in explain outputs. [I will fill in more details in a bit] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15635) ALTER TABLE RENAME doesn't work for datasource tables
Andrew Or created SPARK-15635: - Summary: ALTER TABLE RENAME doesn't work for datasource tables Key: SPARK-15635 URL: https://issues.apache.org/jira/browse/SPARK-15635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or Assignee: Andrew Or {code} scala> sql("CREATE TABLE students (age INT, name STRING) USING parquet") scala> sql("ALTER TABLE students RENAME TO teachers") scala> spark.table("teachers").show() com.google.common.util.concurrent.UncheckedExecutionException: org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/andrew/Documents/dev/spark/andrew-spark/spark-warehouse/students; at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882) at com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:170) at org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:67) at org.apache.spark.sql.SparkSession.table(SparkSession.scala:583) at org.apache.spark.sql.SparkSession.table(SparkSession.scala:579) ... 48 elided Caused by: org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/andrew/Documents/dev/spark/andrew-spark/spark-warehouse/students; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:351) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:340) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15619) spark builds filling up /tmp
[ https://issues.apache.org/jira/browse/SPARK-15619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304911#comment-15304911 ] shane knapp edited comment on SPARK-15619 at 5/27/16 10:33 PM: --- next time we have a maintenance, i will wipe /tmp completely so that we can at least try and see what's creating what... right now it's such a mess that it's hard to attribute anything to anything. i did watch as a spark build (spark-master-test-maven-hadoop-2.7, IIRC) dump a bunch of the liblz4-java3891256912513794605.so files in /tmp this morning on worker-08 (the number string changes for each file). and, so far today, on just one worker, we've had 2628 of these files left in /tmp: [root@amp-jenkins-worker-08 tmp]# ls -lt | grep liblz4 |grep "May 27" | wc -l 2628 i'm not worried about us running out of disk, and this is something i can manage on the system-level, but it'd still be nice to have well behaved tests. :) was (Author: shaneknapp): next time we have a maintenance, i will wipe /tmp completely so that we can at least try and see what's creating what... right now it's such a mess that it's hard to attribute anything to anything. i did watch as a spark build (spark-master-test-maven-hadoop-2.7, IIRC) dump a bunch of the liblz4-java3891256912513794605.so files in /tmp this morning on worker-08 (the number string changes for each file). and, so far today, we've had 2628 of these files left in /tmp: [root@amp-jenkins-worker-08 tmp]# ls -lt | grep liblz4 |grep "May 27" | wc -l 2628 i'm not worried about us running out of disk, and this is something i can manage on the system-level, but it'd still be nice to have well behaved tests. :) > spark builds filling up /tmp > > > Key: SPARK-15619 > URL: https://issues.apache.org/jira/browse/SPARK-15619 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: shane knapp >Priority: Minor > > spark builds aren't cleaning up /tmp after they run... it's hard to pinpoint > EXACTLY what is left there by the spark builds (as other builds are also > guilty of doing this), but a quick perusal of the /tmp directory during some > spark builds show that there are myriad empty directories being created and a > massive pile of shared object libraries being dumped there. > $ for x in $(cat jenkins_workers.txt ); do echo $x; ssh $x "ls -l /tmp/*.so | > wc -l"; done > amp-jenkins-worker-01 > 0 > ls: cannot access /tmp/*.so: No such file or directory > amp-jenkins-worker-02 > 22312 > amp-jenkins-worker-03 > 39673 > amp-jenkins-worker-04 > 39548 > amp-jenkins-worker-05 > 39577 > amp-jenkins-worker-06 > 39299 > amp-jenkins-worker-07 > 39315 > amp-jenkins-worker-08 > 38529 > to help combat this, i set up a cron job on each worker that runs tmpwatch > during system downtime on sundays to clean up files older than a week. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15450) Clean up SparkSession builder for python
[ https://issues.apache.org/jira/browse/SPARK-15450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15450. --- Resolution: Fixed Fix Version/s: 2.0.0 > Clean up SparkSession builder for python > > > Key: SPARK-15450 > URL: https://issues.apache.org/jira/browse/SPARK-15450 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Eric Liang > Fix For: 2.0.0 > > > This is the sister JIRA for SPARK-15075. Today we use > `SQLContext.getOrCreate` in our builder. Instead we should just have a real > `SparkSession.getOrCreate` and use that in our builder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15534) TRUNCATE TABLE should throw exceptions, not logError
[ https://issues.apache.org/jira/browse/SPARK-15534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15534. --- Resolution: Fixed Fix Version/s: 2.0.0 > TRUNCATE TABLE should throw exceptions, not logError > > > Key: SPARK-15534 > URL: https://issues.apache.org/jira/browse/SPARK-15534 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > Fix For: 2.0.0 > > > If the table to truncate doesn't exist, throw an exception! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15535) Remove code for TRUNCATE TABLE ... COLUMN
[ https://issues.apache.org/jira/browse/SPARK-15535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15535. --- Resolution: Fixed Fix Version/s: 2.0.0 > Remove code for TRUNCATE TABLE ... COLUMN > - > > Key: SPARK-15535 > URL: https://issues.apache.org/jira/browse/SPARK-15535 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > Fix For: 2.0.0 > > > This was never supported in the first place. Also Hive doesn't support it: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15450) Clean up SparkSession builder for python
[ https://issues.apache.org/jira/browse/SPARK-15450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15450: -- Assignee: Eric Liang (was: Andrew Or) > Clean up SparkSession builder for python > > > Key: SPARK-15450 > URL: https://issues.apache.org/jira/browse/SPARK-15450 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Eric Liang > Fix For: 2.0.0 > > > This is the sister JIRA for SPARK-15075. Today we use > `SQLContext.getOrCreate` in our builder. Instead we should just have a real > `SparkSession.getOrCreate` and use that in our builder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15619) spark builds filling up /tmp
[ https://issues.apache.org/jira/browse/SPARK-15619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304911#comment-15304911 ] shane knapp commented on SPARK-15619: - next time we have a maintenance, i will wipe /tmp completely so that we can at least try and see what's creating what... right now it's such a mess that it's hard to attribute anything to anything. i did watch as a spark build (spark-master-test-maven-hadoop-2.7, IIRC) dump a bunch of the liblz4-java3891256912513794605.so files in /tmp this morning on worker-08 (the number string changes for each file). and, so far today, we've had 2628 of these files left in /tmp: [root@amp-jenkins-worker-08 tmp]# ls -lt | grep liblz4 |grep "May 27" | wc -l 2628 i'm not worried about us running out of disk, and this is something i can manage on the system-level, but it'd still be nice to have well behaved tests. :) > spark builds filling up /tmp > > > Key: SPARK-15619 > URL: https://issues.apache.org/jira/browse/SPARK-15619 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: shane knapp >Priority: Minor > > spark builds aren't cleaning up /tmp after they run... it's hard to pinpoint > EXACTLY what is left there by the spark builds (as other builds are also > guilty of doing this), but a quick perusal of the /tmp directory during some > spark builds show that there are myriad empty directories being created and a > massive pile of shared object libraries being dumped there. > $ for x in $(cat jenkins_workers.txt ); do echo $x; ssh $x "ls -l /tmp/*.so | > wc -l"; done > amp-jenkins-worker-01 > 0 > ls: cannot access /tmp/*.so: No such file or directory > amp-jenkins-worker-02 > 22312 > amp-jenkins-worker-03 > 39673 > amp-jenkins-worker-04 > 39548 > amp-jenkins-worker-05 > 39577 > amp-jenkins-worker-06 > 39299 > amp-jenkins-worker-07 > 39315 > amp-jenkins-worker-08 > 38529 > to help combat this, i set up a cron job on each worker that runs tmpwatch > during system downtime on sundays to clean up files older than a week. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15622) Janino's classloader has an unexpected behavior when its parent classloader throws an ClassNotFoundException with a cause set
[ https://issues.apache.org/jira/browse/SPARK-15622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304905#comment-15304905 ] Apache Spark commented on SPARK-15622: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/13366 > Janino's classloader has an unexpected behavior when its parent classloader > throws an ClassNotFoundException with a cause set > - > > Key: SPARK-15622 > URL: https://issues.apache.org/jira/browse/SPARK-15622 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > At > https://github.com/aunkrig/janino/blob/janino_2.7.8/janino/src/org/codehaus/janino/ClassLoaderIClassLoader.java#L80-L85, > Janino's classloader throws the exception when its parent throws a > ClassNotFoundException with a cause set. However, it does not throw the > exception when there is no cause set. Seems we need to create a special > ClassLoader to wrap the actual parent classloader set to Janino handle this > behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15622) Janino's classloader has an unexpected behavior when its parent classloader throws an ClassNotFoundException with a cause set
[ https://issues.apache.org/jira/browse/SPARK-15622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15622: Assignee: Apache Spark (was: Yin Huai) > Janino's classloader has an unexpected behavior when its parent classloader > throws an ClassNotFoundException with a cause set > - > > Key: SPARK-15622 > URL: https://issues.apache.org/jira/browse/SPARK-15622 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Critical > > At > https://github.com/aunkrig/janino/blob/janino_2.7.8/janino/src/org/codehaus/janino/ClassLoaderIClassLoader.java#L80-L85, > Janino's classloader throws the exception when its parent throws a > ClassNotFoundException with a cause set. However, it does not throw the > exception when there is no cause set. Seems we need to create a special > ClassLoader to wrap the actual parent classloader set to Janino handle this > behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15622) Janino's classloader has an unexpected behavior when its parent classloader throws an ClassNotFoundException with a cause set
[ https://issues.apache.org/jira/browse/SPARK-15622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15622: Assignee: Yin Huai (was: Apache Spark) > Janino's classloader has an unexpected behavior when its parent classloader > throws an ClassNotFoundException with a cause set > - > > Key: SPARK-15622 > URL: https://issues.apache.org/jira/browse/SPARK-15622 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > At > https://github.com/aunkrig/janino/blob/janino_2.7.8/janino/src/org/codehaus/janino/ClassLoaderIClassLoader.java#L80-L85, > Janino's classloader throws the exception when its parent throws a > ClassNotFoundException with a cause set. However, it does not throw the > exception when there is no cause set. Seems we need to create a special > ClassLoader to wrap the actual parent classloader set to Janino handle this > behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15489) Dataset kryo encoder won't load custom user settings
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Sela updated SPARK-15489: -- Description: When setting a custom "spark.kryo.registrator" (or any other configuration for that matter) through the API, this configuration will not propagate to the encoder that uses a KryoSerializer since it instantiates with "new SparkConf()". See: https://github.com/apache/spark/blob/07c36a2f07fcf5da6fb395f830ebbfc10eb27dcc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L554 This could be hacked by providing those configurations as System properties, but this probably should be passed to the encoder and set in the SerializerInstance after creation. Example: When using Encoders with kryo to encode generically typed Objects in the following manner: public static Encoder encoder() { return Encoders.kryo((Class) Object.class); } I get a decoding exception when trying to decode `java.util.Collections$UnmodifiableCollection`, which probably comes from Guava's `ImmutableList`. This happens when running with master = local[1]. Same code had no problems with RDD api. was: When setting a custom "spark.kryo.registrator" (or any other configuration for that matter) through the API, this configuration will not propagate to the encoder that uses a KryoSerializer since it instantiates with "new SparkConf()". See: Example: When using Encoders with kryo to encode generically typed Objects in the following manner: public static Encoder encoder() { return Encoders.kryo((Class) Object.class); } I get a decoding exception when trying to decode `java.util.Collections$UnmodifiableCollection`, which probably comes from Guava's `ImmutableList`. This happens when running with master = local[1]. Same code had no problems with RDD api. > Dataset kryo encoder won't load custom user settings > - > > Key: SPARK-15489 > URL: https://issues.apache.org/jira/browse/SPARK-15489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Amit Sela > > When setting a custom "spark.kryo.registrator" (or any other configuration > for that matter) through the API, this configuration will not propagate to > the encoder that uses a KryoSerializer since it instantiates with "new > SparkConf()". > See: > https://github.com/apache/spark/blob/07c36a2f07fcf5da6fb395f830ebbfc10eb27dcc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L554 > This could be hacked by providing those configurations as System properties, > but this probably should be passed to the encoder and set in the > SerializerInstance after creation. > Example: > When using Encoders with kryo to encode generically typed Objects in the > following manner: > public static Encoder encoder() { > return Encoders.kryo((Class) Object.class); > } > I get a decoding exception when trying to decode > `java.util.Collections$UnmodifiableCollection`, which probably comes from > Guava's `ImmutableList`. > This happens when running with master = local[1]. Same code had no problems > with RDD api. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15489) Dataset kryo encoder won't load custom user settings
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Sela updated SPARK-15489: -- Description: When setting a custom "spark.kryo.registrator" (or any other configuration for that matter) through the API, this configuration will not propagate to the encoder that uses a KryoSerializer since it instantiates with "new SparkConf()". See: Example: When using Encoders with kryo to encode generically typed Objects in the following manner: public static Encoder encoder() { return Encoders.kryo((Class) Object.class); } I get a decoding exception when trying to decode `java.util.Collections$UnmodifiableCollection`, which probably comes from Guava's `ImmutableList`. This happens when running with master = local[1]. Same code had no problems with RDD api. was: When using Encoders with kryo to encode generically typed Objects in the following manner: public static Encoder encoder() { return Encoders.kryo((Class) Object.class); } I get a decoding exception when trying to decode `java.util.Collections$UnmodifiableCollection`, which probably comes from Guava's `ImmutableList`. This happens when running with master = local[1]. Same code had no problems with RDD api. > Dataset kryo encoder won't load custom user settings > - > > Key: SPARK-15489 > URL: https://issues.apache.org/jira/browse/SPARK-15489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Amit Sela > > When setting a custom "spark.kryo.registrator" (or any other configuration > for that matter) through the API, this configuration will not propagate to > the encoder that uses a KryoSerializer since it instantiates with "new > SparkConf()". > See: > Example: > When using Encoders with kryo to encode generically typed Objects in the > following manner: > public static Encoder encoder() { > return Encoders.kryo((Class) Object.class); > } > I get a decoding exception when trying to decode > `java.util.Collections$UnmodifiableCollection`, which probably comes from > Guava's `ImmutableList`. > This happens when running with master = local[1]. Same code had no problems > with RDD api. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15623) 2.0 python coverage ml.feature
[ https://issues.apache.org/jira/browse/SPARK-15623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304898#comment-15304898 ] Bryan Cutler edited comment on SPARK-15623 at 5/27/16 10:11 PM: I was only able to quickly go though the user guide and api docs, but did not see any breaking api changes and just a couple missing params that were being worked on currently. It might be good if someone else is able to take a look, otherwise I can give it a more thorough pass when I get a chance. The discrepancies I found in the docs went in PR [#13159|https://github.com/apache/spark/pull/13159] was (Author: bryanc): I was only able to quickly go though the user guide and api docs, but did not see any breaking api changes and just a couple missing params that were being worked on currently. It might be good if someone else is able to take a look, otherwise I can give it a more thorough pass when I get a chance. The discrepancies I found in the docs went in PR #13159 > 2.0 python coverage ml.feature > -- > > Key: SPARK-15623 > URL: https://issues.apache.org/jira/browse/SPARK-15623 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk > > See parent task SPARK-14813. > [~bryanc] did this component. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15623) 2.0 python coverage ml.feature
[ https://issues.apache.org/jira/browse/SPARK-15623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304898#comment-15304898 ] Bryan Cutler commented on SPARK-15623: -- I was only able to quickly go though the user guide and api docs, but did not see any breaking api changes and just a couple missing params that were being worked on currently. It might be good if someone else is able to take a look, otherwise I can give it a more thorough pass when I get a chance. The discrepancies I found in the docs went in PR #13159 > 2.0 python coverage ml.feature > -- > > Key: SPARK-15623 > URL: https://issues.apache.org/jira/browse/SPARK-15623 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk > > See parent task SPARK-14813. > [~bryanc] did this component. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15632) Dataset typed filter operation changes query plan schema
[ https://issues.apache.org/jira/browse/SPARK-15632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304897#comment-15304897 ] Wenchen Fan commented on SPARK-15632: - good catch! we should not implement typed filter in this way, but always embed deserializer in filter condition. We can create a `TypedFilter` and a `TypedFilterWithObject` operator, and optimize it case by case like we did to `AppendColumns` and `AppendColumnWithObject`. In the planner we can just plan typed filter with normal filter physical operator. > Dataset typed filter operation changes query plan schema > > > Key: SPARK-15632 > URL: https://issues.apache.org/jira/browse/SPARK-15632 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > Filter operations should never changes query plan schema. However, Dataset > typed filter operation does introduce schema change: > {code} > case class A(b: Double, a: String) > val data = Seq( > "{ 'a': 'foo', 'b': 1, 'c': 'extra' }", > "{ 'a': 'bar', 'b': 2, 'c': 'extra' }", > "{ 'a': 'bar', 'c': 'extra' }" > ) > val df1 = spark.read.json(sc.parallelize(data)) > df1.printSchema() > // root > // |-- a: string (nullable = true) > // |-- b: long (nullable = true) > // |-- c: string (nullable = true) > val ds1 = df1.as[A] > ds1.printSchema() > // root > // |-- a: string (nullable = true) > // |-- b: long (nullable = true) > // |-- c: string (nullable = true) > val ds2 = ds1.filter(_.b > 1)// <- Here comes the trouble maker > ds2.printSchema() > // root <- 1. reordered `a` and `b`, and > // |-- b: double (nullable = true)2. dropped `c`, and > // |-- a: string (nullable = true)3. up-casted `b` from long to double > val df2 = ds2.toDF() > df2.printSchema() > // root <- (Same as above) > // |-- b: double (nullable = true) > // |-- a: string (nullable = true) > {code} > This is becase we wraps the actual {{Filter}} operator with a > {{SerializeFromObject}}/{{DeserializeToObject}} pair. > {{DeserializeToObject}} does a bunch of magic tricks here: > # Field order change > #- {{DeserializeToObject}} resolves the encoder deserializer expression by > **name**. Thus field order in input query plan doesn't matter. > # Field number change > #- Same as above, fields not referred by the encoder are silently dropped > while resolving deserializer expressions by name. > # Field data type change > #- When generating deserializer expressions, we allows "sane" implicit > coercions (e.g. integer to long, and long to double) by inserting {{UpCast}} > operators. Thus actual field data types in input query plan don't matter > either as long as there are valid implicit coercions. > Actually, even field names may change once [PR > #13269|https://github.com/apache/spark/pull/13269] gets merged, because it > introduces case-insensitive encoder resolution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15489) Dataset kryo encoder won't load custom user settings
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Sela updated SPARK-15489: -- Summary: Dataset kryo encoder won't load custom user settings (was: Dataset kryo encoder fails on Collections$UnmodifiableCollection) > Dataset kryo encoder won't load custom user settings > - > > Key: SPARK-15489 > URL: https://issues.apache.org/jira/browse/SPARK-15489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Amit Sela > > When using Encoders with kryo to encode generically typed Objects in the > following manner: > public static Encoder encoder() { > return Encoders.kryo((Class) Object.class); > } > I get a decoding exception when trying to decode > `java.util.Collections$UnmodifiableCollection`, which probably comes from > Guava's `ImmutableList`. > This happens when running with master = local[1]. Same code had no problems > with RDD api. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15489) Dataset kryo encoder fails on Collections$UnmodifiableCollection
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304893#comment-15304893 ] Amit Sela commented on SPARK-15489: --- The issue here is the fact that setting the SparkConf does not propagate to the KryoSerializer used by the encoder. I managed to make this work by using Java System properties instead of SparkConf#set since the SparkConf constructor will take them into account, but it's a hack... For now I think I'll change the description of the issue, and propose this as a temporary solution. > Dataset kryo encoder fails on Collections$UnmodifiableCollection > > > Key: SPARK-15489 > URL: https://issues.apache.org/jira/browse/SPARK-15489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Amit Sela > > When using Encoders with kryo to encode generically typed Objects in the > following manner: > public static Encoder encoder() { > return Encoders.kryo((Class) Object.class); > } > I get a decoding exception when trying to decode > `java.util.Collections$UnmodifiableCollection`, which probably comes from > Guava's `ImmutableList`. > This happens when running with master = local[1]. Same code had no problems > with RDD api. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15618) Use SparkSession.builder.sparkContext(...) in tests where possible
[ https://issues.apache.org/jira/browse/SPARK-15618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15618: Assignee: Apache Spark (was: Dongjoon Hyun) > Use SparkSession.builder.sparkContext(...) in tests where possible > -- > > Key: SPARK-15618 > URL: https://issues.apache.org/jira/browse/SPARK-15618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Apache Spark >Priority: Minor > > There are many places where we could be more explicit about the particular > underlying SparkContext we want, but we just do > `SparkSession.builder.getOrCreate()` anyway. It's better to be clearer in the > code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15618) Use SparkSession.builder.sparkContext(...) in tests where possible
[ https://issues.apache.org/jira/browse/SPARK-15618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304891#comment-15304891 ] Apache Spark commented on SPARK-15618: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/13365 > Use SparkSession.builder.sparkContext(...) in tests where possible > -- > > Key: SPARK-15618 > URL: https://issues.apache.org/jira/browse/SPARK-15618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Dongjoon Hyun >Priority: Minor > > There are many places where we could be more explicit about the particular > underlying SparkContext we want, but we just do > `SparkSession.builder.getOrCreate()` anyway. It's better to be clearer in the > code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15618) Use SparkSession.builder.sparkContext(...) in tests where possible
[ https://issues.apache.org/jira/browse/SPARK-15618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15618: Assignee: Dongjoon Hyun (was: Apache Spark) > Use SparkSession.builder.sparkContext(...) in tests where possible > -- > > Key: SPARK-15618 > URL: https://issues.apache.org/jira/browse/SPARK-15618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Dongjoon Hyun >Priority: Minor > > There are many places where we could be more explicit about the particular > underlying SparkContext we want, but we just do > `SparkSession.builder.getOrCreate()` anyway. It's better to be clearer in the > code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15489) Dataset kryo encoder fails on Collections$UnmodifiableCollection
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304886#comment-15304886 ] Amit Sela commented on SPARK-15489: --- Got it! So I wasn't using the custom registrator correctly, it works better like this: public class ImmutablesRegistrator implements KryoRegistrator { @Override public void registerClasses(Kryo kryo) { UnmodifiableCollectionsSerializer.registerSerializers(kryo); // Guava ImmutableListSerializer.registerSerializers(kryo); ImmutableSetSerializer.registerSerializers(kryo); ImmutableMapSerializer.registerSerializers(kryo); ImmutableMultimapSerializer.registerSerializers(kryo); } } > Dataset kryo encoder fails on Collections$UnmodifiableCollection > > > Key: SPARK-15489 > URL: https://issues.apache.org/jira/browse/SPARK-15489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Amit Sela > > When using Encoders with kryo to encode generically typed Objects in the > following manner: > public static Encoder encoder() { > return Encoders.kryo((Class) Object.class); > } > I get a decoding exception when trying to decode > `java.util.Collections$UnmodifiableCollection`, which probably comes from > Guava's `ImmutableList`. > This happens when running with master = local[1]. Same code had no problems > with RDD api. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304884#comment-15304884 ] Seth Hendrickson commented on SPARK-15581: -- [~BenFradet] See [SPARK-7159|https://issues.apache.org/jira/browse/SPARK-7159] for multinomial logistic regression. > MLlib 2.1 Roadmap > - > > Key: SPARK-15581 > URL: https://issues.apache.org/jira/browse/SPARK-15581 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > > This is a master list for MLlib improvements we are working on for the next > release. Please view this as a wish list rather than a definite plan, for we > don't have an accurate estimate of available resources. Due to limited review > bandwidth, features appearing on this list will get higher priority during > code review. But feel free to suggest new items to the list in comments. We > are experimenting with this process. Your feedback would be greatly > appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add the `@Since("VERSION")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps to improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add a "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on > JIRA. > * If the code looks good to you, please comment "LGTM". For non-trivial PRs, > please ping a maintainer to make a final pass. > * After merging a PR, create and link JIRAs for Python, example code, and > documentation if applicable. > h1. Roadmap (*WIP*) > This is NOT [a complete list of MLlib JIRAs for 2.1| > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority]. > We only include umbrella JIRAs and high-level tasks. > Major efforts in this release: > * Feature parity for the DataFrames-based API (`spark.ml`), relative to the > RDD-based API > * ML persistence > * Python API feature parity and test coverage > * R API expansion and improvements > * Note about new features: As usual, we expect to expand the feature set of > MLlib. However, we will prioritize API parity, bug fixes, and improvements > over new features. > Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for > it, but new features, APIs, and improvements will only be added to `spark.ml`. > h2. Critical feature parity in DataFrame-based API > * Umbrella JIRA: [SPARK-4591] > h2. Persistence > * Complete persistence within MLlib > ** Python tuning (SPARK-13786) > * MLlib in R format: compatibility with other languages (SPARK-15572) > * Impose backwards compatibility for persistence (SPARK-15573) > h2. Python API > * Standardize unit tests for Scala and Python to improve and consolidate test > coverage for Params, persistence, and other common functionality (SPARK-15571) > * Improve Python API handling of Params, persistence (SPARK-14771) > (SPARK-14706) > ** Note: The linked JIRAs for this are incomplete. More to be created... > ** Related: Implement Python meta-algorithms in Scala (to simplify > persistence) (SPARK-15574) > * Feature parity: The main goal of the Python API is to have feature parity > with the
[jira] [Commented] (SPARK-15619) spark builds filling up /tmp
[ https://issues.apache.org/jira/browse/SPARK-15619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304874#comment-15304874 ] Sean Owen commented on SPARK-15619: --- Although I think we've cleaned up this over time, and even seen a few fixes for this recently, I imagine it will not be perfect. Yes, a cron job sounds like a good solution. If you can post the names of the dirs, maybe that would narrow it down. > spark builds filling up /tmp > > > Key: SPARK-15619 > URL: https://issues.apache.org/jira/browse/SPARK-15619 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: shane knapp >Priority: Minor > > spark builds aren't cleaning up /tmp after they run... it's hard to pinpoint > EXACTLY what is left there by the spark builds (as other builds are also > guilty of doing this), but a quick perusal of the /tmp directory during some > spark builds show that there are myriad empty directories being created and a > massive pile of shared object libraries being dumped there. > $ for x in $(cat jenkins_workers.txt ); do echo $x; ssh $x "ls -l /tmp/*.so | > wc -l"; done > amp-jenkins-worker-01 > 0 > ls: cannot access /tmp/*.so: No such file or directory > amp-jenkins-worker-02 > 22312 > amp-jenkins-worker-03 > 39673 > amp-jenkins-worker-04 > 39548 > amp-jenkins-worker-05 > 39577 > amp-jenkins-worker-06 > 39299 > amp-jenkins-worker-07 > 39315 > amp-jenkins-worker-08 > 38529 > to help combat this, i set up a cron job on each worker that runs tmpwatch > during system downtime on sundays to clean up files older than a week. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15634) SQL repl is bricked if a function is registered with a non-existent jar
[ https://issues.apache.org/jira/browse/SPARK-15634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304872#comment-15304872 ] Eric Liang edited comment on SPARK-15634 at 5/27/16 9:57 PM: - Note that adding jars in the repl also doesn't work currently, so this issue may be minor (see linked issue). cc [~yhuai] was (Author: ekhliang): Note that adding jars in the repl also doesn't work currently, so this issue may be minor (see linked issue). > SQL repl is bricked if a function is registered with a non-existent jar > --- > > Key: SPARK-15634 > URL: https://issues.apache.org/jira/browse/SPARK-15634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Eric Liang > > After attempting to register a function using a non-existent jar, no further > SQL commands succeed (and you also cannot un-register the function). > {code} > build/sbt -Phive sparkShell > {code} > {code} > scala> sql("""CREATE TEMPORARY FUNCTION x AS "com.example.functions.Function" > USING JAR "file:///path/to/example.jar) > 16/05/27 14:53:49 ERROR SessionState: file:///path/to/example.jar does not > exist > java.lang.IllegalArgumentException: file:///path/to/example.jar does not exist > at > org.apache.hadoop.hive.ql.session.SessionState.validateFiles(SessionState.java:998) > at > org.apache.hadoop.hive.ql.session.SessionState$ResourceType.preHook(SessionState.java:1102) > at > org.apache.hadoop.hive.ql.session.SessionState$ResourceType$1.preHook(SessionState.java:1091) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1191) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:564) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:533) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:260) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:207) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:206) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:249) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:533) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:523) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:668) > at > org.apache.spark.sql.hive.HiveSessionState.addJar(HiveSessionState.scala:109) > at > org.apache.spark.sql.internal.SessionState$$anon$2.loadResource(SessionState.scala:80) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:734) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:734) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadFunctionResources(SessionCatalog.scala:734) > at > org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:59) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85) > at org.apache.spark.sql.Datas
[jira] [Commented] (SPARK-15634) SQL repl is bricked if a function is registered with a non-existent jar
[ https://issues.apache.org/jira/browse/SPARK-15634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304872#comment-15304872 ] Eric Liang commented on SPARK-15634: Note that adding jars in the repl also doesn't work currently, so this issue may be minor (see linked issue). > SQL repl is bricked if a function is registered with a non-existent jar > --- > > Key: SPARK-15634 > URL: https://issues.apache.org/jira/browse/SPARK-15634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Eric Liang > > After attempting to register a function using a non-existent jar, no further > SQL commands succeed (and you also cannot un-register the function). > {code} > build/sbt -Phive sparkShell > {code} > {code} > scala> sql("""CREATE TEMPORARY FUNCTION x AS "com.example.functions.Function" > USING JAR "file:///path/to/example.jar) > 16/05/27 14:53:49 ERROR SessionState: file:///path/to/example.jar does not > exist > java.lang.IllegalArgumentException: file:///path/to/example.jar does not exist > at > org.apache.hadoop.hive.ql.session.SessionState.validateFiles(SessionState.java:998) > at > org.apache.hadoop.hive.ql.session.SessionState$ResourceType.preHook(SessionState.java:1102) > at > org.apache.hadoop.hive.ql.session.SessionState$ResourceType$1.preHook(SessionState.java:1091) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1191) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:564) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:533) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:260) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:207) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:206) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:249) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:533) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:523) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:668) > at > org.apache.spark.sql.hive.HiveSessionState.addJar(HiveSessionState.scala:109) > at > org.apache.spark.sql.internal.SessionState$$anon$2.loadResource(SessionState.scala:80) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:734) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:734) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadFunctionResources(SessionCatalog.scala:734) > at > org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:59) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85) > at org.apache.spark.sql.Dataset.(Dataset.scala:187) > at org.apache.spark.sql.Dataset.(Dataset.scala:168) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63) > at org.apache.spark.sql.SparkSession.sql(S
[jira] [Commented] (SPARK-15632) Dataset typed filter operation changes query plan schema
[ https://issues.apache.org/jira/browse/SPARK-15632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304873#comment-15304873 ] Cheng Lian commented on SPARK-15632: cc [~cloud_fan] [~marmbrus] > Dataset typed filter operation changes query plan schema > > > Key: SPARK-15632 > URL: https://issues.apache.org/jira/browse/SPARK-15632 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > Dataset typed filter operation changes query plan schema > Filter operations should never changes query plan schema. However, Dataset > typed filter operation does introduce schema change: > {code} > case class A(b: Double, a: String) > val data = Seq( > "{ 'a': 'foo', 'b': 1, 'c': 'extra' }", > "{ 'a': 'bar', 'b': 2, 'c': 'extra' }", > "{ 'a': 'bar', 'c': 'extra' }" > ) > val df1 = spark.read.json(sc.parallelize(data)) > df1.printSchema() > // root > // |-- a: string (nullable = true) > // |-- b: long (nullable = true) > // |-- c: string (nullable = true) > val ds1 = df1.as[A] > ds1.printSchema() > // root > // |-- a: string (nullable = true) > // |-- b: long (nullable = true) > // |-- c: string (nullable = true) > val ds2 = ds1.filter(_.b > 1)// <- Here comes the trouble maker > ds2.printSchema() > // root <- 1. reordered `a` and `b`, and > // |-- b: double (nullable = true)2. dropped `c`, and > // |-- a: string (nullable = true)3. up-casted `b` from long to double > val df2 = ds2.toDF() > df2.printSchema() > // root <- (Same as above) > // |-- b: double (nullable = true) > // |-- a: string (nullable = true) > {code} > This is becase we wraps the actual {{Filter}} operator with a > {{SerializeFromObject}}/{{DeserializeToObject}} pair. > {{DeserializeToObject}} does a bunch of magic tricks here: > # Field order change > #- {{DeserializeToObject}} resolves the encoder deserializer expression by > **name**. Thus field order in input query plan doesn't matter. > # Field number change > #- Same as above, fields not referred by the encoder are silently dropped > while resolving deserializer expressions by name. > # Field data type change > #- When generating deserializer expressions, we allows "sane" implicit > coercions (e.g. integer to long, and long to double) by inserting {{UpCast}} > operators. Thus actual field data types in input query plan don't matter > either as long as there are valid implicit coercions. > Actually, even field names may change once [PR > #13269|https://github.com/apache/spark/pull/13269] gets merged, because it > introduces case-insensitive encoder resolution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15632) Dataset typed filter operation changes query plan schema
[ https://issues.apache.org/jira/browse/SPARK-15632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-15632: --- Description: Filter operations should never changes query plan schema. However, Dataset typed filter operation does introduce schema change: {code} case class A(b: Double, a: String) val data = Seq( "{ 'a': 'foo', 'b': 1, 'c': 'extra' }", "{ 'a': 'bar', 'b': 2, 'c': 'extra' }", "{ 'a': 'bar', 'c': 'extra' }" ) val df1 = spark.read.json(sc.parallelize(data)) df1.printSchema() // root // |-- a: string (nullable = true) // |-- b: long (nullable = true) // |-- c: string (nullable = true) val ds1 = df1.as[A] ds1.printSchema() // root // |-- a: string (nullable = true) // |-- b: long (nullable = true) // |-- c: string (nullable = true) val ds2 = ds1.filter(_.b > 1)// <- Here comes the trouble maker ds2.printSchema() // root <- 1. reordered `a` and `b`, and // |-- b: double (nullable = true)2. dropped `c`, and // |-- a: string (nullable = true)3. up-casted `b` from long to double val df2 = ds2.toDF() df2.printSchema() // root <- (Same as above) // |-- b: double (nullable = true) // |-- a: string (nullable = true) {code} This is becase we wraps the actual {{Filter}} operator with a {{SerializeFromObject}}/{{DeserializeToObject}} pair. {{DeserializeToObject}} does a bunch of magic tricks here: # Field order change #- {{DeserializeToObject}} resolves the encoder deserializer expression by **name**. Thus field order in input query plan doesn't matter. # Field number change #- Same as above, fields not referred by the encoder are silently dropped while resolving deserializer expressions by name. # Field data type change #- When generating deserializer expressions, we allows "sane" implicit coercions (e.g. integer to long, and long to double) by inserting {{UpCast}} operators. Thus actual field data types in input query plan don't matter either as long as there are valid implicit coercions. Actually, even field names may change once [PR #13269|https://github.com/apache/spark/pull/13269] gets merged, because it introduces case-insensitive encoder resolution. was: Dataset typed filter operation changes query plan schema Filter operations should never changes query plan schema. However, Dataset typed filter operation does introduce schema change: {code} case class A(b: Double, a: String) val data = Seq( "{ 'a': 'foo', 'b': 1, 'c': 'extra' }", "{ 'a': 'bar', 'b': 2, 'c': 'extra' }", "{ 'a': 'bar', 'c': 'extra' }" ) val df1 = spark.read.json(sc.parallelize(data)) df1.printSchema() // root // |-- a: string (nullable = true) // |-- b: long (nullable = true) // |-- c: string (nullable = true) val ds1 = df1.as[A] ds1.printSchema() // root // |-- a: string (nullable = true) // |-- b: long (nullable = true) // |-- c: string (nullable = true) val ds2 = ds1.filter(_.b > 1)// <- Here comes the trouble maker ds2.printSchema() // root <- 1. reordered `a` and `b`, and // |-- b: double (nullable = true)2. dropped `c`, and // |-- a: string (nullable = true)3. up-casted `b` from long to double val df2 = ds2.toDF() df2.printSchema() // root <- (Same as above) // |-- b: double (nullable = true) // |-- a: string (nullable = true) {code} This is becase we wraps the actual {{Filter}} operator with a {{SerializeFromObject}}/{{DeserializeToObject}} pair. {{DeserializeToObject}} does a bunch of magic tricks here: # Field order change #- {{DeserializeToObject}} resolves the encoder deserializer expression by **name**. Thus field order in input query plan doesn't matter. # Field number change #- Same as above, fields not referred by the encoder are silently dropped while resolving deserializer expressions by name. # Field data type change #- When generating deserializer expressions, we allows "sane" implicit coercions (e.g. integer to long, and long to double) by inserting {{UpCast}} operators. Thus actual field data types in input query plan don't matter either as long as there are valid implicit coercions. Actually, even field names may change once [PR #13269|https://github.com/apache/spark/pull/13269] gets merged, because it introduces case-insensitive encoder resolution. > Dataset typed filter operation changes query plan schema > > > Key: SPARK-15632 > URL: https://issues.apache.org/jira/browse/SPARK-15632 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > Filter operations should never changes query plan schema. However, Dataset > typed filter operation does introduce schema change: > {code} > case class A(b: Double, a: String) > val data = Seq( >