[jira] [Updated] (SPARK-25427) Add BloomFilter creation test cases
[ https://issues.apache.org/jira/browse/SPARK-25427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25427: -- Component/s: Tests > Add BloomFilter creation test cases > --- > > Key: SPARK-25427 > URL: https://issues.apache.org/jira/browse/SPARK-25427 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.2, 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > Spark supports BloomFilter creation for ORC files. This issue aims to add > test coverages to prevent regressions like SPARK-12417 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25418) The metadata of DataSource table should not include Hive-generated storage properties.
[ https://issues.apache.org/jira/browse/SPARK-25418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25418. - Resolution: Fixed Fix Version/s: 3.0.0 > The metadata of DataSource table should not include Hive-generated storage > properties. > -- > > Key: SPARK-25418 > URL: https://issues.apache.org/jira/browse/SPARK-25418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Priority: Major > Fix For: 3.0.0 > > > When Hive support enabled, Hive catalog puts extra storage properties into > table metadata even for DataSource tables, but we should not have them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25418) The metadata of DataSource table should not include Hive-generated storage properties.
[ https://issues.apache.org/jira/browse/SPARK-25418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-25418: --- Assignee: Takuya Ueshin > The metadata of DataSource table should not include Hive-generated storage > properties. > -- > > Key: SPARK-25418 > URL: https://issues.apache.org/jira/browse/SPARK-25418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.0.0 > > > When Hive support enabled, Hive catalog puts extra storage properties into > table metadata even for DataSource tables, but we should not have them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25427) Add BloomFilter creation test cases
Dongjoon Hyun created SPARK-25427: - Summary: Add BloomFilter creation test cases Key: SPARK-25427 URL: https://issues.apache.org/jira/browse/SPARK-25427 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.2, 2.4.0 Reporter: Dongjoon Hyun Spark supports BloomFilter creation for ORC files. This issue aims to add test coverages to prevent regressions like SPARK-12417 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24498: Target Version/s: 3.0.0 > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23906) Add UDF trunc(numeric)
[ https://issues.apache.org/jira/browse/SPARK-23906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614307#comment-16614307 ] Yuming Wang commented on SPARK-23906: - cc [~dongjoon] It's difficult to reuse {{trunc}} for truncating numbers. How about introduce a new name: {{truncate}}? {code:sql} mysql> SELECT TRUNCATE(1.223,1); -> 1.2 mysql> SELECT TRUNCATE(1.999,1); -> 1.9 mysql> SELECT TRUNCATE(1.999,0); -> 1 mysql> SELECT TRUNCATE(-1.999,1); -> -1.9 mysql> SELECT TRUNCATE(122,-2); -> 100 mysql> SELECT TRUNCATE(10.28*100,0); -> 1028 {code} [https://dev.mysql.com/doc/refman/8.0/en/mathematical-functions.html#function_truncate] > Add UDF trunc(numeric) > -- > > Key: SPARK-23906 > URL: https://issues.apache.org/jira/browse/SPARK-23906 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Yuming Wang >Priority: Major > > https://issues.apache.org/jira/browse/HIVE-14582 > We already have {{date_trunc}} and {{trunc}}. Need to discuss whether we > should introduce a new name or reuse {{trunc}} for truncating numbers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25426) Handles subexpression elimination config inside CodeGeneratorWithInterpretedFallback
Takeshi Yamamuro created SPARK-25426: Summary: Handles subexpression elimination config inside CodeGeneratorWithInterpretedFallback Key: SPARK-25426 URL: https://issues.apache.org/jira/browse/SPARK-25426 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Takeshi Yamamuro -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25414) make it clear that the numRows metrics should be counted for each scan of the source
[ https://issues.apache.org/jira/browse/SPARK-25414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25414: Summary: make it clear that the numRows metrics should be counted for each scan of the source (was: The numInputRows metrics can be incorrect for streaming self-join) > make it clear that the numRows metrics should be counted for each scan of the > source > > > Key: SPARK-25414 > URL: https://issues.apache.org/jira/browse/SPARK-25414 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25414) make it clear that the numRows metrics should be counted for each scan of the source
[ https://issues.apache.org/jira/browse/SPARK-25414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25414: Issue Type: Test (was: Bug) > make it clear that the numRows metrics should be counted for each scan of the > source > > > Key: SPARK-25414 > URL: https://issues.apache.org/jira/browse/SPARK-25414 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25293) Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir
[ https://issues.apache.org/jira/browse/SPARK-25293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614245#comment-16614245 ] omkar puttagunta edited comment on SPARK-25293 at 9/14/18 2:03 AM: --- [~hyukjin.kwon] tested with 2.1.3, got the same issue. My stack overflow question got answers saying that this is due to lack of "shared file system". Is it the real reason? I am running spark in standalone mode, no HDFS, or any other distributed file system If I use the fileOutputCommiter Version 2, will I get the desired result? [https://stackoverflow.com/questions/52089208/spark-dataframe-write-to-csv-creates-temporary-directory-file-in-standalone-clu] was (Author: omkar999): [~hyukjin.kwon] tested with 2.1.3, got the same issue. My stack overflow question got answers saying that this is due to lack of "shared file system". Is it the real reason? If I use the fileOutputCommiter Version 2, will I get the desired result? [https://stackoverflow.com/questions/52089208/spark-dataframe-write-to-csv-creates-temporary-directory-file-in-standalone-clu] > Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx > instead of directly saving in outputDir > -- > > Key: SPARK-25293 > URL: https://issues.apache.org/jira/browse/SPARK-25293 > Project: Spark > Issue Type: Bug > Components: EC2, Java API, Spark Shell, Spark Submit >Affects Versions: 2.0.2, 2.1.3 >Reporter: omkar puttagunta >Priority: Major > > [https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating] > {quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master > node on AWS EC2 > {quote} > Simple Test; reading pipe delimited file and writing data to csv. Commands > below are executed in spark-shell with master-url set > {{val df = > spark.sqlContext.read.option("delimiter","|").option("quote","\u").csv("/home/input-files/") > val emailDf=df.filter("_c3='EML'") > emailDf.repartition(100).write.csv("/opt/outputFile/")}} > After executing the cmds above in spark-shell with master url set. > {quote}In {{worker1}} -> Each part file is created > in\{{/opt/outputFile/_temporary/task-x-xxx/part-xxx-xxx}} > In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated > directly under outputDirectory specified during write. > {quote} > *Same thing happens with coalesce(100) or without specifying > repartition/coalesce!!! Tried with Java also!* > *_Quesiton_* > 1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have > {{part-}} files just like in {{worker2}}? why {{_temporary}} directory is > created and {{part-xxx-xx}} files reside in the \{{task-xxx}}directories? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25293) Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir
[ https://issues.apache.org/jira/browse/SPARK-25293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614245#comment-16614245 ] omkar puttagunta commented on SPARK-25293: -- [~hyukjin.kwon] tested with 2.1.3, got the same issue. My stack overflow question got answers saying that this is due to lack of "shared file system". Is it the real reason? If I use the fileOutputCommiter Version 2, will I get the desired result? [https://stackoverflow.com/questions/52089208/spark-dataframe-write-to-csv-creates-temporary-directory-file-in-standalone-clu] > Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx > instead of directly saving in outputDir > -- > > Key: SPARK-25293 > URL: https://issues.apache.org/jira/browse/SPARK-25293 > Project: Spark > Issue Type: Bug > Components: EC2, Java API, Spark Shell, Spark Submit >Affects Versions: 2.0.2, 2.1.3 >Reporter: omkar puttagunta >Priority: Major > > [https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating] > {quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master > node on AWS EC2 > {quote} > Simple Test; reading pipe delimited file and writing data to csv. Commands > below are executed in spark-shell with master-url set > {{val df = > spark.sqlContext.read.option("delimiter","|").option("quote","\u").csv("/home/input-files/") > val emailDf=df.filter("_c3='EML'") > emailDf.repartition(100).write.csv("/opt/outputFile/")}} > After executing the cmds above in spark-shell with master url set. > {quote}In {{worker1}} -> Each part file is created > in\{{/opt/outputFile/_temporary/task-x-xxx/part-xxx-xxx}} > In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated > directly under outputDirectory specified during write. > {quote} > *Same thing happens with coalesce(100) or without specifying > repartition/coalesce!!! Tried with Java also!* > *_Quesiton_* > 1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have > {{part-}} files just like in {{worker2}}? why {{_temporary}} directory is > created and {{part-xxx-xx}} files reside in the \{{task-xxx}}directories? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25293) Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir
[ https://issues.apache.org/jira/browse/SPARK-25293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] omkar puttagunta updated SPARK-25293: - Affects Version/s: 2.1.3 > Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx > instead of directly saving in outputDir > -- > > Key: SPARK-25293 > URL: https://issues.apache.org/jira/browse/SPARK-25293 > Project: Spark > Issue Type: Bug > Components: EC2, Java API, Spark Shell, Spark Submit >Affects Versions: 2.0.2, 2.1.3 >Reporter: omkar puttagunta >Priority: Major > > [https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating] > {quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master > node on AWS EC2 > {quote} > Simple Test; reading pipe delimited file and writing data to csv. Commands > below are executed in spark-shell with master-url set > {{val df = > spark.sqlContext.read.option("delimiter","|").option("quote","\u").csv("/home/input-files/") > val emailDf=df.filter("_c3='EML'") > emailDf.repartition(100).write.csv("/opt/outputFile/")}} > After executing the cmds above in spark-shell with master url set. > {quote}In {{worker1}} -> Each part file is created > in\{{/opt/outputFile/_temporary/task-x-xxx/part-xxx-xxx}} > In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated > directly under outputDirectory specified during write. > {quote} > *Same thing happens with coalesce(100) or without specifying > repartition/coalesce!!! Tried with Java also!* > *_Quesiton_* > 1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have > {{part-}} files just like in {{worker2}}? why {{_temporary}} directory is > created and {{part-xxx-xx}} files reside in the \{{task-xxx}}directories? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23200) Reset configuration when restarting from checkpoints
[ https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614208#comment-16614208 ] Stavros Kontopoulos edited comment on SPARK-23200 at 9/14/18 1:07 AM: -- [~cloud_fan]I think this should go in 2.4 even though its a bit late, the remaining PR is trivial. A few properties need to be restored from the checkpoint, and of course it needs testing. I can do the testing if we can get it in 2.4 soon. [~foxish] thoughts? was (Author: skonto): [~cloud_fan]I think this should go in 2.4 even though its a bit late, the remaining PR is trivial. A few properties need to be restored from the checkpoint, and of course it needs testing. I can do the testing if we can get it in 2.4 soon. > Reset configuration when restarting from checkpoints > > > Key: SPARK-23200 > URL: https://issues.apache.org/jira/browse/SPARK-23200 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Major > > Streaming workloads and restarting from checkpoints may need additional > changes, i.e. resetting properties - see > https://github.com/apache-spark-on-k8s/spark/pull/516 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23200) Reset configuration when restarting from checkpoints
[ https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614208#comment-16614208 ] Stavros Kontopoulos edited comment on SPARK-23200 at 9/14/18 1:04 AM: -- [~cloud_fan]I think this should go in 2.4 even though its a bit late, the remaining PR is trivial. A few properties need to be restored from the checkpoint, and of course it needs testing. I can do the testing if we can get it in 2.4 soon. was (Author: skonto): [~cloud_fan]I think this should go in 2.4 even though its a bit late, the remaining PR is trivial. A few properties need to be restored from the checkpoint, and of course it needs testing. I can do it if we can get it in 2.4 soon. > Reset configuration when restarting from checkpoints > > > Key: SPARK-23200 > URL: https://issues.apache.org/jira/browse/SPARK-23200 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Major > > Streaming workloads and restarting from checkpoints may need additional > changes, i.e. resetting properties - see > https://github.com/apache-spark-on-k8s/spark/pull/516 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23200) Reset configuration when restarting from checkpoints
[ https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614208#comment-16614208 ] Stavros Kontopoulos edited comment on SPARK-23200 at 9/14/18 1:03 AM: -- [~cloud_fan]I think this should go in 2.4 even though its a bit late, the remaining PR is trivial. A few properties need to be restored from the checkpoint, and of course it needs testing. I can do it if we can get it in 2.4 soon. was (Author: skonto): [~cloud_fan]I think this should go in 2.4 even though its a bit late, the remaining PR is trivial s few properties need to be restored from the checkpoint, and of course it needs testing. I can do it if we can get it in 2.4 soon. > Reset configuration when restarting from checkpoints > > > Key: SPARK-23200 > URL: https://issues.apache.org/jira/browse/SPARK-23200 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Major > > Streaming workloads and restarting from checkpoints may need additional > changes, i.e. resetting properties - see > https://github.com/apache-spark-on-k8s/spark/pull/516 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23200) Reset configuration when restarting from checkpoints
[ https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614208#comment-16614208 ] Stavros Kontopoulos edited comment on SPARK-23200 at 9/14/18 1:03 AM: -- [~cloud_fan]I think this should go in 2.4 even though its a bit late, the remaining PR is trivial s few properties need to be restored from the checkpoint, and of course it needs testing. I can do it if we can get it in 2.4 soon. was (Author: skonto): [~cloud_fan]I think this should go in 2.4 even though its a bit late. > Reset configuration when restarting from checkpoints > > > Key: SPARK-23200 > URL: https://issues.apache.org/jira/browse/SPARK-23200 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Major > > Streaming workloads and restarting from checkpoints may need additional > changes, i.e. resetting properties - see > https://github.com/apache-spark-on-k8s/spark/pull/516 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23200) Reset configuration when restarting from checkpoints
[ https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614208#comment-16614208 ] Stavros Kontopoulos commented on SPARK-23200: - [~cloud_fan]I think this should go in 2.4 even though its a bit late. > Reset configuration when restarting from checkpoints > > > Key: SPARK-23200 > URL: https://issues.apache.org/jira/browse/SPARK-23200 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Major > > Streaming workloads and restarting from checkpoints may need additional > changes, i.e. resetting properties - see > https://github.com/apache-spark-on-k8s/spark/pull/516 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23200) Reset configuration when restarting from checkpoints
[ https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23200: Issue Type: Bug (was: Improvement) > Reset configuration when restarting from checkpoints > > > Key: SPARK-23200 > URL: https://issues.apache.org/jira/browse/SPARK-23200 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Major > > Streaming workloads and restarting from checkpoints may need additional > changes, i.e. resetting properties - see > https://github.com/apache-spark-on-k8s/spark/pull/516 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614200#comment-16614200 ] Liang-Chi Hsieh commented on SPARK-25378: - The fix looks like: https://github.com/apache/spark/compare/master...viirya:SPARK-25378?expand=1 If this looks ok, I can submit a PR with it. cc [~mengxr] [~cloud_fan] [~hyukjin.kwon]. Please let me know. Thanks. > ArrayData.toArray(StringType) assume UTF8String in 2.4 > -- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23200) Reset configuration when restarting from checkpoints
[ https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614196#comment-16614196 ] Stavros Kontopoulos edited comment on SPARK-23200 at 9/14/18 12:45 AM: --- This is important and should have been a bug not an improvement. Checkpointing needs to work as it is used by default in many cases in production. We should have an integration test for it. was (Author: skonto): This is important and should have been a bug not an improvement. Checkpointing needs to work as it is used by default in many cases in production. > Reset configuration when restarting from checkpoints > > > Key: SPARK-23200 > URL: https://issues.apache.org/jira/browse/SPARK-23200 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Major > > Streaming workloads and restarting from checkpoints may need additional > changes, i.e. resetting properties - see > https://github.com/apache-spark-on-k8s/spark/pull/516 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23200) Reset configuration when restarting from checkpoints
[ https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614196#comment-16614196 ] Stavros Kontopoulos edited comment on SPARK-23200 at 9/14/18 12:44 AM: --- This is important and should have been a bug not an improvement. Checkpointing needs to work as it is used by default in many cases in production. was (Author: skonto): This is important and should have been a bug not an improvement. Checkpointing needs to work. > Reset configuration when restarting from checkpoints > > > Key: SPARK-23200 > URL: https://issues.apache.org/jira/browse/SPARK-23200 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Major > > Streaming workloads and restarting from checkpoints may need additional > changes, i.e. resetting properties - see > https://github.com/apache-spark-on-k8s/spark/pull/516 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23200) Reset configuration when restarting from checkpoints
[ https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614196#comment-16614196 ] Stavros Kontopoulos commented on SPARK-23200: - This is important and should have been a bug not an improvement. Checkpointing needs to work. > Reset configuration when restarting from checkpoints > > > Key: SPARK-23200 > URL: https://issues.apache.org/jira/browse/SPARK-23200 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Major > > Streaming workloads and restarting from checkpoints may need additional > changes, i.e. resetting properties - see > https://github.com/apache-spark-on-k8s/spark/pull/516 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25344) Break large tests.py files into smaller files
[ https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614176#comment-16614176 ] Bryan Cutler commented on SPARK-25344: -- >From the mailing list I think we should agree on a few things first: 1. When to create a separate test file, for each module? and how to name? e.g. "test_rdd.py" 2. Where to put the test files? same dir as source or subdir named "tests" 3. Start splitting tests immediately as new tests are written? Incrementally as subtasks in this JIRA? > Break large tests.py files into smaller files > - > > Key: SPARK-25344 > URL: https://issues.apache.org/jira/browse/SPARK-25344 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > > We've got a ton of tests in one humongous tests.py file, rather than breaking > it out into smaller files. > Having one huge file doesn't seem great for code organization, and it also > makes the test parallelization in run-tests.py not work as well. On my > laptop, tests.py takes 150s, and the next longest test file takes only 20s. > There are similarly large files in other pyspark modules, eg. sql/tests.py, > ml/tests.py, mllib/tests.py, streaming/tests.py. > It seems that at least for some of these files, its already broken into > independent test classes, so it shouldn't be too hard to just move them into > their own files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25053) Allow additional port forwarding on Spark on K8S as needed
[ https://issues.apache.org/jira/browse/SPARK-25053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614111#comment-16614111 ] Stavros Kontopoulos commented on SPARK-25053: - This is going to be covered by the pod template PR, checked it already with jprofiler and works as expected. > Allow additional port forwarding on Spark on K8S as needed > -- > > Key: SPARK-25053 > URL: https://issues.apache.org/jira/browse/SPARK-25053 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Trivial > > In some cases, like setting up remote debuggers, adding additional ports to > be forwarded would be useful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata
[ https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25423: Labels: starter (was: ) > Output "dataFilters" in DataSourceScanExec.metadata > --- > > Key: SPARK-25423 > URL: https://issues.apache.org/jira/browse/SPARK-25423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Priority: Trivial > Labels: starter > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25164) Parquet reader builds entire list of columns once for each column
[ https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614020#comment-16614020 ] Ruslan Dautkhanov commented on SPARK-25164: --- Hi [~bersprockets] Thanks a lot for the detailed response. I totally see with what you're saying. That's interesting that Spark realizing all rows even though where filter has a predicate for just one column. I am thinking if it's feasible to lazily realize list of columns in select-clause only after filtering is complete? It seems could be a huge performance improvement for wider tables like this. In other words, if Spark would realize list of columns specified in where clause first, and only after filtering realize rest of columns needed for select-clause. Thoughts? Thank you! Ruslan > Parquet reader builds entire list of columns once for each column > - > > Key: SPARK-25164 > URL: https://issues.apache.org/jira/browse/SPARK-25164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Minor > Fix For: 2.2.3, 2.3.2, 2.4.0 > > > {{VectorizedParquetRecordReader.initializeInternal}} loops through each > column, and for each column it calls > {noformat} > requestedSchema.getColumns().get(i) > {noformat} > However, {{MessageType.getColumns}} will build the entire column list from > getPaths(0). > {noformat} > public List getColumns() { > List paths = this.getPaths(0); > List columns = new > ArrayList(paths.size()); > for (String[] path : paths) { > // TODO: optimize this > > PrimitiveType primitiveType = getType(path).asPrimitiveType(); > columns.add(new ColumnDescriptor( > path, > primitiveType, > getMaxRepetitionLevel(path), > getMaxDefinitionLevel(path))); > } > return columns; > } > {noformat} > This means that for each parquet file, this routine indirectly iterates > colCount*colCount times. > This is actually not particularly noticeable unless you have: > - many parquet files > - many columns > To verify that this is an issue, I created a 1 million record parquet table > with 6000 columns of type double and 67 files (so initializeInternal is > called 67 times). I ran the following query: > {noformat} > sql("select * from 6000_1m_double where id1 = 1").collect > {noformat} > I used Spark from the master branch. I had 8 executor threads. The filter > returns only a few thousand records. The query ran (on average) for 6.4 > minutes. > Then I cached the column list at the top of {{initializeInternal}} as follows: > {noformat} > List columnCache = requestedSchema.getColumns(); > {noformat} > Then I changed {{initializeInternal}} to use {{columnCache}} rather than > {{requestedSchema.getColumns()}}. > With the column cache variable, the same query runs in 5 minutes. So with my > simple query, you save %22 of time by not rebuilding the column list for each > column. > You get additional savings with a paths cache variable, now saving 34% in > total on the above query. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25164) Parquet reader builds entire list of columns once for each column
[ https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614020#comment-16614020 ] Ruslan Dautkhanov edited comment on SPARK-25164 at 9/13/18 8:19 PM: Hi [~bersprockets] Thanks a lot for the detailed response. I totally see with what you're saying. That's interesting that Spark realizing all rows even though where filter has a predicate for just one column. I am thinking if it's feasible to lazily realize list of columns in select-clause only after filtering is complete? It seems could be a huge performance improvement for wider tables like this. In other words, if Spark would realize list of columns specified in where clause first, and only after filtering realize rest of columns needed for select-clause. Thoughts? Thank you! Ruslan was (Author: tagar): Hi [~bersprockets] Thanks a lot for the detailed response. I totally see with what you're saying. That's interesting that Spark realizing all rows even though where filter has a predicate for just one column. I am thinking if it's feasible to lazily realize list of columns in select-clause only after filtering is complete? It seems could be a huge performance improvement for wider tables like this. In other words, if Spark would realize list of columns specified in where clause first, and only after filtering realize rest of columns needed for select-clause. Thoughts? Thank you! Ruslan > Parquet reader builds entire list of columns once for each column > - > > Key: SPARK-25164 > URL: https://issues.apache.org/jira/browse/SPARK-25164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Minor > Fix For: 2.2.3, 2.3.2, 2.4.0 > > > {{VectorizedParquetRecordReader.initializeInternal}} loops through each > column, and for each column it calls > {noformat} > requestedSchema.getColumns().get(i) > {noformat} > However, {{MessageType.getColumns}} will build the entire column list from > getPaths(0). > {noformat} > public List getColumns() { > List paths = this.getPaths(0); > List columns = new > ArrayList(paths.size()); > for (String[] path : paths) { > // TODO: optimize this > > PrimitiveType primitiveType = getType(path).asPrimitiveType(); > columns.add(new ColumnDescriptor( > path, > primitiveType, > getMaxRepetitionLevel(path), > getMaxDefinitionLevel(path))); > } > return columns; > } > {noformat} > This means that for each parquet file, this routine indirectly iterates > colCount*colCount times. > This is actually not particularly noticeable unless you have: > - many parquet files > - many columns > To verify that this is an issue, I created a 1 million record parquet table > with 6000 columns of type double and 67 files (so initializeInternal is > called 67 times). I ran the following query: > {noformat} > sql("select * from 6000_1m_double where id1 = 1").collect > {noformat} > I used Spark from the master branch. I had 8 executor threads. The filter > returns only a few thousand records. The query ran (on average) for 6.4 > minutes. > Then I cached the column list at the top of {{initializeInternal}} as follows: > {noformat} > List columnCache = requestedSchema.getColumns(); > {noformat} > Then I changed {{initializeInternal}} to use {{columnCache}} rather than > {{requestedSchema.getColumns()}}. > With the column cache variable, the same query runs in 5 minutes. So with my > simple query, you save %22 of time by not rebuilding the column list for each > column. > You get additional savings with a paths cache variable, now saving 34% in > total on the above query. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25425) Extra options must overwrite sessions options
Maxim Gekk created SPARK-25425: -- Summary: Extra options must overwrite sessions options Key: SPARK-25425 URL: https://issues.apache.org/jira/browse/SPARK-25425 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk In load() and save() methods of DataSource V2, extra options are overwritten by session options: * https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L244-L245 * https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L205 but implementation must be opposite - more specific extra options set via *.option(...)* must overwrite more common session options -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21291) R bucketBy partitionBy API
[ https://issues.apache.org/jira/browse/SPARK-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613994#comment-16613994 ] Felix Cheung commented on SPARK-21291: -- No, you wouldn’t return a writer in R. I will reply with more details in a few days > R bucketBy partitionBy API > -- > > Key: SPARK-21291 > URL: https://issues.apache.org/jira/browse/SPARK-21291 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Priority: Major > > partitionBy exists but it's for windowspec only -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25400) Increase timeouts in schedulerIntegrationSuite
[ https://issues.apache.org/jira/browse/SPARK-25400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25400: - Assignee: Imran Rashid > Increase timeouts in schedulerIntegrationSuite > -- > > Key: SPARK-25400 > URL: https://issues.apache.org/jira/browse/SPARK-25400 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > I just took a look at a flaky failure in {{SchedulerIntegrationSuite}} > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95887 > it seems the timeout really is too short: > {noformat} > 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task > 5.0 in stage 1.0 (TID 8, localhost, executor driver, partition 5, > PROCESS_LOCAL, 7677 bytes) > 18/09/10 11:14:07.821 task-result-getter-2 INFO TaskSetManager: Finished task > 3.0 in stage 1.0 (TID 6) in 1 ms on localhost (executor driver) (4/10) > 18/09/10 11:14:07.821 task-result-getter-0 INFO TaskSetManager: Finished task > 4.0 in stage 1.0 (TID 7) in 1 ms on localhost (executor driver) (5/10) > 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task > 6.0 in stage 1.0 (TID 9, localhost, executor driver, partition 6, > PROCESS_LOCAL, 7677 bytes) > 18/09/10 11:14:07.821 task-result-getter-1 INFO TaskSetManager: Finished task > 5.0 in stage 1.0 (TID 8) in 0 ms on localhost (executor driver) (6/10) > 18/09/10 11:14:09.481 mock backend thread INFO TaskSetManager: Starting task > 7.0 in stage 1.0 (TID 10, localhost, executor driver, partition 7, > PROCESS_LOCAL, 7677 bytes) > 18/09/10 11:14:09.482 dispatcher-event-loop-14 INFO BlockManagerInfo: Removed > broadcast_0_piece0 on amp-jenkins-worker-05.amp:36913 in memory (size: 1260.0 > B, free: 1638.6 MB) > {noformat} > you'll see that the "mock backend thread" does keep making progress, but for > whatever reason there is over a one second delay in the middle. Thats > already going over the existing timeouts. > Its possible there is something else going on here, but for now just > increasing the timeouts seems like the best next step. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25400) Increase timeouts in schedulerIntegrationSuite
[ https://issues.apache.org/jira/browse/SPARK-25400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25400. --- Resolution: Fixed Fix Version/s: 2.3.2 2.4.0 Issue resolved by pull request 22385 [https://github.com/apache/spark/pull/22385] > Increase timeouts in schedulerIntegrationSuite > -- > > Key: SPARK-25400 > URL: https://issues.apache.org/jira/browse/SPARK-25400 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Fix For: 2.4.0, 2.3.2 > > > I just took a look at a flaky failure in {{SchedulerIntegrationSuite}} > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95887 > it seems the timeout really is too short: > {noformat} > 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task > 5.0 in stage 1.0 (TID 8, localhost, executor driver, partition 5, > PROCESS_LOCAL, 7677 bytes) > 18/09/10 11:14:07.821 task-result-getter-2 INFO TaskSetManager: Finished task > 3.0 in stage 1.0 (TID 6) in 1 ms on localhost (executor driver) (4/10) > 18/09/10 11:14:07.821 task-result-getter-0 INFO TaskSetManager: Finished task > 4.0 in stage 1.0 (TID 7) in 1 ms on localhost (executor driver) (5/10) > 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task > 6.0 in stage 1.0 (TID 9, localhost, executor driver, partition 6, > PROCESS_LOCAL, 7677 bytes) > 18/09/10 11:14:07.821 task-result-getter-1 INFO TaskSetManager: Finished task > 5.0 in stage 1.0 (TID 8) in 0 ms on localhost (executor driver) (6/10) > 18/09/10 11:14:09.481 mock backend thread INFO TaskSetManager: Starting task > 7.0 in stage 1.0 (TID 10, localhost, executor driver, partition 7, > PROCESS_LOCAL, 7677 bytes) > 18/09/10 11:14:09.482 dispatcher-event-loop-14 INFO BlockManagerInfo: Removed > broadcast_0_piece0 on amp-jenkins-worker-05.amp:36913 in memory (size: 1260.0 > B, free: 1638.6 MB) > {noformat} > you'll see that the "mock backend thread" does keep making progress, but for > whatever reason there is over a one second delay in the middle. Thats > already going over the existing timeouts. > Its possible there is something else going on here, but for now just > increasing the timeouts seems like the best next step. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25338) Several tests miss calling super.afterAll() in their afterAll() method
[ https://issues.apache.org/jira/browse/SPARK-25338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25338. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22337 [https://github.com/apache/spark/pull/22337] > Several tests miss calling super.afterAll() in their afterAll() method > -- > > Key: SPARK-25338 > URL: https://issues.apache.org/jira/browse/SPARK-25338 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > Fix For: 3.0.0 > > > The following tests under {{external}} may not call {{super.afterAll()}} in > their {{afterAll()}} method. > {code} > external/flume/src/test/scala/org/apache/spark/streaming/flume/FlumePollingStreamSuite.scala > external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala > external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala > external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala > external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala > external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala > external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaClusterSuite.scala > external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala > external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala > external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisInputDStreamBuilderSuite.scala > external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25338) Several tests miss calling super.afterAll() in their afterAll() method
[ https://issues.apache.org/jira/browse/SPARK-25338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25338: - Assignee: Kazuaki Ishizaki > Several tests miss calling super.afterAll() in their afterAll() method > -- > > Key: SPARK-25338 > URL: https://issues.apache.org/jira/browse/SPARK-25338 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > Fix For: 3.0.0 > > > The following tests under {{external}} may not call {{super.afterAll()}} in > their {{afterAll()}} method. > {code} > external/flume/src/test/scala/org/apache/spark/streaming/flume/FlumePollingStreamSuite.scala > external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala > external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala > external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala > external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala > external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala > external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaClusterSuite.scala > external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala > external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala > external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisInputDStreamBuilderSuite.scala > external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25424) Window duration and slide duration with negative values should fail fast
[ https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghav Kumar Gautam updated SPARK-25424: Fix Version/s: (was: 2.3.2) 2.4.0 > Window duration and slide duration with negative values should fail fast > > > Key: SPARK-25424 > URL: https://issues.apache.org/jira/browse/SPARK-25424 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Raghav Kumar Gautam >Priority: Major > Fix For: 2.4.0 > > > In TimeWindow class window duration and slide duration should not be allowed > to take negative values. > Currently this behaviour enforced by catalyst. It can be enforced by > constructor of TimeWindow allowing it to fail fast. > For e.g. the code below throws following error. Note that the error is > produced at the time of count() call instead of window() call. > {code:java} > val df = spark.readStream > .format("rate") > .option("numPartitions", "2") > .option("rowsPerSecond", "10") > .load() > .filter("value % 20 == 0") > .withWatermark("timestamp", "10 seconds") > .groupBy(window($"timestamp", "-10 seconds", "5 seconds")) > .count() > {code} > Error: > {code:java} > cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data > type mismatch: The window duration (-1000) must be greater than 0.;; > 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], > [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, > count(1) AS count#57L] > +- AnalysisBarrier > +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds > +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) > +- StreamingRelationV2 > org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, > Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], > StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond > -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] > org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, > -1000, 500, 0)' due to data type mismatch: The window duration > (-1000) must be greater than 0.;; > 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], > [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, > count(1) AS count#57L] > +- AnalysisBarrier > +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds > +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) > +- StreamingRelationV2 > org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, > Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], > StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond > -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118) > at >
[jira] [Updated] (SPARK-25424) Window duration and slide duration with negative values should fail fast
[ https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghav Kumar Gautam updated SPARK-25424: Description: In TimeWindow class window duration and slide duration should not be allowed to take negative values. Currently this behaviour enforced by catalyst. It can be enforced by constructor of TimeWindow allowing it to fail fast. For e.g. the code below throws following error. Note that the error is produced at the time of count() call instead of window() call. {code:java} val df = spark.readStream .format("rate") .option("numPartitions", "2") .option("rowsPerSecond", "10") .load() .filter("value % 20 == 0") .withWatermark("timestamp", "10 seconds") .groupBy(window($"timestamp", "-10 seconds", "5 seconds")) .count() {code} Error: {code:java} cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data type mismatch: The window duration (-1000) must be greater than 0.;; 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, count(1) AS count#57L] +- AnalysisBarrier +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) +- StreamingRelationV2 org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], StreamingRelation DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data type mismatch: The window duration (-1000) must be greater than 0.;; 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, count(1) AS count#57L] +- AnalysisBarrier +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) +- StreamingRelationV2 org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], StreamingRelation DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:122) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at
[jira] [Commented] (SPARK-21291) R bucketBy partitionBy API
[ https://issues.apache.org/jira/browse/SPARK-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613832#comment-16613832 ] Huaxin Gao commented on SPARK-21291: [~felixcheung] I am working on this, but not sure if my approach is correct. I am thinking of having the following code: {code:java} setMethod("write.partitionBy", signature(x = "SparkDataFrame"), function(x, ...) { jcols <- lapply(list(...), function(arg) { stopifnot(class(arg) == "character") arg }) write <- callJMethod(x@sdf, "write") invisible(handledCallJMethod(write, "partitionBy", jcols)) }) {code} The method returns a DataFrameWriter, but it seems that the DataFrameWriter can't be used directly in R. The DataFrameWriter methods, for example, text(path: String), is implemented in R as write.text in DataFrame.R, so I am not sure if it's correct for me to return a DataFrameWriter for partitionBy. > R bucketBy partitionBy API > -- > > Key: SPARK-21291 > URL: https://issues.apache.org/jira/browse/SPARK-21291 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Priority: Major > > partitionBy exists but it's for windowspec only -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25424) Window duration and slide duration with negative values should fail fast
[ https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613830#comment-16613830 ] Raghav Kumar Gautam commented on SPARK-25424: - I have a patch for this. Can someone assign this issue to me ? > Window duration and slide duration with negative values should fail fast > > > Key: SPARK-25424 > URL: https://issues.apache.org/jira/browse/SPARK-25424 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Raghav Kumar Gautam >Priority: Major > Fix For: 2.3.2 > > > In TimeWindow class window duration and slide duration is not be allowed to > take negative values. > Currently this is enforced by catalyst. It can be enforced by constructor of > TimeWindow allowing it to fail fast. > For e.g. the code below throws following error. Note that the error is > produced at the time of count() call instead of window() call. > {code} > val df = spark.readStream > .format("rate") > .option("numPartitions", "2") > .option("rowsPerSecond", "10") > .load() > .filter("value % 20 == 0") > .withWatermark("timestamp", "10 seconds") > .groupBy(window($"timestamp", "-10 seconds", "5 seconds")) > .count() > {code} > Error: > {code} > cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data > type mismatch: The window duration (-1000) must be greater than 0.;; > 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], > [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, > count(1) AS count#57L] > +- AnalysisBarrier > +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds > +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) > +- StreamingRelationV2 > org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, > Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], > StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond > -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] > org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, > -1000, 500, 0)' due to data type mismatch: The window duration > (-1000) must be greater than 0.;; > 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], > [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, > count(1) AS count#57L] > +- AnalysisBarrier > +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds > +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) > +- StreamingRelationV2 > org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, > Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], > StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond > -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118) > at >
[jira] [Updated] (SPARK-25424) Window duration and slide duration with negative values should fail fast
[ https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raghav Kumar Gautam updated SPARK-25424: Target Version/s: 2.4.0 (was: 2.3.2) > Window duration and slide duration with negative values should fail fast > > > Key: SPARK-25424 > URL: https://issues.apache.org/jira/browse/SPARK-25424 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Raghav Kumar Gautam >Priority: Major > Fix For: 2.3.2 > > > In TimeWindow class window duration and slide duration is not be allowed to > take negative values. > Currently this is enforced by catalyst. It can be enforced by constructor of > TimeWindow allowing it to fail fast. > For e.g. the code below throws following error. Note that the error is > produced at the time of count() call instead of window() call. > {code} > val df = spark.readStream > .format("rate") > .option("numPartitions", "2") > .option("rowsPerSecond", "10") > .load() > .filter("value % 20 == 0") > .withWatermark("timestamp", "10 seconds") > .groupBy(window($"timestamp", "-10 seconds", "5 seconds")) > .count() > {code} > Error: > {code} > cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data > type mismatch: The window duration (-1000) must be greater than 0.;; > 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], > [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, > count(1) AS count#57L] > +- AnalysisBarrier > +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds > +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) > +- StreamingRelationV2 > org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, > Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], > StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond > -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] > org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, > -1000, 500, 0)' due to data type mismatch: The window duration > (-1000) must be greater than 0.;; > 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], > [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, > count(1) AS count#57L] > +- AnalysisBarrier > +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds > +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) > +- StreamingRelationV2 > org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, > Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], > StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond > -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:122) > at >
[jira] [Created] (SPARK-25424) Window duration and slide duration with negative values should fail fast
Raghav Kumar Gautam created SPARK-25424: --- Summary: Window duration and slide duration with negative values should fail fast Key: SPARK-25424 URL: https://issues.apache.org/jira/browse/SPARK-25424 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Raghav Kumar Gautam Fix For: 2.3.2 In TimeWindow class window duration and slide duration is not be allowed to take negative values. Currently this is enforced by catalyst. It can be enforced by constructor of TimeWindow allowing it to fail fast. For e.g. the code below throws following error. Note that the error is produced at the time of count() call instead of window() call. {code} val df = spark.readStream .format("rate") .option("numPartitions", "2") .option("rowsPerSecond", "10") .load() .filter("value % 20 == 0") .withWatermark("timestamp", "10 seconds") .groupBy(window($"timestamp", "-10 seconds", "5 seconds")) .count() {code} Error: {code} cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data type mismatch: The window duration (-1000) must be greater than 0.;; 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, count(1) AS count#57L] +- AnalysisBarrier +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) +- StreamingRelationV2 org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], StreamingRelation DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data type mismatch: The window duration (-1000) must be greater than 0.;; 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, count(1) AS count#57L] +- AnalysisBarrier +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) +- StreamingRelationV2 org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], StreamingRelation DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:122) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at
[jira] [Updated] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata
[ https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated SPARK-25423: Summary: Output "dataFilters" in DataSourceScanExec.metadata (was: Output "dataFilters" in DataSourceScanExec.toString) > Output "dataFilters" in DataSourceScanExec.metadata > --- > > Key: SPARK-25423 > URL: https://issues.apache.org/jira/browse/SPARK-25423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25406) Incorrect usage of withSQLConf method in Parquet schema pruning test suite masks failing tests
[ https://issues.apache.org/jira/browse/SPARK-25406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-25406. - Resolution: Fixed Fix Version/s: 2.4.0 3.0.0 Issue resolved by pull request 22394 [https://github.com/apache/spark/pull/22394] > Incorrect usage of withSQLConf method in Parquet schema pruning test suite > masks failing tests > -- > > Key: SPARK-25406 > URL: https://issues.apache.org/jira/browse/SPARK-25406 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Michael Allman >Assignee: Michael Allman >Priority: Major > Fix For: 3.0.0, 2.4.0 > > > In {{ParquetSchemaPruning.scala}}, we use the helper method {{withSQLConf}} > to set configuration settings within the scope of a test. However, the way we > use that method is incorrect, and as a result the desired configuration > settings are not propagated to the test cases. This masks some test case > failures. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25406) Incorrect usage of withSQLConf method in Parquet schema pruning test suite masks failing tests
[ https://issues.apache.org/jira/browse/SPARK-25406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-25406: --- Assignee: Michael Allman > Incorrect usage of withSQLConf method in Parquet schema pruning test suite > masks failing tests > -- > > Key: SPARK-25406 > URL: https://issues.apache.org/jira/browse/SPARK-25406 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Michael Allman >Assignee: Michael Allman >Priority: Major > Fix For: 2.4.0, 3.0.0 > > > In {{ParquetSchemaPruning.scala}}, we use the helper method {{withSQLConf}} > to set configuration settings within the scope of a test. However, the way we > use that method is incorrect, and as a result the desired configuration > settings are not propagated to the test cases. This masks some test case > failures. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.toString
Maryann Xue created SPARK-25423: --- Summary: Output "dataFilters" in DataSourceScanExec.toString Key: SPARK-25423 URL: https://issues.apache.org/jira/browse/SPARK-25423 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maryann Xue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25170) Add Task Metrics description to the documentation
[ https://issues.apache.org/jira/browse/SPARK-25170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25170: - Assignee: Luca Canali > Add Task Metrics description to the documentation > - > > Key: SPARK-25170 > URL: https://issues.apache.org/jira/browse/SPARK-25170 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Labels: documentation > Fix For: 2.4.0 > > > The REST API, as well as other instrumentation and tools based on the Spark > ListenerBus expose the values of the Executor Task Metrics, which are quite > useful for workload/ performance troubleshooting. I propose to add the list > of the available Task Metrics with a short description to the monitoring > documentation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25170) Add Task Metrics description to the documentation
[ https://issues.apache.org/jira/browse/SPARK-25170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25170. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22397 [https://github.com/apache/spark/pull/22397] > Add Task Metrics description to the documentation > - > > Key: SPARK-25170 > URL: https://issues.apache.org/jira/browse/SPARK-25170 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Labels: documentation > Fix For: 2.4.0 > > > The REST API, as well as other instrumentation and tools based on the Spark > ListenerBus expose the values of the Executor Task Metrics, which are quite > useful for workload/ performance troubleshooting. I propose to add the list > of the available Task Metrics with a short description to the monitoring > documentation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25407) Spark throws a `ParquetDecodingException` when attempting to read a field from a complex type in certain cases of schema merging
[ https://issues.apache.org/jira/browse/SPARK-25407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-25407: --- Description: Spark supports merging schemata across table partitions in which one partition is missing a subfield that's present in another. However, attempting to select that missing field with a query that includes a partition pruning predicate that filters out the partitions that include that field results in a `ParquetDecodingException` when attempting to get the query results. This bug is specifically exercised by the failing (but ignored) test case [https://github.com/apache/spark/blob/f2d35427eedeacceb6edb8a51974a7e8bbb94bc2/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala#L125-L131]. was: Spark supports merging schemata across table partitions in which one partition is missing a subfield that's present in another. However, attempting to select that missing field with a query that includes a partition pruning predicate the filters out the partitions that include that field results in a `ParquetDecodingException` when attempting to get the query results. This bug is specifically exercised by the failing (but ignored) test case https://github.com/apache/spark/blob/f2d35427eedeacceb6edb8a51974a7e8bbb94bc2/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala#L125-L131. > Spark throws a `ParquetDecodingException` when attempting to read a field > from a complex type in certain cases of schema merging > > > Key: SPARK-25407 > URL: https://issues.apache.org/jira/browse/SPARK-25407 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Michael Allman >Priority: Major > > Spark supports merging schemata across table partitions in which one > partition is missing a subfield that's present in another. However, > attempting to select that missing field with a query that includes a > partition pruning predicate that filters out the partitions that include that > field results in a `ParquetDecodingException` when attempting to get the > query results. > This bug is specifically exercised by the failing (but ignored) test case > [https://github.com/apache/spark/blob/f2d35427eedeacceb6edb8a51974a7e8bbb94bc2/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala#L125-L131]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25422) flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated (encryption = on) (with replication as stream)
[ https://issues.apache.org/jira/browse/SPARK-25422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613575#comment-16613575 ] Wenchen Fan commented on SPARK-25422: - cc [~squito] is it related with the 2GB limitation change? > flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated > (encryption = on) (with replication as stream) > > > Key: SPARK-25422 > URL: https://issues.apache.org/jira/browse/SPARK-25422 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > stacktrace > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 > (TID 7, localhost, executor 1): java.io.IOException: > org.apache.spark.SparkException: corrupt remote block broadcast_0_piece0 of > broadcast_0: 1651574976 != 1165629262 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1320) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:367) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1347) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:373) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: corrupt remote block > broadcast_0_piece0 of broadcast_0: 1651574976 != 1165629262 > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:167) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:151) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:231) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1313) > ... 13 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25422) flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated (encryption = on) (with replication as stream)
Wenchen Fan created SPARK-25422: --- Summary: flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated (encryption = on) (with replication as stream) Key: SPARK-25422 URL: https://issues.apache.org/jira/browse/SPARK-25422 Project: Spark Issue Type: Test Components: Spark Core Affects Versions: 2.4.0 Reporter: Wenchen Fan stacktrace {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7, localhost, executor 1): java.io.IOException: org.apache.spark.SparkException: corrupt remote block broadcast_0_piece0 of broadcast_0: 1651574976 != 1165629262 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1320) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:367) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1347) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:373) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: corrupt remote block broadcast_0_piece0 of broadcast_0: 1651574976 != 1165629262 at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:167) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:151) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:231) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1313) ... 13 more {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25402) Null handling in BooleanSimplification
[ https://issues.apache.org/jira/browse/SPARK-25402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25402: -- Fix Version/s: 2.2.3 > Null handling in BooleanSimplification > -- > > Key: SPARK-25402 > URL: https://issues.apache.org/jira/browse/SPARK-25402 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Blocker > Fix For: 2.2.3, 2.3.2, 2.4.0 > > > SPARK-20350 introduced a bug BooleanSimplification for null handling. For > example, the following case returns a wrong answer. > {code} > val schema = StructType.fromDDL("a boolean, b int") > val rows = Seq(Row(null, 1)) > val rdd = sparkContext.parallelize(rows) > val df = spark.createDataFrame(rdd, schema) > checkAnswer(df.where("(NOT a) OR a"), Seq.empty) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25420) Dataset.count() every time is different.
[ https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huanghuai updated SPARK-25420: -- Priority: Trivial (was: Major) > Dataset.count() every time is different. > - > > Key: SPARK-25420 > URL: https://issues.apache.org/jira/browse/SPARK-25420 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.3.0 > Environment: spark2.3 > standalone >Reporter: huanghuai >Priority: Trivial > Labels: SQL > > Dataset dataset = sparkSession.read().format("csv").option("sep", > ",").option("inferSchema", "true") > .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true") > .option("encoding", "UTF-8") > .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv"); > System.out.println("source count="+dataset.count()); > Dataset dropDuplicates = dataset.dropDuplicates(new > String[]\{"DATE","TIME","VEL","COMPANY"}); > System.out.println("dropDuplicates count1="+dropDuplicates.count()); > System.out.println("dropDuplicates count2="+dropDuplicates.count()); > Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 > and (status = 0 or status = 1)"); > System.out.println("filter count1="+filter.count()); > System.out.println("filter count2="+filter.count()); > System.out.println("filter count3="+filter.count()); > System.out.println("filter count4="+filter.count()); > System.out.println("filter count5="+filter.count()); > > > --The above is code > --- > > > console output: > source count=459275 > dropDuplicates count1=453987 > dropDuplicates count2=453987 > filter count1=445798 > filter count2=445797 > filter count3=445797 > filter count4=445798 > filter count5=445799 > > question: > > Why is filter.count() different everytime? > if I remove dropDuplicates() everything will be ok!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25420) Dataset.count() every time is different.
[ https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huanghuai updated SPARK-25420: -- Priority: Major (was: Trivial) > Dataset.count() every time is different. > - > > Key: SPARK-25420 > URL: https://issues.apache.org/jira/browse/SPARK-25420 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.3.0 > Environment: spark2.3 > standalone >Reporter: huanghuai >Priority: Major > Labels: SQL > > Dataset dataset = sparkSession.read().format("csv").option("sep", > ",").option("inferSchema", "true") > .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true") > .option("encoding", "UTF-8") > .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv"); > System.out.println("source count="+dataset.count()); > Dataset dropDuplicates = dataset.dropDuplicates(new > String[]\{"DATE","TIME","VEL","COMPANY"}); > System.out.println("dropDuplicates count1="+dropDuplicates.count()); > System.out.println("dropDuplicates count2="+dropDuplicates.count()); > Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 > and (status = 0 or status = 1)"); > System.out.println("filter count1="+filter.count()); > System.out.println("filter count2="+filter.count()); > System.out.println("filter count3="+filter.count()); > System.out.println("filter count4="+filter.count()); > System.out.println("filter count5="+filter.count()); > > > --The above is code > --- > > > console output: > source count=459275 > dropDuplicates count1=453987 > dropDuplicates count2=453987 > filter count1=445798 > filter count2=445797 > filter count3=445797 > filter count4=445798 > filter count5=445799 > > question: > > Why is filter.count() different everytime? > if I remove dropDuplicates() everything will be ok!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25420) Dataset.count() every time is different.
[ https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huanghuai updated SPARK-25420: -- Issue Type: Question (was: Bug) > Dataset.count() every time is different. > - > > Key: SPARK-25420 > URL: https://issues.apache.org/jira/browse/SPARK-25420 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.3.0 > Environment: spark2.3 > standalone >Reporter: huanghuai >Priority: Major > Labels: SQL > > Dataset dataset = sparkSession.read().format("csv").option("sep", > ",").option("inferSchema", "true") > .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true") > .option("encoding", "UTF-8") > .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv"); > System.out.println("source count="+dataset.count()); > Dataset dropDuplicates = dataset.dropDuplicates(new > String[]\{"DATE","TIME","VEL","COMPANY"}); > System.out.println("dropDuplicates count1="+dropDuplicates.count()); > System.out.println("dropDuplicates count2="+dropDuplicates.count()); > Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 > and (status = 0 or status = 1)"); > System.out.println("filter count1="+filter.count()); > System.out.println("filter count2="+filter.count()); > System.out.println("filter count3="+filter.count()); > System.out.println("filter count4="+filter.count()); > System.out.println("filter count5="+filter.count()); > > > --The above is code > --- > > > console output: > source count=459275 > dropDuplicates count1=453987 > dropDuplicates count2=453987 > filter count1=445798 > filter count2=445797 > filter count3=445797 > filter count4=445798 > filter count5=445799 > > question: > > Why is filter.count() different everytime? > if I remove dropDuplicates() everything will be ok!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25404) Staging path may not on the expected place when table path contains the stagingDir string
[ https://issues.apache.org/jira/browse/SPARK-25404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613369#comment-16613369 ] Apache Spark commented on SPARK-25404: -- User 'fjh100456' has created a pull request for this issue: https://github.com/apache/spark/pull/22412 > Staging path may not on the expected place when table path contains the > stagingDir string > - > > Key: SPARK-25404 > URL: https://issues.apache.org/jira/browse/SPARK-25404 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jinhua Fu >Priority: Minor > > Considering the follow scenario: > > {code:java} > SET hive.exec.stagingdir=temp; > CREATE TABLE tempTableA(key int) location '/spark/temp/tempTableA'; > INSERT OVERWRITE TABLE tempTableA SELECT 1; > {code} > We expect the staging path under the table path, such as > '/spark/temp/tempTableA/.hive-stagingXXX'(SPARK-20594), but actually it is > '/spark/tempXXX'. > I'm not quite sure why we use the 'if ... else ...' when getting a > stagingDir, but it maybe the cause of this bug. > > {code:java} > // SaveAsHiveFile.scala > private def getStagingDir( > inputPath: Path, > hadoopConf: Configuration, > stagingDir: String): Path = { > .. > var stagingPathName: String = > if (inputPathName.indexOf(stagingDir) == -1) { > new Path(inputPathName, stagingDir).toString > } else { > // The 'indexOf' may not get expected position, and this may be the cause > of this bug. > inputPathName.substring(0, inputPathName.indexOf(stagingDir) + > stagingDir.length) > } > .. > } > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25404) Staging path may not on the expected place when table path contains the stagingDir string
[ https://issues.apache.org/jira/browse/SPARK-25404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25404: Assignee: (was: Apache Spark) > Staging path may not on the expected place when table path contains the > stagingDir string > - > > Key: SPARK-25404 > URL: https://issues.apache.org/jira/browse/SPARK-25404 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jinhua Fu >Priority: Minor > > Considering the follow scenario: > > {code:java} > SET hive.exec.stagingdir=temp; > CREATE TABLE tempTableA(key int) location '/spark/temp/tempTableA'; > INSERT OVERWRITE TABLE tempTableA SELECT 1; > {code} > We expect the staging path under the table path, such as > '/spark/temp/tempTableA/.hive-stagingXXX'(SPARK-20594), but actually it is > '/spark/tempXXX'. > I'm not quite sure why we use the 'if ... else ...' when getting a > stagingDir, but it maybe the cause of this bug. > > {code:java} > // SaveAsHiveFile.scala > private def getStagingDir( > inputPath: Path, > hadoopConf: Configuration, > stagingDir: String): Path = { > .. > var stagingPathName: String = > if (inputPathName.indexOf(stagingDir) == -1) { > new Path(inputPathName, stagingDir).toString > } else { > // The 'indexOf' may not get expected position, and this may be the cause > of this bug. > inputPathName.substring(0, inputPathName.indexOf(stagingDir) + > stagingDir.length) > } > .. > } > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25404) Staging path may not on the expected place when table path contains the stagingDir string
[ https://issues.apache.org/jira/browse/SPARK-25404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25404: Assignee: Apache Spark > Staging path may not on the expected place when table path contains the > stagingDir string > - > > Key: SPARK-25404 > URL: https://issues.apache.org/jira/browse/SPARK-25404 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jinhua Fu >Assignee: Apache Spark >Priority: Minor > > Considering the follow scenario: > > {code:java} > SET hive.exec.stagingdir=temp; > CREATE TABLE tempTableA(key int) location '/spark/temp/tempTableA'; > INSERT OVERWRITE TABLE tempTableA SELECT 1; > {code} > We expect the staging path under the table path, such as > '/spark/temp/tempTableA/.hive-stagingXXX'(SPARK-20594), but actually it is > '/spark/tempXXX'. > I'm not quite sure why we use the 'if ... else ...' when getting a > stagingDir, but it maybe the cause of this bug. > > {code:java} > // SaveAsHiveFile.scala > private def getStagingDir( > inputPath: Path, > hadoopConf: Configuration, > stagingDir: String): Path = { > .. > var stagingPathName: String = > if (inputPathName.indexOf(stagingDir) == -1) { > new Path(inputPathName, stagingDir).toString > } else { > // The 'indexOf' may not get expected position, and this may be the cause > of this bug. > inputPathName.substring(0, inputPathName.indexOf(stagingDir) + > stagingDir.length) > } > .. > } > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25404) Staging path may not on the expected place when table path contains the stagingDir string
[ https://issues.apache.org/jira/browse/SPARK-25404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613368#comment-16613368 ] Apache Spark commented on SPARK-25404: -- User 'fjh100456' has created a pull request for this issue: https://github.com/apache/spark/pull/22412 > Staging path may not on the expected place when table path contains the > stagingDir string > - > > Key: SPARK-25404 > URL: https://issues.apache.org/jira/browse/SPARK-25404 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jinhua Fu >Priority: Minor > > Considering the follow scenario: > > {code:java} > SET hive.exec.stagingdir=temp; > CREATE TABLE tempTableA(key int) location '/spark/temp/tempTableA'; > INSERT OVERWRITE TABLE tempTableA SELECT 1; > {code} > We expect the staging path under the table path, such as > '/spark/temp/tempTableA/.hive-stagingXXX'(SPARK-20594), but actually it is > '/spark/tempXXX'. > I'm not quite sure why we use the 'if ... else ...' when getting a > stagingDir, but it maybe the cause of this bug. > > {code:java} > // SaveAsHiveFile.scala > private def getStagingDir( > inputPath: Path, > hadoopConf: Configuration, > stagingDir: String): Path = { > .. > var stagingPathName: String = > if (inputPathName.indexOf(stagingDir) == -1) { > new Path(inputPathName, stagingDir).toString > } else { > // The 'indexOf' may not get expected position, and this may be the cause > of this bug. > inputPathName.substring(0, inputPathName.indexOf(stagingDir) + > stagingDir.length) > } > .. > } > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25421) Abstract an output path field in trait DataWritingCommand
[ https://issues.apache.org/jira/browse/SPARK-25421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25421: Assignee: Apache Spark > Abstract an output path field in trait DataWritingCommand > - > > Key: SPARK-25421 > URL: https://issues.apache.org/jira/browse/SPARK-25421 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Lantao Jin >Assignee: Apache Spark >Priority: Major > > [SPARK-25357|https://issues.apache.org/jira/browse/SPARK-25357] import a > metadata field in {{SparkPlanInfo}} and it could dump the input location for > read. Corresponding, we need add a field in {{DataWritingCommand}} for output > path. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25421) Abstract an output path field in trait DataWritingCommand
[ https://issues.apache.org/jira/browse/SPARK-25421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613327#comment-16613327 ] Apache Spark commented on SPARK-25421: -- User 'LantaoJin' has created a pull request for this issue: https://github.com/apache/spark/pull/22411 > Abstract an output path field in trait DataWritingCommand > - > > Key: SPARK-25421 > URL: https://issues.apache.org/jira/browse/SPARK-25421 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Lantao Jin >Priority: Major > > [SPARK-25357|https://issues.apache.org/jira/browse/SPARK-25357] import a > metadata field in {{SparkPlanInfo}} and it could dump the input location for > read. Corresponding, we need add a field in {{DataWritingCommand}} for output > path. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25421) Abstract an output path field in trait DataWritingCommand
[ https://issues.apache.org/jira/browse/SPARK-25421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613328#comment-16613328 ] Apache Spark commented on SPARK-25421: -- User 'LantaoJin' has created a pull request for this issue: https://github.com/apache/spark/pull/22411 > Abstract an output path field in trait DataWritingCommand > - > > Key: SPARK-25421 > URL: https://issues.apache.org/jira/browse/SPARK-25421 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Lantao Jin >Priority: Major > > [SPARK-25357|https://issues.apache.org/jira/browse/SPARK-25357] import a > metadata field in {{SparkPlanInfo}} and it could dump the input location for > read. Corresponding, we need add a field in {{DataWritingCommand}} for output > path. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25421) Abstract an output path field in trait DataWritingCommand
[ https://issues.apache.org/jira/browse/SPARK-25421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25421: Assignee: (was: Apache Spark) > Abstract an output path field in trait DataWritingCommand > - > > Key: SPARK-25421 > URL: https://issues.apache.org/jira/browse/SPARK-25421 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Lantao Jin >Priority: Major > > [SPARK-25357|https://issues.apache.org/jira/browse/SPARK-25357] import a > metadata field in {{SparkPlanInfo}} and it could dump the input location for > read. Corresponding, we need add a field in {{DataWritingCommand}} for output > path. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25421) Abstract an output path field in trait DataWritingCommand
Lantao Jin created SPARK-25421: -- Summary: Abstract an output path field in trait DataWritingCommand Key: SPARK-25421 URL: https://issues.apache.org/jira/browse/SPARK-25421 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Lantao Jin [SPARK-25357|https://issues.apache.org/jira/browse/SPARK-25357] import a metadata field in {{SparkPlanInfo}} and it could dump the input location for read. Corresponding, we need add a field in {{DataWritingCommand}} for output path. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25420) Dataset.count() every time is different.
[ https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613217#comment-16613217 ] Marco Gaido commented on SPARK-25420: - I think the reason here is that since we don't enforce any sorting on the incoming Dataset and we take the first row among those with the same aggregation columns, the output of dropDuplicates is random on the row which is chosen for each group. So this doesn't really seem a bug itself, but the output of dropDuplicates is somewhat non-deterministic unless a specific ordering on the input is not enforced. > Dataset.count() every time is different. > - > > Key: SPARK-25420 > URL: https://issues.apache.org/jira/browse/SPARK-25420 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 > Environment: spark2.3 > standalone >Reporter: huanghuai >Priority: Major > Labels: SQL > > Dataset dataset = sparkSession.read().format("csv").option("sep", > ",").option("inferSchema", "true") > .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true") > .option("encoding", "UTF-8") > .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv"); > System.out.println("source count="+dataset.count()); > Dataset dropDuplicates = dataset.dropDuplicates(new > String[]\{"DATE","TIME","VEL","COMPANY"}); > System.out.println("dropDuplicates count1="+dropDuplicates.count()); > System.out.println("dropDuplicates count2="+dropDuplicates.count()); > Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 > and (status = 0 or status = 1)"); > System.out.println("filter count1="+filter.count()); > System.out.println("filter count2="+filter.count()); > System.out.println("filter count3="+filter.count()); > System.out.println("filter count4="+filter.count()); > System.out.println("filter count5="+filter.count()); > > > --The above is code > --- > > > console output: > source count=459275 > dropDuplicates count1=453987 > dropDuplicates count2=453987 > filter count1=445798 > filter count2=445797 > filter count3=445797 > filter count4=445798 > filter count5=445799 > > question: > > Why is filter.count() different everytime? > if I remove dropDuplicates() everything will be ok!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25420) Dataset.count() every time is different.
[ https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613209#comment-16613209 ] Marco Gaido edited comment on SPARK-25420 at 9/13/18 8:51 AM: -- Please do not use Critical/Blocker as they are reserved for committers. was (Author: mgaido): Please do not use Critical/Blocker as they are reserved for committers. Anyway, this seems a correctness issue, so I'd agree raising the priority. > Dataset.count() every time is different. > - > > Key: SPARK-25420 > URL: https://issues.apache.org/jira/browse/SPARK-25420 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 > Environment: spark2.3 > standalone >Reporter: huanghuai >Priority: Major > Labels: SQL > > Dataset dataset = sparkSession.read().format("csv").option("sep", > ",").option("inferSchema", "true") > .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true") > .option("encoding", "UTF-8") > .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv"); > System.out.println("source count="+dataset.count()); > Dataset dropDuplicates = dataset.dropDuplicates(new > String[]\{"DATE","TIME","VEL","COMPANY"}); > System.out.println("dropDuplicates count1="+dropDuplicates.count()); > System.out.println("dropDuplicates count2="+dropDuplicates.count()); > Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 > and (status = 0 or status = 1)"); > System.out.println("filter count1="+filter.count()); > System.out.println("filter count2="+filter.count()); > System.out.println("filter count3="+filter.count()); > System.out.println("filter count4="+filter.count()); > System.out.println("filter count5="+filter.count()); > > > --The above is code > --- > > > console output: > source count=459275 > dropDuplicates count1=453987 > dropDuplicates count2=453987 > filter count1=445798 > filter count2=445797 > filter count3=445797 > filter count4=445798 > filter count5=445799 > > question: > > Why is filter.count() different everytime? > if I remove dropDuplicates() everything will be ok!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25420) Dataset.count() every time is different.
[ https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-25420: Labels: SQL (was: SQL correctness) > Dataset.count() every time is different. > - > > Key: SPARK-25420 > URL: https://issues.apache.org/jira/browse/SPARK-25420 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 > Environment: spark2.3 > standalone >Reporter: huanghuai >Priority: Major > Labels: SQL > > Dataset dataset = sparkSession.read().format("csv").option("sep", > ",").option("inferSchema", "true") > .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true") > .option("encoding", "UTF-8") > .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv"); > System.out.println("source count="+dataset.count()); > Dataset dropDuplicates = dataset.dropDuplicates(new > String[]\{"DATE","TIME","VEL","COMPANY"}); > System.out.println("dropDuplicates count1="+dropDuplicates.count()); > System.out.println("dropDuplicates count2="+dropDuplicates.count()); > Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 > and (status = 0 or status = 1)"); > System.out.println("filter count1="+filter.count()); > System.out.println("filter count2="+filter.count()); > System.out.println("filter count3="+filter.count()); > System.out.println("filter count4="+filter.count()); > System.out.println("filter count5="+filter.count()); > > > --The above is code > --- > > > console output: > source count=459275 > dropDuplicates count1=453987 > dropDuplicates count2=453987 > filter count1=445798 > filter count2=445797 > filter count3=445797 > filter count4=445798 > filter count5=445799 > > question: > > Why is filter.count() different everytime? > if I remove dropDuplicates() everything will be ok!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25420) Dataset.count() every time is different.
[ https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613209#comment-16613209 ] Marco Gaido commented on SPARK-25420: - Please do not use Critical/Blocker as they are reserved for committers. Anyway, this seems a correctness issue, so I'd agree raising the priority. > Dataset.count() every time is different. > - > > Key: SPARK-25420 > URL: https://issues.apache.org/jira/browse/SPARK-25420 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 > Environment: spark2.3 > standalone >Reporter: huanghuai >Priority: Major > Labels: SQL, correctness > > Dataset dataset = sparkSession.read().format("csv").option("sep", > ",").option("inferSchema", "true") > .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true") > .option("encoding", "UTF-8") > .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv"); > System.out.println("source count="+dataset.count()); > Dataset dropDuplicates = dataset.dropDuplicates(new > String[]\{"DATE","TIME","VEL","COMPANY"}); > System.out.println("dropDuplicates count1="+dropDuplicates.count()); > System.out.println("dropDuplicates count2="+dropDuplicates.count()); > Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 > and (status = 0 or status = 1)"); > System.out.println("filter count1="+filter.count()); > System.out.println("filter count2="+filter.count()); > System.out.println("filter count3="+filter.count()); > System.out.println("filter count4="+filter.count()); > System.out.println("filter count5="+filter.count()); > > > --The above is code > --- > > > console output: > source count=459275 > dropDuplicates count1=453987 > dropDuplicates count2=453987 > filter count1=445798 > filter count2=445797 > filter count3=445797 > filter count4=445798 > filter count5=445799 > > question: > > Why is filter.count() different everytime? > if I remove dropDuplicates() everything will be ok!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25420) Dataset.count() every time is different.
[ https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-25420: Labels: SQL correctness (was: ) > Dataset.count() every time is different. > - > > Key: SPARK-25420 > URL: https://issues.apache.org/jira/browse/SPARK-25420 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 > Environment: spark2.3 > standalone >Reporter: huanghuai >Priority: Major > Labels: SQL, correctness > > Dataset dataset = sparkSession.read().format("csv").option("sep", > ",").option("inferSchema", "true") > .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true") > .option("encoding", "UTF-8") > .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv"); > System.out.println("source count="+dataset.count()); > Dataset dropDuplicates = dataset.dropDuplicates(new > String[]\{"DATE","TIME","VEL","COMPANY"}); > System.out.println("dropDuplicates count1="+dropDuplicates.count()); > System.out.println("dropDuplicates count2="+dropDuplicates.count()); > Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 > and (status = 0 or status = 1)"); > System.out.println("filter count1="+filter.count()); > System.out.println("filter count2="+filter.count()); > System.out.println("filter count3="+filter.count()); > System.out.println("filter count4="+filter.count()); > System.out.println("filter count5="+filter.count()); > > > --The above is code > --- > > > console output: > source count=459275 > dropDuplicates count1=453987 > dropDuplicates count2=453987 > filter count1=445798 > filter count2=445797 > filter count3=445797 > filter count4=445798 > filter count5=445799 > > question: > > Why is filter.count() different everytime? > if I remove dropDuplicates() everything will be ok!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25420) Dataset.count() every time is different.
[ https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-25420: Priority: Major (was: Critical) > Dataset.count() every time is different. > - > > Key: SPARK-25420 > URL: https://issues.apache.org/jira/browse/SPARK-25420 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 > Environment: spark2.3 > standalone >Reporter: huanghuai >Priority: Major > > Dataset dataset = sparkSession.read().format("csv").option("sep", > ",").option("inferSchema", "true") > .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true") > .option("encoding", "UTF-8") > .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv"); > System.out.println("source count="+dataset.count()); > Dataset dropDuplicates = dataset.dropDuplicates(new > String[]\{"DATE","TIME","VEL","COMPANY"}); > System.out.println("dropDuplicates count1="+dropDuplicates.count()); > System.out.println("dropDuplicates count2="+dropDuplicates.count()); > Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 > and (status = 0 or status = 1)"); > System.out.println("filter count1="+filter.count()); > System.out.println("filter count2="+filter.count()); > System.out.println("filter count3="+filter.count()); > System.out.println("filter count4="+filter.count()); > System.out.println("filter count5="+filter.count()); > > > --The above is code > --- > > > console output: > source count=459275 > dropDuplicates count1=453987 > dropDuplicates count2=453987 > filter count1=445798 > filter count2=445797 > filter count3=445797 > filter count4=445798 > filter count5=445799 > > question: > > Why is filter.count() different everytime? > if I remove dropDuplicates() everything will be ok!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25420) Dataset.count() every time is different.
huanghuai created SPARK-25420: - Summary: Dataset.count() every time is different. Key: SPARK-25420 URL: https://issues.apache.org/jira/browse/SPARK-25420 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Environment: spark2.3 standalone Reporter: huanghuai Dataset dataset = sparkSession.read().format("csv").option("sep", ",").option("inferSchema", "true") .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true") .option("encoding", "UTF-8") .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv"); System.out.println("source count="+dataset.count()); Dataset dropDuplicates = dataset.dropDuplicates(new String[]\{"DATE","TIME","VEL","COMPANY"}); System.out.println("dropDuplicates count1="+dropDuplicates.count()); System.out.println("dropDuplicates count2="+dropDuplicates.count()); Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 and (status = 0 or status = 1)"); System.out.println("filter count1="+filter.count()); System.out.println("filter count2="+filter.count()); System.out.println("filter count3="+filter.count()); System.out.println("filter count4="+filter.count()); System.out.println("filter count5="+filter.count()); --The above is code --- console output: source count=459275 dropDuplicates count1=453987 dropDuplicates count2=453987 filter count1=445798 filter count2=445797 filter count3=445797 filter count4=445798 filter count5=445799 question: Why is filter.count() different everytime? if I remove dropDuplicates() everything will be ok!! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613157#comment-16613157 ] Liang-Chi Hsieh edited comment on SPARK-25378 at 9/13/18 8:33 AM: -- I think a quick fix is to use general `get` method for just `StringType` in `InternalRow.getAccessor`. This can allow the backward-compatible behavior for `StringType` when calling `toArray`. And we may consider to correct it back to `getUTF8String` by 3.0. WDYT? [~mengxr], [~cloud_fan], [~hyukjin.kwon]. was (Author: viirya): I think a quick fix is to use general `get` method for just `StringType` in `InternalRow.getAccessor`. This can allow the backward-compatible behavior for `StringType` when calling `toArray`. And we may consider to correct to `getUTF8String` by 3.0. WDYT? [~mengxr], [~cloud_fan], [~hyukjin.kwon]. > ArrayData.toArray(StringType) assume UTF8String in 2.4 > -- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25412) FeatureHasher would change the value of output feature
[ https://issues.apache.org/jira/browse/SPARK-25412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613181#comment-16613181 ] Vincent commented on SPARK-25412: - Thanks, Nick, for the reply. so, the tradeoff is between highly sparse vector by increasing numFeature size and risk of losing certain features with conflicted hash value (since changing the value/meaning of those features equals to making them useless ), correct? > FeatureHasher would change the value of output feature > -- > > Key: SPARK-25412 > URL: https://issues.apache.org/jira/browse/SPARK-25412 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.1 >Reporter: Vincent >Priority: Major > > In the current implementation of FeatureHasher.transform, a simple modulo on > the hashed value is used to determine the vector index, it's suggested to use > a large integer value as the numFeature parameter > we found several issues regarding current implementation: > # Cannot get the feature name back by its index after featureHasher > transform, for example. when getting feature importance from decision tree > training followed by a FeatureHasher > # when index conflict, which is a great chance to happen especially when > 'numFeature' is relatively small, its value would be changed with a new > valued (sum of current and old value) > # to avoid confliction, we should set the 'numFeature' with a large number, > highly sparse vector increase the computation complexity of model training -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources
[ https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-24538: -- > ByteArrayDecimalType support push down to parquet data sources > -- > > Key: SPARK-24538 > URL: https://issues.apache.org/jira/browse/SPARK-24538 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > Latest parquet support decimal type statistics. then we can push down to the > data sources: > {noformat} > LM-SHC-16502798:parquet-mr yumwang$ java -jar > ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta > /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > file: > file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > creator: parquet-mr version 1.10.0 (build > 031a6654009e3b82020012a18434c582bd74c73a) > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]} > file schema: spark_schema > > id: REQUIRED INT64 R:0 D:0 > d1: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d2: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d3: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d4: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d5: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > d6: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > row group 1: RC:241867 TS:15480513 OFFSET:4 > > id: INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 > SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: > 241866, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 > SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, > max: 241866.00, num_nulls: 0] > row group 2: RC:241867 TS:15480513 OFFSET:8584904 > > id: INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 > SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, > max: 483733, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 > SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: > 241867.00, max: 483733.00, num_nulls: > 0]{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Resolved] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources
[ https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24538. -- Resolution: Duplicate > ByteArrayDecimalType support push down to parquet data sources > -- > > Key: SPARK-24538 > URL: https://issues.apache.org/jira/browse/SPARK-24538 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > Latest parquet support decimal type statistics. then we can push down to the > data sources: > {noformat} > LM-SHC-16502798:parquet-mr yumwang$ java -jar > ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta > /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > file: > file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > creator: parquet-mr version 1.10.0 (build > 031a6654009e3b82020012a18434c582bd74c73a) > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]} > file schema: spark_schema > > id: REQUIRED INT64 R:0 D:0 > d1: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d2: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d3: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d4: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d5: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > d6: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > row group 1: RC:241867 TS:15480513 OFFSET:4 > > id: INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 > SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: > 241866, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 > SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, > max: 241866.00, num_nulls: 0] > row group 2: RC:241867 TS:15480513 OFFSET:8584904 > > id: INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 > SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, > max: 483733, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 > SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: > 241867.00, max: 483733.00, num_nulls: > 0]{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Resolved] (SPARK-24549) Support DecimalType push down to the parquet data sources
[ https://issues.apache.org/jira/browse/SPARK-24549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24549. -- Resolution: Fixed Fixed in https://github.com/apache/spark/pull/21556 > Support DecimalType push down to the parquet data sources > - > > Key: SPARK-24549 > URL: https://issues.apache.org/jira/browse/SPARK-24549 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-24549) Support DecimalType push down to the parquet data sources
[ https://issues.apache.org/jira/browse/SPARK-24549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-24549: -- > Support DecimalType push down to the parquet data sources > - > > Key: SPARK-24549 > URL: https://issues.apache.org/jira/browse/SPARK-24549 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources
[ https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24538: Issue Type: Sub-task (was: Improvement) Parent: SPARK-25419 > ByteArrayDecimalType support push down to parquet data sources > -- > > Key: SPARK-24538 > URL: https://issues.apache.org/jira/browse/SPARK-24538 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > Latest parquet support decimal type statistics. then we can push down to the > data sources: > {noformat} > LM-SHC-16502798:parquet-mr yumwang$ java -jar > ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta > /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > file: > file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > creator: parquet-mr version 1.10.0 (build > 031a6654009e3b82020012a18434c582bd74c73a) > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]} > file schema: spark_schema > > id: REQUIRED INT64 R:0 D:0 > d1: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d2: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d3: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d4: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d5: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > d6: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > row group 1: RC:241867 TS:15480513 OFFSET:4 > > id: INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 > SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: > 241866, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 > SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, > max: 241866.00, num_nulls: 0] > row group 2: RC:241867 TS:15480513 OFFSET:8584904 > > id: INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 > SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, > max: 483733, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 > SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: > 241867.00, max: 483733.00, num_nulls: > 0]{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources
[ https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613170#comment-16613170 ] Yuming Wang commented on SPARK-24538: - [~cloud_fan] Could you please update this ticket to *Duplicate* and update SPARK-24549 to *Fixed* . > ByteArrayDecimalType support push down to parquet data sources > -- > > Key: SPARK-24538 > URL: https://issues.apache.org/jira/browse/SPARK-24538 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > Latest parquet support decimal type statistics. then we can push down to the > data sources: > {noformat} > LM-SHC-16502798:parquet-mr yumwang$ java -jar > ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta > /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > file: > file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > creator: parquet-mr version 1.10.0 (build > 031a6654009e3b82020012a18434c582bd74c73a) > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]} > file schema: spark_schema > > id: REQUIRED INT64 R:0 D:0 > d1: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d2: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d3: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d4: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d5: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > d6: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > row group 1: RC:241867 TS:15480513 OFFSET:4 > > id: INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 > SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: > 241866, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 > SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, > max: 241866.00, num_nulls: 0] > row group 2: RC:241867 TS:15480513 OFFSET:8584904 > > id: INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 > SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, > max: 483733, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 > SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: > 241867.00, max: 483733.00, num_nulls: > 0]{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources
[ https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24538: Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-25419) > ByteArrayDecimalType support push down to parquet data sources > -- > > Key: SPARK-24538 > URL: https://issues.apache.org/jira/browse/SPARK-24538 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > Latest parquet support decimal type statistics. then we can push down to the > data sources: > {noformat} > LM-SHC-16502798:parquet-mr yumwang$ java -jar > ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta > /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > file: > file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > creator: parquet-mr version 1.10.0 (build > 031a6654009e3b82020012a18434c582bd74c73a) > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]} > file schema: spark_schema > > id: REQUIRED INT64 R:0 D:0 > d1: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d2: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d3: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d4: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d5: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > d6: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > row group 1: RC:241867 TS:15480513 OFFSET:4 > > id: INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 > SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: > 241866, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 > SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, > max: 241866.00, num_nulls: 0] > row group 2: RC:241867 TS:15480513 OFFSET:8584904 > > id: INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 > SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, > max: 483733, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 > SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: > 241867.00, max: 483733.00, num_nulls: > 0]{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe,
[jira] [Comment Edited] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613157#comment-16613157 ] Liang-Chi Hsieh edited comment on SPARK-25378 at 9/13/18 7:59 AM: -- I think a quick fix is to use general `get` method for just `StringType` in `InternalRow.getAccessor`. This can allow the backward-compatible behavior for `StringType` when calling `toArray`. And we may consider to correct to `getUTF8String` by 3.0. WDYT? [~mengxr], [~cloud_fan], [~hyukjin.kwon]. was (Author: viirya): I think a quick fix is to use general `get` method for just `StringType` in `InternalRow.getAccessor`. This can allow the backward-compatible behavior for `StringType` when calling `toArray`. And we may consider to correct to `getUTF8String` by 3.0. WDYT? [~mengxr][~cloud_fan][~hyukjin.kwon] > ArrayData.toArray(StringType) assume UTF8String in 2.4 > -- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25412) FeatureHasher would change the value of output feature
[ https://issues.apache.org/jira/browse/SPARK-25412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-25412. Resolution: Not A Bug > FeatureHasher would change the value of output feature > -- > > Key: SPARK-25412 > URL: https://issues.apache.org/jira/browse/SPARK-25412 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.1 >Reporter: Vincent >Priority: Major > > In the current implementation of FeatureHasher.transform, a simple modulo on > the hashed value is used to determine the vector index, it's suggested to use > a large integer value as the numFeature parameter > we found several issues regarding current implementation: > # Cannot get the feature name back by its index after featureHasher > transform, for example. when getting feature importance from decision tree > training followed by a FeatureHasher > # when index conflict, which is a great chance to happen especially when > 'numFeature' is relatively small, its value would be changed with a new > valued (sum of current and old value) > # to avoid confliction, we should set the 'numFeature' with a large number, > highly sparse vector increase the computation complexity of model training -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25412) FeatureHasher would change the value of output feature
[ https://issues.apache.org/jira/browse/SPARK-25412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613160#comment-16613160 ] Nick Pentreath commented on SPARK-25412: (1) is by design. Feature hashing does not store the exact mapping from feature values to vector indices and so is a one way transform. Hashing gives you speed and requires almost no memory, but you give up the reverse mapping and you have the potential for hash collisions. (2) is again by design for now. There are ways to have the sign of the feature value be determined also as part of a hash function, and in expectation the collisions zero each other out. This may be added in future work. The impact of hash collisions can be reduced by increasing the {{numFeatures}} parameter. The default is probably reasonable for small to medium feature dimensions but should probably be increased when working with very high-cardinality features. I don't think this can be classed as a bug as these are all design and tradeoffs of using feature hashing > FeatureHasher would change the value of output feature > -- > > Key: SPARK-25412 > URL: https://issues.apache.org/jira/browse/SPARK-25412 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.1 >Reporter: Vincent >Priority: Major > > In the current implementation of FeatureHasher.transform, a simple modulo on > the hashed value is used to determine the vector index, it's suggested to use > a large integer value as the numFeature parameter > we found several issues regarding current implementation: > # Cannot get the feature name back by its index after featureHasher > transform, for example. when getting feature importance from decision tree > training followed by a FeatureHasher > # when index conflict, which is a great chance to happen especially when > 'numFeature' is relatively small, its value would be changed with a new > valued (sum of current and old value) > # to avoid confliction, we should set the 'numFeature' with a large number, > highly sparse vector increase the computation complexity of model training -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24549) Support DecimalType push down to the parquet data sources
[ https://issues.apache.org/jira/browse/SPARK-24549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24549: Fix Version/s: 2.4.0 > Support DecimalType push down to the parquet data sources > - > > Key: SPARK-24549 > URL: https://issues.apache.org/jira/browse/SPARK-24549 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613157#comment-16613157 ] Liang-Chi Hsieh commented on SPARK-25378: - I think a quick fix is to use general `get` method for just `StringType` in `InternalRow.getAccessor`. This can allow the backward-compatible behavior for `StringType` when calling `toArray`. And we may consider to correct to `getUTF8String` by 3.0. WDYT? [~mengxr][~cloud_fan][~hyukjin.kwon] > ArrayData.toArray(StringType) assume UTF8String in 2.4 > -- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources
[ https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24538: Fix Version/s: (was: 2.4.0) > ByteArrayDecimalType support push down to parquet data sources > -- > > Key: SPARK-24538 > URL: https://issues.apache.org/jira/browse/SPARK-24538 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > Latest parquet support decimal type statistics. then we can push down to the > data sources: > {noformat} > LM-SHC-16502798:parquet-mr yumwang$ java -jar > ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta > /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > file: > file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > creator: parquet-mr version 1.10.0 (build > 031a6654009e3b82020012a18434c582bd74c73a) > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]} > file schema: spark_schema > > id: REQUIRED INT64 R:0 D:0 > d1: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d2: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d3: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d4: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d5: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > d6: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > row group 1: RC:241867 TS:15480513 OFFSET:4 > > id: INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 > SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: > 241866, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 > SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, > max: 241866.00, num_nulls: 0] > row group 2: RC:241867 TS:15480513 OFFSET:8584904 > > id: INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 > SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, > max: 483733, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 > SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: > 241867.00, max: 483733.00, num_nulls: > 0]{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Updated] (SPARK-25207) Case-insensitve field resolution for filter pushdown when reading Parquet
[ https://issues.apache.org/jira/browse/SPARK-25207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-25207: Issue Type: Sub-task (was: Bug) Parent: SPARK-25419 > Case-insensitve field resolution for filter pushdown when reading Parquet > - > > Key: SPARK-25207 > URL: https://issues.apache.org/jira/browse/SPARK-25207 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: yucai >Assignee: yucai >Priority: Major > Labels: Parquet > Fix For: 2.4.0 > > Attachments: image.png > > > Currently, filter pushdown will not work if Parquet schema and Hive metastore > schema are in different letter cases even spark.sql.caseSensitive is false. > Like the below case: > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > sql("select * from t where id > 0").show{code} > -No filter will be pushed down.- > {code} > scala> sql("select * from t where id > 0").explain // Filters are pushed > with `ID` > == Physical Plan == > *(1) Project [ID#90L] > +- *(1) Filter (isnotnull(id#90L) && (id#90L > 0)) >+- *(1) FileScan parquet default.t[ID#90L] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/tmp/data], PartitionFilters: [], > PushedFilters: [IsNotNull(ID), GreaterThan(ID,0)], ReadSchema: > struct > scala> sql("select * from t").show// Parquet returns NULL for `ID` > because it has `id`. > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show // `NULL > 0` is `false`. > +---+ > | ID| > +---+ > +---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17091) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-17091: Affects Version/s: 2.4.0 Component/s: SQL Issue Type: Sub-task (was: Bug) Parent: SPARK-25419 > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-17091 > URL: https://issues.apache.org/jira/browse/SPARK-17091 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Andrew Duffy >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > Attachments: IN Predicate.png, OR Predicate.png > > > Past attempts at pushing down the InSet operation for Parquet relied on > user-defined predicates. It would be simpler to rewrite an IN clause into the > corresponding OR union of a set of equality conditions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25419) Parquet predicate pushdown improvement
[ https://issues.apache.org/jira/browse/SPARK-25419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25419. - Resolution: Fixed Fix Version/s: 2.4.0 > Parquet predicate pushdown improvement > -- > > Key: SPARK-25419 > URL: https://issues.apache.org/jira/browse/SPARK-25419 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > > Parquet predicate pushdown support: ByteType, ShortType, DecimalType, > DateType, TimestampType. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24716) Refactor ParquetFilters
[ https://issues.apache.org/jira/browse/SPARK-24716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24716: Issue Type: Sub-task (was: Improvement) Parent: SPARK-25419 > Refactor ParquetFilters > --- > > Key: SPARK-24716 > URL: https://issues.apache.org/jira/browse/SPARK-24716 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24718) Timestamp support pushdown to parquet data source
[ https://issues.apache.org/jira/browse/SPARK-24718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24718: Issue Type: Sub-task (was: Improvement) Parent: SPARK-25419 > Timestamp support pushdown to parquet data source > - > > Key: SPARK-24718 > URL: https://issues.apache.org/jira/browse/SPARK-24718 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > > Some thing like this: > {code:java} > case ParquetSchemaType(TIMESTAMP_MICROS, INT64, null) > if pushDownDecimal => > (n: String, v: Any) => FilterApi.eq( > longColumn(n), > Option(v).map(t => (t.asInstanceOf[java.sql.Timestamp].getTime * 1000) > .asInstanceOf[java.lang.Long]).orNull) > case ParquetSchemaType(TIMESTAMP_MILLIS, INT64, null) > if pushDownDecimal => > (n: String, v: Any) => FilterApi.eq( > longColumn(n), > Option(v).map(_.asInstanceOf[java.sql.Timestamp].getTime > .asInstanceOf[java.lang.Long]).orNull) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24549) Support DecimalType push down to the parquet data sources
[ https://issues.apache.org/jira/browse/SPARK-24549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24549: Issue Type: Sub-task (was: Improvement) Parent: SPARK-25419 > Support DecimalType push down to the parquet data sources > - > > Key: SPARK-24549 > URL: https://issues.apache.org/jira/browse/SPARK-24549 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24638) StringStartsWith support push down
[ https://issues.apache.org/jira/browse/SPARK-24638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24638: Issue Type: Sub-task (was: Improvement) Parent: SPARK-25419 > StringStartsWith support push down > -- > > Key: SPARK-24638 > URL: https://issues.apache.org/jira/browse/SPARK-24638 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24706) Support ByteType and ShortType pushdown to parquet
[ https://issues.apache.org/jira/browse/SPARK-24706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24706: Issue Type: Sub-task (was: Improvement) Parent: SPARK-25419 > Support ByteType and ShortType pushdown to parquet > -- > > Key: SPARK-24706 > URL: https://issues.apache.org/jira/browse/SPARK-24706 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23727) Support DATE predict push down in parquet
[ https://issues.apache.org/jira/browse/SPARK-23727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-23727: Issue Type: Sub-task (was: Improvement) Parent: SPARK-25419 > Support DATE predict push down in parquet > - > > Key: SPARK-23727 > URL: https://issues.apache.org/jira/browse/SPARK-23727 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: yucai >Assignee: yucai >Priority: Major > Fix For: 2.4.0 > > > DATE predict push down is missing, should be supported. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources
[ https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24538: Issue Type: Sub-task (was: Improvement) Parent: SPARK-25419 > ByteArrayDecimalType support push down to parquet data sources > -- > > Key: SPARK-24538 > URL: https://issues.apache.org/jira/browse/SPARK-24538 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > > Latest parquet support decimal type statistics. then we can push down to the > data sources: > {noformat} > LM-SHC-16502798:parquet-mr yumwang$ java -jar > ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta > /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > file: > file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > creator: parquet-mr version 1.10.0 (build > 031a6654009e3b82020012a18434c582bd74c73a) > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]} > file schema: spark_schema > > id: REQUIRED INT64 R:0 D:0 > d1: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d2: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d3: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d4: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d5: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > d6: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > row group 1: RC:241867 TS:15480513 OFFSET:4 > > id: INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 > SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: > 241866, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 > SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, > max: 241866.00, num_nulls: 0] > row group 2: RC:241867 TS:15480513 OFFSET:8584904 > > id: INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 > SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, > max: 483733, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 > SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: > 241867.00, max: 483733.00, num_nulls: > 0]{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SPARK-25419) Parquet predicate pushdown improvement
Yuming Wang created SPARK-25419: --- Summary: Parquet predicate pushdown improvement Key: SPARK-25419 URL: https://issues.apache.org/jira/browse/SPARK-25419 Project: Spark Issue Type: Umbrella Components: SQL Affects Versions: 2.4.0 Reporter: Yuming Wang Parquet predicate pushdown support: ByteType, ShortType, DecimalType, DateType, TimestampType. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources
[ https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613135#comment-16613135 ] Yuming Wang commented on SPARK-24538: - [~cloud_fan] OK, I will do it. > ByteArrayDecimalType support push down to parquet data sources > -- > > Key: SPARK-24538 > URL: https://issues.apache.org/jira/browse/SPARK-24538 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > > Latest parquet support decimal type statistics. then we can push down to the > data sources: > {noformat} > LM-SHC-16502798:parquet-mr yumwang$ java -jar > ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta > /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > file: > file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > creator: parquet-mr version 1.10.0 (build > 031a6654009e3b82020012a18434c582bd74c73a) > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]} > file schema: spark_schema > > id: REQUIRED INT64 R:0 D:0 > d1: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d2: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d3: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d4: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d5: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > d6: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > row group 1: RC:241867 TS:15480513 OFFSET:4 > > id: INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 > SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: > 241866, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 > SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, > max: 241866.00, num_nulls: 0] > row group 2: RC:241867 TS:15480513 OFFSET:8584904 > > id: INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 > SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, > max: 483733, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 > SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: > 241867.00, max: 483733.00, num_nulls: > 0]{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-20937) Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide
[ https://issues.apache.org/jira/browse/SPARK-20937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613113#comment-16613113 ] Sergei commented on SPARK-20937: do you remember what did you do with it finally? > Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, > DataFrames and Datasets Guide > - > > Key: SPARK-20937 > URL: https://issues.apache.org/jira/browse/SPARK-20937 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > As a follow-up to SPARK-20297 (and SPARK-10400) in which > {{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala > and Hive, Spark SQL docs for [Parquet > Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration] > should have it documented. > p.s. It was asked about in [Why can't Impala read parquet files after Spark > SQL's write?|https://stackoverflow.com/q/44279870/1305344] on StackOverflow > today. > p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance > Spark: Best Practices for Scaling and Optimizing Apache Spark" book (in Table > 3-10. Parquet data source options) that gives the option some wider publicity. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources
[ https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613105#comment-16613105 ] Wenchen Fan commented on SPARK-24538: - [~yumwang] can you create an umbrella JIRA ticket for all these parquet predicate pushdown improvement tickets? Then it will be easier to refer it in the release notes. Thanks! > ByteArrayDecimalType support push down to parquet data sources > -- > > Key: SPARK-24538 > URL: https://issues.apache.org/jira/browse/SPARK-24538 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > > Latest parquet support decimal type statistics. then we can push down to the > data sources: > {noformat} > LM-SHC-16502798:parquet-mr yumwang$ java -jar > ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta > /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > file: > file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet > creator: parquet-mr version 1.10.0 (build > 031a6654009e3b82020012a18434c582bd74c73a) > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]} > file schema: spark_schema > > id: REQUIRED INT64 R:0 D:0 > d1: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d2: OPTIONAL INT32 O:DECIMAL R:0 D:1 > d3: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d4: OPTIONAL INT64 O:DECIMAL R:0 D:1 > d5: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > d6: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > row group 1: RC:241867 TS:15480513 OFFSET:4 > > id: INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 > SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: > 241866, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 > SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, > max: 241866.00, num_nulls: 0] > row group 2: RC:241867 TS:15480513 OFFSET:8584904 > > id: INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 > ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d1: INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d2: INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 > ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0] > d3: INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0] > d4: INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 > VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., > num_nulls: 0] > d5: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 > SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, > max: 483733, num_nulls: 0] > d6: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 > SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: > 241867.00, max: 483733.00,