[jira] [Resolved] (SPARK-26005) Upgrade ANTRL to 4.7.1
[ https://issues.apache.org/jira/browse/SPARK-26005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-26005. - Resolution: Fixed Fix Version/s: 3.0.0 > Upgrade ANTRL to 4.7.1 > -- > > Key: SPARK-26005 > URL: https://issues.apache.org/jira/browse/SPARK-26005 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19761) create InMemoryFileIndex with empty rootPaths when set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero
[ https://issues.apache.org/jira/browse/SPARK-19761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-19761. - Resolution: Fixed Fix Version/s: 2.2.0 > create InMemoryFileIndex with empty rootPaths when set > PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero > - > > Key: SPARK-19761 > URL: https://issues.apache.org/jira/browse/SPARK-19761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Major > Fix For: 2.2.0 > > > if we create a InMemoryFileIndex with an empty rootPaths when set > PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero, it will throw an exception: > {code} > Positive number of slices required > java.lang.IllegalArgumentException: Positive number of slices required > at > org.apache.spark.rdd.ParallelCollectionRDD$.slice(ParallelCollectionRDD.scala:119) > at > org.apache.spark.rdd.ParallelCollectionRDD.getPartitions(ParallelCollectionRDD.scala:97) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2084) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.collect(RDD.scala:935) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles(PartitioningAwareFileIndex.scala:357) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.listLeafFiles(PartitioningAwareFileIndex.scala:256) > at > org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:74) > at > org.apache.spark.sql.execution.datasources.InMemoryFileIndex.(InMemoryFileIndex.scala:50) > at > org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9$$anonfun$apply$mcV$sp$2.apply$mcV$sp(FileIndexSuite.scala:186) > at > org.apache.spark.sql.test.SQLTestUtils$class.withSQLConf(SQLTestUtils.scala:105) > at > org.apache.spark.sql.execution.datasources.FileIndexSuite.withSQLConf(FileIndexSuite.scala:33) > at > org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply$mcV$sp(FileIndexSuite.scala:185) > at > org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply(FileIndexSuite.scala:185) > at > org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply(FileIndexSuite.scala:185) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19784) refresh datasource table after alter the location
[ https://issues.apache.org/jira/browse/SPARK-19784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683246#comment-16683246 ] Apache Spark commented on SPARK-19784: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/22721 > refresh datasource table after alter the location > - > > Key: SPARK-19784 > URL: https://issues.apache.org/jira/browse/SPARK-19784 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Major > > currently if we alter the location of a datasource table, then we select from > it, it still return the data of the old location. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26014) Deprecate R < 3.4 support
[ https://issues.apache.org/jira/browse/SPARK-26014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26014: Assignee: Apache Spark > Deprecate R < 3.4 support > - > > Key: SPARK-26014 > URL: https://issues.apache.org/jira/browse/SPARK-26014 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > See > http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-SparkR-CRAN-feasibility-check-server-problem-td25605.html > R version. 3.1.x is too old. It's released 4.5 years ago. > R 3.4.0 is released 1.5 years ago. Considering the timing for Spark 3.0, > deprecating lower versions, bumping up R to 3.4 might be reasonable option. > It should be good to deprecate and drop < R 3.4 support. > If we think about the practice, nothing particular is required within R codes > as far as I can tell. > We will just upgrade Jenkins's R version to 3.4, which mean we're not going > to test 3.1 R version (but instead we will test 3.4). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26014) Deprecate R < 3.4 support
[ https://issues.apache.org/jira/browse/SPARK-26014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683243#comment-16683243 ] Apache Spark commented on SPARK-26014: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/23012 > Deprecate R < 3.4 support > - > > Key: SPARK-26014 > URL: https://issues.apache.org/jira/browse/SPARK-26014 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > See > http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-SparkR-CRAN-feasibility-check-server-problem-td25605.html > R version. 3.1.x is too old. It's released 4.5 years ago. > R 3.4.0 is released 1.5 years ago. Considering the timing for Spark 3.0, > deprecating lower versions, bumping up R to 3.4 might be reasonable option. > It should be good to deprecate and drop < R 3.4 support. > If we think about the practice, nothing particular is required within R codes > as far as I can tell. > We will just upgrade Jenkins's R version to 3.4, which mean we're not going > to test 3.1 R version (but instead we will test 3.4). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26014) Deprecate R < 3.4 support
[ https://issues.apache.org/jira/browse/SPARK-26014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26014: Assignee: (was: Apache Spark) > Deprecate R < 3.4 support > - > > Key: SPARK-26014 > URL: https://issues.apache.org/jira/browse/SPARK-26014 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > See > http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-SparkR-CRAN-feasibility-check-server-problem-td25605.html > R version. 3.1.x is too old. It's released 4.5 years ago. > R 3.4.0 is released 1.5 years ago. Considering the timing for Spark 3.0, > deprecating lower versions, bumping up R to 3.4 might be reasonable option. > It should be good to deprecate and drop < R 3.4 support. > If we think about the practice, nothing particular is required within R codes > as far as I can tell. > We will just upgrade Jenkins's R version to 3.4, which mean we're not going > to test 3.1 R version (but instead we will test 3.4). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26013) Upgrade R tools version to 3.5.1 in AppVeyor build
[ https://issues.apache.org/jira/browse/SPARK-26013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683223#comment-16683223 ] Apache Spark commented on SPARK-26013: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/23011 > Upgrade R tools version to 3.5.1 in AppVeyor build > -- > > Key: SPARK-26013 > URL: https://issues.apache.org/jira/browse/SPARK-26013 > Project: Spark > Issue Type: Improvement > Components: Build, SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > R tools 3.5.1 is released few months ago. Spark currently uses > https://github.com/apache/spark/blob/master/dev/appveyor-install-dependencies.ps1#L119 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26013) Upgrade R tools version to 3.5.1 in AppVeyor build
[ https://issues.apache.org/jira/browse/SPARK-26013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26013: Assignee: Apache Spark > Upgrade R tools version to 3.5.1 in AppVeyor build > -- > > Key: SPARK-26013 > URL: https://issues.apache.org/jira/browse/SPARK-26013 > Project: Spark > Issue Type: Improvement > Components: Build, SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > R tools 3.5.1 is released few months ago. Spark currently uses > https://github.com/apache/spark/blob/master/dev/appveyor-install-dependencies.ps1#L119 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26014) Deprecate R < 3.4 support
Hyukjin Kwon created SPARK-26014: Summary: Deprecate R < 3.4 support Key: SPARK-26014 URL: https://issues.apache.org/jira/browse/SPARK-26014 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 3.0.0 Reporter: Hyukjin Kwon See http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-SparkR-CRAN-feasibility-check-server-problem-td25605.html R version. 3.1.x is too old. It's released 4.5 years ago. R 3.4.0 is released 1.5 years ago. Considering the timing for Spark 3.0, deprecating lower versions, bumping up R to 3.4 might be reasonable option. It should be good to deprecate and drop < R 3.4 support. If we think about the practice, nothing particular is required within R codes as far as I can tell. We will just upgrade Jenkins's R version to 3.4, which mean we're not going to test 3.1 R version (but instead we will test 3.4). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26013) Upgrade R tools version to 3.5.1 in AppVeyor build
[ https://issues.apache.org/jira/browse/SPARK-26013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26013: Assignee: (was: Apache Spark) > Upgrade R tools version to 3.5.1 in AppVeyor build > -- > > Key: SPARK-26013 > URL: https://issues.apache.org/jira/browse/SPARK-26013 > Project: Spark > Issue Type: Improvement > Components: Build, SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > R tools 3.5.1 is released few months ago. Spark currently uses > https://github.com/apache/spark/blob/master/dev/appveyor-install-dependencies.ps1#L119 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26013) Upgrade R tools version to 3.5.1 in AppVeyor build
[ https://issues.apache.org/jira/browse/SPARK-26013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683221#comment-16683221 ] Apache Spark commented on SPARK-26013: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/23011 > Upgrade R tools version to 3.5.1 in AppVeyor build > -- > > Key: SPARK-26013 > URL: https://issues.apache.org/jira/browse/SPARK-26013 > Project: Spark > Issue Type: Improvement > Components: Build, SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > R tools 3.5.1 is released few months ago. Spark currently uses > https://github.com/apache/spark/blob/master/dev/appveyor-install-dependencies.ps1#L119 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26012) Dynamic partition will fail when both '' and null values are taken as dynamic partition values simultaneously.
[ https://issues.apache.org/jira/browse/SPARK-26012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] eaton updated SPARK-26012: -- Description: Dynamic partition will fail when both '' and null values are taken as dynamic partition values simultaneously. For example, the test bellow will fail. test("Null and '' values should not cause dynamic partition failure of string types") { withTable("t1", "t2") { spark.range(3).write.saveAsTable("t1") spark.sql("select id, cast(case when id = 1 then '' else null end as string) as p" + " from t1").write.partitionBy("p").saveAsTable("t2") checkAnswer(spark.table("t2").sort("id"), Seq(Row(0, null), Row(1, null), Row(2, null))) } } The error is: 'org.apache.hadoop.fs.FileAlreadyExistsException: File already exists'. Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: File already exists: [file:/F:/learning/spark/spark_master/spark_compile/spark-warehouse/t2/_temporary/0/_temporary/attempt_2018204354_0001_m_00_0/p=__HIVE_DEFAULT_PARTITION__/part-0-96217c96-3695-4f18-b0db-4f35a9078a3d.c000.snappy.parquet|file:///F:/learning/spark/spark_master/spark_compile/spark-warehouse/t2/_temporary/0/_temporary/attempt_2018204354_0001_m_00_0/p=__HIVE_DEFAULT_PARTITION__/part-0-96217c96-3695-4f18-b0db-4f35a9078a3d.c000.snappy.parquet] at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:289) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:398) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892) at org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74) at org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:248) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:390) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151) at org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.newOutputWriter(FileFormatDataWriter.scala:236) at org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.write(FileFormatDataWriter.scala:260) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:239) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:245) ... 10 more 20:43:55.460 WARN org.apache.spark.sql.execution.datasources.FileFormatWriterSuite: was: Dynamic partition will fail when both '' and null values are taken as dynamic partition values simultaneously. For example, the test bellow will fail before this PR: test("Null and '' values should not cause dynamic partition failure of string types") { withTable("t1", "t2") { spark.range(3).write.saveAsTable("t1") spark.sql("select id, cast(case when id = 1 then '' else null end as string) as p" + " from t1").write.partitionBy("p").saveAsTable("t2") checkAnswer(spark.table("t2").sort("id"), Seq(Row(0, null), Row(1, null), Row(2, null))) } } The error is: 'org.apache.hadoop.fs.FileAlreadyExistsException: File already exists'. Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: File already exists: file:/F:/learning/spark/spark_master/spark_compile/spark-warehouse/t2/_temporary/0/_temporary/attempt_2018204354_0001_m_00_0/p=__HIVE_DEFAULT_PARTITION__/part-0-96217c96-3695-4f18-b0db-4f35a9078a3d.c000.snappy.parquet at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:289) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:398) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
[jira] [Created] (SPARK-26013) Upgrade R tools version to 3.5.1 in AppVeyor build
Hyukjin Kwon created SPARK-26013: Summary: Upgrade R tools version to 3.5.1 in AppVeyor build Key: SPARK-26013 URL: https://issues.apache.org/jira/browse/SPARK-26013 Project: Spark Issue Type: Improvement Components: Build, SparkR Affects Versions: 3.0.0 Reporter: Hyukjin Kwon R tools 3.5.1 is released few months ago. Spark currently uses https://github.com/apache/spark/blob/master/dev/appveyor-install-dependencies.ps1#L119 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26012) Dynamic partition will fail when both '' and null values are taken as dynamic partition values simultaneously.
[ https://issues.apache.org/jira/browse/SPARK-26012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683217#comment-16683217 ] Apache Spark commented on SPARK-26012: -- User 'eatoncys' has created a pull request for this issue: https://github.com/apache/spark/pull/23010 > Dynamic partition will fail when both '' and null values are taken as dynamic > partition values simultaneously. > -- > > Key: SPARK-26012 > URL: https://issues.apache.org/jira/browse/SPARK-26012 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: eaton >Priority: Major > > Dynamic partition will fail when both '' and null values are taken as dynamic > partition values simultaneously. > For example, the test bellow will fail before this PR: > test("Null and '' values should not cause dynamic partition failure of string > types") { > withTable("t1", "t2") { > spark.range(3).write.saveAsTable("t1") > spark.sql("select id, cast(case when id = 1 then '' else null end as string) > as p" + > " from t1").write.partitionBy("p").saveAsTable("t2") > checkAnswer(spark.table("t2").sort("id"), Seq(Row(0, null), Row(1, null), > Row(2, null))) > } > } > The error is: 'org.apache.hadoop.fs.FileAlreadyExistsException: File already > exists'. > > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: File already > exists: > file:/F:/learning/spark/spark_master/spark_compile/spark-warehouse/t2/_temporary/0/_temporary/attempt_2018204354_0001_m_00_0/p=__HIVE_DEFAULT_PARTITION__/part-0-96217c96-3695-4f18-b0db-4f35a9078a3d.c000.snappy.parquet > at > org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:289) > at > org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:398) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892) > at > org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74) > at > org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:248) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:390) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.newOutputWriter(FileFormatDataWriter.scala:236) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.write(FileFormatDataWriter.scala:260) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:239) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:245) > ... 10 more > 20:43:55.460 WARN > org.apache.spark.sql.execution.datasources.FileFormatWriterSuite: -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26012) Dynamic partition will fail when both '' and null values are taken as dynamic partition values simultaneously.
[ https://issues.apache.org/jira/browse/SPARK-26012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26012: Assignee: (was: Apache Spark) > Dynamic partition will fail when both '' and null values are taken as dynamic > partition values simultaneously. > -- > > Key: SPARK-26012 > URL: https://issues.apache.org/jira/browse/SPARK-26012 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: eaton >Priority: Major > > Dynamic partition will fail when both '' and null values are taken as dynamic > partition values simultaneously. > For example, the test bellow will fail before this PR: > test("Null and '' values should not cause dynamic partition failure of string > types") { > withTable("t1", "t2") { > spark.range(3).write.saveAsTable("t1") > spark.sql("select id, cast(case when id = 1 then '' else null end as string) > as p" + > " from t1").write.partitionBy("p").saveAsTable("t2") > checkAnswer(spark.table("t2").sort("id"), Seq(Row(0, null), Row(1, null), > Row(2, null))) > } > } > The error is: 'org.apache.hadoop.fs.FileAlreadyExistsException: File already > exists'. > > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: File already > exists: > file:/F:/learning/spark/spark_master/spark_compile/spark-warehouse/t2/_temporary/0/_temporary/attempt_2018204354_0001_m_00_0/p=__HIVE_DEFAULT_PARTITION__/part-0-96217c96-3695-4f18-b0db-4f35a9078a3d.c000.snappy.parquet > at > org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:289) > at > org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:398) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892) > at > org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74) > at > org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:248) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:390) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.newOutputWriter(FileFormatDataWriter.scala:236) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.write(FileFormatDataWriter.scala:260) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:239) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:245) > ... 10 more > 20:43:55.460 WARN > org.apache.spark.sql.execution.datasources.FileFormatWriterSuite: -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26012) Dynamic partition will fail when both '' and null values are taken as dynamic partition values simultaneously.
[ https://issues.apache.org/jira/browse/SPARK-26012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26012: Assignee: Apache Spark > Dynamic partition will fail when both '' and null values are taken as dynamic > partition values simultaneously. > -- > > Key: SPARK-26012 > URL: https://issues.apache.org/jira/browse/SPARK-26012 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: eaton >Assignee: Apache Spark >Priority: Major > > Dynamic partition will fail when both '' and null values are taken as dynamic > partition values simultaneously. > For example, the test bellow will fail before this PR: > test("Null and '' values should not cause dynamic partition failure of string > types") { > withTable("t1", "t2") { > spark.range(3).write.saveAsTable("t1") > spark.sql("select id, cast(case when id = 1 then '' else null end as string) > as p" + > " from t1").write.partitionBy("p").saveAsTable("t2") > checkAnswer(spark.table("t2").sort("id"), Seq(Row(0, null), Row(1, null), > Row(2, null))) > } > } > The error is: 'org.apache.hadoop.fs.FileAlreadyExistsException: File already > exists'. > > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: File already > exists: > file:/F:/learning/spark/spark_master/spark_compile/spark-warehouse/t2/_temporary/0/_temporary/attempt_2018204354_0001_m_00_0/p=__HIVE_DEFAULT_PARTITION__/part-0-96217c96-3695-4f18-b0db-4f35a9078a3d.c000.snappy.parquet > at > org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:289) > at > org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:398) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892) > at > org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74) > at > org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:248) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:390) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.newOutputWriter(FileFormatDataWriter.scala:236) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.write(FileFormatDataWriter.scala:260) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:239) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:245) > ... 10 more > 20:43:55.460 WARN > org.apache.spark.sql.execution.datasources.FileFormatWriterSuite: -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26012) Dynamic partition will fail when both '' and null values are taken as dynamic partition values simultaneously.
eaton created SPARK-26012: - Summary: Dynamic partition will fail when both '' and null values are taken as dynamic partition values simultaneously. Key: SPARK-26012 URL: https://issues.apache.org/jira/browse/SPARK-26012 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: eaton Dynamic partition will fail when both '' and null values are taken as dynamic partition values simultaneously. For example, the test bellow will fail before this PR: test("Null and '' values should not cause dynamic partition failure of string types") { withTable("t1", "t2") { spark.range(3).write.saveAsTable("t1") spark.sql("select id, cast(case when id = 1 then '' else null end as string) as p" + " from t1").write.partitionBy("p").saveAsTable("t2") checkAnswer(spark.table("t2").sort("id"), Seq(Row(0, null), Row(1, null), Row(2, null))) } } The error is: 'org.apache.hadoop.fs.FileAlreadyExistsException: File already exists'. Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: File already exists: file:/F:/learning/spark/spark_master/spark_compile/spark-warehouse/t2/_temporary/0/_temporary/attempt_2018204354_0001_m_00_0/p=__HIVE_DEFAULT_PARTITION__/part-0-96217c96-3695-4f18-b0db-4f35a9078a3d.c000.snappy.parquet at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:289) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:398) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892) at org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74) at org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:248) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:390) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151) at org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.newOutputWriter(FileFormatDataWriter.scala:236) at org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.write(FileFormatDataWriter.scala:260) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:239) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:245) ... 10 more 20:43:55.460 WARN org.apache.spark.sql.execution.datasources.FileFormatWriterSuite: -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26011) pyspark app with "spark.jars.packages" config does not work
[ https://issues.apache.org/jira/browse/SPARK-26011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26011: Assignee: (was: Apache Spark) > pyspark app with "spark.jars.packages" config does not work > --- > > Key: SPARK-26011 > URL: https://issues.apache.org/jira/browse/SPARK-26011 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.2, 2.4.0 >Reporter: shanyu zhao >Priority: Major > > Command "pyspark --packages" works as expected, but if submitting a livy > pyspark job with "spark.jars.packages" config, the downloaded packages are > not added to python's sys.path therefore the package is not available to use. > For example, this command works: > pyspark --packages Azure:mmlspark:0.14 > However, using Jupyter notebook with sparkmagic kernel to open a pyspark > session failed: > %%configure -f \{"conf": {spark.jars.packages": "Azure:mmlspark:0.14"}} > import mmlspark > The root cause is that SparkSubmit determines pyspark app by the suffix of > primary resource but Livy uses "spark-internal" as the primary resource when > calling spark-submit, therefore args.isPython is set to false in > SparkSubmit.scala. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26011) pyspark app with "spark.jars.packages" config does not work
[ https://issues.apache.org/jira/browse/SPARK-26011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26011: Assignee: Apache Spark > pyspark app with "spark.jars.packages" config does not work > --- > > Key: SPARK-26011 > URL: https://issues.apache.org/jira/browse/SPARK-26011 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.2, 2.4.0 >Reporter: shanyu zhao >Assignee: Apache Spark >Priority: Major > > Command "pyspark --packages" works as expected, but if submitting a livy > pyspark job with "spark.jars.packages" config, the downloaded packages are > not added to python's sys.path therefore the package is not available to use. > For example, this command works: > pyspark --packages Azure:mmlspark:0.14 > However, using Jupyter notebook with sparkmagic kernel to open a pyspark > session failed: > %%configure -f \{"conf": {spark.jars.packages": "Azure:mmlspark:0.14"}} > import mmlspark > The root cause is that SparkSubmit determines pyspark app by the suffix of > primary resource but Livy uses "spark-internal" as the primary resource when > calling spark-submit, therefore args.isPython is set to false in > SparkSubmit.scala. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26011) pyspark app with "spark.jars.packages" config does not work
[ https://issues.apache.org/jira/browse/SPARK-26011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683154#comment-16683154 ] Apache Spark commented on SPARK-26011: -- User 'shanyu' has created a pull request for this issue: https://github.com/apache/spark/pull/23009 > pyspark app with "spark.jars.packages" config does not work > --- > > Key: SPARK-26011 > URL: https://issues.apache.org/jira/browse/SPARK-26011 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.2, 2.4.0 >Reporter: shanyu zhao >Priority: Major > > Command "pyspark --packages" works as expected, but if submitting a livy > pyspark job with "spark.jars.packages" config, the downloaded packages are > not added to python's sys.path therefore the package is not available to use. > For example, this command works: > pyspark --packages Azure:mmlspark:0.14 > However, using Jupyter notebook with sparkmagic kernel to open a pyspark > session failed: > %%configure -f \{"conf": {spark.jars.packages": "Azure:mmlspark:0.14"}} > import mmlspark > The root cause is that SparkSubmit determines pyspark app by the suffix of > primary resource but Livy uses "spark-internal" as the primary resource when > calling spark-submit, therefore args.isPython is set to false in > SparkSubmit.scala. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26011) pyspark app with "spark.jars.packages" config does not work
[ https://issues.apache.org/jira/browse/SPARK-26011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shanyu zhao updated SPARK-26011: Description: Command "pyspark --packages" works as expected, but if submitting a livy pyspark job with "spark.jars.packages" config, the downloaded packages are not added to python's sys.path therefore the package is not available to use. For example, this command works: pyspark --packages Azure:mmlspark:0.14 However, using Jupyter notebook with sparkmagic kernel to open a pyspark session failed: %%configure -f \{"conf": {spark.jars.packages": "Azure:mmlspark:0.14"}} import mmlspark The root cause is that SparkSubmit determines pyspark app by the suffix of primary resource but Livy uses "spark-internal" as the primary resource when calling spark-submit, therefore args.isPython is set to false in SparkSubmit.scala. was: Command "pyspark --packages" works as expected, but if submitting a livy pyspark job with "spark.jars.packages" config, the downloaded packages are not added to python's sys.path therefore the package is not available to use. For example, this command works: pyspark --packages Azure:mmlspark:0.14 However, using Jupyter notebook with sparkmagic kernel to open a pyspark session failed: %%configure -f \{"conf": {spark.jars.packages": "Azure:mmlspark:0.14"}} import mmlspark The root cause is that SparkSubmit determines pyspark app by the suffix of primary resource but Livy uses "spark-internal" as the primary resource when calling spark-submit, therefore args.isPython is fails in SparkSubmit.scala. > pyspark app with "spark.jars.packages" config does not work > --- > > Key: SPARK-26011 > URL: https://issues.apache.org/jira/browse/SPARK-26011 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.2, 2.4.0 >Reporter: shanyu zhao >Priority: Major > > Command "pyspark --packages" works as expected, but if submitting a livy > pyspark job with "spark.jars.packages" config, the downloaded packages are > not added to python's sys.path therefore the package is not available to use. > For example, this command works: > pyspark --packages Azure:mmlspark:0.14 > However, using Jupyter notebook with sparkmagic kernel to open a pyspark > session failed: > %%configure -f \{"conf": {spark.jars.packages": "Azure:mmlspark:0.14"}} > import mmlspark > The root cause is that SparkSubmit determines pyspark app by the suffix of > primary resource but Livy uses "spark-internal" as the primary resource when > calling spark-submit, therefore args.isPython is set to false in > SparkSubmit.scala. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26011) pyspark app with "spark.jars.packages" config does not work
shanyu zhao created SPARK-26011: --- Summary: pyspark app with "spark.jars.packages" config does not work Key: SPARK-26011 URL: https://issues.apache.org/jira/browse/SPARK-26011 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.4.0, 2.3.2 Reporter: shanyu zhao Command "pyspark --packages" works as expected, but if submitting a livy pyspark job with "spark.jars.packages" config, the downloaded packages are not added to python's sys.path therefore the package is not available to use. For example, this command works: pyspark --packages Azure:mmlspark:0.14 However, using Jupyter notebook with sparkmagic kernel to open a pyspark session failed: %%configure -f \{"conf": {spark.jars.packages": "Azure:mmlspark:0.14"}} import mmlspark The root cause is that SparkSubmit determines pyspark app by the suffix of primary resource but Livy uses "spark-internal" as the primary resource when calling spark-submit, therefore args.isPython is fails in SparkSubmit.scala. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses
[ https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683033#comment-16683033 ] Apache Spark commented on SPARK-22674: -- User 'superbobry' has created a pull request for this issue: https://github.com/apache/spark/pull/23008 > PySpark breaks serialization of namedtuple subclasses > - > > Key: SPARK-22674 > URL: https://issues.apache.org/jira/browse/SPARK-22674 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.3.0 >Reporter: Jonas Amrich >Priority: Major > > Pyspark monkey patches the namedtuple class to make it serializable, however > this breaks serialization of its subclasses. With current implementation, any > subclass will be serialized (and deserialized) as it's parent namedtuple. > Consider this code, which will fail with {{AttributeError: 'Point' object has > no attribute 'sum'}}: > {code} > from collections import namedtuple > Point = namedtuple("Point", "x y") > class PointSubclass(Point): > def sum(self): > return self.x + self.y > rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]]) > rdd.collect()[0][0].sum() > {code} > Moreover, as PySpark hijacks all namedtuples in the main module, importing > pyspark breaks serialization of namedtuple subclasses even in code which is > not related to spark / distributed execution. I don't see any clean solution > to this; a possible workaround may be to limit serialization hack only to > direct namedtuple subclasses like in > https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26010) SparkR vignette fails on CRAN on Java 11
[ https://issues.apache.org/jira/browse/SPARK-26010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682959#comment-16682959 ] Apache Spark commented on SPARK-26010: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/23007 > SparkR vignette fails on CRAN on Java 11 > > > Key: SPARK-26010 > URL: https://issues.apache.org/jira/browse/SPARK-26010 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0, 3.0.0 >Reporter: Felix Cheung >Priority: Major > > follow up to SPARK-25572 > but for vignettes > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26010) SparkR vignette fails on CRAN on Java 11
[ https://issues.apache.org/jira/browse/SPARK-26010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26010: Assignee: Apache Spark > SparkR vignette fails on CRAN on Java 11 > > > Key: SPARK-26010 > URL: https://issues.apache.org/jira/browse/SPARK-26010 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0, 3.0.0 >Reporter: Felix Cheung >Assignee: Apache Spark >Priority: Major > > follow up to SPARK-25572 > but for vignettes > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26010) SparkR vignette fails on CRAN on Java 11
[ https://issues.apache.org/jira/browse/SPARK-26010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682958#comment-16682958 ] Apache Spark commented on SPARK-26010: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/23007 > SparkR vignette fails on CRAN on Java 11 > > > Key: SPARK-26010 > URL: https://issues.apache.org/jira/browse/SPARK-26010 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0, 3.0.0 >Reporter: Felix Cheung >Priority: Major > > follow up to SPARK-25572 > but for vignettes > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26010) SparkR vignette fails on CRAN on Java 11
[ https://issues.apache.org/jira/browse/SPARK-26010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26010: Assignee: (was: Apache Spark) > SparkR vignette fails on CRAN on Java 11 > > > Key: SPARK-26010 > URL: https://issues.apache.org/jira/browse/SPARK-26010 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0, 3.0.0 >Reporter: Felix Cheung >Priority: Major > > follow up to SPARK-25572 > but for vignettes > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26010) SparkR vignette fails on Java 11
Felix Cheung created SPARK-26010: Summary: SparkR vignette fails on Java 11 Key: SPARK-26010 URL: https://issues.apache.org/jira/browse/SPARK-26010 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.4.0, 3.0.0 Reporter: Felix Cheung follow up to SPARK-25572 but for vignettes -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26010) SparkR vignette fails on CRAN on Java 11
[ https://issues.apache.org/jira/browse/SPARK-26010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-26010: - Summary: SparkR vignette fails on CRAN on Java 11 (was: SparkR vignette fails on Java 11) > SparkR vignette fails on CRAN on Java 11 > > > Key: SPARK-26010 > URL: https://issues.apache.org/jira/browse/SPARK-26010 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0, 3.0.0 >Reporter: Felix Cheung >Priority: Major > > follow up to SPARK-25572 > but for vignettes > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24421) Accessing sun.misc.Cleaner in JDK11
[ https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682951#comment-16682951 ] Alan edited comment on SPARK-24421 at 11/11/18 6:13 PM: Sean: In the proposed release note text you say that access to the internal JDK classes "is no longer possible in Java 9 and later, because of the new module encapsulation system". I don't think this is quite right as the main issue you ran into is that the JDK's internal cleaner mechanism was refactored and moved from sun.misc to jdk.internal.ref. Also the comment about using add-opens may need update too as java.lang remains open to code on the class path in JDK 9/10/11. I suspect the the comments in this issue about add-opens meant to say jdk.internal.ref instead (although you just don't want to go there as directly using anything in that package may break at any time). was (Author: bateman): Sean - In the proposed release note text you say that access to the internal JDK classes "is no longer possible in Java 9 and later, because of the new module encapsulation system". I don't think this is quite right as the main issue you ran into is that the JDK's internal cleaner mechanism was refactored and moved from sun.misc to jdk.internal.ref. Also the comment about using `--add-opens` may need update too as java.lang remains open to code on the class path in JDK 9/10/11. I suspect the the comments in this issue about `--add-opens` meant to say jdk.internal.ref instead (although you just don't want to go there as directly using anything in that package may break at any time). > Accessing sun.misc.Cleaner in JDK11 > --- > > Key: SPARK-24421 > URL: https://issues.apache.org/jira/browse/SPARK-24421 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: DB Tsai >Priority: Major > Labels: release-notes > > Many internal APIs such as unsafe are encapsulated in JDK9+, see > http://openjdk.java.net/jeps/260 for detail. > To use Unsafe, we need to add *jdk.unsupported* to our code’s module > declaration: > {code:java} > module java9unsafe { > requires jdk.unsupported; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24421) Accessing sun.misc.Cleaner in JDK11
[ https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682951#comment-16682951 ] Alan commented on SPARK-24421: -- Sean - In the proposed release note text you say that access to the internal JDK classes "is no longer possible in Java 9 and later, because of the new module encapsulation system". I don't think this is quite right as the main issue you ran into is that the JDK's internal cleaner mechanism was refactored and moved from sun.misc to jdk.internal.ref. Also the comment about using `--add-opens` may need update too as java.lang remains open to code on the class path in JDK 9/10/11. I suspect the the comments in this issue about `--add-opens` meant to say jdk.internal.ref instead (although you just don't want to go there as directly using anything in that package may break at any time). > Accessing sun.misc.Cleaner in JDK11 > --- > > Key: SPARK-24421 > URL: https://issues.apache.org/jira/browse/SPARK-24421 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: DB Tsai >Priority: Major > Labels: release-notes > > Many internal APIs such as unsafe are encapsulated in JDK9+, see > http://openjdk.java.net/jeps/260 for detail. > To use Unsafe, we need to add *jdk.unsupported* to our code’s module > declaration: > {code:java} > module java9unsafe { > requires jdk.unsupported; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26009) Unable to fetch jar from remote repo while running spark-submit on kubernetes
Bala Bharath Reddy Resapu created SPARK-26009: - Summary: Unable to fetch jar from remote repo while running spark-submit on kubernetes Key: SPARK-26009 URL: https://issues.apache.org/jira/browse/SPARK-26009 Project: Spark Issue Type: Question Components: Kubernetes Affects Versions: 2.3.2 Reporter: Bala Bharath Reddy Resapu I am trying to run spark on kubernetes with a docker image. My requirement is to download the jar from the external repo while running spark-submit. I am able to download the jar using wget in the container but it doesn't work when inputting in the spark-submit command. I am not packaging the jar with docker image. It works fine when I input the jar file inside the docker image. ./bin/spark-submit \ --master k8s://https://ip:port \ --deploy-mode cluster \ --name test3 \ --class hello \ --conf spark.kubernetes.container.image.pullSecrets=abcd \ --conf spark.kubernetes.container.image=spark:h2.0 \ [https://devops.com/artifactory/local/testing/testing_2.11/h|https://bala.bharath.reddy.resapu%40ibm.com:akcp5bcbktykg2ti28sju4gtebsqwkg2mqkaf9w6g5rdbo3iwrwx7qb1m5dokgd54hdru2...@na.artifactory.swg-devops.com/artifactory/txo-cedp-garage-artifacts-sbt-local/testing/testing_2.11/arithmetic.jar]ello.jar -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25914) Separate projection from grouping and aggregate in logical Aggregate
[ https://issues.apache.org/jira/browse/SPARK-25914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25914: Target Version/s: 3.0.0 > Separate projection from grouping and aggregate in logical Aggregate > > > Key: SPARK-25914 > URL: https://issues.apache.org/jira/browse/SPARK-25914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maryann Xue >Assignee: Dilip Biswal >Priority: Major > > Currently the Spark SQL logical Aggregate has two expression fields: > {{groupingExpressions}} and {{aggregateExpressions}}, in which > {{aggregateExpressions}} is actually the result expressions, or in other > words, the project list in the SELECT clause. > > This would cause an exception while processing the following query: > {code:java} > SELECT concat('x', concat(a, 's')) > FROM testData2 > GROUP BY concat(a, 's'){code} > After optimization, the query becomes: > {code:java} > SELECT concat('x', a, 's') > FROM testData2 > GROUP BY concat(a, 's'){code} > The optimization rule {{CombineConcats}} optimizes the expressions by > flattening "concat" and causes the query to fail since the expression > {{concat('x', a, 's')}} in the SELECT clause is neither referencing a > grouping expression nor a aggregate expression. > > The problem is that we try to mix two operations in one operator, and worse, > in one field: the group-and-aggregate operation and the project operation. > There are two ways to solve this problem: > 1. Break the two operations into two logical operators, which means a > group-by query can usually be mapped into a Project-over-Aggregate pattern. > 2. Break the two operations into multiple fields in the Aggregate operator, > the same way we do for physical aggregate classes (e.g., > {{HashAggregateExec}}, or {{SortAggregateExec}}). Thus, > {{groupingExpressions}} would still be the expressions from the GROUP BY > clause (as before), but {{aggregateExpressions}} would contain aggregate > functions only, and {{resultExpressions}} would be the project list in the > SELECT clause holding references to either {{groupingExpressions}} or > {{aggregateExpressions}}. > > I would say option 1 is even clearer, but it would be more likely to break > the pattern matching in existing optimization rules and thus require more > changes in the compiler. So we'd probably wanna go with option 2. That said, > I suggest we achieve this goal through two iterative steps: > > Phase 1: Keep the current fields of logical Aggregate as > {{groupingExpressions}} and {{aggregateExpressions}}, but change the > semantics of {{aggregateExpressions}} by replacing the grouping expressions > with corresponding references to expressions in {{groupingExpressions}}. The > aggregate expressions in {{aggregateExpressions}} will remain the same. > > Phase 2: Add {{resultExpressions}} for the project list, and keep only > aggregate expressions in {{aggregateExpressions}}. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24421) Accessing sun.misc.Cleaner in JDK11
[ https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-24421: -- Docs Text: Provisional release notes text: Spark 3 attempts to avoid the JVM's default limit on total size of memory allocated by direct buffers, for user convenience, by accessing some internal JDK classes directly. This is no longer possible in Java 9 and later, because of the new module encapsulation system. For many usages of Spark, this will not matter, as the default {{MaxDirectMemorySize}} may be more than sufficient for all direct buffer allocation. If it isn't, it can be made to work again by allowing the access explicitly with the JVM argument. {{--add-opens java.base/java.lang=ALL-UNNAMED}}. Of course this can also be resolved by explicitly setting {{-XX:MaxDirectMemorySize=}} to a sufficiently large value. > Accessing sun.misc.Cleaner in JDK11 > --- > > Key: SPARK-24421 > URL: https://issues.apache.org/jira/browse/SPARK-24421 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: DB Tsai >Priority: Major > Labels: release-notes > > Many internal APIs such as unsafe are encapsulated in JDK9+, see > http://openjdk.java.net/jeps/260 for detail. > To use Unsafe, we need to add *jdk.unsupported* to our code’s module > declaration: > {code:java} > module java9unsafe { > requires jdk.unsupported; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24421) Accessing sun.misc.Cleaner in JDK11
[ https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682915#comment-16682915 ] Assaf Mendelson commented on SPARK-24421: - Would it be possible to add {{Add-Opens java.base/java.lang=ALL-UNNAMED to the manifest file to avoid the need to do so when running the jar?}} > Accessing sun.misc.Cleaner in JDK11 > --- > > Key: SPARK-24421 > URL: https://issues.apache.org/jira/browse/SPARK-24421 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: DB Tsai >Priority: Major > Labels: release-notes > > Many internal APIs such as unsafe are encapsulated in JDK9+, see > http://openjdk.java.net/jeps/260 for detail. > To use Unsafe, we need to add *jdk.unsupported* to our code’s module > declaration: > {code:java} > module java9unsafe { > requires jdk.unsupported; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19714) Clarify Bucketizer handling of invalid input
[ https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19714. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23003 [https://github.com/apache/spark/pull/23003] > Clarify Bucketizer handling of invalid input > > > Key: SPARK-19714 > URL: https://issues.apache.org/jira/browse/SPARK-19714 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Bill Chambers >Assignee: Wojciech Szymanski >Priority: Minor > Fix For: 3.0.0 > > > {code} > contDF = spark.range(500).selectExpr("cast(id as double) as id") > import org.apache.spark.ml.feature.Bucketizer > val splits = Array(5.0, 10.0, 250.0, 500.0) > val bucketer = new Bucketizer() > .setSplits(splits) > .setInputCol("id") > .setHandleInvalid("skip") > bucketer.transform(contDF).show() > {code} > You would expect that this would handle the invalid buckets. However it fails > {code} > Caused by: org.apache.spark.SparkException: Feature value 0.0 out of > Bucketizer bounds [5.0, 500.0]. Check your features, or loosen the > lower/upper bound constraints. > {code} > It seems strange that handleInvalud doesn't actually handleInvalid inputs. > Thoughts anyone? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26008) Structured Streaming Manual clock for simulation
Tom Bar Yacov created SPARK-26008: - Summary: Structured Streaming Manual clock for simulation Key: SPARK-26008 URL: https://issues.apache.org/jira/browse/SPARK-26008 Project: Spark Issue Type: Question Components: Structured Streaming Affects Versions: 2.4.0, 2.3.2, 2.3.1, 2.3.0 Reporter: Tom Bar Yacov Structured streaming Internal {color:#33}StreamTest{color} class allows to test incremental logic and verify outputs between multiple triggers. It support changing the internal spark clock to get full deterministic simulation of the incremental state and APIs. This is not possible outside tests since {color:#33}DataStreamWriter{color} hides the triggerClock parameter and is final. This can be very useful not only in unit test mode but also for a real running query. for example when you have all the Kafka historical data persisted to hdfs with its Kafka timestamp and you want to "play" the data and simulate the streaming application output as if running on this data in live streaming including incremental output between triggers. Today I can simulate multiple triggers and incremental logic for some of the APIs, but for APIs that depend on the execution clock like {color:#33}mapGroupsWithState{color} with execution based timeout I did not find a way to do this. Question is - Is it a possible to support a similar solution like in StreamTest - Allow passing an external manual clock as parameter to DataStreamWriter and allowing the user an external control over this clock? what possible failures that can occur if running with manual clock in real cluster mode? Thanks -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25914) Separate projection from grouping and aggregate in logical Aggregate
[ https://issues.apache.org/jira/browse/SPARK-25914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682890#comment-16682890 ] Hyukjin Kwon commented on SPARK-25914: -- Please avoid to set a target version which is usually reserved by committers. > Separate projection from grouping and aggregate in logical Aggregate > > > Key: SPARK-25914 > URL: https://issues.apache.org/jira/browse/SPARK-25914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maryann Xue >Assignee: Dilip Biswal >Priority: Major > > Currently the Spark SQL logical Aggregate has two expression fields: > {{groupingExpressions}} and {{aggregateExpressions}}, in which > {{aggregateExpressions}} is actually the result expressions, or in other > words, the project list in the SELECT clause. > > This would cause an exception while processing the following query: > {code:java} > SELECT concat('x', concat(a, 's')) > FROM testData2 > GROUP BY concat(a, 's'){code} > After optimization, the query becomes: > {code:java} > SELECT concat('x', a, 's') > FROM testData2 > GROUP BY concat(a, 's'){code} > The optimization rule {{CombineConcats}} optimizes the expressions by > flattening "concat" and causes the query to fail since the expression > {{concat('x', a, 's')}} in the SELECT clause is neither referencing a > grouping expression nor a aggregate expression. > > The problem is that we try to mix two operations in one operator, and worse, > in one field: the group-and-aggregate operation and the project operation. > There are two ways to solve this problem: > 1. Break the two operations into two logical operators, which means a > group-by query can usually be mapped into a Project-over-Aggregate pattern. > 2. Break the two operations into multiple fields in the Aggregate operator, > the same way we do for physical aggregate classes (e.g., > {{HashAggregateExec}}, or {{SortAggregateExec}}). Thus, > {{groupingExpressions}} would still be the expressions from the GROUP BY > clause (as before), but {{aggregateExpressions}} would contain aggregate > functions only, and {{resultExpressions}} would be the project list in the > SELECT clause holding references to either {{groupingExpressions}} or > {{aggregateExpressions}}. > > I would say option 1 is even clearer, but it would be more likely to break > the pattern matching in existing optimization rules and thus require more > changes in the compiler. So we'd probably wanna go with option 2. That said, > I suggest we achieve this goal through two iterative steps: > > Phase 1: Keep the current fields of logical Aggregate as > {{groupingExpressions}} and {{aggregateExpressions}}, but change the > semantics of {{aggregateExpressions}} by replacing the grouping expressions > with corresponding references to expressions in {{groupingExpressions}}. The > aggregate expressions in {{aggregateExpressions}} will remain the same. > > Phase 2: Add {{resultExpressions}} for the project list, and keep only > aggregate expressions in {{aggregateExpressions}}. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25914) Separate projection from grouping and aggregate in logical Aggregate
[ https://issues.apache.org/jira/browse/SPARK-25914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-25914: - Target Version/s: (was: 3.0.0) > Separate projection from grouping and aggregate in logical Aggregate > > > Key: SPARK-25914 > URL: https://issues.apache.org/jira/browse/SPARK-25914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maryann Xue >Assignee: Dilip Biswal >Priority: Major > > Currently the Spark SQL logical Aggregate has two expression fields: > {{groupingExpressions}} and {{aggregateExpressions}}, in which > {{aggregateExpressions}} is actually the result expressions, or in other > words, the project list in the SELECT clause. > > This would cause an exception while processing the following query: > {code:java} > SELECT concat('x', concat(a, 's')) > FROM testData2 > GROUP BY concat(a, 's'){code} > After optimization, the query becomes: > {code:java} > SELECT concat('x', a, 's') > FROM testData2 > GROUP BY concat(a, 's'){code} > The optimization rule {{CombineConcats}} optimizes the expressions by > flattening "concat" and causes the query to fail since the expression > {{concat('x', a, 's')}} in the SELECT clause is neither referencing a > grouping expression nor a aggregate expression. > > The problem is that we try to mix two operations in one operator, and worse, > in one field: the group-and-aggregate operation and the project operation. > There are two ways to solve this problem: > 1. Break the two operations into two logical operators, which means a > group-by query can usually be mapped into a Project-over-Aggregate pattern. > 2. Break the two operations into multiple fields in the Aggregate operator, > the same way we do for physical aggregate classes (e.g., > {{HashAggregateExec}}, or {{SortAggregateExec}}). Thus, > {{groupingExpressions}} would still be the expressions from the GROUP BY > clause (as before), but {{aggregateExpressions}} would contain aggregate > functions only, and {{resultExpressions}} would be the project list in the > SELECT clause holding references to either {{groupingExpressions}} or > {{aggregateExpressions}}. > > I would say option 1 is even clearer, but it would be more likely to break > the pattern matching in existing optimization rules and thus require more > changes in the compiler. So we'd probably wanna go with option 2. That said, > I suggest we achieve this goal through two iterative steps: > > Phase 1: Keep the current fields of logical Aggregate as > {{groupingExpressions}} and {{aggregateExpressions}}, but change the > semantics of {{aggregateExpressions}} by replacing the grouping expressions > with corresponding references to expressions in {{groupingExpressions}}. The > aggregate expressions in {{aggregateExpressions}} will remain the same. > > Phase 2: Add {{resultExpressions}} for the project list, and keep only > aggregate expressions in {{aggregateExpressions}}. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25972) Missed JSON options in streaming.py
[ https://issues.apache.org/jira/browse/SPARK-25972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-25972: Assignee: Maxim Gekk > Missed JSON options in streaming.py > > > Key: SPARK-25972 > URL: https://issues.apache.org/jira/browse/SPARK-25972 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Trivial > > streaming.py misses JSON options comparing to readwrite.py: > - dropFieldIfAllNull > - encoding -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25972) Missed JSON options in streaming.py
[ https://issues.apache.org/jira/browse/SPARK-25972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25972. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22973 [https://github.com/apache/spark/pull/22973] > Missed JSON options in streaming.py > > > Key: SPARK-25972 > URL: https://issues.apache.org/jira/browse/SPARK-25972 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Trivial > Fix For: 3.0.0 > > > streaming.py misses JSON options comparing to readwrite.py: > - dropFieldIfAllNull > - encoding -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26007) DataFrameReader.csv() should respect to spark.sql.columnNameOfCorruptRecord
[ https://issues.apache.org/jira/browse/SPARK-26007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26007: Assignee: Apache Spark > DataFrameReader.csv() should respect to spark.sql.columnNameOfCorruptRecord > --- > > Key: SPARK-26007 > URL: https://issues.apache.org/jira/browse/SPARK-26007 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > The csv() method of DataFrameReader doesn't take into account the SQL config > spark.sql.columnNameOfCorruptRecord while creating an instance of CSVOptions: > https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L491-L494 > This should be fixed by passing > sparkSession.sessionState.conf.columnNameOfCorruptRecord as a constructor > parameter to CSVOptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26007) DataFrameReader.csv() should respect to spark.sql.columnNameOfCorruptRecord
[ https://issues.apache.org/jira/browse/SPARK-26007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682855#comment-16682855 ] Apache Spark commented on SPARK-26007: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23006 > DataFrameReader.csv() should respect to spark.sql.columnNameOfCorruptRecord > --- > > Key: SPARK-26007 > URL: https://issues.apache.org/jira/browse/SPARK-26007 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > The csv() method of DataFrameReader doesn't take into account the SQL config > spark.sql.columnNameOfCorruptRecord while creating an instance of CSVOptions: > https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L491-L494 > This should be fixed by passing > sparkSession.sessionState.conf.columnNameOfCorruptRecord as a constructor > parameter to CSVOptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26007) DataFrameReader.csv() should respect to spark.sql.columnNameOfCorruptRecord
[ https://issues.apache.org/jira/browse/SPARK-26007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26007: Assignee: (was: Apache Spark) > DataFrameReader.csv() should respect to spark.sql.columnNameOfCorruptRecord > --- > > Key: SPARK-26007 > URL: https://issues.apache.org/jira/browse/SPARK-26007 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > The csv() method of DataFrameReader doesn't take into account the SQL config > spark.sql.columnNameOfCorruptRecord while creating an instance of CSVOptions: > https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L491-L494 > This should be fixed by passing > sparkSession.sessionState.conf.columnNameOfCorruptRecord as a constructor > parameter to CSVOptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26007) DataFrameReader.csv() should respect to spark.sql.columnNameOfCorruptRecord
[ https://issues.apache.org/jira/browse/SPARK-26007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682856#comment-16682856 ] Apache Spark commented on SPARK-26007: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23006 > DataFrameReader.csv() should respect to spark.sql.columnNameOfCorruptRecord > --- > > Key: SPARK-26007 > URL: https://issues.apache.org/jira/browse/SPARK-26007 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > The csv() method of DataFrameReader doesn't take into account the SQL config > spark.sql.columnNameOfCorruptRecord while creating an instance of CSVOptions: > https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L491-L494 > This should be fixed by passing > sparkSession.sessionState.conf.columnNameOfCorruptRecord as a constructor > parameter to CSVOptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26007) DataFrameReader.csv() should respect to spark.sql.columnNameOfCorruptRecord
Maxim Gekk created SPARK-26007: -- Summary: DataFrameReader.csv() should respect to spark.sql.columnNameOfCorruptRecord Key: SPARK-26007 URL: https://issues.apache.org/jira/browse/SPARK-26007 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk The csv() method of DataFrameReader doesn't take into account the SQL config spark.sql.columnNameOfCorruptRecord while creating an instance of CSVOptions: https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L491-L494 This should be fixed by passing sparkSession.sessionState.conf.columnNameOfCorruptRecord as a constructor parameter to CSVOptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26006) mllib Prefixspan
idan Levi created SPARK-26006: - Summary: mllib Prefixspan Key: SPARK-26006 URL: https://issues.apache.org/jira/browse/SPARK-26006 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.3.0 Environment: Unit test running on windows Reporter: idan Levi Mllib's Prefixspan - run method - cached RDD stays in cache. val dataInternalRepr = toDatabaseInternalRepr(data, itemToInt) .persist(StorageLevel.MEMORY_AND_DISK) After run is comlpeted , rdd remain in cache. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26005) Upgrade ANTRL to 4.7.1
[ https://issues.apache.org/jira/browse/SPARK-26005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682800#comment-16682800 ] Apache Spark commented on SPARK-26005: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/23005 > Upgrade ANTRL to 4.7.1 > -- > > Key: SPARK-26005 > URL: https://issues.apache.org/jira/browse/SPARK-26005 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26005) Upgrade ANTRL to 4.7.1
[ https://issues.apache.org/jira/browse/SPARK-26005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26005: Assignee: Apache Spark (was: Xiao Li) > Upgrade ANTRL to 4.7.1 > -- > > Key: SPARK-26005 > URL: https://issues.apache.org/jira/browse/SPARK-26005 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26005) Upgrade ANTRL to 4.7.1
[ https://issues.apache.org/jira/browse/SPARK-26005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26005: Assignee: Xiao Li (was: Apache Spark) > Upgrade ANTRL to 4.7.1 > -- > > Key: SPARK-26005 > URL: https://issues.apache.org/jira/browse/SPARK-26005 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26005) Upgrade ANTRL to 4.7.1
[ https://issues.apache.org/jira/browse/SPARK-26005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682799#comment-16682799 ] Apache Spark commented on SPARK-26005: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/23005 > Upgrade ANTRL to 4.7.1 > -- > > Key: SPARK-26005 > URL: https://issues.apache.org/jira/browse/SPARK-26005 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26005) Upgrade ANTRL to 4.7.1
Xiao Li created SPARK-26005: --- Summary: Upgrade ANTRL to 4.7.1 Key: SPARK-26005 URL: https://issues.apache.org/jira/browse/SPARK-26005 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Xiao Li Assignee: Xiao Li -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org