[jira] [Assigned] (SPARK-38651) Writing out empty or nested empty schemas in Datasource should be configurable
[ https://issues.apache.org/jira/browse/SPARK-38651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38651: Assignee: Apache Spark > Writing out empty or nested empty schemas in Datasource should be configurable > -- > > Key: SPARK-38651 > URL: https://issues.apache.org/jira/browse/SPARK-38651 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Thejdeep Gudivada >Assignee: Apache Spark >Priority: Major > > In SPARK-23372, we introduced a backwards incompatible change that would > remove support for writing out empty or nested empty schemas in file based > datasources. This introduces backward incompatibility for users who have been > using a schema that met the above condition since the datasource supported > it. Except for Parquet and text, other file based sources support this > behavior. > > We should either : > * Make it configurable to enable/disable writing out empty schemas > * Enable the validation check only for sources that do not support it - > Parquet / Text -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38651) Writing out empty or nested empty schemas in Datasource should be configurable
[ https://issues.apache.org/jira/browse/SPARK-38651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512202#comment-17512202 ] Apache Spark commented on SPARK-38651: -- User 'thejdeep' has created a pull request for this issue: https://github.com/apache/spark/pull/35969 > Writing out empty or nested empty schemas in Datasource should be configurable > -- > > Key: SPARK-38651 > URL: https://issues.apache.org/jira/browse/SPARK-38651 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Thejdeep Gudivada >Priority: Major > > In SPARK-23372, we introduced a backwards incompatible change that would > remove support for writing out empty or nested empty schemas in file based > datasources. This introduces backward incompatibility for users who have been > using a schema that met the above condition since the datasource supported > it. Except for Parquet and text, other file based sources support this > behavior. > > We should either : > * Make it configurable to enable/disable writing out empty schemas > * Enable the validation check only for sources that do not support it - > Parquet / Text -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38651) Writing out empty or nested empty schemas in Datasource should be configurable
[ https://issues.apache.org/jira/browse/SPARK-38651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38651: Assignee: (was: Apache Spark) > Writing out empty or nested empty schemas in Datasource should be configurable > -- > > Key: SPARK-38651 > URL: https://issues.apache.org/jira/browse/SPARK-38651 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Thejdeep Gudivada >Priority: Major > > In SPARK-23372, we introduced a backwards incompatible change that would > remove support for writing out empty or nested empty schemas in file based > datasources. This introduces backward incompatibility for users who have been > using a schema that met the above condition since the datasource supported > it. Except for Parquet and text, other file based sources support this > behavior. > > We should either : > * Make it configurable to enable/disable writing out empty schemas > * Enable the validation check only for sources that do not support it - > Parquet / Text -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38654) Show default index type in SQL plans for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-38654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512198#comment-17512198 ] Apache Spark commented on SPARK-38654: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/35968 > Show default index type in SQL plans for pandas API on Spark > > > Key: SPARK-38654 > URL: https://issues.apache.org/jira/browse/SPARK-38654 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, it's difficult for users to tell which plan and expressions are > for default index from explain API. > We should mark and show which plan/expression is for the default index in > pandas API on Spark. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38654) Show default index type in SQL plans for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-38654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512197#comment-17512197 ] Apache Spark commented on SPARK-38654: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/35968 > Show default index type in SQL plans for pandas API on Spark > > > Key: SPARK-38654 > URL: https://issues.apache.org/jira/browse/SPARK-38654 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, it's difficult for users to tell which plan and expressions are > for default index from explain API. > We should mark and show which plan/expression is for the default index in > pandas API on Spark. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38654) Show default index type in SQL plans for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-38654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38654: Assignee: Apache Spark > Show default index type in SQL plans for pandas API on Spark > > > Key: SPARK-38654 > URL: https://issues.apache.org/jira/browse/SPARK-38654 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > Currently, it's difficult for users to tell which plan and expressions are > for default index from explain API. > We should mark and show which plan/expression is for the default index in > pandas API on Spark. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38654) Show default index type in SQL plans for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-38654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38654: Assignee: (was: Apache Spark) > Show default index type in SQL plans for pandas API on Spark > > > Key: SPARK-38654 > URL: https://issues.apache.org/jira/browse/SPARK-38654 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, it's difficult for users to tell which plan and expressions are > for default index from explain API. > We should mark and show which plan/expression is for the default index in > pandas API on Spark. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38654) Show default index type in SQL plans for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-38654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38654: - Priority: Minor (was: Major) > Show default index type in SQL plans for pandas API on Spark > > > Key: SPARK-38654 > URL: https://issues.apache.org/jira/browse/SPARK-38654 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, it's difficult for users to tell which plan and expressions are > for default index from explain API. > We should mark and show which plan/expression is for the default index in > pandas API on Spark. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38654) Show default index type in SQL plans for pandas API on Spark
Hyukjin Kwon created SPARK-38654: Summary: Show default index type in SQL plans for pandas API on Spark Key: SPARK-38654 URL: https://issues.apache.org/jira/browse/SPARK-38654 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon Currently, it's difficult for users to tell which plan and expressions are for default index from explain API. We should mark and show which plan/expression is for the default index in pandas API on Spark. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal
[ https://issues.apache.org/jira/browse/SPARK-38570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512195#comment-17512195 ] Apache Spark commented on SPARK-38570: -- User 'mcdull-zhang' has created a pull request for this issue: https://github.com/apache/spark/pull/35967 > Incorrect DynamicPartitionPruning caused by Literal > --- > > Key: SPARK-38570 > URL: https://issues.apache.org/jira/browse/SPARK-38570 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: mcdull_zhang >Assignee: mcdull_zhang >Priority: Minor > Fix For: 3.3.0 > > > The return value of Literal.references is an empty AttributeSet, so Literal > is mistaken for a partition column. > > org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan: > {code:java} > val srcInfo: Option[(Expression, LogicalPlan)] = > findExpressionAndTrackLineageDown(a, plan) > srcInfo.flatMap { > case (resExp, l: LogicalRelation) => > l.relation match { > case fs: HadoopFsRelation => > val partitionColumns = AttributeSet( > l.resolve(fs.partitionSchema, > fs.sparkSession.sessionState.analyzer.resolver)) > // When resExp is a Literal, Literal is considered a partition > column. > if (resExp.references.subsetOf(partitionColumns)) { > return Some(l) > } else { > None > } > case _ => None > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38610) UI for Pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-38610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38610. -- Resolution: Invalid Will focus on improving existing SQL UI to show the related info. > UI for Pandas API on Spark > -- > > Key: SPARK-38610 > URL: https://issues.apache.org/jira/browse/SPARK-38610 > Project: Spark > Issue Type: Improvement > Components: PySpark, Web UI >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Critical > > Currently Pandas API on Spark does not have its dedicated UI which mixes up > with SQL UI tab. It should be better to have a dedicated page -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38646) Pull a trait out for Python functions
[ https://issues.apache.org/jira/browse/SPARK-38646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-38646: Assignee: Zhen Li > Pull a trait out for Python functions > - > > Key: SPARK-38646 > URL: https://issues.apache.org/jira/browse/SPARK-38646 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0, 3.2.2 >Reporter: Zhen Li >Assignee: Zhen Li >Priority: Major > Fix For: 3.4.0 > > > Currently pyspark uses a case class PythonFunction PythonRDD and many other > interfaces/classes. Propose to change to use a trait instead to avoid tying > impl with APIs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38646) Pull a trait out for Python functions
[ https://issues.apache.org/jira/browse/SPARK-38646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38646. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 35964 [https://github.com/apache/spark/pull/35964] > Pull a trait out for Python functions > - > > Key: SPARK-38646 > URL: https://issues.apache.org/jira/browse/SPARK-38646 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0, 3.2.2 >Reporter: Zhen Li >Priority: Major > Fix For: 3.4.0 > > > Currently pyspark uses a case class PythonFunction PythonRDD and many other > interfaces/classes. Propose to change to use a trait instead to avoid tying > impl with APIs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38653) Repartition by Column that is Int not working properly only on particular numbers. (11, 33)
[ https://issues.apache.org/jira/browse/SPARK-38653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Engelhart updated SPARK-38653: --- Description: My understanding is when you call .repartition(a column). For each unique key in that field. The data will go to that partition. There should never be two keys repartitioned to the same part. That behavior is true with a String column. That behavior is also true with an Int column except on certain numbers. In my use case. The magic numbers 11 and 33. {code:java} //Int based column repartition spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex"). repartition($"collectionIndex").write.mode("overwrite").parquet("path") //Produces two part files //String based column repartition spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex"). repartition($"collectionIndex").write.mode("overwrite").parquet("path1") //Produces three part files {code} {code:java} //Not working as expected spark.read.parquet("path/part-0...").distinct.show spark.read.parquet("path/part-1...").distinct.show //Working as expected spark.read.parquet("path1/part-0...").distinct.show spark.read.parquet("path1/part-1...").distinct.show spark.read.parquet("path1/part-2...").distinct.show {code} !image-2022-03-24-22-16-44-560.png! This problem really manifested itself when doing something like {code:java} spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex"). repartition($"collectionIndex").write.mode("overwrite").partitionBy("collectionIndex").parquet("path") {code} Because you end up with incorrect partitions where the data is commingled. was: My understanding is when you call .repartition(a column). For each unique key in that field. The data will go to that partition. There should never be two keys repartitioned to the same part. That behavior is true with a String column. That behavior is also true with an Int column except on certain numbers. In my use case. The magic numbers 11 and 33. {code:java} //Int based column repartition spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex"). repartition($"collectionIndex").write.mode("overwrite").parquet("path") //Produces two part files //String based column repartition spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex"). repartition($"collectionIndex").write.mode("overwrite").parquet("path1") //Produces three part files {code} {code:java} //Not working as expected spark.read.parquet("path/part-0...").distinct.show spark.read.parquet("path/part-1...").distinct.show //Working as expected spark.read.parquet("path1/part-0...").distinct.show spark.read.parquet("path1/part-1...").distinct.show spark.read.parquet("path1/part-2...").distinct.show {code} !image-2022-03-24-22-09-26-917.png! > Repartition by Column that is Int not working properly only on particular > numbers. (11, 33) > --- > > Key: SPARK-38653 > URL: https://issues.apache.org/jira/browse/SPARK-38653 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 > Environment: This was running on EMR 6.4.0 using Spark 3.1.2 in an > EMR Notebook writing to S3 >Reporter: John Engelhart >Priority: Major > > My understanding is when you call .repartition(a column). For each unique key > in that field. The data will go to that partition. There should never be two > keys repartitioned to the same part. That behavior is true with a String > column. That behavior is also true with an Int column except on certain > numbers. In my use case. The magic numbers 11 and 33. > {code:java} > //Int based column repartition > spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex"). > repartition($"collectionIndex").write.mode("overwrite").parquet("path") > //Produces two part files > //String based column repartition > spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex"). > repartition($"collectionIndex").write.mode("overwrite").parquet("path1") > //Produces three part files {code} > > {code:java} > //Not working as expected > spark.read.parquet("path/part-0...").distinct.show > spark.read.parquet("path/part-1...").distinct.show > //Working as expected > spark.read.parquet("path1/part-0...").distinct.show > spark.read.parquet("path1/part-1...").distinct.show > spark.read.parquet("path1/part-2...").distinct.show {code} > !image-2022-03-24-22-16-44-560.png! > This problem really manifested itself when doing something like > {code:java} > spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex"). > repartition($"collectionIndex").write.mode("overwrite").partitionBy("collectionIndex").parquet("path") >
[jira] [Updated] (SPARK-38653) Repartition by Column that is Int not working properly only on particular numbers. (11, 33)
[ https://issues.apache.org/jira/browse/SPARK-38653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Engelhart updated SPARK-38653: --- Description: My understanding is when you call .repartition(a column). For each unique key in that field. The data will go to that partition. There should never be two keys repartitioned to the same part. That behavior is true with a String column. That behavior is also true with an Int column except on certain numbers. In my use case. The magic numbers 11 and 33. {code:java} //Int based column repartition spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex"). repartition($"collectionIndex").write.mode("overwrite").parquet("path") //Produces two part files //String based column repartition spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex"). repartition($"collectionIndex").write.mode("overwrite").parquet("path1") //Produces three part files {code} {code:java} //Not working as expected spark.read.parquet("path/part-0...").distinct.show spark.read.parquet("path/part-1...").distinct.show //Working as expected spark.read.parquet("path1/part-0...").distinct.show spark.read.parquet("path1/part-1...").distinct.show spark.read.parquet("path1/part-2...").distinct.show {code} !image-2022-03-24-22-09-26-917.png! was: My understanding is when you call .repartition(a column). For each unique key in that field. The data will go to that partition. There should never be two keys repartitioned the same part. That behavior is true with a String column. That behavior is also true with an Int column except on certain numbers. In my use case. The magic numbers 11 and 33. {code:java} //Int based column repartition spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex"). repartition($"collectionIndex").write.mode("overwrite").parquet("path") //Produces two part files //String based column repartition spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex"). repartition($"collectionIndex").write.mode("overwrite").parquet("path1") //Produces three part files {code} {code:java} //Not working as expected spark.read.parquet("path/part-0...").distinct.show spark.read.parquet("path/part-1...").distinct.show //Working as expected spark.read.parquet("path1/part-0...").distinct.show spark.read.parquet("path1/part-1...").distinct.show spark.read.parquet("path1/part-2...").distinct.show {code} !image-2022-03-24-22-09-26-917.png! > Repartition by Column that is Int not working properly only on particular > numbers. (11, 33) > --- > > Key: SPARK-38653 > URL: https://issues.apache.org/jira/browse/SPARK-38653 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 > Environment: This was running on EMR 6.4.0 using Spark 3.1.2 in an > EMR Notebook writing to S3 >Reporter: John Engelhart >Priority: Major > > My understanding is when you call .repartition(a column). For each unique key > in that field. The data will go to that partition. There should never be two > keys repartitioned to the same part. That behavior is true with a String > column. That behavior is also true with an Int column except on certain > numbers. In my use case. The magic numbers 11 and 33. > {code:java} > //Int based column repartition > spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex"). > repartition($"collectionIndex").write.mode("overwrite").parquet("path") > //Produces two part files > //String based column repartition > spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex"). > repartition($"collectionIndex").write.mode("overwrite").parquet("path1") > //Produces three part files {code} > > {code:java} > //Not working as expected > spark.read.parquet("path/part-0...").distinct.show > spark.read.parquet("path/part-1...").distinct.show > //Working as expected > spark.read.parquet("path1/part-0...").distinct.show > spark.read.parquet("path1/part-1...").distinct.show > spark.read.parquet("path1/part-2...").distinct.show {code} > !image-2022-03-24-22-09-26-917.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38653) Repartition by Column that is Int not working properly only on particular numbers. (11, 33)
[ https://issues.apache.org/jira/browse/SPARK-38653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Engelhart updated SPARK-38653: --- Summary: Repartition by Column that is Int not working properly only on particular numbers. (11, 33) (was: Repartition by Column that is Int not working properly only particular numbers. (11, 33)) > Repartition by Column that is Int not working properly only on particular > numbers. (11, 33) > --- > > Key: SPARK-38653 > URL: https://issues.apache.org/jira/browse/SPARK-38653 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 > Environment: This was running on EMR 6.4.0 using Spark 3.1.2 in an > EMR Notebook writing to S3 >Reporter: John Engelhart >Priority: Major > > My understanding is when you call .repartition(a column). For each unique key > in that field. The data will go to that partition. There should never be two > keys repartitioned the same part. That behavior is true with a String column. > That behavior is also true with an Int column except on certain numbers. In > my use case. The magic numbers 11 and 33. > {code:java} > //Int based column repartition > spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex"). > repartition($"collectionIndex").write.mode("overwrite").parquet("path") > //Produces two part files > //String based column repartition > spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex"). > repartition($"collectionIndex").write.mode("overwrite").parquet("path1") > //Produces three part files {code} > > {code:java} > //Not working as expected > spark.read.parquet("path/part-0...").distinct.show > spark.read.parquet("path/part-1...").distinct.show > //Working as expected > spark.read.parquet("path1/part-0...").distinct.show > spark.read.parquet("path1/part-1...").distinct.show > spark.read.parquet("path1/part-2...").distinct.show {code} > !image-2022-03-24-22-09-26-917.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38653) Repartition by Column that is Int not working properly only particular numbers. (11, 33)
John Engelhart created SPARK-38653: -- Summary: Repartition by Column that is Int not working properly only particular numbers. (11, 33) Key: SPARK-38653 URL: https://issues.apache.org/jira/browse/SPARK-38653 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.2 Environment: This was running on EMR 6.4.0 using Spark 3.1.2 in an EMR Notebook writing to S3 Reporter: John Engelhart My understanding is when you call .repartition(a column). For each unique key in that field. The data will go to that partition. There should never be two keys repartitioned the same part. That behavior is true with a String column. That behavior is also true with an Int column except on certain numbers. In my use case. The magic numbers 11 and 33. {code:java} //Int based column repartition spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex"). repartition($"collectionIndex").write.mode("overwrite").parquet("path") //Produces two part files //String based column repartition spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex"). repartition($"collectionIndex").write.mode("overwrite").parquet("path1") //Produces three part files {code} {code:java} //Not working as expected spark.read.parquet("path/part-0...").distinct.show spark.read.parquet("path/part-1...").distinct.show //Working as expected spark.read.parquet("path1/part-0...").distinct.show spark.read.parquet("path1/part-1...").distinct.show spark.read.parquet("path1/part-2...").distinct.show {code} !image-2022-03-24-22-09-26-917.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38652) K8S IT Test DepsTestsSuite blocks with PathIOException in hadoop-aws-3.3.2
[ https://issues.apache.org/jira/browse/SPARK-38652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qian updated SPARK-38652: - Description: DepsTestsSuite in k8s IT test is blocked with PathIOException in hadoop-aws-3.3.2. Exception Message is as follow {code:java} Exception in thread "main" org.apache.spark.SparkException: Uploading file /Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar failed... at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:332) at org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:277) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:275) at org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:187) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:178) at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$5(KubernetesDriverBuilder.scala:86) at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) at scala.collection.immutable.List.foldLeft(List.scala:91) at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:84) at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:104) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5(KubernetesClientApplication.scala:248) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5$adapted(KubernetesClientApplication.scala:242) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2738) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:242) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:214) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: org.apache.spark.SparkException: Error uploading file spark-examples_2.12-3.4.0-SNAPSHOT.jar at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:355) at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:328) ... 30 more Caused by: org.apache.hadoop.fs.PathIOException: `Cannot get relative path for URI:file:///Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar': Input/output error at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.getFinalPath(CopyFromLocalOperation.java:365) at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.uploadSourceFromFS(CopyFromLocalOperation.java:226) at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.execute(CopyFromLocalOperation.java:170) at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$copyFromLocalFile$25(S3AFileSystem.java:3920) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444) at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337) at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356) at
[jira] [Updated] (SPARK-38652) K8S IT Test DepsTestsSuite blocks with PathIOException in hadoop-aws-3.3.2
[ https://issues.apache.org/jira/browse/SPARK-38652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qian updated SPARK-38652: - Description: DepsTestsSuite in k8s IT test is blocked with PathIOException in hadoop-aws-3.3.2. Exception Message is as follow {code:java} Exception in thread "main" org.apache.spark.SparkException: Uploading file /Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar failed... at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:332) at org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:277) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:275) at org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:187) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:178) at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$5(KubernetesDriverBuilder.scala:86) at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) at scala.collection.immutable.List.foldLeft(List.scala:91) at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:84) at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:104) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5(KubernetesClientApplication.scala:248) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5$adapted(KubernetesClientApplication.scala:242) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2738) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:242) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:214) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: org.apache.spark.SparkException: Error uploading file spark-examples_2.12-3.4.0-SNAPSHOT.jar at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:355) at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:328) ... 30 more Caused by: org.apache.hadoop.fs.PathIOException: `Cannot get relative path for URI:file:///Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar': Input/output errorat org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.getFinalPath(CopyFromLocalOperation.java:365) at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.uploadSourceFromFS(CopyFromLocalOperation.java:226) at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.execute(CopyFromLocalOperation.java:170) at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$copyFromLocalFile$25(S3AFileSystem.java:3920) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444) at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337) at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356) at
[jira] [Commented] (SPARK-38652) K8S IT Test DepsTestsSuite blocks with PathIOException in hadoop-aws-3.3.2
[ https://issues.apache.org/jira/browse/SPARK-38652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512163#comment-17512163 ] qian commented on SPARK-38652: -- I am working on it. cc [~chaosun] & [~dongjoon] > K8S IT Test DepsTestsSuite blocks with PathIOException in hadoop-aws-3.3.2 > -- > > Key: SPARK-38652 > URL: https://issues.apache.org/jira/browse/SPARK-38652 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 3.3.0 >Reporter: qian >Priority: Major > > DepsTestsSuite in k8s IT test is blocked with PathIOException in > hadoop-aws-3.3.2. Exception Message is as follow > {code:java} > Exception in thread "main" org.apache.spark.SparkException: Uploading file > /Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar > failed... > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:332) > > at > org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:277) > > at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:275) > > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:187) > > at scala.collection.immutable.List.foreach(List.scala:431) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:178) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$5(KubernetesDriverBuilder.scala:86) > at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > > at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > > at scala.collection.immutable.List.foldLeft(List.scala:91) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:84) > > at > org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:104) > > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5(KubernetesClientApplication.scala:248) > > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5$adapted(KubernetesClientApplication.scala:242) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2738) > > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:242) > > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:214) > > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958) > > at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) > > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) > > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: > org.apache.spark.SparkException: Error uploading file > spark-examples_2.12-3.4.0-SNAPSHOT.jar > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:355) > > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:328) > > ... 30 more > Caused by: org.apache.hadoop.fs.PathIOException: `Cannot get relative path > for > URI:file:///Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar': > Input/output errorat > org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.getFinalPath(CopyFromLocalOperation.java:365) > > at >
[jira] [Created] (SPARK-38652) K8S IT Test DepsTestsSuite blocks with PathIOException in hadoop-aws-3.3.2
qian created SPARK-38652: Summary: K8S IT Test DepsTestsSuite blocks with PathIOException in hadoop-aws-3.3.2 Key: SPARK-38652 URL: https://issues.apache.org/jira/browse/SPARK-38652 Project: Spark Issue Type: Bug Components: Kubernetes, Tests Affects Versions: 3.3.0 Reporter: qian DepsTestsSuite in k8s IT test is blocked with PathIOException in hadoop-aws-3.3.2. Exception Message is as follow {code:java} Exception in thread "main" org.apache.spark.SparkException: Uploading file /Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar failed... at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:332) at org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:277) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:275) at org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:187) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:178) at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$5(KubernetesDriverBuilder.scala:86) at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) at scala.collection.immutable.List.foldLeft(List.scala:91) at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:84) at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:104) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5(KubernetesClientApplication.scala:248) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5$adapted(KubernetesClientApplication.scala:242) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2738) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:242) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:214) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: org.apache.spark.SparkException: Error uploading file spark-examples_2.12-3.4.0-SNAPSHOT.jar at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:355) at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:328) ... 30 more Caused by: org.apache.hadoop.fs.PathIOException: `Cannot get relative path for URI:file:///Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar': Input/output errorat org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.getFinalPath(CopyFromLocalOperation.java:365) at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.uploadSourceFromFS(CopyFromLocalOperation.java:226) at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.execute(CopyFromLocalOperation.java:170) at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$copyFromLocalFile$25(S3AFileSystem.java:3920) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444) at
[jira] [Created] (SPARK-38651) Writing out empty or nested empty schemas in Datasource should be configurable
Thejdeep Gudivada created SPARK-38651: - Summary: Writing out empty or nested empty schemas in Datasource should be configurable Key: SPARK-38651 URL: https://issues.apache.org/jira/browse/SPARK-38651 Project: Spark Issue Type: Task Components: SQL Affects Versions: 2.4.0 Reporter: Thejdeep Gudivada In SPARK-23372, we introduced a backwards incompatible change that would remove support for writing out empty or nested empty schemas in file based datasources. This introduces backward incompatibility for users who have been using a schema that met the above condition since the datasource supported it. Except for Parquet and text, other file based sources support this behavior. We should either : * Make it configurable to enable/disable writing out empty schemas * Enable the validation check only for sources that do not support it - Parquet / Text -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal
[ https://issues.apache.org/jira/browse/SPARK-38570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-38570. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35878 https://github.com/apache/spark/pull/35878 > Incorrect DynamicPartitionPruning caused by Literal > --- > > Key: SPARK-38570 > URL: https://issues.apache.org/jira/browse/SPARK-38570 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: mcdull_zhang >Assignee: mcdull_zhang >Priority: Minor > Fix For: 3.3.0 > > > The return value of Literal.references is an empty AttributeSet, so Literal > is mistaken for a partition column. > > org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan: > {code:java} > val srcInfo: Option[(Expression, LogicalPlan)] = > findExpressionAndTrackLineageDown(a, plan) > srcInfo.flatMap { > case (resExp, l: LogicalRelation) => > l.relation match { > case fs: HadoopFsRelation => > val partitionColumns = AttributeSet( > l.resolve(fs.partitionSchema, > fs.sparkSession.sessionState.analyzer.resolver)) > // When resExp is a Literal, Literal is considered a partition > column. > if (resExp.references.subsetOf(partitionColumns)) { > return Some(l) > } else { > None > } > case _ => None > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal
[ https://issues.apache.org/jira/browse/SPARK-38570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-38570: --- Assignee: mcdull_zhang > Incorrect DynamicPartitionPruning caused by Literal > --- > > Key: SPARK-38570 > URL: https://issues.apache.org/jira/browse/SPARK-38570 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: mcdull_zhang >Assignee: mcdull_zhang >Priority: Minor > > The return value of Literal.references is an empty AttributeSet, so Literal > is mistaken for a partition column. > > org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan: > {code:java} > val srcInfo: Option[(Expression, LogicalPlan)] = > findExpressionAndTrackLineageDown(a, plan) > srcInfo.flatMap { > case (resExp, l: LogicalRelation) => > l.relation match { > case fs: HadoopFsRelation => > val partitionColumns = AttributeSet( > l.resolve(fs.partitionSchema, > fs.sparkSession.sessionState.analyzer.resolver)) > // When resExp is a Literal, Literal is considered a partition > column. > if (resExp.references.subsetOf(partitionColumns)) { > return Some(l) > } else { > None > } > case _ => None > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38645) Support `spark.sql.codegen.cleanedSourcePrint` flag to print Codegen cleanedSource
[ https://issues.apache.org/jira/browse/SPARK-38645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38645: - Fix Version/s: (was: 3.2.1) > Support `spark.sql.codegen.cleanedSourcePrint` flag to print Codegen > cleanedSource > -- > > Key: SPARK-38645 > URL: https://issues.apache.org/jira/browse/SPARK-38645 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: tonydoen >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > When we use spark-sql, encountering problems in codegen source, we often > have to change the log level to DEBUG, but there are too many logs in this > mode (DEBUG) . > > Then `spark.sql.codegen.cleanedSourcePrint` can ensure that just printing > codegen source. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38650) Better ParseException message for char without length
[ https://issues.apache.org/jira/browse/SPARK-38650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512126#comment-17512126 ] Apache Spark commented on SPARK-38650: -- User 'anchovYu' has created a pull request for this issue: https://github.com/apache/spark/pull/35966 > Better ParseException message for char without length > - > > Key: SPARK-38650 > URL: https://issues.apache.org/jira/browse/SPARK-38650 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinyi Yu >Priority: Major > > We support char and varchar types. But when users input the type without > length, the message is confusing and not helpful at all: > {code:sql} > > SELECT cast('a' as CHAR) > DataType char is not supported.(line 1, pos 19) > == SQL == > SELECT cast('a' AS CHAR) > ---^^^{code} > This ticket would like to improve the error message for these special cases > to: > {code:java} > Datatype char requires a length parameter, for example char(10). Please > specify the length.{code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38650) Better ParseException message for char without length
[ https://issues.apache.org/jira/browse/SPARK-38650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38650: Assignee: Apache Spark > Better ParseException message for char without length > - > > Key: SPARK-38650 > URL: https://issues.apache.org/jira/browse/SPARK-38650 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinyi Yu >Assignee: Apache Spark >Priority: Major > > We support char and varchar types. But when users input the type without > length, the message is confusing and not helpful at all: > {code:sql} > > SELECT cast('a' as CHAR) > DataType char is not supported.(line 1, pos 19) > == SQL == > SELECT cast('a' AS CHAR) > ---^^^{code} > This ticket would like to improve the error message for these special cases > to: > {code:java} > Datatype char requires a length parameter, for example char(10). Please > specify the length.{code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38650) Better ParseException message for char without length
[ https://issues.apache.org/jira/browse/SPARK-38650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38650: Assignee: (was: Apache Spark) > Better ParseException message for char without length > - > > Key: SPARK-38650 > URL: https://issues.apache.org/jira/browse/SPARK-38650 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinyi Yu >Priority: Major > > We support char and varchar types. But when users input the type without > length, the message is confusing and not helpful at all: > {code:sql} > > SELECT cast('a' as CHAR) > DataType char is not supported.(line 1, pos 19) > == SQL == > SELECT cast('a' AS CHAR) > ---^^^{code} > This ticket would like to improve the error message for these special cases > to: > {code:java} > Datatype char requires a length parameter, for example char(10). Please > specify the length.{code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38650) Better ParseException message for char without length
Xinyi Yu created SPARK-38650: Summary: Better ParseException message for char without length Key: SPARK-38650 URL: https://issues.apache.org/jira/browse/SPARK-38650 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Xinyi Yu We support char and varchar types. But when users input the type without length, the message is confusing and not helpful at all: {code:sql} > SELECT cast('a' as CHAR) DataType char is not supported.(line 1, pos 19) == SQL == SELECT cast('a' AS CHAR) ---^^^{code} This ticket would like to improve the error message for these special cases to: {code:java} Datatype char requires a length parameter, for example char(10). Please specify the length.{code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38641) Get rid of invalid configuration elements in mvn_scalafmt in main pom.xml
[ https://issues.apache.org/jira/browse/SPARK-38641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-38641. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 35956 [https://github.com/apache/spark/pull/35956] > Get rid of invalid configuration elements in mvn_scalafmt in main pom.xml > - > > Key: SPARK-38641 > URL: https://issues.apache.org/jira/browse/SPARK-38641 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.1 >Reporter: morvenhuang >Assignee: morvenhuang >Priority: Trivial > Fix For: 3.4.0 > > Attachments: mvn_scalafmt.jpg > > > After loading latest spark code into IntelliJ IDEA, it complains that > configuration 'parameters' and 'skip' under mvn_scalafmt plugin are not > allowed, see screenshot attached for details. > > I've contacted the author of mvn_scalafmt, Ciaran Kearney, to confirm if > these 2 configuration items are no longer there since v 1.0.0, here's his > return, > > {quote}That's correct. The command line parameters were removed by scalafmt > itself a few versions ago and skip was replaced by validateOnly (which checks > formatting without changing files. > {quote} > > I think we should get rid of the 'parameters' since it's invalid , and > replace 'skip' with 'validateOnly' as Ciaran said. > > I can make a quick fix for this. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38641) Get rid of invalid configuration elements in mvn_scalafmt in main pom.xml
[ https://issues.apache.org/jira/browse/SPARK-38641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-38641: Assignee: morvenhuang > Get rid of invalid configuration elements in mvn_scalafmt in main pom.xml > - > > Key: SPARK-38641 > URL: https://issues.apache.org/jira/browse/SPARK-38641 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.1 >Reporter: morvenhuang >Assignee: morvenhuang >Priority: Trivial > Attachments: mvn_scalafmt.jpg > > > After loading latest spark code into IntelliJ IDEA, it complains that > configuration 'parameters' and 'skip' under mvn_scalafmt plugin are not > allowed, see screenshot attached for details. > > I've contacted the author of mvn_scalafmt, Ciaran Kearney, to confirm if > these 2 configuration items are no longer there since v 1.0.0, here's his > return, > > {quote}That's correct. The command line parameters were removed by scalafmt > itself a few versions ago and skip was replaced by validateOnly (which checks > formatting without changing files. > {quote} > > I think we should get rid of the 'parameters' since it's invalid , and > replace 'skip' with 'validateOnly' as Ciaran said. > > I can make a quick fix for this. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL
[ https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511358#comment-17511358 ] Stu edited comment on SPARK-26639 at 3/24/22, 10:13 PM: Here's another example of this happening, in Spark 3.1.2. I'm running the following code: {code:java} WITH t AS ( SELECT random() as a ) SELECT * FROM t UNION SELECT * FROM t {code} The CTE has a non-deterministic function. If it was pre-calculated, the same random value would be chosen for `a` in both unioned queries, and the output would be deduplicated into a single record. This is not the case. The output is two records, with different random values. In our platform, some folks like to write complex CTEs and reference them multiple times. Recalculating these for every reference is quite computationally expensive, so we recommend to create separate tables in these cases, but don't have any way to enforce this. Fixing this bug would save a good number of compute hours! was (Author: stubartmess): Here's another example of this happening, in Spark 3.1.2. I'm running the following code: {code:java} WITH t AS ( SELECT random() as a ) SELECT * FROM t UNION SELECT * FROM t {code} The CTE has a non-deterministic function. If it was pre-calculated, the same random value would be chosen for `a` in both unioned queries, and the output would be deduplicated into a single record. This is not the case. The output is two records, with different random values. > The reuse subquery function maybe does not work in SPARK SQL > > > Key: SPARK-26639 > URL: https://issues.apache.org/jira/browse/SPARK-26639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Ke Jia >Priority: Major > > The subquery reuse feature has done in > [https://github.com/apache/spark/pull/14548] > In my test, I found the visualized plan do show the subquery is executed > once. But the stage of same subquery execute maybe not once. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38649) Fix SECURITY.md
Bjørn Jørgensen created SPARK-38649: --- Summary: Fix SECURITY.md Key: SPARK-38649 URL: https://issues.apache.org/jira/browse/SPARK-38649 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 3.4.0 Reporter: Bjørn Jørgensen At [Github Security -> Security policy|https://github.com/apache/spark/security/policy] The info there does not tell users what to do, if they have found a security issue. The default text for this page is " # Security Policy ## Supported Versions Use this section to tell people about which versions of your project are currently being supported with security updates. | Version | Supported | | --- | -- | | 5.1.x | :white_check_mark: | | 5.0.x | :x:| | 4.0.x | :white_check_mark: | | < 4.0 | :x:| ## Reporting a Vulnerability Use this section to tell people how to report a vulnerability. Tell them where to go, how often they can expect to get an update on a reported vulnerability, what to expect if the vulnerability is accepted or declined, etc. " We should change this to something like: " Reporting security issues Apache Spark uses the standard process outlined by the Apache Security Team for reporting vulnerabilities. Note that vulnerabilities should not be publicly disclosed until the project has responded. To report a possible security vulnerability, please email secur...@spark.apache.org. This is a non-public list that will reach the Apache Security team, as well as the Spark PMC. For more info https://spark.apache.org/security.html " -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38647) Add SupportsReportOrdering mix in interface for Scan
[ https://issues.apache.org/jira/browse/SPARK-38647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38647: Assignee: (was: Apache Spark) > Add SupportsReportOrdering mix in interface for Scan > > > Key: SPARK-38647 > URL: https://issues.apache.org/jira/browse/SPARK-38647 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Enrico Minack >Priority: Major > > As {{SupportsReportPartitioning}} allows implementations of {{Scan}} provide > Spark with information about the exiting partitioning of data read by a > {{{}DataSourceV2{}}}, a similar mix in interface {{SupportsReportOrdering}} > should provide order information. > This prevents Spark from sorting data if they already exhibit a certain order > provided by the source. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38647) Add SupportsReportOrdering mix in interface for Scan
[ https://issues.apache.org/jira/browse/SPARK-38647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511943#comment-17511943 ] Apache Spark commented on SPARK-38647: -- User 'EnricoMi' has created a pull request for this issue: https://github.com/apache/spark/pull/35965 > Add SupportsReportOrdering mix in interface for Scan > > > Key: SPARK-38647 > URL: https://issues.apache.org/jira/browse/SPARK-38647 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Enrico Minack >Priority: Major > > As {{SupportsReportPartitioning}} allows implementations of {{Scan}} provide > Spark with information about the exiting partitioning of data read by a > {{{}DataSourceV2{}}}, a similar mix in interface {{SupportsReportOrdering}} > should provide order information. > This prevents Spark from sorting data if they already exhibit a certain order > provided by the source. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38647) Add SupportsReportOrdering mix in interface for Scan
[ https://issues.apache.org/jira/browse/SPARK-38647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38647: Assignee: Apache Spark > Add SupportsReportOrdering mix in interface for Scan > > > Key: SPARK-38647 > URL: https://issues.apache.org/jira/browse/SPARK-38647 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Enrico Minack >Assignee: Apache Spark >Priority: Major > > As {{SupportsReportPartitioning}} allows implementations of {{Scan}} provide > Spark with information about the exiting partitioning of data read by a > {{{}DataSourceV2{}}}, a similar mix in interface {{SupportsReportOrdering}} > should provide order information. > This prevents Spark from sorting data if they already exhibit a certain order > provided by the source. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38438) Can't update spark.jars.packages on existing global/default context
[ https://issues.apache.org/jira/browse/SPARK-38438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-38438: - Issue Type: Improvement (was: Bug) Priority: Minor (was: Major) This is not a bug. As discussed in email, this kind of thing raises so many questions of semantics when classes are updated or unloaded that I think it won't happen. > Can't update spark.jars.packages on existing global/default context > --- > > Key: SPARK-38438 > URL: https://issues.apache.org/jira/browse/SPARK-38438 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.2.1 > Environment: py: 3.9 > spark: 3.2.1 >Reporter: Rafal Wojdyla >Priority: Minor > > Reproduction: > {code:python} > from pyspark.sql import SparkSession > # default session: > s = SparkSession.builder.getOrCreate() > # later on we want to update jars.packages, here's e.g. spark-hats > s = (SparkSession.builder > .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2") > .getOrCreate()) > # line below returns None, the config was not propagated: > s._sc._conf.get("spark.jars.packages") > {code} > Stopping the context doesn't help, in fact it's even more confusing, because > the configuration is updated, but doesn't have an effect: > {code:python} > from pyspark.sql import SparkSession > # default session: > s = SparkSession.builder.getOrCreate() > s.stop() > s = (SparkSession.builder > .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2") > .getOrCreate()) > # now this line returns 'za.co.absa:spark-hats_2.12:0.2.2', but the context > # doesn't download the jar/package, as it would if there was no global context > # thus the extra package is unusable. It's not downloaded, or added to the > # classpath. > s._sc._conf.get("spark.jars.packages") > {code} > One workaround is to stop the context AND kill the JVM gateway, which seems > to be a kind of hard reset: > {code:python} > from pyspark import SparkContext > from pyspark.sql import SparkSession > # default session: > s = SparkSession.builder.getOrCreate() > # Hard reset: > s.stop() > s._sc._gateway.shutdown() > s._sc._gateway.proc.stdin.close() > SparkContext._gateway = None > SparkContext._jvm = None > s = (SparkSession.builder > .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2") > .getOrCreate()) > # Now we are guaranteed there's a new spark session, and packages > # are downloaded, added to the classpath etc. > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38346) Add cache in MLlib BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-38346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-38346. -- Resolution: Won't Fix > Add cache in MLlib BinaryClassificationMetrics > -- > > Key: SPARK-38346 > URL: https://issues.apache.org/jira/browse/SPARK-38346 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.3 > Environment: Windows10/macOS12.2; spark_2.11-2.2.3; > mmlspark_2.11-0.18.0; lightgbmlib-2.2.350 >Reporter: Mingchao Wu >Priority: Minor > > we run some example code use BinaryClassificationEvaluator in MLlib, found > that ShuffledRDD[28] at BinaryClassificationMetrics.scala:155 and > UnionRDD[36] BinaryClassificationMetrics.scala:90 were used more than once > but not cached. > We use spark-2.2.3 and found the code in branch master is still without > cache, so we hope to improve it. > The example code is as follow: > {code:java} > import com.microsoft.ml.spark.lightgbm.LightGBMRegressor > import org.apache.spark.ml.Pipeline > import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator > import org.apache.spark.ml.feature.VectorAssembler > import org.apache.spark.sql.types.{DoubleType, IntegerType} > import org.apache.spark.sql.{DataFrame, SparkSession} > object LightGBMRegressorTest { def main(args: Array[String]): Unit = { > val spark: SparkSession = SparkSession.builder() > .appName("LightGBMRegressorTest") > .master("local[*]") > .getOrCreate() val startTime = System.currentTimeMillis() var > originalData: DataFrame = spark.read.option("header", "true") > .option("inferSchema", "true") > .csv("data/hour.csv") val labelCol = "workingday" > val cateCols = Array("season", "yr", "mnth", "hr") > val conCols: Array[String] = Array("temp", "atemp", "hum", "casual", > "cnt") > val vecCols = conCols ++ cateCols import spark.implicits._ > vecCols.foreach(col => { > originalData = originalData.withColumn(col, $"$col".cast(DoubleType)) > }) > originalData = originalData.withColumn(labelCol, > $"$labelCol".cast(IntegerType)) val assembler = new > VectorAssembler().setInputCols(vecCols).setOutputCol("features") val > classifier: LightGBMRegressor = new > LightGBMRegressor().setNumIterations(100).setNumLeaves(31) > > .setBoostFromAverage(false).setFeatureFraction(1.0).setMaxDepth(-1).setMaxBin(255) > > .setLearningRate(0.1).setMinSumHessianInLeaf(0.001).setLambdaL1(0.0).setLambdaL2(0.0) > > .setBaggingFraction(0.5).setBaggingFreq(1).setBaggingSeed(1).setObjective("binary") > > .setLabelCol(labelCol).setCategoricalSlotNames(cateCols).setFeaturesCol("features") > .setBoostingType("gbdt") val pipeline: Pipeline = new > Pipeline().setStages(Array(assembler, classifier)) val Array(tr, te) = > originalData.randomSplit(Array(0.7, .03), 666) > val model = pipeline.fit(tr) > val modelDF = model.transform(te) > val evaluator = new > BinaryClassificationEvaluator().setLabelCol(labelCol).setRawPredictionCol("prediction") > println(evaluator.evaluate(modelDF)) println(s"time: > ${System.currentTimeMillis() - startTime}" ) > System.in.read() > } > }{code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38210) Spark documentation build README is stale
[ https://issues.apache.org/jira/browse/SPARK-38210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-38210. -- Resolution: Not A Problem > Spark documentation build README is stale > - > > Key: SPARK-38210 > URL: https://issues.apache.org/jira/browse/SPARK-38210 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Khalid Mammadov >Priority: Minor > > I was following docs/README.md to build documentation and found out that it's > not complete. I had to install additional packages that is not documented but > available in the [CI/CD phase > |https://github.com/apache/spark/blob/c8b34ab7340265f1f2bec2afa694c10f174b222c/.github/workflows/build_and_test.yml#L526]and > few more to finish the build process. > I will file a PR to change README.md to include these packages and improve > the guide -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38202) Invalid URL in SparkContext.addedJars will constantly fails Executor.run()
[ https://issues.apache.org/jira/browse/SPARK-38202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-38202. -- Resolution: Not A Problem > Invalid URL in SparkContext.addedJars will constantly fails Executor.run() > -- > > Key: SPARK-38202 > URL: https://issues.apache.org/jira/browse/SPARK-38202 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Bo Zhang >Priority: Major > > When an invalid URL is used in SparkContext.addJar(), all subsequent query > executions will fail since downloading the jar is in the critical path of > Executor.run(), even when the query has noting to do with the jar. > A simple reproduce of the issue: > {code:java} > sc.addJar("http://invalid/library.jar;) > (0 to 1).toDF.count > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37532) RDD name could be very long and memory costly
[ https://issues.apache.org/jira/browse/SPARK-37532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-37532. -- Resolution: Won't Fix > RDD name could be very long and memory costly > - > > Key: SPARK-37532 > URL: https://issues.apache.org/jira/browse/SPARK-37532 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Kent Yao >Priority: Minor > > take sc.newHadoopFile for an example, the path parameter can be a very long > string and turn to a very unfriendly name both for the UI or driver memory -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38648) SPIP: Simplified API for DL Inferencing
Lee Yang created SPARK-38648: Summary: SPIP: Simplified API for DL Inferencing Key: SPARK-38648 URL: https://issues.apache.org/jira/browse/SPARK-38648 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: Lee Yang h1. Background and Motivation The deployment of deep learning (DL) models to Spark clusters can be a point of friction today. DL practitioners often aren't well-versed with Spark, and Spark experts often aren't well-versed with the fast-changing DL frameworks. Currently, the deployment of trained DL models is done in a fairly ad-hoc manner, with each model integration usually requiring significant effort. To simplify this process, we propose adding an integration layer for each major DL framework that can introspect their respective saved models to more-easily integrate these models into Spark applications. You can find a detailed proposal [here|https://docs.google.com/document/d/1n7QPHVZfmQknvebZEXxzndHPV2T71aBsDnP4COQa_v0] h1. Goals - Simplify the deployment of trained single-node DL models to Spark inference applications. - Follow pandas_udf for simple inference use-cases. - Follow Spark ML Pipelines APIs for transfer-learning use-cases. - Enable integrations with popular third-party DL frameworks like TensorFlow, PyTorch, and Huggingface. - Focus on PySpark, since most of the DL frameworks use Python. - Take advantage of built-in Spark features like GPU scheduling and Arrow integration. - Enable inference on both CPU and GPU. h1. Non-goals - DL model training. - Inference w/ distributed models, i.e. "model parallel" inference. h1. Target Personas - Data scientists who need to deploy DL models on Spark. - Developers who need to deploy DL models on Spark. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38647) Add SupportsReportOrdering mix in interface for Scan
[ https://issues.apache.org/jira/browse/SPARK-38647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-38647: -- Description: As {{SupportsReportPartitioning}} allows implementations of {{Scan}} provide Spark with information about the exiting partitioning of data read by a {{{}DataSourceV2{}}}, a similar mix in interface {{SupportsReportOrdering}} should provide order information. This prevents Spark from sorting data if they already exhibit a certain order provided by the source. was: As {{SupportsReportPartitioning}} allows implementations of {{Scan}} provide Spark with information about the exiting partitioning of data read by a {{DataSourceV2}}, a similar mix in interface {{SupportsReportOrdering}} should provide order information. This prevents Spark from sorting data if they exhibit a certain order provided by the source. > Add SupportsReportOrdering mix in interface for Scan > > > Key: SPARK-38647 > URL: https://issues.apache.org/jira/browse/SPARK-38647 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Enrico Minack >Priority: Major > > As {{SupportsReportPartitioning}} allows implementations of {{Scan}} provide > Spark with information about the exiting partitioning of data read by a > {{{}DataSourceV2{}}}, a similar mix in interface {{SupportsReportOrdering}} > should provide order information. > This prevents Spark from sorting data if they already exhibit a certain order > provided by the source. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38647) Add SupportsReportOrdering mix in interface for Scan
Enrico Minack created SPARK-38647: - Summary: Add SupportsReportOrdering mix in interface for Scan Key: SPARK-38647 URL: https://issues.apache.org/jira/browse/SPARK-38647 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.3.0 Reporter: Enrico Minack As {{SupportsReportPartitioning}} allows implementations of {{Scan}} provide Spark with information about the exiting partitioning of data read by a {{DataSourceV2}}, a similar mix in interface {{SupportsReportOrdering}} should provide order information. This prevents Spark from sorting data if they exhibit a certain order provided by the source. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37463) Read/Write Timestamp ntz from/to Orc uses int64
[ https://issues.apache.org/jira/browse/SPARK-37463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37463. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34984 [https://github.com/apache/spark/pull/34984] > Read/Write Timestamp ntz from/to Orc uses int64 > --- > > Key: SPARK-37463 > URL: https://issues.apache.org/jira/browse/SPARK-37463 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.3.0 > > > There are some example code: > import java.util.TimeZone > TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) > sql("set spark.sql.session.timeZone=America/Los_Angeles") > val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp > '2021-06-01 00:00:00' ts") > df.write.mode("overwrite").orc("ts_ntz_orc") > df.write.mode("overwrite").parquet("ts_ntz_parquet") > df.write.mode("overwrite").format("avro").save("ts_ntz_avro") > val query = """ > select 'orc', * > from `orc`.`ts_ntz_orc` > union all > select 'parquet', * > from `parquet`.`ts_ntz_parquet` > union all > select 'avro', * > from `avro`.`ts_ntz_avro` > """ > val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam") > for (tz <- tzs) { > TimeZone.setDefault(TimeZone.getTimeZone(tz)) > sql(s"set spark.sql.session.timeZone=$tz") > println(s"Time zone is ${TimeZone.getDefault.getID}") > sql(query).show(false) > } > The output show below looks so strange. > Time zone is America/Los_Angeles > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-06-01 00:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 00:00:00| > +---+---+---+ > Time zone is UTC > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-05-31 17:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 07:00:00| > +---+---+---+ > Time zone is Europe/Amsterdam > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-05-31 15:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 09:00:00| > +---+---+---+ -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37463) Read/Write Timestamp ntz from/to Orc uses int64
[ https://issues.apache.org/jira/browse/SPARK-37463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37463: --- Assignee: jiaan.geng > Read/Write Timestamp ntz from/to Orc uses int64 > --- > > Key: SPARK-37463 > URL: https://issues.apache.org/jira/browse/SPARK-37463 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > There are some example code: > import java.util.TimeZone > TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) > sql("set spark.sql.session.timeZone=America/Los_Angeles") > val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp > '2021-06-01 00:00:00' ts") > df.write.mode("overwrite").orc("ts_ntz_orc") > df.write.mode("overwrite").parquet("ts_ntz_parquet") > df.write.mode("overwrite").format("avro").save("ts_ntz_avro") > val query = """ > select 'orc', * > from `orc`.`ts_ntz_orc` > union all > select 'parquet', * > from `parquet`.`ts_ntz_parquet` > union all > select 'avro', * > from `avro`.`ts_ntz_avro` > """ > val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam") > for (tz <- tzs) { > TimeZone.setDefault(TimeZone.getTimeZone(tz)) > sql(s"set spark.sql.session.timeZone=$tz") > println(s"Time zone is ${TimeZone.getDefault.getID}") > sql(query).show(false) > } > The output show below looks so strange. > Time zone is America/Los_Angeles > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-06-01 00:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 00:00:00| > +---+---+---+ > Time zone is UTC > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-05-31 17:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 07:00:00| > +---+---+---+ > Time zone is Europe/Amsterdam > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-05-31 15:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 09:00:00| > +---+---+---+ -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38646) Pull a trait out for Python functions
[ https://issues.apache.org/jira/browse/SPARK-38646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38646: Assignee: (was: Apache Spark) > Pull a trait out for Python functions > - > > Key: SPARK-38646 > URL: https://issues.apache.org/jira/browse/SPARK-38646 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0, 3.2.2 >Reporter: Zhen Li >Priority: Major > > Currently pyspark uses a case class PythonFunction PythonRDD and many other > interfaces/classes. Propose to change to use a trait instead to avoid tying > impl with APIs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38646) Pull a trait out for Python functions
[ https://issues.apache.org/jira/browse/SPARK-38646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511886#comment-17511886 ] Apache Spark commented on SPARK-38646: -- User 'zhenlineo' has created a pull request for this issue: https://github.com/apache/spark/pull/35964 > Pull a trait out for Python functions > - > > Key: SPARK-38646 > URL: https://issues.apache.org/jira/browse/SPARK-38646 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0, 3.2.2 >Reporter: Zhen Li >Priority: Major > > Currently pyspark uses a case class PythonFunction PythonRDD and many other > interfaces/classes. Propose to change to use a trait instead to avoid tying > impl with APIs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38646) Pull a trait out for Python functions
[ https://issues.apache.org/jira/browse/SPARK-38646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511887#comment-17511887 ] Apache Spark commented on SPARK-38646: -- User 'zhenlineo' has created a pull request for this issue: https://github.com/apache/spark/pull/35964 > Pull a trait out for Python functions > - > > Key: SPARK-38646 > URL: https://issues.apache.org/jira/browse/SPARK-38646 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0, 3.2.2 >Reporter: Zhen Li >Priority: Major > > Currently pyspark uses a case class PythonFunction PythonRDD and many other > interfaces/classes. Propose to change to use a trait instead to avoid tying > impl with APIs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38646) Pull a trait out for Python functions
[ https://issues.apache.org/jira/browse/SPARK-38646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38646: Assignee: Apache Spark > Pull a trait out for Python functions > - > > Key: SPARK-38646 > URL: https://issues.apache.org/jira/browse/SPARK-38646 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0, 3.2.2 >Reporter: Zhen Li >Assignee: Apache Spark >Priority: Major > > Currently pyspark uses a case class PythonFunction PythonRDD and many other > interfaces/classes. Propose to change to use a trait instead to avoid tying > impl with APIs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37568) Support 2-arguments by the convert_timezone() function
[ https://issues.apache.org/jira/browse/SPARK-37568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-37568: - Fix Version/s: 3.3.0 > Support 2-arguments by the convert_timezone() function > -- > > Key: SPARK-37568 > URL: https://issues.apache.org/jira/browse/SPARK-37568 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0, 3.4.0 > > > # If sourceTs is a timestamp_ntz, take the sourceTz from the session time > zone, see the SQL config spark.sql.session.timeZone > # If sourceTs is a timestamp_ltz, convert it to a timestamp_ntz using the > targetTz -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38646) Pull a trait out for Python functions
Zhen Li created SPARK-38646: --- Summary: Pull a trait out for Python functions Key: SPARK-38646 URL: https://issues.apache.org/jira/browse/SPARK-38646 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0, 3.2.2 Reporter: Zhen Li Currently pyspark uses a case class PythonFunction PythonRDD and many other interfaces/classes. Propose to change to use a trait instead to avoid tying impl with APIs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38639) Support ignoreCorruptRecord flag to ensure querying broken sequence file table smoothly
[ https://issues.apache.org/jira/browse/SPARK-38639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511874#comment-17511874 ] Apache Spark commented on SPARK-38639: -- User 'TonyDoen' has created a pull request for this issue: https://github.com/apache/spark/pull/35963 > Support ignoreCorruptRecord flag to ensure querying broken sequence file > table smoothly > --- > > Key: SPARK-38639 > URL: https://issues.apache.org/jira/browse/SPARK-38639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.1 >Reporter: tonydoen >Priority: Minor > Fix For: 3.2.1 > > Original Estimate: 48h > Remaining Estimate: 48h > > There's an existing flag "spark.sql.files.ignoreCorruptFiles" and > "spark.sql.files.ignoreMissingFiles" that will quietly ignore attempted reads > from files that have been corrupted, but it still allows the query to fail on > sequence files. > > Being able to ignore corrupt record is useful in the scenarios that users > want to query successfully in dirty data(mixed schema in one table). > > We would like to add a "spark.sql.hive.ignoreCorruptRecord" to fill out the > functionality. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38639) Support ignoreCorruptRecord flag to ensure querying broken sequence file table smoothly
[ https://issues.apache.org/jira/browse/SPARK-38639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511873#comment-17511873 ] Apache Spark commented on SPARK-38639: -- User 'TonyDoen' has created a pull request for this issue: https://github.com/apache/spark/pull/35963 > Support ignoreCorruptRecord flag to ensure querying broken sequence file > table smoothly > --- > > Key: SPARK-38639 > URL: https://issues.apache.org/jira/browse/SPARK-38639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.1 >Reporter: tonydoen >Priority: Minor > Fix For: 3.2.1 > > Original Estimate: 48h > Remaining Estimate: 48h > > There's an existing flag "spark.sql.files.ignoreCorruptFiles" and > "spark.sql.files.ignoreMissingFiles" that will quietly ignore attempted reads > from files that have been corrupted, but it still allows the query to fail on > sequence files. > > Being able to ignore corrupt record is useful in the scenarios that users > want to query successfully in dirty data(mixed schema in one table). > > We would like to add a "spark.sql.hive.ignoreCorruptRecord" to fill out the > functionality. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38645) Support `spark.sql.codegen.cleanedSourcePrint` flag to print Codegen cleanedSource
[ https://issues.apache.org/jira/browse/SPARK-38645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511858#comment-17511858 ] tonydoen commented on SPARK-38645: -- related pr : [https://github.com/apache/spark/pull/35962] > Support `spark.sql.codegen.cleanedSourcePrint` flag to print Codegen > cleanedSource > -- > > Key: SPARK-38645 > URL: https://issues.apache.org/jira/browse/SPARK-38645 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: tonydoen >Priority: Trivial > Fix For: 3.2.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > When we use spark-sql, encountering problems in codegen source, we often > have to change the log level to DEBUG, but there are too many logs in this > mode (DEBUG) . > > Then `spark.sql.codegen.cleanedSourcePrint` can ensure that just printing > codegen source. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38639) Support ignoreCorruptRecord flag to ensure querying broken sequence file table smoothly
[ https://issues.apache.org/jira/browse/SPARK-38639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511855#comment-17511855 ] Apache Spark commented on SPARK-38639: -- User 'TonyDoen' has created a pull request for this issue: https://github.com/apache/spark/pull/35962 > Support ignoreCorruptRecord flag to ensure querying broken sequence file > table smoothly > --- > > Key: SPARK-38639 > URL: https://issues.apache.org/jira/browse/SPARK-38639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.1 >Reporter: tonydoen >Priority: Minor > Fix For: 3.2.1 > > Original Estimate: 48h > Remaining Estimate: 48h > > There's an existing flag "spark.sql.files.ignoreCorruptFiles" and > "spark.sql.files.ignoreMissingFiles" that will quietly ignore attempted reads > from files that have been corrupted, but it still allows the query to fail on > sequence files. > > Being able to ignore corrupt record is useful in the scenarios that users > want to query successfully in dirty data(mixed schema in one table). > > We would like to add a "spark.sql.hive.ignoreCorruptRecord" to fill out the > functionality. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38639) Support ignoreCorruptRecord flag to ensure querying broken sequence file table smoothly
[ https://issues.apache.org/jira/browse/SPARK-38639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511854#comment-17511854 ] Apache Spark commented on SPARK-38639: -- User 'TonyDoen' has created a pull request for this issue: https://github.com/apache/spark/pull/35962 > Support ignoreCorruptRecord flag to ensure querying broken sequence file > table smoothly > --- > > Key: SPARK-38639 > URL: https://issues.apache.org/jira/browse/SPARK-38639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.1 >Reporter: tonydoen >Priority: Minor > Fix For: 3.2.1 > > Original Estimate: 48h > Remaining Estimate: 48h > > There's an existing flag "spark.sql.files.ignoreCorruptFiles" and > "spark.sql.files.ignoreMissingFiles" that will quietly ignore attempted reads > from files that have been corrupted, but it still allows the query to fail on > sequence files. > > Being able to ignore corrupt record is useful in the scenarios that users > want to query successfully in dirty data(mixed schema in one table). > > We would like to add a "spark.sql.hive.ignoreCorruptRecord" to fill out the > functionality. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38645) Support `spark.sql.codegen.cleanedSourcePrint` flag to print Codegen cleanedSource
tonydoen created SPARK-38645: Summary: Support `spark.sql.codegen.cleanedSourcePrint` flag to print Codegen cleanedSource Key: SPARK-38645 URL: https://issues.apache.org/jira/browse/SPARK-38645 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1 Reporter: tonydoen Fix For: 3.2.1 When we use spark-sql, encountering problems in codegen source, we often have to change the log level to DEBUG, but there are too many logs in this mode (DEBUG) . Then `spark.sql.codegen.cleanedSourcePrint` can ensure that just printing codegen source. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38644) DS V2 topN push-down supports project with alias
[ https://issues.apache.org/jira/browse/SPARK-38644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38644: Assignee: (was: Apache Spark) > DS V2 topN push-down supports project with alias > > > Key: SPARK-38644 > URL: https://issues.apache.org/jira/browse/SPARK-38644 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38644) DS V2 topN push-down supports project with alias
[ https://issues.apache.org/jira/browse/SPARK-38644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511846#comment-17511846 ] Apache Spark commented on SPARK-38644: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/35961 > DS V2 topN push-down supports project with alias > > > Key: SPARK-38644 > URL: https://issues.apache.org/jira/browse/SPARK-38644 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38644) DS V2 topN push-down supports project with alias
[ https://issues.apache.org/jira/browse/SPARK-38644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38644: Assignee: Apache Spark > DS V2 topN push-down supports project with alias > > > Key: SPARK-38644 > URL: https://issues.apache.org/jira/browse/SPARK-38644 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38644) DS V2 topN push-down supports project with alias
[ https://issues.apache.org/jira/browse/SPARK-38644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-38644: --- Summary: DS V2 topN push-down supports project with alias (was: DS V2 aggregate push-down supports project with alias) > DS V2 topN push-down supports project with alias > > > Key: SPARK-38644 > URL: https://issues.apache.org/jira/browse/SPARK-38644 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38644) DS V2 aggregate push-down supports project with alias
[ https://issues.apache.org/jira/browse/SPARK-38644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-38644: --- Issue Type: Improvement (was: New Feature) > DS V2 aggregate push-down supports project with alias > - > > Key: SPARK-38644 > URL: https://issues.apache.org/jira/browse/SPARK-38644 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38644) DS V2 aggregate push-down supports project with alias
jiaan.geng created SPARK-38644: -- Summary: DS V2 aggregate push-down supports project with alias Key: SPARK-38644 URL: https://issues.apache.org/jira/browse/SPARK-38644 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.4.0 Reporter: jiaan.geng -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38461) Use error classes in org.apache.spark.broadcast
[ https://issues.apache.org/jira/browse/SPARK-38461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38461: Assignee: (was: Apache Spark) > Use error classes in org.apache.spark.broadcast > --- > > Key: SPARK-38461 > URL: https://issues.apache.org/jira/browse/SPARK-38461 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38461) Use error classes in org.apache.spark.broadcast
[ https://issues.apache.org/jira/browse/SPARK-38461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38461: Assignee: Apache Spark > Use error classes in org.apache.spark.broadcast > --- > > Key: SPARK-38461 > URL: https://issues.apache.org/jira/browse/SPARK-38461 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38461) Use error classes in org.apache.spark.broadcast
[ https://issues.apache.org/jira/browse/SPARK-38461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511834#comment-17511834 ] Apache Spark commented on SPARK-38461: -- User 'bozhang2820' has created a pull request for this issue: https://github.com/apache/spark/pull/35960 > Use error classes in org.apache.spark.broadcast > --- > > Key: SPARK-38461 > URL: https://issues.apache.org/jira/browse/SPARK-38461 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511815#comment-17511815 ] hujiahua commented on SPARK-18105: -- It's working in my case by setting spark.file.transferTo=false. Thanks to [~zhangweilst] . And my spark version was 3.1.2. > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.1 >Reporter: Davies Liu >Priority: Major > Attachments: TestWeightedGraph.java > > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in > stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) > at > org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) > at java.io.BufferedInputStream.read(BufferedInputStream.java:353) > at java.io.DataInputStream.read(DataInputStream.java:149) > at com.google.common.io.ByteStreams.read(ByteStreams.java:828) > at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > https://github.com/jpountz/lz4-java/issues/89 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38640) NPE with unpersisting memory-only RDD with RDD fetching from shuffle service enabled
[ https://issues.apache.org/jira/browse/SPARK-38640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511809#comment-17511809 ] Apache Spark commented on SPARK-38640: -- User 'Kimahriman' has created a pull request for this issue: https://github.com/apache/spark/pull/35959 > NPE with unpersisting memory-only RDD with RDD fetching from shuffle service > enabled > > > Key: SPARK-38640 > URL: https://issues.apache.org/jira/browse/SPARK-38640 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Adam Binford >Priority: Major > > If you have RDD fetching from shuffle service enabled, memory-only cached > RDDs will fail to unpersist. > > > {code:java} > // spark.shuffle.service.fetch.rdd.enabled=true > val df = spark.range(5) > .persist(StorageLevel.MEMORY_ONLY) > df.count() > df.unpersist(true) > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38640) NPE with unpersisting memory-only RDD with RDD fetching from shuffle service enabled
[ https://issues.apache.org/jira/browse/SPARK-38640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38640: Assignee: (was: Apache Spark) > NPE with unpersisting memory-only RDD with RDD fetching from shuffle service > enabled > > > Key: SPARK-38640 > URL: https://issues.apache.org/jira/browse/SPARK-38640 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Adam Binford >Priority: Major > > If you have RDD fetching from shuffle service enabled, memory-only cached > RDDs will fail to unpersist. > > > {code:java} > // spark.shuffle.service.fetch.rdd.enabled=true > val df = spark.range(5) > .persist(StorageLevel.MEMORY_ONLY) > df.count() > df.unpersist(true) > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38640) NPE with unpersisting memory-only RDD with RDD fetching from shuffle service enabled
[ https://issues.apache.org/jira/browse/SPARK-38640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511810#comment-17511810 ] Apache Spark commented on SPARK-38640: -- User 'Kimahriman' has created a pull request for this issue: https://github.com/apache/spark/pull/35959 > NPE with unpersisting memory-only RDD with RDD fetching from shuffle service > enabled > > > Key: SPARK-38640 > URL: https://issues.apache.org/jira/browse/SPARK-38640 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Adam Binford >Priority: Major > > If you have RDD fetching from shuffle service enabled, memory-only cached > RDDs will fail to unpersist. > > > {code:java} > // spark.shuffle.service.fetch.rdd.enabled=true > val df = spark.range(5) > .persist(StorageLevel.MEMORY_ONLY) > df.count() > df.unpersist(true) > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38640) NPE with unpersisting memory-only RDD with RDD fetching from shuffle service enabled
[ https://issues.apache.org/jira/browse/SPARK-38640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38640: Assignee: Apache Spark > NPE with unpersisting memory-only RDD with RDD fetching from shuffle service > enabled > > > Key: SPARK-38640 > URL: https://issues.apache.org/jira/browse/SPARK-38640 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Adam Binford >Assignee: Apache Spark >Priority: Major > > If you have RDD fetching from shuffle service enabled, memory-only cached > RDDs will fail to unpersist. > > > {code:java} > // spark.shuffle.service.fetch.rdd.enabled=true > val df = spark.range(5) > .persist(StorageLevel.MEMORY_ONLY) > df.count() > df.unpersist(true) > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38627) TypeError: Datetime subtraction can only be applied to datetime series
[ https://issues.apache.org/jira/browse/SPARK-38627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511790#comment-17511790 ] Hyukjin Kwon commented on SPARK-38627: -- I used Mac. I haven't tested it Spark 3.2. Can you show the full error message? > TypeError: Datetime subtraction can only be applied to datetime series > -- > > Key: SPARK-38627 > URL: https://issues.apache.org/jira/browse/SPARK-38627 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Prakhar Sandhu >Priority: Major > > I am trying to replace pandas with pyspark.pandas library, when I tried this : > pdf is a pyspark.pandas dataframe > {code:java} > pdf["date_diff"] = (pdf["date1"] - pdf["date2"])/pdf.Timedelta(days=30){code} > I got the below error : > {code:java} > File > "C:\Users\abc\Anaconda3\envs\test\lib\site-packages\pyspark\pandas\data_type_ops\datetime_ops.py", > line 75, in sub > raise TypeError("Datetime subtraction can only be applied to datetime > series.") {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38636) AttributeError: module 'pyspark.pandas' has no attribute 'Timestamp'
[ https://issues.apache.org/jira/browse/SPARK-38636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38636. -- Resolution: Not A Problem > AttributeError: module 'pyspark.pandas' has no attribute 'Timestamp' > > > Key: SPARK-38636 > URL: https://issues.apache.org/jira/browse/SPARK-38636 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Prakhar Sandhu >Priority: Major > > I am trying to replace pandas library with pyspark.pandas. > Tried something like below - > {code:java} > List[pd.Timestamp] {code} > But it does not work and instead thrown the below error > {code:java} > AttributeError: module 'pyspark.pandas' has no attribute 'Timestamp'{code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38643) Validate input dataset of ml.regression
[ https://issues.apache.org/jira/browse/SPARK-38643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511771#comment-17511771 ] Apache Spark commented on SPARK-38643: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/35958 > Validate input dataset of ml.regression > --- > > Key: SPARK-38643 > URL: https://issues.apache.org/jira/browse/SPARK-38643 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.4.0 >Reporter: zhengruifeng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38643) Validate input dataset of ml.regression
[ https://issues.apache.org/jira/browse/SPARK-38643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38643: Assignee: (was: Apache Spark) > Validate input dataset of ml.regression > --- > > Key: SPARK-38643 > URL: https://issues.apache.org/jira/browse/SPARK-38643 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.4.0 >Reporter: zhengruifeng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38643) Validate input dataset of ml.regression
[ https://issues.apache.org/jira/browse/SPARK-38643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38643: Assignee: Apache Spark > Validate input dataset of ml.regression > --- > > Key: SPARK-38643 > URL: https://issues.apache.org/jira/browse/SPARK-38643 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.4.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38643) Validate input dataset of ml.regression
[ https://issues.apache.org/jira/browse/SPARK-38643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511770#comment-17511770 ] Apache Spark commented on SPARK-38643: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/35958 > Validate input dataset of ml.regression > --- > > Key: SPARK-38643 > URL: https://issues.apache.org/jira/browse/SPARK-38643 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.4.0 >Reporter: zhengruifeng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38643) Validate input dataset of ml.regression
zhengruifeng created SPARK-38643: Summary: Validate input dataset of ml.regression Key: SPARK-38643 URL: https://issues.apache.org/jira/browse/SPARK-38643 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 3.4.0 Reporter: zhengruifeng -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38627) TypeError: Datetime subtraction can only be applied to datetime series
[ https://issues.apache.org/jira/browse/SPARK-38627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511765#comment-17511765 ] Prakhar Sandhu commented on SPARK-38627: Hi [~hyukjin.kwon] , Great ^^ # Did it work on spark 3.3 or spark 3.2? # . What environment are you using? I have set up a conda environment in my local system with spark 3.2. I specified the numpy explicitly but got the below error : {code:java} df = pd.DataFrame({ 'Date1': rng.to_numpy(), 'Date2': rng.to_numpy()}) File "C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pyspark\pandas\indexes\base.py", line 519, in to_numpy result = np.asarray(self._to_internal_pandas()._values, dtype=dtype) File "C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pyspark\pandas\indexes\base.py", line 472, in _to_internal_pandas return self._psdf._internal.to_pandas_frame.index{code} > TypeError: Datetime subtraction can only be applied to datetime series > -- > > Key: SPARK-38627 > URL: https://issues.apache.org/jira/browse/SPARK-38627 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Prakhar Sandhu >Priority: Major > > I am trying to replace pandas with pyspark.pandas library, when I tried this : > pdf is a pyspark.pandas dataframe > {code:java} > pdf["date_diff"] = (pdf["date1"] - pdf["date2"])/pdf.Timedelta(days=30){code} > I got the below error : > {code:java} > File > "C:\Users\abc\Anaconda3\envs\test\lib\site-packages\pyspark\pandas\data_type_ops\datetime_ops.py", > line 75, in sub > raise TypeError("Datetime subtraction can only be applied to datetime > series.") {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-38627) TypeError: Datetime subtraction can only be applied to datetime series
[ https://issues.apache.org/jira/browse/SPARK-38627 ] Prakhar Sandhu deleted comment on SPARK-38627: was (Author: JIRAUSER286645): Hi [~hyukjin.kwon] , Nice ^^ # Did it work on spark 3.3? # What environment are you using? I have set up a conda environment in my local system with spark 3.2. I specified the numpy explicitly {code:java} df = pd.DataFrame({ 'Date1': rng.to_numpy, 'Date2': rng.to_numpy}) File "C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pyspark\pandas\frame.py", line 519, in __init__ pdf = pd.DataFrame(data=data, index=index, columns=columns, dtype=dtype, copy=copy) File "C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pandas\core\frame.py", line 435, in __init__ mgr = init_dict(data, index, columns, dtype=dtype) File "C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pandas\core\internals\construction.py", line 254, in init_dict return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype) File "C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pandas\core\internals\construction.py", line 64, in arrays_to_mgr index = extract_index(arrays) File "C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pandas\core\internals\construction.py", line 355, in extract_index raise ValueError("If using all scalar values, you must pass an index") ValueError: If using all scalar values, you must pass an index {code} > TypeError: Datetime subtraction can only be applied to datetime series > -- > > Key: SPARK-38627 > URL: https://issues.apache.org/jira/browse/SPARK-38627 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Prakhar Sandhu >Priority: Major > > I am trying to replace pandas with pyspark.pandas library, when I tried this : > pdf is a pyspark.pandas dataframe > {code:java} > pdf["date_diff"] = (pdf["date1"] - pdf["date2"])/pdf.Timedelta(days=30){code} > I got the below error : > {code:java} > File > "C:\Users\abc\Anaconda3\envs\test\lib\site-packages\pyspark\pandas\data_type_ops\datetime_ops.py", > line 75, in sub > raise TypeError("Datetime subtraction can only be applied to datetime > series.") {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38627) TypeError: Datetime subtraction can only be applied to datetime series
[ https://issues.apache.org/jira/browse/SPARK-38627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511763#comment-17511763 ] Prakhar Sandhu commented on SPARK-38627: Hi [~hyukjin.kwon] , Nice ^^ # Did it work on spark 3.3? # What environment are you using? I have set up a conda environment in my local system with spark 3.2. I specified the numpy explicitly {code:java} df = pd.DataFrame({ 'Date1': rng.to_numpy, 'Date2': rng.to_numpy}) File "C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pyspark\pandas\frame.py", line 519, in __init__ pdf = pd.DataFrame(data=data, index=index, columns=columns, dtype=dtype, copy=copy) File "C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pandas\core\frame.py", line 435, in __init__ mgr = init_dict(data, index, columns, dtype=dtype) File "C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pandas\core\internals\construction.py", line 254, in init_dict return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype) File "C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pandas\core\internals\construction.py", line 64, in arrays_to_mgr index = extract_index(arrays) File "C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pandas\core\internals\construction.py", line 355, in extract_index raise ValueError("If using all scalar values, you must pass an index") ValueError: If using all scalar values, you must pass an index {code} > TypeError: Datetime subtraction can only be applied to datetime series > -- > > Key: SPARK-38627 > URL: https://issues.apache.org/jira/browse/SPARK-38627 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Prakhar Sandhu >Priority: Major > > I am trying to replace pandas with pyspark.pandas library, when I tried this : > pdf is a pyspark.pandas dataframe > {code:java} > pdf["date_diff"] = (pdf["date1"] - pdf["date2"])/pdf.Timedelta(days=30){code} > I got the below error : > {code:java} > File > "C:\Users\abc\Anaconda3\envs\test\lib\site-packages\pyspark\pandas\data_type_ops\datetime_ops.py", > line 75, in sub > raise TypeError("Datetime subtraction can only be applied to datetime > series.") {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38627) TypeError: Datetime subtraction can only be applied to datetime series
[ https://issues.apache.org/jira/browse/SPARK-38627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511752#comment-17511752 ] Hyukjin Kwon commented on SPARK-38627: -- ^^ this works. Although you have to explicitly call to_numpy when creating a dataframe: {code} import pyspark.pandas as pd import numpy as np np.random.seed(0) rng = pd.date_range('2015-02-24', periods=5, freq='T') df = pd.DataFrame({ 'Date1': rng.to_numpy(), 'Date2': rng.to_numpy()}) print(df) df["x"] = df["Date1"] - df["Date2"] print(df) {code} {code} Date1 Date2 x 0 2015-02-24 00:00:00 2015-02-24 00:00:00 0 1 2015-02-24 00:01:00 2015-02-24 00:01:00 0 2 2015-02-24 00:02:00 2015-02-24 00:02:00 0 3 2015-02-24 00:03:00 2015-02-24 00:03:00 0 4 2015-02-24 00:04:00 2015-02-24 00:04:00 0 {code} > TypeError: Datetime subtraction can only be applied to datetime series > -- > > Key: SPARK-38627 > URL: https://issues.apache.org/jira/browse/SPARK-38627 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Prakhar Sandhu >Priority: Major > > I am trying to replace pandas with pyspark.pandas library, when I tried this : > pdf is a pyspark.pandas dataframe > {code:java} > pdf["date_diff"] = (pdf["date1"] - pdf["date2"])/pdf.Timedelta(days=30){code} > I got the below error : > {code:java} > File > "C:\Users\abc\Anaconda3\envs\test\lib\site-packages\pyspark\pandas\data_type_ops\datetime_ops.py", > line 75, in sub > raise TypeError("Datetime subtraction can only be applied to datetime > series.") {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38636) AttributeError: module 'pyspark.pandas' has no attribute 'Timestamp'
[ https://issues.apache.org/jira/browse/SPARK-38636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511751#comment-17511751 ] Prakhar Sandhu commented on SPARK-38636: Hi [~hyukjin.kwon] , I was able to pass the above error by replacing pd.Timestamp with pd.to_datetime. {code:java} import pyspark.pandas as pd List[pd.to_datetime]{code} > AttributeError: module 'pyspark.pandas' has no attribute 'Timestamp' > > > Key: SPARK-38636 > URL: https://issues.apache.org/jira/browse/SPARK-38636 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Prakhar Sandhu >Priority: Major > > I am trying to replace pandas library with pyspark.pandas. > Tried something like below - > {code:java} > List[pd.Timestamp] {code} > But it does not work and instead thrown the below error > {code:java} > AttributeError: module 'pyspark.pandas' has no attribute 'Timestamp'{code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38627) TypeError: Datetime subtraction can only be applied to datetime series
[ https://issues.apache.org/jira/browse/SPARK-38627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511748#comment-17511748 ] Prakhar Sandhu commented on SPARK-38627: Hi [~hyukjin.kwon] , I am not sure if I would be able to share the full repo-code, but please try to run the below commands in spark 3.3. The below code snippet is running fine with pandas library but failed when I replaced pandas with pyspark.pandas : {code:java} import pyspark.pandas as pd import numpy as np np.random.seed(0) rng = pd.date_range('2015-02-24', periods=5, freq='T') df = pd.DataFrame({ 'Date1': rng, 'Date2': rng}) print(df) df["x"] = df["Date1"] - df["Date2"] print(df) {code} > TypeError: Datetime subtraction can only be applied to datetime series > -- > > Key: SPARK-38627 > URL: https://issues.apache.org/jira/browse/SPARK-38627 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Prakhar Sandhu >Priority: Major > > I am trying to replace pandas with pyspark.pandas library, when I tried this : > pdf is a pyspark.pandas dataframe > {code:java} > pdf["date_diff"] = (pdf["date1"] - pdf["date2"])/pdf.Timedelta(days=30){code} > I got the below error : > {code:java} > File > "C:\Users\abc\Anaconda3\envs\test\lib\site-packages\pyspark\pandas\data_type_ops\datetime_ops.py", > line 75, in sub > raise TypeError("Datetime subtraction can only be applied to datetime > series.") {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25789) Support for Dataset of Avro
[ https://issues.apache.org/jira/browse/SPARK-25789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511747#comment-17511747 ] IKozar commented on SPARK-25789: my development is also blocked due to the issue. what are the expectations of the fix? > Support for Dataset of Avro > --- > > Key: SPARK-25789 > URL: https://issues.apache.org/jira/browse/SPARK-25789 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Aleksander Eskilson >Priority: Major > > Support for Dataset of Avro records in an API that would allow the user to > provide a class to an {{Encoder}} for Avro, analogous to the {{Bean}} > encoder. This functionality was previously to be provided by SPARK-22739 and > [Spark-Avro #169|https://github.com/databricks/spark-avro/issues/169]. Avro > functionality was folded into Spark-proper by SPARK-24768, eliminating the > need to maintain a separate library for Avro in Spark. Resolution of this > issue would: > * Add necessary {{Expression}} elements to Spark > * Add an {{AvroEncoder}} for Datasets of Avro records to Spark -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38614) After Spark update, df.show() shows incorrect F.percent_rank results
[ https://issues.apache.org/jira/browse/SPARK-38614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZygD updated SPARK-38614: - Summary: After Spark update, df.show() shows incorrect F.percent_rank results (was: df.show() shows incorrect F.percent_rank results) > After Spark update, df.show() shows incorrect F.percent_rank results > > > Key: SPARK-38614 > URL: https://issues.apache.org/jira/browse/SPARK-38614 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0, 3.2.1 >Reporter: ZygD >Priority: Major > Labels: correctness > > Expected result is obtained using Spark 3.1.2, but not 3.2.0 or 3.2.1 > *Minimal reproducible example* > {code:java} > from pyspark.sql import SparkSession, functions as F, Window as W > spark = SparkSession.builder.getOrCreate() > > df = spark.range(101).withColumn('pr', F.percent_rank().over(W.orderBy('id'))) > df.show(3) > df.show(5) {code} > *Expected result* > {code:java} > +---++ > | id| pr| > +---++ > | 0| 0.0| > | 1|0.01| > | 2|0.02| > +---++ > only showing top 3 rows > +---++ > | id| pr| > +---++ > | 0| 0.0| > | 1|0.01| > | 2|0.02| > | 3|0.03| > | 4|0.04| > +---++ > only showing top 5 rows{code} > *Actual result* > {code:java} > +---+--+ > | id|pr| > +---+--+ > | 0| 0.0| > | 1|0.| > | 2|0.| > +---+--+ > only showing top 3 rows > +---+---+ > | id| pr| > +---+---+ > | 0|0.0| > | 1|0.2| > | 2|0.4| > | 3|0.6| > | 4|0.8| > +---+---+ > only showing top 5 rows{code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37568) Support 2-arguments by the convert_timezone() function
[ https://issues.apache.org/jira/browse/SPARK-37568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511693#comment-17511693 ] Apache Spark commented on SPARK-37568: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/35957 > Support 2-arguments by the convert_timezone() function > -- > > Key: SPARK-37568 > URL: https://issues.apache.org/jira/browse/SPARK-37568 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > # If sourceTs is a timestamp_ntz, take the sourceTz from the session time > zone, see the SQL config spark.sql.session.timeZone > # If sourceTs is a timestamp_ltz, convert it to a timestamp_ntz using the > targetTz -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37568) Support 2-arguments by the convert_timezone() function
[ https://issues.apache.org/jira/browse/SPARK-37568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511692#comment-17511692 ] Apache Spark commented on SPARK-37568: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/35957 > Support 2-arguments by the convert_timezone() function > -- > > Key: SPARK-37568 > URL: https://issues.apache.org/jira/browse/SPARK-37568 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > # If sourceTs is a timestamp_ntz, take the sourceTz from the session time > zone, see the SQL config spark.sql.session.timeZone > # If sourceTs is a timestamp_ltz, convert it to a timestamp_ntz using the > targetTz -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38642) spark-sql can not enable isolatedClientLoader to extend dsv2 catalog when using builtin hiveMetastoreJar
suheng.cloud created SPARK-38642: Summary: spark-sql can not enable isolatedClientLoader to extend dsv2 catalog when using builtin hiveMetastoreJar Key: SPARK-38642 URL: https://issues.apache.org/jira/browse/SPARK-38642 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1, 3.1.2 Reporter: suheng.cloud Hi, all: I make use of IsolatedClientLoader to enable datasource v2 catalog on hive, It works well on api/spark-shell, while failed on spark-sql cmd. After dig into source, I found that the SparkSQLCLIDriver(spark-sql) initialize differently by using CliSessionState which will be reused through cli lifecycle. Thus the IsolatedClientLoader creator in HiveUtils will determine to off isolate because encoutering special global SessionState by that type.In my case, namespaces/tables will not recognized from another hive catalog since a CliSessionState in sparkSession will always be used to connected with. I notice [SPARK-21428|https://issues.apache.org/jira/browse/SPARK-21428] but think that since the datasource v2 api should be more popular, SparkSQLCLIDriver should also adjust that? my env: spark-3.1.2 hadoop-cdh5.13.0 hive-2.3.6 for each v2 catalog set spark.sql.hive.metastore.jars=builtin(we have no auth to deploy jars on target clusters) Now, for workaround this, we have to deploy jars on hdfs and use 'path' way which cause a significant delay on catalog initialize. Any help is appreciate, thanks. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38588) Validate input dataset of ml.classification
[ https://issues.apache.org/jira/browse/SPARK-38588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-38588: - Fix Version/s: 3.4.0 > Validate input dataset of ml.classification > --- > > Key: SPARK-38588 > URL: https://issues.apache.org/jira/browse/SPARK-38588 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.4.0 >Reporter: zhengruifeng >Priority: Major > Fix For: 3.4.0 > > > LinearSVC should fail fast if the input dataset contains invalid values. > > {code:java} > import org.apache.spark.ml.feature._ > import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.classification._ > import org.apache.spark.ml.clustering._ > val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, > Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, > 2.0.toDF() > val svc = new LinearSVC() > val model = svc.fit(df) > scala> model.intercept > res0: Double = NaN > scala> model.coefficients > res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38588) Validate input dataset of ml.classification
[ https://issues.apache.org/jira/browse/SPARK-38588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-38588. -- Resolution: Resolved > Validate input dataset of ml.classification > --- > > Key: SPARK-38588 > URL: https://issues.apache.org/jira/browse/SPARK-38588 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.4.0 >Reporter: zhengruifeng >Priority: Major > > LinearSVC should fail fast if the input dataset contains invalid values. > > {code:java} > import org.apache.spark.ml.feature._ > import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.classification._ > import org.apache.spark.ml.clustering._ > val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, > Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, > 2.0.toDF() > val svc = new LinearSVC() > val model = svc.fit(df) > scala> model.intercept > res0: Double = NaN > scala> model.coefficients > res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37568) Support 2-arguments by the convert_timezone() function
[ https://issues.apache.org/jira/browse/SPARK-37568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-37568. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 35951 [https://github.com/apache/spark/pull/35951] > Support 2-arguments by the convert_timezone() function > -- > > Key: SPARK-37568 > URL: https://issues.apache.org/jira/browse/SPARK-37568 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > # If sourceTs is a timestamp_ntz, take the sourceTz from the session time > zone, see the SQL config spark.sql.session.timeZone > # If sourceTs is a timestamp_ltz, convert it to a timestamp_ntz using the > targetTz -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37568) Support 2-arguments by the convert_timezone() function
[ https://issues.apache.org/jira/browse/SPARK-37568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-37568: Assignee: Max Gekk > Support 2-arguments by the convert_timezone() function > -- > > Key: SPARK-37568 > URL: https://issues.apache.org/jira/browse/SPARK-37568 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > # If sourceTs is a timestamp_ntz, take the sourceTz from the session time > zone, see the SQL config spark.sql.session.timeZone > # If sourceTs is a timestamp_ltz, convert it to a timestamp_ntz using the > targetTz -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38623) Add more comments and tests for HashShuffleSpec
[ https://issues.apache.org/jira/browse/SPARK-38623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-38623: Summary: Add more comments and tests for HashShuffleSpec (was: Simplify the compatibility check in HashShuffleSpec) > Add more comments and tests for HashShuffleSpec > --- > > Key: SPARK-38623 > URL: https://issues.apache.org/jira/browse/SPARK-38623 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38063) Support SQL split_part function
[ https://issues.apache.org/jira/browse/SPARK-38063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-38063. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35352 [https://github.com/apache/spark/pull/35352] > Support SQL split_part function > --- > > Key: SPARK-38063 > URL: https://issues.apache.org/jira/browse/SPARK-38063 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.3.0 > > > `split_part()` is a commonly supported function by other systems such as > Postgres and some other systems. The Spark equivalent is > `element_at(split(arg, delim), part)` > h5. Function Specificaiton > h6. Syntax > {code:java} > split_part(str, delimiter, partNum) > {code} > h6. Arguments > {code:java} > str: string type > delimiter: string type > partNum: Integer type > {code} > h6. Note > {code:java} > 1. This function splits `str` by `delimiter` and return requested part of the > split (1-based). > 2. If any input parameter is NULL, return NULL. > 3. If the index is out of range of split parts, returns empty stirng. > 4. If `partNum` is 0, throws an error. > 5. If `partNum` is negative, the parts are counted backward from the end of > the string > 6. when delimiter is empty, str is considered not split thus there is just 1 > split part. > {code} > h6. Examples > {code:java} > > SELECT _FUNC_('11.12.13', '.', 3); > 13 > > SELECT _FUNC_(NULL, '.', 3); > NULL > > SELECT _FUNC_('11.12.13', '', 1); > '11.12.13' > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38063) Support SQL split_part function
[ https://issues.apache.org/jira/browse/SPARK-38063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-38063: --- Assignee: Rui Wang > Support SQL split_part function > --- > > Key: SPARK-38063 > URL: https://issues.apache.org/jira/browse/SPARK-38063 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > > `split_part()` is a commonly supported function by other systems such as > Postgres and some other systems. The Spark equivalent is > `element_at(split(arg, delim), part)` > h5. Function Specificaiton > h6. Syntax > {code:java} > split_part(str, delimiter, partNum) > {code} > h6. Arguments > {code:java} > str: string type > delimiter: string type > partNum: Integer type > {code} > h6. Note > {code:java} > 1. This function splits `str` by `delimiter` and return requested part of the > split (1-based). > 2. If any input parameter is NULL, return NULL. > 3. If the index is out of range of split parts, returns empty stirng. > 4. If `partNum` is 0, throws an error. > 5. If `partNum` is negative, the parts are counted backward from the end of > the string > 6. when delimiter is empty, str is considered not split thus there is just 1 > split part. > {code} > h6. Examples > {code:java} > > SELECT _FUNC_('11.12.13', '.', 3); > 13 > > SELECT _FUNC_(NULL, '.', 3); > NULL > > SELECT _FUNC_('11.12.13', '', 1); > '11.12.13' > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38585) Simplify the code of TreeNode.clone()
[ https://issues.apache.org/jira/browse/SPARK-38585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-38585: --- Assignee: Yang Jie > Simplify the code of TreeNode.clone() > - > > Key: SPARK-38585 > URL: https://issues.apache.org/jira/browse/SPARK-38585 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > SPARK-28057 adds {{forceCopy}} to private {{mapChildren}} method in > {{TreeNode}} to realize the {{clone()}} method. > After SPARK-34989, the call corresponding to {{forceCopy=false}} is changed > to use {{{}withNewChildren{}}}, and {{forceCopy}} always true and the private > {{mapChildren}} only used by {{clone()}} method. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38585) Simplify the code of TreeNode.clone()
[ https://issues.apache.org/jira/browse/SPARK-38585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-38585. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 35890 [https://github.com/apache/spark/pull/35890] > Simplify the code of TreeNode.clone() > - > > Key: SPARK-38585 > URL: https://issues.apache.org/jira/browse/SPARK-38585 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > SPARK-28057 adds {{forceCopy}} to private {{mapChildren}} method in > {{TreeNode}} to realize the {{clone()}} method. > After SPARK-34989, the call corresponding to {{forceCopy=false}} is changed > to use {{{}withNewChildren{}}}, and {{forceCopy}} always true and the private > {{mapChildren}} only used by {{clone()}} method. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org