[jira] [Commented] (SPARK-41799) Combine plan-related tests
[ https://issues.apache.org/jira/browse/SPARK-41799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653272#comment-17653272 ] Apache Spark commented on SPARK-41799: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39323 > Combine plan-related tests > -- > > Key: SPARK-41799 > URL: https://issues.apache.org/jira/browse/SPARK-41799 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41799) Combine plan-related tests
[ https://issues.apache.org/jira/browse/SPARK-41799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653271#comment-17653271 ] Apache Spark commented on SPARK-41799: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39323 > Combine plan-related tests > -- > > Key: SPARK-41799 > URL: https://issues.apache.org/jira/browse/SPARK-41799 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41799) Combine plan-related tests
[ https://issues.apache.org/jira/browse/SPARK-41799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41799: Assignee: (was: Apache Spark) > Combine plan-related tests > -- > > Key: SPARK-41799 > URL: https://issues.apache.org/jira/browse/SPARK-41799 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41799) Combine plan-related tests
[ https://issues.apache.org/jira/browse/SPARK-41799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41799: Assignee: Apache Spark > Combine plan-related tests > -- > > Key: SPARK-41799 > URL: https://issues.apache.org/jira/browse/SPARK-41799 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41799) Combine plan-related tests
Ruifeng Zheng created SPARK-41799: - Summary: Combine plan-related tests Key: SPARK-41799 URL: https://issues.apache.org/jira/browse/SPARK-41799 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8
[ https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiale He updated SPARK-41741: - Affects Version/s: 3.4.0 (was: 2.4.0) > [SQL] ParquetFilters StringStartsWith push down matching string do not use > UTF-8 > > > Key: SPARK-41741 > URL: https://issues.apache.org/jira/browse/SPARK-41741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jiale He >Priority: Major > Attachments: image-2022-12-28-18-00-00-861.png, > image-2022-12-28-18-00-21-586.png, > part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet > > > Hello ~ > > I found a problem, but there are two ways to solve it. > > The parquet filter is pushed down. When using the like '***%' statement to > query, if the system default encoding is not UTF-8, it may cause an error. > > There are two ways to bypass this problem as far as I know > 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8" > 2. spark.sql.parquet.filterPushdown.string.startsWith=false > > The following is the information to reproduce this problem > The parquet sample file is in the attachment > {code:java} > spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”) > spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code} > !image-2022-12-28-18-00-00-861.png|width=879,height=430! > > !image-2022-12-28-18-00-21-586.png|width=799,height=731! > > I think the correct code should be: > {code:java} > private val strToBinary = > Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8
[ https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653269#comment-17653269 ] Jiale He commented on SPARK-41741: -- [~bjornjorgensen] done > [SQL] ParquetFilters StringStartsWith push down matching string do not use > UTF-8 > > > Key: SPARK-41741 > URL: https://issues.apache.org/jira/browse/SPARK-41741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jiale He >Priority: Major > Attachments: image-2022-12-28-18-00-00-861.png, > image-2022-12-28-18-00-21-586.png, > part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet > > > Hello ~ > > I found a problem, but there are two ways to solve it. > > The parquet filter is pushed down. When using the like '***%' statement to > query, if the system default encoding is not UTF-8, it may cause an error. > > There are two ways to bypass this problem as far as I know > 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8" > 2. spark.sql.parquet.filterPushdown.string.startsWith=false > > The following is the information to reproduce this problem > The parquet sample file is in the attachment > {code:java} > spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”) > spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code} > !image-2022-12-28-18-00-00-861.png|width=879,height=430! > > !image-2022-12-28-18-00-21-586.png|width=799,height=731! > > I think the correct code should be: > {code:java} > private val strToBinary = > Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41797) Enable test for `array_repeat`
[ https://issues.apache.org/jira/browse/SPARK-41797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41797. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39319 [https://github.com/apache/spark/pull/39319] > Enable test for `array_repeat` > -- > > Key: SPARK-41797 > URL: https://issues.apache.org/jira/browse/SPARK-41797 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41797) Enable test for `array_repeat`
[ https://issues.apache.org/jira/browse/SPARK-41797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41797: - Assignee: Ruifeng Zheng > Enable test for `array_repeat` > -- > > Key: SPARK-41797 > URL: https://issues.apache.org/jira/browse/SPARK-41797 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41798) Upgrade hive-storage-api to 2.8.1
[ https://issues.apache.org/jira/browse/SPARK-41798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653257#comment-17653257 ] Apache Spark commented on SPARK-41798: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39322 > Upgrade hive-storage-api to 2.8.1 > - > > Key: SPARK-41798 > URL: https://issues.apache.org/jira/browse/SPARK-41798 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41798) Upgrade hive-storage-api to 2.8.1
[ https://issues.apache.org/jira/browse/SPARK-41798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41798: Assignee: Apache Spark > Upgrade hive-storage-api to 2.8.1 > - > > Key: SPARK-41798 > URL: https://issues.apache.org/jira/browse/SPARK-41798 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41798) Upgrade hive-storage-api to 2.8.1
[ https://issues.apache.org/jira/browse/SPARK-41798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41798: Assignee: (was: Apache Spark) > Upgrade hive-storage-api to 2.8.1 > - > > Key: SPARK-41798 > URL: https://issues.apache.org/jira/browse/SPARK-41798 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41798) Upgrade hive-storage-api to 2.8.1
[ https://issues.apache.org/jira/browse/SPARK-41798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653256#comment-17653256 ] Apache Spark commented on SPARK-41798: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39322 > Upgrade hive-storage-api to 2.8.1 > - > > Key: SPARK-41798 > URL: https://issues.apache.org/jira/browse/SPARK-41798 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41773) Window.partitionBy is not respected with row_number
[ https://issues.apache.org/jira/browse/SPARK-41773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41773. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39318 [https://github.com/apache/spark/pull/39318] > Window.partitionBy is not respected with row_number > > > Key: SPARK-41773 > URL: https://issues.apache.org/jira/browse/SPARK-41773 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > > {code} > File "/.../spark/python/pyspark/sql/connect/window.py", line 292, in > pyspark.sql.connect.window.Window.orderBy > Failed example: > df.withColumn("row_number", row_number().over(window)).show() > Expected: > +---++--+ > | id|category|row_number| > +---++--+ > | 1| a| 1| > | 1| a| 2| > | 1| b| 3| > | 2| a| 1| > | 2| b| 2| > | 3| b| 1| > +---++--+ > Got: > +---++--+ > | id|category|row_number| > +---++--+ > | 1| b| 1| > | 1| a| 2| > | 1| a| 3| > | 2| b| 1| > | 2| a| 2| > | 3| b| 1| > +---++--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41798) Upgrade hive-storage-api to 2.8.1
Dongjoon Hyun created SPARK-41798: - Summary: Upgrade hive-storage-api to 2.8.1 Key: SPARK-41798 URL: https://issues.apache.org/jira/browse/SPARK-41798 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41773) Window.partitionBy is not respected with row_number
[ https://issues.apache.org/jira/browse/SPARK-41773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41773: Assignee: Ruifeng Zheng > Window.partitionBy is not respected with row_number > > > Key: SPARK-41773 > URL: https://issues.apache.org/jira/browse/SPARK-41773 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Ruifeng Zheng >Priority: Major > > {code} > File "/.../spark/python/pyspark/sql/connect/window.py", line 292, in > pyspark.sql.connect.window.Window.orderBy > Failed example: > df.withColumn("row_number", row_number().over(window)).show() > Expected: > +---++--+ > | id|category|row_number| > +---++--+ > | 1| a| 1| > | 1| a| 2| > | 1| b| 3| > | 2| a| 1| > | 2| b| 2| > | 3| b| 1| > +---++--+ > Got: > +---++--+ > | id|category|row_number| > +---++--+ > | 1| b| 1| > | 1| a| 2| > | 1| a| 3| > | 2| b| 1| > | 2| a| 2| > | 3| b| 1| > +---++--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41383) Implement `DataFrame.cube`
[ https://issues.apache.org/jira/browse/SPARK-41383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653251#comment-17653251 ] Apache Spark commented on SPARK-41383: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39321 > Implement `DataFrame.cube` > -- > > Key: SPARK-41383 > URL: https://issues.apache.org/jira/browse/SPARK-41383 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41069) Implement `DataFrame.approxQuantile` and `DataFrame.stat.approxQuantile`
[ https://issues.apache.org/jira/browse/SPARK-41069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41069. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39262 [https://github.com/apache/spark/pull/39262] > Implement `DataFrame.approxQuantile` and `DataFrame.stat.approxQuantile` > > > Key: SPARK-41069 > URL: https://issues.apache.org/jira/browse/SPARK-41069 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41796) Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE
[ https://issues.apache.org/jira/browse/SPARK-41796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41796: Assignee: (was: Apache Spark) > Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE > > > Key: SPARK-41796 > URL: https://issues.apache.org/jira/browse/SPARK-41796 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > > UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41796) Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE
[ https://issues.apache.org/jira/browse/SPARK-41796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41796: Assignee: Apache Spark > Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE > > > Key: SPARK-41796 > URL: https://issues.apache.org/jira/browse/SPARK-41796 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > > UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41796) Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE
[ https://issues.apache.org/jira/browse/SPARK-41796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-41796: Summary: Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE (was: Test the error class: UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE) > Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE > > > Key: SPARK-41796 > URL: https://issues.apache.org/jira/browse/SPARK-41796 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41796) Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE
[ https://issues.apache.org/jira/browse/SPARK-41796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653245#comment-17653245 ] Apache Spark commented on SPARK-41796: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/39320 > Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE > > > Key: SPARK-41796 > URL: https://issues.apache.org/jira/browse/SPARK-41796 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > > UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41796) Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE
[ https://issues.apache.org/jira/browse/SPARK-41796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-41796: Description: UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE > Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE > > > Key: SPARK-41796 > URL: https://issues.apache.org/jira/browse/SPARK-41796 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > > UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41797) Enable test for `array_repeat`
[ https://issues.apache.org/jira/browse/SPARK-41797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41797: Assignee: Apache Spark > Enable test for `array_repeat` > -- > > Key: SPARK-41797 > URL: https://issues.apache.org/jira/browse/SPARK-41797 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41797) Enable test for `array_repeat`
[ https://issues.apache.org/jira/browse/SPARK-41797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41797: Assignee: (was: Apache Spark) > Enable test for `array_repeat` > -- > > Key: SPARK-41797 > URL: https://issues.apache.org/jira/browse/SPARK-41797 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41797) Enable test for `array_repeat`
[ https://issues.apache.org/jira/browse/SPARK-41797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653244#comment-17653244 ] Apache Spark commented on SPARK-41797: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39319 > Enable test for `array_repeat` > -- > > Key: SPARK-41797 > URL: https://issues.apache.org/jira/browse/SPARK-41797 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41797) Enable test for `array_repeat`
Ruifeng Zheng created SPARK-41797: - Summary: Enable test for `array_repeat` Key: SPARK-41797 URL: https://issues.apache.org/jira/browse/SPARK-41797 Project: Spark Issue Type: Improvement Components: Connect, PySpark, Tests Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41797) Enable test for `array_repeat`
[ https://issues.apache.org/jira/browse/SPARK-41797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-41797: -- Parent: SPARK-41283 Issue Type: Sub-task (was: Improvement) > Enable test for `array_repeat` > -- > > Key: SPARK-41797 > URL: https://issues.apache.org/jira/browse/SPARK-41797 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41796) Test the error class: UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE
BingKun Pan created SPARK-41796: --- Summary: Test the error class: UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE Key: SPARK-41796 URL: https://issues.apache.org/jira/browse/SPARK-41796 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.4.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41786) Deduplicate helper functions
[ https://issues.apache.org/jira/browse/SPARK-41786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41786. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39307 [https://github.com/apache/spark/pull/39307] > Deduplicate helper functions > > > Key: SPARK-41786 > URL: https://issues.apache.org/jira/browse/SPARK-41786 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41786) Deduplicate helper functions
[ https://issues.apache.org/jira/browse/SPARK-41786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41786: - Assignee: Ruifeng Zheng > Deduplicate helper functions > > > Key: SPARK-41786 > URL: https://issues.apache.org/jira/browse/SPARK-41786 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41773) Window.partitionBy is not respected with row_number
[ https://issues.apache.org/jira/browse/SPARK-41773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41773: Assignee: (was: Apache Spark) > Window.partitionBy is not respected with row_number > > > Key: SPARK-41773 > URL: https://issues.apache.org/jira/browse/SPARK-41773 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > File "/.../spark/python/pyspark/sql/connect/window.py", line 292, in > pyspark.sql.connect.window.Window.orderBy > Failed example: > df.withColumn("row_number", row_number().over(window)).show() > Expected: > +---++--+ > | id|category|row_number| > +---++--+ > | 1| a| 1| > | 1| a| 2| > | 1| b| 3| > | 2| a| 1| > | 2| b| 2| > | 3| b| 1| > +---++--+ > Got: > +---++--+ > | id|category|row_number| > +---++--+ > | 1| b| 1| > | 1| a| 2| > | 1| a| 3| > | 2| b| 1| > | 2| a| 2| > | 3| b| 1| > +---++--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41773) Window.partitionBy is not respected with row_number
[ https://issues.apache.org/jira/browse/SPARK-41773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653242#comment-17653242 ] Apache Spark commented on SPARK-41773: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39318 > Window.partitionBy is not respected with row_number > > > Key: SPARK-41773 > URL: https://issues.apache.org/jira/browse/SPARK-41773 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > File "/.../spark/python/pyspark/sql/connect/window.py", line 292, in > pyspark.sql.connect.window.Window.orderBy > Failed example: > df.withColumn("row_number", row_number().over(window)).show() > Expected: > +---++--+ > | id|category|row_number| > +---++--+ > | 1| a| 1| > | 1| a| 2| > | 1| b| 3| > | 2| a| 1| > | 2| b| 2| > | 3| b| 1| > +---++--+ > Got: > +---++--+ > | id|category|row_number| > +---++--+ > | 1| b| 1| > | 1| a| 2| > | 1| a| 3| > | 2| b| 1| > | 2| a| 2| > | 3| b| 1| > +---++--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41773) Window.partitionBy is not respected with row_number
[ https://issues.apache.org/jira/browse/SPARK-41773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41773: Assignee: Apache Spark > Window.partitionBy is not respected with row_number > > > Key: SPARK-41773 > URL: https://issues.apache.org/jira/browse/SPARK-41773 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > {code} > File "/.../spark/python/pyspark/sql/connect/window.py", line 292, in > pyspark.sql.connect.window.Window.orderBy > Failed example: > df.withColumn("row_number", row_number().over(window)).show() > Expected: > +---++--+ > | id|category|row_number| > +---++--+ > | 1| a| 1| > | 1| a| 2| > | 1| b| 3| > | 2| a| 1| > | 2| b| 2| > | 3| b| 1| > +---++--+ > Got: > +---++--+ > | id|category|row_number| > +---++--+ > | 1| b| 1| > | 1| a| 2| > | 1| a| 3| > | 2| b| 1| > | 2| a| 2| > | 3| b| 1| > +---++--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions
[ https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-41049: --- Assignee: Wenchen Fan > Nondeterministic expressions have unstable values if they are children of > CodegenFallback expressions > - > > Key: SPARK-41049 > URL: https://issues.apache.org/jira/browse/SPARK-41049 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Guy Boo >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.4.0 > > > h2. Expectation > For a given row, Nondeterministic expressions are expected to have stable > values. > {code:scala} > import org.apache.spark.sql.functions._ > val df = sparkContext.parallelize(1 to 5).toDF("x") > val v1 = rand().*(lit(1)).cast(IntegerType) > df.select(v1, v1).collect{code} > Returns a set like this: > |8777|8777| > |1357|1357| > |3435|3435| > |9204|9204| > |3870|3870| > where both columns always have the same value, but what that value is changes > from row to row. This is different from the following: > {code:scala} > df.select(rand(), rand()).collect{code} > In this case, because the rand() calls are distinct, the values in both > columns should be different. > h2. Problem > This expectation does not appear to be stable in the event that any > subsequent expression is a CodegenFallback. This program: > {code:scala} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > val sparkSession = SparkSession.builder().getOrCreate() > val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x") > val v1 = rand().*(lit(1)).cast(IntegerType) > val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback > df.select(v1, v1, v2, v2).collect {code} > produces output like this: > |8159|8159|8159|{color:#ff}2028{color}| > |8320|8320|8320|{color:#ff}1640{color}| > |7937|7937|7937|{color:#ff}769{color}| > |436|436|436|{color:#ff}8924{color}| > |8924|8924|2827|{color:#ff}2731{color}| > Not sure why the first call via the CodegenFallback path should be correct > while subsequent calls aren't. > h2. Workaround > If the Nondeterministic expression is moved to a separate, earlier select() > call, so the CodegenFallback instead only refers to a column reference, then > the problem seems to go away. But this workaround may not be reliable if > optimization is ever able to restructure adjacent select()s. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions
[ https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-41049. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39248 [https://github.com/apache/spark/pull/39248] > Nondeterministic expressions have unstable values if they are children of > CodegenFallback expressions > - > > Key: SPARK-41049 > URL: https://issues.apache.org/jira/browse/SPARK-41049 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Guy Boo >Priority: Major > Fix For: 3.4.0 > > > h2. Expectation > For a given row, Nondeterministic expressions are expected to have stable > values. > {code:scala} > import org.apache.spark.sql.functions._ > val df = sparkContext.parallelize(1 to 5).toDF("x") > val v1 = rand().*(lit(1)).cast(IntegerType) > df.select(v1, v1).collect{code} > Returns a set like this: > |8777|8777| > |1357|1357| > |3435|3435| > |9204|9204| > |3870|3870| > where both columns always have the same value, but what that value is changes > from row to row. This is different from the following: > {code:scala} > df.select(rand(), rand()).collect{code} > In this case, because the rand() calls are distinct, the values in both > columns should be different. > h2. Problem > This expectation does not appear to be stable in the event that any > subsequent expression is a CodegenFallback. This program: > {code:scala} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > val sparkSession = SparkSession.builder().getOrCreate() > val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x") > val v1 = rand().*(lit(1)).cast(IntegerType) > val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback > df.select(v1, v1, v2, v2).collect {code} > produces output like this: > |8159|8159|8159|{color:#ff}2028{color}| > |8320|8320|8320|{color:#ff}1640{color}| > |7937|7937|7937|{color:#ff}769{color}| > |436|436|436|{color:#ff}8924{color}| > |8924|8924|2827|{color:#ff}2731{color}| > Not sure why the first call via the CodegenFallback path should be correct > while subsequent calls aren't. > h2. Workaround > If the Nondeterministic expression is moved to a separate, earlier select() > call, so the CodegenFallback instead only refers to a column reference, then > the problem seems to go away. But this workaround may not be reliable if > optimization is ever able to restructure adjacent select()s. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41731) Implement the column accessor
[ https://issues.apache.org/jira/browse/SPARK-41731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653240#comment-17653240 ] Apache Spark commented on SPARK-41731: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39317 > Implement the column accessor > - > > Key: SPARK-41731 > URL: https://issues.apache.org/jira/browse/SPARK-41731 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41731) Implement the column accessor
[ https://issues.apache.org/jira/browse/SPARK-41731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653239#comment-17653239 ] Apache Spark commented on SPARK-41731: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39317 > Implement the column accessor > - > > Key: SPARK-41731 > URL: https://issues.apache.org/jira/browse/SPARK-41731 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41795) Disable ANSI mode in pyspark.sql.tests.connect.test_connect_column
Hyukjin Kwon created SPARK-41795: Summary: Disable ANSI mode in pyspark.sql.tests.connect.test_connect_column Key: SPARK-41795 URL: https://issues.apache.org/jira/browse/SPARK-41795 Project: Spark Issue Type: Sub-task Components: Connect, Tests Affects Versions: 3.4.0 Reporter: Hyukjin Kwon See -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41795) Disable ANSI mode in pyspark.sql.tests.connect.test_connect_column
[ https://issues.apache.org/jira/browse/SPARK-41795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41795: - Description: See SPARK-41794 (was: See ) > Disable ANSI mode in pyspark.sql.tests.connect.test_connect_column > -- > > Key: SPARK-41795 > URL: https://issues.apache.org/jira/browse/SPARK-41795 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > See SPARK-41794 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41794) Reenable ANSI mode in pyspark.sql.tests.connect.test_connect_column
Hyukjin Kwon created SPARK-41794: Summary: Reenable ANSI mode in pyspark.sql.tests.connect.test_connect_column Key: SPARK-41794 URL: https://issues.apache.org/jira/browse/SPARK-41794 Project: Spark Issue Type: Sub-task Components: Connect, Tests Affects Versions: 3.4.0 Reporter: Hyukjin Kwon Assignee: Ruifeng Zheng {code} == ERROR [0.901s]: test_column_accessor (pyspark.sql.tests.connect.test_connect_column.SparkConnectTests) -- Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests/connect/test_connect_column.py", line 744, in test_column_accessor cdf.select(CF.col("z")[0], cdf.z[10], CF.col("z")[-10]).toPandas(), File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 949, in toPandas return self._session.client.to_pandas(query) File "/.../spark/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/.../spark/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/.../spark/python/pyspark/sql/connect/client.py", line 623, in _handle_error raise SparkConnectException(status.message, info.reason) from None pyspark.sql.connect.client.SparkConnectException: (org.apache.spark.SparkArrayIndexOutOfBoundsException) [INVALID_ARRAY_INDEX] The index 10 is out of bounds. The array has 3 elements. Use the SQL function `get()` to tolerate accessing element at invalid index and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. == ERROR [0.245s]: test_column_arithmetic_ops (pyspark.sql.tests.connect.test_connect_column.SparkConnectTests) -- Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests/connect/test_connect_column.py", line 799, in test_column_arithmetic_ops cdf.select(cdf.a % cdf["b"], cdf["a"] % 2, 12 % cdf.c).toPandas(), File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 949, in toPandas return self._session.client.to_pandas(query) File "/.../spark/python/pyspark/sql/connect/client.py", line 413, in to_pandas return self._execute_and_fetch(req) File "/.../spark/python/pyspark/sql/connect/client.py", line 573, in _execute_and_fetch self._handle_error(rpc_error) File "/.../spark/python/pyspark/sql/connect/client.py", line 623, in _handle_error raise SparkConnectException(status.message, info.reason) from None pyspark.sql.connect.client.SparkConnectException: (org.apache.spark.SparkArithmeticException) [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653238#comment-17653238 ] Gera Shegalov commented on SPARK-41793: --- Similarly in SQLite {code} .header on create table test_table(a long, b decimal(38,2)); insert into test_table values ('9223372036854775807', '11342371013783243717493546650944543.47'), ('9223372036854775807', '.99'); select * from test_table; select count(1) over( partition by a order by b asc range between 10.2345 preceding and 6.7890 following) as cnt_1 from test_table; {code} yields {code} a|b 9223372036854775807|1.13423710137832e+34 9223372036854775807|1.0e+36 cnt_1 1 1 {code} > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Major > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41553) Fix the documentation for num_files
[ https://issues.apache.org/jira/browse/SPARK-41553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41553: Assignee: Bjørn Jørgensen > Fix the documentation for num_files > --- > > Key: SPARK-41553 > URL: https://issues.apache.org/jira/browse/SPARK-41553 > Project: Spark > Issue Type: Improvement > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > > num_files has been deprecated and might be removed in a future version. " > "Use DataFrame.spark.repartition instead.", > The num_files argument doesn't manage the number of files, but specifying the > partition number. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41553) Fix the documentation for num_files
[ https://issues.apache.org/jira/browse/SPARK-41553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41553. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39098 [https://github.com/apache/spark/pull/39098] > Fix the documentation for num_files > --- > > Key: SPARK-41553 > URL: https://issues.apache.org/jira/browse/SPARK-41553 > Project: Spark > Issue Type: Improvement > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.4.0 > > > num_files has been deprecated and might be removed in a future version. " > "Use DataFrame.spark.repartition instead.", > The num_files argument doesn't manage the number of files, but specifying the > partition number. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated SPARK-41793: -- Summary: Incorrect result for window frames defined by a range clause on large decimals (was: Incorrect result for window frames defined as ranges on large decimals ) > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Major > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41793) Incorrect result for window frames defined as ranges on large decimals
Gera Shegalov created SPARK-41793: - Summary: Incorrect result for window frames defined as ranges on large decimals Key: SPARK-41793 URL: https://issues.apache.org/jira/browse/SPARK-41793 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Gera Shegalov Context https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 The following windowing query on a simple two-row input should produce two non-empty windows as a result {code} from pprint import pprint data = [ ('9223372036854775807', '11342371013783243717493546650944543.47'), ('9223372036854775807', '.99') ] df1 = spark.createDataFrame(data, 'a STRING, b STRING') df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) df2.createOrReplaceTempView('test_table') df = sql(''' SELECT COUNT(1) OVER ( PARTITION BY a ORDER BY b ASC RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING ) AS CNT_1 FROM test_table ''') res = df.collect() df.explain(True) pprint(res) {code} Spark 3.4.0-SNAPSHOT output: {code} [Row(CNT_1=1), Row(CNT_1=0)] {code} Spark 3.3.1 output as expected: {code} Row(CNT_1=1), Row(CNT_1=1)] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41745) SparkSession.createDataFrame does not respect the column names in the row
[ https://issues.apache.org/jira/browse/SPARK-41745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41745. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39313 [https://github.com/apache/spark/pull/39313] > SparkSession.createDataFrame does not respect the column names in the row > - > > Key: SPARK-41745 > URL: https://issues.apache.org/jira/browse/SPARK-41745 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > > {code} > File "/.../spark/python/pyspark/sql/connect/group.py", line 173, in > pyspark.sql.connect.group.GroupedData.pivot > Failed example: > df1.show() > Differences (ndiff with -expected +actual): > - +--+++ > ? --- > + +--++-+ > - |course|year|earnings| > + |_1| _2| _3| > - +--+++ > ? --- > + +--++-+ > - |dotNET|2012| 1| > ? --- > + |dotNET|2012|1| > - | Java|2012| 2| > ? --- > + | Java|2012|2| > - |dotNET|2012|5000| > ? --- > + |dotNET|2012| 5000| > - |dotNET|2013| 48000| > ? --- > + |dotNET|2013|48000| > - | Java|2013| 3| > ? --- > + | Java|2013|3| > - +--+++ > ? --- > + +--++-+ > + > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41789) Make `createDataFrame` support list of Rows
[ https://issues.apache.org/jira/browse/SPARK-41789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41789. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39313 [https://github.com/apache/spark/pull/39313] > Make `createDataFrame` support list of Rows > --- > > Key: SPARK-41789 > URL: https://issues.apache.org/jira/browse/SPARK-41789 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41745) SparkSession.createDataFrame does not respect the column names in the row
[ https://issues.apache.org/jira/browse/SPARK-41745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41745: Assignee: Ruifeng Zheng > SparkSession.createDataFrame does not respect the column names in the row > - > > Key: SPARK-41745 > URL: https://issues.apache.org/jira/browse/SPARK-41745 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Ruifeng Zheng >Priority: Major > > {code} > File "/.../spark/python/pyspark/sql/connect/group.py", line 173, in > pyspark.sql.connect.group.GroupedData.pivot > Failed example: > df1.show() > Differences (ndiff with -expected +actual): > - +--+++ > ? --- > + +--++-+ > - |course|year|earnings| > + |_1| _2| _3| > - +--+++ > ? --- > + +--++-+ > - |dotNET|2012| 1| > ? --- > + |dotNET|2012|1| > - | Java|2012| 2| > ? --- > + | Java|2012|2| > - |dotNET|2012|5000| > ? --- > + |dotNET|2012| 5000| > - |dotNET|2013| 48000| > ? --- > + |dotNET|2013|48000| > - | Java|2013| 3| > ? --- > + | Java|2013|3| > - +--+++ > ? --- > + +--++-+ > + > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41787) Upgrade silencer from 1.7.10 to 1.7.12
[ https://issues.apache.org/jira/browse/SPARK-41787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41787. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39309 [https://github.com/apache/spark/pull/39309] > Upgrade silencer from 1.7.10 to 1.7.12 > -- > > Key: SPARK-41787 > URL: https://issues.apache.org/jira/browse/SPARK-41787 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > Attachments: image-2022-12-30-16-57-32-736.png > > > !image-2022-12-30-16-57-32-736.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41787) Upgrade silencer from 1.7.10 to 1.7.12
[ https://issues.apache.org/jira/browse/SPARK-41787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41787: Assignee: BingKun Pan > Upgrade silencer from 1.7.10 to 1.7.12 > -- > > Key: SPARK-41787 > URL: https://issues.apache.org/jira/browse/SPARK-41787 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Attachments: image-2022-12-30-16-57-32-736.png > > > !image-2022-12-30-16-57-32-736.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41785) Implement `GroupedData.mean`
[ https://issues.apache.org/jira/browse/SPARK-41785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41785. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39304 [https://github.com/apache/spark/pull/39304] > Implement `GroupedData.mean` > > > Key: SPARK-41785 > URL: https://issues.apache.org/jira/browse/SPARK-41785 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41784) Add missing `__rmod__`
[ https://issues.apache.org/jira/browse/SPARK-41784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41784: Assignee: Ruifeng Zheng > Add missing `__rmod__` > -- > > Key: SPARK-41784 > URL: https://issues.apache.org/jira/browse/SPARK-41784 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41785) Implement `GroupedData.mean`
[ https://issues.apache.org/jira/browse/SPARK-41785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41785: Assignee: Ruifeng Zheng > Implement `GroupedData.mean` > > > Key: SPARK-41785 > URL: https://issues.apache.org/jira/browse/SPARK-41785 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41784) Add missing `__rmod__`
[ https://issues.apache.org/jira/browse/SPARK-41784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41784. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39303 [https://github.com/apache/spark/pull/39303] > Add missing `__rmod__` > -- > > Key: SPARK-41784 > URL: https://issues.apache.org/jira/browse/SPARK-41784 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41770) eqNullSafe does not support None as its argument
[ https://issues.apache.org/jira/browse/SPARK-41770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41770: Assignee: Ruifeng Zheng > eqNullSafe does not support None as its argument > > > Key: SPARK-41770 > URL: https://issues.apache.org/jira/browse/SPARK-41770 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > > {code} > ** > File "/.../spark/python/pyspark/sql/connect/column.py", line 90, in > pyspark.sql.connect.column.Column.eqNullSafe > Failed example: > df1.select( > df1['value'] == 'foo', > df1['value'].eqNullSafe('foo'), > df1['value'].eqNullSafe(None) > ).show() > Exception raised: > Traceback (most recent call last): > File "/.../miniconda3/envs/python3.9/lib/python3.9/doctest.py", line > 1336, in __run > exec(compile(example.source, filename, "single", > File "", line > 4, in > df1['value'].eqNullSafe(None) > File > "/.../workspace/forked/spark/python/pyspark/sql/connect/column.py", line 78, > in wrapped > return scalar_function(name, self, other) > File > "/.../workspace/forked/spark/python/pyspark/sql/connect/column.py", line 95, > in scalar_function > return Column(UnresolvedFunction(op, [arg._expr for arg in args])) > File > "/.../workspace/forked/spark/python/pyspark/sql/connect/column.py", line 95, > in > return Column(UnresolvedFunction(op, [arg._expr for arg in args])) > AttributeError: 'NoneType' object has no attribute '_expr' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41770) eqNullSafe does not support None as its argument
[ https://issues.apache.org/jira/browse/SPARK-41770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41770. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39302 [https://github.com/apache/spark/pull/39302] > eqNullSafe does not support None as its argument > > > Key: SPARK-41770 > URL: https://issues.apache.org/jira/browse/SPARK-41770 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > {code} > ** > File "/.../spark/python/pyspark/sql/connect/column.py", line 90, in > pyspark.sql.connect.column.Column.eqNullSafe > Failed example: > df1.select( > df1['value'] == 'foo', > df1['value'].eqNullSafe('foo'), > df1['value'].eqNullSafe(None) > ).show() > Exception raised: > Traceback (most recent call last): > File "/.../miniconda3/envs/python3.9/lib/python3.9/doctest.py", line > 1336, in __run > exec(compile(example.source, filename, "single", > File "", line > 4, in > df1['value'].eqNullSafe(None) > File > "/.../workspace/forked/spark/python/pyspark/sql/connect/column.py", line 78, > in wrapped > return scalar_function(name, self, other) > File > "/.../workspace/forked/spark/python/pyspark/sql/connect/column.py", line 95, > in scalar_function > return Column(UnresolvedFunction(op, [arg._expr for arg in args])) > File > "/.../workspace/forked/spark/python/pyspark/sql/connect/column.py", line 95, > in > return Column(UnresolvedFunction(op, [arg._expr for arg in args])) > AttributeError: 'NoneType' object has no attribute '_expr' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41783) Make column op support None
[ https://issues.apache.org/jira/browse/SPARK-41783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41783: Assignee: Ruifeng Zheng > Make column op support None > --- > > Key: SPARK-41783 > URL: https://issues.apache.org/jira/browse/SPARK-41783 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41783) Make column op support None
[ https://issues.apache.org/jira/browse/SPARK-41783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41783. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39302 [https://github.com/apache/spark/pull/39302] > Make column op support None > --- > > Key: SPARK-41783 > URL: https://issues.apache.org/jira/browse/SPARK-41783 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41792) Shuffle merge finalization removes the wrong finalization state from the DB
[ https://issues.apache.org/jira/browse/SPARK-41792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653210#comment-17653210 ] Apache Spark commented on SPARK-41792: -- User 'mridulm' has created a pull request for this issue: https://github.com/apache/spark/pull/39316 > Shuffle merge finalization removes the wrong finalization state from the DB > --- > > Key: SPARK-41792 > URL: https://issues.apache.org/jira/browse/SPARK-41792 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.3.0, 3.4.0 >Reporter: Mridul Muralidharan >Priority: Minor > > During `finalizeShuffleMerge` in external shuffle service, if the > finalization request is for a newer shuffle merge id, then we cleanup the > existing (older) shuffle details and add the newer entry (for which we got no > pushed blocks) to the DB. > Unfortunately, when cleaning up from the DB, we are using the incorrect > AppAttemptShuffleMergeId - we remove the latest shuffle merge id instead of > the existing entry. > Proposed Fix: > {code} > diff --git > a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java > > b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java > index 816d1082850..551104d0eba 100644 > --- > a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java > +++ > b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java > @@ -653,9 +653,11 @@ public class RemoteBlockPushResolver implements > MergedShuffleFileManager { > } else if (msg.shuffleMergeId > mergePartitionsInfo.shuffleMergeId) { >// If no blocks pushed for the finalizeShuffleMerge shuffleMergeId > then return >// empty MergeStatuses but cleanup the older shuffleMergeId files. > + AppAttemptShuffleMergeId currentAppAttemptShuffleMergeId = new > AppAttemptShuffleMergeId( > + msg.appId, msg.appAttemptId, msg.shuffleId, > mergePartitionsInfo.shuffleMergeId); >submitCleanupTask(() -> >closeAndDeleteOutdatedPartitions( > - appAttemptShuffleMergeId, > mergePartitionsInfo.shuffleMergePartitions)); > + currentAppAttemptShuffleMergeId, > mergePartitionsInfo.shuffleMergePartitions)); > } else { >// This block covers: >// 1. finalization of determinate stage > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41792) Shuffle merge finalization removes the wrong finalization state from the DB
[ https://issues.apache.org/jira/browse/SPARK-41792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41792: Assignee: Apache Spark > Shuffle merge finalization removes the wrong finalization state from the DB > --- > > Key: SPARK-41792 > URL: https://issues.apache.org/jira/browse/SPARK-41792 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.3.0, 3.4.0 >Reporter: Mridul Muralidharan >Assignee: Apache Spark >Priority: Minor > > During `finalizeShuffleMerge` in external shuffle service, if the > finalization request is for a newer shuffle merge id, then we cleanup the > existing (older) shuffle details and add the newer entry (for which we got no > pushed blocks) to the DB. > Unfortunately, when cleaning up from the DB, we are using the incorrect > AppAttemptShuffleMergeId - we remove the latest shuffle merge id instead of > the existing entry. > Proposed Fix: > {code} > diff --git > a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java > > b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java > index 816d1082850..551104d0eba 100644 > --- > a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java > +++ > b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java > @@ -653,9 +653,11 @@ public class RemoteBlockPushResolver implements > MergedShuffleFileManager { > } else if (msg.shuffleMergeId > mergePartitionsInfo.shuffleMergeId) { >// If no blocks pushed for the finalizeShuffleMerge shuffleMergeId > then return >// empty MergeStatuses but cleanup the older shuffleMergeId files. > + AppAttemptShuffleMergeId currentAppAttemptShuffleMergeId = new > AppAttemptShuffleMergeId( > + msg.appId, msg.appAttemptId, msg.shuffleId, > mergePartitionsInfo.shuffleMergeId); >submitCleanupTask(() -> >closeAndDeleteOutdatedPartitions( > - appAttemptShuffleMergeId, > mergePartitionsInfo.shuffleMergePartitions)); > + currentAppAttemptShuffleMergeId, > mergePartitionsInfo.shuffleMergePartitions)); > } else { >// This block covers: >// 1. finalization of determinate stage > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41792) Shuffle merge finalization removes the wrong finalization state from the DB
[ https://issues.apache.org/jira/browse/SPARK-41792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653208#comment-17653208 ] Apache Spark commented on SPARK-41792: -- User 'mridulm' has created a pull request for this issue: https://github.com/apache/spark/pull/39316 > Shuffle merge finalization removes the wrong finalization state from the DB > --- > > Key: SPARK-41792 > URL: https://issues.apache.org/jira/browse/SPARK-41792 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.3.0, 3.4.0 >Reporter: Mridul Muralidharan >Priority: Minor > > During `finalizeShuffleMerge` in external shuffle service, if the > finalization request is for a newer shuffle merge id, then we cleanup the > existing (older) shuffle details and add the newer entry (for which we got no > pushed blocks) to the DB. > Unfortunately, when cleaning up from the DB, we are using the incorrect > AppAttemptShuffleMergeId - we remove the latest shuffle merge id instead of > the existing entry. > Proposed Fix: > {code} > diff --git > a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java > > b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java > index 816d1082850..551104d0eba 100644 > --- > a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java > +++ > b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java > @@ -653,9 +653,11 @@ public class RemoteBlockPushResolver implements > MergedShuffleFileManager { > } else if (msg.shuffleMergeId > mergePartitionsInfo.shuffleMergeId) { >// If no blocks pushed for the finalizeShuffleMerge shuffleMergeId > then return >// empty MergeStatuses but cleanup the older shuffleMergeId files. > + AppAttemptShuffleMergeId currentAppAttemptShuffleMergeId = new > AppAttemptShuffleMergeId( > + msg.appId, msg.appAttemptId, msg.shuffleId, > mergePartitionsInfo.shuffleMergeId); >submitCleanupTask(() -> >closeAndDeleteOutdatedPartitions( > - appAttemptShuffleMergeId, > mergePartitionsInfo.shuffleMergePartitions)); > + currentAppAttemptShuffleMergeId, > mergePartitionsInfo.shuffleMergePartitions)); > } else { >// This block covers: >// 1. finalization of determinate stage > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41792) Shuffle merge finalization removes the wrong finalization state from the DB
[ https://issues.apache.org/jira/browse/SPARK-41792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41792: Assignee: (was: Apache Spark) > Shuffle merge finalization removes the wrong finalization state from the DB > --- > > Key: SPARK-41792 > URL: https://issues.apache.org/jira/browse/SPARK-41792 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.3.0, 3.4.0 >Reporter: Mridul Muralidharan >Priority: Minor > > During `finalizeShuffleMerge` in external shuffle service, if the > finalization request is for a newer shuffle merge id, then we cleanup the > existing (older) shuffle details and add the newer entry (for which we got no > pushed blocks) to the DB. > Unfortunately, when cleaning up from the DB, we are using the incorrect > AppAttemptShuffleMergeId - we remove the latest shuffle merge id instead of > the existing entry. > Proposed Fix: > {code} > diff --git > a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java > > b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java > index 816d1082850..551104d0eba 100644 > --- > a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java > +++ > b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java > @@ -653,9 +653,11 @@ public class RemoteBlockPushResolver implements > MergedShuffleFileManager { > } else if (msg.shuffleMergeId > mergePartitionsInfo.shuffleMergeId) { >// If no blocks pushed for the finalizeShuffleMerge shuffleMergeId > then return >// empty MergeStatuses but cleanup the older shuffleMergeId files. > + AppAttemptShuffleMergeId currentAppAttemptShuffleMergeId = new > AppAttemptShuffleMergeId( > + msg.appId, msg.appAttemptId, msg.shuffleId, > mergePartitionsInfo.shuffleMergeId); >submitCleanupTask(() -> >closeAndDeleteOutdatedPartitions( > - appAttemptShuffleMergeId, > mergePartitionsInfo.shuffleMergePartitions)); > + currentAppAttemptShuffleMergeId, > mergePartitionsInfo.shuffleMergePartitions)); > } else { >// This block covers: >// 1. finalization of determinate stage > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41792) Shuffle merge finalization removes the wrong finalization state from the DB
Mridul Muralidharan created SPARK-41792: --- Summary: Shuffle merge finalization removes the wrong finalization state from the DB Key: SPARK-41792 URL: https://issues.apache.org/jira/browse/SPARK-41792 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 3.3.0, 3.4.0 Reporter: Mridul Muralidharan During `finalizeShuffleMerge` in external shuffle service, if the finalization request is for a newer shuffle merge id, then we cleanup the existing (older) shuffle details and add the newer entry (for which we got no pushed blocks) to the DB. Unfortunately, when cleaning up from the DB, we are using the incorrect AppAttemptShuffleMergeId - we remove the latest shuffle merge id instead of the existing entry. Proposed Fix: {code} diff --git a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java index 816d1082850..551104d0eba 100644 --- a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java +++ b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java @@ -653,9 +653,11 @@ public class RemoteBlockPushResolver implements MergedShuffleFileManager { } else if (msg.shuffleMergeId > mergePartitionsInfo.shuffleMergeId) { // If no blocks pushed for the finalizeShuffleMerge shuffleMergeId then return // empty MergeStatuses but cleanup the older shuffleMergeId files. + AppAttemptShuffleMergeId currentAppAttemptShuffleMergeId = new AppAttemptShuffleMergeId( + msg.appId, msg.appAttemptId, msg.shuffleId, mergePartitionsInfo.shuffleMergeId); submitCleanupTask(() -> closeAndDeleteOutdatedPartitions( - appAttemptShuffleMergeId, mergePartitionsInfo.shuffleMergePartitions)); + currentAppAttemptShuffleMergeId, mergePartitionsInfo.shuffleMergePartitions)); } else { // This block covers: // 1. finalization of determinate stage {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41423) Protobuf serializer for StageDataWrapper
[ https://issues.apache.org/jira/browse/SPARK-41423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-41423: -- Assignee: BingKun Pan > Protobuf serializer for StageDataWrapper > > > Key: SPARK-41423 > URL: https://issues.apache.org/jira/browse/SPARK-41423 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: BingKun Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41423) Protobuf serializer for StageDataWrapper
[ https://issues.apache.org/jira/browse/SPARK-41423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-41423. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39192 [https://github.com/apache/spark/pull/39192] > Protobuf serializer for StageDataWrapper > > > Key: SPARK-41423 > URL: https://issues.apache.org/jira/browse/SPARK-41423 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: BingKun Pan >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41754) Add simple developer guides for UI protobuf serializer
[ https://issues.apache.org/jira/browse/SPARK-41754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-41754. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39270 [https://github.com/apache/spark/pull/39270] > Add simple developer guides for UI protobuf serializer > -- > > Key: SPARK-41754 > URL: https://issues.apache.org/jira/browse/SPARK-41754 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8
[ https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653185#comment-17653185 ] Bjørn Jørgensen commented on SPARK-41741: - [~jlelehe] can you change Affects Version/s: from 2.4.0 to 3.4.0 ? > [SQL] ParquetFilters StringStartsWith push down matching string do not use > UTF-8 > > > Key: SPARK-41741 > URL: https://issues.apache.org/jira/browse/SPARK-41741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jiale He >Priority: Major > Attachments: image-2022-12-28-18-00-00-861.png, > image-2022-12-28-18-00-21-586.png, > part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet > > > Hello ~ > > I found a problem, but there are two ways to solve it. > > The parquet filter is pushed down. When using the like '***%' statement to > query, if the system default encoding is not UTF-8, it may cause an error. > > There are two ways to bypass this problem as far as I know > 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8" > 2. spark.sql.parquet.filterPushdown.string.startsWith=false > > The following is the information to reproduce this problem > The parquet sample file is in the attachment > {code:java} > spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”) > spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code} > !image-2022-12-28-18-00-00-861.png|width=879,height=430! > > !image-2022-12-28-18-00-21-586.png|width=799,height=731! > > I think the correct code should be: > {code:java} > private val strToBinary = > Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41553) Fix the documentation for num_files
[ https://issues.apache.org/jira/browse/SPARK-41553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bjørn Jørgensen updated SPARK-41553: Description: num_files has been deprecated and might be removed in a future version. " "Use DataFrame.spark.repartition instead.", The num_files argument doesn't manage the number of files, but specifying the partition number. was: Functions have this signature. def to_json( (..) num_files: Optional[int] = None, .. note:: pandas-on-Spark writes JSON files into the directory, `path`, and writes multiple `part-...` files in the directory when `path` is specified. This behavior was inherited from Apache Spark. The number of files can be controlled by `num_files`. if num_files is not None: warnings.warn( "`num_files` has been deprecated and might be removed in a future version. " "Use `DataFrame.spark.repartition` instead.", FutureWarning, ) I will change num_files to repartition > Fix the documentation for num_files > --- > > Key: SPARK-41553 > URL: https://issues.apache.org/jira/browse/SPARK-41553 > Project: Spark > Issue Type: Improvement > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > num_files has been deprecated and might be removed in a future version. " > "Use DataFrame.spark.repartition instead.", > The num_files argument doesn't manage the number of files, but specifying the > partition number. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41553) Fix the documentation for num_files
[ https://issues.apache.org/jira/browse/SPARK-41553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bjørn Jørgensen updated SPARK-41553: Summary: Fix the documentation for num_files (was: Change num_files to repartition) > Fix the documentation for num_files > --- > > Key: SPARK-41553 > URL: https://issues.apache.org/jira/browse/SPARK-41553 > Project: Spark > Issue Type: Improvement > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > Functions have this signature. > > def to_json( > (..) > num_files: Optional[int] = None, > > > .. note:: pandas-on-Spark writes JSON files into the directory, `path`, and > writes > multiple `part-...` files in the directory when `path` is specified. > This behavior was inherited from Apache Spark. The number of files can > be controlled by `num_files`. > > > > if num_files is not None: > warnings.warn( > "`num_files` has been deprecated and might be removed in a future version. " > "Use `DataFrame.spark.repartition` instead.", > FutureWarning, > ) > > > I will change num_files to repartition -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41791) Create distinct metadata attributes for metadata that is constant or file and metadata that is generated during the scan
[ https://issues.apache.org/jira/browse/SPARK-41791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41791: Assignee: (was: Apache Spark) > Create distinct metadata attributes for metadata that is constant or file and > metadata that is generated during the scan > > > Key: SPARK-41791 > URL: https://issues.apache.org/jira/browse/SPARK-41791 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.3.1 >Reporter: Jan-Ole Sasse >Priority: Major > Fix For: 3.4.0 > > > There are two types or Metadata in Spark > * Metadata that is constant per file (file_name, file_size, ...) > * Metadata that is not contant (currently only row_index) > The two types are generated differently > * File constant metadata is appended to the output after scan > * non-constant metadata is generated during the scan > The proposal here is to create different metadata attributes to distinguish > those different types throughout the code -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41791) Create distinct metadata attributes for metadata that is constant or file and metadata that is generated during the scan
[ https://issues.apache.org/jira/browse/SPARK-41791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653143#comment-17653143 ] Apache Spark commented on SPARK-41791: -- User 'olaky' has created a pull request for this issue: https://github.com/apache/spark/pull/39314 > Create distinct metadata attributes for metadata that is constant or file and > metadata that is generated during the scan > > > Key: SPARK-41791 > URL: https://issues.apache.org/jira/browse/SPARK-41791 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.3.1 >Reporter: Jan-Ole Sasse >Priority: Major > Fix For: 3.4.0 > > > There are two types or Metadata in Spark > * Metadata that is constant per file (file_name, file_size, ...) > * Metadata that is not contant (currently only row_index) > The two types are generated differently > * File constant metadata is appended to the output after scan > * non-constant metadata is generated during the scan > The proposal here is to create different metadata attributes to distinguish > those different types throughout the code -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41791) Create distinct metadata attributes for metadata that is constant or file and metadata that is generated during the scan
[ https://issues.apache.org/jira/browse/SPARK-41791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41791: Assignee: Apache Spark > Create distinct metadata attributes for metadata that is constant or file and > metadata that is generated during the scan > > > Key: SPARK-41791 > URL: https://issues.apache.org/jira/browse/SPARK-41791 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.3.1 >Reporter: Jan-Ole Sasse >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > > There are two types or Metadata in Spark > * Metadata that is constant per file (file_name, file_size, ...) > * Metadata that is not contant (currently only row_index) > The two types are generated differently > * File constant metadata is appended to the output after scan > * non-constant metadata is generated during the scan > The proposal here is to create different metadata attributes to distinguish > those different types throughout the code -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41791) Create distinct metadata attributes for metadata that is constant or file and metadata that is generated during the scan
[ https://issues.apache.org/jira/browse/SPARK-41791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653142#comment-17653142 ] Apache Spark commented on SPARK-41791: -- User 'olaky' has created a pull request for this issue: https://github.com/apache/spark/pull/39314 > Create distinct metadata attributes for metadata that is constant or file and > metadata that is generated during the scan > > > Key: SPARK-41791 > URL: https://issues.apache.org/jira/browse/SPARK-41791 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.3.1 >Reporter: Jan-Ole Sasse >Priority: Major > Fix For: 3.4.0 > > > There are two types or Metadata in Spark > * Metadata that is constant per file (file_name, file_size, ...) > * Metadata that is not contant (currently only row_index) > The two types are generated differently > * File constant metadata is appended to the output after scan > * non-constant metadata is generated during the scan > The proposal here is to create different metadata attributes to distinguish > those different types throughout the code -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41791) Create distinct metadata attributes for metadata that is constant or file and metadata that is generated during the scan
Jan-Ole Sasse created SPARK-41791: - Summary: Create distinct metadata attributes for metadata that is constant or file and metadata that is generated during the scan Key: SPARK-41791 URL: https://issues.apache.org/jira/browse/SPARK-41791 Project: Spark Issue Type: Improvement Components: Optimizer Affects Versions: 3.3.1 Reporter: Jan-Ole Sasse Fix For: 3.4.0 There are two types or Metadata in Spark * Metadata that is constant per file (file_name, file_size, ...) * Metadata that is not contant (currently only row_index) The two types are generated differently * File constant metadata is appended to the output after scan * non-constant metadata is generated during the scan The proposal here is to create different metadata attributes to distinguish those different types throughout the code -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41790) Set TRANSFORM reader and writer's format correctly
[ https://issues.apache.org/jira/browse/SPARK-41790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653137#comment-17653137 ] Apache Spark commented on SPARK-41790: -- User 'mattshma' has created a pull request for this issue: https://github.com/apache/spark/pull/39315 > Set TRANSFORM reader and writer's format correctly > -- > > Key: SPARK-41790 > URL: https://issues.apache.org/jira/browse/SPARK-41790 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: mattshma >Priority: Major > > We'll get wrong data when transform only specify reader or writer 's row > format delimited, the reason is using the wrong format to feed/fetch data > to/from running script now. In theory, writer uses inFormat to feed to input > data into the running script and reader uses outFormat to read the output > from the running script, but inFormat and outFormat are set wrong value > currently in the following code: > {code:java} > val (inFormat, inSerdeClass, inSerdeProps, reader) = > format( > inRowFormat, "hive.script.recordreader", > "org.apache.hadoop.hive.ql.exec.TextRecordReader") > val (outFormat, outSerdeClass, outSerdeProps, writer) = > format( > outRowFormat, "hive.script.recordwriter", > "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code} > > Example SQL: > {code:java} > spark-sql> CREATE TABLE t1 (a string, b string); > spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4"); > spark-sql> SELECT TRANSFORM(a, b) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > USING 'cat' > > AS (c) > > FROM t1; > c > spark-sql> SELECT TRANSFORM(a, b) > > USING 'cat' > > AS (c) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > FROM t1; > c > 1 23 4{code} > > The same sql in hive: > {code:java} > hive> SELECT TRANSFORM(a, b) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > USING 'cat' > > AS (c) > > FROM t1; > c > 1,2 > 3,4 > hive> SELECT TRANSFORM(a, b) > > USING 'cat' > > AS (c) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > FROM t1; > c > 1 2 > 3 4 {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41790) Set TRANSFORM reader and writer's format correctly
[ https://issues.apache.org/jira/browse/SPARK-41790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41790: Assignee: (was: Apache Spark) > Set TRANSFORM reader and writer's format correctly > -- > > Key: SPARK-41790 > URL: https://issues.apache.org/jira/browse/SPARK-41790 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: mattshma >Priority: Major > > We'll get wrong data when transform only specify reader or writer 's row > format delimited, the reason is using the wrong format to feed/fetch data > to/from running script now. In theory, writer uses inFormat to feed to input > data into the running script and reader uses outFormat to read the output > from the running script, but inFormat and outFormat are set wrong value > currently in the following code: > {code:java} > val (inFormat, inSerdeClass, inSerdeProps, reader) = > format( > inRowFormat, "hive.script.recordreader", > "org.apache.hadoop.hive.ql.exec.TextRecordReader") > val (outFormat, outSerdeClass, outSerdeProps, writer) = > format( > outRowFormat, "hive.script.recordwriter", > "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code} > > Example SQL: > {code:java} > spark-sql> CREATE TABLE t1 (a string, b string); > spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4"); > spark-sql> SELECT TRANSFORM(a, b) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > USING 'cat' > > AS (c) > > FROM t1; > c > spark-sql> SELECT TRANSFORM(a, b) > > USING 'cat' > > AS (c) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > FROM t1; > c > 1 23 4{code} > > The same sql in hive: > {code:java} > hive> SELECT TRANSFORM(a, b) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > USING 'cat' > > AS (c) > > FROM t1; > c > 1,2 > 3,4 > hive> SELECT TRANSFORM(a, b) > > USING 'cat' > > AS (c) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > FROM t1; > c > 1 2 > 3 4 {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41790) Set TRANSFORM reader and writer's format correctly
[ https://issues.apache.org/jira/browse/SPARK-41790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653136#comment-17653136 ] Apache Spark commented on SPARK-41790: -- User 'mattshma' has created a pull request for this issue: https://github.com/apache/spark/pull/39315 > Set TRANSFORM reader and writer's format correctly > -- > > Key: SPARK-41790 > URL: https://issues.apache.org/jira/browse/SPARK-41790 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: mattshma >Priority: Major > > We'll get wrong data when transform only specify reader or writer 's row > format delimited, the reason is using the wrong format to feed/fetch data > to/from running script now. In theory, writer uses inFormat to feed to input > data into the running script and reader uses outFormat to read the output > from the running script, but inFormat and outFormat are set wrong value > currently in the following code: > {code:java} > val (inFormat, inSerdeClass, inSerdeProps, reader) = > format( > inRowFormat, "hive.script.recordreader", > "org.apache.hadoop.hive.ql.exec.TextRecordReader") > val (outFormat, outSerdeClass, outSerdeProps, writer) = > format( > outRowFormat, "hive.script.recordwriter", > "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code} > > Example SQL: > {code:java} > spark-sql> CREATE TABLE t1 (a string, b string); > spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4"); > spark-sql> SELECT TRANSFORM(a, b) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > USING 'cat' > > AS (c) > > FROM t1; > c > spark-sql> SELECT TRANSFORM(a, b) > > USING 'cat' > > AS (c) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > FROM t1; > c > 1 23 4{code} > > The same sql in hive: > {code:java} > hive> SELECT TRANSFORM(a, b) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > USING 'cat' > > AS (c) > > FROM t1; > c > 1,2 > 3,4 > hive> SELECT TRANSFORM(a, b) > > USING 'cat' > > AS (c) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > FROM t1; > c > 1 2 > 3 4 {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41790) Set TRANSFORM reader and writer's format correctly
[ https://issues.apache.org/jira/browse/SPARK-41790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41790: Assignee: Apache Spark > Set TRANSFORM reader and writer's format correctly > -- > > Key: SPARK-41790 > URL: https://issues.apache.org/jira/browse/SPARK-41790 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: mattshma >Assignee: Apache Spark >Priority: Major > > We'll get wrong data when transform only specify reader or writer 's row > format delimited, the reason is using the wrong format to feed/fetch data > to/from running script now. In theory, writer uses inFormat to feed to input > data into the running script and reader uses outFormat to read the output > from the running script, but inFormat and outFormat are set wrong value > currently in the following code: > {code:java} > val (inFormat, inSerdeClass, inSerdeProps, reader) = > format( > inRowFormat, "hive.script.recordreader", > "org.apache.hadoop.hive.ql.exec.TextRecordReader") > val (outFormat, outSerdeClass, outSerdeProps, writer) = > format( > outRowFormat, "hive.script.recordwriter", > "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code} > > Example SQL: > {code:java} > spark-sql> CREATE TABLE t1 (a string, b string); > spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4"); > spark-sql> SELECT TRANSFORM(a, b) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > USING 'cat' > > AS (c) > > FROM t1; > c > spark-sql> SELECT TRANSFORM(a, b) > > USING 'cat' > > AS (c) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > FROM t1; > c > 1 23 4{code} > > The same sql in hive: > {code:java} > hive> SELECT TRANSFORM(a, b) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > USING 'cat' > > AS (c) > > FROM t1; > c > 1,2 > 3,4 > hive> SELECT TRANSFORM(a, b) > > USING 'cat' > > AS (c) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > FROM t1; > c > 1 2 > 3 4 {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41790) Set TRANSFORM reader and writer's format correctly
[ https://issues.apache.org/jira/browse/SPARK-41790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mattshma updated SPARK-41790: - Summary: Set TRANSFORM reader and writer's format correctly (was: Transform will get wrong date when only specify reader or writer 's row format delimited) > Set TRANSFORM reader and writer's format correctly > -- > > Key: SPARK-41790 > URL: https://issues.apache.org/jira/browse/SPARK-41790 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: mattshma >Priority: Major > > We'll get wrong data when transform only specify reader or writer 's row > format delimited, the reason is using the wrong format to feed/fetch data > to/from running script now. In theory, writer uses inFormat to feed to input > data into the running script and reader uses outFormat to read the output > from the running script, but inFormat and outFormat are set wrong value > currently in the following code: > {code:java} > val (inFormat, inSerdeClass, inSerdeProps, reader) = > format( > inRowFormat, "hive.script.recordreader", > "org.apache.hadoop.hive.ql.exec.TextRecordReader") > val (outFormat, outSerdeClass, outSerdeProps, writer) = > format( > outRowFormat, "hive.script.recordwriter", > "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code} > > Example SQL: > {code:java} > spark-sql> CREATE TABLE t1 (a string, b string); > spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4"); > spark-sql> SELECT TRANSFORM(a, b) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > USING 'cat' > > AS (c) > > FROM t1; > c > spark-sql> SELECT TRANSFORM(a, b) > > USING 'cat' > > AS (c) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > FROM t1; > c > 1 23 4{code} > > The same sql in hive: > {code:java} > hive> SELECT TRANSFORM(a, b) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > USING 'cat' > > AS (c) > > FROM t1; > c > 1,2 > 3,4 > hive> SELECT TRANSFORM(a, b) > > USING 'cat' > > AS (c) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > FROM t1; > c > 1 2 > 3 4 {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41790) Transform will get wrong date when only specify reader or writer 's row format delimited
[ https://issues.apache.org/jira/browse/SPARK-41790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mattshma updated SPARK-41790: - Description: We'll get wrong data when transform only specify reader or writer 's row format delimited, the reason is using the wrong format to feed/fetch data to/from running script now. In theory, writer uses inFormat to feed to input data into the running script and reader uses outFormat to read the output from the running script, but inFormat and outFormat are set wrong value currently in the following code: {code:java} val (inFormat, inSerdeClass, inSerdeProps, reader) = format( inRowFormat, "hive.script.recordreader", "org.apache.hadoop.hive.ql.exec.TextRecordReader") val (outFormat, outSerdeClass, outSerdeProps, writer) = format( outRowFormat, "hive.script.recordwriter", "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code} Example SQL: {code:java} spark-sql> CREATE TABLE t1 (a string, b string); spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4"); spark-sql> SELECT TRANSFORM(a, b) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > USING 'cat' > AS (c) > FROM t1; c spark-sql> SELECT TRANSFORM(a, b) > USING 'cat' > AS (c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > FROM t1; c 1 23 4{code} The same sql in hive: {code:java} hive> SELECT TRANSFORM(a, b) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > USING 'cat' > AS (c) > FROM t1; c 1,2 3,4 hive> SELECT TRANSFORM(a, b) > USING 'cat' > AS (c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > FROM t1; c 1 2 3 4 {code} was: We'll get wrong data when transform only specify reader or writer 's row format delimited, the reason is using the wrong format to feed/fetch data to/from running script now: writer uses inFormat to feed to input data into the running script and reader uses outFormat to read the output from the running script. But inFormat and outFormat are set wrong value currently because the following code: {code:java} val (inFormat, inSerdeClass, inSerdeProps, reader) = format( inRowFormat, "hive.script.recordreader", "org.apache.hadoop.hive.ql.exec.TextRecordReader") val (outFormat, outSerdeClass, outSerdeProps, writer) = format( outRowFormat, "hive.script.recordwriter", "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code} Example SQL: {code:java} spark-sql> CREATE TABLE t1 (a string, b string); spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4"); spark-sql> SELECT TRANSFORM(a, b) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > USING 'cat' > AS (c) > FROM t1; c spark-sql> SELECT TRANSFORM(a, b) > USING 'cat' > AS (c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > FROM t1; c 1 23 4{code} The same sql in hive: {code:java} hive> SELECT TRANSFORM(a, b) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > USING 'cat' > AS (c) > FROM t1; c 1,2 3,4 hive> SELECT TRANSFORM(a, b) > USING 'cat' > AS (c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > FROM t1; c 1 2 3 4 {code} > Transform will get wrong date when only specify reader or writer 's row > format delimited > > > Key: SPARK-41790 > URL: https://issues.apache.org/jira/browse/SPARK-41790 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: mattshma >Priority: Major > > We'll get wrong data when transform only specify reader or writer 's row > format delimited, the reason is using the wrong format to feed/fetch data > to/from running script now. In theory, writer uses inFormat to feed to input > data into the running script and reader uses outFormat to read the output > from the running script, but inFormat and outFormat are set wrong value > currently in the following code: > {code:java} > val (inFormat, inSerdeClass, inSerdeProps, reader) = > format( > inRowFormat, "hive.script.recordreader", > "org.apache.hadoop.hive.ql.exec.TextRecordReader") > val (outFormat, outSerdeClass, outSerdeProps, writer) = > format( > outRowFormat, "hive.script.recordwriter", > "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code} > > Example SQL: > {code:java} > spark-sql> CREATE TABLE t1 (a string, b string); > spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4"); > spark-sql> SELECT TRANSFORM(a, b) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > USING
[jira] [Updated] (SPARK-41790) Transform will get wrong date when only specify reader or writer 's row format delimited
[ https://issues.apache.org/jira/browse/SPARK-41790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mattshma updated SPARK-41790: - Description: We'll get wrong data when transform only specify reader or writer 's row format delimited, the reason is using the wrong format to feed/fetch data to/from running script now: writer uses inFormat to feed to input data into the running script and reader uses outFormat to read the output from the running script. But inFormat and outFormat are set wrong value currently because the following code: {code:java} val (inFormat, inSerdeClass, inSerdeProps, reader) = format( inRowFormat, "hive.script.recordreader", "org.apache.hadoop.hive.ql.exec.TextRecordReader") val (outFormat, outSerdeClass, outSerdeProps, writer) = format( outRowFormat, "hive.script.recordwriter", "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code} Example SQL: {code:java} spark-sql> CREATE TABLE t1 (a string, b string); spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4"); spark-sql> SELECT TRANSFORM(a, b) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > USING 'cat' > AS (c) > FROM t1; c spark-sql> SELECT TRANSFORM(a, b) > USING 'cat' > AS (c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > FROM t1; c 1 23 4{code} The same sql in hive: {code:java} hive> SELECT TRANSFORM(a, b) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > USING 'cat' > AS (c) > FROM t1; c 1,2 3,4 hive> SELECT TRANSFORM(a, b) > USING 'cat' > AS (c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > FROM t1; c 1 2 3 4 {code} was: We'll get wrong data when transform only specify reader or writer 's row format delimited, the reason is using the wrong format to feed/fetch data to/from running script now: writer uses inFormat to feed to input data into the running script and reader uses outFormat to read the output from the running script. But inFormat and outFormat are set wrong value currently because the following code: {code:java} val (inFormat, inSerdeClass, inSerdeProps, reader) = format( inRowFormat, "hive.script.recordreader", "org.apache.hadoop.hive.ql.exec.TextRecordReader") val (outFormat, outSerdeClass, outSerdeProps, writer) = format( outRowFormat, "hive.script.recordwriter", "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code} Example SQL: {code:java} spark-sql> CREATE TABLE t1 (a string, b string); spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4"); spark-sql> SELECT TRANSFORM(a, b) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > USING 'cat' > AS (c) > FROM t1; c spark-sql> SELECT TRANSFORM(a, b) > USING 'cat' > AS (c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > FROM t1; c 1 23 4{code} In hive: {code:java} hive> SELECT TRANSFORM(a, b) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > USING 'cat' > AS (c) > FROM t1; c 1,2 3,4 hive> SELECT TRANSFORM(a, b) > USING 'cat' > AS (c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > FROM t1; c 1 2 3 4 {code} > Transform will get wrong date when only specify reader or writer 's row > format delimited > > > Key: SPARK-41790 > URL: https://issues.apache.org/jira/browse/SPARK-41790 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: mattshma >Priority: Major > > We'll get wrong data when transform only specify reader or writer 's row > format delimited, the reason is using the wrong format to feed/fetch data > to/from running script now: writer uses inFormat to feed to input data into > the running script and reader uses outFormat to read the output from the > running script. But inFormat and outFormat are set wrong value currently > because the following code: > > {code:java} > val (inFormat, inSerdeClass, inSerdeProps, reader) = > format( > inRowFormat, "hive.script.recordreader", > "org.apache.hadoop.hive.ql.exec.TextRecordReader") > val (outFormat, outSerdeClass, outSerdeProps, writer) = > format( > outRowFormat, "hive.script.recordwriter", > "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code} > > Example SQL: > {code:java} > spark-sql> CREATE TABLE t1 (a string, b string); > spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4"); > spark-sql> SELECT TRANSFORM(a, b) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > USING 'cat' >
[jira] [Created] (SPARK-41790) Transform will get wrong date when only specify reader or writer 's row format delimited
mattshma created SPARK-41790: Summary: Transform will get wrong date when only specify reader or writer 's row format delimited Key: SPARK-41790 URL: https://issues.apache.org/jira/browse/SPARK-41790 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.1 Reporter: mattshma We'll get wrong data when transform only specify reader or writer 's row format delimited, the reason is using the wrong format to feed/fetch data to/from running script now: writer uses inFormat to feed to input data into the running script and reader uses outFormat to read the output from the running script. But inFormat and outFormat are set wrong value currently because the following code: {code:java} val (inFormat, inSerdeClass, inSerdeProps, reader) = format( inRowFormat, "hive.script.recordreader", "org.apache.hadoop.hive.ql.exec.TextRecordReader") val (outFormat, outSerdeClass, outSerdeProps, writer) = format( outRowFormat, "hive.script.recordwriter", "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code} Example SQL: {code:java} spark-sql> CREATE TABLE t1 (a string, b string); spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4"); spark-sql> SELECT TRANSFORM(a, b) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > USING 'cat' > AS (c) > FROM t1; c spark-sql> SELECT TRANSFORM(a, b) > USING 'cat' > AS (c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > FROM t1; c 1 23 4{code} In hive: {code:java} hive> SELECT TRANSFORM(a, b) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > USING 'cat' > AS (c) > FROM t1; c 1,2 3,4 hive> SELECT TRANSFORM(a, b) > USING 'cat' > AS (c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > FROM t1; c 1 2 3 4 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41789) Make `createDataFrame` support list of Rows
[ https://issues.apache.org/jira/browse/SPARK-41789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653104#comment-17653104 ] Apache Spark commented on SPARK-41789: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39313 > Make `createDataFrame` support list of Rows > --- > > Key: SPARK-41789 > URL: https://issues.apache.org/jira/browse/SPARK-41789 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41789) Make `createDataFrame` support list of Rows
[ https://issues.apache.org/jira/browse/SPARK-41789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653105#comment-17653105 ] Apache Spark commented on SPARK-41789: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39313 > Make `createDataFrame` support list of Rows > --- > > Key: SPARK-41789 > URL: https://issues.apache.org/jira/browse/SPARK-41789 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41789) Make `createDataFrame` support list of Rows
[ https://issues.apache.org/jira/browse/SPARK-41789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41789: Assignee: Apache Spark (was: Ruifeng Zheng) > Make `createDataFrame` support list of Rows > --- > > Key: SPARK-41789 > URL: https://issues.apache.org/jira/browse/SPARK-41789 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41789) Make `createDataFrame` support list of Rows
[ https://issues.apache.org/jira/browse/SPARK-41789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41789: Assignee: Ruifeng Zheng (was: Apache Spark) > Make `createDataFrame` support list of Rows > --- > > Key: SPARK-41789 > URL: https://issues.apache.org/jira/browse/SPARK-41789 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41745) SparkSession.createDataFrame does not respect the column names in the row
[ https://issues.apache.org/jira/browse/SPARK-41745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41745: Assignee: (was: Apache Spark) > SparkSession.createDataFrame does not respect the column names in the row > - > > Key: SPARK-41745 > URL: https://issues.apache.org/jira/browse/SPARK-41745 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > File "/.../spark/python/pyspark/sql/connect/group.py", line 173, in > pyspark.sql.connect.group.GroupedData.pivot > Failed example: > df1.show() > Differences (ndiff with -expected +actual): > - +--+++ > ? --- > + +--++-+ > - |course|year|earnings| > + |_1| _2| _3| > - +--+++ > ? --- > + +--++-+ > - |dotNET|2012| 1| > ? --- > + |dotNET|2012|1| > - | Java|2012| 2| > ? --- > + | Java|2012|2| > - |dotNET|2012|5000| > ? --- > + |dotNET|2012| 5000| > - |dotNET|2013| 48000| > ? --- > + |dotNET|2013|48000| > - | Java|2013| 3| > ? --- > + | Java|2013|3| > - +--+++ > ? --- > + +--++-+ > + > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41745) SparkSession.createDataFrame does not respect the column names in the row
[ https://issues.apache.org/jira/browse/SPARK-41745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41745: Assignee: Apache Spark > SparkSession.createDataFrame does not respect the column names in the row > - > > Key: SPARK-41745 > URL: https://issues.apache.org/jira/browse/SPARK-41745 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > {code} > File "/.../spark/python/pyspark/sql/connect/group.py", line 173, in > pyspark.sql.connect.group.GroupedData.pivot > Failed example: > df1.show() > Differences (ndiff with -expected +actual): > - +--+++ > ? --- > + +--++-+ > - |course|year|earnings| > + |_1| _2| _3| > - +--+++ > ? --- > + +--++-+ > - |dotNET|2012| 1| > ? --- > + |dotNET|2012|1| > - | Java|2012| 2| > ? --- > + | Java|2012|2| > - |dotNET|2012|5000| > ? --- > + |dotNET|2012| 5000| > - |dotNET|2013| 48000| > ? --- > + |dotNET|2013|48000| > - | Java|2013| 3| > ? --- > + | Java|2013|3| > - +--+++ > ? --- > + +--++-+ > + > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41745) SparkSession.createDataFrame does not respect the column names in the row
[ https://issues.apache.org/jira/browse/SPARK-41745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653103#comment-17653103 ] Apache Spark commented on SPARK-41745: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39313 > SparkSession.createDataFrame does not respect the column names in the row > - > > Key: SPARK-41745 > URL: https://issues.apache.org/jira/browse/SPARK-41745 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > File "/.../spark/python/pyspark/sql/connect/group.py", line 173, in > pyspark.sql.connect.group.GroupedData.pivot > Failed example: > df1.show() > Differences (ndiff with -expected +actual): > - +--+++ > ? --- > + +--++-+ > - |course|year|earnings| > + |_1| _2| _3| > - +--+++ > ? --- > + +--++-+ > - |dotNET|2012| 1| > ? --- > + |dotNET|2012|1| > - | Java|2012| 2| > ? --- > + | Java|2012|2| > - |dotNET|2012|5000| > ? --- > + |dotNET|2012| 5000| > - |dotNET|2013| 48000| > ? --- > + |dotNET|2013|48000| > - | Java|2013| 3| > ? --- > + | Java|2013|3| > - +--+++ > ? --- > + +--++-+ > + > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41789) Make `createDataFrame` support list of Rows
Ruifeng Zheng created SPARK-41789: - Summary: Make `createDataFrame` support list of Rows Key: SPARK-41789 URL: https://issues.apache.org/jira/browse/SPARK-41789 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng Assignee: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41788) Move InsertIntoStatement to basicLogicalOperators
[ https://issues.apache.org/jira/browse/SPARK-41788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41788: Assignee: Apache Spark > Move InsertIntoStatement to basicLogicalOperators > - > > Key: SPARK-41788 > URL: https://issues.apache.org/jira/browse/SPARK-41788 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Cheng Pan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41788) Move InsertIntoStatement to basicLogicalOperators
[ https://issues.apache.org/jira/browse/SPARK-41788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41788: Assignee: (was: Apache Spark) > Move InsertIntoStatement to basicLogicalOperators > - > > Key: SPARK-41788 > URL: https://issues.apache.org/jira/browse/SPARK-41788 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41788) Move InsertIntoStatement to basicLogicalOperators
[ https://issues.apache.org/jira/browse/SPARK-41788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653097#comment-17653097 ] Apache Spark commented on SPARK-41788: -- User 'pan3793' has created a pull request for this issue: https://github.com/apache/spark/pull/39312 > Move InsertIntoStatement to basicLogicalOperators > - > > Key: SPARK-41788 > URL: https://issues.apache.org/jira/browse/SPARK-41788 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41788) Move InsertIntoStatement to basicLogicalOperators
[ https://issues.apache.org/jira/browse/SPARK-41788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653098#comment-17653098 ] Apache Spark commented on SPARK-41788: -- User 'pan3793' has created a pull request for this issue: https://github.com/apache/spark/pull/39312 > Move InsertIntoStatement to basicLogicalOperators > - > > Key: SPARK-41788 > URL: https://issues.apache.org/jira/browse/SPARK-41788 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41788) Move InsertIntoStatement to basicLogicalOperators
Cheng Pan created SPARK-41788: - Summary: Move InsertIntoStatement to basicLogicalOperators Key: SPARK-41788 URL: https://issues.apache.org/jira/browse/SPARK-41788 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.4.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41442) Only update SQLMetric value if merging with valid metric
[ https://issues.apache.org/jira/browse/SPARK-41442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653091#comment-17653091 ] Apache Spark commented on SPARK-41442: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/39311 > Only update SQLMetric value if merging with valid metric > > > Key: SPARK-41442 > URL: https://issues.apache.org/jira/browse/SPARK-41442 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 3.4.0 > > > We use -1 as initial value of SQLMetric, and change it to 0 while merging > with other SQLMetric instances. A SQLMetric will be treated as invalid and > filtered out later. > While we are developing with Spark, it is trouble behavior that two invalid > SQLMetric instances merge to a valid SQLMetric because merging will set the > value to 0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41629) Support for protocol extensions
[ https://issues.apache.org/jira/browse/SPARK-41629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41629. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39291 [https://github.com/apache/spark/pull/39291] > Support for protocol extensions > --- > > Key: SPARK-41629 > URL: https://issues.apache.org/jira/browse/SPARK-41629 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Fix For: 3.4.0 > > > Spark comes with many different extension points. Many of those simply become > available through the shared classpath between Spark and the user > application. To be able to support arbitrary plugins e.g. for Delta or > Iceberg, we need a way to make the Spark Connect protocol extensible and let > users register their own handlers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41629) Support for protocol extensions
[ https://issues.apache.org/jira/browse/SPARK-41629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41629: - Assignee: Martin Grund > Support for protocol extensions > --- > > Key: SPARK-41629 > URL: https://issues.apache.org/jira/browse/SPARK-41629 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > > Spark comes with many different extension points. Many of those simply become > available through the shared classpath between Spark and the user > application. To be able to support arbitrary plugins e.g. for Delta or > Iceberg, we need a way to make the Spark Connect protocol extensible and let > users register their own handlers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41787) Upgrade silencer from 1.7.10 to 1.7.12
[ https://issues.apache.org/jira/browse/SPARK-41787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653076#comment-17653076 ] Apache Spark commented on SPARK-41787: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/39309 > Upgrade silencer from 1.7.10 to 1.7.12 > -- > > Key: SPARK-41787 > URL: https://issues.apache.org/jira/browse/SPARK-41787 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > Attachments: image-2022-12-30-16-57-32-736.png > > > !image-2022-12-30-16-57-32-736.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org