[jira] [Commented] (SPARK-41799) Combine plan-related tests

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653272#comment-17653272
 ] 

Apache Spark commented on SPARK-41799:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39323

> Combine plan-related tests
> --
>
> Key: SPARK-41799
> URL: https://issues.apache.org/jira/browse/SPARK-41799
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41799) Combine plan-related tests

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653271#comment-17653271
 ] 

Apache Spark commented on SPARK-41799:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39323

> Combine plan-related tests
> --
>
> Key: SPARK-41799
> URL: https://issues.apache.org/jira/browse/SPARK-41799
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41799) Combine plan-related tests

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41799:


Assignee: (was: Apache Spark)

> Combine plan-related tests
> --
>
> Key: SPARK-41799
> URL: https://issues.apache.org/jira/browse/SPARK-41799
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41799) Combine plan-related tests

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41799:


Assignee: Apache Spark

> Combine plan-related tests
> --
>
> Key: SPARK-41799
> URL: https://issues.apache.org/jira/browse/SPARK-41799
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41799) Combine plan-related tests

2022-12-30 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41799:
-

 Summary: Combine plan-related tests
 Key: SPARK-41799
 URL: https://issues.apache.org/jira/browse/SPARK-41799
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8

2022-12-30 Thread Jiale He (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiale He updated SPARK-41741:
-
Affects Version/s: 3.4.0
   (was: 2.4.0)

> [SQL] ParquetFilters StringStartsWith push down matching string do not use 
> UTF-8
> 
>
> Key: SPARK-41741
> URL: https://issues.apache.org/jira/browse/SPARK-41741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jiale He
>Priority: Major
> Attachments: image-2022-12-28-18-00-00-861.png, 
> image-2022-12-28-18-00-21-586.png, 
> part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet
>
>
> Hello ~
>  
> I found a problem, but there are two ways to solve it.
>  
> The parquet filter is pushed down. When using the like '***%' statement to 
> query, if the system default encoding is not UTF-8, it may cause an error.
>  
> There are two ways to bypass this problem as far as I know
> 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8"
> 2. spark.sql.parquet.filterPushdown.string.startsWith=false
>  
> The following is the information to reproduce this problem
> The parquet sample file is in the attachment
> {code:java}
> spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”)
> spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code}
> !image-2022-12-28-18-00-00-861.png|width=879,height=430!
>  
>   !image-2022-12-28-18-00-21-586.png|width=799,height=731!
>  
> I think the correct code should be:
> {code:java}
> private val strToBinary = 
> Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8

2022-12-30 Thread Jiale He (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653269#comment-17653269
 ] 

Jiale He commented on SPARK-41741:
--

[~bjornjorgensen] done

> [SQL] ParquetFilters StringStartsWith push down matching string do not use 
> UTF-8
> 
>
> Key: SPARK-41741
> URL: https://issues.apache.org/jira/browse/SPARK-41741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jiale He
>Priority: Major
> Attachments: image-2022-12-28-18-00-00-861.png, 
> image-2022-12-28-18-00-21-586.png, 
> part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet
>
>
> Hello ~
>  
> I found a problem, but there are two ways to solve it.
>  
> The parquet filter is pushed down. When using the like '***%' statement to 
> query, if the system default encoding is not UTF-8, it may cause an error.
>  
> There are two ways to bypass this problem as far as I know
> 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8"
> 2. spark.sql.parquet.filterPushdown.string.startsWith=false
>  
> The following is the information to reproduce this problem
> The parquet sample file is in the attachment
> {code:java}
> spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”)
> spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code}
> !image-2022-12-28-18-00-00-861.png|width=879,height=430!
>  
>   !image-2022-12-28-18-00-21-586.png|width=799,height=731!
>  
> I think the correct code should be:
> {code:java}
> private val strToBinary = 
> Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41797) Enable test for `array_repeat`

2022-12-30 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41797.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39319
[https://github.com/apache/spark/pull/39319]

> Enable test for `array_repeat`
> --
>
> Key: SPARK-41797
> URL: https://issues.apache.org/jira/browse/SPARK-41797
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41797) Enable test for `array_repeat`

2022-12-30 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41797:
-

Assignee: Ruifeng Zheng

> Enable test for `array_repeat`
> --
>
> Key: SPARK-41797
> URL: https://issues.apache.org/jira/browse/SPARK-41797
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41798) Upgrade hive-storage-api to 2.8.1

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653257#comment-17653257
 ] 

Apache Spark commented on SPARK-41798:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39322

> Upgrade hive-storage-api to 2.8.1
> -
>
> Key: SPARK-41798
> URL: https://issues.apache.org/jira/browse/SPARK-41798
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41798) Upgrade hive-storage-api to 2.8.1

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41798:


Assignee: Apache Spark

> Upgrade hive-storage-api to 2.8.1
> -
>
> Key: SPARK-41798
> URL: https://issues.apache.org/jira/browse/SPARK-41798
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41798) Upgrade hive-storage-api to 2.8.1

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41798:


Assignee: (was: Apache Spark)

> Upgrade hive-storage-api to 2.8.1
> -
>
> Key: SPARK-41798
> URL: https://issues.apache.org/jira/browse/SPARK-41798
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41798) Upgrade hive-storage-api to 2.8.1

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653256#comment-17653256
 ] 

Apache Spark commented on SPARK-41798:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39322

> Upgrade hive-storage-api to 2.8.1
> -
>
> Key: SPARK-41798
> URL: https://issues.apache.org/jira/browse/SPARK-41798
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41773) Window.partitionBy is not respected with row_number

2022-12-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41773.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39318
[https://github.com/apache/spark/pull/39318]

> Window.partitionBy is not respected with row_number 
> 
>
> Key: SPARK-41773
> URL: https://issues.apache.org/jira/browse/SPARK-41773
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code}
> File "/.../spark/python/pyspark/sql/connect/window.py", line 292, in 
> pyspark.sql.connect.window.Window.orderBy
> Failed example:
> df.withColumn("row_number", row_number().over(window)).show()
> Expected:
> +---++--+
> | id|category|row_number|
> +---++--+
> |  1|   a| 1|
> |  1|   a| 2|
> |  1|   b| 3|
> |  2|   a| 1|
> |  2|   b| 2|
> |  3|   b| 1|
> +---++--+
> Got:
> +---++--+
> | id|category|row_number|
> +---++--+
> |  1|   b| 1|
> |  1|   a| 2|
> |  1|   a| 3|
> |  2|   b| 1|
> |  2|   a| 2|
> |  3|   b| 1|
> +---++--+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41798) Upgrade hive-storage-api to 2.8.1

2022-12-30 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-41798:
-

 Summary: Upgrade hive-storage-api to 2.8.1
 Key: SPARK-41798
 URL: https://issues.apache.org/jira/browse/SPARK-41798
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41773) Window.partitionBy is not respected with row_number

2022-12-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41773:


Assignee: Ruifeng Zheng

> Window.partitionBy is not respected with row_number 
> 
>
> Key: SPARK-41773
> URL: https://issues.apache.org/jira/browse/SPARK-41773
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Ruifeng Zheng
>Priority: Major
>
> {code}
> File "/.../spark/python/pyspark/sql/connect/window.py", line 292, in 
> pyspark.sql.connect.window.Window.orderBy
> Failed example:
> df.withColumn("row_number", row_number().over(window)).show()
> Expected:
> +---++--+
> | id|category|row_number|
> +---++--+
> |  1|   a| 1|
> |  1|   a| 2|
> |  1|   b| 3|
> |  2|   a| 1|
> |  2|   b| 2|
> |  3|   b| 1|
> +---++--+
> Got:
> +---++--+
> | id|category|row_number|
> +---++--+
> |  1|   b| 1|
> |  1|   a| 2|
> |  1|   a| 3|
> |  2|   b| 1|
> |  2|   a| 2|
> |  3|   b| 1|
> +---++--+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41383) Implement `DataFrame.cube`

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653251#comment-17653251
 ] 

Apache Spark commented on SPARK-41383:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39321

> Implement `DataFrame.cube`
> --
>
> Key: SPARK-41383
> URL: https://issues.apache.org/jira/browse/SPARK-41383
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41069) Implement `DataFrame.approxQuantile` and `DataFrame.stat.approxQuantile`

2022-12-30 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41069.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39262
[https://github.com/apache/spark/pull/39262]

> Implement `DataFrame.approxQuantile` and `DataFrame.stat.approxQuantile`
> 
>
> Key: SPARK-41069
> URL: https://issues.apache.org/jira/browse/SPARK-41069
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41796) Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41796:


Assignee: (was: Apache Spark)

> Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE
> 
>
> Key: SPARK-41796
> URL: https://issues.apache.org/jira/browse/SPARK-41796
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>
> UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41796) Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41796:


Assignee: Apache Spark

> Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE
> 
>
> Key: SPARK-41796
> URL: https://issues.apache.org/jira/browse/SPARK-41796
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>
> UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41796) Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE

2022-12-30 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-41796:

Summary: Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE  
(was: Test the error class: 
UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE)

> Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE
> 
>
> Key: SPARK-41796
> URL: https://issues.apache.org/jira/browse/SPARK-41796
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41796) Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653245#comment-17653245
 ] 

Apache Spark commented on SPARK-41796:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39320

> Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE
> 
>
> Key: SPARK-41796
> URL: https://issues.apache.org/jira/browse/SPARK-41796
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>
> UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41796) Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE

2022-12-30 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-41796:

Description: 
UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE

> Test the error class: UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE
> 
>
> Key: SPARK-41796
> URL: https://issues.apache.org/jira/browse/SPARK-41796
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>
> UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41797) Enable test for `array_repeat`

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41797:


Assignee: Apache Spark

> Enable test for `array_repeat`
> --
>
> Key: SPARK-41797
> URL: https://issues.apache.org/jira/browse/SPARK-41797
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41797) Enable test for `array_repeat`

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41797:


Assignee: (was: Apache Spark)

> Enable test for `array_repeat`
> --
>
> Key: SPARK-41797
> URL: https://issues.apache.org/jira/browse/SPARK-41797
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41797) Enable test for `array_repeat`

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653244#comment-17653244
 ] 

Apache Spark commented on SPARK-41797:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39319

> Enable test for `array_repeat`
> --
>
> Key: SPARK-41797
> URL: https://issues.apache.org/jira/browse/SPARK-41797
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41797) Enable test for `array_repeat`

2022-12-30 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41797:
-

 Summary: Enable test for `array_repeat`
 Key: SPARK-41797
 URL: https://issues.apache.org/jira/browse/SPARK-41797
 Project: Spark
  Issue Type: Improvement
  Components: Connect, PySpark, Tests
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41797) Enable test for `array_repeat`

2022-12-30 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-41797:
--
Parent: SPARK-41283
Issue Type: Sub-task  (was: Improvement)

> Enable test for `array_repeat`
> --
>
> Key: SPARK-41797
> URL: https://issues.apache.org/jira/browse/SPARK-41797
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41796) Test the error class: UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE

2022-12-30 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-41796:
---

 Summary: Test the error class: 
UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE
 Key: SPARK-41796
 URL: https://issues.apache.org/jira/browse/SPARK-41796
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.4.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41786) Deduplicate helper functions

2022-12-30 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41786.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39307
[https://github.com/apache/spark/pull/39307]

> Deduplicate helper functions
> 
>
> Key: SPARK-41786
> URL: https://issues.apache.org/jira/browse/SPARK-41786
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41786) Deduplicate helper functions

2022-12-30 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41786:
-

Assignee: Ruifeng Zheng

> Deduplicate helper functions
> 
>
> Key: SPARK-41786
> URL: https://issues.apache.org/jira/browse/SPARK-41786
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41773) Window.partitionBy is not respected with row_number

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41773:


Assignee: (was: Apache Spark)

> Window.partitionBy is not respected with row_number 
> 
>
> Key: SPARK-41773
> URL: https://issues.apache.org/jira/browse/SPARK-41773
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> File "/.../spark/python/pyspark/sql/connect/window.py", line 292, in 
> pyspark.sql.connect.window.Window.orderBy
> Failed example:
> df.withColumn("row_number", row_number().over(window)).show()
> Expected:
> +---++--+
> | id|category|row_number|
> +---++--+
> |  1|   a| 1|
> |  1|   a| 2|
> |  1|   b| 3|
> |  2|   a| 1|
> |  2|   b| 2|
> |  3|   b| 1|
> +---++--+
> Got:
> +---++--+
> | id|category|row_number|
> +---++--+
> |  1|   b| 1|
> |  1|   a| 2|
> |  1|   a| 3|
> |  2|   b| 1|
> |  2|   a| 2|
> |  3|   b| 1|
> +---++--+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41773) Window.partitionBy is not respected with row_number

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653242#comment-17653242
 ] 

Apache Spark commented on SPARK-41773:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39318

> Window.partitionBy is not respected with row_number 
> 
>
> Key: SPARK-41773
> URL: https://issues.apache.org/jira/browse/SPARK-41773
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> File "/.../spark/python/pyspark/sql/connect/window.py", line 292, in 
> pyspark.sql.connect.window.Window.orderBy
> Failed example:
> df.withColumn("row_number", row_number().over(window)).show()
> Expected:
> +---++--+
> | id|category|row_number|
> +---++--+
> |  1|   a| 1|
> |  1|   a| 2|
> |  1|   b| 3|
> |  2|   a| 1|
> |  2|   b| 2|
> |  3|   b| 1|
> +---++--+
> Got:
> +---++--+
> | id|category|row_number|
> +---++--+
> |  1|   b| 1|
> |  1|   a| 2|
> |  1|   a| 3|
> |  2|   b| 1|
> |  2|   a| 2|
> |  3|   b| 1|
> +---++--+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41773) Window.partitionBy is not respected with row_number

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41773:


Assignee: Apache Spark

> Window.partitionBy is not respected with row_number 
> 
>
> Key: SPARK-41773
> URL: https://issues.apache.org/jira/browse/SPARK-41773
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> {code}
> File "/.../spark/python/pyspark/sql/connect/window.py", line 292, in 
> pyspark.sql.connect.window.Window.orderBy
> Failed example:
> df.withColumn("row_number", row_number().over(window)).show()
> Expected:
> +---++--+
> | id|category|row_number|
> +---++--+
> |  1|   a| 1|
> |  1|   a| 2|
> |  1|   b| 3|
> |  2|   a| 1|
> |  2|   b| 2|
> |  3|   b| 1|
> +---++--+
> Got:
> +---++--+
> | id|category|row_number|
> +---++--+
> |  1|   b| 1|
> |  1|   a| 2|
> |  1|   a| 3|
> |  2|   b| 1|
> |  2|   a| 2|
> |  3|   b| 1|
> +---++--+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions

2022-12-30 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-41049:
---

Assignee: Wenchen Fan

> Nondeterministic expressions have unstable values if they are children of 
> CodegenFallback expressions
> -
>
> Key: SPARK-41049
> URL: https://issues.apache.org/jira/browse/SPARK-41049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Guy Boo
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.4.0
>
>
> h2. Expectation
> For a given row, Nondeterministic expressions are expected to have stable 
> values.
> {code:scala}
> import org.apache.spark.sql.functions._
> val df = sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> df.select(v1, v1).collect{code}
> Returns a set like this:
> |8777|8777|
> |1357|1357|
> |3435|3435|
> |9204|9204|
> |3870|3870|
> where both columns always have the same value, but what that value is changes 
> from row to row. This is different from the following:
> {code:scala}
> df.select(rand(), rand()).collect{code}
> In this case, because the rand() calls are distinct, the values in both 
> columns should be different.
> h2. Problem
> This expectation does not appear to be stable in the event that any 
> subsequent expression is a CodegenFallback. This program:
> {code:scala}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> val sparkSession = SparkSession.builder().getOrCreate()
> val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback
> df.select(v1, v1, v2, v2).collect {code}
> produces output like this:
> |8159|8159|8159|{color:#ff}2028{color}|
> |8320|8320|8320|{color:#ff}1640{color}|
> |7937|7937|7937|{color:#ff}769{color}|
> |436|436|436|{color:#ff}8924{color}|
> |8924|8924|2827|{color:#ff}2731{color}|
> Not sure why the first call via the CodegenFallback path should be correct 
> while subsequent calls aren't.
> h2. Workaround
> If the Nondeterministic expression is moved to a separate, earlier select() 
> call, so the CodegenFallback instead only refers to a column reference, then 
> the problem seems to go away. But this workaround may not be reliable if 
> optimization is ever able to restructure adjacent select()s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions

2022-12-30 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-41049.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39248
[https://github.com/apache/spark/pull/39248]

> Nondeterministic expressions have unstable values if they are children of 
> CodegenFallback expressions
> -
>
> Key: SPARK-41049
> URL: https://issues.apache.org/jira/browse/SPARK-41049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Guy Boo
>Priority: Major
> Fix For: 3.4.0
>
>
> h2. Expectation
> For a given row, Nondeterministic expressions are expected to have stable 
> values.
> {code:scala}
> import org.apache.spark.sql.functions._
> val df = sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> df.select(v1, v1).collect{code}
> Returns a set like this:
> |8777|8777|
> |1357|1357|
> |3435|3435|
> |9204|9204|
> |3870|3870|
> where both columns always have the same value, but what that value is changes 
> from row to row. This is different from the following:
> {code:scala}
> df.select(rand(), rand()).collect{code}
> In this case, because the rand() calls are distinct, the values in both 
> columns should be different.
> h2. Problem
> This expectation does not appear to be stable in the event that any 
> subsequent expression is a CodegenFallback. This program:
> {code:scala}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> val sparkSession = SparkSession.builder().getOrCreate()
> val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback
> df.select(v1, v1, v2, v2).collect {code}
> produces output like this:
> |8159|8159|8159|{color:#ff}2028{color}|
> |8320|8320|8320|{color:#ff}1640{color}|
> |7937|7937|7937|{color:#ff}769{color}|
> |436|436|436|{color:#ff}8924{color}|
> |8924|8924|2827|{color:#ff}2731{color}|
> Not sure why the first call via the CodegenFallback path should be correct 
> while subsequent calls aren't.
> h2. Workaround
> If the Nondeterministic expression is moved to a separate, earlier select() 
> call, so the CodegenFallback instead only refers to a column reference, then 
> the problem seems to go away. But this workaround may not be reliable if 
> optimization is ever able to restructure adjacent select()s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41731) Implement the column accessor

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653240#comment-17653240
 ] 

Apache Spark commented on SPARK-41731:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39317

> Implement the column accessor
> -
>
> Key: SPARK-41731
> URL: https://issues.apache.org/jira/browse/SPARK-41731
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41731) Implement the column accessor

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653239#comment-17653239
 ] 

Apache Spark commented on SPARK-41731:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/39317

> Implement the column accessor
> -
>
> Key: SPARK-41731
> URL: https://issues.apache.org/jira/browse/SPARK-41731
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41795) Disable ANSI mode in pyspark.sql.tests.connect.test_connect_column

2022-12-30 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-41795:


 Summary: Disable ANSI mode in 
pyspark.sql.tests.connect.test_connect_column
 Key: SPARK-41795
 URL: https://issues.apache.org/jira/browse/SPARK-41795
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, Tests
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


See 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41795) Disable ANSI mode in pyspark.sql.tests.connect.test_connect_column

2022-12-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41795:
-
Description: See  SPARK-41794  (was: See )

> Disable ANSI mode in pyspark.sql.tests.connect.test_connect_column
> --
>
> Key: SPARK-41795
> URL: https://issues.apache.org/jira/browse/SPARK-41795
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See  SPARK-41794



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41794) Reenable ANSI mode in pyspark.sql.tests.connect.test_connect_column

2022-12-30 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-41794:


 Summary: Reenable ANSI mode in 
pyspark.sql.tests.connect.test_connect_column
 Key: SPARK-41794
 URL: https://issues.apache.org/jira/browse/SPARK-41794
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, Tests
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon
Assignee: Ruifeng Zheng


{code}
==
ERROR [0.901s]: test_column_accessor 
(pyspark.sql.tests.connect.test_connect_column.SparkConnectTests)
--
Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/tests/connect/test_connect_column.py", 
line 744, in test_column_accessor
cdf.select(CF.col("z")[0], cdf.z[10], CF.col("z")[-10]).toPandas(),
  File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 949, in 
toPandas
return self._session.client.to_pandas(query)
  File "/.../spark/python/pyspark/sql/connect/client.py", line 413, in to_pandas
return self._execute_and_fetch(req)
  File "/.../spark/python/pyspark/sql/connect/client.py", line 573, in 
_execute_and_fetch
self._handle_error(rpc_error)
  File "/.../spark/python/pyspark/sql/connect/client.py", line 623, in 
_handle_error
raise SparkConnectException(status.message, info.reason) from None
pyspark.sql.connect.client.SparkConnectException: 
(org.apache.spark.SparkArrayIndexOutOfBoundsException) [INVALID_ARRAY_INDEX] 
The index 10 is out of bounds. The array has 3 elements. Use the SQL function 
`get()` to tolerate accessing element at invalid index and return NULL instead. 
If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.

==
ERROR [0.245s]: test_column_arithmetic_ops 
(pyspark.sql.tests.connect.test_connect_column.SparkConnectTests)
--
Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/tests/connect/test_connect_column.py", 
line 799, in test_column_arithmetic_ops
cdf.select(cdf.a % cdf["b"], cdf["a"] % 2, 12 % cdf.c).toPandas(),
  File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 949, in 
toPandas
return self._session.client.to_pandas(query)
  File "/.../spark/python/pyspark/sql/connect/client.py", line 413, in to_pandas
return self._execute_and_fetch(req)
  File "/.../spark/python/pyspark/sql/connect/client.py", line 573, in 
_execute_and_fetch
self._handle_error(rpc_error)
  File "/.../spark/python/pyspark/sql/connect/client.py", line 623, in 
_handle_error
raise SparkConnectException(status.message, info.reason) from None
pyspark.sql.connect.client.SparkConnectException: 
(org.apache.spark.SparkArithmeticException) [DIVIDE_BY_ZERO] Division by zero. 
Use `try_divide` to tolerate divisor being 0 and return NULL instead. If 
necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.

{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals

2022-12-30 Thread Gera Shegalov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653238#comment-17653238
 ] 

Gera Shegalov commented on SPARK-41793:
---

Similarly in SQLite
{code}
.header on

create table test_table(a long, b decimal(38,2));
insert into test_table 
values
  ('9223372036854775807', '11342371013783243717493546650944543.47'),
  ('9223372036854775807', '.99');

select * from test_table;

select 
  count(1) over(
partition by a 
order by b asc
range between 10.2345 preceding and 6.7890 following) as cnt_1 
  from 
test_table;
{code}

yields

{code}
a|b
9223372036854775807|1.13423710137832e+34
9223372036854775807|1.0e+36
cnt_1
1
1
{code}

> Incorrect result for window frames defined by a range clause on large 
> decimals 
> ---
>
> Key: SPARK-41793
> URL: https://issues.apache.org/jira/browse/SPARK-41793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gera Shegalov
>Priority: Major
>
> Context 
> https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686
> The following windowing query on a simple two-row input should produce two 
> non-empty windows as a result
> {code}
> from pprint import pprint
> data = [
>   ('9223372036854775807', '11342371013783243717493546650944543.47'),
>   ('9223372036854775807', '.99')
> ]
> df1 = spark.createDataFrame(data, 'a STRING, b STRING')
> df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)'))
> df2.createOrReplaceTempView('test_table')
> df = sql('''
>   SELECT 
> COUNT(1) OVER (
>   PARTITION BY a 
>   ORDER BY b ASC 
>   RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING
> ) AS CNT_1 
>   FROM 
> test_table
>   ''')
> res = df.collect()
> df.explain(True)
> pprint(res)
> {code}
> Spark 3.4.0-SNAPSHOT output:
> {code}
> [Row(CNT_1=1), Row(CNT_1=0)]
> {code}
> Spark 3.3.1 output as expected:
> {code}
> Row(CNT_1=1), Row(CNT_1=1)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41553) Fix the documentation for num_files

2022-12-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41553:


Assignee: Bjørn Jørgensen

> Fix the documentation for num_files
> ---
>
> Key: SPARK-41553
> URL: https://issues.apache.org/jira/browse/SPARK-41553
> Project: Spark
>  Issue Type: Improvement
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>
> num_files has been deprecated and might be removed in a future version. "
> "Use DataFrame.spark.repartition instead.",
> The num_files argument doesn't manage the number of files, but specifying the 
> partition number.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41553) Fix the documentation for num_files

2022-12-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41553.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39098
[https://github.com/apache/spark/pull/39098]

> Fix the documentation for num_files
> ---
>
> Key: SPARK-41553
> URL: https://issues.apache.org/jira/browse/SPARK-41553
> Project: Spark
>  Issue Type: Improvement
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.4.0
>
>
> num_files has been deprecated and might be removed in a future version. "
> "Use DataFrame.spark.repartition instead.",
> The num_files argument doesn't manage the number of files, but specifying the 
> partition number.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals

2022-12-30 Thread Gera Shegalov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated SPARK-41793:
--
Summary: Incorrect result for window frames defined by a range clause on 
large decimals   (was: Incorrect result for window frames defined as ranges on 
large decimals )

> Incorrect result for window frames defined by a range clause on large 
> decimals 
> ---
>
> Key: SPARK-41793
> URL: https://issues.apache.org/jira/browse/SPARK-41793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gera Shegalov
>Priority: Major
>
> Context 
> https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686
> The following windowing query on a simple two-row input should produce two 
> non-empty windows as a result
> {code}
> from pprint import pprint
> data = [
>   ('9223372036854775807', '11342371013783243717493546650944543.47'),
>   ('9223372036854775807', '.99')
> ]
> df1 = spark.createDataFrame(data, 'a STRING, b STRING')
> df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)'))
> df2.createOrReplaceTempView('test_table')
> df = sql('''
>   SELECT 
> COUNT(1) OVER (
>   PARTITION BY a 
>   ORDER BY b ASC 
>   RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING
> ) AS CNT_1 
>   FROM 
> test_table
>   ''')
> res = df.collect()
> df.explain(True)
> pprint(res)
> {code}
> Spark 3.4.0-SNAPSHOT output:
> {code}
> [Row(CNT_1=1), Row(CNT_1=0)]
> {code}
> Spark 3.3.1 output as expected:
> {code}
> Row(CNT_1=1), Row(CNT_1=1)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41793) Incorrect result for window frames defined as ranges on large decimals

2022-12-30 Thread Gera Shegalov (Jira)
Gera Shegalov created SPARK-41793:
-

 Summary: Incorrect result for window frames defined as ranges on 
large decimals 
 Key: SPARK-41793
 URL: https://issues.apache.org/jira/browse/SPARK-41793
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Gera Shegalov


Context 
https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686

The following windowing query on a simple two-row input should produce two 
non-empty windows as a result

{code}
from pprint import pprint
data = [
  ('9223372036854775807', '11342371013783243717493546650944543.47'),
  ('9223372036854775807', '.99')
]
df1 = spark.createDataFrame(data, 'a STRING, b STRING')
df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)'))
df2.createOrReplaceTempView('test_table')
df = sql('''
  SELECT 
COUNT(1) OVER (
  PARTITION BY a 
  ORDER BY b ASC 
  RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING
) AS CNT_1 
  FROM 
test_table
  ''')
res = df.collect()
df.explain(True)
pprint(res)
{code}

Spark 3.4.0-SNAPSHOT output:
{code}
[Row(CNT_1=1), Row(CNT_1=0)]
{code}

Spark 3.3.1 output as expected:
{code}
Row(CNT_1=1), Row(CNT_1=1)]
{code}







--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41745) SparkSession.createDataFrame does not respect the column names in the row

2022-12-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41745.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39313
[https://github.com/apache/spark/pull/39313]

> SparkSession.createDataFrame does not respect the column names in the row
> -
>
> Key: SPARK-41745
> URL: https://issues.apache.org/jira/browse/SPARK-41745
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code}
> File "/.../spark/python/pyspark/sql/connect/group.py", line 173, in 
> pyspark.sql.connect.group.GroupedData.pivot
> Failed example:
> df1.show()
> Differences (ndiff with -expected +actual):
> - +--+++
> ?   ---
> + +--++-+
> - |course|year|earnings|
> + |_1|  _2|   _3|
> - +--+++
> ?   ---
> + +--++-+
> - |dotNET|2012|   1|
> ?  ---
> + |dotNET|2012|1|
> - |  Java|2012|   2|
> ?  ---
> + |  Java|2012|2|
> - |dotNET|2012|5000|
> ?   ---
> + |dotNET|2012| 5000|
> - |dotNET|2013|   48000|
> ?  ---
> + |dotNET|2013|48000|
> - |  Java|2013|   3|
> ?  ---
> + |  Java|2013|3|
> - +--+++
> ?   ---
> + +--++-+
> + 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41789) Make `createDataFrame` support list of Rows

2022-12-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41789.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39313
[https://github.com/apache/spark/pull/39313]

> Make `createDataFrame` support list of Rows
> ---
>
> Key: SPARK-41789
> URL: https://issues.apache.org/jira/browse/SPARK-41789
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41745) SparkSession.createDataFrame does not respect the column names in the row

2022-12-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41745:


Assignee: Ruifeng Zheng

> SparkSession.createDataFrame does not respect the column names in the row
> -
>
> Key: SPARK-41745
> URL: https://issues.apache.org/jira/browse/SPARK-41745
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Ruifeng Zheng
>Priority: Major
>
> {code}
> File "/.../spark/python/pyspark/sql/connect/group.py", line 173, in 
> pyspark.sql.connect.group.GroupedData.pivot
> Failed example:
> df1.show()
> Differences (ndiff with -expected +actual):
> - +--+++
> ?   ---
> + +--++-+
> - |course|year|earnings|
> + |_1|  _2|   _3|
> - +--+++
> ?   ---
> + +--++-+
> - |dotNET|2012|   1|
> ?  ---
> + |dotNET|2012|1|
> - |  Java|2012|   2|
> ?  ---
> + |  Java|2012|2|
> - |dotNET|2012|5000|
> ?   ---
> + |dotNET|2012| 5000|
> - |dotNET|2013|   48000|
> ?  ---
> + |dotNET|2013|48000|
> - |  Java|2013|   3|
> ?  ---
> + |  Java|2013|3|
> - +--+++
> ?   ---
> + +--++-+
> + 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41787) Upgrade silencer from 1.7.10 to 1.7.12

2022-12-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41787.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39309
[https://github.com/apache/spark/pull/39309]

> Upgrade silencer from 1.7.10 to 1.7.12
> --
>
> Key: SPARK-41787
> URL: https://issues.apache.org/jira/browse/SPARK-41787
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2022-12-30-16-57-32-736.png
>
>
> !image-2022-12-30-16-57-32-736.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41787) Upgrade silencer from 1.7.10 to 1.7.12

2022-12-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41787:


Assignee: BingKun Pan

> Upgrade silencer from 1.7.10 to 1.7.12
> --
>
> Key: SPARK-41787
> URL: https://issues.apache.org/jira/browse/SPARK-41787
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Attachments: image-2022-12-30-16-57-32-736.png
>
>
> !image-2022-12-30-16-57-32-736.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41785) Implement `GroupedData.mean`

2022-12-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41785.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39304
[https://github.com/apache/spark/pull/39304]

> Implement `GroupedData.mean`
> 
>
> Key: SPARK-41785
> URL: https://issues.apache.org/jira/browse/SPARK-41785
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41784) Add missing `__rmod__`

2022-12-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41784:


Assignee: Ruifeng Zheng

> Add missing `__rmod__`
> --
>
> Key: SPARK-41784
> URL: https://issues.apache.org/jira/browse/SPARK-41784
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41785) Implement `GroupedData.mean`

2022-12-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41785:


Assignee: Ruifeng Zheng

> Implement `GroupedData.mean`
> 
>
> Key: SPARK-41785
> URL: https://issues.apache.org/jira/browse/SPARK-41785
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41784) Add missing `__rmod__`

2022-12-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41784.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39303
[https://github.com/apache/spark/pull/39303]

> Add missing `__rmod__`
> --
>
> Key: SPARK-41784
> URL: https://issues.apache.org/jira/browse/SPARK-41784
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41770) eqNullSafe does not support None as its argument

2022-12-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41770:


Assignee: Ruifeng Zheng

> eqNullSafe does not support None as its argument
> 
>
> Key: SPARK-41770
> URL: https://issues.apache.org/jira/browse/SPARK-41770
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code}
> **
> File "/.../spark/python/pyspark/sql/connect/column.py", line 90, in 
> pyspark.sql.connect.column.Column.eqNullSafe
> Failed example:
> df1.select(
> df1['value'] == 'foo',
> df1['value'].eqNullSafe('foo'),
> df1['value'].eqNullSafe(None)
> ).show()
> Exception raised:
> Traceback (most recent call last):
>   File "/.../miniconda3/envs/python3.9/lib/python3.9/doctest.py", line 
> 1336, in __run
> exec(compile(example.source, filename, "single",
>   File "", line 
> 4, in 
> df1['value'].eqNullSafe(None)
>   File 
> "/.../workspace/forked/spark/python/pyspark/sql/connect/column.py", line 78, 
> in wrapped
> return scalar_function(name, self, other)
>   File 
> "/.../workspace/forked/spark/python/pyspark/sql/connect/column.py", line 95, 
> in scalar_function
> return Column(UnresolvedFunction(op, [arg._expr for arg in args]))
>   File 
> "/.../workspace/forked/spark/python/pyspark/sql/connect/column.py", line 95, 
> in 
> return Column(UnresolvedFunction(op, [arg._expr for arg in args]))
> AttributeError: 'NoneType' object has no attribute '_expr'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41770) eqNullSafe does not support None as its argument

2022-12-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41770.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39302
[https://github.com/apache/spark/pull/39302]

> eqNullSafe does not support None as its argument
> 
>
> Key: SPARK-41770
> URL: https://issues.apache.org/jira/browse/SPARK-41770
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> {code}
> **
> File "/.../spark/python/pyspark/sql/connect/column.py", line 90, in 
> pyspark.sql.connect.column.Column.eqNullSafe
> Failed example:
> df1.select(
> df1['value'] == 'foo',
> df1['value'].eqNullSafe('foo'),
> df1['value'].eqNullSafe(None)
> ).show()
> Exception raised:
> Traceback (most recent call last):
>   File "/.../miniconda3/envs/python3.9/lib/python3.9/doctest.py", line 
> 1336, in __run
> exec(compile(example.source, filename, "single",
>   File "", line 
> 4, in 
> df1['value'].eqNullSafe(None)
>   File 
> "/.../workspace/forked/spark/python/pyspark/sql/connect/column.py", line 78, 
> in wrapped
> return scalar_function(name, self, other)
>   File 
> "/.../workspace/forked/spark/python/pyspark/sql/connect/column.py", line 95, 
> in scalar_function
> return Column(UnresolvedFunction(op, [arg._expr for arg in args]))
>   File 
> "/.../workspace/forked/spark/python/pyspark/sql/connect/column.py", line 95, 
> in 
> return Column(UnresolvedFunction(op, [arg._expr for arg in args]))
> AttributeError: 'NoneType' object has no attribute '_expr'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41783) Make column op support None

2022-12-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41783:


Assignee: Ruifeng Zheng

> Make column op support None
> ---
>
> Key: SPARK-41783
> URL: https://issues.apache.org/jira/browse/SPARK-41783
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41783) Make column op support None

2022-12-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41783.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39302
[https://github.com/apache/spark/pull/39302]

> Make column op support None
> ---
>
> Key: SPARK-41783
> URL: https://issues.apache.org/jira/browse/SPARK-41783
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41792) Shuffle merge finalization removes the wrong finalization state from the DB

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653210#comment-17653210
 ] 

Apache Spark commented on SPARK-41792:
--

User 'mridulm' has created a pull request for this issue:
https://github.com/apache/spark/pull/39316

> Shuffle merge finalization removes the wrong finalization state from the DB
> ---
>
> Key: SPARK-41792
> URL: https://issues.apache.org/jira/browse/SPARK-41792
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Mridul Muralidharan
>Priority: Minor
>
> During `finalizeShuffleMerge` in external shuffle service, if the 
> finalization request is for a newer shuffle merge id, then we cleanup the 
> existing (older) shuffle details and add the newer entry (for which we got no 
> pushed blocks) to the DB.
> Unfortunately, when cleaning up from the DB, we are using the incorrect 
> AppAttemptShuffleMergeId - we remove the latest shuffle merge id instead of 
> the existing entry.
> Proposed Fix:
> {code}
> diff --git 
> a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
>  
> b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
> index 816d1082850..551104d0eba 100644
> --- 
> a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
> +++ 
> b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
> @@ -653,9 +653,11 @@ public class RemoteBlockPushResolver implements 
> MergedShuffleFileManager {
>  } else if (msg.shuffleMergeId > mergePartitionsInfo.shuffleMergeId) {
>// If no blocks pushed for the finalizeShuffleMerge shuffleMergeId 
> then return
>// empty MergeStatuses but cleanup the older shuffleMergeId files.
> +  AppAttemptShuffleMergeId currentAppAttemptShuffleMergeId = new 
> AppAttemptShuffleMergeId(
> +  msg.appId, msg.appAttemptId, msg.shuffleId, 
> mergePartitionsInfo.shuffleMergeId);
>submitCleanupTask(() ->
>closeAndDeleteOutdatedPartitions(
> -  appAttemptShuffleMergeId, 
> mergePartitionsInfo.shuffleMergePartitions));
> +  currentAppAttemptShuffleMergeId, 
> mergePartitionsInfo.shuffleMergePartitions));
>  } else {
>// This block covers:
>//  1. finalization of determinate stage
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41792) Shuffle merge finalization removes the wrong finalization state from the DB

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41792:


Assignee: Apache Spark

> Shuffle merge finalization removes the wrong finalization state from the DB
> ---
>
> Key: SPARK-41792
> URL: https://issues.apache.org/jira/browse/SPARK-41792
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Mridul Muralidharan
>Assignee: Apache Spark
>Priority: Minor
>
> During `finalizeShuffleMerge` in external shuffle service, if the 
> finalization request is for a newer shuffle merge id, then we cleanup the 
> existing (older) shuffle details and add the newer entry (for which we got no 
> pushed blocks) to the DB.
> Unfortunately, when cleaning up from the DB, we are using the incorrect 
> AppAttemptShuffleMergeId - we remove the latest shuffle merge id instead of 
> the existing entry.
> Proposed Fix:
> {code}
> diff --git 
> a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
>  
> b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
> index 816d1082850..551104d0eba 100644
> --- 
> a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
> +++ 
> b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
> @@ -653,9 +653,11 @@ public class RemoteBlockPushResolver implements 
> MergedShuffleFileManager {
>  } else if (msg.shuffleMergeId > mergePartitionsInfo.shuffleMergeId) {
>// If no blocks pushed for the finalizeShuffleMerge shuffleMergeId 
> then return
>// empty MergeStatuses but cleanup the older shuffleMergeId files.
> +  AppAttemptShuffleMergeId currentAppAttemptShuffleMergeId = new 
> AppAttemptShuffleMergeId(
> +  msg.appId, msg.appAttemptId, msg.shuffleId, 
> mergePartitionsInfo.shuffleMergeId);
>submitCleanupTask(() ->
>closeAndDeleteOutdatedPartitions(
> -  appAttemptShuffleMergeId, 
> mergePartitionsInfo.shuffleMergePartitions));
> +  currentAppAttemptShuffleMergeId, 
> mergePartitionsInfo.shuffleMergePartitions));
>  } else {
>// This block covers:
>//  1. finalization of determinate stage
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41792) Shuffle merge finalization removes the wrong finalization state from the DB

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653208#comment-17653208
 ] 

Apache Spark commented on SPARK-41792:
--

User 'mridulm' has created a pull request for this issue:
https://github.com/apache/spark/pull/39316

> Shuffle merge finalization removes the wrong finalization state from the DB
> ---
>
> Key: SPARK-41792
> URL: https://issues.apache.org/jira/browse/SPARK-41792
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Mridul Muralidharan
>Priority: Minor
>
> During `finalizeShuffleMerge` in external shuffle service, if the 
> finalization request is for a newer shuffle merge id, then we cleanup the 
> existing (older) shuffle details and add the newer entry (for which we got no 
> pushed blocks) to the DB.
> Unfortunately, when cleaning up from the DB, we are using the incorrect 
> AppAttemptShuffleMergeId - we remove the latest shuffle merge id instead of 
> the existing entry.
> Proposed Fix:
> {code}
> diff --git 
> a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
>  
> b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
> index 816d1082850..551104d0eba 100644
> --- 
> a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
> +++ 
> b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
> @@ -653,9 +653,11 @@ public class RemoteBlockPushResolver implements 
> MergedShuffleFileManager {
>  } else if (msg.shuffleMergeId > mergePartitionsInfo.shuffleMergeId) {
>// If no blocks pushed for the finalizeShuffleMerge shuffleMergeId 
> then return
>// empty MergeStatuses but cleanup the older shuffleMergeId files.
> +  AppAttemptShuffleMergeId currentAppAttemptShuffleMergeId = new 
> AppAttemptShuffleMergeId(
> +  msg.appId, msg.appAttemptId, msg.shuffleId, 
> mergePartitionsInfo.shuffleMergeId);
>submitCleanupTask(() ->
>closeAndDeleteOutdatedPartitions(
> -  appAttemptShuffleMergeId, 
> mergePartitionsInfo.shuffleMergePartitions));
> +  currentAppAttemptShuffleMergeId, 
> mergePartitionsInfo.shuffleMergePartitions));
>  } else {
>// This block covers:
>//  1. finalization of determinate stage
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41792) Shuffle merge finalization removes the wrong finalization state from the DB

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41792:


Assignee: (was: Apache Spark)

> Shuffle merge finalization removes the wrong finalization state from the DB
> ---
>
> Key: SPARK-41792
> URL: https://issues.apache.org/jira/browse/SPARK-41792
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Mridul Muralidharan
>Priority: Minor
>
> During `finalizeShuffleMerge` in external shuffle service, if the 
> finalization request is for a newer shuffle merge id, then we cleanup the 
> existing (older) shuffle details and add the newer entry (for which we got no 
> pushed blocks) to the DB.
> Unfortunately, when cleaning up from the DB, we are using the incorrect 
> AppAttemptShuffleMergeId - we remove the latest shuffle merge id instead of 
> the existing entry.
> Proposed Fix:
> {code}
> diff --git 
> a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
>  
> b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
> index 816d1082850..551104d0eba 100644
> --- 
> a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
> +++ 
> b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
> @@ -653,9 +653,11 @@ public class RemoteBlockPushResolver implements 
> MergedShuffleFileManager {
>  } else if (msg.shuffleMergeId > mergePartitionsInfo.shuffleMergeId) {
>// If no blocks pushed for the finalizeShuffleMerge shuffleMergeId 
> then return
>// empty MergeStatuses but cleanup the older shuffleMergeId files.
> +  AppAttemptShuffleMergeId currentAppAttemptShuffleMergeId = new 
> AppAttemptShuffleMergeId(
> +  msg.appId, msg.appAttemptId, msg.shuffleId, 
> mergePartitionsInfo.shuffleMergeId);
>submitCleanupTask(() ->
>closeAndDeleteOutdatedPartitions(
> -  appAttemptShuffleMergeId, 
> mergePartitionsInfo.shuffleMergePartitions));
> +  currentAppAttemptShuffleMergeId, 
> mergePartitionsInfo.shuffleMergePartitions));
>  } else {
>// This block covers:
>//  1. finalization of determinate stage
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41792) Shuffle merge finalization removes the wrong finalization state from the DB

2022-12-30 Thread Mridul Muralidharan (Jira)
Mridul Muralidharan created SPARK-41792:
---

 Summary: Shuffle merge finalization removes the wrong finalization 
state from the DB
 Key: SPARK-41792
 URL: https://issues.apache.org/jira/browse/SPARK-41792
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 3.3.0, 3.4.0
Reporter: Mridul Muralidharan


During `finalizeShuffleMerge` in external shuffle service, if the finalization 
request is for a newer shuffle merge id, then we cleanup the existing (older) 
shuffle details and add the newer entry (for which we got no pushed blocks) to 
the DB.

Unfortunately, when cleaning up from the DB, we are using the incorrect 
AppAttemptShuffleMergeId - we remove the latest shuffle merge id instead of the 
existing entry.

Proposed Fix:

{code}
diff --git 
a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
 
b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
index 816d1082850..551104d0eba 100644
--- 
a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
+++ 
b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
@@ -653,9 +653,11 @@ public class RemoteBlockPushResolver implements 
MergedShuffleFileManager {
 } else if (msg.shuffleMergeId > mergePartitionsInfo.shuffleMergeId) {
   // If no blocks pushed for the finalizeShuffleMerge shuffleMergeId 
then return
   // empty MergeStatuses but cleanup the older shuffleMergeId files.
+  AppAttemptShuffleMergeId currentAppAttemptShuffleMergeId = new 
AppAttemptShuffleMergeId(
+  msg.appId, msg.appAttemptId, msg.shuffleId, 
mergePartitionsInfo.shuffleMergeId);
   submitCleanupTask(() ->
   closeAndDeleteOutdatedPartitions(
-  appAttemptShuffleMergeId, 
mergePartitionsInfo.shuffleMergePartitions));
+  currentAppAttemptShuffleMergeId, 
mergePartitionsInfo.shuffleMergePartitions));
 } else {
   // This block covers:
   //  1. finalization of determinate stage
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41423) Protobuf serializer for StageDataWrapper

2022-12-30 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-41423:
--

Assignee: BingKun Pan

> Protobuf serializer for StageDataWrapper
> 
>
> Key: SPARK-41423
> URL: https://issues.apache.org/jira/browse/SPARK-41423
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: BingKun Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41423) Protobuf serializer for StageDataWrapper

2022-12-30 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-41423.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39192
[https://github.com/apache/spark/pull/39192]

> Protobuf serializer for StageDataWrapper
> 
>
> Key: SPARK-41423
> URL: https://issues.apache.org/jira/browse/SPARK-41423
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: BingKun Pan
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41754) Add simple developer guides for UI protobuf serializer

2022-12-30 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-41754.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39270
[https://github.com/apache/spark/pull/39270]

> Add simple developer guides for UI protobuf serializer
> --
>
> Key: SPARK-41754
> URL: https://issues.apache.org/jira/browse/SPARK-41754
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8

2022-12-30 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653185#comment-17653185
 ] 

Bjørn Jørgensen commented on SPARK-41741:
-

[~jlelehe] can you change Affects Version/s: from 2.4.0 to 3.4.0 ? 

> [SQL] ParquetFilters StringStartsWith push down matching string do not use 
> UTF-8
> 
>
> Key: SPARK-41741
> URL: https://issues.apache.org/jira/browse/SPARK-41741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Jiale He
>Priority: Major
> Attachments: image-2022-12-28-18-00-00-861.png, 
> image-2022-12-28-18-00-21-586.png, 
> part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet
>
>
> Hello ~
>  
> I found a problem, but there are two ways to solve it.
>  
> The parquet filter is pushed down. When using the like '***%' statement to 
> query, if the system default encoding is not UTF-8, it may cause an error.
>  
> There are two ways to bypass this problem as far as I know
> 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8"
> 2. spark.sql.parquet.filterPushdown.string.startsWith=false
>  
> The following is the information to reproduce this problem
> The parquet sample file is in the attachment
> {code:java}
> spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”)
> spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code}
> !image-2022-12-28-18-00-00-861.png|width=879,height=430!
>  
>   !image-2022-12-28-18-00-21-586.png|width=799,height=731!
>  
> I think the correct code should be:
> {code:java}
> private val strToBinary = 
> Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41553) Fix the documentation for num_files

2022-12-30 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-41553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-41553:

Description: 

num_files has been deprecated and might be removed in a future version. "
"Use DataFrame.spark.repartition instead.",

The num_files argument doesn't manage the number of files, but specifying the 
partition number.

  was:
Functions have this signature. 

 
def to_json(
(..)
num_files: Optional[int] = None,
 
 
.. note:: pandas-on-Spark writes JSON files into the directory, `path`, and 
writes
multiple `part-...` files in the directory when `path` is specified.
This behavior was inherited from Apache Spark. The number of files can
be controlled by `num_files`.
 
 
 
if num_files is not None:
warnings.warn(
"`num_files` has been deprecated and might be removed in a future version. "
"Use `DataFrame.spark.repartition` instead.",
FutureWarning,
)
 
 
I will change num_files to repartition


> Fix the documentation for num_files
> ---
>
> Key: SPARK-41553
> URL: https://issues.apache.org/jira/browse/SPARK-41553
> Project: Spark
>  Issue Type: Improvement
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> num_files has been deprecated and might be removed in a future version. "
> "Use DataFrame.spark.repartition instead.",
> The num_files argument doesn't manage the number of files, but specifying the 
> partition number.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41553) Fix the documentation for num_files

2022-12-30 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-41553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-41553:

Summary: Fix the documentation for num_files  (was: Change num_files to 
repartition)

> Fix the documentation for num_files
> ---
>
> Key: SPARK-41553
> URL: https://issues.apache.org/jira/browse/SPARK-41553
> Project: Spark
>  Issue Type: Improvement
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> Functions have this signature. 
>  
> def to_json(
> (..)
> num_files: Optional[int] = None,
>  
>  
> .. note:: pandas-on-Spark writes JSON files into the directory, `path`, and 
> writes
> multiple `part-...` files in the directory when `path` is specified.
> This behavior was inherited from Apache Spark. The number of files can
> be controlled by `num_files`.
>  
>  
>  
> if num_files is not None:
> warnings.warn(
> "`num_files` has been deprecated and might be removed in a future version. "
> "Use `DataFrame.spark.repartition` instead.",
> FutureWarning,
> )
>  
>  
> I will change num_files to repartition



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41791) Create distinct metadata attributes for metadata that is constant or file and metadata that is generated during the scan

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41791:


Assignee: (was: Apache Spark)

> Create distinct metadata attributes for metadata that is constant or file and 
> metadata that is generated during the scan
> 
>
> Key: SPARK-41791
> URL: https://issues.apache.org/jira/browse/SPARK-41791
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.3.1
>Reporter: Jan-Ole Sasse
>Priority: Major
> Fix For: 3.4.0
>
>
> There are two types or Metadata in Spark
>  * Metadata that is constant per file (file_name, file_size, ...)
>  * Metadata that is not contant (currently only row_index)
> The two types are generated differently
>  * File constant metadata is appended to the output after scan
>  * non-constant metadata is generated during the scan
> The proposal here is to create different metadata attributes to distinguish 
> those different types throughout the code



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41791) Create distinct metadata attributes for metadata that is constant or file and metadata that is generated during the scan

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653143#comment-17653143
 ] 

Apache Spark commented on SPARK-41791:
--

User 'olaky' has created a pull request for this issue:
https://github.com/apache/spark/pull/39314

> Create distinct metadata attributes for metadata that is constant or file and 
> metadata that is generated during the scan
> 
>
> Key: SPARK-41791
> URL: https://issues.apache.org/jira/browse/SPARK-41791
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.3.1
>Reporter: Jan-Ole Sasse
>Priority: Major
> Fix For: 3.4.0
>
>
> There are two types or Metadata in Spark
>  * Metadata that is constant per file (file_name, file_size, ...)
>  * Metadata that is not contant (currently only row_index)
> The two types are generated differently
>  * File constant metadata is appended to the output after scan
>  * non-constant metadata is generated during the scan
> The proposal here is to create different metadata attributes to distinguish 
> those different types throughout the code



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41791) Create distinct metadata attributes for metadata that is constant or file and metadata that is generated during the scan

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41791:


Assignee: Apache Spark

> Create distinct metadata attributes for metadata that is constant or file and 
> metadata that is generated during the scan
> 
>
> Key: SPARK-41791
> URL: https://issues.apache.org/jira/browse/SPARK-41791
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.3.1
>Reporter: Jan-Ole Sasse
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>
> There are two types or Metadata in Spark
>  * Metadata that is constant per file (file_name, file_size, ...)
>  * Metadata that is not contant (currently only row_index)
> The two types are generated differently
>  * File constant metadata is appended to the output after scan
>  * non-constant metadata is generated during the scan
> The proposal here is to create different metadata attributes to distinguish 
> those different types throughout the code



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41791) Create distinct metadata attributes for metadata that is constant or file and metadata that is generated during the scan

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653142#comment-17653142
 ] 

Apache Spark commented on SPARK-41791:
--

User 'olaky' has created a pull request for this issue:
https://github.com/apache/spark/pull/39314

> Create distinct metadata attributes for metadata that is constant or file and 
> metadata that is generated during the scan
> 
>
> Key: SPARK-41791
> URL: https://issues.apache.org/jira/browse/SPARK-41791
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.3.1
>Reporter: Jan-Ole Sasse
>Priority: Major
> Fix For: 3.4.0
>
>
> There are two types or Metadata in Spark
>  * Metadata that is constant per file (file_name, file_size, ...)
>  * Metadata that is not contant (currently only row_index)
> The two types are generated differently
>  * File constant metadata is appended to the output after scan
>  * non-constant metadata is generated during the scan
> The proposal here is to create different metadata attributes to distinguish 
> those different types throughout the code



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41791) Create distinct metadata attributes for metadata that is constant or file and metadata that is generated during the scan

2022-12-30 Thread Jan-Ole Sasse (Jira)
Jan-Ole Sasse created SPARK-41791:
-

 Summary: Create distinct metadata attributes for metadata that is 
constant or file and metadata that is generated during the scan
 Key: SPARK-41791
 URL: https://issues.apache.org/jira/browse/SPARK-41791
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 3.3.1
Reporter: Jan-Ole Sasse
 Fix For: 3.4.0


There are two types or Metadata in Spark
 * Metadata that is constant per file (file_name, file_size, ...)
 * Metadata that is not contant (currently only row_index)

The two types are generated differently
 * File constant metadata is appended to the output after scan
 * non-constant metadata is generated during the scan

The proposal here is to create different metadata attributes to distinguish 
those different types throughout the code



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41790) Set TRANSFORM reader and writer's format correctly

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653137#comment-17653137
 ] 

Apache Spark commented on SPARK-41790:
--

User 'mattshma' has created a pull request for this issue:
https://github.com/apache/spark/pull/39315

> Set TRANSFORM reader and writer's format correctly
> --
>
> Key: SPARK-41790
> URL: https://issues.apache.org/jira/browse/SPARK-41790
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: mattshma
>Priority: Major
>
> We'll get wrong data when transform only specify reader or writer 's row 
> format delimited, the reason is using the wrong format to feed/fetch data 
> to/from running script now.  In theory, writer uses inFormat to feed to input 
> data into the running script and reader uses outFormat to read the output 
> from the running script, but inFormat and outFormat are set wrong value 
> currently in the following code:
> {code:java}
> val (inFormat, inSerdeClass, inSerdeProps, reader) =
>   format(
> inRowFormat, "hive.script.recordreader",
> "org.apache.hadoop.hive.ql.exec.TextRecordReader")
> val (outFormat, outSerdeClass, outSerdeProps, writer) =
>   format(
> outRowFormat, "hive.script.recordwriter",
> "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code}
>  
> Example SQL:
> {code:java}
> spark-sql> CREATE TABLE t1 (a string, b string); 
> spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4");
> spark-sql> SELECT TRANSFORM(a, b)
>          >   ROW FORMAT DELIMITED
>          >   FIELDS TERMINATED BY ','
>          >   USING 'cat'
>          >   AS (c)
>          > FROM t1;
> c
> spark-sql> SELECT TRANSFORM(a, b)
>          >   USING 'cat'
>          >   AS (c)
>          >   ROW FORMAT DELIMITED
>          >   FIELDS TERMINATED BY ','
>          > FROM t1;
> c
> 1    23    4{code}
>  
> The same sql in hive:
> {code:java}
> hive> SELECT TRANSFORM(a, b)
>     >   ROW FORMAT DELIMITED
>     >   FIELDS TERMINATED BY ','
>     >   USING 'cat'
>     >   AS (c)
>     > FROM t1;
> c
> 1,2
> 3,4
> hive> SELECT TRANSFORM(a, b)
>     >   USING 'cat'
>     >   AS (c)
>     >   ROW FORMAT DELIMITED
>     >   FIELDS TERMINATED BY ','
>     > FROM t1;
> c
> 1    2
> 3    4 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41790) Set TRANSFORM reader and writer's format correctly

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41790:


Assignee: (was: Apache Spark)

> Set TRANSFORM reader and writer's format correctly
> --
>
> Key: SPARK-41790
> URL: https://issues.apache.org/jira/browse/SPARK-41790
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: mattshma
>Priority: Major
>
> We'll get wrong data when transform only specify reader or writer 's row 
> format delimited, the reason is using the wrong format to feed/fetch data 
> to/from running script now.  In theory, writer uses inFormat to feed to input 
> data into the running script and reader uses outFormat to read the output 
> from the running script, but inFormat and outFormat are set wrong value 
> currently in the following code:
> {code:java}
> val (inFormat, inSerdeClass, inSerdeProps, reader) =
>   format(
> inRowFormat, "hive.script.recordreader",
> "org.apache.hadoop.hive.ql.exec.TextRecordReader")
> val (outFormat, outSerdeClass, outSerdeProps, writer) =
>   format(
> outRowFormat, "hive.script.recordwriter",
> "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code}
>  
> Example SQL:
> {code:java}
> spark-sql> CREATE TABLE t1 (a string, b string); 
> spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4");
> spark-sql> SELECT TRANSFORM(a, b)
>          >   ROW FORMAT DELIMITED
>          >   FIELDS TERMINATED BY ','
>          >   USING 'cat'
>          >   AS (c)
>          > FROM t1;
> c
> spark-sql> SELECT TRANSFORM(a, b)
>          >   USING 'cat'
>          >   AS (c)
>          >   ROW FORMAT DELIMITED
>          >   FIELDS TERMINATED BY ','
>          > FROM t1;
> c
> 1    23    4{code}
>  
> The same sql in hive:
> {code:java}
> hive> SELECT TRANSFORM(a, b)
>     >   ROW FORMAT DELIMITED
>     >   FIELDS TERMINATED BY ','
>     >   USING 'cat'
>     >   AS (c)
>     > FROM t1;
> c
> 1,2
> 3,4
> hive> SELECT TRANSFORM(a, b)
>     >   USING 'cat'
>     >   AS (c)
>     >   ROW FORMAT DELIMITED
>     >   FIELDS TERMINATED BY ','
>     > FROM t1;
> c
> 1    2
> 3    4 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41790) Set TRANSFORM reader and writer's format correctly

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653136#comment-17653136
 ] 

Apache Spark commented on SPARK-41790:
--

User 'mattshma' has created a pull request for this issue:
https://github.com/apache/spark/pull/39315

> Set TRANSFORM reader and writer's format correctly
> --
>
> Key: SPARK-41790
> URL: https://issues.apache.org/jira/browse/SPARK-41790
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: mattshma
>Priority: Major
>
> We'll get wrong data when transform only specify reader or writer 's row 
> format delimited, the reason is using the wrong format to feed/fetch data 
> to/from running script now.  In theory, writer uses inFormat to feed to input 
> data into the running script and reader uses outFormat to read the output 
> from the running script, but inFormat and outFormat are set wrong value 
> currently in the following code:
> {code:java}
> val (inFormat, inSerdeClass, inSerdeProps, reader) =
>   format(
> inRowFormat, "hive.script.recordreader",
> "org.apache.hadoop.hive.ql.exec.TextRecordReader")
> val (outFormat, outSerdeClass, outSerdeProps, writer) =
>   format(
> outRowFormat, "hive.script.recordwriter",
> "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code}
>  
> Example SQL:
> {code:java}
> spark-sql> CREATE TABLE t1 (a string, b string); 
> spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4");
> spark-sql> SELECT TRANSFORM(a, b)
>          >   ROW FORMAT DELIMITED
>          >   FIELDS TERMINATED BY ','
>          >   USING 'cat'
>          >   AS (c)
>          > FROM t1;
> c
> spark-sql> SELECT TRANSFORM(a, b)
>          >   USING 'cat'
>          >   AS (c)
>          >   ROW FORMAT DELIMITED
>          >   FIELDS TERMINATED BY ','
>          > FROM t1;
> c
> 1    23    4{code}
>  
> The same sql in hive:
> {code:java}
> hive> SELECT TRANSFORM(a, b)
>     >   ROW FORMAT DELIMITED
>     >   FIELDS TERMINATED BY ','
>     >   USING 'cat'
>     >   AS (c)
>     > FROM t1;
> c
> 1,2
> 3,4
> hive> SELECT TRANSFORM(a, b)
>     >   USING 'cat'
>     >   AS (c)
>     >   ROW FORMAT DELIMITED
>     >   FIELDS TERMINATED BY ','
>     > FROM t1;
> c
> 1    2
> 3    4 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41790) Set TRANSFORM reader and writer's format correctly

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41790:


Assignee: Apache Spark

> Set TRANSFORM reader and writer's format correctly
> --
>
> Key: SPARK-41790
> URL: https://issues.apache.org/jira/browse/SPARK-41790
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: mattshma
>Assignee: Apache Spark
>Priority: Major
>
> We'll get wrong data when transform only specify reader or writer 's row 
> format delimited, the reason is using the wrong format to feed/fetch data 
> to/from running script now.  In theory, writer uses inFormat to feed to input 
> data into the running script and reader uses outFormat to read the output 
> from the running script, but inFormat and outFormat are set wrong value 
> currently in the following code:
> {code:java}
> val (inFormat, inSerdeClass, inSerdeProps, reader) =
>   format(
> inRowFormat, "hive.script.recordreader",
> "org.apache.hadoop.hive.ql.exec.TextRecordReader")
> val (outFormat, outSerdeClass, outSerdeProps, writer) =
>   format(
> outRowFormat, "hive.script.recordwriter",
> "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code}
>  
> Example SQL:
> {code:java}
> spark-sql> CREATE TABLE t1 (a string, b string); 
> spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4");
> spark-sql> SELECT TRANSFORM(a, b)
>          >   ROW FORMAT DELIMITED
>          >   FIELDS TERMINATED BY ','
>          >   USING 'cat'
>          >   AS (c)
>          > FROM t1;
> c
> spark-sql> SELECT TRANSFORM(a, b)
>          >   USING 'cat'
>          >   AS (c)
>          >   ROW FORMAT DELIMITED
>          >   FIELDS TERMINATED BY ','
>          > FROM t1;
> c
> 1    23    4{code}
>  
> The same sql in hive:
> {code:java}
> hive> SELECT TRANSFORM(a, b)
>     >   ROW FORMAT DELIMITED
>     >   FIELDS TERMINATED BY ','
>     >   USING 'cat'
>     >   AS (c)
>     > FROM t1;
> c
> 1,2
> 3,4
> hive> SELECT TRANSFORM(a, b)
>     >   USING 'cat'
>     >   AS (c)
>     >   ROW FORMAT DELIMITED
>     >   FIELDS TERMINATED BY ','
>     > FROM t1;
> c
> 1    2
> 3    4 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41790) Set TRANSFORM reader and writer's format correctly

2022-12-30 Thread mattshma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mattshma updated SPARK-41790:
-
Summary: Set TRANSFORM reader and writer's format correctly  (was: 
Transform will get wrong date when only specify reader or writer 's row format 
delimited)

> Set TRANSFORM reader and writer's format correctly
> --
>
> Key: SPARK-41790
> URL: https://issues.apache.org/jira/browse/SPARK-41790
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: mattshma
>Priority: Major
>
> We'll get wrong data when transform only specify reader or writer 's row 
> format delimited, the reason is using the wrong format to feed/fetch data 
> to/from running script now.  In theory, writer uses inFormat to feed to input 
> data into the running script and reader uses outFormat to read the output 
> from the running script, but inFormat and outFormat are set wrong value 
> currently in the following code:
> {code:java}
> val (inFormat, inSerdeClass, inSerdeProps, reader) =
>   format(
> inRowFormat, "hive.script.recordreader",
> "org.apache.hadoop.hive.ql.exec.TextRecordReader")
> val (outFormat, outSerdeClass, outSerdeProps, writer) =
>   format(
> outRowFormat, "hive.script.recordwriter",
> "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code}
>  
> Example SQL:
> {code:java}
> spark-sql> CREATE TABLE t1 (a string, b string); 
> spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4");
> spark-sql> SELECT TRANSFORM(a, b)
>          >   ROW FORMAT DELIMITED
>          >   FIELDS TERMINATED BY ','
>          >   USING 'cat'
>          >   AS (c)
>          > FROM t1;
> c
> spark-sql> SELECT TRANSFORM(a, b)
>          >   USING 'cat'
>          >   AS (c)
>          >   ROW FORMAT DELIMITED
>          >   FIELDS TERMINATED BY ','
>          > FROM t1;
> c
> 1    23    4{code}
>  
> The same sql in hive:
> {code:java}
> hive> SELECT TRANSFORM(a, b)
>     >   ROW FORMAT DELIMITED
>     >   FIELDS TERMINATED BY ','
>     >   USING 'cat'
>     >   AS (c)
>     > FROM t1;
> c
> 1,2
> 3,4
> hive> SELECT TRANSFORM(a, b)
>     >   USING 'cat'
>     >   AS (c)
>     >   ROW FORMAT DELIMITED
>     >   FIELDS TERMINATED BY ','
>     > FROM t1;
> c
> 1    2
> 3    4 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41790) Transform will get wrong date when only specify reader or writer 's row format delimited

2022-12-30 Thread mattshma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mattshma updated SPARK-41790:
-
Description: 
We'll get wrong data when transform only specify reader or writer 's row format 
delimited, the reason is using the wrong format to feed/fetch data to/from 
running script now.  In theory, writer uses inFormat to feed to input data into 
the running script and reader uses outFormat to read the output from the 
running script, but inFormat and outFormat are set wrong value currently in the 
following code:
{code:java}
val (inFormat, inSerdeClass, inSerdeProps, reader) =
  format(
inRowFormat, "hive.script.recordreader",
"org.apache.hadoop.hive.ql.exec.TextRecordReader")

val (outFormat, outSerdeClass, outSerdeProps, writer) =
  format(
outRowFormat, "hive.script.recordwriter",
"org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code}
 

Example SQL:
{code:java}
spark-sql> CREATE TABLE t1 (a string, b string); 

spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4");

spark-sql> SELECT TRANSFORM(a, b)
         >   ROW FORMAT DELIMITED
         >   FIELDS TERMINATED BY ','
         >   USING 'cat'
         >   AS (c)
         > FROM t1;
c

spark-sql> SELECT TRANSFORM(a, b)
         >   USING 'cat'
         >   AS (c)
         >   ROW FORMAT DELIMITED
         >   FIELDS TERMINATED BY ','
         > FROM t1;
c
1    23    4{code}
 

The same sql in hive:
{code:java}
hive> SELECT TRANSFORM(a, b)
    >   ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY ','
    >   USING 'cat'
    >   AS (c)
    > FROM t1;
c
1,2
3,4

hive> SELECT TRANSFORM(a, b)
    >   USING 'cat'
    >   AS (c)
    >   ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY ','
    > FROM t1;
c
1    2
3    4 {code}
 

  was:
We'll get wrong data when transform only specify reader or writer 's row format 
delimited, the reason is using the wrong format to feed/fetch data to/from 
running script now: writer uses inFormat to feed to input data into the running 
script and reader uses outFormat to read the output from the running script. 
But inFormat and outFormat are set wrong value currently because the following 
code:

 
{code:java}
val (inFormat, inSerdeClass, inSerdeProps, reader) =
  format(
inRowFormat, "hive.script.recordreader",
"org.apache.hadoop.hive.ql.exec.TextRecordReader")

val (outFormat, outSerdeClass, outSerdeProps, writer) =
  format(
outRowFormat, "hive.script.recordwriter",
"org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code}
 

Example SQL:
{code:java}
spark-sql> CREATE TABLE t1 (a string, b string); 

spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4");

spark-sql> SELECT TRANSFORM(a, b)
         >   ROW FORMAT DELIMITED
         >   FIELDS TERMINATED BY ','
         >   USING 'cat'
         >   AS (c)
         > FROM t1;
c

spark-sql> SELECT TRANSFORM(a, b)
         >   USING 'cat'
         >   AS (c)
         >   ROW FORMAT DELIMITED
         >   FIELDS TERMINATED BY ','
         > FROM t1;
c
1    23    4{code}
 

The same sql in hive:
{code:java}
hive> SELECT TRANSFORM(a, b)
    >   ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY ','
    >   USING 'cat'
    >   AS (c)
    > FROM t1;
c
1,2
3,4

hive> SELECT TRANSFORM(a, b)
    >   USING 'cat'
    >   AS (c)
    >   ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY ','
    > FROM t1;
c
1    2
3    4 {code}
 


> Transform will get wrong date when only specify reader or writer 's row 
> format delimited
> 
>
> Key: SPARK-41790
> URL: https://issues.apache.org/jira/browse/SPARK-41790
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: mattshma
>Priority: Major
>
> We'll get wrong data when transform only specify reader or writer 's row 
> format delimited, the reason is using the wrong format to feed/fetch data 
> to/from running script now.  In theory, writer uses inFormat to feed to input 
> data into the running script and reader uses outFormat to read the output 
> from the running script, but inFormat and outFormat are set wrong value 
> currently in the following code:
> {code:java}
> val (inFormat, inSerdeClass, inSerdeProps, reader) =
>   format(
> inRowFormat, "hive.script.recordreader",
> "org.apache.hadoop.hive.ql.exec.TextRecordReader")
> val (outFormat, outSerdeClass, outSerdeProps, writer) =
>   format(
> outRowFormat, "hive.script.recordwriter",
> "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code}
>  
> Example SQL:
> {code:java}
> spark-sql> CREATE TABLE t1 (a string, b string); 
> spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4");
> spark-sql> SELECT TRANSFORM(a, b)
>          >   ROW FORMAT DELIMITED
>          >   FIELDS TERMINATED BY ','
>          >   USING 

[jira] [Updated] (SPARK-41790) Transform will get wrong date when only specify reader or writer 's row format delimited

2022-12-30 Thread mattshma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mattshma updated SPARK-41790:
-
Description: 
We'll get wrong data when transform only specify reader or writer 's row format 
delimited, the reason is using the wrong format to feed/fetch data to/from 
running script now: writer uses inFormat to feed to input data into the running 
script and reader uses outFormat to read the output from the running script. 
But inFormat and outFormat are set wrong value currently because the following 
code:

 
{code:java}
val (inFormat, inSerdeClass, inSerdeProps, reader) =
  format(
inRowFormat, "hive.script.recordreader",
"org.apache.hadoop.hive.ql.exec.TextRecordReader")

val (outFormat, outSerdeClass, outSerdeProps, writer) =
  format(
outRowFormat, "hive.script.recordwriter",
"org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code}
 

Example SQL:
{code:java}
spark-sql> CREATE TABLE t1 (a string, b string); 

spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4");

spark-sql> SELECT TRANSFORM(a, b)
         >   ROW FORMAT DELIMITED
         >   FIELDS TERMINATED BY ','
         >   USING 'cat'
         >   AS (c)
         > FROM t1;
c

spark-sql> SELECT TRANSFORM(a, b)
         >   USING 'cat'
         >   AS (c)
         >   ROW FORMAT DELIMITED
         >   FIELDS TERMINATED BY ','
         > FROM t1;
c
1    23    4{code}
 

The same sql in hive:
{code:java}
hive> SELECT TRANSFORM(a, b)
    >   ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY ','
    >   USING 'cat'
    >   AS (c)
    > FROM t1;
c
1,2
3,4

hive> SELECT TRANSFORM(a, b)
    >   USING 'cat'
    >   AS (c)
    >   ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY ','
    > FROM t1;
c
1    2
3    4 {code}
 

  was:
We'll get wrong data when transform only specify reader or writer 's row format 
delimited, the reason is using the wrong format to feed/fetch data to/from 
running script now: writer uses inFormat to feed to input data into the running 
script and reader uses outFormat to read the output from the running script. 
But inFormat and outFormat are set wrong value currently because the following 
code:

 
{code:java}
val (inFormat, inSerdeClass, inSerdeProps, reader) =
  format(
inRowFormat, "hive.script.recordreader",
"org.apache.hadoop.hive.ql.exec.TextRecordReader")

val (outFormat, outSerdeClass, outSerdeProps, writer) =
  format(
outRowFormat, "hive.script.recordwriter",
"org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code}
 

 

Example SQL:

 
{code:java}
spark-sql> CREATE TABLE t1 (a string, b string); 

spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4");

spark-sql> SELECT TRANSFORM(a, b)
         >   ROW FORMAT DELIMITED
         >   FIELDS TERMINATED BY ','
         >   USING 'cat'
         >   AS (c)
         > FROM t1;
c

spark-sql> SELECT TRANSFORM(a, b)
         >   USING 'cat'
         >   AS (c)
         >   ROW FORMAT DELIMITED
         >   FIELDS TERMINATED BY ','
         > FROM t1;
c
1    23    4{code}
 

 

In hive:
{code:java}
hive> SELECT TRANSFORM(a, b)
    >   ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY ','
    >   USING 'cat'
    >   AS (c)
    > FROM t1;
c
1,2
3,4

hive> SELECT TRANSFORM(a, b)
    >   USING 'cat'
    >   AS (c)
    >   ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY ','
    > FROM t1;
c
1    2
3    4 {code}
 


> Transform will get wrong date when only specify reader or writer 's row 
> format delimited
> 
>
> Key: SPARK-41790
> URL: https://issues.apache.org/jira/browse/SPARK-41790
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: mattshma
>Priority: Major
>
> We'll get wrong data when transform only specify reader or writer 's row 
> format delimited, the reason is using the wrong format to feed/fetch data 
> to/from running script now: writer uses inFormat to feed to input data into 
> the running script and reader uses outFormat to read the output from the 
> running script. But inFormat and outFormat are set wrong value currently 
> because the following code:
>  
> {code:java}
> val (inFormat, inSerdeClass, inSerdeProps, reader) =
>   format(
> inRowFormat, "hive.script.recordreader",
> "org.apache.hadoop.hive.ql.exec.TextRecordReader")
> val (outFormat, outSerdeClass, outSerdeProps, writer) =
>   format(
> outRowFormat, "hive.script.recordwriter",
> "org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code}
>  
> Example SQL:
> {code:java}
> spark-sql> CREATE TABLE t1 (a string, b string); 
> spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4");
> spark-sql> SELECT TRANSFORM(a, b)
>          >   ROW FORMAT DELIMITED
>          >   FIELDS TERMINATED BY ','
>          >   USING 'cat'
>        

[jira] [Created] (SPARK-41790) Transform will get wrong date when only specify reader or writer 's row format delimited

2022-12-30 Thread mattshma (Jira)
mattshma created SPARK-41790:


 Summary: Transform will get wrong date when only specify reader or 
writer 's row format delimited
 Key: SPARK-41790
 URL: https://issues.apache.org/jira/browse/SPARK-41790
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.1
Reporter: mattshma


We'll get wrong data when transform only specify reader or writer 's row format 
delimited, the reason is using the wrong format to feed/fetch data to/from 
running script now: writer uses inFormat to feed to input data into the running 
script and reader uses outFormat to read the output from the running script. 
But inFormat and outFormat are set wrong value currently because the following 
code:

 
{code:java}
val (inFormat, inSerdeClass, inSerdeProps, reader) =
  format(
inRowFormat, "hive.script.recordreader",
"org.apache.hadoop.hive.ql.exec.TextRecordReader")

val (outFormat, outSerdeClass, outSerdeProps, writer) =
  format(
outRowFormat, "hive.script.recordwriter",
"org.apache.hadoop.hive.ql.exec.TextRecordWriter") {code}
 

 

Example SQL:

 
{code:java}
spark-sql> CREATE TABLE t1 (a string, b string); 

spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4");

spark-sql> SELECT TRANSFORM(a, b)
         >   ROW FORMAT DELIMITED
         >   FIELDS TERMINATED BY ','
         >   USING 'cat'
         >   AS (c)
         > FROM t1;
c

spark-sql> SELECT TRANSFORM(a, b)
         >   USING 'cat'
         >   AS (c)
         >   ROW FORMAT DELIMITED
         >   FIELDS TERMINATED BY ','
         > FROM t1;
c
1    23    4{code}
 

 

In hive:
{code:java}
hive> SELECT TRANSFORM(a, b)
    >   ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY ','
    >   USING 'cat'
    >   AS (c)
    > FROM t1;
c
1,2
3,4

hive> SELECT TRANSFORM(a, b)
    >   USING 'cat'
    >   AS (c)
    >   ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY ','
    > FROM t1;
c
1    2
3    4 {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41789) Make `createDataFrame` support list of Rows

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653104#comment-17653104
 ] 

Apache Spark commented on SPARK-41789:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39313

> Make `createDataFrame` support list of Rows
> ---
>
> Key: SPARK-41789
> URL: https://issues.apache.org/jira/browse/SPARK-41789
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41789) Make `createDataFrame` support list of Rows

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653105#comment-17653105
 ] 

Apache Spark commented on SPARK-41789:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39313

> Make `createDataFrame` support list of Rows
> ---
>
> Key: SPARK-41789
> URL: https://issues.apache.org/jira/browse/SPARK-41789
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41789) Make `createDataFrame` support list of Rows

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41789:


Assignee: Apache Spark  (was: Ruifeng Zheng)

> Make `createDataFrame` support list of Rows
> ---
>
> Key: SPARK-41789
> URL: https://issues.apache.org/jira/browse/SPARK-41789
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41789) Make `createDataFrame` support list of Rows

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41789:


Assignee: Ruifeng Zheng  (was: Apache Spark)

> Make `createDataFrame` support list of Rows
> ---
>
> Key: SPARK-41789
> URL: https://issues.apache.org/jira/browse/SPARK-41789
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41745) SparkSession.createDataFrame does not respect the column names in the row

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41745:


Assignee: (was: Apache Spark)

> SparkSession.createDataFrame does not respect the column names in the row
> -
>
> Key: SPARK-41745
> URL: https://issues.apache.org/jira/browse/SPARK-41745
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> File "/.../spark/python/pyspark/sql/connect/group.py", line 173, in 
> pyspark.sql.connect.group.GroupedData.pivot
> Failed example:
> df1.show()
> Differences (ndiff with -expected +actual):
> - +--+++
> ?   ---
> + +--++-+
> - |course|year|earnings|
> + |_1|  _2|   _3|
> - +--+++
> ?   ---
> + +--++-+
> - |dotNET|2012|   1|
> ?  ---
> + |dotNET|2012|1|
> - |  Java|2012|   2|
> ?  ---
> + |  Java|2012|2|
> - |dotNET|2012|5000|
> ?   ---
> + |dotNET|2012| 5000|
> - |dotNET|2013|   48000|
> ?  ---
> + |dotNET|2013|48000|
> - |  Java|2013|   3|
> ?  ---
> + |  Java|2013|3|
> - +--+++
> ?   ---
> + +--++-+
> + 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41745) SparkSession.createDataFrame does not respect the column names in the row

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41745:


Assignee: Apache Spark

> SparkSession.createDataFrame does not respect the column names in the row
> -
>
> Key: SPARK-41745
> URL: https://issues.apache.org/jira/browse/SPARK-41745
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> {code}
> File "/.../spark/python/pyspark/sql/connect/group.py", line 173, in 
> pyspark.sql.connect.group.GroupedData.pivot
> Failed example:
> df1.show()
> Differences (ndiff with -expected +actual):
> - +--+++
> ?   ---
> + +--++-+
> - |course|year|earnings|
> + |_1|  _2|   _3|
> - +--+++
> ?   ---
> + +--++-+
> - |dotNET|2012|   1|
> ?  ---
> + |dotNET|2012|1|
> - |  Java|2012|   2|
> ?  ---
> + |  Java|2012|2|
> - |dotNET|2012|5000|
> ?   ---
> + |dotNET|2012| 5000|
> - |dotNET|2013|   48000|
> ?  ---
> + |dotNET|2013|48000|
> - |  Java|2013|   3|
> ?  ---
> + |  Java|2013|3|
> - +--+++
> ?   ---
> + +--++-+
> + 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41745) SparkSession.createDataFrame does not respect the column names in the row

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653103#comment-17653103
 ] 

Apache Spark commented on SPARK-41745:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39313

> SparkSession.createDataFrame does not respect the column names in the row
> -
>
> Key: SPARK-41745
> URL: https://issues.apache.org/jira/browse/SPARK-41745
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> File "/.../spark/python/pyspark/sql/connect/group.py", line 173, in 
> pyspark.sql.connect.group.GroupedData.pivot
> Failed example:
> df1.show()
> Differences (ndiff with -expected +actual):
> - +--+++
> ?   ---
> + +--++-+
> - |course|year|earnings|
> + |_1|  _2|   _3|
> - +--+++
> ?   ---
> + +--++-+
> - |dotNET|2012|   1|
> ?  ---
> + |dotNET|2012|1|
> - |  Java|2012|   2|
> ?  ---
> + |  Java|2012|2|
> - |dotNET|2012|5000|
> ?   ---
> + |dotNET|2012| 5000|
> - |dotNET|2013|   48000|
> ?  ---
> + |dotNET|2013|48000|
> - |  Java|2013|   3|
> ?  ---
> + |  Java|2013|3|
> - +--+++
> ?   ---
> + +--++-+
> + 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41789) Make `createDataFrame` support list of Rows

2022-12-30 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41789:
-

 Summary: Make `createDataFrame` support list of Rows
 Key: SPARK-41789
 URL: https://issues.apache.org/jira/browse/SPARK-41789
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng
Assignee: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41788) Move InsertIntoStatement to basicLogicalOperators

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41788:


Assignee: Apache Spark

> Move InsertIntoStatement to basicLogicalOperators
> -
>
> Key: SPARK-41788
> URL: https://issues.apache.org/jira/browse/SPARK-41788
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Cheng Pan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41788) Move InsertIntoStatement to basicLogicalOperators

2022-12-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41788:


Assignee: (was: Apache Spark)

> Move InsertIntoStatement to basicLogicalOperators
> -
>
> Key: SPARK-41788
> URL: https://issues.apache.org/jira/browse/SPARK-41788
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41788) Move InsertIntoStatement to basicLogicalOperators

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653097#comment-17653097
 ] 

Apache Spark commented on SPARK-41788:
--

User 'pan3793' has created a pull request for this issue:
https://github.com/apache/spark/pull/39312

> Move InsertIntoStatement to basicLogicalOperators
> -
>
> Key: SPARK-41788
> URL: https://issues.apache.org/jira/browse/SPARK-41788
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41788) Move InsertIntoStatement to basicLogicalOperators

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653098#comment-17653098
 ] 

Apache Spark commented on SPARK-41788:
--

User 'pan3793' has created a pull request for this issue:
https://github.com/apache/spark/pull/39312

> Move InsertIntoStatement to basicLogicalOperators
> -
>
> Key: SPARK-41788
> URL: https://issues.apache.org/jira/browse/SPARK-41788
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41788) Move InsertIntoStatement to basicLogicalOperators

2022-12-30 Thread Cheng Pan (Jira)
Cheng Pan created SPARK-41788:
-

 Summary: Move InsertIntoStatement to basicLogicalOperators
 Key: SPARK-41788
 URL: https://issues.apache.org/jira/browse/SPARK-41788
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41442) Only update SQLMetric value if merging with valid metric

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653091#comment-17653091
 ] 

Apache Spark commented on SPARK-41442:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/39311

> Only update SQLMetric value if merging with valid metric
> 
>
> Key: SPARK-41442
> URL: https://issues.apache.org/jira/browse/SPARK-41442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.4.0
>
>
> We use -1 as initial value of SQLMetric, and change it to 0 while merging 
> with other SQLMetric instances. A SQLMetric will be treated as invalid and 
> filtered out later.
> While we are developing with Spark, it is trouble behavior that two invalid 
> SQLMetric instances merge to a valid SQLMetric because merging will set the 
> value to 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41629) Support for protocol extensions

2022-12-30 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41629.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39291
[https://github.com/apache/spark/pull/39291]

> Support for protocol extensions
> ---
>
> Key: SPARK-41629
> URL: https://issues.apache.org/jira/browse/SPARK-41629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
> Fix For: 3.4.0
>
>
> Spark comes with many different extension points. Many of those simply become 
> available through the shared classpath between Spark and the user 
> application. To be able to support arbitrary plugins e.g. for Delta or 
> Iceberg, we need a way to make the Spark Connect protocol extensible and let 
> users register their own handlers. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41629) Support for protocol extensions

2022-12-30 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41629:
-

Assignee: Martin Grund

> Support for protocol extensions
> ---
>
> Key: SPARK-41629
> URL: https://issues.apache.org/jira/browse/SPARK-41629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
>
> Spark comes with many different extension points. Many of those simply become 
> available through the shared classpath between Spark and the user 
> application. To be able to support arbitrary plugins e.g. for Delta or 
> Iceberg, we need a way to make the Spark Connect protocol extensible and let 
> users register their own handlers. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41787) Upgrade silencer from 1.7.10 to 1.7.12

2022-12-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653076#comment-17653076
 ] 

Apache Spark commented on SPARK-41787:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39309

> Upgrade silencer from 1.7.10 to 1.7.12
> --
>
> Key: SPARK-41787
> URL: https://issues.apache.org/jira/browse/SPARK-41787
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Attachments: image-2022-12-30-16-57-32-736.png
>
>
> !image-2022-12-30-16-57-32-736.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >