[jira] [Updated] (SPARK-48187) Run `docs` only in PR builders and Java 21 Daily CI

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48187:
---
Labels: pull-request-available  (was: )

> Run `docs` only in PR builders and Java 21 Daily CI
> ---
>
> Key: SPARK-48187
> URL: https://issues.apache.org/jira/browse/SPARK-48187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48138) Disable a flaky `SparkSessionE2ESuite.interrupt tag` test

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48138:
--
Fix Version/s: 3.5.2

> Disable a flaky `SparkSessionE2ESuite.interrupt tag` test
> -
>
> Key: SPARK-48138
> URL: https://issues.apache.org/jira/browse/SPARK-48138
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>
> - https://github.com/apache/spark/actions/runs/8962353911/job/24611130573 
> (Master, 5/5)
> - https://github.com/apache/spark/actions/runs/8948176536/job/24581022674 
> (Master, 5/4)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48139) Re-enable `SparkSessionE2ESuite.interrupt tag`

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48139:
--
Affects Version/s: 3.5.2

> Re-enable `SparkSessionE2ESuite.interrupt tag`
> --
>
> Key: SPARK-48139
> URL: https://issues.apache.org/jira/browse/SPARK-48139
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 4.0.0, 3.5.2
>Reporter: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48037) SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48037:
--
Fix Version/s: 3.5.2

> SortShuffleWriter lacks shuffle write related metrics resulting in 
> potentially inaccurate data
> --
>
> Key: SPARK-48037
> URL: https://issues.apache.org/jira/browse/SPARK-48037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0, 4.0.0, 3.5.1, 3.4.3
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Blocker
>  Labels: correctness, pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48160) XPath expressions (all collations)

2024-05-07 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-48160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-48160:
-
Component/s: SQL
 (was: Spark Core)

> XPath expressions (all collations)
> --
>
> Key: SPARK-48160
> URL: https://issues.apache.org/jira/browse/SPARK-48160
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48158) XML expressions (all collations)

2024-05-07 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-48158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-48158:
-
Component/s: SQL
 (was: Spark Core)

> XML expressions (all collations)
> 
>
> Key: SPARK-48158
> URL: https://issues.apache.org/jira/browse/SPARK-48158
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for *XML* built-in string functions in Spark 
> ({*}XmlToStructs{*}, {*}SchemaOfXml{*}, {*}StructsToXml{*}). First confirm 
> what is the expected behaviour for these functions when given collated 
> strings, and then move on to implementation and testing. You will find these 
> expressions in the *xmlExpressions.scala* file, and they should mostly be 
> pass-through functions. Implement the corresponding E2E SQL tests 
> (CollationSQLExpressionsSuite) to reflect how this function should be used 
> with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor 
> to experiment with the existing functions to learn more about how they work. 
> In addition, look into the possible use-cases and implementation of similar 
> functions within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *XML* expressions so that 
> they support all collation types currently supported in Spark. To understand 
> what changes were introduced in order to enable full collation support for 
> other existing functions in Spark, take a look at the Spark PRs and Jira 
> tickets for completed tasks in this parent (for example: Ascii, Chr, Base64, 
> UnBase64, Decode, StringDecode, Encode, ToBinary, FormatNumber, Sentences).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48159) Datetime expressions (all collations)

2024-05-07 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-48159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-48159:
-
Component/s: SQL
 (was: Spark Core)

> Datetime expressions (all collations)
> -
>
> Key: SPARK-48159
> URL: https://issues.apache.org/jira/browse/SPARK-48159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48157) CSV expressions (all collations)

2024-05-07 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-48157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-48157:
-
Component/s: SQL
 (was: Spark Core)

> CSV expressions (all collations)
> 
>
> Key: SPARK-48157
> URL: https://issues.apache.org/jira/browse/SPARK-48157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for *CSV* built-in string functions in Spark 
> ({*}CsvToStructs{*}, {*}SchemaOfCsv{*}, {*}StructsToCsv{*}). First confirm 
> what is the expected behaviour for these functions when given collated 
> strings, and then move on to implementation and testing. You will find these 
> expressions in the *csvExpressions.scala* file, and they should mostly be 
> pass-through functions. Implement the corresponding E2E SQL tests 
> (CollationSQLExpressionsSuite) to reflect how this function should be used 
> with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor 
> to experiment with the existing functions to learn more about how they work. 
> In addition, look into the possible use-cases and implementation of similar 
> functions within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *CSV* expressions so that 
> they support all collation types currently supported in Spark. To understand 
> what changes were introduced in order to enable full collation support for 
> other existing functions in Spark, take a look at the Spark PRs and Jira 
> tickets for completed tasks in this parent (for example: Ascii, Chr, Base64, 
> UnBase64, Decode, StringDecode, Encode, ToBinary, FormatNumber, Sentences).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48161) JSON expressions (all collations)

2024-05-07 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-48161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-48161:
-
Component/s: SQL
 (was: Spark Core)

> JSON expressions (all collations)
> -
>
> Key: SPARK-48161
> URL: https://issues.apache.org/jira/browse/SPARK-48161
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48162) Miscellaneous expressions (all collations)

2024-05-07 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-48162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-48162:
-
Component/s: SQL
 (was: Spark Core)

> Miscellaneous expressions (all collations)
> --
>
> Key: SPARK-48162
> URL: https://issues.apache.org/jira/browse/SPARK-48162
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48186) Add support for AbstractMapType

2024-05-07 Thread Jira
Uroš Bojanić created SPARK-48186:


 Summary: Add support for AbstractMapType
 Key: SPARK-48186
 URL: https://issues.apache.org/jira/browse/SPARK-48186
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48183) Update error contribution guide to respect new error class file

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48183.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46455
[https://github.com/apache/spark/pull/46455]

> Update error contribution guide to respect new error class file
> ---
>
> Key: SPARK-48183
> URL: https://issues.apache.org/jira/browse/SPARK-48183
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We moved error class definition from .py to .json but documentation still 
> shows old behavior. We should update it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48183) Update error contribution guide to respect new error class file

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48183:
-

Assignee: Haejoon Lee

> Update error contribution guide to respect new error class file
> ---
>
> Key: SPARK-48183
> URL: https://issues.apache.org/jira/browse/SPARK-48183
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> We moved error class definition from .py to .json but documentation still 
> shows old behavior. We should update it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47914) Do not display the splits parameter in Rang

2024-05-07 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47914.
--
Resolution: Fixed

Issue resolved by pull request 46136
[https://github.com/apache/spark/pull/46136]

> Do not display the splits parameter in Rang
> ---
>
> Key: SPARK-47914
> URL: https://issues.apache.org/jira/browse/SPARK-47914
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: guihuawen
>Assignee: guihuawen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [SQL]
> explain extended select * from range(0, 4);
> plan
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'UnresolvedTableValuedFunction [range], [0, 4]
>  
> == Analyzed Logical Plan ==
> id: bigint
> Project [id#11L|#11L]
> +- Range (0, 4, step=1, splits=None)
>  
> == Optimized Logical Plan ==
> Range (0, 4, step=1, splits=None)
>  
> == Physical Plan ==
> *(1) Range (0, 4, step=1, splits=1)
>  
> The splits parameter will only be set during the physical execution phase. 
> But it is also displayed in the logical execution phase as None, which is not 
> very user-friendly. Showing the physical execution plan can help users.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47914) Do not display the splits parameter in Rang

2024-05-07 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-47914:


Assignee: guihuawen

> Do not display the splits parameter in Rang
> ---
>
> Key: SPARK-47914
> URL: https://issues.apache.org/jira/browse/SPARK-47914
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: guihuawen
>Assignee: guihuawen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [SQL]
> explain extended select * from range(0, 4);
> plan
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'UnresolvedTableValuedFunction [range], [0, 4]
>  
> == Analyzed Logical Plan ==
> id: bigint
> Project [id#11L|#11L]
> +- Range (0, 4, step=1, splits=None)
>  
> == Optimized Logical Plan ==
> Range (0, 4, step=1, splits=None)
>  
> == Physical Plan ==
> *(1) Range (0, 4, step=1, splits=1)
>  
> The splits parameter will only be set during the physical execution phase. 
> But it is also displayed in the logical execution phase as None, which is not 
> very user-friendly. Showing the physical execution plan can help users.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48185) Fix 'symbolic reference class is not accessible: class sun.util.calendar.ZoneInfo'

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48185:
---
Labels: pull-request-available  (was: )

> Fix 'symbolic reference class is not accessible: class 
> sun.util.calendar.ZoneInfo'
> --
>
> Key: SPARK-48185
> URL: https://issues.apache.org/jira/browse/SPARK-48185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48185) Fix 'symbolic reference class is not accessible: class sun.util.calendar.ZoneInfo'

2024-05-07 Thread Kent Yao (Jira)
Kent Yao created SPARK-48185:


 Summary: Fix 'symbolic reference class is not accessible: class 
sun.util.calendar.ZoneInfo'
 Key: SPARK-48185
 URL: https://issues.apache.org/jira/browse/SPARK-48185
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48183) Update error contribution guide to respect new error class file

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48183:
---
Labels: pull-request-available  (was: )

> Update error contribution guide to respect new error class file
> ---
>
> Key: SPARK-48183
> URL: https://issues.apache.org/jira/browse/SPARK-48183
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> We moved error class definition from .py to .json but documentation still 
> shows old behavior. We should update it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48183) Update error contribution guide to respect new error class file

2024-05-07 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-48183:
---

 Summary: Update error contribution guide to respect new error 
class file
 Key: SPARK-48183
 URL: https://issues.apache.org/jira/browse/SPARK-48183
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


We moved error class definition from .py to .json but documentation still shows 
old behavior. We should update it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47365) Add toArrowTable() DataFrame method to PySpark

2024-05-07 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-47365:
-
Description: 
Over in the Apache Arrow community, we hear from a lot of users who want to 
return the contents of a PySpark DataFrame as a [PyArrow 
Table|https://arrow.apache.org/docs/python/generated/pyarrow.Table.html]. 
Currently the only documented way to do this is:

*PySpark DataFrame* --> *pandas DataFrame* --> *PyArrow Table*

This adds significant overhead compared to going direct from PySpark DataFrame 
to PyArrow Table. Since [PySpark already goes through PyArrow to convert to 
pandas|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html],
 would it be possible to publicly expose a (possibly experimental) 
*toArrowTable()* method of the Spark DataFrame class?

  was:
Over in the Apache Arrow community, we hear from a lot of users who want to 
return the contents of a PySpark DataFrame as a [PyArrow 
Table|https://arrow.apache.org/docs/python/generated/pyarrow.Table.html]. 
Currently the only documented way to do this is:

*PySpark DataFrame* --> *pandas DataFrame* --> *PyArrow Table*

This adds significant overhead compared to going direct from PySpark DataFrame 
to PyArrow Table. Since [PySpark already goes through PyArrow to convert to 
pandas|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html],
 would it be possible to publicly expose an experimental *_toArrowTable()* 
method of the Spark DataFrame class?


> Add toArrowTable() DataFrame method to PySpark
> --
>
> Key: SPARK-47365
> URL: https://issues.apache.org/jira/browse/SPARK-47365
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Input/Output, PySpark, SQL
>Affects Versions: 3.5.1
>Reporter: Ian Cook
>Priority: Major
>  Labels: pull-request-available
>
> Over in the Apache Arrow community, we hear from a lot of users who want to 
> return the contents of a PySpark DataFrame as a [PyArrow 
> Table|https://arrow.apache.org/docs/python/generated/pyarrow.Table.html]. 
> Currently the only documented way to do this is:
> *PySpark DataFrame* --> *pandas DataFrame* --> *PyArrow Table*
> This adds significant overhead compared to going direct from PySpark 
> DataFrame to PyArrow Table. Since [PySpark already goes through PyArrow to 
> convert to 
> pandas|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html],
>  would it be possible to publicly expose a (possibly experimental) 
> *toArrowTable()* method of the Spark DataFrame class?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48126) Make spark.log.structuredLogging.enabled effective

2024-05-07 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-48126.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46452
[https://github.com/apache/spark/pull/46452]

> Make spark.log.structuredLogging.enabled effective
> --
>
> Key: SPARK-48126
> URL: https://issues.apache.org/jira/browse/SPARK-48126
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, the spark conf spark.log.structuredLogging.enabled is not taking 
> effects. We need to fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47365) Add toArrowTable() DataFrame method to PySpark

2024-05-07 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-47365:
-
Summary: Add toArrowTable() DataFrame method to PySpark  (was: Add 
_toArrowTable() DataFrame method to PySpark)

> Add toArrowTable() DataFrame method to PySpark
> --
>
> Key: SPARK-47365
> URL: https://issues.apache.org/jira/browse/SPARK-47365
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Input/Output, PySpark, SQL
>Affects Versions: 3.5.1
>Reporter: Ian Cook
>Priority: Major
>  Labels: pull-request-available
>
> Over in the Apache Arrow community, we hear from a lot of users who want to 
> return the contents of a PySpark DataFrame as a [PyArrow 
> Table|https://arrow.apache.org/docs/python/generated/pyarrow.Table.html]. 
> Currently the only documented way to do this is:
> *PySpark DataFrame* --> *pandas DataFrame* --> *PyArrow Table*
> This adds significant overhead compared to going direct from PySpark 
> DataFrame to PyArrow Table. Since [PySpark already goes through PyArrow to 
> convert to 
> pandas|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html],
>  would it be possible to publicly expose an experimental *_toArrowTable()* 
> method of the Spark DataFrame class?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48045) Pandas API groupby with multi-agg-relabel ignores as_index=False

2024-05-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48045:


Assignee: Saidatt Sinai Amonkar

> Pandas API groupby with multi-agg-relabel ignores as_index=False
> 
>
> Key: SPARK-48045
> URL: https://issues.apache.org/jira/browse/SPARK-48045
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.5.1
> Environment: Python 3.11, PySpark 3.5.1, Pandas=2.2.2
>Reporter: Paul George
>Assignee: Saidatt Sinai Amonkar
>Priority: Minor
>  Labels: pull-request-available
>
> A Pandas API DataFrame groupby with as_index=False and a multilevel 
> relabeling, such as
> {code:java}
> from pyspark import pandas as ps
> ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a", 
> as_index=False).agg(b_max=("b", "max")){code}
> fails to include group keys in the resulting DataFrame. This diverges from 
> expected behavior as well as from the behavior of native Pandas, e.g.
> *actual*
> {code:java}
>    b_max
> 0      1 {code}
> *expected*
> {code:java}
>    a  b_max
> 0  0      1 {code}
>  
> A possible fix is to prepend groupby key columns to {{*order*}} and 
> {{*columns*}} before filtering here:  
> [https://github.com/apache/spark/blob/master/python/pyspark/pandas/groupby.py#L327-L328]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48045) Pandas API groupby with multi-agg-relabel ignores as_index=False

2024-05-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48045.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46391
[https://github.com/apache/spark/pull/46391]

> Pandas API groupby with multi-agg-relabel ignores as_index=False
> 
>
> Key: SPARK-48045
> URL: https://issues.apache.org/jira/browse/SPARK-48045
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.5.1
> Environment: Python 3.11, PySpark 3.5.1, Pandas=2.2.2
>Reporter: Paul George
>Assignee: Saidatt Sinai Amonkar
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> A Pandas API DataFrame groupby with as_index=False and a multilevel 
> relabeling, such as
> {code:java}
> from pyspark import pandas as ps
> ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a", 
> as_index=False).agg(b_max=("b", "max")){code}
> fails to include group keys in the resulting DataFrame. This diverges from 
> expected behavior as well as from the behavior of native Pandas, e.g.
> *actual*
> {code:java}
>    b_max
> 0      1 {code}
> *expected*
> {code:java}
>    a  b_max
> 0  0      1 {code}
>  
> A possible fix is to prepend groupby key columns to {{*order*}} and 
> {{*columns*}} before filtering here:  
> [https://github.com/apache/spark/blob/master/python/pyspark/pandas/groupby.py#L327-L328]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48152) Make spark-profiler as a part of release and publish to maven central repo

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48152.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46402
[https://github.com/apache/spark/pull/46402]

> Make spark-profiler as a part of release and publish to maven central repo
> --
>
> Key: SPARK-48152
> URL: https://issues.apache.org/jira/browse/SPARK-48152
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48152) Make spark-profiler as a part of release and publish to maven central repo

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48152:
-

Assignee: BingKun Pan

> Make spark-profiler as a part of release and publish to maven central repo
> --
>
> Key: SPARK-48152
> URL: https://issues.apache.org/jira/browse/SPARK-48152
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47960) Support Chaining Stateful Operators in TransformWithState

2024-05-07 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-47960:


Assignee: Bhuwan Sahni

> Support Chaining Stateful Operators in TransformWithState
> -
>
> Key: SPARK-47960
> URL: https://issues.apache.org/jira/browse/SPARK-47960
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Bhuwan Sahni
>Assignee: Bhuwan Sahni
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This issue tracks adding support to chain stateful operators after the 
> Arbitrary State API, transformWithState. In order to support chaining, we 
> need to allow the user to specify the new eventTimeColumn in the output from 
> StatefulProcessor. Any watermark evaluation expressions downstream after 
> transformWithState would use the user specified eventTimeColumn.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47960) Support Chaining Stateful Operators in TransformWithState

2024-05-07 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-47960.
--
Resolution: Fixed

Issue resolved by pull request 45376
[https://github.com/apache/spark/pull/45376]

> Support Chaining Stateful Operators in TransformWithState
> -
>
> Key: SPARK-47960
> URL: https://issues.apache.org/jira/browse/SPARK-47960
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Bhuwan Sahni
>Assignee: Bhuwan Sahni
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This issue tracks adding support to chain stateful operators after the 
> Arbitrary State API, transformWithState. In order to support chaining, we 
> need to allow the user to specify the new eventTimeColumn in the output from 
> StatefulProcessor. Any watermark evaluation expressions downstream after 
> transformWithState would use the user specified eventTimeColumn.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48126) Make spark.log.structuredLogging.enabled effective

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48126:
---
Labels: pull-request-available  (was: )

> Make spark.log.structuredLogging.enabled effective
> --
>
> Key: SPARK-48126
> URL: https://issues.apache.org/jira/browse/SPARK-48126
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
>
> Currently, the spark conf spark.log.structuredLogging.enabled is not taking 
> effects. We need to fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48180) Analyzer bug with multiple ORDER BY items for input table argument

2024-05-07 Thread Daniel (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844471#comment-17844471
 ] 

Daniel commented on SPARK-48180:


Fix here: [https://github.com/apache/spark/pull/46451]

> Analyzer bug with multiple ORDER BY items for input table argument
> --
>
> Key: SPARK-48180
> URL: https://issues.apache.org/jira/browse/SPARK-48180
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Daniel
>Priority: Major
>  Labels: pull-request-available
>
> Steps to reproduce:
>  
> {{from pyspark.sql.functions import udtf}}
> {{@udtf(returnType="a: int, b: int")}}
> {{class tvf:}}
> {{  def eval(self, *args):}}
> {{    yield 1, 2}}
>  
> {{SELECT * FROM tvf(}}
> {{  TABLE(}}
> {{    SELECT 1 AS device_id, 2 AS data_ds}}
> {{    )}}
> {{    WITH SINGLE PARTITION}}
> {{    ORDER BY device_id, data_ds}}
> {{ )}}
> {{[UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_TABLE_ARGUMENT] 
> Unsupported subquery expression: Table arguments are used in a function where 
> they are not supported:}}
> {{'UnresolvedTableValuedFunction [tvf], [table-argument#338 [], 'data_ds], 
> false}}
> {{   +- Project [1 AS device_id#336, 2 AS data_ds#337]}}
> {{      +- OneRowRelation}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48126) Make spark.log.structuredLogging.enabled effective

2024-05-07 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-48126:
---
Summary: Make spark.log.structuredLogging.enabled effective  (was: Make 
spark.log.structuredLogging.enabled effecitve)

> Make spark.log.structuredLogging.enabled effective
> --
>
> Key: SPARK-48126
> URL: https://issues.apache.org/jira/browse/SPARK-48126
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, the spark conf spark.log.structuredLogging.enabled is not taking 
> effects. We need to fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48180) Analyzer bug with multiple ORDER BY items for input table argument

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48180:
---
Labels: pull-request-available  (was: )

> Analyzer bug with multiple ORDER BY items for input table argument
> --
>
> Key: SPARK-48180
> URL: https://issues.apache.org/jira/browse/SPARK-48180
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Daniel
>Priority: Major
>  Labels: pull-request-available
>
> Steps to reproduce:
>  
> {{from pyspark.sql.functions import udtf}}
> {{@udtf(returnType="a: int, b: int")}}
> {{class tvf:}}
> {{  def eval(self, *args):}}
> {{    yield 1, 2}}
>  
> {{SELECT * FROM tvf(}}
> {{  TABLE(}}
> {{    SELECT 1 AS device_id, 2 AS data_ds}}
> {{    )}}
> {{    WITH SINGLE PARTITION}}
> {{    ORDER BY device_id, data_ds}}
> {{ )}}
> {{[UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_TABLE_ARGUMENT] 
> Unsupported subquery expression: Table arguments are used in a function where 
> they are not supported:}}
> {{'UnresolvedTableValuedFunction [tvf], [table-argument#338 [], 'data_ds], 
> false}}
> {{   +- Project [1 AS device_id#336, 2 AS data_ds#337]}}
> {{      +- OneRowRelation}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48182) SQL (java side): Migrate `error/warn/info` with variables to structured logging framework

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48182:
---
Labels: pull-request-available  (was: )

> SQL (java side): Migrate `error/warn/info` with variables to structured 
> logging framework
> -
>
> Key: SPARK-48182
> URL: https://issues.apache.org/jira/browse/SPARK-48182
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48182) SQL (java side): Migrate `error/warn/info` with variables to structured logging framework

2024-05-07 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-48182:
---

 Summary: SQL (java side): Migrate `error/warn/info` with variables 
to structured logging framework
 Key: SPARK-48182
 URL: https://issues.apache.org/jira/browse/SPARK-48182
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48152) Make spark-profiler as a part of release and publish to maven central repo

2024-05-07 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-48152:

Summary: Make spark-profiler as a part of release and publish to maven 
central repo  (was: Publish the module `spark-profiler` to `maven central 
repository`)

> Make spark-profiler as a part of release and publish to maven central repo
> --
>
> Key: SPARK-48152
> URL: https://issues.apache.org/jira/browse/SPARK-48152
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48178) Run `build/scala-213/java-11-17` jobs of branch-3.5 only if needed

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48178.
---
Fix Version/s: 3.5.2
   Resolution: Fixed

Issue resolved by pull request 46449
[https://github.com/apache/spark/pull/46449]

> Run `build/scala-213/java-11-17` jobs of branch-3.5 only if needed
> --
>
> Key: SPARK-48178
> URL: https://issues.apache.org/jira/browse/SPARK-48178
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.5.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48180) Analyzer bug with multiple ORDER BY items for input table argument

2024-05-07 Thread Daniel (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844467#comment-17844467
 ] 

Daniel commented on SPARK-48180:


The bug is that parentheses are required around the two arguments in {{ORDER BY 
device_id, data_ds.}} Otherwise the SQL analyzer cannot tell the difference 
between ordering by an additional table column vs. another expression argument 
to the TVF. 

It could help to improve the error message here to make it more explicit.

> Analyzer bug with multiple ORDER BY items for input table argument
> --
>
> Key: SPARK-48180
> URL: https://issues.apache.org/jira/browse/SPARK-48180
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Daniel
>Priority: Major
>
> Steps to reproduce:
>  
> {{from pyspark.sql.functions import udtf}}
> {{@udtf(returnType="a: int, b: int")}}
> {{class tvf:}}
> {{  def eval(self, *args):}}
> {{    yield 1, 2}}
>  
> {{SELECT * FROM tvf(}}
> {{  TABLE(}}
> {{    SELECT 1 AS device_id, 2 AS data_ds}}
> {{    )}}
> {{    WITH SINGLE PARTITION}}
> {{    ORDER BY device_id, data_ds}}
> {{ )}}
> {{[UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_TABLE_ARGUMENT] 
> Unsupported subquery expression: Table arguments are used in a function where 
> they are not supported:}}
> {{'UnresolvedTableValuedFunction [tvf], [table-argument#338 [], 'data_ds], 
> false}}
> {{   +- Project [1 AS device_id#336, 2 AS data_ds#337]}}
> {{      +- OneRowRelation}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48178) Run `build/scala-213/java-11-17` jobs of branch-3.5 only if needed

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48178:
-

Assignee: Dongjoon Hyun

> Run `build/scala-213/java-11-17` jobs of branch-3.5 only if needed
> --
>
> Key: SPARK-48178
> URL: https://issues.apache.org/jira/browse/SPARK-48178
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.5.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48181) Unify StreamingPythonRunner and PythonPlannerRunner

2024-05-07 Thread Wei Liu (Jira)
Wei Liu created SPARK-48181:
---

 Summary: Unify StreamingPythonRunner and PythonPlannerRunner
 Key: SPARK-48181
 URL: https://issues.apache.org/jira/browse/SPARK-48181
 Project: Spark
  Issue Type: New Feature
  Components: Connect, SS
Affects Versions: 4.0.0
Reporter: Wei Liu


We should unify the two driver side python runner for PySpark. To do this we 
should move out of StreamingPythonRunner and enhance PythonPlannerRunner with 
streaming support (multiple read - write loop)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48178) Run `build/scala-213/java-11-17` jobs of branch-3.5 only if needed

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48178:
--
Summary: Run `build/scala-213/java-11-17` jobs of branch-3.5 only if needed 
 (was: Run `build/scala-211/java-11-17` jobs of branch-3.5 only if needed)

> Run `build/scala-213/java-11-17` jobs of branch-3.5 only if needed
> --
>
> Key: SPARK-48178
> URL: https://issues.apache.org/jira/browse/SPARK-48178
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.5.2
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48179) Pin `nbsphinx` to `0.9.3`

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48179.
---
Fix Version/s: 3.5.2
   Resolution: Fixed

Issue resolved by pull request 46448
[https://github.com/apache/spark/pull/46448]

>  Pin `nbsphinx` to `0.9.3`
> --
>
> Key: SPARK-48179
> URL: https://issues.apache.org/jira/browse/SPARK-48179
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.5.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48179) Pin `nbsphinx` to `0.9.3`

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48179:
-

Assignee: Dongjoon Hyun

>  Pin `nbsphinx` to `0.9.3`
> --
>
> Key: SPARK-48179
> URL: https://issues.apache.org/jira/browse/SPARK-48179
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.5.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48179) Pin `nbsphinx` to `0.9.3`

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48179:
---
Labels: pull-request-available  (was: )

>  Pin `nbsphinx` to `0.9.3`
> --
>
> Key: SPARK-48179
> URL: https://issues.apache.org/jira/browse/SPARK-48179
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.5.2
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48180) Analyzer bug with multiple ORDER BY items for input table argument

2024-05-07 Thread Daniel (Jira)
Daniel created SPARK-48180:
--

 Summary: Analyzer bug with multiple ORDER BY items for input table 
argument
 Key: SPARK-48180
 URL: https://issues.apache.org/jira/browse/SPARK-48180
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.1, 3.5.0, 4.0.0
Reporter: Daniel


Steps to reproduce:

 

{{from pyspark.sql.functions import udtf}}

{{@udtf(returnType="a: int, b: int")}}
{{class tvf:}}
{{  def eval(self, *args):}}
{{    yield 1, 2}}

 

{{SELECT * FROM tvf(}}
{{  TABLE(}}
{{    SELECT 1 AS device_id, 2 AS data_ds}}
{{    )}}
{{    WITH SINGLE PARTITION}}
{{    ORDER BY device_id, data_ds}}
{{ )}}


{{[UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_TABLE_ARGUMENT] 
Unsupported subquery expression: Table arguments are used in a function where 
they are not supported:}}
{{'UnresolvedTableValuedFunction [tvf], [table-argument#338 [], 'data_ds], 
false}}
{{   +- Project [1 AS device_id#336, 2 AS data_ds#337]}}
{{      +- OneRowRelation}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48179) Pin `nbsphinx` to `0.9.3`

2024-05-07 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-48179:
-

 Summary:  Pin `nbsphinx` to `0.9.3`
 Key: SPARK-48179
 URL: https://issues.apache.org/jira/browse/SPARK-48179
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 3.5.2
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48178) Run `build/scala-211/java-11-17` jobs of branch-3.5 only if needed

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48178:
---
Labels: pull-request-available  (was: )

> Run `build/scala-211/java-11-17` jobs of branch-3.5 only if needed
> --
>
> Key: SPARK-48178
> URL: https://issues.apache.org/jira/browse/SPARK-48178
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.5.2
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48178) Run `build/scala-211/java-11-17` jobs of branch-3.5 only if needed

2024-05-07 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-48178:
-

 Summary: Run `build/scala-211/java-11-17` jobs of branch-3.5 only 
if needed
 Key: SPARK-48178
 URL: https://issues.apache.org/jira/browse/SPARK-48178
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 3.5.2
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48177) Upgrade `Parquet` to 1.14.0

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48177:
--
Summary: Upgrade `Parquet` to 1.14.0  (was: Bump Parquet to 1.14.0)

> Upgrade `Parquet` to 1.14.0
> ---
>
> Key: SPARK-48177
> URL: https://issues.apache.org/jira/browse/SPARK-48177
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48177) Bump Parquet to 1.14.0

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48177:
--
Affects Version/s: 4.0.0
   (was: 3.5.2)

> Bump Parquet to 1.14.0
> --
>
> Key: SPARK-48177
> URL: https://issues.apache.org/jira/browse/SPARK-48177
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48177) Bump Parquet to 1.14.0

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48177:
-

Assignee: Fokko Driesprong

> Bump Parquet to 1.14.0
> --
>
> Key: SPARK-48177
> URL: https://issues.apache.org/jira/browse/SPARK-48177
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48177) Bump Parquet to 1.14.0

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48177:
--
Parent: SPARK-44111
Issue Type: Sub-task  (was: Improvement)

> Bump Parquet to 1.14.0
> --
>
> Key: SPARK-48177
> URL: https://issues.apache.org/jira/browse/SPARK-48177
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48177) Bump Parquet to 1.14.0

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48177:
--
Fix Version/s: (was: 4.0.0)

> Bump Parquet to 1.14.0
> --
>
> Key: SPARK-48177
> URL: https://issues.apache.org/jira/browse/SPARK-48177
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48177) Bump Parquet to 1.14.0

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48177:
---
Labels: pull-request-available  (was: )

> Bump Parquet to 1.14.0
> --
>
> Key: SPARK-48177
> URL: https://issues.apache.org/jira/browse/SPARK-48177
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.2
>Reporter: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48177) Bump Parquet to 1.14.0

2024-05-07 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-48177:


 Summary: Bump Parquet to 1.14.0
 Key: SPARK-48177
 URL: https://issues.apache.org/jira/browse/SPARK-48177
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.2
Reporter: Fokko Driesprong
 Fix For: 4.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48134) Spark core (java side): Migrate `error/warn/info` with variables to structured logging framework

2024-05-07 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-48134.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46390
[https://github.com/apache/spark/pull/46390]

> Spark core (java side): Migrate `error/warn/info` with variables to 
> structured logging framework
> 
>
> Key: SPARK-48134
> URL: https://issues.apache.org/jira/browse/SPARK-48134
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48134) Spark core (java side): Migrate `error/warn/info` with variables to structured logging framework

2024-05-07 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-48134:
--

Assignee: BingKun Pan

> Spark core (java side): Migrate `error/warn/info` with variables to 
> structured logging framework
> 
>
> Key: SPARK-48134
> URL: https://issues.apache.org/jira/browse/SPARK-48134
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48176) Fix name of FIELD_ALREADY_EXISTS error condition

2024-05-07 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-48176:


 Summary: Fix name of FIELD_ALREADY_EXISTS error condition
 Key: SPARK-48176
 URL: https://issues.apache.org/jira/browse/SPARK-48176
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48037) SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48037.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46273
[https://github.com/apache/spark/pull/46273]

> SortShuffleWriter lacks shuffle write related metrics resulting in 
> potentially inaccurate data
> --
>
> Key: SPARK-48037
> URL: https://issues.apache.org/jira/browse/SPARK-48037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0, 4.0.0, 3.5.1, 3.4.3
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Blocker
>  Labels: correctness, pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48037) SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48037:
---
Labels: correctness pull-request-available  (was: correctness)

> SortShuffleWriter lacks shuffle write related metrics resulting in 
> potentially inaccurate data
> --
>
> Key: SPARK-48037
> URL: https://issues.apache.org/jira/browse/SPARK-48037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0, 4.0.0, 3.5.1, 3.4.3
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Blocker
>  Labels: correctness, pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48037) SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data

2024-05-07 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844388#comment-17844388
 ] 

Dongjoon Hyun commented on SPARK-48037:
---

Thank you, [~dzcxzl]. I raised the priority to `Blocker` for all future 
releases and added a label, `correctness`.

> SortShuffleWriter lacks shuffle write related metrics resulting in 
> potentially inaccurate data
> --
>
> Key: SPARK-48037
> URL: https://issues.apache.org/jira/browse/SPARK-48037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0, 4.0.0, 3.5.1, 3.4.3
>Reporter: dzcxzl
>Priority: Blocker
>  Labels: correctness
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48037) SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48037:
-

Assignee: dzcxzl

> SortShuffleWriter lacks shuffle write related metrics resulting in 
> potentially inaccurate data
> --
>
> Key: SPARK-48037
> URL: https://issues.apache.org/jira/browse/SPARK-48037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0, 4.0.0, 3.5.1, 3.4.3
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Blocker
>  Labels: correctness
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48037) SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48037:
--
Affects Version/s: 3.4.3
   3.5.1
   4.0.0

> SortShuffleWriter lacks shuffle write related metrics resulting in 
> potentially inaccurate data
> --
>
> Key: SPARK-48037
> URL: https://issues.apache.org/jira/browse/SPARK-48037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0, 4.0.0, 3.5.1, 3.4.3
>Reporter: dzcxzl
>Priority: Blocker
>  Labels: correctness
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48037) SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48037:
--
Target Version/s: 4.0.0, 3.5.2, 3.4.4

> SortShuffleWriter lacks shuffle write related metrics resulting in 
> potentially inaccurate data
> --
>
> Key: SPARK-48037
> URL: https://issues.apache.org/jira/browse/SPARK-48037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0, 4.0.0, 3.5.1, 3.4.3
>Reporter: dzcxzl
>Priority: Blocker
>  Labels: correctness
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48037) SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48037:
--
Labels: correctness  (was: pull-request-available)

> SortShuffleWriter lacks shuffle write related metrics resulting in 
> potentially inaccurate data
> --
>
> Key: SPARK-48037
> URL: https://issues.apache.org/jira/browse/SPARK-48037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: dzcxzl
>Priority: Major
>  Labels: correctness
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48037) SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48037:
--
Priority: Blocker  (was: Major)

> SortShuffleWriter lacks shuffle write related metrics resulting in 
> potentially inaccurate data
> --
>
> Key: SPARK-48037
> URL: https://issues.apache.org/jira/browse/SPARK-48037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: dzcxzl
>Priority: Blocker
>  Labels: correctness
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48146) Fix error with aggregate function in With child

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48146:
---
Labels: pull-request-available  (was: )

> Fix error with aggregate function in With child
> ---
>
> Key: SPARK-48146
> URL: https://issues.apache.org/jira/browse/SPARK-48146
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kelvin Jiang
>Priority: Major
>  Labels: pull-request-available
>
> Right now, if we have an aggregate function in the child of a With 
> expression, we fail an assertion. However, queries like this used to work:
> {code:sql}
> select
> id between cast(max(id between 1 and 2) as int) and id
> from range(10)
> group by id
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41547) Reenable ANSI mode in pyspark.sql.tests.connect.test_connect_functions

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-41547.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46432
[https://github.com/apache/spark/pull/46432]

> Reenable ANSI mode in pyspark.sql.tests.connect.test_connect_functions
> --
>
> Key: SPARK-41547
> URL: https://issues.apache.org/jira/browse/SPARK-41547
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Xinrong Meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> See https://issues.apache.org/jira/browse/SPARK-41548
> We should fix the tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48169) Use lazy BadRecordException cause for StaxXmlParser and JacksonParser

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48169.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46438
[https://github.com/apache/spark/pull/46438]

> Use lazy BadRecordException cause for StaxXmlParser and JacksonParser
> -
>
> Key: SPARK-48169
> URL: https://issues.apache.org/jira/browse/SPARK-48169
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Assignee: Vladimir Golubev
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> For now since the https://issues.apache.org/jira/browse/SPARK-48143, the old 
> constructor is used



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48175) Store collation information in metadata and not in type for SER/DE

2024-05-07 Thread Stefan Kandic (Jira)
Stefan Kandic created SPARK-48175:
-

 Summary: Store collation information in metadata and not in type 
for SER/DE
 Key: SPARK-48175
 URL: https://issues.apache.org/jira/browse/SPARK-48175
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 4.0.0
Reporter: Stefan Kandic


Changing serialization and deserialization of collated strings so that the 
collation information is put in the metadata of the enclosing struct field - 
and then read back from there during parsing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48165) Update `ap-loader` to 3.0-9

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48165.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46427
[https://github.com/apache/spark/pull/46427]

> Update `ap-loader` to 3.0-9
> ---
>
> Key: SPARK-48165
> URL: https://issues.apache.org/jira/browse/SPARK-48165
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48173) CheckAnalsis should see the entire query plan

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48173.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46439
[https://github.com/apache/spark/pull/46439]

> CheckAnalsis should see the entire query plan
> -
>
> Key: SPARK-48173
> URL: https://issues.apache.org/jira/browse/SPARK-48173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48173) CheckAnalsis should see the entire query plan

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48173:
-

Assignee: Wenchen Fan

> CheckAnalsis should see the entire query plan
> -
>
> Key: SPARK-48173
> URL: https://issues.apache.org/jira/browse/SPARK-48173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47297) Format expressions (all collations)

2024-05-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47297:
---

Assignee: Uroš Bojanić

> Format expressions (all collations)
> ---
>
> Key: SPARK-47297
> URL: https://issues.apache.org/jira/browse/SPARK-47297
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47297) Format expressions (all collations)

2024-05-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47297.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46423
[https://github.com/apache/spark/pull/46423]

> Format expressions (all collations)
> ---
>
> Key: SPARK-47297
> URL: https://issues.apache.org/jira/browse/SPARK-47297
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48171) Clean up the use of deprecated APIs related to `o.rocksdb.Logger`

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48171:
-

Assignee: Yang Jie

> Clean up the use of deprecated APIs related to `o.rocksdb.Logger`
> -
>
> Key: SPARK-48171
> URL: https://issues.apache.org/jira/browse/SPARK-48171
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> /**
>  * AbstractLogger constructor.
>  *
>  * Important: the log level set within
>  * the {@link org.rocksdb.Options} instance will be used as
>  * maximum log level of RocksDB.
>  *
>  * @param options {@link org.rocksdb.Options} instance.
>  *
>  * @deprecated Use {@link Logger#Logger(InfoLogLevel)} instead, e.g. {@code 
> new
>  * Logger(options.infoLogLevel())}.
>  */
> @Deprecated
> public Logger(final Options options) {
>   this(options.infoLogLevel());
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48171) Clean up the use of deprecated APIs related to `o.rocksdb.Logger`

2024-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48171.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46436
[https://github.com/apache/spark/pull/46436]

> Clean up the use of deprecated APIs related to `o.rocksdb.Logger`
> -
>
> Key: SPARK-48171
> URL: https://issues.apache.org/jira/browse/SPARK-48171
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code:java}
> /**
>  * AbstractLogger constructor.
>  *
>  * Important: the log level set within
>  * the {@link org.rocksdb.Options} instance will be used as
>  * maximum log level of RocksDB.
>  *
>  * @param options {@link org.rocksdb.Options} instance.
>  *
>  * @deprecated Use {@link Logger#Logger(InfoLogLevel)} instead, e.g. {@code 
> new
>  * Logger(options.infoLogLevel())}.
>  */
> @Deprecated
> public Logger(final Options options) {
>   this(options.infoLogLevel());
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47465) Remove experimental tag from toArrowTable() PySpark DataFrame method

2024-05-07 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-47465:
-
Description: 
As a follow-up to SPARK-47365:

What is needed to consider making the *toArrowTable()* PySpark DataFrame 
non-experimental?

What can the Apache Arrow developers do to help with this?

  was:
As a follow-up to SPARK-47365:

What is needed to consider making the *toArrow()* PySpark DataFrame 
non-experimental?

What can the Apache Arrow developers do to help with this?


> Remove experimental tag from toArrowTable() PySpark DataFrame method
> 
>
> Key: SPARK-47465
> URL: https://issues.apache.org/jira/browse/SPARK-47465
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.1
>Reporter: Ian Cook
>Priority: Major
>
> As a follow-up to SPARK-47365:
> What is needed to consider making the *toArrowTable()* PySpark DataFrame 
> non-experimental?
> What can the Apache Arrow developers do to help with this?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47466) Add PySpark DataFrame method to return iterator of PyArrow RecordBatches

2024-05-07 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-47466:
-
Description: 
As a follow-up to SPARK-47365:

*toArrowTable()* is useful when the data is relatively small. For larger data, 
the best way to return the contents of a PySpark DataFrame in Arrow format is 
to return an iterator of [PyArrow 
RecordBatches|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html].

  was:
As a follow-up to SPARK-47365:

*toArrow()* is useful when the data is relatively small. For larger data, the 
best way to return the contents of a PySpark DataFrame in Arrow format is to 
return an iterator of [PyArrow 
RecordBatches|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html].
 


> Add PySpark DataFrame method to return iterator of PyArrow RecordBatches
> 
>
> Key: SPARK-47466
> URL: https://issues.apache.org/jira/browse/SPARK-47466
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.1
>Reporter: Ian Cook
>Priority: Major
>
> As a follow-up to SPARK-47365:
> *toArrowTable()* is useful when the data is relatively small. For larger 
> data, the best way to return the contents of a PySpark DataFrame in Arrow 
> format is to return an iterator of [PyArrow 
> RecordBatches|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47465) Remove experimental tag from toArrowTable() PySpark DataFrame method

2024-05-07 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-47465:
-
Summary: Remove experimental tag from toArrowTable() PySpark DataFrame 
method  (was: Remove experimental tag from toArrow() PySpark DataFrame method)

> Remove experimental tag from toArrowTable() PySpark DataFrame method
> 
>
> Key: SPARK-47465
> URL: https://issues.apache.org/jira/browse/SPARK-47465
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.1
>Reporter: Ian Cook
>Priority: Major
>
> As a follow-up to SPARK-47365:
> What is needed to consider making the *toArrow()* PySpark DataFrame 
> non-experimental?
> What can the Apache Arrow developers do to help with this?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47365) Add _toArrowTable() DataFrame method to PySpark

2024-05-07 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-47365:
-
Summary: Add _toArrowTable() DataFrame method to PySpark  (was: Add 
_toArrow() DataFrame method to PySpark)

> Add _toArrowTable() DataFrame method to PySpark
> ---
>
> Key: SPARK-47365
> URL: https://issues.apache.org/jira/browse/SPARK-47365
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Input/Output, PySpark, SQL
>Affects Versions: 3.5.1
>Reporter: Ian Cook
>Priority: Major
>  Labels: pull-request-available
>
> Over in the Apache Arrow community, we hear from a lot of users who want to 
> return the contents of a PySpark DataFrame as a [PyArrow 
> Table|https://arrow.apache.org/docs/python/generated/pyarrow.Table.html]. 
> Currently the only documented way to do this is:
> *PySpark DataFrame* --> *pandas DataFrame* --> *PyArrow Table*
> This adds significant overhead compared to going direct from PySpark 
> DataFrame to PyArrow Table. Since [PySpark already goes through PyArrow to 
> convert to 
> pandas|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html],
>  would it be possible to publicly expose an experimental *_toArrow()* method 
> of the Spark DataFrame class?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47365) Add _toArrowTable() DataFrame method to PySpark

2024-05-07 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-47365:
-
Description: 
Over in the Apache Arrow community, we hear from a lot of users who want to 
return the contents of a PySpark DataFrame as a [PyArrow 
Table|https://arrow.apache.org/docs/python/generated/pyarrow.Table.html]. 
Currently the only documented way to do this is:

*PySpark DataFrame* --> *pandas DataFrame* --> *PyArrow Table*

This adds significant overhead compared to going direct from PySpark DataFrame 
to PyArrow Table. Since [PySpark already goes through PyArrow to convert to 
pandas|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html],
 would it be possible to publicly expose an experimental *_toArrowTable()* 
method of the Spark DataFrame class?

  was:
Over in the Apache Arrow community, we hear from a lot of users who want to 
return the contents of a PySpark DataFrame as a [PyArrow 
Table|https://arrow.apache.org/docs/python/generated/pyarrow.Table.html]. 
Currently the only documented way to do this is:

*PySpark DataFrame* --> *pandas DataFrame* --> *PyArrow Table*

This adds significant overhead compared to going direct from PySpark DataFrame 
to PyArrow Table. Since [PySpark already goes through PyArrow to convert to 
pandas|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html],
 would it be possible to publicly expose an experimental *_toArrow()* method of 
the Spark DataFrame class?


> Add _toArrowTable() DataFrame method to PySpark
> ---
>
> Key: SPARK-47365
> URL: https://issues.apache.org/jira/browse/SPARK-47365
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Input/Output, PySpark, SQL
>Affects Versions: 3.5.1
>Reporter: Ian Cook
>Priority: Major
>  Labels: pull-request-available
>
> Over in the Apache Arrow community, we hear from a lot of users who want to 
> return the contents of a PySpark DataFrame as a [PyArrow 
> Table|https://arrow.apache.org/docs/python/generated/pyarrow.Table.html]. 
> Currently the only documented way to do this is:
> *PySpark DataFrame* --> *pandas DataFrame* --> *PyArrow Table*
> This adds significant overhead compared to going direct from PySpark 
> DataFrame to PyArrow Table. Since [PySpark already goes through PyArrow to 
> convert to 
> pandas|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html],
>  would it be possible to publicly expose an experimental *_toArrowTable()* 
> method of the Spark DataFrame class?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48173) CheckAnalsis should see the entire query plan

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48173:
---
Labels: pull-request-available  (was: )

> CheckAnalsis should see the entire query plan
> -
>
> Key: SPARK-48173
> URL: https://issues.apache.org/jira/browse/SPARK-48173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48173) CheckAnalsis should see the entire query plan

2024-05-07 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-48173:
---

 Summary: CheckAnalsis should see the entire query plan
 Key: SPARK-48173
 URL: https://issues.apache.org/jira/browse/SPARK-48173
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48169) Use lazy BadRecordException cause for StaxXmlParser and JacksonParser

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48169:
---
Labels: pull-request-available  (was: )

> Use lazy BadRecordException cause for StaxXmlParser and JacksonParser
> -
>
> Key: SPARK-48169
> URL: https://issues.apache.org/jira/browse/SPARK-48169
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Priority: Minor
>  Labels: pull-request-available
>
> For now since the https://issues.apache.org/jira/browse/SPARK-48143, the old 
> constructor is used



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48143) UnivocityParser is slow when parsing partially-malformed CSV in PERMISSIVE mode

2024-05-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48143:
---

Assignee: Vladimir Golubev

> UnivocityParser is slow when parsing partially-malformed CSV in PERMISSIVE 
> mode
> ---
>
> Key: SPARK-48143
> URL: https://issues.apache.org/jira/browse/SPARK-48143
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Assignee: Vladimir Golubev
>Priority: Major
>  Labels: pull-request-available
>
> Parsing partially-malformed CSV in permissive mode is slow due to heavy 
> exception construction



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48143) UnivocityParser is slow when parsing partially-malformed CSV in PERMISSIVE mode

2024-05-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48143.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46400
[https://github.com/apache/spark/pull/46400]

> UnivocityParser is slow when parsing partially-malformed CSV in PERMISSIVE 
> mode
> ---
>
> Key: SPARK-48143
> URL: https://issues.apache.org/jira/browse/SPARK-48143
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Assignee: Vladimir Golubev
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Parsing partially-malformed CSV in permissive mode is slow due to heavy 
> exception construction



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48172) Fix escaping issue for mysql

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48172:
---
Labels: pull-request-available  (was: )

> Fix escaping issue for mysql
> 
>
> Key: SPARK-48172
> URL: https://issues.apache.org/jira/browse/SPARK-48172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48172) Fix escaping issue for mysql

2024-05-07 Thread Mihailo Milosevic (Jira)
Mihailo Milosevic created SPARK-48172:
-

 Summary: Fix escaping issue for mysql
 Key: SPARK-48172
 URL: https://issues.apache.org/jira/browse/SPARK-48172
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Mihailo Milosevic






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48171) Clean up the use of deprecated APIs related to `o.rocksdb.Logger`

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48171:
---
Labels: pull-request-available  (was: )

> Clean up the use of deprecated APIs related to `o.rocksdb.Logger`
> -
>
> Key: SPARK-48171
> URL: https://issues.apache.org/jira/browse/SPARK-48171
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> /**
>  * AbstractLogger constructor.
>  *
>  * Important: the log level set within
>  * the {@link org.rocksdb.Options} instance will be used as
>  * maximum log level of RocksDB.
>  *
>  * @param options {@link org.rocksdb.Options} instance.
>  *
>  * @deprecated Use {@link Logger#Logger(InfoLogLevel)} instead, e.g. {@code 
> new
>  * Logger(options.infoLogLevel())}.
>  */
> @Deprecated
> public Logger(final Options options) {
>   this(options.infoLogLevel());
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48171) Clean up the use of deprecated APIs related to `o.rocksdb.Logger`

2024-05-07 Thread Yang Jie (Jira)
Yang Jie created SPARK-48171:


 Summary: Clean up the use of deprecated APIs related to 
`o.rocksdb.Logger`
 Key: SPARK-48171
 URL: https://issues.apache.org/jira/browse/SPARK-48171
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Yang Jie


{code:java}
/**
 * AbstractLogger constructor.
 *
 * Important: the log level set within
 * the {@link org.rocksdb.Options} instance will be used as
 * maximum log level of RocksDB.
 *
 * @param options {@link org.rocksdb.Options} instance.
 *
 * @deprecated Use {@link Logger#Logger(InfoLogLevel)} instead, e.g. {@code new
 * Logger(options.infoLogLevel())}.
 */
@Deprecated
public Logger(final Options options) {
  this(options.infoLogLevel());
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48000) Hash join support for strings with collation

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48000:
---
Labels: pull-request-available  (was: )

> Hash join support for strings with collation
> 
>
> Key: SPARK-48000
> URL: https://issues.apache.org/jira/browse/SPARK-48000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48123) Provide a constant table schema for querying structured logs

2024-05-07 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844245#comment-17844245
 ] 

Steve Loughran commented on SPARK-48123:


this doesn't handle nested stack traces. I seem to have my comments here 
ignored. let me repeat

*  deep nested are common, especially those coming from networks, where have to 
translate things like aws sdk errors into meaningful and well known exceptions.
* these consist of a chain of exceptions, each with their own message and stack 
trace
* any log format which fails to anticipate or support these is inadequate to 
diagnose a large portion of the stack traces a failing app will generate
* thus destroying its utility value

has a decision been made to ignore my requirements?

> Provide a constant table schema for querying structured logs
> 
>
> Key: SPARK-48123
> URL: https://issues.apache.org/jira/browse/SPARK-48123
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Providing a table schema LOG_SCHEMA, so that users can load structured logs 
> with the following:
> ```
> spark.read.schema(LOG_SCHEMA).json(logPath)
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41794) Reenable ANSI mode in pyspark.sql.tests.connect.test_connect_column

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-41794:
---
Labels: pull-request-available  (was: )

> Reenable ANSI mode in pyspark.sql.tests.connect.test_connect_column
> ---
>
> Key: SPARK-41794
> URL: https://issues.apache.org/jira/browse/SPARK-41794
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>
> {code}
> ==
> ERROR [0.901s]: test_column_accessor 
> (pyspark.sql.tests.connect.test_connect_column.SparkConnectTests)
> --
> Traceback (most recent call last):
>   File "/.../spark/python/pyspark/sql/tests/connect/test_connect_column.py", 
> line 744, in test_column_accessor
> cdf.select(CF.col("z")[0], cdf.z[10], CF.col("z")[-10]).toPandas(),
>   File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 949, in 
> toPandas
> return self._session.client.to_pandas(query)
>   File "/.../spark/python/pyspark/sql/connect/client.py", line 413, in 
> to_pandas
> return self._execute_and_fetch(req)
>   File "/.../spark/python/pyspark/sql/connect/client.py", line 573, in 
> _execute_and_fetch
> self._handle_error(rpc_error)
>   File "/.../spark/python/pyspark/sql/connect/client.py", line 623, in 
> _handle_error
> raise SparkConnectException(status.message, info.reason) from None
> pyspark.sql.connect.client.SparkConnectException: 
> (org.apache.spark.SparkArrayIndexOutOfBoundsException) [INVALID_ARRAY_INDEX] 
> The index 10 is out of bounds. The array has 3 elements. Use the SQL function 
> `get()` to tolerate accessing element at invalid index and return NULL 
> instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this 
> error.
> ==
> ERROR [0.245s]: test_column_arithmetic_ops 
> (pyspark.sql.tests.connect.test_connect_column.SparkConnectTests)
> --
> Traceback (most recent call last):
>   File "/.../spark/python/pyspark/sql/tests/connect/test_connect_column.py", 
> line 799, in test_column_arithmetic_ops
> cdf.select(cdf.a % cdf["b"], cdf["a"] % 2, 12 % cdf.c).toPandas(),
>   File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 949, in 
> toPandas
> return self._session.client.to_pandas(query)
>   File "/.../spark/python/pyspark/sql/connect/client.py", line 413, in 
> to_pandas
> return self._execute_and_fetch(req)
>   File "/.../spark/python/pyspark/sql/connect/client.py", line 573, in 
> _execute_and_fetch
> self._handle_error(rpc_error)
>   File "/.../spark/python/pyspark/sql/connect/client.py", line 623, in 
> _handle_error
> raise SparkConnectException(status.message, info.reason) from None
> pyspark.sql.connect.client.SparkConnectException: 
> (org.apache.spark.SparkArithmeticException) [DIVIDE_BY_ZERO] Division by 
> zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. 
> If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48086) Different Arrow versions in client and server

2024-05-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48086.
--
Fix Version/s: 3.5.2
   Resolution: Fixed

Issue resolved by pull request 46431
[https://github.com/apache/spark/pull/46431]

> Different Arrow versions in client and server 
> --
>
> Key: SPARK-48086
> URL: https://issues.apache.org/jira/browse/SPARK-48086
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.2
>
>
> {code}
> ==
> FAIL [1.071s]: test_pandas_udf_arrow_overflow 
> (pyspark.sql.tests.connect.test_parity_pandas_udf.PandasUDFParityTests.test_pandas_udf_arrow_overflow)
> --
> pyspark.errors.exceptions.connect.PythonException: 
>   An exception was thrown from the Python worker. Please see the stack trace 
> below.
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
>  line 302, in _create_array
> return pa.Array.from_pandas(
>^
>   File "pyarrow/array.pxi", line 1054, in pyarrow.lib.Array.from_pandas
>   File "pyarrow/array.pxi", line 323, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Integer value 128 not in range: -128 to 127
> The above exception was the direct cause of the following exception:
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 1834, in main
> process()
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 1826, in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
>  line 531, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>
> 
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
>  line 104, in dump_stream
> for batch in iterator:
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
>  line 525, in init_stream_yield_batches
> batch = self._create_batch(series)
> ^^
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
>  line 511, in _create_batch
> arrs.append(self._create_array(s, t, arrow_cast=self._arrow_cast))
> ^
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
>  line 330, in _create_array
> raise PySparkValueError(error_msg % (series.dtype, series.na...
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/pandas/test_pandas_udf.py",
>  line 299, in test_pandas_udf_arrow_overflow
> with self.assertRaisesRegex(
> AssertionError: "Exception thrown when converting pandas.Series" does not 
> match "
>   An exception was thrown from the Python worker. Please see the stack trace 
> below.
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
>  line 302, in _create_array
> return pa.Array.from_pandas(
>^
>   File "pyarrow/array.pxi", line 1054, in pyarrow.lib.Array.from_pandas
>   File "pyarrow/array.pxi", line 323, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Integer value 128 not in range: -128 to 127
> The above exception was the direct cause of the following exception:
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 1834, in main
> process()
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 1826, in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> 

[jira] [Assigned] (SPARK-48086) Different Arrow versions in client and server

2024-05-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48086:


Assignee: Hyukjin Kwon

> Different Arrow versions in client and server 
> --
>
> Key: SPARK-48086
> URL: https://issues.apache.org/jira/browse/SPARK-48086
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> {code}
> ==
> FAIL [1.071s]: test_pandas_udf_arrow_overflow 
> (pyspark.sql.tests.connect.test_parity_pandas_udf.PandasUDFParityTests.test_pandas_udf_arrow_overflow)
> --
> pyspark.errors.exceptions.connect.PythonException: 
>   An exception was thrown from the Python worker. Please see the stack trace 
> below.
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
>  line 302, in _create_array
> return pa.Array.from_pandas(
>^
>   File "pyarrow/array.pxi", line 1054, in pyarrow.lib.Array.from_pandas
>   File "pyarrow/array.pxi", line 323, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Integer value 128 not in range: -128 to 127
> The above exception was the direct cause of the following exception:
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 1834, in main
> process()
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 1826, in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
>  line 531, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>
> 
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
>  line 104, in dump_stream
> for batch in iterator:
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
>  line 525, in init_stream_yield_batches
> batch = self._create_batch(series)
> ^^
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
>  line 511, in _create_batch
> arrs.append(self._create_array(s, t, arrow_cast=self._arrow_cast))
> ^
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
>  line 330, in _create_array
> raise PySparkValueError(error_msg % (series.dtype, series.na...
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/pandas/test_pandas_udf.py",
>  line 299, in test_pandas_udf_arrow_overflow
> with self.assertRaisesRegex(
> AssertionError: "Exception thrown when converting pandas.Series" does not 
> match "
>   An exception was thrown from the Python worker. Please see the stack trace 
> below.
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
>  line 302, in _create_array
> return pa.Array.from_pandas(
>^
>   File "pyarrow/array.pxi", line 1054, in pyarrow.lib.Array.from_pandas
>   File "pyarrow/array.pxi", line 323, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Integer value 128 not in range: -128 to 127
> The above exception was the direct cause of the following exception:
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 1834, in main
> process()
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 1826, in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
>  line 531, in dump_stream
> Traceback (most recent call last):
>   File 
> 

[jira] [Created] (SPARK-48169) Use lazy BadRecordException cause for StaxXmlParser and JacksonParser

2024-05-07 Thread Vladimir Golubev (Jira)
Vladimir Golubev created SPARK-48169:


 Summary: Use lazy BadRecordException cause for StaxXmlParser and 
JacksonParser
 Key: SPARK-48169
 URL: https://issues.apache.org/jira/browse/SPARK-48169
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Vladimir Golubev


For now since the https://issues.apache.org/jira/browse/SPARK-48143, the old 
constructor is used



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41547) Reenable ANSI mode in pyspark.sql.tests.connect.test_connect_functions

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-41547:
---
Labels: pull-request-available  (was: )

> Reenable ANSI mode in pyspark.sql.tests.connect.test_connect_functions
> --
>
> Key: SPARK-41547
> URL: https://issues.apache.org/jira/browse/SPARK-41547
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Xinrong Meng
>Priority: Major
>  Labels: pull-request-available
>
> See https://issues.apache.org/jira/browse/SPARK-41548
> We should fix the tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48168) Add bitwise shifting operators support

2024-05-07 Thread Kent Yao (Jira)
Kent Yao created SPARK-48168:


 Summary: Add bitwise shifting operators support
 Key: SPARK-48168
 URL: https://issues.apache.org/jira/browse/SPARK-48168
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47267) Hash functions should respect collation

2024-05-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47267:
---

Assignee: Uroš Bojanić

> Hash functions should respect collation
> ---
>
> Key: SPARK-47267
> URL: https://issues.apache.org/jira/browse/SPARK-47267
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> All functions in `hash_funcs` group should respec collation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47267) Hash functions should respect collation

2024-05-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47267:
---
Labels: pull-request-available  (was: )

> Hash functions should respect collation
> ---
>
> Key: SPARK-47267
> URL: https://issues.apache.org/jira/browse/SPARK-47267
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
>
> All functions in `hash_funcs` group should respec collation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >