[jira] [Created] (SPARK-48935) Restrictions on`collatinId` should be added to the constructor of `StringType`

2024-07-18 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-48935:
---

 Summary: Restrictions on`collatinId` should be added to the 
constructor of `StringType`
 Key: SPARK-48935
 URL: https://issues.apache.org/jira/browse/SPARK-48935
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48935) Restrictions on`collatinId` should be added to the constructor of `StringType`

2024-07-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48935:
---
Labels: pull-request-available  (was: )

> Restrictions on`collatinId` should be added to the constructor of `StringType`
> --
>
> Key: SPARK-48935
> URL: https://issues.apache.org/jira/browse/SPARK-48935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48935) Restrictions on`collatinId` should be added to the constructor of `StringType`

2024-07-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48935:
--

Assignee: Apache Spark

> Restrictions on`collatinId` should be added to the constructor of `StringType`
> --
>
> Key: SPARK-48935
> URL: https://issues.apache.org/jira/browse/SPARK-48935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48829) Upgrade `RoaringBitmap` to 1.2.0

2024-07-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48829:
--

Assignee: (was: Apache Spark)

>  Upgrade `RoaringBitmap` to 1.2.0
> -
>
> Key: SPARK-48829
> URL: https://issues.apache.org/jira/browse/SPARK-48829
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48388) [M0] Fix SET behavior for scripts

2024-07-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48388:
--

Assignee: Apache Spark

> [M0] Fix SET behavior for scripts
> -
>
> Key: SPARK-48388
> URL: https://issues.apache.org/jira/browse/SPARK-48388
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> By standard, SET is used to set variable value in SQL scripts.
> On our end, SET is configured to work with some Hive configs, so the grammar 
> is a bit messed up and for that reason it was decided to use SET VAR instead 
> of SET to work with SQL variables.
> This is not by standard and we should figure out the way to be able to use 
> SET for SQL variables and forbid setting of Hive configs from SQL scripts.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48388) [M0] Fix SET behavior for scripts

2024-07-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48388:
--

Assignee: (was: Apache Spark)

> [M0] Fix SET behavior for scripts
> -
>
> Key: SPARK-48388
> URL: https://issues.apache.org/jira/browse/SPARK-48388
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>  Labels: pull-request-available
>
> By standard, SET is used to set variable value in SQL scripts.
> On our end, SET is configured to work with some Hive configs, so the grammar 
> is a bit messed up and for that reason it was decided to use SET VAR instead 
> of SET to work with SQL variables.
> This is not by standard and we should figure out the way to be able to use 
> SET for SQL variables and forbid setting of Hive configs from SQL scripts.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48292) Revert [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status

2024-07-18 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866966#comment-17866966
 ] 

Steve Loughran commented on SPARK-48292:


what happens if a TA is authorized to commit, but doesn't return? as a network 
partition can trigger this. the output file may appear consistent with the 
committed task after a second tasks is told to commit its TA, but the 
partitioned TA may commit later? the core mapreduce commit protocols say 
"exactly one of the TAs shall have its output committed" but don't guarantee it 
is the second one

> Revert [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage 
> when committed file not consistent with task status
> --
>
> Key: SPARK-48292
> URL: https://issues.apache.org/jira/browse/SPARK-48292
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: L. C. Hsieh
>Assignee: angerszhu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.4
>
>
> When a task attemp fails but it is authorized to do task commit, 
> OutputCommitCoordinator will make the stage failed with a reason message 
> which says that task commit success, but actually the driver never knows if a 
> task commit is successful or not. We should update the reason message to make 
> it less confused.
> See https://github.com/apache/spark/pull/36564#discussion_r1598660630



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48936) Makes spark-shell works with Spark connect

2024-07-18 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-48936:


 Summary: Makes spark-shell works with Spark connect
 Key: SPARK-48936
 URL: https://issues.apache.org/jira/browse/SPARK-48936
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


`bin/pyspark --remote` works but `bin/spark-shell --remote` does not work. We 
should make it working.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48936) Makes spark-shell works with Spark connect

2024-07-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48936:
---
Labels: pull-request-available  (was: )

> Makes spark-shell works with Spark connect
> --
>
> Key: SPARK-48936
> URL: https://issues.apache.org/jira/browse/SPARK-48936
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> `bin/pyspark --remote` works but `bin/spark-shell --remote` does not work. We 
> should make it working.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48937) Fix collation support for the StringToMap expression

2024-07-18 Thread Jira
Uroš Bojanić created SPARK-48937:


 Summary: Fix collation support for the StringToMap expression
 Key: SPARK-48937
 URL: https://issues.apache.org/jira/browse/SPARK-48937
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48937) Fix collation support for the StringToMap expression

2024-07-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-48937:
-
Description: 
Enable collation support for *StringToMap* built-in string function in Spark 
({*}str_to_map{*}). First confirm what is the expected behaviour for this 
function when given collated strings, and then move on to implementation and 
testing. You will find this expression in the *complexTypeCreator.scala* file. 
However, this experssion is currently implemented as pass-through function, 
which is wrong because it doesn't provide appropriate collation awareness for 
non-default delimiters.

 

Example 1.
{code:java}
SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
This query will give the correct result, regardless of the collation.

 
{code:java}
{"a":"1","b":"2","c":"3"}{code}
 

 

Example 2.
{code:java}
SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
This query will give the *incorrect* result, under UTF8_LCASE or UNICODE_CI 
collation. The correct result should be:
{code:java}
{"a":"1","b":"2","c":"3"}{code}
 

Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
reflect how this function should be used with collation in SparkSQL, and feel 
free to use your chosen Spark SQL Editor to experiment with the existing 
functions to learn more about how they work. In addition, look into the 
possible use-cases and implementation of similar functions within other other 
open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *StringToMap* expression so 
that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
StringTypeBinaryLcase). To understand what changes were introduced in order to 
enable full collation support for other existing functions in Spark, take a 
look at the related Spark PRs and Jira tickets for completed tasks in this 
parent (for example: ).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class. Also, refer to the Unicode Technical 
Standard for string 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].

> Fix collation support for the StringToMap expression
> 
>
> Key: SPARK-48937
> URL: https://issues.apache.org/jira/browse/SPARK-48937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for *StringToMap* built-in string function in Spark 
> ({*}str_to_map{*}). First confirm what is the expected behaviour for this 
> function when given collated strings, and then move on to implementation and 
> testing. You will find this expression in the *complexTypeCreator.scala* 
> file. However, this experssion is currently implemented as pass-through 
> function, which is wrong because it doesn't provide appropriate collation 
> awareness for non-default delimiters.
>  
> Example 1.
> {code:java}
> SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
> This query will give the correct result, regardless of the collation.
>  
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
>  
> Example 2.
> {code:java}
> SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
> This query will give the *incorrect* result, under UTF8_LCASE or UNICODE_CI 
> collation. The correct result should be:
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
> reflect how this function should be used with collation in SparkSQL, and feel 
> free to use your chosen Spark SQL Editor to experiment with the existing 
> functions to learn more about how they work. In addition, look into the 
> possible use-cases and implementation of similar functions within other other 
> open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringToMap* expression so 
> that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
> StringTypeBinaryLcase). To understand what changes were introduced in order 
> to enable full collation support for other existing functions in Spark, take 
> a look at the related Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: ).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional comm

[jira] [Updated] (SPARK-48937) Fix collation support for the StringToMap expression

2024-07-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-48937:
-
Description: 
Enable collation support for *StringToMap* built-in string function in Spark 
({*}str_to_map{*}). First confirm what is the expected behaviour for this 
function when given collated strings, and then move on to implementation and 
testing. You will find this expression in the *complexTypeCreator.scala* file. 
However, this experssion is currently implemented as pass-through function, 
which is wrong because it doesn't provide appropriate collation awareness for 
non-default delimiters.

 

Example 1.
{code:java}
SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
This query will give the correct result, regardless of the collation.
{code:java}
{"a":"1","b":"2","c":"3"}{code}
 

Example 2.
{code:java}
SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
This query will give the *incorrect* result, under UTF8_LCASE collation. The 
correct result should be:
{code:java}
{"a":"1","b":"2","c":"3"}{code}
 

Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
reflect how this function should be used with collation in SparkSQL, and feel 
free to use your chosen Spark SQL Editor to experiment with the existing 
functions to learn more about how they work. In addition, look into the 
possible use-cases and implementation of similar functions within other other 
open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *StringToMap* expression so 
that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
StringTypeBinaryLcase). To understand what changes were introduced in order to 
enable full collation support for other existing functions in Spark, take a 
look at the related Spark PRs and Jira tickets for completed tasks in this 
parent (for example: https://issues.apache.org/jira/browse/SPARK-47414).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class. Also, refer to the Unicode Technical 
Standard for string 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].

  was:
Enable collation support for *StringToMap* built-in string function in Spark 
({*}str_to_map{*}). First confirm what is the expected behaviour for this 
function when given collated strings, and then move on to implementation and 
testing. You will find this expression in the *complexTypeCreator.scala* file. 
However, this experssion is currently implemented as pass-through function, 
which is wrong because it doesn't provide appropriate collation awareness for 
non-default delimiters.

 

Example 1.
{code:java}
SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
This query will give the correct result, regardless of the collation.
{code:java}
{"a":"1","b":"2","c":"3"}{code}
 

Example 2.
{code:java}
SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
This query will give the *incorrect* result, under UTF8_LCASE or UNICODE_CI 
collation. The correct result should be:
{code:java}
{"a":"1","b":"2","c":"3"}{code}
 

Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
reflect how this function should be used with collation in SparkSQL, and feel 
free to use your chosen Spark SQL Editor to experiment with the existing 
functions to learn more about how they work. In addition, look into the 
possible use-cases and implementation of similar functions within other other 
open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *StringToMap* expression so 
that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
StringTypeBinaryLcase). To understand what changes were introduced in order to 
enable full collation support for other existing functions in Spark, take a 
look at the related Spark PRs and Jira tickets for completed tasks in this 
parent (for example: https://issues.apache.org/jira/browse/SPARK-47414).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class. Also, refer to the Unicode Technical 
Standard for string 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].


> Fix collation support for the StringToMap expression
> 
>
> Key: SPARK-48937
> URL: https://issues.apache.org/jira/browse/SPARK-48937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for *StringToMap* built-in string function in Spark 
> ({*}str_to_map{*}). First confirm what is the expected behaviour for this 
> function when given collated strings, and then mov

[jira] [Updated] (SPARK-48937) Fix collation support for the StringToMap expression

2024-07-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-48937:
-
Description: 
Enable collation support for *StringToMap* built-in string function in Spark 
({*}str_to_map{*}). First confirm what is the expected behaviour for this 
function when given collated strings, and then move on to implementation and 
testing. You will find this expression in the *complexTypeCreator.scala* file. 
However, this experssion is currently implemented as pass-through function, 
which is wrong because it doesn't provide appropriate collation awareness for 
non-default delimiters.

 

Example 1.
{code:java}
SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
This query will give the correct result, regardless of the collation.

 
{code:java}
{"a":"1","b":"2","c":"3"}{code}
 

 

Example 2.
{code:java}
SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
This query will give the *incorrect* result, under UTF8_LCASE or UNICODE_CI 
collation. The correct result should be:
{code:java}
{"a":"1","b":"2","c":"3"}{code}
 

Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
reflect how this function should be used with collation in SparkSQL, and feel 
free to use your chosen Spark SQL Editor to experiment with the existing 
functions to learn more about how they work. In addition, look into the 
possible use-cases and implementation of similar functions within other other 
open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *StringToMap* expression so 
that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
StringTypeBinaryLcase). To understand what changes were introduced in order to 
enable full collation support for other existing functions in Spark, take a 
look at the related Spark PRs and Jira tickets for completed tasks in this 
parent (for example: https://issues.apache.org/jira/browse/SPARK-47414).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class. Also, refer to the Unicode Technical 
Standard for string 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].

  was:
Enable collation support for *StringToMap* built-in string function in Spark 
({*}str_to_map{*}). First confirm what is the expected behaviour for this 
function when given collated strings, and then move on to implementation and 
testing. You will find this expression in the *complexTypeCreator.scala* file. 
However, this experssion is currently implemented as pass-through function, 
which is wrong because it doesn't provide appropriate collation awareness for 
non-default delimiters.

 

Example 1.
{code:java}
SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
This query will give the correct result, regardless of the collation.

 
{code:java}
{"a":"1","b":"2","c":"3"}{code}
 

 

Example 2.
{code:java}
SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
This query will give the *incorrect* result, under UTF8_LCASE or UNICODE_CI 
collation. The correct result should be:
{code:java}
{"a":"1","b":"2","c":"3"}{code}
 

Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
reflect how this function should be used with collation in SparkSQL, and feel 
free to use your chosen Spark SQL Editor to experiment with the existing 
functions to learn more about how they work. In addition, look into the 
possible use-cases and implementation of similar functions within other other 
open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *StringToMap* expression so 
that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
StringTypeBinaryLcase). To understand what changes were introduced in order to 
enable full collation support for other existing functions in Spark, take a 
look at the related Spark PRs and Jira tickets for completed tasks in this 
parent (for example: ).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class. Also, refer to the Unicode Technical 
Standard for string 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].


> Fix collation support for the StringToMap expression
> 
>
> Key: SPARK-48937
> URL: https://issues.apache.org/jira/browse/SPARK-48937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for *StringToMap* built-in string function in Spark 
> ({*}str_to_map{*}). First confirm what is the expected behaviour for this 
> function when given collated strings, and then move on to implementation 

[jira] [Updated] (SPARK-48937) Fix collation support for the StringToMap expression

2024-07-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-48937:
-
Description: 
Enable collation support for *StringToMap* built-in string function in Spark 
({*}str_to_map{*}). First confirm what is the expected behaviour for this 
function when given collated strings, and then move on to implementation and 
testing. You will find this expression in the *complexTypeCreator.scala* file. 
However, this experssion is currently implemented as pass-through function, 
which is wrong because it doesn't provide appropriate collation awareness for 
non-default delimiters.

 

Example 1.
{code:java}
SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
This query will give the correct result, regardless of the collation.
{code:java}
{"a":"1","b":"2","c":"3"}{code}
 

Example 2.
{code:java}
SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
This query will give the *incorrect* result, under UTF8_LCASE or UNICODE_CI 
collation. The correct result should be:
{code:java}
{"a":"1","b":"2","c":"3"}{code}
 

Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
reflect how this function should be used with collation in SparkSQL, and feel 
free to use your chosen Spark SQL Editor to experiment with the existing 
functions to learn more about how they work. In addition, look into the 
possible use-cases and implementation of similar functions within other other 
open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *StringToMap* expression so 
that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
StringTypeBinaryLcase). To understand what changes were introduced in order to 
enable full collation support for other existing functions in Spark, take a 
look at the related Spark PRs and Jira tickets for completed tasks in this 
parent (for example: https://issues.apache.org/jira/browse/SPARK-47414).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class. Also, refer to the Unicode Technical 
Standard for string 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].

  was:
Enable collation support for *StringToMap* built-in string function in Spark 
({*}str_to_map{*}). First confirm what is the expected behaviour for this 
function when given collated strings, and then move on to implementation and 
testing. You will find this expression in the *complexTypeCreator.scala* file. 
However, this experssion is currently implemented as pass-through function, 
which is wrong because it doesn't provide appropriate collation awareness for 
non-default delimiters.

 

Example 1.
{code:java}
SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
This query will give the correct result, regardless of the collation.

 
{code:java}
{"a":"1","b":"2","c":"3"}{code}
 

 

Example 2.
{code:java}
SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
This query will give the *incorrect* result, under UTF8_LCASE or UNICODE_CI 
collation. The correct result should be:
{code:java}
{"a":"1","b":"2","c":"3"}{code}
 

Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
reflect how this function should be used with collation in SparkSQL, and feel 
free to use your chosen Spark SQL Editor to experiment with the existing 
functions to learn more about how they work. In addition, look into the 
possible use-cases and implementation of similar functions within other other 
open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *StringToMap* expression so 
that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
StringTypeBinaryLcase). To understand what changes were introduced in order to 
enable full collation support for other existing functions in Spark, take a 
look at the related Spark PRs and Jira tickets for completed tasks in this 
parent (for example: https://issues.apache.org/jira/browse/SPARK-47414).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class. Also, refer to the Unicode Technical 
Standard for string 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].


> Fix collation support for the StringToMap expression
> 
>
> Key: SPARK-48937
> URL: https://issues.apache.org/jira/browse/SPARK-48937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for *StringToMap* built-in string function in Spark 
> ({*}str_to_map{*}). First confirm what is the expected behaviour for this 
> function when given collated s

[jira] [Updated] (SPARK-48937) Fix collation support for the StringToMap expression (binary & lowercase collation only)

2024-07-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-48937:
-
Summary: Fix collation support for the StringToMap expression (binary & 
lowercase collation only)  (was: Fix collation support for the StringToMap 
expression)

> Fix collation support for the StringToMap expression (binary & lowercase 
> collation only)
> 
>
> Key: SPARK-48937
> URL: https://issues.apache.org/jira/browse/SPARK-48937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for *StringToMap* built-in string function in Spark 
> ({*}str_to_map{*}). First confirm what is the expected behaviour for this 
> function when given collated strings, and then move on to implementation and 
> testing. You will find this expression in the *complexTypeCreator.scala* 
> file. However, this experssion is currently implemented as pass-through 
> function, which is wrong because it doesn't provide appropriate collation 
> awareness for non-default delimiters.
>  
> Example 1.
> {code:java}
> SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
> This query will give the correct result, regardless of the collation.
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Example 2.
> {code:java}
> SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
> This query will give the *incorrect* result, under UTF8_LCASE collation. The 
> correct result should be:
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
> reflect how this function should be used with collation in SparkSQL, and feel 
> free to use your chosen Spark SQL Editor to experiment with the existing 
> functions to learn more about how they work. In addition, look into the 
> possible use-cases and implementation of similar functions within other other 
> open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringToMap* expression so 
> that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
> StringTypeBinaryLcase). To understand what changes were introduced in order 
> to enable full collation support for other existing functions in Spark, take 
> a look at the related Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: https://issues.apache.org/jira/browse/SPARK-47414).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48937) Fix collation support for the StringToMap expression (binary & lowercase collation only)

2024-07-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867003#comment-17867003
 ] 

Uroš Bojanić commented on SPARK-48937:
--

[~psyren99] Here is an open ticket within the collation effort, please let me 
know if you want to take this one or not.

The scope of this ticket is relatively small (~1 day of work), so a reasonable 
deadline for finishing this task would be 6 days from now (Wednesday 24 Jul 
2024).

> Fix collation support for the StringToMap expression (binary & lowercase 
> collation only)
> 
>
> Key: SPARK-48937
> URL: https://issues.apache.org/jira/browse/SPARK-48937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for *StringToMap* built-in string function in Spark 
> ({*}str_to_map{*}). First confirm what is the expected behaviour for this 
> function when given collated strings, and then move on to implementation and 
> testing. You will find this expression in the *complexTypeCreator.scala* 
> file. However, this experssion is currently implemented as pass-through 
> function, which is wrong because it doesn't provide appropriate collation 
> awareness for non-default delimiters.
>  
> Example 1.
> {code:java}
> SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
> This query will give the correct result, regardless of the collation.
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Example 2.
> {code:java}
> SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
> This query will give the *incorrect* result, under UTF8_LCASE collation. The 
> correct result should be:
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
> reflect how this function should be used with collation in SparkSQL, and feel 
> free to use your chosen Spark SQL Editor to experiment with the existing 
> functions to learn more about how they work. In addition, look into the 
> possible use-cases and implementation of similar functions within other other 
> open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringToMap* expression so 
> that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
> StringTypeBinaryLcase). To understand what changes were introduced in order 
> to enable full collation support for other existing functions in Spark, take 
> a look at the related Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: https://issues.apache.org/jira/browse/SPARK-47414).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48338) Sql Scripting support for Spark SQL

2024-07-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48338:
---
Labels: pull-request-available  (was: )

> Sql Scripting support for Spark SQL
> ---
>
> Key: SPARK-48338
> URL: https://issues.apache.org/jira/browse/SPARK-48338
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
> Attachments: Sql Scripting - OSS.odt, [Design Doc] Sql Scripting - 
> OSS.pdf
>
>
> Design doc for this feature is in attachment.
> High level example of Sql Script:
> ```
> BEGIN
>   DECLARE c INT = 10;
>   WHILE c > 0 DO
> INSERT INTO tscript VALUES (c);
> SET c = c - 1;
>   END WHILE;
> END
> ```
> High level motivation behind this feature:
> SQL Scripting gives customers the ability to develop complex ETL and analysis 
> entirely in SQL. Until now, customers have had to write verbose SQL 
> statements or combine SQL + Python to efficiently write business logic. 
> Coming from another system, customers have to choose whether or not they want 
> to migrate to pyspark. Some customers end up not using Spark because of this 
> gap. SQL Scripting is a key milestone towards enabling SQL practitioners to 
> write sophisticated queries, without the need to use pyspark. Further, SQL 
> Scripting is a necessary step towards support for SQL Stored Procedures, and 
> along with SQL Variables (released) and Temp Tables (in progress), will allow 
> for more seamless data warehouse migrations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48791) Perf regression due to accumulator registration overhead using CopyOnWriteArrayList

2024-07-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48791:
--
Issue Type: Bug  (was: Improvement)

> Perf regression due to accumulator registration overhead using 
> CopyOnWriteArrayList
> ---
>
> Key: SPARK-48791
> URL: https://issues.apache.org/jira/browse/SPARK-48791
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0, 3.5.1, 3.3.4, 3.4.3
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.3
>
>
> We noticed query perf regression and locate the root cause is the overhead 
> introuduced when registering accumulators using CopyOnWriteArrayList.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48791) Perf regression due to accumulator registration overhead using CopyOnWriteArrayList

2024-07-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48791:
--
Fix Version/s: 3.5.3

> Perf regression due to accumulator registration overhead using 
> CopyOnWriteArrayList
> ---
>
> Key: SPARK-48791
> URL: https://issues.apache.org/jira/browse/SPARK-48791
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0, 3.5.1, 3.3.4, 3.4.3
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.3
>
>
> We noticed query perf regression and locate the root cause is the overhead 
> introuduced when registering accumulators using CopyOnWriteArrayList.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48791) Perf regression due to accumulator registration overhead using CopyOnWriteArrayList

2024-07-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867038#comment-17867038
 ] 

Dongjoon Hyun commented on SPARK-48791:
---

I added a fix version, 3.5.3, for now because it arrives after RC1 tagging.

> Perf regression due to accumulator registration overhead using 
> CopyOnWriteArrayList
> ---
>
> Key: SPARK-48791
> URL: https://issues.apache.org/jira/browse/SPARK-48791
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0, 3.5.1, 3.3.4, 3.4.3
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.3
>
>
> We noticed query perf regression and locate the root cause is the overhead 
> introuduced when registering accumulators using CopyOnWriteArrayList.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48890) Add Streaming related fields to log4j ThreadContext

2024-07-18 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-48890.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47340
[https://github.com/apache/spark/pull/47340]

> Add Streaming related fields to log4j ThreadContext
> ---
>
> Key: SPARK-48890
> URL: https://issues.apache.org/jira/browse/SPARK-48890
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SS
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Assignee: Wei Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> There are some special informations needed for structured streaming queries. 
> Specifically, each query has a query_id and run_id. Also if using 
> MicrobatchExecution (default), there is a batch_id.
>  
> A (query_id, run_id, batch_id) identifies the microbatch the streaming query 
> runs. Adding these field to a threadContext would help especially when there 
> are multiple queries running. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48890) Add Streaming related fields to log4j ThreadContext

2024-07-18 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-48890:
--

Assignee: Wei Liu

> Add Streaming related fields to log4j ThreadContext
> ---
>
> Key: SPARK-48890
> URL: https://issues.apache.org/jira/browse/SPARK-48890
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SS
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Assignee: Wei Liu
>Priority: Major
>  Labels: pull-request-available
>
> There are some special informations needed for structured streaming queries. 
> Specifically, each query has a query_id and run_id. Also if using 
> MicrobatchExecution (default), there is a batch_id.
>  
> A (query_id, run_id, batch_id) identifies the microbatch the streaming query 
> runs. Adding these field to a threadContext would help especially when there 
> are multiple queries running. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48929) View fails with internal error after upgrade causes expected syntax error.

2024-07-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48929:
---
Labels: pull-request-available  (was: )

> View fails with internal error after upgrade causes expected syntax error.
> --
>
> Key: SPARK-48929
> URL: https://issues.apache.org/jira/browse/SPARK-48929
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Serge Rielau
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> On older Spark:
> CREATE VIEW v AS SELECT 1 ! IN (2);
> SEELCT * FROM v;
> => true
> Upgrade to Spark 4
> SELECT * FROM v;
> Internal error 
> This makes it hard to debug the problem.
> Rather than assuming that failure to parse a view text is an internal error 
> we should assume something like upgrade broke it and expose the actual error



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48937) Fix collation support for the StringToMap expression (binary & lowercase collation only)

2024-07-18 Thread psyren99 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867106#comment-17867106
 ] 

psyren99 commented on SPARK-48937:
--

Yes, I do it

> Fix collation support for the StringToMap expression (binary & lowercase 
> collation only)
> 
>
> Key: SPARK-48937
> URL: https://issues.apache.org/jira/browse/SPARK-48937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for *StringToMap* built-in string function in Spark 
> ({*}str_to_map{*}). First confirm what is the expected behaviour for this 
> function when given collated strings, and then move on to implementation and 
> testing. You will find this expression in the *complexTypeCreator.scala* 
> file. However, this experssion is currently implemented as pass-through 
> function, which is wrong because it doesn't provide appropriate collation 
> awareness for non-default delimiters.
>  
> Example 1.
> {code:java}
> SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
> This query will give the correct result, regardless of the collation.
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Example 2.
> {code:java}
> SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
> This query will give the *incorrect* result, under UTF8_LCASE collation. The 
> correct result should be:
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
> reflect how this function should be used with collation in SparkSQL, and feel 
> free to use your chosen Spark SQL Editor to experiment with the existing 
> functions to learn more about how they work. In addition, look into the 
> possible use-cases and implementation of similar functions within other other 
> open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringToMap* expression so 
> that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
> StringTypeBinaryLcase). To understand what changes were introduced in order 
> to enable full collation support for other existing functions in Spark, take 
> a look at the related Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: https://issues.apache.org/jira/browse/SPARK-47414).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48937) Fix collation support for the StringToMap expression (binary & lowercase collation only)

2024-07-18 Thread psyren99 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867106#comment-17867106
 ] 

psyren99 edited comment on SPARK-48937 at 7/18/24 8:30 PM:
---

[~uros-db] Yes, I do it


was (Author: JIRAUSER306095):
Yes, I do it

> Fix collation support for the StringToMap expression (binary & lowercase 
> collation only)
> 
>
> Key: SPARK-48937
> URL: https://issues.apache.org/jira/browse/SPARK-48937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for *StringToMap* built-in string function in Spark 
> ({*}str_to_map{*}). First confirm what is the expected behaviour for this 
> function when given collated strings, and then move on to implementation and 
> testing. You will find this expression in the *complexTypeCreator.scala* 
> file. However, this experssion is currently implemented as pass-through 
> function, which is wrong because it doesn't provide appropriate collation 
> awareness for non-default delimiters.
>  
> Example 1.
> {code:java}
> SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
> This query will give the correct result, regardless of the collation.
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Example 2.
> {code:java}
> SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
> This query will give the *incorrect* result, under UTF8_LCASE collation. The 
> correct result should be:
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
> reflect how this function should be used with collation in SparkSQL, and feel 
> free to use your chosen Spark SQL Editor to experiment with the existing 
> functions to learn more about how they work. In addition, look into the 
> possible use-cases and implementation of similar functions within other other 
> open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringToMap* expression so 
> that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
> StringTypeBinaryLcase). To understand what changes were introduced in order 
> to enable full collation support for other existing functions in Spark, take 
> a look at the related Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: https://issues.apache.org/jira/browse/SPARK-47414).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48937) Fix collation support for the StringToMap expression (binary & lowercase collation only)

2024-07-18 Thread psyren99 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867106#comment-17867106
 ] 

psyren99 edited comment on SPARK-48937 at 7/18/24 8:31 PM:
---

[~uros-db] Yes, I'll do it


was (Author: JIRAUSER306095):
[~uros-db] Yes, I do it

> Fix collation support for the StringToMap expression (binary & lowercase 
> collation only)
> 
>
> Key: SPARK-48937
> URL: https://issues.apache.org/jira/browse/SPARK-48937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for *StringToMap* built-in string function in Spark 
> ({*}str_to_map{*}). First confirm what is the expected behaviour for this 
> function when given collated strings, and then move on to implementation and 
> testing. You will find this expression in the *complexTypeCreator.scala* 
> file. However, this experssion is currently implemented as pass-through 
> function, which is wrong because it doesn't provide appropriate collation 
> awareness for non-default delimiters.
>  
> Example 1.
> {code:java}
> SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
> This query will give the correct result, regardless of the collation.
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Example 2.
> {code:java}
> SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
> This query will give the *incorrect* result, under UTF8_LCASE collation. The 
> correct result should be:
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
> reflect how this function should be used with collation in SparkSQL, and feel 
> free to use your chosen Spark SQL Editor to experiment with the existing 
> functions to learn more about how they work. In addition, look into the 
> possible use-cases and implementation of similar functions within other other 
> open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringToMap* expression so 
> that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
> StringTypeBinaryLcase). To understand what changes were introduced in order 
> to enable full collation support for other existing functions in Spark, take 
> a look at the related Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: https://issues.apache.org/jira/browse/SPARK-47414).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48937) Fix collation support for the StringToMap expression (binary & lowercase collation only)

2024-07-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867108#comment-17867108
 ] 

Uroš Bojanić commented on SPARK-48937:
--

Ack. Feel free to ping me for review when you open a PR for this and let me 
know if you have any questions. Happy coding!

> Fix collation support for the StringToMap expression (binary & lowercase 
> collation only)
> 
>
> Key: SPARK-48937
> URL: https://issues.apache.org/jira/browse/SPARK-48937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for *StringToMap* built-in string function in Spark 
> ({*}str_to_map{*}). First confirm what is the expected behaviour for this 
> function when given collated strings, and then move on to implementation and 
> testing. You will find this expression in the *complexTypeCreator.scala* 
> file. However, this experssion is currently implemented as pass-through 
> function, which is wrong because it doesn't provide appropriate collation 
> awareness for non-default delimiters.
>  
> Example 1.
> {code:java}
> SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
> This query will give the correct result, regardless of the collation.
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Example 2.
> {code:java}
> SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
> This query will give the *incorrect* result, under UTF8_LCASE collation. The 
> correct result should be:
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
> reflect how this function should be used with collation in SparkSQL, and feel 
> free to use your chosen Spark SQL Editor to experiment with the existing 
> functions to learn more about how they work. In addition, look into the 
> possible use-cases and implementation of similar functions within other other 
> open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringToMap* expression so 
> that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
> StringTypeBinaryLcase). To understand what changes were introduced in order 
> to enable full collation support for other existing functions in Spark, take 
> a look at the related Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: https://issues.apache.org/jira/browse/SPARK-47414).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48495) Document planned approach to shredding

2024-07-18 Thread Russell Spitzer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867126#comment-17867126
 ] 

Russell Spitzer commented on SPARK-48495:
-

This was merged with a bug in the table markdown, I've added a small fix PR here

https://github.com/apache/spark/pull/47407

> Document planned approach to shredding
> --
>
> Key: SPARK-48495
> URL: https://issues.apache.org/jira/browse/SPARK-48495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: David Cashman
>Assignee: David Cashman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48495) Document planned approach to shredding

2024-07-18 Thread Russell Spitzer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867126#comment-17867126
 ] 

Russell Spitzer edited comment on SPARK-48495 at 7/18/24 10:05 PM:
---

This was merged with a bug in the table markdown, I've added a small fix PR here

https://github.com/apache/spark/pull/47407

[~David Cashman] + [~gurwls223]


was (Author: rspitzer):
This was merged with a bug in the table markdown, I've added a small fix PR here

https://github.com/apache/spark/pull/47407

> Document planned approach to shredding
> --
>
> Key: SPARK-48495
> URL: https://issues.apache.org/jira/browse/SPARK-48495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: David Cashman
>Assignee: David Cashman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48921) ScalaUDF in subquery should run through analyzer

2024-07-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48921.
---
Fix Version/s: 3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 47406
[https://github.com/apache/spark/pull/47406]

> ScalaUDF in subquery should run through analyzer
> 
>
> Key: SPARK-48921
> URL: https://issues.apache.org/jira/browse/SPARK-48921
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1, 3.4.3
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.2, 4.0.0
>
>
> We got a customer issue that a `MergeInto` query on Iceberg table works 
> earlier but cannot work after upgrading to Spark 3.4.
> The error looks like
> ```
> Caused by: org.apache.spark.SparkRuntimeException: Error while decoding: 
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> nullable on unresolved object
> upcast(getcolumnbyordinal(0, StringType), StringType, - root class: 
> java.lang.String).toString.
> ```
> The source table of `MergeInto` uses `ScalaUDF`. The error happens when Spark 
> invokes the deserializer of input encoder of the `ScalaUDF` and the 
> deserializer is not resolved yet.
> The encoders of ScalaUDF are resolved by the rule `ResolveEncodersInUDF` 
> which will be applied at the end of analysis phase.
> During rewriting `MergeInto` to `ReplaceData` query, Spark creates an 
> `Exists` subquery and `ScalaUDF` is part of the plan of the subquery. Note 
> that the `ScalaUDF` is already resolved by the analyzer.
> Then, in `ResolveSubquery` rule which resolves the subquery, it will resolve 
> the subquery plan if it is not resolved yet. Because the subquery containing 
> `ScalaUDF` is resolved, the rule skips it so `ResolveEncodersInUDF` won't be 
> applied on it. So the analyzed `ReplaceData` query contains a `ScalaUDF` with 
> encoders unresolved that cause the error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48921) ScalaUDF in subquery should run through analyzer

2024-07-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48921:
--
Fix Version/s: 3.5.3
   (was: 3.5.2)

> ScalaUDF in subquery should run through analyzer
> 
>
> Key: SPARK-48921
> URL: https://issues.apache.org/jira/browse/SPARK-48921
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1, 3.4.3
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.3
>
>
> We got a customer issue that a `MergeInto` query on Iceberg table works 
> earlier but cannot work after upgrading to Spark 3.4.
> The error looks like
> ```
> Caused by: org.apache.spark.SparkRuntimeException: Error while decoding: 
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> nullable on unresolved object
> upcast(getcolumnbyordinal(0, StringType), StringType, - root class: 
> java.lang.String).toString.
> ```
> The source table of `MergeInto` uses `ScalaUDF`. The error happens when Spark 
> invokes the deserializer of input encoder of the `ScalaUDF` and the 
> deserializer is not resolved yet.
> The encoders of ScalaUDF are resolved by the rule `ResolveEncodersInUDF` 
> which will be applied at the end of analysis phase.
> During rewriting `MergeInto` to `ReplaceData` query, Spark creates an 
> `Exists` subquery and `ScalaUDF` is part of the plan of the subquery. Note 
> that the `ScalaUDF` is already resolved by the analyzer.
> Then, in `ResolveSubquery` rule which resolves the subquery, it will resolve 
> the subquery plan if it is not resolved yet. Because the subquery containing 
> `ScalaUDF` is resolved, the rule skips it so `ResolveEncodersInUDF` won't be 
> applied on it. So the analyzed `ReplaceData` query contains a `ScalaUDF` with 
> encoders unresolved that cause the error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46446) Correctness bug in correlated subquery with OFFSET

2024-07-18 Thread Andy Lam (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867132#comment-17867132
 ] 

Andy Lam commented on SPARK-46446:
--

[~cloud_fan] Could we unresolve this ticket or create a new one? Decorrelate 
subqueries with correlation under LIMIT with OFFSET hasn't been fixed, just 
disabled.

> Correctness bug in correlated subquery with OFFSET
> --
>
> Key: SPARK-46446
> URL: https://issues.apache.org/jira/browse/SPARK-46446
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
> introduced recently when support for correlation under OFFSET was enabled but 
> were not handled correctly. (So we went from unsupported, query throws error 
> -> wrong results.)
> It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS
>  
> It's easy to repro with a query like
> {code:java}
> create table x(x1 int, x2 int);
> insert into x values (1, 1), (2, 2);
> create table y(y1 int, y2 int);
> insert into y values (1, 1), (1, 2), (2, 4);
> select * from x where exists (select * from y where x1 = y1 limit 1 offset 
> 2){code}
> Correct result: empty set, see postgres: 
> [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] 
> Spark result: Array([2,2])
>  
> The 
> [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
>  where it was introduced added a test for it, but the golden file results for 
> the test actually were incorrect and we didn't notice. (The bug was initially 
> found by https://github.com/apache/spark/pull/44084)
> I'll work on both:
>  * Adding support for offset in DecorrelateInnerQuery (the transformation is 
> into a filter on row_number window function, similar to limit).
>  * Adding a feature flag to enable/disable offset in subquery support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48938) Improve error message when registering UDTFs

2024-07-18 Thread Allison Wang (Jira)
Allison Wang created SPARK-48938:


 Summary: Improve error message when registering UDTFs
 Key: SPARK-48938
 URL: https://issues.apache.org/jira/browse/SPARK-48938
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Improve the error message when registering Python UDTFs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48938) Improve error message when registering UDTFs

2024-07-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48938:
---
Labels: pull-request-available  (was: )

> Improve error message when registering UDTFs
> 
>
> Key: SPARK-48938
> URL: https://issues.apache.org/jira/browse/SPARK-48938
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Improve the error message when registering Python UDTFs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48939) Support recursive reference of Avro schema

2024-07-18 Thread Yuchen Liu (Jira)
Yuchen Liu created SPARK-48939:
--

 Summary: Support recursive reference of Avro schema
 Key: SPARK-48939
 URL: https://issues.apache.org/jira/browse/SPARK-48939
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 4.0.0
Reporter: Yuchen Liu


We should support reading Avro message with recursive reference in schema. 
Recursive reference denotes the case that the type of a field can be defined 
before in the parent nodes. A simple example is:

 
{code:java}
{
  "type": "record",
  "name": "LongList",
  "fields" : [
{"name": "value", "type": "long"},
{"name": "next", "type": ["null", "LongList"]}
  ]
}
{code}
This is written in Avro Schema DSL and represents a linked list data structure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48940) Upgrade `Arrow` to 17.0.0

2024-07-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48940:
---
Labels: pull-request-available  (was: )

> Upgrade `Arrow` to 17.0.0
> -
>
> Key: SPARK-48940
> URL: https://issues.apache.org/jira/browse/SPARK-48940
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48934) Python datetime types converted incorrectly for setting timeout in applyInPandasWithState

2024-07-18 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-48934:


Assignee: Siying Dong

> Python datetime types converted incorrectly for setting timeout in 
> applyInPandasWithState
> -
>
> Key: SPARK-48934
> URL: https://issues.apache.org/jira/browse/SPARK-48934
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.4.4, 3.5.3
>
>
> In applyInPandasWithState(), when state.setTimeoutTimestamp() is passed in 
> with datetime.datetime type, it doesn't function as expected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48934) Python datetime types converted incorrectly for setting timeout in applyInPandasWithState

2024-07-18 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-48934.
--
Fix Version/s: 4.0.0
   3.4.4
   3.5.3
   Resolution: Fixed

Issue resolved via [https://github.com/apache/spark/pull/47398]

 

> Python datetime types converted incorrectly for setting timeout in 
> applyInPandasWithState
> -
>
> Key: SPARK-48934
> URL: https://issues.apache.org/jira/browse/SPARK-48934
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Siying Dong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.4.4, 3.5.3
>
>
> In applyInPandasWithState(), when state.setTimeoutTimestamp() is passed in 
> with datetime.datetime type, it doesn't function as expected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48388) [M0] Fix SET behavior for scripts

2024-07-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48388:
---

Assignee: David Milicevic

> [M0] Fix SET behavior for scripts
> -
>
> Key: SPARK-48388
> URL: https://issues.apache.org/jira/browse/SPARK-48388
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Assignee: David Milicevic
>Priority: Major
>  Labels: pull-request-available
>
> By standard, SET is used to set variable value in SQL scripts.
> On our end, SET is configured to work with some Hive configs, so the grammar 
> is a bit messed up and for that reason it was decided to use SET VAR instead 
> of SET to work with SQL variables.
> This is not by standard and we should figure out the way to be able to use 
> SET for SQL variables and forbid setting of Hive configs from SQL scripts.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48388) [M0] Fix SET behavior for scripts

2024-07-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48388.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47272
[https://github.com/apache/spark/pull/47272]

> [M0] Fix SET behavior for scripts
> -
>
> Key: SPARK-48388
> URL: https://issues.apache.org/jira/browse/SPARK-48388
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Assignee: David Milicevic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> By standard, SET is used to set variable value in SQL scripts.
> On our end, SET is configured to work with some Hive configs, so the grammar 
> is a bit messed up and for that reason it was decided to use SET VAR instead 
> of SET to work with SQL variables.
> This is not by standard and we should figure out the way to be able to use 
> SET for SQL variables and forbid setting of Hive configs from SQL scripts.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48934) Python datetime types converted incorrectly for setting timeout in applyInPandasWithState

2024-07-18 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-48934:
-
Fix Version/s: (was: 3.4.4)

> Python datetime types converted incorrectly for setting timeout in 
> applyInPandasWithState
> -
>
> Key: SPARK-48934
> URL: https://issues.apache.org/jira/browse/SPARK-48934
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.3
>
>
> In applyInPandasWithState(), when state.setTimeoutTimestamp() is passed in 
> with datetime.datetime type, it doesn't function as expected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48941) PySparkML: Replace RDD read / write API invocation with Dataframe read / write API

2024-07-18 Thread Weichen Xu (Jira)
Weichen Xu created SPARK-48941:
--

 Summary: PySparkML: Replace RDD read / write API invocation with 
Dataframe read / write API 
 Key: SPARK-48941
 URL: https://issues.apache.org/jira/browse/SPARK-48941
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Weichen Xu


PySparkML: Replace RDD read / write API invocation with Dataframe read / write 
API 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48941) PySparkML: Replace RDD read / write API invocation with Dataframe read / write API

2024-07-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48941:
---
Labels: pull-request-available  (was: )

> PySparkML: Replace RDD read / write API invocation with Dataframe read / 
> write API 
> ---
>
> Key: SPARK-48941
> URL: https://issues.apache.org/jira/browse/SPARK-48941
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Weichen Xu
>Priority: Major
>  Labels: pull-request-available
>
> PySparkML: Replace RDD read / write API invocation with Dataframe read / 
> write API 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48933) Upgrade `protobuf-java` to `3.25.3`

2024-07-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48933.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47397
[https://github.com/apache/spark/pull/47397]

> Upgrade `protobuf-java` to `3.25.3`
> ---
>
> Key: SPARK-48933
> URL: https://issues.apache.org/jira/browse/SPARK-48933
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48791) Perf regression due to accumulator registration overhead using CopyOnWriteArrayList

2024-07-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48791:
--
Fix Version/s: 3.4.4

> Perf regression due to accumulator registration overhead using 
> CopyOnWriteArrayList
> ---
>
> Key: SPARK-48791
> URL: https://issues.apache.org/jira/browse/SPARK-48791
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0, 3.5.1, 3.3.4, 3.4.3
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.4.4, 3.5.3
>
>
> We noticed query perf regression and locate the root cause is the overhead 
> introuduced when registering accumulators using CopyOnWriteArrayList.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48942) Reading parquet with Array of Structs of UDTs throws Exception

2024-07-18 Thread James Willis (Jira)
James Willis created SPARK-48942:


 Summary: Reading parquet with Array of Structs of UDTs throws 
Exception
 Key: SPARK-48942
 URL: https://issues.apache.org/jira/browse/SPARK-48942
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1, 3.4.0
Reporter: James Willis


When reading parquet files that have a column that is an Array of Structs where 
the Struct contains a UDT, the read fails. We have repro'd this is 3.4.0 and 
3.5.1.

I see this might be some related work on Supporting UDTs in parquet: 
https://issues.apache.org/jira/browse/SPARK-39086

 

I'm trying to create a UDT to repro this without sedona

 

I discovered this when using Apache Sedona. Here is my minimal reproducible 
example:
{code:java}
sedona.sql("""
SELECT ARRAY(STRUCT(ST_POINT(1.0, 1.1)))
""").write.mode("overwrite").format("parquet").save(test_read_write_path)df = 
sedona.read.format("parquet").load(test_read_write_path)
df.show() {code}
This gives the following stack trace:
{code:java}
24/07/19 03:31:23 WARN TaskSetManager: Lost task 0.0 in stage 22.0 (TID 24) 
(10.0.133.128 executor 1): java.lang.IllegalArgumentException: Spark type: 
StructType(StructField(col1,BinaryType,true)) doesn't match the type: 
StructType(StructField(col1,org.apache.spark.sql.sedona_sql.UDT.GeometryUDT@5b138d06,true))
 in column vector
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:69)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:116)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:272)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:292)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:286)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:228)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:290)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
at java.base/java.lang.Thread.run(Unknown Source)

24/07/19 03:31:23 WARN TaskSetManager: Lost task 0.1 in stage 22.0 (TID 25) 
(10.0.183.100 executor 5): java.lang.IllegalArgumentException: Spark type: 
StructType(StructField(col1,BinaryType,true)) doesn't match the type: 
StructType(StructField(col1,org.apache.spark.sql.sedona_sql.UDT.GeometryUDT@239780ef,true))
 in column vector
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:69)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:116)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(Vector

[jira] [Commented] (SPARK-48934) Python datetime types converted incorrectly for setting timeout in applyInPandasWithState

2024-07-18 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867173#comment-17867173
 ] 

Kent Yao commented on SPARK-48934:
--

Collected this to 3.5.2

> Python datetime types converted incorrectly for setting timeout in 
> applyInPandasWithState
> -
>
> Key: SPARK-48934
> URL: https://issues.apache.org/jira/browse/SPARK-48934
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>
> In applyInPandasWithState(), when state.setTimeoutTimestamp() is passed in 
> with datetime.datetime type, it doesn't function as expected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48791) Perf regression due to accumulator registration overhead using CopyOnWriteArrayList

2024-07-18 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867175#comment-17867175
 ] 

Kent Yao commented on SPARK-48791:
--

Collected to 3.5.2

> Perf regression due to accumulator registration overhead using 
> CopyOnWriteArrayList
> ---
>
> Key: SPARK-48791
> URL: https://issues.apache.org/jira/browse/SPARK-48791
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0, 3.5.1, 3.3.4, 3.4.3
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.4
>
>
> We noticed query perf regression and locate the root cause is the overhead 
> introuduced when registering accumulators using CopyOnWriteArrayList.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48791) Perf regression due to accumulator registration overhead using CopyOnWriteArrayList

2024-07-18 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-48791:
-
Fix Version/s: 3.5.2
   (was: 3.5.3)

> Perf regression due to accumulator registration overhead using 
> CopyOnWriteArrayList
> ---
>
> Key: SPARK-48791
> URL: https://issues.apache.org/jira/browse/SPARK-48791
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0, 3.5.1, 3.3.4, 3.4.3
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.4
>
>
> We noticed query perf regression and locate the root cause is the overhead 
> introuduced when registering accumulators using CopyOnWriteArrayList.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48921) ScalaUDF in subquery should run through analyzer

2024-07-18 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-48921:
-
Fix Version/s: 3.5.2
   (was: 3.5.3)

> ScalaUDF in subquery should run through analyzer
> 
>
> Key: SPARK-48921
> URL: https://issues.apache.org/jira/browse/SPARK-48921
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1, 3.4.3
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>
> We got a customer issue that a `MergeInto` query on Iceberg table works 
> earlier but cannot work after upgrading to Spark 3.4.
> The error looks like
> ```
> Caused by: org.apache.spark.SparkRuntimeException: Error while decoding: 
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> nullable on unresolved object
> upcast(getcolumnbyordinal(0, StringType), StringType, - root class: 
> java.lang.String).toString.
> ```
> The source table of `MergeInto` uses `ScalaUDF`. The error happens when Spark 
> invokes the deserializer of input encoder of the `ScalaUDF` and the 
> deserializer is not resolved yet.
> The encoders of ScalaUDF are resolved by the rule `ResolveEncodersInUDF` 
> which will be applied at the end of analysis phase.
> During rewriting `MergeInto` to `ReplaceData` query, Spark creates an 
> `Exists` subquery and `ScalaUDF` is part of the plan of the subquery. Note 
> that the `ScalaUDF` is already resolved by the analyzer.
> Then, in `ResolveSubquery` rule which resolves the subquery, it will resolve 
> the subquery plan if it is not resolved yet. Because the subquery containing 
> `ScalaUDF` is resolved, the rule skips it so `ResolveEncodersInUDF` won't be 
> applied on it. So the analyzed `ReplaceData` query contains a `ScalaUDF` with 
> encoders unresolved that cause the error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org