[jira] [Created] (SPARK-48935) Restrictions on`collatinId` should be added to the constructor of `StringType`
BingKun Pan created SPARK-48935: --- Summary: Restrictions on`collatinId` should be added to the constructor of `StringType` Key: SPARK-48935 URL: https://issues.apache.org/jira/browse/SPARK-48935 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48935) Restrictions on`collatinId` should be added to the constructor of `StringType`
[ https://issues.apache.org/jira/browse/SPARK-48935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48935: --- Labels: pull-request-available (was: ) > Restrictions on`collatinId` should be added to the constructor of `StringType` > -- > > Key: SPARK-48935 > URL: https://issues.apache.org/jira/browse/SPARK-48935 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48935) Restrictions on`collatinId` should be added to the constructor of `StringType`
[ https://issues.apache.org/jira/browse/SPARK-48935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48935: -- Assignee: Apache Spark > Restrictions on`collatinId` should be added to the constructor of `StringType` > -- > > Key: SPARK-48935 > URL: https://issues.apache.org/jira/browse/SPARK-48935 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48829) Upgrade `RoaringBitmap` to 1.2.0
[ https://issues.apache.org/jira/browse/SPARK-48829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48829: -- Assignee: (was: Apache Spark) > Upgrade `RoaringBitmap` to 1.2.0 > - > > Key: SPARK-48829 > URL: https://issues.apache.org/jira/browse/SPARK-48829 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48388) [M0] Fix SET behavior for scripts
[ https://issues.apache.org/jira/browse/SPARK-48388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48388: -- Assignee: Apache Spark > [M0] Fix SET behavior for scripts > - > > Key: SPARK-48388 > URL: https://issues.apache.org/jira/browse/SPARK-48388 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: David Milicevic >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > By standard, SET is used to set variable value in SQL scripts. > On our end, SET is configured to work with some Hive configs, so the grammar > is a bit messed up and for that reason it was decided to use SET VAR instead > of SET to work with SQL variables. > This is not by standard and we should figure out the way to be able to use > SET for SQL variables and forbid setting of Hive configs from SQL scripts. > > For more details, design doc can be found in parent Jira item. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48388) [M0] Fix SET behavior for scripts
[ https://issues.apache.org/jira/browse/SPARK-48388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48388: -- Assignee: (was: Apache Spark) > [M0] Fix SET behavior for scripts > - > > Key: SPARK-48388 > URL: https://issues.apache.org/jira/browse/SPARK-48388 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: David Milicevic >Priority: Major > Labels: pull-request-available > > By standard, SET is used to set variable value in SQL scripts. > On our end, SET is configured to work with some Hive configs, so the grammar > is a bit messed up and for that reason it was decided to use SET VAR instead > of SET to work with SQL variables. > This is not by standard and we should figure out the way to be able to use > SET for SQL variables and forbid setting of Hive configs from SQL scripts. > > For more details, design doc can be found in parent Jira item. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48292) Revert [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status
[ https://issues.apache.org/jira/browse/SPARK-48292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866966#comment-17866966 ] Steve Loughran commented on SPARK-48292: what happens if a TA is authorized to commit, but doesn't return? as a network partition can trigger this. the output file may appear consistent with the committed task after a second tasks is told to commit its TA, but the partitioned TA may commit later? the core mapreduce commit protocols say "exactly one of the TAs shall have its output committed" but don't guarantee it is the second one > Revert [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage > when committed file not consistent with task status > -- > > Key: SPARK-48292 > URL: https://issues.apache.org/jira/browse/SPARK-48292 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: L. C. Hsieh >Assignee: angerszhu >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.4 > > > When a task attemp fails but it is authorized to do task commit, > OutputCommitCoordinator will make the stage failed with a reason message > which says that task commit success, but actually the driver never knows if a > task commit is successful or not. We should update the reason message to make > it less confused. > See https://github.com/apache/spark/pull/36564#discussion_r1598660630 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48936) Makes spark-shell works with Spark connect
Hyukjin Kwon created SPARK-48936: Summary: Makes spark-shell works with Spark connect Key: SPARK-48936 URL: https://issues.apache.org/jira/browse/SPARK-48936 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 4.0.0 Reporter: Hyukjin Kwon `bin/pyspark --remote` works but `bin/spark-shell --remote` does not work. We should make it working. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48936) Makes spark-shell works with Spark connect
[ https://issues.apache.org/jira/browse/SPARK-48936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48936: --- Labels: pull-request-available (was: ) > Makes spark-shell works with Spark connect > -- > > Key: SPARK-48936 > URL: https://issues.apache.org/jira/browse/SPARK-48936 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > > `bin/pyspark --remote` works but `bin/spark-shell --remote` does not work. We > should make it working. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48937) Fix collation support for the StringToMap expression
Uroš Bojanić created SPARK-48937: Summary: Fix collation support for the StringToMap expression Key: SPARK-48937 URL: https://issues.apache.org/jira/browse/SPARK-48937 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Uroš Bojanić -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48937) Fix collation support for the StringToMap expression
[ https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uroš Bojanić updated SPARK-48937: - Description: Enable collation support for *StringToMap* built-in string function in Spark ({*}str_to_map{*}). First confirm what is the expected behaviour for this function when given collated strings, and then move on to implementation and testing. You will find this expression in the *complexTypeCreator.scala* file. However, this experssion is currently implemented as pass-through function, which is wrong because it doesn't provide appropriate collation awareness for non-default delimiters. Example 1. {code:java} SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code} This query will give the correct result, regardless of the collation. {code:java} {"a":"1","b":"2","c":"3"}{code} Example 2. {code:java} SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code} This query will give the *incorrect* result, under UTF8_LCASE or UNICODE_CI collation. The correct result should be: {code:java} {"a":"1","b":"2","c":"3"}{code} Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to reflect how this function should be used with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment with the existing functions to learn more about how they work. In addition, look into the possible use-cases and implementation of similar functions within other other open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. The goal for this Jira ticket is to implement the *StringToMap* expression so that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. StringTypeBinaryLcase). To understand what changes were introduced in order to enable full collation support for other existing functions in Spark, take a look at the related Spark PRs and Jira tickets for completed tasks in this parent (for example: ). Read more about ICU [Collation Concepts|http://example.com/] and [Collator|http://example.com/] class. Also, refer to the Unicode Technical Standard for string [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. > Fix collation support for the StringToMap expression > > > Key: SPARK-48937 > URL: https://issues.apache.org/jira/browse/SPARK-48937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for *StringToMap* built-in string function in Spark > ({*}str_to_map{*}). First confirm what is the expected behaviour for this > function when given collated strings, and then move on to implementation and > testing. You will find this expression in the *complexTypeCreator.scala* > file. However, this experssion is currently implemented as pass-through > function, which is wrong because it doesn't provide appropriate collation > awareness for non-default delimiters. > > Example 1. > {code:java} > SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code} > This query will give the correct result, regardless of the collation. > > {code:java} > {"a":"1","b":"2","c":"3"}{code} > > > Example 2. > {code:java} > SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code} > This query will give the *incorrect* result, under UTF8_LCASE or UNICODE_CI > collation. The correct result should be: > {code:java} > {"a":"1","b":"2","c":"3"}{code} > > Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to > reflect how this function should be used with collation in SparkSQL, and feel > free to use your chosen Spark SQL Editor to experiment with the existing > functions to learn more about how they work. In addition, look into the > possible use-cases and implementation of similar functions within other other > open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *StringToMap* expression so > that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. > StringTypeBinaryLcase). To understand what changes were introduced in order > to enable full collation support for other existing functions in Spark, take > a look at the related Spark PRs and Jira tickets for completed tasks in this > parent (for example: ). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for string > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional comm
[jira] [Updated] (SPARK-48937) Fix collation support for the StringToMap expression
[ https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uroš Bojanić updated SPARK-48937: - Description: Enable collation support for *StringToMap* built-in string function in Spark ({*}str_to_map{*}). First confirm what is the expected behaviour for this function when given collated strings, and then move on to implementation and testing. You will find this expression in the *complexTypeCreator.scala* file. However, this experssion is currently implemented as pass-through function, which is wrong because it doesn't provide appropriate collation awareness for non-default delimiters. Example 1. {code:java} SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code} This query will give the correct result, regardless of the collation. {code:java} {"a":"1","b":"2","c":"3"}{code} Example 2. {code:java} SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code} This query will give the *incorrect* result, under UTF8_LCASE collation. The correct result should be: {code:java} {"a":"1","b":"2","c":"3"}{code} Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to reflect how this function should be used with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment with the existing functions to learn more about how they work. In addition, look into the possible use-cases and implementation of similar functions within other other open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. The goal for this Jira ticket is to implement the *StringToMap* expression so that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. StringTypeBinaryLcase). To understand what changes were introduced in order to enable full collation support for other existing functions in Spark, take a look at the related Spark PRs and Jira tickets for completed tasks in this parent (for example: https://issues.apache.org/jira/browse/SPARK-47414). Read more about ICU [Collation Concepts|http://example.com/] and [Collator|http://example.com/] class. Also, refer to the Unicode Technical Standard for string [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. was: Enable collation support for *StringToMap* built-in string function in Spark ({*}str_to_map{*}). First confirm what is the expected behaviour for this function when given collated strings, and then move on to implementation and testing. You will find this expression in the *complexTypeCreator.scala* file. However, this experssion is currently implemented as pass-through function, which is wrong because it doesn't provide appropriate collation awareness for non-default delimiters. Example 1. {code:java} SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code} This query will give the correct result, regardless of the collation. {code:java} {"a":"1","b":"2","c":"3"}{code} Example 2. {code:java} SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code} This query will give the *incorrect* result, under UTF8_LCASE or UNICODE_CI collation. The correct result should be: {code:java} {"a":"1","b":"2","c":"3"}{code} Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to reflect how this function should be used with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment with the existing functions to learn more about how they work. In addition, look into the possible use-cases and implementation of similar functions within other other open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. The goal for this Jira ticket is to implement the *StringToMap* expression so that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. StringTypeBinaryLcase). To understand what changes were introduced in order to enable full collation support for other existing functions in Spark, take a look at the related Spark PRs and Jira tickets for completed tasks in this parent (for example: https://issues.apache.org/jira/browse/SPARK-47414). Read more about ICU [Collation Concepts|http://example.com/] and [Collator|http://example.com/] class. Also, refer to the Unicode Technical Standard for string [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. > Fix collation support for the StringToMap expression > > > Key: SPARK-48937 > URL: https://issues.apache.org/jira/browse/SPARK-48937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for *StringToMap* built-in string function in Spark > ({*}str_to_map{*}). First confirm what is the expected behaviour for this > function when given collated strings, and then mov
[jira] [Updated] (SPARK-48937) Fix collation support for the StringToMap expression
[ https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uroš Bojanić updated SPARK-48937: - Description: Enable collation support for *StringToMap* built-in string function in Spark ({*}str_to_map{*}). First confirm what is the expected behaviour for this function when given collated strings, and then move on to implementation and testing. You will find this expression in the *complexTypeCreator.scala* file. However, this experssion is currently implemented as pass-through function, which is wrong because it doesn't provide appropriate collation awareness for non-default delimiters. Example 1. {code:java} SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code} This query will give the correct result, regardless of the collation. {code:java} {"a":"1","b":"2","c":"3"}{code} Example 2. {code:java} SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code} This query will give the *incorrect* result, under UTF8_LCASE or UNICODE_CI collation. The correct result should be: {code:java} {"a":"1","b":"2","c":"3"}{code} Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to reflect how this function should be used with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment with the existing functions to learn more about how they work. In addition, look into the possible use-cases and implementation of similar functions within other other open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. The goal for this Jira ticket is to implement the *StringToMap* expression so that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. StringTypeBinaryLcase). To understand what changes were introduced in order to enable full collation support for other existing functions in Spark, take a look at the related Spark PRs and Jira tickets for completed tasks in this parent (for example: https://issues.apache.org/jira/browse/SPARK-47414). Read more about ICU [Collation Concepts|http://example.com/] and [Collator|http://example.com/] class. Also, refer to the Unicode Technical Standard for string [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. was: Enable collation support for *StringToMap* built-in string function in Spark ({*}str_to_map{*}). First confirm what is the expected behaviour for this function when given collated strings, and then move on to implementation and testing. You will find this expression in the *complexTypeCreator.scala* file. However, this experssion is currently implemented as pass-through function, which is wrong because it doesn't provide appropriate collation awareness for non-default delimiters. Example 1. {code:java} SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code} This query will give the correct result, regardless of the collation. {code:java} {"a":"1","b":"2","c":"3"}{code} Example 2. {code:java} SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code} This query will give the *incorrect* result, under UTF8_LCASE or UNICODE_CI collation. The correct result should be: {code:java} {"a":"1","b":"2","c":"3"}{code} Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to reflect how this function should be used with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment with the existing functions to learn more about how they work. In addition, look into the possible use-cases and implementation of similar functions within other other open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. The goal for this Jira ticket is to implement the *StringToMap* expression so that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. StringTypeBinaryLcase). To understand what changes were introduced in order to enable full collation support for other existing functions in Spark, take a look at the related Spark PRs and Jira tickets for completed tasks in this parent (for example: ). Read more about ICU [Collation Concepts|http://example.com/] and [Collator|http://example.com/] class. Also, refer to the Unicode Technical Standard for string [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. > Fix collation support for the StringToMap expression > > > Key: SPARK-48937 > URL: https://issues.apache.org/jira/browse/SPARK-48937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for *StringToMap* built-in string function in Spark > ({*}str_to_map{*}). First confirm what is the expected behaviour for this > function when given collated strings, and then move on to implementation
[jira] [Updated] (SPARK-48937) Fix collation support for the StringToMap expression
[ https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uroš Bojanić updated SPARK-48937: - Description: Enable collation support for *StringToMap* built-in string function in Spark ({*}str_to_map{*}). First confirm what is the expected behaviour for this function when given collated strings, and then move on to implementation and testing. You will find this expression in the *complexTypeCreator.scala* file. However, this experssion is currently implemented as pass-through function, which is wrong because it doesn't provide appropriate collation awareness for non-default delimiters. Example 1. {code:java} SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code} This query will give the correct result, regardless of the collation. {code:java} {"a":"1","b":"2","c":"3"}{code} Example 2. {code:java} SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code} This query will give the *incorrect* result, under UTF8_LCASE or UNICODE_CI collation. The correct result should be: {code:java} {"a":"1","b":"2","c":"3"}{code} Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to reflect how this function should be used with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment with the existing functions to learn more about how they work. In addition, look into the possible use-cases and implementation of similar functions within other other open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. The goal for this Jira ticket is to implement the *StringToMap* expression so that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. StringTypeBinaryLcase). To understand what changes were introduced in order to enable full collation support for other existing functions in Spark, take a look at the related Spark PRs and Jira tickets for completed tasks in this parent (for example: https://issues.apache.org/jira/browse/SPARK-47414). Read more about ICU [Collation Concepts|http://example.com/] and [Collator|http://example.com/] class. Also, refer to the Unicode Technical Standard for string [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. was: Enable collation support for *StringToMap* built-in string function in Spark ({*}str_to_map{*}). First confirm what is the expected behaviour for this function when given collated strings, and then move on to implementation and testing. You will find this expression in the *complexTypeCreator.scala* file. However, this experssion is currently implemented as pass-through function, which is wrong because it doesn't provide appropriate collation awareness for non-default delimiters. Example 1. {code:java} SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code} This query will give the correct result, regardless of the collation. {code:java} {"a":"1","b":"2","c":"3"}{code} Example 2. {code:java} SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code} This query will give the *incorrect* result, under UTF8_LCASE or UNICODE_CI collation. The correct result should be: {code:java} {"a":"1","b":"2","c":"3"}{code} Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to reflect how this function should be used with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment with the existing functions to learn more about how they work. In addition, look into the possible use-cases and implementation of similar functions within other other open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. The goal for this Jira ticket is to implement the *StringToMap* expression so that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. StringTypeBinaryLcase). To understand what changes were introduced in order to enable full collation support for other existing functions in Spark, take a look at the related Spark PRs and Jira tickets for completed tasks in this parent (for example: https://issues.apache.org/jira/browse/SPARK-47414). Read more about ICU [Collation Concepts|http://example.com/] and [Collator|http://example.com/] class. Also, refer to the Unicode Technical Standard for string [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. > Fix collation support for the StringToMap expression > > > Key: SPARK-48937 > URL: https://issues.apache.org/jira/browse/SPARK-48937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for *StringToMap* built-in string function in Spark > ({*}str_to_map{*}). First confirm what is the expected behaviour for this > function when given collated s
[jira] [Updated] (SPARK-48937) Fix collation support for the StringToMap expression (binary & lowercase collation only)
[ https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uroš Bojanić updated SPARK-48937: - Summary: Fix collation support for the StringToMap expression (binary & lowercase collation only) (was: Fix collation support for the StringToMap expression) > Fix collation support for the StringToMap expression (binary & lowercase > collation only) > > > Key: SPARK-48937 > URL: https://issues.apache.org/jira/browse/SPARK-48937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for *StringToMap* built-in string function in Spark > ({*}str_to_map{*}). First confirm what is the expected behaviour for this > function when given collated strings, and then move on to implementation and > testing. You will find this expression in the *complexTypeCreator.scala* > file. However, this experssion is currently implemented as pass-through > function, which is wrong because it doesn't provide appropriate collation > awareness for non-default delimiters. > > Example 1. > {code:java} > SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code} > This query will give the correct result, regardless of the collation. > {code:java} > {"a":"1","b":"2","c":"3"}{code} > > Example 2. > {code:java} > SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code} > This query will give the *incorrect* result, under UTF8_LCASE collation. The > correct result should be: > {code:java} > {"a":"1","b":"2","c":"3"}{code} > > Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to > reflect how this function should be used with collation in SparkSQL, and feel > free to use your chosen Spark SQL Editor to experiment with the existing > functions to learn more about how they work. In addition, look into the > possible use-cases and implementation of similar functions within other other > open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *StringToMap* expression so > that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. > StringTypeBinaryLcase). To understand what changes were introduced in order > to enable full collation support for other existing functions in Spark, take > a look at the related Spark PRs and Jira tickets for completed tasks in this > parent (for example: https://issues.apache.org/jira/browse/SPARK-47414). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for string > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48937) Fix collation support for the StringToMap expression (binary & lowercase collation only)
[ https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867003#comment-17867003 ] Uroš Bojanić commented on SPARK-48937: -- [~psyren99] Here is an open ticket within the collation effort, please let me know if you want to take this one or not. The scope of this ticket is relatively small (~1 day of work), so a reasonable deadline for finishing this task would be 6 days from now (Wednesday 24 Jul 2024). > Fix collation support for the StringToMap expression (binary & lowercase > collation only) > > > Key: SPARK-48937 > URL: https://issues.apache.org/jira/browse/SPARK-48937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for *StringToMap* built-in string function in Spark > ({*}str_to_map{*}). First confirm what is the expected behaviour for this > function when given collated strings, and then move on to implementation and > testing. You will find this expression in the *complexTypeCreator.scala* > file. However, this experssion is currently implemented as pass-through > function, which is wrong because it doesn't provide appropriate collation > awareness for non-default delimiters. > > Example 1. > {code:java} > SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code} > This query will give the correct result, regardless of the collation. > {code:java} > {"a":"1","b":"2","c":"3"}{code} > > Example 2. > {code:java} > SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code} > This query will give the *incorrect* result, under UTF8_LCASE collation. The > correct result should be: > {code:java} > {"a":"1","b":"2","c":"3"}{code} > > Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to > reflect how this function should be used with collation in SparkSQL, and feel > free to use your chosen Spark SQL Editor to experiment with the existing > functions to learn more about how they work. In addition, look into the > possible use-cases and implementation of similar functions within other other > open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *StringToMap* expression so > that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. > StringTypeBinaryLcase). To understand what changes were introduced in order > to enable full collation support for other existing functions in Spark, take > a look at the related Spark PRs and Jira tickets for completed tasks in this > parent (for example: https://issues.apache.org/jira/browse/SPARK-47414). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for string > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48338) Sql Scripting support for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-48338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48338: --- Labels: pull-request-available (was: ) > Sql Scripting support for Spark SQL > --- > > Key: SPARK-48338 > URL: https://issues.apache.org/jira/browse/SPARK-48338 > Project: Spark > Issue Type: Epic > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > Attachments: Sql Scripting - OSS.odt, [Design Doc] Sql Scripting - > OSS.pdf > > > Design doc for this feature is in attachment. > High level example of Sql Script: > ``` > BEGIN > DECLARE c INT = 10; > WHILE c > 0 DO > INSERT INTO tscript VALUES (c); > SET c = c - 1; > END WHILE; > END > ``` > High level motivation behind this feature: > SQL Scripting gives customers the ability to develop complex ETL and analysis > entirely in SQL. Until now, customers have had to write verbose SQL > statements or combine SQL + Python to efficiently write business logic. > Coming from another system, customers have to choose whether or not they want > to migrate to pyspark. Some customers end up not using Spark because of this > gap. SQL Scripting is a key milestone towards enabling SQL practitioners to > write sophisticated queries, without the need to use pyspark. Further, SQL > Scripting is a necessary step towards support for SQL Stored Procedures, and > along with SQL Variables (released) and Temp Tables (in progress), will allow > for more seamless data warehouse migrations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48791) Perf regression due to accumulator registration overhead using CopyOnWriteArrayList
[ https://issues.apache.org/jira/browse/SPARK-48791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48791: -- Issue Type: Bug (was: Improvement) > Perf regression due to accumulator registration overhead using > CopyOnWriteArrayList > --- > > Key: SPARK-48791 > URL: https://issues.apache.org/jira/browse/SPARK-48791 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0, 3.5.1, 3.3.4, 3.4.3 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.3 > > > We noticed query perf regression and locate the root cause is the overhead > introuduced when registering accumulators using CopyOnWriteArrayList. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48791) Perf regression due to accumulator registration overhead using CopyOnWriteArrayList
[ https://issues.apache.org/jira/browse/SPARK-48791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48791: -- Fix Version/s: 3.5.3 > Perf regression due to accumulator registration overhead using > CopyOnWriteArrayList > --- > > Key: SPARK-48791 > URL: https://issues.apache.org/jira/browse/SPARK-48791 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0, 3.5.1, 3.3.4, 3.4.3 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.3 > > > We noticed query perf regression and locate the root cause is the overhead > introuduced when registering accumulators using CopyOnWriteArrayList. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48791) Perf regression due to accumulator registration overhead using CopyOnWriteArrayList
[ https://issues.apache.org/jira/browse/SPARK-48791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867038#comment-17867038 ] Dongjoon Hyun commented on SPARK-48791: --- I added a fix version, 3.5.3, for now because it arrives after RC1 tagging. > Perf regression due to accumulator registration overhead using > CopyOnWriteArrayList > --- > > Key: SPARK-48791 > URL: https://issues.apache.org/jira/browse/SPARK-48791 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0, 3.5.1, 3.3.4, 3.4.3 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.3 > > > We noticed query perf regression and locate the root cause is the overhead > introuduced when registering accumulators using CopyOnWriteArrayList. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48890) Add Streaming related fields to log4j ThreadContext
[ https://issues.apache.org/jira/browse/SPARK-48890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-48890. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47340 [https://github.com/apache/spark/pull/47340] > Add Streaming related fields to log4j ThreadContext > --- > > Key: SPARK-48890 > URL: https://issues.apache.org/jira/browse/SPARK-48890 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SS >Affects Versions: 4.0.0 >Reporter: Wei Liu >Assignee: Wei Liu >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > There are some special informations needed for structured streaming queries. > Specifically, each query has a query_id and run_id. Also if using > MicrobatchExecution (default), there is a batch_id. > > A (query_id, run_id, batch_id) identifies the microbatch the streaming query > runs. Adding these field to a threadContext would help especially when there > are multiple queries running. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48890) Add Streaming related fields to log4j ThreadContext
[ https://issues.apache.org/jira/browse/SPARK-48890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-48890: -- Assignee: Wei Liu > Add Streaming related fields to log4j ThreadContext > --- > > Key: SPARK-48890 > URL: https://issues.apache.org/jira/browse/SPARK-48890 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SS >Affects Versions: 4.0.0 >Reporter: Wei Liu >Assignee: Wei Liu >Priority: Major > Labels: pull-request-available > > There are some special informations needed for structured streaming queries. > Specifically, each query has a query_id and run_id. Also if using > MicrobatchExecution (default), there is a batch_id. > > A (query_id, run_id, batch_id) identifies the microbatch the streaming query > runs. Adding these field to a threadContext would help especially when there > are multiple queries running. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48929) View fails with internal error after upgrade causes expected syntax error.
[ https://issues.apache.org/jira/browse/SPARK-48929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48929: --- Labels: pull-request-available (was: ) > View fails with internal error after upgrade causes expected syntax error. > -- > > Key: SPARK-48929 > URL: https://issues.apache.org/jira/browse/SPARK-48929 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Serge Rielau >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > On older Spark: > CREATE VIEW v AS SELECT 1 ! IN (2); > SEELCT * FROM v; > => true > Upgrade to Spark 4 > SELECT * FROM v; > Internal error > This makes it hard to debug the problem. > Rather than assuming that failure to parse a view text is an internal error > we should assume something like upgrade broke it and expose the actual error -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48937) Fix collation support for the StringToMap expression (binary & lowercase collation only)
[ https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867106#comment-17867106 ] psyren99 commented on SPARK-48937: -- Yes, I do it > Fix collation support for the StringToMap expression (binary & lowercase > collation only) > > > Key: SPARK-48937 > URL: https://issues.apache.org/jira/browse/SPARK-48937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for *StringToMap* built-in string function in Spark > ({*}str_to_map{*}). First confirm what is the expected behaviour for this > function when given collated strings, and then move on to implementation and > testing. You will find this expression in the *complexTypeCreator.scala* > file. However, this experssion is currently implemented as pass-through > function, which is wrong because it doesn't provide appropriate collation > awareness for non-default delimiters. > > Example 1. > {code:java} > SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code} > This query will give the correct result, regardless of the collation. > {code:java} > {"a":"1","b":"2","c":"3"}{code} > > Example 2. > {code:java} > SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code} > This query will give the *incorrect* result, under UTF8_LCASE collation. The > correct result should be: > {code:java} > {"a":"1","b":"2","c":"3"}{code} > > Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to > reflect how this function should be used with collation in SparkSQL, and feel > free to use your chosen Spark SQL Editor to experiment with the existing > functions to learn more about how they work. In addition, look into the > possible use-cases and implementation of similar functions within other other > open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *StringToMap* expression so > that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. > StringTypeBinaryLcase). To understand what changes were introduced in order > to enable full collation support for other existing functions in Spark, take > a look at the related Spark PRs and Jira tickets for completed tasks in this > parent (for example: https://issues.apache.org/jira/browse/SPARK-47414). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for string > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-48937) Fix collation support for the StringToMap expression (binary & lowercase collation only)
[ https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867106#comment-17867106 ] psyren99 edited comment on SPARK-48937 at 7/18/24 8:30 PM: --- [~uros-db] Yes, I do it was (Author: JIRAUSER306095): Yes, I do it > Fix collation support for the StringToMap expression (binary & lowercase > collation only) > > > Key: SPARK-48937 > URL: https://issues.apache.org/jira/browse/SPARK-48937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for *StringToMap* built-in string function in Spark > ({*}str_to_map{*}). First confirm what is the expected behaviour for this > function when given collated strings, and then move on to implementation and > testing. You will find this expression in the *complexTypeCreator.scala* > file. However, this experssion is currently implemented as pass-through > function, which is wrong because it doesn't provide appropriate collation > awareness for non-default delimiters. > > Example 1. > {code:java} > SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code} > This query will give the correct result, regardless of the collation. > {code:java} > {"a":"1","b":"2","c":"3"}{code} > > Example 2. > {code:java} > SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code} > This query will give the *incorrect* result, under UTF8_LCASE collation. The > correct result should be: > {code:java} > {"a":"1","b":"2","c":"3"}{code} > > Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to > reflect how this function should be used with collation in SparkSQL, and feel > free to use your chosen Spark SQL Editor to experiment with the existing > functions to learn more about how they work. In addition, look into the > possible use-cases and implementation of similar functions within other other > open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *StringToMap* expression so > that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. > StringTypeBinaryLcase). To understand what changes were introduced in order > to enable full collation support for other existing functions in Spark, take > a look at the related Spark PRs and Jira tickets for completed tasks in this > parent (for example: https://issues.apache.org/jira/browse/SPARK-47414). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for string > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-48937) Fix collation support for the StringToMap expression (binary & lowercase collation only)
[ https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867106#comment-17867106 ] psyren99 edited comment on SPARK-48937 at 7/18/24 8:31 PM: --- [~uros-db] Yes, I'll do it was (Author: JIRAUSER306095): [~uros-db] Yes, I do it > Fix collation support for the StringToMap expression (binary & lowercase > collation only) > > > Key: SPARK-48937 > URL: https://issues.apache.org/jira/browse/SPARK-48937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for *StringToMap* built-in string function in Spark > ({*}str_to_map{*}). First confirm what is the expected behaviour for this > function when given collated strings, and then move on to implementation and > testing. You will find this expression in the *complexTypeCreator.scala* > file. However, this experssion is currently implemented as pass-through > function, which is wrong because it doesn't provide appropriate collation > awareness for non-default delimiters. > > Example 1. > {code:java} > SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code} > This query will give the correct result, regardless of the collation. > {code:java} > {"a":"1","b":"2","c":"3"}{code} > > Example 2. > {code:java} > SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code} > This query will give the *incorrect* result, under UTF8_LCASE collation. The > correct result should be: > {code:java} > {"a":"1","b":"2","c":"3"}{code} > > Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to > reflect how this function should be used with collation in SparkSQL, and feel > free to use your chosen Spark SQL Editor to experiment with the existing > functions to learn more about how they work. In addition, look into the > possible use-cases and implementation of similar functions within other other > open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *StringToMap* expression so > that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. > StringTypeBinaryLcase). To understand what changes were introduced in order > to enable full collation support for other existing functions in Spark, take > a look at the related Spark PRs and Jira tickets for completed tasks in this > parent (for example: https://issues.apache.org/jira/browse/SPARK-47414). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for string > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48937) Fix collation support for the StringToMap expression (binary & lowercase collation only)
[ https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867108#comment-17867108 ] Uroš Bojanić commented on SPARK-48937: -- Ack. Feel free to ping me for review when you open a PR for this and let me know if you have any questions. Happy coding! > Fix collation support for the StringToMap expression (binary & lowercase > collation only) > > > Key: SPARK-48937 > URL: https://issues.apache.org/jira/browse/SPARK-48937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for *StringToMap* built-in string function in Spark > ({*}str_to_map{*}). First confirm what is the expected behaviour for this > function when given collated strings, and then move on to implementation and > testing. You will find this expression in the *complexTypeCreator.scala* > file. However, this experssion is currently implemented as pass-through > function, which is wrong because it doesn't provide appropriate collation > awareness for non-default delimiters. > > Example 1. > {code:java} > SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code} > This query will give the correct result, regardless of the collation. > {code:java} > {"a":"1","b":"2","c":"3"}{code} > > Example 2. > {code:java} > SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code} > This query will give the *incorrect* result, under UTF8_LCASE collation. The > correct result should be: > {code:java} > {"a":"1","b":"2","c":"3"}{code} > > Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to > reflect how this function should be used with collation in SparkSQL, and feel > free to use your chosen Spark SQL Editor to experiment with the existing > functions to learn more about how they work. In addition, look into the > possible use-cases and implementation of similar functions within other other > open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *StringToMap* expression so > that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. > StringTypeBinaryLcase). To understand what changes were introduced in order > to enable full collation support for other existing functions in Spark, take > a look at the related Spark PRs and Jira tickets for completed tasks in this > parent (for example: https://issues.apache.org/jira/browse/SPARK-47414). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for string > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48495) Document planned approach to shredding
[ https://issues.apache.org/jira/browse/SPARK-48495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867126#comment-17867126 ] Russell Spitzer commented on SPARK-48495: - This was merged with a bug in the table markdown, I've added a small fix PR here https://github.com/apache/spark/pull/47407 > Document planned approach to shredding > -- > > Key: SPARK-48495 > URL: https://issues.apache.org/jira/browse/SPARK-48495 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: David Cashman >Assignee: David Cashman >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-48495) Document planned approach to shredding
[ https://issues.apache.org/jira/browse/SPARK-48495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867126#comment-17867126 ] Russell Spitzer edited comment on SPARK-48495 at 7/18/24 10:05 PM: --- This was merged with a bug in the table markdown, I've added a small fix PR here https://github.com/apache/spark/pull/47407 [~David Cashman] + [~gurwls223] was (Author: rspitzer): This was merged with a bug in the table markdown, I've added a small fix PR here https://github.com/apache/spark/pull/47407 > Document planned approach to shredding > -- > > Key: SPARK-48495 > URL: https://issues.apache.org/jira/browse/SPARK-48495 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: David Cashman >Assignee: David Cashman >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48921) ScalaUDF in subquery should run through analyzer
[ https://issues.apache.org/jira/browse/SPARK-48921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48921. --- Fix Version/s: 3.5.2 4.0.0 Resolution: Fixed Issue resolved by pull request 47406 [https://github.com/apache/spark/pull/47406] > ScalaUDF in subquery should run through analyzer > > > Key: SPARK-48921 > URL: https://issues.apache.org/jira/browse/SPARK-48921 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1, 3.4.3 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Labels: pull-request-available > Fix For: 3.5.2, 4.0.0 > > > We got a customer issue that a `MergeInto` query on Iceberg table works > earlier but cannot work after upgrading to Spark 3.4. > The error looks like > ``` > Caused by: org.apache.spark.SparkRuntimeException: Error while decoding: > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > nullable on unresolved object > upcast(getcolumnbyordinal(0, StringType), StringType, - root class: > java.lang.String).toString. > ``` > The source table of `MergeInto` uses `ScalaUDF`. The error happens when Spark > invokes the deserializer of input encoder of the `ScalaUDF` and the > deserializer is not resolved yet. > The encoders of ScalaUDF are resolved by the rule `ResolveEncodersInUDF` > which will be applied at the end of analysis phase. > During rewriting `MergeInto` to `ReplaceData` query, Spark creates an > `Exists` subquery and `ScalaUDF` is part of the plan of the subquery. Note > that the `ScalaUDF` is already resolved by the analyzer. > Then, in `ResolveSubquery` rule which resolves the subquery, it will resolve > the subquery plan if it is not resolved yet. Because the subquery containing > `ScalaUDF` is resolved, the rule skips it so `ResolveEncodersInUDF` won't be > applied on it. So the analyzed `ReplaceData` query contains a `ScalaUDF` with > encoders unresolved that cause the error. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48921) ScalaUDF in subquery should run through analyzer
[ https://issues.apache.org/jira/browse/SPARK-48921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48921: -- Fix Version/s: 3.5.3 (was: 3.5.2) > ScalaUDF in subquery should run through analyzer > > > Key: SPARK-48921 > URL: https://issues.apache.org/jira/browse/SPARK-48921 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1, 3.4.3 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.3 > > > We got a customer issue that a `MergeInto` query on Iceberg table works > earlier but cannot work after upgrading to Spark 3.4. > The error looks like > ``` > Caused by: org.apache.spark.SparkRuntimeException: Error while decoding: > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > nullable on unresolved object > upcast(getcolumnbyordinal(0, StringType), StringType, - root class: > java.lang.String).toString. > ``` > The source table of `MergeInto` uses `ScalaUDF`. The error happens when Spark > invokes the deserializer of input encoder of the `ScalaUDF` and the > deserializer is not resolved yet. > The encoders of ScalaUDF are resolved by the rule `ResolveEncodersInUDF` > which will be applied at the end of analysis phase. > During rewriting `MergeInto` to `ReplaceData` query, Spark creates an > `Exists` subquery and `ScalaUDF` is part of the plan of the subquery. Note > that the `ScalaUDF` is already resolved by the analyzer. > Then, in `ResolveSubquery` rule which resolves the subquery, it will resolve > the subquery plan if it is not resolved yet. Because the subquery containing > `ScalaUDF` is resolved, the rule skips it so `ResolveEncodersInUDF` won't be > applied on it. So the analyzed `ReplaceData` query contains a `ScalaUDF` with > encoders unresolved that cause the error. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46446) Correctness bug in correlated subquery with OFFSET
[ https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867132#comment-17867132 ] Andy Lam commented on SPARK-46446: -- [~cloud_fan] Could we unresolve this ticket or create a new one? Decorrelate subqueries with correlation under LIMIT with OFFSET hasn't been fixed, just disabled. > Correctness bug in correlated subquery with OFFSET > -- > > Key: SPARK-46446 > URL: https://issues.apache.org/jira/browse/SPARK-46446 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Subqueries with correlation under LIMIT with OFFSET have a correctness bug, > introduced recently when support for correlation under OFFSET was enabled but > were not handled correctly. (So we went from unsupported, query throws error > -> wrong results.) > It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS > > It's easy to repro with a query like > {code:java} > create table x(x1 int, x2 int); > insert into x values (1, 1), (2, 2); > create table y(y1 int, y2 int); > insert into y values (1, 1), (1, 2), (2, 4); > select * from x where exists (select * from y where x1 = y1 limit 1 offset > 2){code} > Correct result: empty set, see postgres: > [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] > Spark result: Array([2,2]) > > The > [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403] > where it was introduced added a test for it, but the golden file results for > the test actually were incorrect and we didn't notice. (The bug was initially > found by https://github.com/apache/spark/pull/44084) > I'll work on both: > * Adding support for offset in DecorrelateInnerQuery (the transformation is > into a filter on row_number window function, similar to limit). > * Adding a feature flag to enable/disable offset in subquery support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48938) Improve error message when registering UDTFs
Allison Wang created SPARK-48938: Summary: Improve error message when registering UDTFs Key: SPARK-48938 URL: https://issues.apache.org/jira/browse/SPARK-48938 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Improve the error message when registering Python UDTFs -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48938) Improve error message when registering UDTFs
[ https://issues.apache.org/jira/browse/SPARK-48938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48938: --- Labels: pull-request-available (was: ) > Improve error message when registering UDTFs > > > Key: SPARK-48938 > URL: https://issues.apache.org/jira/browse/SPARK-48938 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > > Improve the error message when registering Python UDTFs -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48939) Support recursive reference of Avro schema
Yuchen Liu created SPARK-48939: -- Summary: Support recursive reference of Avro schema Key: SPARK-48939 URL: https://issues.apache.org/jira/browse/SPARK-48939 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 4.0.0 Reporter: Yuchen Liu We should support reading Avro message with recursive reference in schema. Recursive reference denotes the case that the type of a field can be defined before in the parent nodes. A simple example is: {code:java} { "type": "record", "name": "LongList", "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["null", "LongList"]} ] } {code} This is written in Avro Schema DSL and represents a linked list data structure. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48940) Upgrade `Arrow` to 17.0.0
[ https://issues.apache.org/jira/browse/SPARK-48940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48940: --- Labels: pull-request-available (was: ) > Upgrade `Arrow` to 17.0.0 > - > > Key: SPARK-48940 > URL: https://issues.apache.org/jira/browse/SPARK-48940 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48934) Python datetime types converted incorrectly for setting timeout in applyInPandasWithState
[ https://issues.apache.org/jira/browse/SPARK-48934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-48934: Assignee: Siying Dong > Python datetime types converted incorrectly for setting timeout in > applyInPandasWithState > - > > Key: SPARK-48934 > URL: https://issues.apache.org/jira/browse/SPARK-48934 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.4.4, 3.5.3 > > > In applyInPandasWithState(), when state.setTimeoutTimestamp() is passed in > with datetime.datetime type, it doesn't function as expected. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48934) Python datetime types converted incorrectly for setting timeout in applyInPandasWithState
[ https://issues.apache.org/jira/browse/SPARK-48934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-48934. -- Fix Version/s: 4.0.0 3.4.4 3.5.3 Resolution: Fixed Issue resolved via [https://github.com/apache/spark/pull/47398] > Python datetime types converted incorrectly for setting timeout in > applyInPandasWithState > - > > Key: SPARK-48934 > URL: https://issues.apache.org/jira/browse/SPARK-48934 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Siying Dong >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.4.4, 3.5.3 > > > In applyInPandasWithState(), when state.setTimeoutTimestamp() is passed in > with datetime.datetime type, it doesn't function as expected. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48388) [M0] Fix SET behavior for scripts
[ https://issues.apache.org/jira/browse/SPARK-48388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-48388: --- Assignee: David Milicevic > [M0] Fix SET behavior for scripts > - > > Key: SPARK-48388 > URL: https://issues.apache.org/jira/browse/SPARK-48388 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: David Milicevic >Assignee: David Milicevic >Priority: Major > Labels: pull-request-available > > By standard, SET is used to set variable value in SQL scripts. > On our end, SET is configured to work with some Hive configs, so the grammar > is a bit messed up and for that reason it was decided to use SET VAR instead > of SET to work with SQL variables. > This is not by standard and we should figure out the way to be able to use > SET for SQL variables and forbid setting of Hive configs from SQL scripts. > > For more details, design doc can be found in parent Jira item. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48388) [M0] Fix SET behavior for scripts
[ https://issues.apache.org/jira/browse/SPARK-48388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48388. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47272 [https://github.com/apache/spark/pull/47272] > [M0] Fix SET behavior for scripts > - > > Key: SPARK-48388 > URL: https://issues.apache.org/jira/browse/SPARK-48388 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: David Milicevic >Assignee: David Milicevic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > By standard, SET is used to set variable value in SQL scripts. > On our end, SET is configured to work with some Hive configs, so the grammar > is a bit messed up and for that reason it was decided to use SET VAR instead > of SET to work with SQL variables. > This is not by standard and we should figure out the way to be able to use > SET for SQL variables and forbid setting of Hive configs from SQL scripts. > > For more details, design doc can be found in parent Jira item. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48934) Python datetime types converted incorrectly for setting timeout in applyInPandasWithState
[ https://issues.apache.org/jira/browse/SPARK-48934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-48934: - Fix Version/s: (was: 3.4.4) > Python datetime types converted incorrectly for setting timeout in > applyInPandasWithState > - > > Key: SPARK-48934 > URL: https://issues.apache.org/jira/browse/SPARK-48934 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.3 > > > In applyInPandasWithState(), when state.setTimeoutTimestamp() is passed in > with datetime.datetime type, it doesn't function as expected. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48941) PySparkML: Replace RDD read / write API invocation with Dataframe read / write API
Weichen Xu created SPARK-48941: -- Summary: PySparkML: Replace RDD read / write API invocation with Dataframe read / write API Key: SPARK-48941 URL: https://issues.apache.org/jira/browse/SPARK-48941 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 4.0.0 Reporter: Weichen Xu PySparkML: Replace RDD read / write API invocation with Dataframe read / write API -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48941) PySparkML: Replace RDD read / write API invocation with Dataframe read / write API
[ https://issues.apache.org/jira/browse/SPARK-48941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48941: --- Labels: pull-request-available (was: ) > PySparkML: Replace RDD read / write API invocation with Dataframe read / > write API > --- > > Key: SPARK-48941 > URL: https://issues.apache.org/jira/browse/SPARK-48941 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Weichen Xu >Priority: Major > Labels: pull-request-available > > PySparkML: Replace RDD read / write API invocation with Dataframe read / > write API -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48933) Upgrade `protobuf-java` to `3.25.3`
[ https://issues.apache.org/jira/browse/SPARK-48933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48933. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47397 [https://github.com/apache/spark/pull/47397] > Upgrade `protobuf-java` to `3.25.3` > --- > > Key: SPARK-48933 > URL: https://issues.apache.org/jira/browse/SPARK-48933 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48791) Perf regression due to accumulator registration overhead using CopyOnWriteArrayList
[ https://issues.apache.org/jira/browse/SPARK-48791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48791: -- Fix Version/s: 3.4.4 > Perf regression due to accumulator registration overhead using > CopyOnWriteArrayList > --- > > Key: SPARK-48791 > URL: https://issues.apache.org/jira/browse/SPARK-48791 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0, 3.5.1, 3.3.4, 3.4.3 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.4.4, 3.5.3 > > > We noticed query perf regression and locate the root cause is the overhead > introuduced when registering accumulators using CopyOnWriteArrayList. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48942) Reading parquet with Array of Structs of UDTs throws Exception
James Willis created SPARK-48942: Summary: Reading parquet with Array of Structs of UDTs throws Exception Key: SPARK-48942 URL: https://issues.apache.org/jira/browse/SPARK-48942 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.1, 3.4.0 Reporter: James Willis When reading parquet files that have a column that is an Array of Structs where the Struct contains a UDT, the read fails. We have repro'd this is 3.4.0 and 3.5.1. I see this might be some related work on Supporting UDTs in parquet: https://issues.apache.org/jira/browse/SPARK-39086 I'm trying to create a UDT to repro this without sedona I discovered this when using Apache Sedona. Here is my minimal reproducible example: {code:java} sedona.sql(""" SELECT ARRAY(STRUCT(ST_POINT(1.0, 1.1))) """).write.mode("overwrite").format("parquet").save(test_read_write_path)df = sedona.read.format("parquet").load(test_read_write_path) df.show() {code} This gives the following stack trace: {code:java} 24/07/19 03:31:23 WARN TaskSetManager: Lost task 0.0 in stage 22.0 (TID 24) (10.0.133.128 executor 1): java.lang.IllegalArgumentException: Spark type: StructType(StructField(col1,BinaryType,true)) doesn't match the type: StructType(StructField(col1,org.apache.spark.sql.sedona_sql.UDT.GeometryUDT@5b138d06,true)) in column vector at org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:69) at org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:116) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:272) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:292) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:286) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:228) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:290) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) 24/07/19 03:31:23 WARN TaskSetManager: Lost task 0.1 in stage 22.0 (TID 25) (10.0.183.100 executor 5): java.lang.IllegalArgumentException: Spark type: StructType(StructField(col1,BinaryType,true)) doesn't match the type: StructType(StructField(col1,org.apache.spark.sql.sedona_sql.UDT.GeometryUDT@239780ef,true)) in column vector at org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:69) at org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:116) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(Vector
[jira] [Commented] (SPARK-48934) Python datetime types converted incorrectly for setting timeout in applyInPandasWithState
[ https://issues.apache.org/jira/browse/SPARK-48934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867173#comment-17867173 ] Kent Yao commented on SPARK-48934: -- Collected this to 3.5.2 > Python datetime types converted incorrectly for setting timeout in > applyInPandasWithState > - > > Key: SPARK-48934 > URL: https://issues.apache.org/jira/browse/SPARK-48934 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2 > > > In applyInPandasWithState(), when state.setTimeoutTimestamp() is passed in > with datetime.datetime type, it doesn't function as expected. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48791) Perf regression due to accumulator registration overhead using CopyOnWriteArrayList
[ https://issues.apache.org/jira/browse/SPARK-48791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867175#comment-17867175 ] Kent Yao commented on SPARK-48791: -- Collected to 3.5.2 > Perf regression due to accumulator registration overhead using > CopyOnWriteArrayList > --- > > Key: SPARK-48791 > URL: https://issues.apache.org/jira/browse/SPARK-48791 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0, 3.5.1, 3.3.4, 3.4.3 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.4 > > > We noticed query perf regression and locate the root cause is the overhead > introuduced when registering accumulators using CopyOnWriteArrayList. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48791) Perf regression due to accumulator registration overhead using CopyOnWriteArrayList
[ https://issues.apache.org/jira/browse/SPARK-48791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-48791: - Fix Version/s: 3.5.2 (was: 3.5.3) > Perf regression due to accumulator registration overhead using > CopyOnWriteArrayList > --- > > Key: SPARK-48791 > URL: https://issues.apache.org/jira/browse/SPARK-48791 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0, 3.5.1, 3.3.4, 3.4.3 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.4 > > > We noticed query perf regression and locate the root cause is the overhead > introuduced when registering accumulators using CopyOnWriteArrayList. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48921) ScalaUDF in subquery should run through analyzer
[ https://issues.apache.org/jira/browse/SPARK-48921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-48921: - Fix Version/s: 3.5.2 (was: 3.5.3) > ScalaUDF in subquery should run through analyzer > > > Key: SPARK-48921 > URL: https://issues.apache.org/jira/browse/SPARK-48921 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1, 3.4.3 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2 > > > We got a customer issue that a `MergeInto` query on Iceberg table works > earlier but cannot work after upgrading to Spark 3.4. > The error looks like > ``` > Caused by: org.apache.spark.SparkRuntimeException: Error while decoding: > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > nullable on unresolved object > upcast(getcolumnbyordinal(0, StringType), StringType, - root class: > java.lang.String).toString. > ``` > The source table of `MergeInto` uses `ScalaUDF`. The error happens when Spark > invokes the deserializer of input encoder of the `ScalaUDF` and the > deserializer is not resolved yet. > The encoders of ScalaUDF are resolved by the rule `ResolveEncodersInUDF` > which will be applied at the end of analysis phase. > During rewriting `MergeInto` to `ReplaceData` query, Spark creates an > `Exists` subquery and `ScalaUDF` is part of the plan of the subquery. Note > that the `ScalaUDF` is already resolved by the analyzer. > Then, in `ResolveSubquery` rule which resolves the subquery, it will resolve > the subquery plan if it is not resolved yet. Because the subquery containing > `ScalaUDF` is resolved, the rule skips it so `ResolveEncodersInUDF` won't be > applied on it. So the analyzed `ReplaceData` query contains a `ScalaUDF` with > encoders unresolved that cause the error. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org