[jira] [Created] (SPARK-48556) Incorrect error message pointing to UNSUPPORTED_GROUPING_EXPRESSION
Nikola Mandic created SPARK-48556: - Summary: Incorrect error message pointing to UNSUPPORTED_GROUPING_EXPRESSION Key: SPARK-48556 URL: https://issues.apache.org/jira/browse/SPARK-48556 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Nikola Mandic Following sequence of queries produces UNSUPPORTED_GROUPING_EXPRESSION error: {code:java} create table t1(a int, b int) using parquet; select grouping(a), dummy from t1 group by a with rollup; {code} However, the appropriate error should point the user to the invalid dummy column name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48430) Fix map value extraction when map contains collated strings
Nikola Mandic created SPARK-48430: - Summary: Fix map value extraction when map contains collated strings Key: SPARK-48430 URL: https://issues.apache.org/jira/browse/SPARK-48430 Project: Spark Issue Type: Task Components: SQL Affects Versions: 4.0.0 Reporter: Nikola Mandic Following queries return unexpected results: {code:java} select collation(map('a', 'b' collate utf8_binary_lcase)['a']); select collation(element_at(map('a', 'b' collate utf8_binary_lcase), 'a'));{code} Both return UTF8_BINARY instead of UTF8_BINARY_LCASE. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48413) ALTER COLUMN with collation
[ https://issues.apache.org/jira/browse/SPARK-48413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-48413: -- Epic Link: SPARK-46830 > ALTER COLUMN with collation > --- > > Key: SPARK-48413 > URL: https://issues.apache.org/jira/browse/SPARK-48413 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > > Add support for changing collation of a column with ALTER COLUMN command. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48413) ALTER COLUMN with collation
Nikola Mandic created SPARK-48413: - Summary: ALTER COLUMN with collation Key: SPARK-48413 URL: https://issues.apache.org/jira/browse/SPARK-48413 Project: Spark Issue Type: Task Components: SQL Affects Versions: 4.0.0 Reporter: Nikola Mandic Add support for changing collation of a column with ALTER COLUMN command. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48273) Late rewrite of PlanWithUnresolvedIdentifier
Nikola Mandic created SPARK-48273: - Summary: Late rewrite of PlanWithUnresolvedIdentifier Key: SPARK-48273 URL: https://issues.apache.org/jira/browse/SPARK-48273 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Nikola Mandic PlanWithUnresolvedIdentifier is rewritten later in analysis which causes rules like SubstituteUnresolvedOrdinals to miss the new plan. This causes following queries to fail: {code:java} create temporary view identifier('v1') as (select my_col from (values (1), (2), (1) as (my_col)) group by 1); -- cache table identifier('t1') as (select my_col from (values (1), (2), (1) as (my_col)) group by 1); -- create table identifier('t2') as (select my_col from (values (1), (2), (1) as (my_col)) group by 1); insert into identifier('t2') select my_col from (values (3) as (my_col)) group by 1; {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46841) Language support for collations
[ https://issues.apache.org/jira/browse/SPARK-46841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-46841: -- Description: Languages and localization for collations are supported by ICU library. Collation naming format is as follows: {code:java} <2-letter language code>[_<4-letter script>][_<3-letter country code>][_specifier_specifier...]{code} Locale specifier consists of the first part of collation name (language + script + country). Locale specifiers need to be stable across ICU versions; to keep existing ids and names invariant we introduce golden file will locale table which should case CI failure on any silent changes. Currently supported optional specifiers: * CS/CI - case sensitivity, default is case-sensitive; supported by configuring ICU collation levels * AS/AI - accent sensitivity; default is accent-sensitive; supported by configuring ICU collation levels * /LCASE/UCASE - case conversion performed prior to comparisons; supported by internal implementation relying on ICU locale-aware conversions User can use collation specifiers in any order except of locale which is mandatory and must go first. There is a one-to-one mapping between collation ids and collation names defined in CollationFactory. was: Languages and localization for collations are supported by ICU library. Collation naming format is as follows: {code:java} <2-letter language code>__[_specifier_specifier...]{code} Locale specifier consists of the first part of collation name (language + script + country). Locale specifiers need to be stable across ICU versions; to keep existing ids and names invariant we introduce golden file will locale table which should case CI failure on any silent changes. Currently supported optional specifiers: * CS/CI - case sensitivity, default is case-sensitive; supported by configuring ICU collation levels * AS/AI - accent sensitivity; default is accent-sensitive; supported by configuring ICU collation levels * /LCASE/UCASE - case conversion performed prior to comparisons; supported by internal implementation relying on ICU locale-aware conversions User can use collation specifiers in any order except of locale which is mandatory and must go first. There is a one-to-one mapping between collation ids and collation names defined in CollationFactory. > Language support for collations > --- > > Key: SPARK-46841 > URL: https://issues.apache.org/jira/browse/SPARK-46841 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > > Languages and localization for collations are supported by ICU library. > Collation naming format is as follows: > {code:java} > <2-letter language code>[_<4-letter script>][_<3-letter country > code>][_specifier_specifier...]{code} > Locale specifier consists of the first part of collation name (language + > script + country). Locale specifiers need to be stable across ICU versions; > to keep existing ids and names invariant we introduce golden file will locale > table which should case CI failure on any silent changes. > Currently supported optional specifiers: > * CS/CI - case sensitivity, default is case-sensitive; supported by > configuring ICU collation levels > * AS/AI - accent sensitivity; default is accent-sensitive; supported by > configuring ICU collation levels > * /LCASE/UCASE - case conversion performed prior to > comparisons; supported by internal implementation relying on ICU locale-aware > conversions > User can use collation specifiers in any order except of locale which is > mandatory and must go first. There is a one-to-one mapping between collation > ids and collation names defined in CollationFactory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46841) Language support for collations
[ https://issues.apache.org/jira/browse/SPARK-46841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-46841: -- Description: Languages and localization for collations are supported by ICU library. Collation naming format is as follows: {code:java} <2-letter language code>__[_specifier_specifier...]{code} Locale specifier consists of the first part of collation name (language + script + country). Locale specifiers need to be stable across ICU versions; to keep existing ids and names invariant we introduce golden file will locale table which should case CI failure on any silent changes. Currently supported optional specifiers: * CS/CI - case sensitivity, default is case-sensitive; supported by configuring ICU collation levels * AS/AI - accent sensitivity; default is accent-sensitive; supported by configuring ICU collation levels * /LCASE/UCASE - case conversion performed prior to comparisons; supported by internal implementation relying on ICU locale-aware conversions User can use collation specifiers in any order except of locale which is mandatory and must go first. There is a one-to-one mapping between collation ids and collation names defined in CollationFactory. was: Languages and localization for collations are supported by ICU library. Collation naming format is as follows: {code:java} <2-letter language code>__<3-letter country code>[_specifier_specifier...]{code} Locale specifier consists of the first part of collation name (language + script + country). Locale specifiers need to be stable across ICU versions; to keep existing ids and names invariant we introduce golden file will locale table which should case CI failure on any silent changes. Currently supported optional specifiers: * CS/CI - case sensitivity, default is case-sensitive; supported by configuring ICU collation levels * AS/AI - accent sensitivity; default is accent-sensitive; supported by configuring ICU collation levels * /LCASE/UCASE - case conversion performed prior to comparisons; supported by internal implementation relying on ICU locale-aware conversions User can use collation specifiers in any order except of locale which is mandatory and must go first. There is a one-to-one mapping between collation ids and collation names defined in CollationFactory. > Language support for collations > --- > > Key: SPARK-46841 > URL: https://issues.apache.org/jira/browse/SPARK-46841 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > > Languages and localization for collations are supported by ICU library. > Collation naming format is as follows: > {code:java} > <2-letter language code>__ country code>[_specifier_specifier...]{code} > Locale specifier consists of the first part of collation name (language + > script + country). Locale specifiers need to be stable across ICU versions; > to keep existing ids and names invariant we introduce golden file will locale > table which should case CI failure on any silent changes. > Currently supported optional specifiers: > * CS/CI - case sensitivity, default is case-sensitive; supported by > configuring ICU collation levels > * AS/AI - accent sensitivity; default is accent-sensitive; supported by > configuring ICU collation levels > * /LCASE/UCASE - case conversion performed prior to > comparisons; supported by internal implementation relying on ICU locale-aware > conversions > User can use collation specifiers in any order except of locale which is > mandatory and must go first. There is a one-to-one mapping between collation > ids and collation names defined in CollationFactory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46841) Language support for collations
[ https://issues.apache.org/jira/browse/SPARK-46841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-46841: -- Component/s: SQL Description: Languages and localization for collations are supported by ICU library. Collation naming format is as follows: {code:java} <2-letter language code>__<3-letter country code>[_specifier_specifier...]{code} Locale specifier consists of the first part of collation name (language + script + country). Locale specifiers need to be stable across ICU versions; to keep existing ids and names invariant we introduce golden file will locale table which should case CI failure on any silent changes. Currently supported optional specifiers: * CS/CI - case sensitivity, default is case-sensitive; supported by configuring ICU collation levels * AS/AI - accent sensitivity; default is accent-sensitive; supported by configuring ICU collation levels * /LCASE/UCASE - case conversion performed prior to comparisons; supported by internal implementation relying on ICU locale-aware conversions User can use collation specifiers in any order except of locale which is mandatory and must go first. There is a one-to-one mapping between collation ids and collation names defined in CollationFactory. > Language support for collations > --- > > Key: SPARK-46841 > URL: https://issues.apache.org/jira/browse/SPARK-46841 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > > Languages and localization for collations are supported by ICU library. > Collation naming format is as follows: > {code:java} > <2-letter language code>__<3-letter country > code>[_specifier_specifier...]{code} > Locale specifier consists of the first part of collation name (language + > script + country). Locale specifiers need to be stable across ICU versions; > to keep existing ids and names invariant we introduce golden file will locale > table which should case CI failure on any silent changes. > Currently supported optional specifiers: > * CS/CI - case sensitivity, default is case-sensitive; supported by > configuring ICU collation levels > * AS/AI - accent sensitivity; default is accent-sensitive; supported by > configuring ICU collation levels > * /LCASE/UCASE - case conversion performed prior to > comparisons; supported by internal implementation relying on ICU locale-aware > conversions > User can use collation specifiers in any order except of locale which is > mandatory and must go first. There is a one-to-one mapping between collation > ids and collation names defined in CollationFactory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47874) Multiple bugs with map operations in combination with collations
Nikola Mandic created SPARK-47874: - Summary: Multiple bugs with map operations in combination with collations Key: SPARK-47874 URL: https://issues.apache.org/jira/browse/SPARK-47874 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Nikola Mandic Following two queries produce different results (first succeeds, second throws an exceptions): {code:java} select map('a', 1, 'A' collate utf8_binary_lcase, 2); -- success select map('a' collate utf8_binary_lcase, 1, 'A', 2); -- exception{code} Following query results in 1: {code:java} select cast(map('a', 1, 'A', 2) as map)['A' collate utf8_binary_lcase]; -- 1{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46841) Language support for collations
[ https://issues.apache.org/jira/browse/SPARK-46841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837668#comment-17837668 ] Nikola Mandic commented on SPARK-46841: --- Working on it. > Language support for collations > --- > > Key: SPARK-46841 > URL: https://issues.apache.org/jira/browse/SPARK-46841 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47832) Fix problematic test in TPC-DS Collations test when ANSI flag is set
[ https://issues.apache.org/jira/browse/SPARK-47832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-47832: -- Summary: Fix problematic test in TPC-DS Collations test when ANSI flag is set (was: Skip problematic test in TPC-DS Collations test when ANSI flag is set) > Fix problematic test in TPC-DS Collations test when ANSI flag is set > > > Key: SPARK-47832 > URL: https://issues.apache.org/jira/browse/SPARK-47832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > Fix For: 4.0.0 > > > "Build / ANSI (master, Hadoop 3, JDK 17, Scala 2.13)" CI is broken by TPC-DS > collations test. Error: > {code:java} > [info] - q35-v2.7 *** FAILED *** (2 seconds, 695 milliseconds) > 3489[info] java.lang.Exception: Expected "[null f d 0 > 1 0.0 0 0 2 1 2.0 2 2 2 > 1 2.0 2 2 > 3490 > ... > null m m 4 1 4.0 4 4 1 1 > 1.0 1 1 3 1 3.0 3 3]", but got > "[org.apache.spark.sparkexception > 3589[info] { > 3590[info] "errorclass" : "_legacy_error_temp_2250", > 3591[info] "messageparameters" : { > 3592[info] "analyzetblmsg" : " or analyze these tables through: analyze > table `spark_catalog`.`tpcds_utf8`.`customer_demographics` compute > statistics;.", > 3593[info] "autobroadcastjointhreshold" : > "spark.sql.autobroadcastjointhreshold", > 3594[info] "drivermemory" : "spark.driver.memory" > 3595[info] } > 3596[info] }]" > 3597[info] Error using configs: > 3598[info] at > org.apache.spark.sql.TPCDSCollationQueryTestSuite.$anonfun$runQuery$1(TPCDSCollationQueryTestSuite.scala:228) > 3599 > ... {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47832) Skip problematic test in TPC-DS Collations test when ANSI flag is set
Nikola Mandic created SPARK-47832: - Summary: Skip problematic test in TPC-DS Collations test when ANSI flag is set Key: SPARK-47832 URL: https://issues.apache.org/jira/browse/SPARK-47832 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Nikola Mandic Fix For: 4.0.0 "Build / ANSI (master, Hadoop 3, JDK 17, Scala 2.13)" CI is broken by TPC-DS collations test. Error: {code:java} [info] - q35-v2.7 *** FAILED *** (2 seconds, 695 milliseconds) 3489[info] java.lang.Exception: Expected "[null f d 0 1 0.0 0 0 2 1 2.0 2 2 2 1 2.0 2 2 3490 ... nullm m 4 1 4.0 4 4 1 1 1.0 1 1 3 1 3.0 3 3]", but got "[org.apache.spark.sparkexception 3589[info] { 3590[info] "errorclass" : "_legacy_error_temp_2250", 3591[info] "messageparameters" : { 3592[info] "analyzetblmsg" : " or analyze these tables through: analyze table `spark_catalog`.`tpcds_utf8`.`customer_demographics` compute statistics;.", 3593[info] "autobroadcastjointhreshold" : "spark.sql.autobroadcastjointhreshold", 3594[info] "drivermemory" : "spark.driver.memory" 3595[info] } 3596[info] }]" 3597[info] Error using configs: 3598[info] at org.apache.spark.sql.TPCDSCollationQueryTestSuite.$anonfun$runQuery$1(TPCDSCollationQueryTestSuite.scala:228) 3599 ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47408) TBD
[ https://issues.apache.org/jira/browse/SPARK-47408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-47408: -- Summary: TBD (was: Luhncheck (all collations)) > TBD > --- > > Key: SPARK-47408 > URL: https://issues.apache.org/jira/browse/SPARK-47408 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47414) TBD
[ https://issues.apache.org/jira/browse/SPARK-47414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-47414: -- Summary: TBD (was: Length, BitLength, OctetLength (all collations)) > TBD > --- > > Key: SPARK-47414 > URL: https://issues.apache.org/jira/browse/SPARK-47414 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47416) TBD
[ https://issues.apache.org/jira/browse/SPARK-47416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-47416: -- Summary: TBD (was: SoundEx (all collations)) > TBD > --- > > Key: SPARK-47416 > URL: https://issues.apache.org/jira/browse/SPARK-47416 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-47420) TBD
[ https://issues.apache.org/jira/browse/SPARK-47420 ] Nikola Mandic deleted comment on SPARK-47420: --- was (Author: JIRAUSER304340): Working on it. > TBD > --- > > Key: SPARK-47420 > URL: https://issues.apache.org/jira/browse/SPARK-47420 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47420) TBD
[ https://issues.apache.org/jira/browse/SPARK-47420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-47420: -- Summary: TBD (was: FormatNumber, Sentences (all collations)) > TBD > --- > > Key: SPARK-47420 > URL: https://issues.apache.org/jira/browse/SPARK-47420 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47417) Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, FormatNumber, Sentences (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-47417: -- Summary: Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, FormatNumber, Sentences (all collations) (was: Ascii, Chr, Base64, UnBase64 (all collations)) > Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, > FormatNumber, Sentences (all collations) > -- > > Key: SPARK-47417 > URL: https://issues.apache.org/jira/browse/SPARK-47417 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47418) TBD
[ https://issues.apache.org/jira/browse/SPARK-47418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-47418: -- Summary: TBD (was: Decode, StringDecode, Encode, ToBinary (all collations)) > TBD > --- > > Key: SPARK-47418 > URL: https://issues.apache.org/jira/browse/SPARK-47418 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47416) SoundEx (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835328#comment-17835328 ] Nikola Mandic commented on SPARK-47416: --- Working on it. > SoundEx (all collations) > > > Key: SPARK-47416 > URL: https://issues.apache.org/jira/browse/SPARK-47416 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47617) Add TPC-DS testing infrastructure for collations
[ https://issues.apache.org/jira/browse/SPARK-47617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-47617: -- Description: As collation support grows across all SQL features and new collation types are added, we need to have reliable testing model covering as many standard SQL capabilities as possible. We can utilize TPC-DS testing infrastructure already present in Spark. The idea is to vary TPC-DS table string columns by adding multiple collations with different ordering rules and case sensitivity, producing new tables. These tables should yield the same results against predefined TPC-DS queries for certain batches of collations. For example, when comparing query runs on table where columns are first collated as UTF8_BINARY and then as UTF8_BINARY_LCASE, we should be getting same results after converting to lowercase. Introduce new query suite which tests the described behavior with available collations (utf8_binary and unicode) combined with case conversions (lowercase, uppercase, randomized case for fuzzy testing). was: As collation support grows across all SQL features and new collation types are added, we need to have reliable testing model covering as many standard SQL capabilities as possible. We can utilize TCP-DS testing infrastructure already present in Spark. The idea is to vary TCP-DS table string columns by adding multiple collations with different ordering rules and case sensitivity, producing new tables. These tables should yield the same results against predefined TCP-DS queries for certain batches of collations. For example, when comparing query runs on table where columns are first collated as UTF8_BINARY and then as UTF8_BINARY_LCASE, we should be getting same results after converting to lowercase. Introduce new query suite which tests the described behavior with available collations (utf8_binary and unicode) combined with case conversions (lowercase, uppercase, randomized case for fuzzy testing). > Add TPC-DS testing infrastructure for collations > > > Key: SPARK-47617 > URL: https://issues.apache.org/jira/browse/SPARK-47617 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > > As collation support grows across all SQL features and new collation types > are added, we need to have reliable testing model covering as many standard > SQL capabilities as possible. > We can utilize TPC-DS testing infrastructure already present in Spark. The > idea is to vary TPC-DS table string columns by adding multiple collations > with different ordering rules and case sensitivity, producing new tables. > These tables should yield the same results against predefined TPC-DS queries > for certain batches of collations. For example, when comparing query runs on > table where columns are first collated as UTF8_BINARY and then as > UTF8_BINARY_LCASE, we should be getting same results after converting to > lowercase. > Introduce new query suite which tests the described behavior with available > collations (utf8_binary and unicode) combined with case conversions > (lowercase, uppercase, randomized case for fuzzy testing). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47617) Add TPC-DS testing infrastructure for collations
[ https://issues.apache.org/jira/browse/SPARK-47617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-47617: -- Summary: Add TPC-DS testing infrastructure for collations (was: Add TCP-DS testing infrastructure for collations) > Add TPC-DS testing infrastructure for collations > > > Key: SPARK-47617 > URL: https://issues.apache.org/jira/browse/SPARK-47617 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > > As collation support grows across all SQL features and new collation types > are added, we need to have reliable testing model covering as many standard > SQL capabilities as possible. > We can utilize TCP-DS testing infrastructure already present in Spark. The > idea is to vary TCP-DS table string columns by adding multiple collations > with different ordering rules and case sensitivity, producing new tables. > These tables should yield the same results against predefined TCP-DS queries > for certain batches of collations. For example, when comparing query runs on > table where columns are first collated as UTF8_BINARY and then as > UTF8_BINARY_LCASE, we should be getting same results after converting to > lowercase. > Introduce new query suite which tests the described behavior with available > collations (utf8_binary and unicode) combined with case conversions > (lowercase, uppercase, randomized case for fuzzy testing). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47617) Add TCP-DS testing infrastructure for collations
Nikola Mandic created SPARK-47617: - Summary: Add TCP-DS testing infrastructure for collations Key: SPARK-47617 URL: https://issues.apache.org/jira/browse/SPARK-47617 Project: Spark Issue Type: Task Components: SQL Affects Versions: 4.0.0 Reporter: Nikola Mandic As collation support grows across all SQL features and new collation types are added, we need to have reliable testing model covering as many standard SQL capabilities as possible. We can utilize TCP-DS testing infrastructure already present in Spark. The idea is to vary TCP-DS table string columns by adding multiple collations with different ordering rules and case sensitivity, producing new tables. These tables should yield the same results against predefined TCP-DS queries for certain batches of collations. For example, when comparing query runs on table where columns are first collated as UTF8_BINARY and then as UTF8_BINARY_LCASE, we should be getting same results after converting to lowercase. Introduce new query suite which tests the described behavior with available collations (utf8_binary and unicode) combined with case conversions (lowercase, uppercase, randomized case for fuzzy testing). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47483) Add support for aggregation and join operations on arrays of collated strings
[ https://issues.apache.org/jira/browse/SPARK-47483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-47483: -- Epic Link: SPARK-46830 > Add support for aggregation and join operations on arrays of collated strings > - > > Key: SPARK-47483 > URL: https://issues.apache.org/jira/browse/SPARK-47483 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > > Example of aggregation sequence: > {code:java} > create table t(a array) using parquet; > insert into t(a) values(array('a' collate utf8_binary_lcase)); > insert into t(a) values(array('A' collate utf8_binary_lcase)); > select distinct a from t; {code} > Example of join sequence: > {code:java} > create table l(a array) using parquet; > create table r(a array) using parquet; > insert into l(a) values(array('a' collate utf8_binary_lcase)); > insert into r(a) values(array('A' collate utf8_binary_lcase)); > select * from l join r where l.a = r.a; {code} > Both runs should yield one row since the arrays are considered equal. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47483) Add support for aggregation and join operations on arrays of collated strings
Nikola Mandic created SPARK-47483: - Summary: Add support for aggregation and join operations on arrays of collated strings Key: SPARK-47483 URL: https://issues.apache.org/jira/browse/SPARK-47483 Project: Spark Issue Type: Task Components: SQL Affects Versions: 4.0.0 Reporter: Nikola Mandic Example of aggregation sequence: {code:java} create table t(a array) using parquet; insert into t(a) values(array('a' collate utf8_binary_lcase)); insert into t(a) values(array('A' collate utf8_binary_lcase)); select distinct a from t; {code} Example of join sequence: {code:java} create table l(a array) using parquet; create table r(a array) using parquet; insert into l(a) values(array('a' collate utf8_binary_lcase)); insert into r(a) values(array('A' collate utf8_binary_lcase)); select * from l join r where l.a = r.a; {code} Both runs should yield one row since the arrays are considered equal. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47422) Support collated strings in array operations
Nikola Mandic created SPARK-47422: - Summary: Support collated strings in array operations Key: SPARK-47422 URL: https://issues.apache.org/jira/browse/SPARK-47422 Project: Spark Issue Type: Task Components: SQL Affects Versions: 4.0.0 Reporter: Nikola Mandic Collations need to be properly supported in following array operations but currently yield unexpected results: ArraysOverlap, ArrayDistinct, ArrayUnion, ArrayIntersect, ArrayExcept. Example query: {code:java} select array_contains(array('aaa' collate utf8_binary_lcase), 'AAA' collate utf8_binary_lcase){code} We would expect the result of query to be true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47211) Fix ignored PySpark Connect string collation
[ https://issues.apache.org/jira/browse/SPARK-47211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-47211: -- Component/s: Connect > Fix ignored PySpark Connect string collation > > > Key: SPARK-47211 > URL: https://issues.apache.org/jira/browse/SPARK-47211 > Project: Spark > Issue Type: Bug > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > Fix For: 4.0.0 > > > When using Connect with PySpark, string collation silently gets dropped: > {code:java} > Client connected to the Spark Connect server at localhost > SparkSession available as 'spark'. > >>> spark.sql("select 'abc' collate 'UNICODE'") > DataFrame[collate(abc): string] > >>> from pyspark.sql.types import StructType, StringType, StructField > >>> spark.createDataFrame([], StructType([StructField('id', StringType(2))])) > DataFrame[id: string] > {code} > Instead of "string" type in dataframe, we should be seeing "string COLLATE > 'UNICODE'". -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47211) Fix ignored PySpark Connect string collation
Nikola Mandic created SPARK-47211: - Summary: Fix ignored PySpark Connect string collation Key: SPARK-47211 URL: https://issues.apache.org/jira/browse/SPARK-47211 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 4.0.0 Reporter: Nikola Mandic Fix For: 4.0.0 When using Connect with PySpark, string collation silently gets dropped: {code:java} Client connected to the Spark Connect server at localhost SparkSession available as 'spark'. >>> spark.sql("select 'abc' collate 'UNICODE'") DataFrame[collate(abc): string] >>> from pyspark.sql.types import StructType, StringType, StructField >>> spark.createDataFrame([], StructType([StructField('id', StringType(2))])) DataFrame[id: string] {code} Instead of "string" type in dataframe, we should be seeing "string COLLATE 'UNICODE'". -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47144) Fix Spark Connect collation issue
[ https://issues.apache.org/jira/browse/SPARK-47144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-47144: -- Epic Link: SPARK-46830 > Fix Spark Connect collation issue > - > > Key: SPARK-47144 > URL: https://issues.apache.org/jira/browse/SPARK-47144 > Project: Spark > Issue Type: Bug > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > Fix For: 4.0.0 > > > Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when > connecting to sever using Spark Connect: > {code:java} > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support > convert string(UCS_BASIC_LCASE) to connect proto types.{code} > When using default collation "UCS_BASIC", the error is not occurring. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47144) Fix Spark Connect collation issue
[ https://issues.apache.org/jira/browse/SPARK-47144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-47144: -- Component/s: SQL > Fix Spark Connect collation issue > - > > Key: SPARK-47144 > URL: https://issues.apache.org/jira/browse/SPARK-47144 > Project: Spark > Issue Type: Bug > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > Fix For: 4.0.0 > > > Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when > connecting to sever using Spark Connect: > {code:java} > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support > convert string(UCS_BASIC_LCASE) to connect proto types.{code} > When using default collation "UCS_BASIC", the error is not occurring. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47144) Fix Spark Connect collation issue
Nikola Mandic created SPARK-47144: - Summary: Fix Spark Connect collation issue Key: SPARK-47144 URL: https://issues.apache.org/jira/browse/SPARK-47144 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 4.0.0 Reporter: Nikola Mandic Fix For: 4.0.0 Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when connecting to sever using Spark Connect: {code:java} pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support convert string(UCS_BASIC_LCASE) to connect proto types.{code} When using default collation "UCS_BASIC", the error is not occurring. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42328) Assign name to _LEGACY_ERROR_TEMP_1175
[ https://issues.apache.org/jira/browse/SPARK-42328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818849#comment-17818849 ] Nikola Mandic commented on SPARK-42328: --- [~maxgekk] Yes, thank you. > Assign name to _LEGACY_ERROR_TEMP_1175 > -- > > Key: SPARK-42328 > URL: https://issues.apache.org/jira/browse/SPARK-42328 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org