[jira] [Created] (SPARK-48556) Incorrect error message pointing to UNSUPPORTED_GROUPING_EXPRESSION

2024-06-06 Thread Nikola Mandic (Jira)
Nikola Mandic created SPARK-48556:
-

 Summary: Incorrect error message pointing to 
UNSUPPORTED_GROUPING_EXPRESSION
 Key: SPARK-48556
 URL: https://issues.apache.org/jira/browse/SPARK-48556
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nikola Mandic


Following sequence of queries produces UNSUPPORTED_GROUPING_EXPRESSION error:
{code:java}
create table t1(a int, b int) using parquet;
select grouping(a), dummy from t1 group by a with rollup; {code}
However, the appropriate error should point the user to the invalid dummy 
column name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48430) Fix map value extraction when map contains collated strings

2024-05-27 Thread Nikola Mandic (Jira)
Nikola Mandic created SPARK-48430:
-

 Summary: Fix map value extraction when map contains collated 
strings
 Key: SPARK-48430
 URL: https://issues.apache.org/jira/browse/SPARK-48430
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nikola Mandic


Following queries return unexpected results:
{code:java}
select collation(map('a', 'b' collate utf8_binary_lcase)['a']);
select collation(element_at(map('a', 'b' collate utf8_binary_lcase), 
'a'));{code}
Both return UTF8_BINARY instead of UTF8_BINARY_LCASE.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48413) ALTER COLUMN with collation

2024-05-24 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-48413:
--
Epic Link: SPARK-46830

> ALTER COLUMN with collation
> ---
>
> Key: SPARK-48413
> URL: https://issues.apache.org/jira/browse/SPARK-48413
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
>
> Add support for changing collation of a column with ALTER COLUMN command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48413) ALTER COLUMN with collation

2024-05-24 Thread Nikola Mandic (Jira)
Nikola Mandic created SPARK-48413:
-

 Summary: ALTER COLUMN with collation
 Key: SPARK-48413
 URL: https://issues.apache.org/jira/browse/SPARK-48413
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nikola Mandic


Add support for changing collation of a column with ALTER COLUMN command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48273) Late rewrite of PlanWithUnresolvedIdentifier

2024-05-14 Thread Nikola Mandic (Jira)
Nikola Mandic created SPARK-48273:
-

 Summary: Late rewrite of PlanWithUnresolvedIdentifier
 Key: SPARK-48273
 URL: https://issues.apache.org/jira/browse/SPARK-48273
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nikola Mandic


PlanWithUnresolvedIdentifier is rewritten later in analysis which causes rules 
like 
SubstituteUnresolvedOrdinals to miss the new plan. This causes following 
queries to fail:

{code:java}
create temporary view identifier('v1') as (select my_col from (values (1), (2), 
(1) as (my_col)) group by 1);
--
cache table identifier('t1') as (select my_col from (values (1), (2), (1) as 
(my_col)) group by 1); 
--
create table identifier('t2') as (select my_col from (values (1), (2), (1) 
as (my_col)) group by 1);
insert into identifier('t2') select my_col from (values (3) as (my_col)) group 
by 1; {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46841) Language support for collations

2024-04-24 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-46841:
--
Description: 
Languages and localization for collations are supported by ICU library. 
Collation naming format is as follows:
{code:java}
<2-letter language code>[_<4-letter script>][_<3-letter country 
code>][_specifier_specifier...]{code}
Locale specifier consists of the first part of collation name (language + 
script + country). Locale specifiers need to be stable across ICU versions; to 
keep existing ids and names invariant we introduce golden file will locale 
table which should case CI failure on any silent changes.

Currently supported optional specifiers:
 * CS/CI - case sensitivity, default is case-sensitive; supported by 
configuring ICU collation levels
 * AS/AI - accent sensitivity; default is accent-sensitive; supported by 
configuring ICU collation levels
 * /LCASE/UCASE - case conversion performed prior to comparisons; 
supported by internal implementation relying on ICU locale-aware conversions

User can use collation specifiers in any order except of locale which is 
mandatory and must go first. There is a one-to-one mapping between collation 
ids and collation names defined in CollationFactory.

  was:
Languages and localization for collations are supported by ICU library. 
Collation naming format is as follows:
{code:java}
<2-letter language code>__[_specifier_specifier...]{code}
Locale specifier consists of the first part of collation name (language + 
script + country). Locale specifiers need to be stable across ICU versions; to 
keep existing ids and names invariant we introduce golden file will locale 
table which should case CI failure on any silent changes.

Currently supported optional specifiers:
 * CS/CI - case sensitivity, default is case-sensitive; supported by 
configuring ICU collation levels
 * AS/AI - accent sensitivity; default is accent-sensitive; supported by 
configuring ICU collation levels
 * /LCASE/UCASE - case conversion performed prior to comparisons; 
supported by internal implementation relying on ICU locale-aware conversions

User can use collation specifiers in any order except of locale which is 
mandatory and must go first. There is a one-to-one mapping between collation 
ids and collation names defined in CollationFactory.


> Language support for collations
> ---
>
> Key: SPARK-46841
> URL: https://issues.apache.org/jira/browse/SPARK-46841
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
>
> Languages and localization for collations are supported by ICU library. 
> Collation naming format is as follows:
> {code:java}
> <2-letter language code>[_<4-letter script>][_<3-letter country 
> code>][_specifier_specifier...]{code}
> Locale specifier consists of the first part of collation name (language + 
> script + country). Locale specifiers need to be stable across ICU versions; 
> to keep existing ids and names invariant we introduce golden file will locale 
> table which should case CI failure on any silent changes.
> Currently supported optional specifiers:
>  * CS/CI - case sensitivity, default is case-sensitive; supported by 
> configuring ICU collation levels
>  * AS/AI - accent sensitivity; default is accent-sensitive; supported by 
> configuring ICU collation levels
>  * /LCASE/UCASE - case conversion performed prior to 
> comparisons; supported by internal implementation relying on ICU locale-aware 
> conversions
> User can use collation specifiers in any order except of locale which is 
> mandatory and must go first. There is a one-to-one mapping between collation 
> ids and collation names defined in CollationFactory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46841) Language support for collations

2024-04-24 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-46841:
--
Description: 
Languages and localization for collations are supported by ICU library. 
Collation naming format is as follows:
{code:java}
<2-letter language code>__[_specifier_specifier...]{code}
Locale specifier consists of the first part of collation name (language + 
script + country). Locale specifiers need to be stable across ICU versions; to 
keep existing ids and names invariant we introduce golden file will locale 
table which should case CI failure on any silent changes.

Currently supported optional specifiers:
 * CS/CI - case sensitivity, default is case-sensitive; supported by 
configuring ICU collation levels
 * AS/AI - accent sensitivity; default is accent-sensitive; supported by 
configuring ICU collation levels
 * /LCASE/UCASE - case conversion performed prior to comparisons; 
supported by internal implementation relying on ICU locale-aware conversions

User can use collation specifiers in any order except of locale which is 
mandatory and must go first. There is a one-to-one mapping between collation 
ids and collation names defined in CollationFactory.

  was:
Languages and localization for collations are supported by ICU library. 
Collation naming format is as follows:
{code:java}
<2-letter language code>__<3-letter country 
code>[_specifier_specifier...]{code}
Locale specifier consists of the first part of collation name (language + 
script + country). Locale specifiers need to be stable across ICU versions; to 
keep existing ids and names invariant we introduce golden file will locale 
table which should case CI failure on any silent changes.

Currently supported optional specifiers:
 * CS/CI - case sensitivity, default is case-sensitive; supported by 
configuring ICU collation levels
 * AS/AI - accent sensitivity; default is accent-sensitive; supported by 
configuring ICU collation levels
 * /LCASE/UCASE - case conversion performed prior to comparisons; 
supported by internal implementation relying on ICU locale-aware conversions

User can use collation specifiers in any order except of locale which is 
mandatory and must go first. There is a one-to-one mapping between collation 
ids and collation names defined in CollationFactory.


> Language support for collations
> ---
>
> Key: SPARK-46841
> URL: https://issues.apache.org/jira/browse/SPARK-46841
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
>
> Languages and localization for collations are supported by ICU library. 
> Collation naming format is as follows:
> {code:java}
> <2-letter language code>__ country code>[_specifier_specifier...]{code}
> Locale specifier consists of the first part of collation name (language + 
> script + country). Locale specifiers need to be stable across ICU versions; 
> to keep existing ids and names invariant we introduce golden file will locale 
> table which should case CI failure on any silent changes.
> Currently supported optional specifiers:
>  * CS/CI - case sensitivity, default is case-sensitive; supported by 
> configuring ICU collation levels
>  * AS/AI - accent sensitivity; default is accent-sensitive; supported by 
> configuring ICU collation levels
>  * /LCASE/UCASE - case conversion performed prior to 
> comparisons; supported by internal implementation relying on ICU locale-aware 
> conversions
> User can use collation specifiers in any order except of locale which is 
> mandatory and must go first. There is a one-to-one mapping between collation 
> ids and collation names defined in CollationFactory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46841) Language support for collations

2024-04-24 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-46841:
--
Component/s: SQL
Description: 
Languages and localization for collations are supported by ICU library. 
Collation naming format is as follows:
{code:java}
<2-letter language code>__<3-letter country 
code>[_specifier_specifier...]{code}
Locale specifier consists of the first part of collation name (language + 
script + country). Locale specifiers need to be stable across ICU versions; to 
keep existing ids and names invariant we introduce golden file will locale 
table which should case CI failure on any silent changes.

Currently supported optional specifiers:
 * CS/CI - case sensitivity, default is case-sensitive; supported by 
configuring ICU collation levels
 * AS/AI - accent sensitivity; default is accent-sensitive; supported by 
configuring ICU collation levels
 * /LCASE/UCASE - case conversion performed prior to comparisons; 
supported by internal implementation relying on ICU locale-aware conversions

User can use collation specifiers in any order except of locale which is 
mandatory and must go first. There is a one-to-one mapping between collation 
ids and collation names defined in CollationFactory.

> Language support for collations
> ---
>
> Key: SPARK-46841
> URL: https://issues.apache.org/jira/browse/SPARK-46841
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
>
> Languages and localization for collations are supported by ICU library. 
> Collation naming format is as follows:
> {code:java}
> <2-letter language code>__<3-letter country 
> code>[_specifier_specifier...]{code}
> Locale specifier consists of the first part of collation name (language + 
> script + country). Locale specifiers need to be stable across ICU versions; 
> to keep existing ids and names invariant we introduce golden file will locale 
> table which should case CI failure on any silent changes.
> Currently supported optional specifiers:
>  * CS/CI - case sensitivity, default is case-sensitive; supported by 
> configuring ICU collation levels
>  * AS/AI - accent sensitivity; default is accent-sensitive; supported by 
> configuring ICU collation levels
>  * /LCASE/UCASE - case conversion performed prior to 
> comparisons; supported by internal implementation relying on ICU locale-aware 
> conversions
> User can use collation specifiers in any order except of locale which is 
> mandatory and must go first. There is a one-to-one mapping between collation 
> ids and collation names defined in CollationFactory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47874) Multiple bugs with map operations in combination with collations

2024-04-16 Thread Nikola Mandic (Jira)
Nikola Mandic created SPARK-47874:
-

 Summary: Multiple bugs with map operations in combination with 
collations
 Key: SPARK-47874
 URL: https://issues.apache.org/jira/browse/SPARK-47874
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nikola Mandic


Following two queries produce different results (first succeeds, second throws 
an exceptions):
{code:java}
select map('a', 1, 'A' collate utf8_binary_lcase, 2); -- success
select map('a' collate utf8_binary_lcase, 1, 'A', 2); -- exception{code}
Following query results in 1:
{code:java}
select cast(map('a', 1, 'A', 2) as map)['A' collate utf8_binary_lcase]; -- 1{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46841) Language support for collations

2024-04-16 Thread Nikola Mandic (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837668#comment-17837668
 ] 

Nikola Mandic commented on SPARK-46841:
---

Working on it.

> Language support for collations
> ---
>
> Key: SPARK-46841
> URL: https://issues.apache.org/jira/browse/SPARK-46841
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47832) Fix problematic test in TPC-DS Collations test when ANSI flag is set

2024-04-12 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47832:
--
Summary: Fix problematic test in TPC-DS Collations test when ANSI flag is 
set  (was: Skip problematic test in TPC-DS Collations test when ANSI flag is 
set)

> Fix problematic test in TPC-DS Collations test when ANSI flag is set
> 
>
> Key: SPARK-47832
> URL: https://issues.apache.org/jira/browse/SPARK-47832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
> Fix For: 4.0.0
>
>
> "Build / ANSI (master, Hadoop 3, JDK 17, Scala 2.13)" CI is broken by TPC-DS 
> collations test. Error:
> {code:java}
> [info] - q35-v2.7 *** FAILED *** (2 seconds, 695 milliseconds)
> 3489[info]   java.lang.Exception: Expected "[null f   d   0   
> 1   0.0 0   0   2   1   2.0 2   2   2 
>   1   2.0 2   2
> 3490
> ...
> null  m   m   4   1   4.0 4   4   1   1   
> 1.0 1   1   3   1   3.0 3   3]", but got 
> "[org.apache.spark.sparkexception
> 3589[info] {
> 3590[info]   "errorclass" : "_legacy_error_temp_2250",
> 3591[info]   "messageparameters" : {
> 3592[info] "analyzetblmsg" : " or analyze these tables through: analyze 
> table `spark_catalog`.`tpcds_utf8`.`customer_demographics` compute 
> statistics;.",
> 3593[info] "autobroadcastjointhreshold" : 
> "spark.sql.autobroadcastjointhreshold",
> 3594[info] "drivermemory" : "spark.driver.memory"
> 3595[info]   }
> 3596[info] }]"
> 3597[info] Error using configs:
> 3598[info]   at 
> org.apache.spark.sql.TPCDSCollationQueryTestSuite.$anonfun$runQuery$1(TPCDSCollationQueryTestSuite.scala:228)
> 3599
> ... {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47832) Skip problematic test in TPC-DS Collations test when ANSI flag is set

2024-04-12 Thread Nikola Mandic (Jira)
Nikola Mandic created SPARK-47832:
-

 Summary: Skip problematic test in TPC-DS Collations test when ANSI 
flag is set
 Key: SPARK-47832
 URL: https://issues.apache.org/jira/browse/SPARK-47832
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nikola Mandic
 Fix For: 4.0.0


"Build / ANSI (master, Hadoop 3, JDK 17, Scala 2.13)" CI is broken by TPC-DS 
collations test. Error:
{code:java}
[info] - q35-v2.7 *** FAILED *** (2 seconds, 695 milliseconds)
3489[info]   java.lang.Exception: Expected "[null   f   d   0   
1   0.0 0   0   2   1   2.0 2   2   2   
1   2.0 2   2
3490
...
nullm   m   4   1   4.0 4   4   1   1   
1.0 1   1   3   1   3.0 3   3]", but got 
"[org.apache.spark.sparkexception
3589[info] {
3590[info]   "errorclass" : "_legacy_error_temp_2250",
3591[info]   "messageparameters" : {
3592[info] "analyzetblmsg" : " or analyze these tables through: analyze 
table `spark_catalog`.`tpcds_utf8`.`customer_demographics` compute 
statistics;.",
3593[info] "autobroadcastjointhreshold" : 
"spark.sql.autobroadcastjointhreshold",
3594[info] "drivermemory" : "spark.driver.memory"
3595[info]   }
3596[info] }]"
3597[info] Error using configs:
3598[info]   at 
org.apache.spark.sql.TPCDSCollationQueryTestSuite.$anonfun$runQuery$1(TPCDSCollationQueryTestSuite.scala:228)
3599
... {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47408) TBD

2024-04-11 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47408:
--
Summary: TBD  (was: Luhncheck (all collations))

> TBD
> ---
>
> Key: SPARK-47408
> URL: https://issues.apache.org/jira/browse/SPARK-47408
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47414) TBD

2024-04-11 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47414:
--
Summary: TBD  (was: Length, BitLength, OctetLength (all collations))

> TBD
> ---
>
> Key: SPARK-47414
> URL: https://issues.apache.org/jira/browse/SPARK-47414
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47416) TBD

2024-04-11 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47416:
--
Summary: TBD  (was: SoundEx (all collations))

> TBD
> ---
>
> Key: SPARK-47416
> URL: https://issues.apache.org/jira/browse/SPARK-47416
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-47420) TBD

2024-04-10 Thread Nikola Mandic (Jira)


[ https://issues.apache.org/jira/browse/SPARK-47420 ]


Nikola Mandic deleted comment on SPARK-47420:
---

was (Author: JIRAUSER304340):
Working on it.

> TBD
> ---
>
> Key: SPARK-47420
> URL: https://issues.apache.org/jira/browse/SPARK-47420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47420) TBD

2024-04-10 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47420:
--
Summary: TBD  (was: FormatNumber, Sentences (all collations))

> TBD
> ---
>
> Key: SPARK-47420
> URL: https://issues.apache.org/jira/browse/SPARK-47420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47417) Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, FormatNumber, Sentences (all collations)

2024-04-10 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47417:
--
Summary: Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, 
ToBinary, FormatNumber, Sentences (all collations)  (was: Ascii, Chr, Base64, 
UnBase64 (all collations))

> Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, 
> FormatNumber, Sentences (all collations)
> --
>
> Key: SPARK-47417
> URL: https://issues.apache.org/jira/browse/SPARK-47417
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47418) TBD

2024-04-10 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47418:
--
Summary: TBD  (was: Decode, StringDecode, Encode, ToBinary (all collations))

> TBD
> ---
>
> Key: SPARK-47418
> URL: https://issues.apache.org/jira/browse/SPARK-47418
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47416) SoundEx (all collations)

2024-04-09 Thread Nikola Mandic (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835328#comment-17835328
 ] 

Nikola Mandic commented on SPARK-47416:
---

Working on it.

> SoundEx (all collations)
> 
>
> Key: SPARK-47416
> URL: https://issues.apache.org/jira/browse/SPARK-47416
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47617) Add TPC-DS testing infrastructure for collations

2024-03-27 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47617:
--
Description: 
As collation support grows across all SQL features and new collation types are 
added, we need to have reliable testing model covering as many standard SQL 
capabilities as possible.

We can utilize TPC-DS testing infrastructure already present in Spark. The idea 
is to vary TPC-DS table string columns by adding multiple collations with 
different ordering rules and case sensitivity, producing new tables. These 
tables should yield the same results against predefined TPC-DS queries for 
certain batches of collations. For example, when comparing query runs on table 
where columns are first collated as UTF8_BINARY and then as UTF8_BINARY_LCASE, 
we should be getting same results after converting to lowercase.

Introduce new query suite which tests the described behavior with available 
collations (utf8_binary and unicode) combined with case conversions (lowercase, 
uppercase, randomized case for fuzzy testing).

  was:
As collation support grows across all SQL features and new collation types are 
added, we need to have reliable testing model covering as many standard SQL 
capabilities as possible.

We can utilize TCP-DS testing infrastructure already present in Spark. The idea 
is to vary TCP-DS table string columns by adding multiple collations with 
different ordering rules and case sensitivity, producing new tables. These 
tables should yield the same results against predefined TCP-DS queries for 
certain batches of collations. For example, when comparing query runs on table 
where columns are first collated as UTF8_BINARY and then as UTF8_BINARY_LCASE, 
we should be getting same results after converting to lowercase.

Introduce new query suite which tests the described behavior with available 
collations (utf8_binary and unicode) combined with case conversions (lowercase, 
uppercase, randomized case for fuzzy testing).


> Add TPC-DS testing infrastructure for collations
> 
>
> Key: SPARK-47617
> URL: https://issues.apache.org/jira/browse/SPARK-47617
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
>
> As collation support grows across all SQL features and new collation types 
> are added, we need to have reliable testing model covering as many standard 
> SQL capabilities as possible.
> We can utilize TPC-DS testing infrastructure already present in Spark. The 
> idea is to vary TPC-DS table string columns by adding multiple collations 
> with different ordering rules and case sensitivity, producing new tables. 
> These tables should yield the same results against predefined TPC-DS queries 
> for certain batches of collations. For example, when comparing query runs on 
> table where columns are first collated as UTF8_BINARY and then as 
> UTF8_BINARY_LCASE, we should be getting same results after converting to 
> lowercase.
> Introduce new query suite which tests the described behavior with available 
> collations (utf8_binary and unicode) combined with case conversions 
> (lowercase, uppercase, randomized case for fuzzy testing).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47617) Add TPC-DS testing infrastructure for collations

2024-03-27 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47617:
--
Summary: Add TPC-DS testing infrastructure for collations  (was: Add TCP-DS 
testing infrastructure for collations)

> Add TPC-DS testing infrastructure for collations
> 
>
> Key: SPARK-47617
> URL: https://issues.apache.org/jira/browse/SPARK-47617
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
>
> As collation support grows across all SQL features and new collation types 
> are added, we need to have reliable testing model covering as many standard 
> SQL capabilities as possible.
> We can utilize TCP-DS testing infrastructure already present in Spark. The 
> idea is to vary TCP-DS table string columns by adding multiple collations 
> with different ordering rules and case sensitivity, producing new tables. 
> These tables should yield the same results against predefined TCP-DS queries 
> for certain batches of collations. For example, when comparing query runs on 
> table where columns are first collated as UTF8_BINARY and then as 
> UTF8_BINARY_LCASE, we should be getting same results after converting to 
> lowercase.
> Introduce new query suite which tests the described behavior with available 
> collations (utf8_binary and unicode) combined with case conversions 
> (lowercase, uppercase, randomized case for fuzzy testing).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47617) Add TCP-DS testing infrastructure for collations

2024-03-27 Thread Nikola Mandic (Jira)
Nikola Mandic created SPARK-47617:
-

 Summary: Add TCP-DS testing infrastructure for collations
 Key: SPARK-47617
 URL: https://issues.apache.org/jira/browse/SPARK-47617
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nikola Mandic


As collation support grows across all SQL features and new collation types are 
added, we need to have reliable testing model covering as many standard SQL 
capabilities as possible.

We can utilize TCP-DS testing infrastructure already present in Spark. The idea 
is to vary TCP-DS table string columns by adding multiple collations with 
different ordering rules and case sensitivity, producing new tables. These 
tables should yield the same results against predefined TCP-DS queries for 
certain batches of collations. For example, when comparing query runs on table 
where columns are first collated as UTF8_BINARY and then as UTF8_BINARY_LCASE, 
we should be getting same results after converting to lowercase.

Introduce new query suite which tests the described behavior with available 
collations (utf8_binary and unicode) combined with case conversions (lowercase, 
uppercase, randomized case for fuzzy testing).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47483) Add support for aggregation and join operations on arrays of collated strings

2024-03-20 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47483:
--
Epic Link: SPARK-46830

> Add support for aggregation and join operations on arrays of collated strings
> -
>
> Key: SPARK-47483
> URL: https://issues.apache.org/jira/browse/SPARK-47483
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
>
> Example of aggregation sequence:
> {code:java}
> create table t(a array) using parquet;
> insert into t(a) values(array('a' collate utf8_binary_lcase));
> insert into t(a) values(array('A' collate utf8_binary_lcase));
> select distinct a from t; {code}
> Example of join sequence:
> {code:java}
> create table l(a array) using parquet;
> create table r(a array) using parquet;
> insert into l(a) values(array('a' collate utf8_binary_lcase));
> insert into r(a) values(array('A' collate utf8_binary_lcase));
> select * from l join r where l.a = r.a; {code}
> Both runs should yield one row since the arrays are considered equal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47483) Add support for aggregation and join operations on arrays of collated strings

2024-03-20 Thread Nikola Mandic (Jira)
Nikola Mandic created SPARK-47483:
-

 Summary: Add support for aggregation and join operations on arrays 
of collated strings
 Key: SPARK-47483
 URL: https://issues.apache.org/jira/browse/SPARK-47483
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nikola Mandic


Example of aggregation sequence:
{code:java}
create table t(a array) using parquet;


insert into t(a) values(array('a' collate utf8_binary_lcase));
insert into t(a) values(array('A' collate utf8_binary_lcase));


select distinct a from t; {code}
Example of join sequence:
{code:java}
create table l(a array) using parquet;
create table r(a array) using parquet;


insert into l(a) values(array('a' collate utf8_binary_lcase));
insert into r(a) values(array('A' collate utf8_binary_lcase));


select * from l join r where l.a = r.a; {code}
Both runs should yield one row since the arrays are considered equal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47422) Support collated strings in array operations

2024-03-15 Thread Nikola Mandic (Jira)
Nikola Mandic created SPARK-47422:
-

 Summary: Support collated strings in array operations
 Key: SPARK-47422
 URL: https://issues.apache.org/jira/browse/SPARK-47422
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nikola Mandic


Collations need to be properly supported in following array operations but 
currently yield unexpected results: ArraysOverlap, ArrayDistinct, ArrayUnion, 
ArrayIntersect, ArrayExcept. Example query:

{code:java}
select array_contains(array('aaa' collate utf8_binary_lcase), 'AAA' collate 
utf8_binary_lcase){code}
We would expect the result of query to be true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47211) Fix ignored PySpark Connect string collation

2024-02-28 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47211:
--
Component/s: Connect

> Fix ignored PySpark Connect string collation
> 
>
> Key: SPARK-47211
> URL: https://issues.apache.org/jira/browse/SPARK-47211
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
> Fix For: 4.0.0
>
>
> When using Connect with PySpark, string collation silently gets dropped:
> {code:java}
> Client connected to the Spark Connect server at localhost
> SparkSession available as 'spark'.
> >>> spark.sql("select 'abc' collate 'UNICODE'")
> DataFrame[collate(abc): string]
> >>> from pyspark.sql.types import StructType, StringType, StructField
> >>> spark.createDataFrame([], StructType([StructField('id', StringType(2))]))
> DataFrame[id: string]
> {code}
> Instead of "string" type in dataframe, we should be seeing "string COLLATE 
> 'UNICODE'".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47211) Fix ignored PySpark Connect string collation

2024-02-28 Thread Nikola Mandic (Jira)
Nikola Mandic created SPARK-47211:
-

 Summary: Fix ignored PySpark Connect string collation
 Key: SPARK-47211
 URL: https://issues.apache.org/jira/browse/SPARK-47211
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Nikola Mandic
 Fix For: 4.0.0


When using Connect with PySpark, string collation silently gets dropped:
{code:java}
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.
>>> spark.sql("select 'abc' collate 'UNICODE'")
DataFrame[collate(abc): string]
>>> from pyspark.sql.types import StructType, StringType, StructField
>>> spark.createDataFrame([], StructType([StructField('id', StringType(2))]))
DataFrame[id: string]
{code}
Instead of "string" type in dataframe, we should be seeing "string COLLATE 
'UNICODE'".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47144) Fix Spark Connect collation issue

2024-02-23 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47144:
--
Epic Link: SPARK-46830

> Fix Spark Connect collation issue
> -
>
> Key: SPARK-47144
> URL: https://issues.apache.org/jira/browse/SPARK-47144
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
> Fix For: 4.0.0
>
>
> Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when 
> connecting to sever using Spark Connect:
> {code:java}
> pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support 
> convert string(UCS_BASIC_LCASE) to connect proto types.{code}
> When using default collation "UCS_BASIC", the error is not occurring.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47144) Fix Spark Connect collation issue

2024-02-23 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47144:
--
Component/s: SQL

> Fix Spark Connect collation issue
> -
>
> Key: SPARK-47144
> URL: https://issues.apache.org/jira/browse/SPARK-47144
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
> Fix For: 4.0.0
>
>
> Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when 
> connecting to sever using Spark Connect:
> {code:java}
> pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support 
> convert string(UCS_BASIC_LCASE) to connect proto types.{code}
> When using default collation "UCS_BASIC", the error is not occurring.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47144) Fix Spark Connect collation issue

2024-02-23 Thread Nikola Mandic (Jira)
Nikola Mandic created SPARK-47144:
-

 Summary: Fix Spark Connect collation issue
 Key: SPARK-47144
 URL: https://issues.apache.org/jira/browse/SPARK-47144
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 4.0.0
Reporter: Nikola Mandic
 Fix For: 4.0.0


Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when 
connecting to sever using Spark Connect:
{code:java}
pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
(org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support convert 
string(UCS_BASIC_LCASE) to connect proto types.{code}
When using default collation "UCS_BASIC", the error is not occurring.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42328) Assign name to _LEGACY_ERROR_TEMP_1175

2024-02-20 Thread Nikola Mandic (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818849#comment-17818849
 ] 

Nikola Mandic commented on SPARK-42328:
---

[~maxgekk] Yes, thank you.

> Assign name to _LEGACY_ERROR_TEMP_1175
> --
>
> Key: SPARK-42328
> URL: https://issues.apache.org/jira/browse/SPARK-42328
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org