from:"Nicholas Chammas \(JIRA\)"

[jira] [Updated] (SPARK-48222) Sync Ruby Bundler to 2.4.22 and refresh Gem lock file

2024-05-09 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-48222:
-
Component/s: Documentation

> Sync Ruby Bundler to 2.4.22 and refresh Gem lock file
> -
>
> Key: SPARK-48222
> URL: https://issues.apache.org/jira/browse/SPARK-48222
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48222) Sync Ruby Bundler to 2.4.22 and refresh Gem lock file

2024-05-09 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-48222:


 Summary: Sync Ruby Bundler to 2.4.22 and refresh Gem lock file
 Key: SPARK-48222
 URL: https://issues.apache.org/jira/browse/SPARK-48222
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48176) Fix name of FIELD_ALREADY_EXISTS error condition

2024-05-07 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-48176:


 Summary: Fix name of FIELD_ALREADY_EXISTS error condition
 Key: SPARK-48176
 URL: https://issues.apache.org/jira/browse/SPARK-48176
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48107) Exclude tests from Python distribution

2024-05-02 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-48107:


 Summary: Exclude tests from Python distribution
 Key: SPARK-48107
 URL: https://issues.apache.org/jira/browse/SPARK-48107
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47429) Rename errorClass to errorCondition and subClass to subCondition

2024-05-01 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842694#comment-17842694
 ] 

Nicholas Chammas commented on SPARK-47429:
--

I think one intermediate step we can take here is to mark the existing fields 
as deprecated, indicating that they will be renamed. That way, if we don't 
complete this renaming before the 4.0 release we at least have the deprecation 
in.

Another thing we can do in addition to deprecating the existing fields is to 
add the renamed fields and simply have them redirect to the original ones.

I will build a list of the classes, class attributes, methods, and method 
parameters that will need this kind of update. Note that this list will be 
much, much smaller than the thousands of uses that BingKun highlighted, since I 
am just focusing on the declarations.

cc [~cloud_fan] 

> Rename errorClass to errorCondition and subClass to subCondition
> 
>
> Key: SPARK-47429
> URL: https://issues.apache.org/jira/browse/SPARK-47429
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
> Attachments: image-2024-04-18-09-26-04-493.png
>
>
> We've agreed on the parent task to rename {{errorClass}} to align it more 
> closely with the SQL standard, and take advantage of the opportunity to break 
> backwards compatibility offered by the Spark version change from 3.5 to 4.0.
> This ticket also covers renaming {{subClass}} as well.
> This is a subtask so the changes are in their own PR and easier to review 
> apart from other things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47429) Rename errorClass to errorCondition and subClass to subCondition

2024-05-01 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-47429:
-
Summary: Rename errorClass to errorCondition and subClass to subCondition  
(was: Rename errorClass to errorCondition)

> Rename errorClass to errorCondition and subClass to subCondition
> 
>
> Key: SPARK-47429
> URL: https://issues.apache.org/jira/browse/SPARK-47429
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
> Attachments: image-2024-04-18-09-26-04-493.png
>
>
> We've agreed on the parent task to rename {{errorClass}} to align it more 
> closely with the SQL standard, and take advantage of the opportunity to break 
> backwards compatibility offered by the Spark version change from 3.5 to 4.0.
> This ticket also covers renaming {{subClass}} as well.
> This is a subtask so the changes are in their own PR and easier to review 
> apart from other things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47429) Rename errorClass to errorCondition

2024-04-15 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-47429:
-
Description: 
We've agreed on the parent task to rename {{errorClass}} to align it more 
closely with the SQL standard, and take advantage of the opportunity to break 
backwards compatibility offered by the Spark version change from 3.5 to 4.0.

This ticket also covers renaming {{subClass}} as well.

This is a subtask so the changes are in their own PR and easier to review apart 
from other things.

  was:
We've agreed on the parent task to rename {{errorClass}} to align it more 
closely with the SQL standard, and take advantage of the opportunity to break 
backwards compatibility offered by the Spark version change from 3.5 to 4.0.

This is a subtask so the changes are in their own PR and easier to review apart 
from other things.


> Rename errorClass to errorCondition
> ---
>
> Key: SPARK-47429
> URL: https://issues.apache.org/jira/browse/SPARK-47429
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We've agreed on the parent task to rename {{errorClass}} to align it more 
> closely with the SQL standard, and take advantage of the opportunity to break 
> backwards compatibility offered by the Spark version change from 3.5 to 4.0.
> This ticket also covers renaming {{subClass}} as well.
> This is a subtask so the changes are in their own PR and easier to review 
> apart from other things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28024) Incorrect numeric values when out of range

2024-04-15 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837292#comment-17837292
 ] 

Nicholas Chammas commented on SPARK-28024:
--

[~cloud_fan] - Given the updated descriptions for Cases 2, 3, and 4, do you 
still consider there to be a problem here? Or shall we just consider this an 
acceptable difference between how Spark and Postgres handle these cases?

> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} 
> with {{set spark.sql.ansi.enabled=true;}} as compared to the default behavior 
> on PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> With ANSI mode enabled, this case is no longer an issue. All 4 of the above 
> statements now yield {{CAST_OVERFLOW}} or {{ARITHMETIC_OVERFLOW}} errors.
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(1.2345678901234E200);
> ERROR:  value overflows numeric format {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-28024:
-
Description: 
Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} with 
{{set spark.sql.ansi.enabled=true;}} as compared to the default behavior on 
PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
With ANSI mode enabled, this case is no longer an issue. All 4 of the above 
statements now yield {{CAST_OVERFLOW}} or {{ARITHMETIC_OVERFLOW}} errors.

Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}

  was:
Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} with 
{{set spark.sql.ansi.enabled=true;}} as compared to the default behavior on 
PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
With ANSI mode enabled, this case is no longer an issue. All 4 of the above 
statements now yield {{CAST_OVERFLOW or }}{{ARITHMETIC_OVERFLOW}} errors.

Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}


> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} 
> with {{set spark.sql.ansi.enabled=true;}} as compared to the default behavior 
> on PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> With ANSI mode enabled, this case is no longer an issue. All 4 of the above 
> statements now yield {{CAST_OVERFLOW}} or {{ARITHMETIC_OVERFLOW}} errors.
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(1.2345678901234E200);
> ERROR:  value overflows numeric format {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-28024:
-
Description: 
Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} with 
{{set spark.sql.ansi.enabled=true;}} as compared to the default behavior on 
PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
With ANSI mode enabled, this case is no longer an issue. All 4 of the above 
statements now yield {{CAST_OVERFLOW or }}{{ARITHMETIC_OVERFLOW}} errors.

Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}

  was:
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}


> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} 
> with {{set spark.sql.ansi.enabled=true;}} as compared to the default behavior 
> on PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> With ANSI mode enabled, this case is no longer an issue. All 4 of the above 
> statements now yield {{CAST_OVERFLOW or }}{{ARITHMETIC_OVERFLOW}} errors.
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(1.2345678901234E200);
> ERROR:  value overflows numeric format {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836706#comment-17836706
 ] 

Nicholas Chammas commented on SPARK-28024:
--

I've just retried cases 2-4 on master with ANSI mode enabled, and Spark's 
behavior appears to be the same as when I last checked it in February.

I also ran those same cases against PostgreSQL 16. I couldn't replicate the 
output for Case 4, and I believe there was a mistake in the original 
description of that case where the sign was flipped. So I've adjusted the sign 
accordingly and shown Spark and Postgres's behavior side-by-side.

Here is the original Case 4 with the negative sign:

{code:sql}
spark-sql (default)> select exp(-1.2345678901234E200);
0.0

postgres=# select exp(-1.2345678901234E200); 
0.
{code}
 
So I don't think there is a problem there. With a positive sign, the behavior 
is different as shown in the ticket description above.

> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> As compared to PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(1.2345678901234E200);
> ERROR:  value overflows numeric format {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-28024:
-
Description: 
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}

  was:
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(-1.2345678901234E200);
ERROR:  value overflows numeric format
{code}


> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> As compared to PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(1.2345678901234E200);
> ERROR:  value overflows numeric format {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-28024:
-
Description: 
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(-1.2345678901234E200);
ERROR:  value overflows numeric format
{code}

  was:
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity
 postgres=# select exp(-1.2345678901234E200);
ERROR:  value overflows numeric format
{code}


> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> As compared to PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(-1.2345678901234E200);
> ERROR:  value overflows numeric format
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-28024:
-
Description: 
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity
 postgres=# select exp(-1.2345678901234E200);
ERROR:  value overflows numeric format
{code}

  was:
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}


> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> As compared to PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
>  postgres=# select exp(-1.2345678901234E200);
> ERROR:  value overflows numeric format
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-28024:
-
Description: 
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}

  was:
For example
Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}

Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0
{code}

Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0
{code}

Case 4:
{code:sql}
spark-sql> select exp(-1.2345678901234E200);
0.0

postgres=# select exp(-1.2345678901234E200);
ERROR:  value overflows numeric format
{code}


> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> As compared to PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(1.2345678901234E200);
> ERROR:  value overflows numeric format {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47429) Rename errorClass to errorCondition

2024-03-16 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-47429:


 Summary: Rename errorClass to errorCondition
 Key: SPARK-47429
 URL: https://issues.apache.org/jira/browse/SPARK-47429
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas


We've agreed on the parent task to rename {{errorClass}} to align it more 
closely with the SQL standard, and take advantage of the opportunity to break 
backwards compatibility offered by the Spark version change from 3.5 to 4.0.

This is a subtask so the changes are in their own PR and easier to review apart 
from other things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46810) Clarify error class terminology

2024-03-05 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823713#comment-17823713
 ] 

Nicholas Chammas commented on SPARK-46810:
--

[~cloud_fan], [~LuciferYang], [~beliefer], and [~dongjoon] - Friendly ping.

Any thoughts on how to resolve the inconsistent error terminology?

> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
> [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
>  It has several key pieces of hierarchical information that have inconsistent 
> names throughout our documentation and codebase:
>  * 42
>  ** K01
>  *** INCOMPLETE_TYPE_DEFINITION
>   ARRAY
>   MAP
>   STRUCT
> What are the names of these different levels of information?
> Some examples of inconsistent terminology:
>  * [Over 
> here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
>  we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION 
> we call that an "error class". So what exactly is a class, the 42 or the 
> INCOMPLETE_TYPE_DEFINITION?
>  * [Over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
>  we call K01 the "subclass". But [over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
>  we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
> INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
> So what exactly is a subclass?
>  * [On this 
> page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
>  we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
> places we refer to it as an "error class".
> I don't think we should leave this status quo as-is. I see a couple of ways 
> to fix this.
> h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"
> One solution is to use the following terms:
>  * Error class: 42
>  * Error sub-class: K01
>  * Error state: 42K01
>  * Error condition: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-condition: ARRAY, MAP, STRUCT
> Pros: 
>  * This terminology seems (to me at least) the most natural and intuitive.
>  * It aligns most closely to the SQL standard.
> Cons:
>  * We use {{errorClass}} [all over our 
> codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
>  – literally in thousands of places – to refer to strings like 
> INCOMPLETE_TYPE_DEFINITION.
>  ** It's probably not practical to update all these usages to say 
> {{errorCondition}} instead, so if we go with this approach there will be a 
> divide between the terminology we use in user-facing documentation vs. what 
> the code base uses.
>  ** We can perhaps rename the existing {{error-classes.json}} to 
> {{error-conditions.json}} but clarify the reason for this divide between code 
> and user docs in the documentation for {{ErrorClassesJsonReader}} .
> h1. Option 2: 42 becomes an "Error Category"
> Another approach is to use the following terminology:
>  * Error category: 42
>  * Error sub-category: K01
>  * Error state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The change from calling "42" a "class" to a "category" is low impact and 
> may not show up in user-facing documentation at all. (See my side note below.)
> Cons:
>  * These terms do not align with the SQL standard.
>  * We will have to retire the term "error condition", which we have [already 
> used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md]
>  in user-facing documentation.
> h1. Option 3: "Error Class" and "State Class"
>  * SQL state class: 42
>  * SQL state sub-class: K01
>  * SQL state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The change from calling

[jira] [Created] (SPARK-47271) Explain importance of statistics on SQL performance tuning page

2024-03-04 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-47271:


 Summary: Explain importance of statistics on SQL performance 
tuning page
 Key: SPARK-47271
 URL: https://issues.apache.org/jira/browse/SPARK-47271
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47252) Clarify that pivot may trigger an eager computation

2024-03-02 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-47252:


 Summary: Clarify that pivot may trigger an eager computation
 Key: SPARK-47252
 URL: https://issues.apache.org/jira/browse/SPARK-47252
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47216) Refine layout of SQL performance tuning page

2024-02-28 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-47216:


 Summary: Refine layout of SQL performance tuning page
 Key: SPARK-47216
 URL: https://issues.apache.org/jira/browse/SPARK-47216
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47190) Add support for checkpointing to Spark Connect

2024-02-27 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821286#comment-17821286
 ] 

Nicholas Chammas commented on SPARK-47190:
--

[~gurwls223] - Is there some design reason we do _not_ want to support 
checkpointing in Spark Connect? Or is it just a matter of someone taking the 
time to implement support?

If the latter, do we do so via a new method directly on {{SparkSession}}, or 
shall we somehow expose a limited version of {{spark.sparkContext}} so users 
can call the existing {{setCheckpointDir()}} method?

> Add support for checkpointing to Spark Connect
> --
>
> Key: SPARK-47190
> URL: https://issues.apache.org/jira/browse/SPARK-47190
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The {{sparkContext}} that underlies a given {{SparkSession}} is not 
> accessible over Spark Connect. This means you cannot call 
> {{spark.sparkContext.setCheckpointDir(...)}}, which in turn means you cannot 
> checkpoint a DataFrame.
> We should add support for this somehow to Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47190) Add support for checkpointing to Spark Connect

2024-02-27 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-47190:


 Summary: Add support for checkpointing to Spark Connect
 Key: SPARK-47190
 URL: https://issues.apache.org/jira/browse/SPARK-47190
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Nicholas Chammas


The {{sparkContext}} that underlies a given {{SparkSession}} is not accessible 
over Spark Connect. This means you cannot call 
{{spark.sparkContext.setCheckpointDir(...)}}, which in turn means you cannot 
checkpoint a DataFrame.

We should add support for this somehow to Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47189) Tweak column error names and text

2024-02-27 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-47189:


 Summary: Tweak column error names and text
 Key: SPARK-47189
 URL: https://issues.apache.org/jira/browse/SPARK-47189
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47180) Migrate CSV parsing off of Univocity

2024-02-26 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-47180:


 Summary: Migrate CSV parsing off of Univocity
 Key: SPARK-47180
 URL: https://issues.apache.org/jira/browse/SPARK-47180
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas


Univocity appears to be unmaintained.

As of February 2024:
 * The last release was [more than 3 years 
ago|https://github.com/uniVocity/univocity-parsers/releases].
 * The last commit to {{master}} was [almost 3 years 
ago|https://github.com/uniVocity/univocity-parsers/commits/master/].
 * The website is 
[down|https://github.com/uniVocity/univocity-parsers/issues/506].
 * There are 
[multiple|https://github.com/uniVocity/univocity-parsers/issues/494] 
[open|https://github.com/uniVocity/univocity-parsers/issues/495] 
[bugs|https://github.com/uniVocity/univocity-parsers/issues/499] on the tracker 
with no indication that anyone cares.

It's not urgent, but we should consider migrating to an actively maintained CSV 
library in the JVM ecosystem.

There are a bunch of libraries [listed here on this Maven 
Repository|https://mvnrepository.com/open-source/csv-libraries].

[jackson-dataformats-text|https://github.com/FasterXML/jackson-dataformats-text]
 looks interesting. I know we already use FasterXML to parse JSON. Perhaps we 
should use them to parse CSV as well.

I'm guessing we chose Univocity back in the day because it was the fastest CSV 
library on the JVM. However, the last performance benchmark comparing it to 
others was [from February 
2018|https://github.com/uniVocity/csv-parsers-comparison/blob/5548b52f2cc27eb19c11464e9a331491e8ad4ba6/README.md#statistics-updated-28th-of-february-2018],
 so this may no longer be true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47082) Out of bounds error message is incorrect

2024-02-17 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-47082:
-
Summary: Out of bounds error message is incorrect  (was: Out of bounds 
error message flips the bounds)

> Out of bounds error message is incorrect
> 
>
> Key: SPARK-47082
> URL: https://issues.apache.org/jira/browse/SPARK-47082
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47082) Out of bounds error message flips the bounds

2024-02-17 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-47082:


 Summary: Out of bounds error message flips the bounds
 Key: SPARK-47082
 URL: https://issues.apache.org/jira/browse/SPARK-47082
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47024) Sum of floats/doubles may be incorrect depending on partitioning

2024-02-12 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-47024.
--
Resolution: Not A Problem

Resolving this as "Not A Problem".

I mean, it _is_ a problem, but it's a basic problem with floats, and I don't 
think there is anything practical that can be done about it in Spark.

> Sum of floats/doubles may be incorrect depending on partitioning
> 
>
> Key: SPARK-47024
> URL: https://issues.apache.org/jira/browse/SPARK-47024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0, 3.3.4
>Reporter: Nicholas Chammas
>Priority: Major
>  Labels: correctness
>
> I found this problem using 
> [Hypothesis|https://hypothesis.readthedocs.io/en/latest/].
> Here's a reproduction that fails on {{{}master{}}}, 3.5.0, 3.4.2, and 3.3.4 
> (and probably all prior versions as well):
> {code:python}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import col, sum
> SUM_EXAMPLE = [
> (1.0,),
> (0.0,),
> (1.0,),
> (9007199254740992.0,),
> ]
> spark = (
> SparkSession.builder
> .config("spark.log.level", "ERROR")
> .getOrCreate()
> )
> def compare_sums(data, num_partitions):
> df = spark.createDataFrame(data, "val double").coalesce(1)
> result1 = df.agg(sum(col("val"))).collect()[0][0]
> df = spark.createDataFrame(data, "val double").repartition(num_partitions)
> result2 = df.agg(sum(col("val"))).collect()[0][0]
> assert result1 == result2, f"{result1}, {result2}"
> if __name__ == "__main__":
> print(compare_sums(SUM_EXAMPLE, 2))
> {code}
> This fails as follows:
> {code:python}
> AssertionError: 9007199254740994.0, 9007199254740992.0
> {code}
> I suspected some kind of problem related to code generation, so tried setting 
> all of these to {{{}false{}}}:
>  * {{spark.sql.codegen.wholeStage}}
>  * {{spark.sql.codegen.aggregate.map.twolevel.enabled}}
>  * {{spark.sql.codegen.aggregate.splitAggregateFunc.enabled}}
> But this did not change the behavior.
> Somehow, the partitioning of the data affects the computed sum.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47024) Sum of floats/doubles may be incorrect depending on partitioning

2024-02-12 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-47024:
-
Description: 
I found this problem using 
[Hypothesis|https://hypothesis.readthedocs.io/en/latest/].

Here's a reproduction that fails on {{{}master{}}}, 3.5.0, 3.4.2, and 3.3.4 
(and probably all prior versions as well):
{code:python}
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum

SUM_EXAMPLE = [
(1.0,),
(0.0,),
(1.0,),
(9007199254740992.0,),
]

spark = (
SparkSession.builder
.config("spark.log.level", "ERROR")
.getOrCreate()
)


def compare_sums(data, num_partitions):
df = spark.createDataFrame(data, "val double").coalesce(1)
result1 = df.agg(sum(col("val"))).collect()[0][0]
df = spark.createDataFrame(data, "val double").repartition(num_partitions)
result2 = df.agg(sum(col("val"))).collect()[0][0]
assert result1 == result2, f"{result1}, {result2}"


if __name__ == "__main__":
print(compare_sums(SUM_EXAMPLE, 2))
{code}
This fails as follows:
{code:python}
AssertionError: 9007199254740994.0, 9007199254740992.0
{code}
I suspected some kind of problem related to code generation, so tried setting 
all of these to {{{}false{}}}:
 * {{spark.sql.codegen.wholeStage}}
 * {{spark.sql.codegen.aggregate.map.twolevel.enabled}}
 * {{spark.sql.codegen.aggregate.splitAggregateFunc.enabled}}

But this did not change the behavior.

Somehow, the partitioning of the data affects the computed sum.

  was:Will fill in the details shortly.

Summary: Sum of floats/doubles may be incorrect depending on 
partitioning  (was: Sum is incorrect (exact cause currently unknown))

Sadly, I think this is a case where we may not be able to do anything. The 
problem appears to be a classic case of floating point arithmetic going wrong.
{code:scala}
scala> 9007199254740992.0 + 1.0
val res0: Double = 9.007199254740992E15

scala> 9007199254740992.0 + 2.0
val res1: Double = 9.007199254740994E15
{code}
Notice how adding {{1.0}} did not change the large value, whereas adding 
{{2.0}} did.

So what I believe is happening is that, depending on the order in which the 
rows happen to be added, we either hit or do not hit this corner case.

In other words, if the aggregation goes like this:
{code:java}
(1.0 + 1.0) + (0.0 + 9007199254740992.0)
2.0 + 9007199254740992.0
9007199254740994.0
{code}
Then there is no problem.

However, if we are unlucky and it goes like this:
{code:java}
(1.0 + 0.0) + (1.0 + 9007199254740992.0)
1.0 + 9007199254740992.0
9007199254740992.0
{code}
Then we get the incorrect result shown in the description above.

This violates what I believe should be an invariant in Spark: That declarative 
aggregates like {{sum}} do not compute different results depending on accidents 
of row order or partitioning.

However, given that this is a basic problem of floating point arithmetic, I 
doubt we can really do anything here.

Note that there are many such "special" numbers that have this problem, not 
just 9007199254740992.0:
{code:scala}
scala> 1.7168917017330176e+16 + 1.0
val res2: Double = 1.7168917017330176E16

scala> 1.7168917017330176e+16 + 2.0
val res3: Double = 1.7168917017330178E16
{code}

> Sum of floats/doubles may be incorrect depending on partitioning
> 
>
> Key: SPARK-47024
> URL: https://issues.apache.org/jira/browse/SPARK-47024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0, 3.3.4
>Reporter: Nicholas Chammas
>Priority: Major
>  Labels: correctness
>
> I found this problem using 
> [Hypothesis|https://hypothesis.readthedocs.io/en/latest/].
> Here's a reproduction that fails on {{{}master{}}}, 3.5.0, 3.4.2, and 3.3.4 
> (and probably all prior versions as well):
> {code:python}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import col, sum
> SUM_EXAMPLE = [
> (1.0,),
> (0.0,),
> (1.0,),
> (9007199254740992.0,),
> ]
> spark = (
> SparkSession.builder
> .config("spark.log.level", "ERROR")
> .getOrCreate()
> )
> def compare_sums(data, num_partitions):
> df = spark.createDataFrame(data, "val double").coalesce(1)
> result1 = df.agg(sum(col("val"))).collect()[0][0]
> df = spark.createDataFrame(data, "val double").repartition(num_partitions)
> result2 = df.agg(sum(col("val"))).collect()[0][0]
> assert result1 == result2, f"{result1}, {result2}"
> if __name__ == "__main__":
> print(compare_sums(SUM_EXAMPLE, 2))
> {code}
> This fails as follows:
> {code:python}
> AssertionError: 9007199254740994.0, 9007199254740992.0
> {code}
> I suspected some kind of problem related to code generation, so tried setting 
> all of these to {{{}false{}}}:
>  *

[jira] [Created] (SPARK-47024) Sum is incorrect (exact cause currently unknown)

2024-02-12 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-47024:


 Summary: Sum is incorrect (exact cause currently unknown)
 Key: SPARK-47024
 URL: https://issues.apache.org/jira/browse/SPARK-47024
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.4, 3.5.0, 3.4.2
Reporter: Nicholas Chammas


Will fill in the details shortly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46992) Inconsistent results with 'sort', 'cache', and AQE.

2024-02-06 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46992:
-
Labels: correctness  (was: )

> Inconsistent results with 'sort', 'cache', and AQE.
> ---
>
> Key: SPARK-46992
> URL: https://issues.apache.org/jira/browse/SPARK-46992
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Denis Tarima
>Priority: Critical
>  Labels: correctness
>
>  
> With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes 
> {color:#4c9aff}sample{color} results after caching.
> Moreover, when cached,  {color:#4c9aff}collect{color} returns records as if 
> it's not cached, which is inconsistent with {color:#4c9aff}count{color} and 
> {color:#4c9aff}show{color}.
> A script to reproduce:
> {code:scala}
> import spark.implicits._
> val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123)
> println("NON CACHED:")
> println("  count: " + df.count())
> println("  collect: " + df.collect().mkString(" "))
> df.show()
> println("CACHED:")
> df.cache().count()
> println("  count: " + df.count())
> println("  collect: " + df.collect().mkString(" "))
> df.show()
> df.unpersist()
> {code}
> output:
> {code}
> NON CACHED:
>   count: 2
>   collect: [1] [4]
> +---+
> | id|
> +---+
> |  1|
> |  4|
> +---+
> CACHED:
>   count: 3
>   collect: [1] [4]
> +---+
> | id|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46992) Inconsistent results with 'sort', 'cache', and AQE.

2024-02-06 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814913#comment-17814913
 ] 

Nicholas Chammas commented on SPARK-46992:
--

I can confirm the behavior described above is still present on {{master}} at 
[{{5d5b3a5}}|https://github.com/apache/spark/commit/5d5b3a54b7b5fb4308fe40da696ba805c72983fc].

Adding the {{correctness}} label.

> Inconsistent results with 'sort', 'cache', and AQE.
> ---
>
> Key: SPARK-46992
> URL: https://issues.apache.org/jira/browse/SPARK-46992
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Denis Tarima
>Priority: Critical
>
>  
> With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes 
> {color:#4c9aff}sample{color} results after caching.
> Moreover, when cached,  {color:#4c9aff}collect{color} returns records as if 
> it's not cached, which is inconsistent with {color:#4c9aff}count{color} and 
> {color:#4c9aff}show{color}.
> A script to reproduce:
> {code:scala}
> import spark.implicits._
> val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123)
> println("NON CACHED:")
> println("  count: " + df.count())
> println("  collect: " + df.collect().mkString(" "))
> df.show()
> println("CACHED:")
> df.cache().count()
> println("  count: " + df.count())
> println("  collect: " + df.collect().mkString(" "))
> df.show()
> df.unpersist()
> {code}
> output:
> {code}
> NON CACHED:
>   count: 2
>   collect: [1] [4]
> +---+
> | id|
> +---+
> |  1|
> |  4|
> +---+
> CACHED:
>   count: 3
>   collect: [1] [4]
> +---+
> | id|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46810) Clarify error class terminology

2024-02-05 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814406#comment-17814406
 ] 

Nicholas Chammas commented on SPARK-46810:
--

[~cloud_fan], [~LuciferYang], [~beliefer], and [~dongjoon] - What are your 
thoughts on the 3 proposed options?

> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
> [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
>  It has several key pieces of hierarchical information that have inconsistent 
> names throughout our documentation and codebase:
>  * 42
>  ** K01
>  *** INCOMPLETE_TYPE_DEFINITION
>   ARRAY
>   MAP
>   STRUCT
> What are the names of these different levels of information?
> Some examples of inconsistent terminology:
>  * [Over 
> here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
>  we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION 
> we call that an "error class". So what exactly is a class, the 42 or the 
> INCOMPLETE_TYPE_DEFINITION?
>  * [Over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
>  we call K01 the "subclass". But [over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
>  we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
> INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
> So what exactly is a subclass?
>  * [On this 
> page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
>  we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
> places we refer to it as an "error class".
> I don't think we should leave this status quo as-is. I see a couple of ways 
> to fix this.
> h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"
> One solution is to use the following terms:
>  * Error class: 42
>  * Error sub-class: K01
>  * Error state: 42K01
>  * Error condition: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-condition: ARRAY, MAP, STRUCT
> Pros: 
>  * This terminology seems (to me at least) the most natural and intuitive.
>  * It aligns most closely to the SQL standard.
> Cons:
>  * We use {{errorClass}} [all over our 
> codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
>  – literally in thousands of places – to refer to strings like 
> INCOMPLETE_TYPE_DEFINITION.
>  ** It's probably not practical to update all these usages to say 
> {{errorCondition}} instead, so if we go with this approach there will be a 
> divide between the terminology we use in user-facing documentation vs. what 
> the code base uses.
>  ** We can perhaps rename the existing {{error-classes.json}} to 
> {{error-conditions.json}} but clarify the reason for this divide between code 
> and user docs in the documentation for {{ErrorClassesJsonReader}} .
> h1. Option 2: 42 becomes an "Error Category"
> Another approach is to use the following terminology:
>  * Error category: 42
>  * Error sub-category: K01
>  * Error state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The change from calling "42" a "class" to a "category" is low impact and 
> may not show up in user-facing documentation at all. (See my side note below.)
> Cons:
>  * These terms do not align with the SQL standard.
>  * We will have to retire the term "error condition", which we have [already 
> used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md]
>  in user-facing documentation.
> h1. Option 3: "Error Class" and "State Class"
>  * SQL state class: 42
>  * SQL state sub-class: K01
>  * SQL state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The change from calling "42" a "class" to a "state class"

[jira] [Commented] (SPARK-40549) PYSPARK: Observation computes the wrong results when using `corr` function

2024-02-02 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813780#comment-17813780
 ] 

Nicholas Chammas commented on SPARK-40549:
--

I think this is just a consequence of floating point arithmetic being imprecise.
{code:python}
>>> for i in range(10):
...     o = Observation(f"test_{i}")
...     df_o = df.observe(o, F.corr("id", "id2"))
...     df_o.count()
...     print(o.get)
... 
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0002}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0002}
{'corr(id, id2)': 0.}
{'corr(id, id2)': 1.0} {code}
Unfortunately, {{corr}} seems to convert to float internally, so even if you 
give it decimals you will get a similar result:
{code:python}
>>> from decimal import Decimal
>>> import pyspark.sql.functions as F
>>> 
>>> df = spark.createDataFrame(
...     [(Decimal(i), Decimal(i * 10)) for i in range(10)],
...     schema="id decimal, id2 decimal",
... )for i in range(10):
    o = Observation(f"test_{i}")
    df_o = df.observe(o, F.corr("id", "id2"))
    df_o.count()
    print(o.get)
>>> 
>>> for i in range(10):
...     o = Observation(f"test_{i}")
...     df_o = df.observe(o, F.corr("id", "id2"))
...     df_o.count()
...     print(o.get)
... 
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 0.}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0002}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0} {code}

I don't think there is anything that can be done here.

> PYSPARK: Observation computes the wrong results when using `corr` function 
> ---
>
> Key: SPARK-40549
> URL: https://issues.apache.org/jira/browse/SPARK-40549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
> Environment: {code:java}
> // lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:    Ubuntu 22.04.1 LTS
> Release:        22.04
> Codename:       jammy {code}
> {code:java}
>  // python -V
> python 3.10.4
> {code}
> {code:java}
>  // lshw -class cpu
> *-cpu                             
> description: CPU        product: AMD Ryzen 9 3900X 12-Core Processor        
> vendor: Advanced Micro Devices [AMD]        physical id: f        bus info: 
> cpu@0        version: 23.113.0        serial: Unknown        slot: AM4        
> size: 2194MHz        capacity: 4672MHz        width: 64 bits        clock: 
> 100MHz        capabilities: lm fpu fpu_exception wp vme de pse tsc msr pae 
> mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht 
> syscall nx mmxext fxsr_opt pdpe1gb rdtscp x86-64 constant_tsc rep_good nopl 
> nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma 
> cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy 
> svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit 
> wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 
> cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm 
> rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves 
> cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr 
> rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean 
> flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif 
> v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es cpufreq      
>   configuration: cores=12 enabledcores=12 microcode=141561875 threads=24
> {code}
>Reporter: Herminio Vazquez
>Priority: Major
>  Labels: correctness
>
> Minimalistic description of the odd computation results.
> When creating a new `Observation` object and computing a simple correlation 
> function between 2 columns, the results appear to be non-deterministic.
> {code:java}
> # Init
> from pyspark.sql import SparkSession, Observation
> import pyspark.sql.functions as F
> df = spark.createDataFrame([(float(i), float(i*10),) for i in range(10)], 
> schema="id double, id2 double")
> for i in range(10):
>     o = Observation(f"test_{i}")
>     df_o = df.observe(o, F.corr("id", "id2").eqNullSafe(1.0))
>     df_o.count()
> print(o.get)
> # Results
> {'(corr(id, id2) <=> 1.0)': False}
> {'(corr(id, id2) <=> 1.0)': False}
> {'(corr(id, id2) <=> 1.0)': False}
> {'(corr(id, id2) <=> 1.0)': True}
> {'(corr(id, id2) <=> 1.0)': True}
> {'(corr(id, id2) <=> 1.0)': True}
> {'(corr(id, id2) <=> 1.0)': True}
> {'(corr(id, id2) <=> 1.0)': True}
> {'(corr(id, id2) <=> 1.0)': True}
> {'(corr(id, id2) <=> 1.0)': False}{code}
>  



--
This message

[jira] [Commented] (SPARK-45786) Inaccurate Decimal multiplication and division results

2024-02-02 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813766#comment-17813766
 ] 

Nicholas Chammas commented on SPARK-45786:
--

[~kazuyukitanimura] - I'm just curious: How did you find this bug? Was it 
something you stumbled on by accident or did you search for it using something 
like a fuzzer?

> Inaccurate Decimal multiplication and division results
> --
>
> Key: SPARK-45786
> URL: https://issues.apache.org/jira/browse/SPARK-45786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.4, 3.3.3, 3.4.1, 3.5.0, 4.0.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Blocker
>  Labels: correctness, pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1
>
>
> Decimal multiplication and division results may be inaccurate due to rounding 
> issues.
> h2. Multiplication:
> {code:scala}
> scala> sql("select  -14120025096157587712113961295153.858047 * 
> -0.4652").show(truncate=false)
> ++
>   
> |(-14120025096157587712113961295153.858047 * -0.4652)|
> ++
> |6568635674732509803675414794505.574764  |
> ++
> {code}
> The correct answer is
> {quote}6568635674732509803675414794505.574763
> {quote}
> Please note that the last digit is 3 instead of 4 as
>  
> {code:scala}
> scala> 
> java.math.BigDecimal("-14120025096157587712113961295153.858047").multiply(java.math.BigDecimal("-0.4652"))
> val res21: java.math.BigDecimal = 6568635674732509803675414794505.5747634644
> {code}
> Since the factional part .574763 is followed by 4644, it should not be 
> rounded up.
> h2. Division:
> {code:scala}
> scala> sql("select -0.172787979 / 
> 533704665545018957788294905796.5").show(truncate=false)
> +-+
> |(-0.172787979 / 533704665545018957788294905796.5)|
> +-+
> |-3.237521E-31|
> +-+
> {code}
> The correct answer is
> {quote}-3.237520E-31
> {quote}
> Please note that the last digit is 0 instead of 1 as
>  
> {code:scala}
> scala> 
> java.math.BigDecimal("-0.172787979").divide(java.math.BigDecimal("533704665545018957788294905796.5"),
>  100, java.math.RoundingMode.DOWN)
> val res22: java.math.BigDecimal = 
> -3.237520489418037889998826491401059986665344697406144511563561222578738E-31
> {code}
> Since the factional part .237520 is followed by 4894..., it should not be 
> rounded up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38167) CSV parsing error when using escape='"'

2024-02-02 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813741#comment-17813741
 ] 

Nicholas Chammas commented on SPARK-38167:
--

[~marnixvandenbroek] - Could you link to the bug report you filed with 
Univocity?

cc [~maxgekk] - I believe you have hit some parsing bugs in Univocity recently.

> CSV parsing error when using escape='"' 
> 
>
> Key: SPARK-38167
> URL: https://issues.apache.org/jira/browse/SPARK-38167
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.2.1
> Environment: Pyspark on a single-node Databricks managed Spark 3.1.2 
> cluster.
>Reporter: Marnix van den Broek
>Priority: Major
>  Labels: correctness, csv, csvparser, data-integrity
>
> hi all,
> When reading CSV files with Spark, I ran into a parsing bug.
> {*}The summary{*}:
> When
>  # reading a comma separated, double-quote quoted CSV file using the csv 
> reader options _escape='"'_ and {_}header=True{_},
>  # with a row containing a quoted empty field
>  # followed by a quoted field starting with a comma and followed by one or 
> more characters
> selecting columns from the dataframe at or after the field described in 3) 
> gives incorrect and inconsistent results
> {*}In detail{*}:
> When I instruct Spark to read this CSV file:
>  
> {code:java}
> col1,col2
> "",",a"
> {code}
>  
> using the CSV reader options escape='"' (unnecessary for the example, 
> necessary for the files I'm processing) and header=True, I expect the 
> following result:
>  
> {code:java}
> spark.read.csv(path, escape='"', header=True).show()
>  
> +++
> |col1|col2|
> +++
> |null|  ,a|
> +++   {code}
>  
>  Spark does yield this result, so far so good. However, when I select col2 
> from the dataframe, Spark yields an incorrect result:
>  
> {code:java}
> spark.read.csv(path, escape='"', header=True).select('col2').show()
>  
> ++
> |col2|
> ++
> |  a"|
> ++{code}
>  
> If you run this example with more columns in the file, and more commas in the 
> field, e.g. ",,,a", the problem compounds, as Spark shifts many values to 
> the right, causing unexpected and incorrect results. The inconsistency 
> between both methods surprised me, as it implies the parsing is evaluated 
> differently between both methods. 
> I expect the bug to be located in the quote-balancing and un-escaping methods 
> of the csv parser, but I can't find where that code is located in the code 
> base. I'd be happy to take a look at it if anyone can point me where it is. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42399) CONV() silently overflows returning wrong results

2024-02-02 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-42399:
-
Affects Version/s: (was: 3.5.0)

> CONV() silently overflows returning wrong results
> -
>
> Key: SPARK-42399
> URL: https://issues.apache.org/jira/browse/SPARK-42399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Critical
>  Labels: correctness, pull-request-available
>
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 2.114 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.ansi.enabled = true;
> spark.sql.ansi.enabled true
> Time taken: 0.068 seconds, Fetched 1 row(s)
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 0.05 seconds, Fetched 1 row(s)
> In ANSI mode we should raise an error for sure.
> In non ANSI either an error or a NULL maybe be acceptable.
> Alternatively, of course, we could consider if we can support arbitrary 
> domains since the result is a STRING again. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42399) CONV() silently overflows returning wrong results

2024-02-02 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813733#comment-17813733
 ] 

Nicholas Chammas commented on SPARK-42399:
--

This issue does indeed appear to be resolved on {{master}} when ANSI mode is 
enabled:
{code:java}
>>> spark.sql(f"SELECT CONV('{'f' * 64}', 16, 10) AS 
>>> result").show(truncate=False)
++
|result              |
++
|18446744073709551615|
++
>>> spark.conf.set("spark.sql.ansi.enabled", "true")
>>> spark.sql(f"SELECT CONV('{'f' * 64}', 16, 10) AS 
>>> result").show(truncate=False)
Traceback (most recent call last):
...
pyspark.errors.exceptions.captured.ArithmeticException: [ARITHMETIC_OVERFLOW] 
Overflow in function conv(). If necessary set "spark.sql.ansi.enabled" to 
"false" to bypass this error. SQLSTATE: 22003
== SQL (line 1, position 8) ==
SELECT CONV('', 
16, 10) AS result
       

 {code}
However, there is still a silent overflow when ANSI mode is disabled. The error 
message suggests this is intended behavior.

cc [~gengliang] and [~gurwls223], who resolved SPARK-42427.

> CONV() silently overflows returning wrong results
> -
>
> Key: SPARK-42399
> URL: https://issues.apache.org/jira/browse/SPARK-42399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Serge Rielau
>Priority: Critical
>  Labels: correctness, pull-request-available
>
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 2.114 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.ansi.enabled = true;
> spark.sql.ansi.enabled true
> Time taken: 0.068 seconds, Fetched 1 row(s)
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 0.05 seconds, Fetched 1 row(s)
> In ANSI mode we should raise an error for sure.
> In non ANSI either an error or a NULL maybe be acceptable.
> Alternatively, of course, we could consider if we can support arbitrary 
> domains since the result is a STRING again. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42399) CONV() silently overflows returning wrong results

2024-02-02 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-42399:
-
Affects Version/s: 3.5.0

> CONV() silently overflows returning wrong results
> -
>
> Key: SPARK-42399
> URL: https://issues.apache.org/jira/browse/SPARK-42399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Serge Rielau
>Priority: Critical
>  Labels: correctness, pull-request-available
>
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 2.114 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.ansi.enabled = true;
> spark.sql.ansi.enabled true
> Time taken: 0.068 seconds, Fetched 1 row(s)
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 0.05 seconds, Fetched 1 row(s)
> In ANSI mode we should raise an error for sure.
> In non ANSI either an error or a NULL maybe be acceptable.
> Alternatively, of course, we could consider if we can support arbitrary 
> domains since the result is a STRING again. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42399) CONV() silently overflows returning wrong results

2024-02-02 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-42399:
-
Labels: correctness pull-request-available  (was: pull-request-available)

> CONV() silently overflows returning wrong results
> -
>
> Key: SPARK-42399
> URL: https://issues.apache.org/jira/browse/SPARK-42399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Critical
>  Labels: correctness, pull-request-available
>
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 2.114 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.ansi.enabled = true;
> spark.sql.ansi.enabled true
> Time taken: 0.068 seconds, Fetched 1 row(s)
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 0.05 seconds, Fetched 1 row(s)
> In ANSI mode we should raise an error for sure.
> In non ANSI either an error or a NULL maybe be acceptable.
> Alternatively, of course, we could consider if we can support arbitrary 
> domains since the result is a STRING again. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46810) Clarify error class terminology

2024-02-01 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46810:
-
Description: 
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
 * [Over 
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
 we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we 
call that an "error class". So what exactly is a class, the 42 or the 
INCOMPLETE_TYPE_DEFINITION?
 * [Over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
 we call K01 the "subclass". But [over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
 we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
So what exactly is a subclass?
 * [On this 
page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
 we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
places we refer to it as an "error class".

I don't think we should leave this status quo as-is. I see a couple of ways to 
fix this.
h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"

One solution is to use the following terms:
 * Error class: 42
 * Error sub-class: K01
 * Error state: 42K01
 * Error condition: INCOMPLETE_TYPE_DEFINITION
 * Error sub-condition: ARRAY, MAP, STRUCT

Pros: 
 * This terminology seems (to me at least) the most natural and intuitive.
 * It aligns most closely to the SQL standard.

Cons:
 * We use {{errorClass}} [all over our 
codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
 – literally in thousands of places – to refer to strings like 
INCOMPLETE_TYPE_DEFINITION.
 ** It's probably not practical to update all these usages to say 
{{errorCondition}} instead, so if we go with this approach there will be a 
divide between the terminology we use in user-facing documentation vs. what the 
code base uses.
 ** We can perhaps rename the existing {{error-classes.json}} to 
{{error-conditions.json}} but clarify the reason for this divide between code 
and user docs in the documentation for {{ErrorClassesJsonReader}} .

h1. Option 2: 42 becomes an "Error Category"

Another approach is to use the following terminology:
 * Error category: 42
 * Error sub-category: K01
 * Error state: 42K01
 * Error class: INCOMPLETE_TYPE_DEFINITION
 * Error sub-classes: ARRAY, MAP, STRUCT

Pros:
 * We continue to use "error class" as we do today in our code base.
 * The change from calling "42" a "class" to a "category" is low impact and may 
not show up in user-facing documentation at all. (See my side note below.)

Cons:
 * These terms do not align with the SQL standard.
 * We will have to retire the term "error condition", which we have [already 
used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md]
 in user-facing documentation.

h1. Option 3: "Error Class" and "State Class"
 * SQL state class: 42
 * SQL state sub-class: K01
 * SQL state: 42K01
 * Error class: INCOMPLETE_TYPE_DEFINITION
 * Error sub-classes: ARRAY, MAP, STRUCT

Pros:
 * We continue to use "error class" as we do today in our code base.
 * The change from calling "42" a "class" to a "state class" is low impact and 
may not show up in user-facing documentation at all. (See my side note below.)

Cons:
 * "State class" vs. "Error class" is a bit confusing.
 * These terms do not align with the SQL standard.
 * We will have to retire the term "error condition", which we have [already 
used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md]
 in user-facing documentation.

—

Side note: In any case, I believe talking about "42" and "K01" – regardless of 
what we end up calling them – in front of users is not helpful. I don't think 
anybody cares what "42" by itself means, or what "K01" by itself means. 
Accordingly, we should limit how much we

[jira] [Created] (SPARK-46935) Consolidate error documentation

2024-01-31 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46935:


 Summary: Consolidate error documentation
 Key: SPARK-46935
 URL: https://issues.apache.org/jira/browse/SPARK-46935
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46810) Clarify error class terminology

2024-01-31 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46810:
-
Description: 
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
 * [Over 
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
 we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we 
call that an "error class". So what exactly is a class, the 42 or the 
INCOMPLETE_TYPE_DEFINITION?
 * [Over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
 we call K01 the "subclass". But [over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
 we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
So what exactly is a subclass?
 * [On this 
page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
 we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
places we refer to it as an "error class".

I don't think we should leave this status quo as-is. I see a couple of ways to 
fix this.
h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"

One solution is to use the following terms:
 * Error class: 42
 * Error sub-class: K01
 * Error state: 42K01
 * Error condition: INCOMPLETE_TYPE_DEFINITION
 * Error sub-condition: ARRAY, MAP, STRUCT

Pros: 
 * This terminology seems (to me at least) the most natural and intuitive.
 * It may also match the SQL standard.

Cons:
 * We use {{errorClass}} [all over our 
codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
 – literally in thousands of places – to refer to strings like 
INCOMPLETE_TYPE_DEFINITION.
 ** It's probably not practical to update all these usages to say 
{{errorCondition}} instead, so if we go with this approach there will be a 
divide between the terminology we use in user-facing documentation vs. what the 
code base uses.
 ** We can perhaps rename the existing {{error-classes.json}} to 
{{error-conditions.json}} but clarify the reason for this divide between code 
and user docs in the documentation for {{ErrorClassesJsonReader}} .

h1. Option 2: 42 becomes an "Error Category"

Another approach is to use the following terminology:
 * Error category: 42
 * Error sub-category: K01
 * Error state: 42K01
 * Error class: INCOMPLETE_TYPE_DEFINITION
 * Error sub-classes: ARRAY, MAP, STRUCT

Pros:
 * We continue to use "error class" as we do today in our code base.
 * The change from calling "42" a class to a category is low impact and may not 
show up in user-facing documentation at all. (See my side note below.)

Cons:
 * These terms may not align with the SQL standard.
 * We will have to retire the term "error condition", which we have [already 
used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md]
 in user-facing documentation.

h1. Option 3: "Error Class" and "State Class"
 * SQL state class: 42
 * SQL state sub-class: K01
 * SQL state: 42K01
 * Error class: INCOMPLETE_TYPE_DEFINITION
 * Error sub-classes: ARRAY, MAP, STRUCT

—

Side note: In any case, I believe talking about "42" and "K01" – regardless of 
what we end up calling them – in front of users is not helpful. I don't think 
anybody cares what "42" by itself means, or what "K01" by itself means. 
Accordingly, we should limit how much we talk about these concepts in the 
user-facing documentation.

  was:
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY

[jira] [Updated] (SPARK-46923) Limit width of config tables in documentation and style them consistently

2024-01-30 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46923:
-
Summary: Limit width of config tables in documentation and style them 
consistently  (was: Style config tables in documentation consistently)

> Limit width of config tables in documentation and style them consistently
> -
>
> Key: SPARK-46923
> URL: https://issues.apache.org/jira/browse/SPARK-46923
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46923) Style config tables in documentation consistently

2024-01-30 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46923:


 Summary: Style config tables in documentation consistently
 Key: SPARK-46923
 URL: https://issues.apache.org/jira/browse/SPARK-46923
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46810) Clarify error class terminology

2024-01-29 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811923#comment-17811923
 ] 

Nicholas Chammas commented on SPARK-46810:
--

I think Option 3 is a good compromise that lets us continue calling 
{{INCOMPLETE_TYPE_DEFINITION}} an "error class", which perhaps would be the 
least disruptive to Spark developers.

However, for the record, the SQL standard only seems to use the term "class" in 
the context of the 5-character SQLSTATE. Otherwise, the standard uses the term 
"condition" or "exception condition".

I don't have a copy of the SQL 2016 standard handy. It's not available on ISO's 
website for sale, actually. The only option appears to be to purchase [the SQL 
2023 standard for ~$220|https://www.iso.org/standard/76583.html].

However, there is a copy of the [SQL 1992 standard available 
publicly|https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt]. 

Table 23 on page 619 is relevant:

{code}
 Table_23-SQLSTATE_class_and_subclass_values

 _Condition__Class_Subcondition___Subclass

| ambiguous cursor name| 3C  | (no subclass)| 000  |
|  | |  |  |
|  | |  |  |
| cardinality violation| 21  | (no subclass)| 000  |
|  | |  |  |
| connection exception | 08  | (no subclass)| 000  |
|  | |  |  |
|  | | connection does not  | 003  |
   exist
|  | | connection failure   | 006  |
|  | |  |  |
|  | | connection name in use   | 002  |
|  | |  |  |
|  | | SQL-client unable to | 001  |
   establish SQL-connection
...
{code}

I think this maps closest to Option 1, but again if we want to go with Option 3 
I think that's reasonable too. But in the case of Option 3 we should then 
retire [our use of the term "error 
condition"|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html] so 
that we don't use multiple terms to refer to the same thing.

> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
> [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
>  It has several key pieces of hierarchical information that have inconsistent 
> names throughout our documentation and codebase:
>  * 42
>  ** K01
>  *** INCOMPLETE_TYPE_DEFINITION
>   ARRAY
>   MAP
>   STRUCT
> What are the names of these different levels of information?
> Some examples of inconsistent terminology:
>  * [Over 
> here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
>  we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION 
> we call that an "error class". So what exactly is a class, the 42 or the 
> INCOMPLETE_TYPE_DEFINITION?
>  * [Over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
>  we call K01 the "subclass". But [over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
>  we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
> INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
> So what exactly is a subclass?
>  * [On this 
> page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
>  we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
> places we refer to it as an "error class".
> I don't think we should leave this status quo as-is. I see a couple of ways 
> to fix this.
> h1. Option 1:

[jira] [Created] (SPARK-46894) Move PySpark error conditions into standalone JSON file

2024-01-28 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46894:


 Summary: Move PySpark error conditions into standalone JSON file
 Key: SPARK-46894
 URL: https://issues.apache.org/jira/browse/SPARK-46894
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46810) Clarify error class terminology

2024-01-27 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811627#comment-17811627
 ] 

Nicholas Chammas commented on SPARK-46810:
--

Thanks for sharing the relevant quote, [~srielau].

1. So just to be clear, you are saying you prefer Option 1. Is that correct? I 
will update the PR accordingly.

2. Is there anyone else we need buy-in from before moving forward? [~maxgekk], 
perhaps?

> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
> [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
>  It has several key pieces of hierarchical information that have inconsistent 
> names throughout our documentation and codebase:
>  * 42
>  ** K01
>  *** INCOMPLETE_TYPE_DEFINITION
>   ARRAY
>   MAP
>   STRUCT
> What are the names of these different levels of information?
> Some examples of inconsistent terminology:
>  * [Over 
> here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
>  we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION 
> we call that an "error class". So what exactly is a class, the 42 or the 
> INCOMPLETE_TYPE_DEFINITION?
>  * [Over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
>  we call K01 the "subclass". But [over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
>  we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
> INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
> So what exactly is a subclass?
>  * [On this 
> page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
>  we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
> places we refer to it as an "error class".
> I don't think we should leave this status quo as-is. I see a couple of ways 
> to fix this.
> h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"
> One solution is to use the following terms:
>  * Error class: 42
>  * Error sub-class: K01
>  * Error state: 42K01
>  * Error condition: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-condition: ARRAY, MAP, STRUCT
> Pros: 
>  * This terminology seems (to me at least) the most natural and intuitive.
>  * It may also match the SQL standard.
> Cons:
>  * We use {{errorClass}} [all over our 
> codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
>  – literally in thousands of places – to refer to strings like 
> INCOMPLETE_TYPE_DEFINITION.
>  ** It's probably not practical to update all these usages to say 
> {{errorCondition}} instead, so if we go with this approach there will be a 
> divide between the terminology we use in user-facing documentation vs. what 
> the code base uses.
>  ** We can perhaps rename the existing {{error-classes.json}} to 
> {{error-conditions.json}} but clarify the reason for this divide between code 
> and user docs in the documentation for {{ErrorClassesJsonReader}} .
> h1. Option 2: 42 becomes an "Error Category"
> Another approach is to use the following terminology:
>  * Error category: 42
>  * Error sub-category: K01
>  * Error state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The change from calling "42" a class to a category is low impact and may 
> not show up in user-facing documentation at all. (See my side note below.)
> Cons:
>  * These terms may not align with the SQL standard.
>  * We will have to retire the term "error condition", which we have [already 
> used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md]
>  in user-facing documentation.
> —
> Side note: In either case, I believe talking about "42" and "K01" – 
> regardless of what we end up calling them – in front of users is not helpful. 
> I don't think anybody cares what "42" by itself means, or what

[jira] [Comment Edited] (SPARK-46810) Clarify error class terminology

2024-01-26 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811470#comment-17811470
 ] 

Nicholas Chammas edited comment on SPARK-46810 at 1/27/24 5:00 AM:
---

[~srielau] - What do you think of the problem and proposed solutions described 
above?

I am partial to Option 1, but certainly either solution will need buy-in from 
whoever cares about how we manage and document errors.

Also, you mentioned [on the 
PR|https://github.com/apache/spark/pull/44902/files#r1468258626] that the SQL 
standard uses specific terms. Could you link to or quote the relevant parts?


was (Author: nchammas):
[~srielau] - What do you think of the problem and proposed solutions described 
above?

Also, you mentioned [on the 
PR|https://github.com/apache/spark/pull/44902/files#r1468258626] that the SQL 
standard uses specific terms. Could you link to or quote the relevant parts?

> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
> [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
>  It has several key pieces of hierarchical information that have inconsistent 
> names throughout our documentation and codebase:
>  * 42
>  ** K01
>  *** INCOMPLETE_TYPE_DEFINITION
>   ARRAY
>   MAP
>   STRUCT
> What are the names of these different levels of information?
> Some examples of inconsistent terminology:
>  * [Over 
> here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
>  we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION 
> we call that an "error class". So what exactly is a class, the 42 or the 
> INCOMPLETE_TYPE_DEFINITION?
>  * [Over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
>  we call K01 the "subclass". But [over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
>  we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
> INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
> So what exactly is a subclass?
>  * [On this 
> page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
>  we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
> places we refer to it as an "error class".
> I don't think we should leave this status quo as-is. I see a couple of ways 
> to fix this.
> h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"
> One solution is to use the following terms:
>  * Error class: 42
>  * Error sub-class: K01
>  * Error state: 42K01
>  * Error condition: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-condition: ARRAY, MAP, STRUCT
> Pros: 
>  * This terminology seems (to me at least) the most natural and intuitive.
>  * It may also match the SQL standard.
> Cons:
>  * We use {{errorClass}} [all over our 
> codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
>  – literally in thousands of places – to refer to strings like 
> INCOMPLETE_TYPE_DEFINITION.
>  ** It's probably not practical to update all these usages to say 
> {{errorCondition}} instead, so if we go with this approach there will be a 
> divide between the terminology we use in user-facing documentation vs. what 
> the code base uses.
>  ** We can perhaps rename the existing {{error-classes.json}} to 
> {{error-conditions.json}} but clarify the reason for this divide between code 
> and user docs in the documentation for {{ErrorClassesJsonReader}} .
> h1. Option 2: 42 becomes an "Error Category"
> Another approach is to use the following terminology:
>  * Error category: 42
>  * Error sub-category: K01
>  * Error state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The change from calling "42" a class to a category is low impact and may 
> not show up in user-facing documentation at all. (See my side note below.)
> Cons:
>

[jira] [Updated] (SPARK-46810) Clarify error class terminology

2024-01-26 Thread Nicholas Chammas (Jira)

[
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Nicholas Chammas updated SPARK-46810:
-
Description:
We use inconsistent terminology when talking about error classes. I'd like to
get some clarity on that before contributing any potential improvements to this
part of the documentation.

Consider
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
It has several key pieces of hierarchical information that have inconsistent
names throughout our documentation and codebase:
* 42
** K01
*** INCOMPLETE_TYPE_DEFINITION
ARRAY
MAP
STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
* [Over
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we
call that an "error class". So what exactly is a class, the 42 or the
INCOMPLETE_TYPE_DEFINITION?
* [Over
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
we call K01 the "subclass". But [over
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for
INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes".
So what exactly is a subclass?
* [On this
page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other
places we refer to it as an "error class".

I don't think we should leave this status quo as-is. I see a couple of ways to
fix this.
h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"

One solution is to use the following terms:
* Error class: 42
* Error sub-class: K01
* Error state: 42K01
* Error condition: INCOMPLETE_TYPE_DEFINITION
* Error sub-condition: ARRAY, MAP, STRUCT

Pros:
* This terminology seems (to me at least) the most natural and intuitive.
* It may also match the SQL standard.

Cons:
* We use {{errorClass}} [all over our
codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
– literally in thousands of places – to refer to strings like
INCOMPLETE_TYPE_DEFINITION.
** It's probably not practical to update all these usages to say
{{errorCondition}} instead, so if we go with this approach there will be a
divide between the terminology we use in user-facing documentation vs. what the
code base uses.
** We can perhaps rename the existing {{error-classes.json}} to
{{error-conditions.json}} but clarify the reason for this divide between code
and user docs in the documentation for {{ErrorClassesJsonReader}} .

h1. Option 2: 42 becomes an "Error Category"

Another approach is to use the following terminology:
* Error category: 42
* Error sub-category: K01
* Error state: 42K01
* Error class: INCOMPLETE_TYPE_DEFINITION
* Error sub-classes: ARRAY, MAP, STRUCT

Pros:
* We continue to use "error class" as we do today in our code base.
* The change from calling "42" a class to a category is low impact and may not
show up in user-facing documentation at all. (See my side note below.)

Cons:
* These terms may not align with the SQL standard.
* We will have to retire the term "error condition", which we have [already
used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md]
in user-facing documentation.

—

Side note: In either case, I believe talking about "42" and "K01" – regardless
of what we end up calling them – in front of users is not helpful. I don't
think anybody cares what "42" by itself means, or what "K01" by itself means.
Accordingly, we should limit how much we talk about these concepts in the
user-facing documentation.

was:
We use inconsistent terminology when talking about error classes. I'd like to
get some clarity on that before contributing any potential improvements to this
part of the documentation.

What are the names of these different levels of information?

Some examples of inconsistent terminology:
* [Over

[jira] [Commented] (SPARK-46810) Clarify error class terminology

2024-01-26 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811470#comment-17811470
 ] 

Nicholas Chammas commented on SPARK-46810:
--

[~srielau] - What do you think of the problem and proposed solutions described 
above?

Also, you mentioned [on the 
PR|https://github.com/apache/spark/pull/44902/files#r1468258626] that the SQL 
standard uses specific terms. Could you link to or quote the relevant parts?

> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
> [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
>  It has several key pieces of hierarchical information that have inconsistent 
> names throughout our documentation and codebase:
>  * 42
>  ** K01
>  *** INCOMPLETE_TYPE_DEFINITION
>   ARRAY
>   MAP
>   STRUCT
> What are the names of these different levels of information?
> Some examples of inconsistent terminology:
>  * [Over 
> here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
>  we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION 
> we call that an "error class". So what exactly is a class, the 42 or the 
> INCOMPLETE_TYPE_DEFINITION?
>  * [Over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
>  we call K01 the "subclass". But [over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
>  we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
> INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
> So what exactly is a subclass?
>  * [On this 
> page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
>  we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
> places we refer to it as an "error class".
> I don't think we should leave this status quo as-is. I see a couple of ways 
> to fix this.
> h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"
> One solution is to use the following terms:
>  * Error class: 42
>  * Error sub-class: K01
>  * Error state: 42K01
>  * Error condition: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-condition: ARRAY, MAP, STRUCT
> Pros: 
>  * This terminology seems (to me at least) the most natural and intuitive.
>  * It may also match the SQL standard.
> Cons:
>  * We use {{errorClass}} [all over our 
> codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
>  – literally in thousands of places – to refer to strings like 
> INCOMPLETE_TYPE_DEFINITION.
>  ** It's probably not practical to update all these usages to say 
> {{errorCondition}} instead, so if we go with this approach there will be a 
> divide between the terminology we use in user-facing documentation vs. what 
> the code base uses.
>  ** We can perhaps rename the existing {{error-classes.json}} to 
> {{error-conditions.json}} but clarify the reason for this divide between code 
> and user docs in the documentation for {{ErrorClassesJsonReader}} .
> h1. Option 2: 42 becomes an "Error Category"
> Another approach is to use the following terminology:
>  * Error category: 42
>  * Error sub-category: K01
>  * Error state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The change from calling "42" a class to a category is low impact and may 
> not show up in user-facing documentation at all. (See my side note below.)
> Cons:
>  * These terms may not align with the SQL standard.
>  * We will have to retire the term "error condition", which we have [already 
> used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md#L0-L1]
>  in user-facing documentation.
> —
> Side note: In either case, I believe talking about "42" and "K01" – 
> regardless of what we end up calling them – in front of users is not helpful. 
> I don't think anybody cares what "42" by

[jira] [Updated] (SPARK-46810) Clarify error class terminology

2024-01-26 Thread Nicholas Chammas (Jira)

[
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

What are the names of these different levels of information?

I don't think we should leave this status quo as-is. I see a couple of ways to
fix this.
h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"

One solution is to use the following terms:
* Error class: 42
* Error sub-class: K01
* Error state: 42K01
* Error condition: INCOMPLETE_TYPE_DEFINITION
* Error sub-condition: ARRAY, MAP, STRUCT

Pros:
* This terminology seems (to me at least) the most natural and intuitive.
* It may also match the SQL standard.

h1. Option 2: 42 becomes an "Error Category"

—

was:
We use inconsistent terminology when talking about error classes. I'd like to
get some clarity on that before contributing any potential improvements to this
part of the documentation.

What are the names of these different levels of information?

Some examples of inconsistent terminology:
* [Over

[jira] [Updated] (SPARK-46810) Clarify error class terminology

2024-01-26 Thread Nicholas Chammas (Jira)

[
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

What are the names of these different levels of information?

I don't think we should leave this status quo as-is. I see a couple of ways to
fix this.
h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"

One solution is to use the following terms:
* Error class: 42
* Error sub-class: K01
* Error state: 42K01
* Error condition: INCOMPLETE_TYPE_DEFINITION
* Error sub-condition: ARRAY, MAP, STRUCT

Pros:
* This terminology seems (to me at least) the most natural and intuitive.
* It may also match the SQL standard.

Cons:
* We use {{errorClass}} [all over our
codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
– literally in thousands of places – to refer to INCOMPLETE_TYPE_DEFINITION.
** It's probably not practical to update all these usages to say
{{errorCondition}} instead, so if we go with this approach there will be a
divide between the terminology we use in user-facing documentation vs. what the
code base uses.
** We can perhaps rename the existing {{error-classes.json}} to
{{error-conditions.json}} but clarify the reason for this divide in the
documentation for {{ErrorClassesJsonReader}} .

h1. Option 2: 42 becomes an "Error Category"

Another
* Error category: 42
* Error sub-category: K01
* Error state: 42K01
* Error class: INCOMPLETE_TYPE_DEFINITION
* Error sub-classes: ARRAY, MAP, STRUCT

We should not use "error condition" if one of the above terms more accurately
describes what we are talking about.

Side note: With this terminology, I believe talking about error categories and
sub-categories in front of users is not helpful. I don't think anybody cares
what "42" by itself means, or what "K01" by itself means. Accordingly, we
should limit how much we talk about these concepts in the user-facing
documentation.

was:
We use inconsistent terminology when talking about error classes. I'd like to
get some clarity on that before contributing any potential improvements to this
part of the documentation.

What are the names of these different levels of information?

[jira] [Created] (SPARK-46863) Clean up custom.css

2024-01-25 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46863:


 Summary: Clean up custom.css
 Key: SPARK-46863
 URL: https://issues.apache.org/jira/browse/SPARK-46863
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46825) Build Spark only once when building docs

2024-01-23 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46825:


 Summary: Build Spark only once when building docs
 Key: SPARK-46825
 URL: https://issues.apache.org/jira/browse/SPARK-46825
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46819) Port error class data to automation-friendly format

2024-01-23 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46819:


 Summary: Port error class data to automation-friendly format
 Key: SPARK-46819
 URL: https://issues.apache.org/jira/browse/SPARK-46819
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas


As described in SPARK-46810, we have several types of error data captured in 
our code and documentation.

Unfortunately, a good chunk of this data is in a Markdown table that is not 
friendly to automation (e.g. to generate documentation, or run tests).

[https://github.com/apache/spark/blob/d1fbc4c7191aafadada1a6f7c217bf89f6cae49f/common/utils/src/main/resources/error/README.md#L121]

We should migrate this error data to an automation-friendly format.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46810) Clarify error class terminology

2024-01-23 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46810:
-
Description: 
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
 * [Over 
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
 we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we 
call that an "error class". So what exactly is a class, the 42 or the 
INCOMPLETE_TYPE_DEFINITION?
 * [Over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
 we call K01 the "subclass". But [over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
 we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
So what exactly is a subclass?
 * [On this 
page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
 we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
places we refer to it as an "error class".

I personally like the terminology "error condition", but as we are already 
using "error class" very heavily throughout the codebase to refer to something 
like INCOMPLETE_TYPE_DEFINITION, I don't think it's practical to change at this 
point.

To rationalize the different terms we are using, I propose the following 
terminology, which we should use consistently throughout our code and 
documentation:
 * Error category: 42
 * Error sub-category: K01
 * Error state: 42K01
 * Error class: INCOMPLETE_TYPE_DEFINITION
 * Error sub-classes: ARRAY, MAP, STRUCT

We should not use "error condition" if one of the above terms more accurately 
describes what we are talking about.

Side note: With this terminology, I believe talking about error categories and 
sub-categories in front of users is not helpful. I don't think anybody cares 
what "42" by itself means, or what "K01" by itself means. Accordingly, we 
should limit how much we talk about these concepts in the user-facing 
documentation.

  was:
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
 * [Over 
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
 we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we 
call that an "error class". So what exactly is a class, the 42 or the 
INCOMPLETE_TYPE_DEFINITION?
 * [Over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
 we call K01 the "subclass". But [over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
 we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
So what exactly is a subclass?

I propose the following terminology, which we should use consistently 
throughout our code and documentation:
 * Error class: 42
 * Error subclass: K01
 * Error state: 42K01
 * Error condition: INCOMPLETE_TYPE_DEFINITION
 * Error sub-conditions: ARRAY, MAP, STRUCT

Side note: With this terminology, I believe talking about error classes and 
subclasses in front of users is not helpful. I don't think anybody cares about 
what "42" by itself means, or what "K01" by itself means. Accordingly, we 
should limit how much we talk about these concepts in the

[jira] [Updated] (SPARK-46810) Clarify error class terminology

2024-01-23 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46810:
-
Description: 
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
 * [Over 
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
 we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we 
call that an "error class". So what exactly is a class, the 42 or the 
INCOMPLETE_TYPE_DEFINITION?
 * [Over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
 we call K01 the "subclass". But [over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
 we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
So what exactly is a subclass?

I propose the following terminology, which we should use consistently 
throughout our code and documentation:
 * Error class: 42
 * Error subclass: K01
 * Error state: 42K01
 * Error condition: INCOMPLETE_TYPE_DEFINITION
 * Error sub-conditions: ARRAY, MAP, STRUCT

Side note: With this terminology, I believe talking about error classes and 
subclasses in front of users is not helpful. I don't think anybody cares about 
what "42" by itself means, or what "K01" by itself means. Accordingly, we 
should limit how much we talk about these concepts in the user-facing 
documentation.

  was:
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
 * [Over 
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
 we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we 
call that an "error class". So what exactly is a class, the 42 or the 
INCOMPLETE_TYPE_DEFINITION?
 * [Over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
 we call K01 the "subclass". But [over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
 we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
So what exactly is a subclass?

I propose the following terminology, which we should use consistently 
throughout our code and documentation:
 * Error class: 42
 * Error subclass: K01
 * Error state: 42K01
 * Error condition: INCOMPLETE_TYPE_DEFINITION
 * Error sub-conditions: ARRAY, MAP, STRUCT

Side note: With this terminology, I believe talking about error classes and 
subclasses in front of users is not helpful. I don't think anybody cares about 
what 42 by itself means, or what K01 by itself means. Accordingly, we should 
limit how much we talk about these concepts.


> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
>

[jira] [Commented] (SPARK-46810) Clarify error class terminology

2024-01-23 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809804#comment-17809804
 ] 

Nicholas Chammas commented on SPARK-46810:
--

[~itholic] [~gurwls223] - What do you think?

cc also [~karenfeng], who I see in git blame as the original contributor of 
error classes.

> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
> [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
>  It has several key pieces of hierarchical information that have inconsistent 
> names throughout our documentation and codebase:
>  * 42
>  ** K01
>  *** INCOMPLETE_TYPE_DEFINITION
>   ARRAY
>   MAP
>   STRUCT
> What are the names of these different levels of information?
> Some examples of inconsistent terminology:
>  * [Over 
> here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
>  we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION 
> we call that an "error class". So what exactly is a class, the 42 or the 
> INCOMPLETE_TYPE_DEFINITION?
>  * [Over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
>  we call K01 the "subclass". But [over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
>  we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
> INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
> So what exactly is a subclass?
> I propose the following terminology, which we should use consistently 
> throughout our code and documentation:
>  * Error class: 42
>  * Error subclass: K01
>  * Error state: 42K01
>  * Error condition: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-conditions: ARRAY, MAP, STRUCT
> Side note: With this terminology, I believe talking about error classes and 
> subclasses in front of users is not helpful. I don't think anybody cares 
> about what 42 by itself means, or what K01 by itself means. Accordingly, we 
> should limit how much we talk about these concepts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46810) Clarify error class terminology

2024-01-23 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46810:


 Summary: Clarify error class terminology
 Key: SPARK-46810
 URL: https://issues.apache.org/jira/browse/SPARK-46810
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas


We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
 * [Over 
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
 we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we 
call that an "error class". So what exactly is a class, the 42 or the 
INCOMPLETE_TYPE_DEFINITION?
 * [Over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
 we call K01 the "subclass". But [over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
 we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
So what exactly is a subclass?

I propose the following terminology, which we should use consistently 
throughout our code and documentation:
 * Error class: 42
 * Error subclass: K01
 * Error state: 42K01
 * Error condition: INCOMPLETE_TYPE_DEFINITION
 * Error sub-conditions: ARRAY, MAP, STRUCT

Side note: With this terminology, I believe talking about error classes and 
subclasses in front of users is not helpful. I don't think anybody cares about 
what 42 by itself means, or what K01 by itself means. Accordingly, we should 
limit how much we talk about these concepts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46807) Include automation notice in SQL error class documents

2024-01-22 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46807:


 Summary: Include automation notice in SQL error class documents
 Key: SPARK-46807
 URL: https://issues.apache.org/jira/browse/SPARK-46807
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46775) Fix formatting of Kinesis docs

2024-01-19 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46775:


 Summary: Fix formatting of Kinesis docs
 Key: SPARK-46775
 URL: https://issues.apache.org/jira/browse/SPARK-46775
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46764) Reorganize Ruby script to build API docs

2024-01-18 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46764:


 Summary: Reorganize Ruby script to build API docs
 Key: SPARK-46764
 URL: https://issues.apache.org/jira/browse/SPARK-46764
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2024-01-15 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806954#comment-17806954
 ] 

Nicholas Chammas commented on SPARK-45599:
--

Using [Hypothesis|https://github.com/HypothesisWorks/hypothesis], I've managed 
to shrink the provided test case from 373 elements down to 14:

{code:python}
from math import nan
from pyspark.sql import SparkSession

HYPOTHESIS_EXAMPLE = [
(0.0,),
(2.0,),
(153.0,),
(168.0,),
(3252411229536261.0,),
(7.205759403792794e+16,),
(1.7976931348623157e+308,),
(0.25,),
(nan,),
(nan,),
(-0.0,),
(-128.0,),
(nan,),
(nan,),
]

spark = (
SparkSession.builder
.config("spark.log.level", "ERROR")
.getOrCreate()
)


def compare_percentiles(data, slices):
rdd = spark.sparkContext.parallelize(data, numSlices=1)
df = spark.createDataFrame(rdd, "val double")
result1 = df.selectExpr('percentile(val, 0.1)').collect()[0][0]

rdd = spark.sparkContext.parallelize(data, numSlices=slices)
df = spark.createDataFrame(rdd, "val double")
result2 = df.selectExpr('percentile(val, 0.1)').collect()[0][0]

assert result1 == result2, f"{result1}, {result2}"


if __name__ == "__main__":
compare_percentiles(HYPOTHESIS_EXAMPLE, 2)
{code}

Running this test fails as follows:

{code:python}
Traceback (most recent call last):  
  File ".../SPARK-45599.py", line 41, in 
compare_percentiles(HYPOTHESIS_EXAMPLE, 2)
  File ".../SPARK-45599.py", line 37, in compare_percentiles
assert result1 == result2, f"{result1}, {result2}"
   ^^
AssertionError: 0.050044, -0.0
{code}

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: correctness
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
>

[jira] [Commented] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2024-01-12 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806150#comment-17806150
 ] 

Nicholas Chammas commented on SPARK-45599:
--

cc [~dongjoon] - This is an old correctness bug with a concise reproduction.

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: correctness
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,),

[jira] [Updated] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2024-01-12 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-45599:
-
Labels: correctness  (was: data-corruption)

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: correctness
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,),

[jira] [Commented] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2024-01-12 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806148#comment-17806148
 ] 

Nicholas Chammas commented on SPARK-45599:
--

I can confirm that this bug is still present on {{master}} at commit 
[a3266b411723310ec10fc1843ddababc15249db0|https://github.com/apache/spark/tree/a3266b411723310ec10fc1843ddababc15249db0].

With {{numSlices=4}} I get {{-5.924228780007003E136}} and with {{numSlices=1}} 
I get {{{}-4.739483957565084E136{}}}.

Updating the label on this issue. I will also ping some committers to bring 
this bug to their attention, as correctness bugs are taken very seriously.

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: data-corruption
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
>

[jira] [Updated] (SPARK-46395) Assign Spark configs to groups for use in documentation

2024-01-12 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46395:
-
Summary: Assign Spark configs to groups for use in documentation  (was: 
Automatically generate SQL configuration tables for documentation)

> Assign Spark configs to groups for use in documentation
> ---
>
> Key: SPARK-46395
> URL: https://issues.apache.org/jira/browse/SPARK-46395
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.5.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46668) Parallelize Sphinx build of Python API docs

2024-01-10 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46668:


 Summary: Parallelize Sphinx build of Python API docs
 Key: SPARK-46668
 URL: https://issues.apache.org/jira/browse/SPARK-46668
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46658) Loosen Ruby dependency specs for doc build

2024-01-10 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46658:


 Summary: Loosen Ruby dependency specs for doc build
 Key: SPARK-46658
 URL: https://issues.apache.org/jira/browse/SPARK-46658
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46437) Enable conditional includes in Jekyll documentation

2024-01-08 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46437:
-
Component/s: (was: SQL)

> Enable conditional includes in Jekyll documentation
> ---
>
> Key: SPARK-46437
> URL: https://issues.apache.org/jira/browse/SPARK-46437
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46437) Enable conditional includes in Jekyll documentation

2024-01-08 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46437:
-
Summary: Enable conditional includes in Jekyll documentation  (was: Remove 
unnecessary cruft from SQL built-in functions docs)

> Enable conditional includes in Jekyll documentation
> ---
>
> Key: SPARK-46437
> URL: https://issues.apache.org/jira/browse/SPARK-46437
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.5.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46626) Bump jekyll version to support Ruby 3.3

2024-01-08 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46626:


 Summary: Bump jekyll version to support Ruby 3.3
 Key: SPARK-46626
 URL: https://issues.apache.org/jira/browse/SPARK-46626
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46449) Add ability to create databases/schemas via Catalog API

2023-12-30 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46449:
-
Summary: Add ability to create databases/schemas via Catalog API  (was: Add 
ability to create databases via Catalog API)

> Add ability to create databases/schemas via Catalog API
> ---
>
> Key: SPARK-46449
> URL: https://issues.apache.org/jira/browse/SPARK-46449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> As of Spark 3.5, the only way to create a database is via SQL. The Catalog 
> API should offer an equivalent.
> Perhaps something like:
> {code:python}
> spark.catalog.createDatabase(
> name: str,
> existsOk: bool = False,
> comment: str = None,
> location: str = None,
> properties: dict = None,
> )
> {code}
> If {{schema}} is the preferred terminology, then we should use that instead 
> of {{database}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46449) Add ability to create databases via Catalog API

2023-12-28 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46449:
-
Description: 
As of Spark 3.5, the only way to create a database is via SQL. The Catalog API 
should offer an equivalent.

Perhaps something like:
{code:python}
spark.catalog.createDatabase(
name: str,
existsOk: bool = False,
comment: str = None,
location: str = None,
properties: dict = None,
)
{code}

If {{schema}} is the preferred terminology, then we should use that instead of 
{{database}}.

  was:
As of Spark 3.5, the only way to create a database is via SQL. The Catalog API 
should offer an equivalent.

Perhaps something like:
{code:python}
spark.catalog.createDatabase(
name: str,
existsOk: bool = False,
comment: str = None,
location: str = None,
properties: dict = None,
)
{code}


> Add ability to create databases via Catalog API
> ---
>
> Key: SPARK-46449
> URL: https://issues.apache.org/jira/browse/SPARK-46449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> As of Spark 3.5, the only way to create a database is via SQL. The Catalog 
> API should offer an equivalent.
> Perhaps something like:
> {code:python}
> spark.catalog.createDatabase(
> name: str,
> existsOk: bool = False,
> comment: str = None,
> location: str = None,
> properties: dict = None,
> )
> {code}
> If {{schema}} is the preferred terminology, then we should use that instead 
> of {{database}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46449) Add ability to create databases via Catalog API

2023-12-18 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46449:


 Summary: Add ability to create databases via Catalog API
 Key: SPARK-46449
 URL: https://issues.apache.org/jira/browse/SPARK-46449
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Nicholas Chammas


As of Spark 3.5, the only way to create a database is via SQL. The Catalog API 
should offer an equivalent.

Perhaps something like:
{code:python}
spark.catalog.createDatabase(
name: str,
existsOk: bool = False,
comment: str = None,
location: str = None,
properties: dict = None,
)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46437) Remove unnecessary cruft from SQL built-in functions docs

2023-12-17 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46437:


 Summary: Remove unnecessary cruft from SQL built-in functions docs
 Key: SPARK-46437
 URL: https://issues.apache.org/jira/browse/SPARK-46437
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.5.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46395) Automatically generate SQL configuration tables for documentation

2023-12-13 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46395:


 Summary: Automatically generate SQL configuration tables for 
documentation
 Key: SPARK-46395
 URL: https://issues.apache.org/jira/browse/SPARK-46395
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.5.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2023-12-10 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795162#comment-17795162
 ] 

Nicholas Chammas commented on SPARK-45599:
--

Per the [contributing guide|https://spark.apache.org/contributing.html], I 
suggest the {{correctness}} label instead of {{{}data-corruption{}}}.

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: data-corruption
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
>

[jira] [Created] (SPARK-46357) Replace use of setConf with conf.set in docs

2023-12-10 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46357:


 Summary: Replace use of setConf with conf.set in docs
 Key: SPARK-46357
 URL: https://issues.apache.org/jira/browse/SPARK-46357
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.5.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37571) decouple amplab jenkins from spark website, builds and tests

2023-12-05 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793347#comment-17793347
 ] 

Nicholas Chammas commented on SPARK-37571:
--

Since we've 
[retired|https://lists.apache.org/thread/5n59fs22rtytflbz4sz1pz32ozzfbkrx] the 
venerable Jenkins infrastructure, I suppose we can close this issue.

> decouple amplab jenkins from spark website, builds and tests
> 
>
> Key: SPARK-37571
> URL: https://issues.apache.org/jira/browse/SPARK-37571
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
> Attachments: audit.txt, spark-repo-to-be-audited.txt
>
>
> we will be turning off jenkins on dec 23rd, and we need to decouple the build 
> infra from jenkins, as well as remove any amplab jenkins-specific docs on the 
> website, scripts and infra setup.
> i'll be creating > 1 PRs for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37647) Expose percentile function in Scala/Python APIs

2023-12-05 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-37647.
--
Resolution: Fixed

It looks like this got added as part of Spark 3.5: 
[https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.percentile.html]

> Expose percentile function in Scala/Python APIs
> ---
>
> Key: SPARK-37647
> URL: https://issues.apache.org/jira/browse/SPARK-37647
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> SQL offers a percentile function (exact, not approximate) that is not 
> available directly in the Scala or Python DataFrame APIs.
> While it is possible to invoke SQL functions from Scala or Python via 
> {{{}expr(){}}}, I think most users expect function parity across Scala, 
> Python, and SQL. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45390) Remove `distutils` usage

2023-11-17 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17787268#comment-17787268
 ] 

Nicholas Chammas commented on SPARK-45390:
--

Ah, are you referring to [PySpark's Python 
dependencies|https://github.com/apache/spark/blob/4520f3b2da01badb506488b6ff2899babd8c709e/python/setup.py#L310-L330]
 not supporting Python 3.12?

> Remove `distutils` usage
> 
>
> Key: SPARK-45390
> URL: https://issues.apache.org/jira/browse/SPARK-45390
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [PEP-632|https://peps.python.org/pep-0632] deprecated {{distutils}} module in 
> Python {{3.10}} and dropped in Python {{3.12}} in favor of {{packaging}} 
> package.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45390) Remove `distutils` usage

2023-11-15 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786401#comment-17786401
 ] 

Nicholas Chammas commented on SPARK-45390:
--

{quote}We don't promise to support all future unreleased Python versions
{quote}
"all future unreleased versions" is a tall ask that no-one is making. :) 

The relevant circumstances here are that a) Python 3.12 is already out and the 
backwards-incompatible changes are known and [very 
limited|https://docs.python.org/3/whatsnew/3.12.html], and b) Spark 4.0 may be 
a disruptive change and so many people may remain on Spark 3.5 for longer than 
usual.

If we expect 3.5 -> 4.0 to be an easy migration, then backporting a fix like 
this to 3.5 is not as important.
{quote}we need much more validation because all Python package ecosystem should 
work there without any issues
{quote}
I'm not sure what you mean here.

Anyway, I suppose we could just wait and see. Maybe I'm wrong, but I suspect 
many users will find it surprising that Spark 3.5 doesn't work on Python 3.12, 
especially if this is the only (or close to the only) fix required.

> Remove `distutils` usage
> 
>
> Key: SPARK-45390
> URL: https://issues.apache.org/jira/browse/SPARK-45390
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [PEP-632|https://peps.python.org/pep-0632] deprecated {{distutils}} module in 
> Python {{3.10}} and dropped in Python {{3.12}} in favor of {{packaging}} 
> package.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31001) Add ability to create a partitioned table via catalog.createTable()

2022-08-31 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598403#comment-17598403
 ] 

Nicholas Chammas commented on SPARK-31001:
--

Thanks for sharing these details. This is very helpful.

Yeah, this seems like an "unofficial" answer to the original problem. It is 
helpful nonetheless, but as you said it will take a separate effort to 
formalize and document this. I agree that a formal solution will probably not 
use an option named with leading underscores.

> Add ability to create a partitioned table via catalog.createTable()
> ---
>
> Key: SPARK-31001
> URL: https://issues.apache.org/jira/browse/SPARK-31001
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> There doesn't appear to be a way to create a partitioned table using the 
> Catalog interface.
> In SQL, however, you can do this via {{{}CREATE TABLE ... PARTITIONED BY{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31001) Add ability to create a partitioned table via catalog.createTable()

2022-08-30 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598115#comment-17598115
 ] 

Nicholas Chammas commented on SPARK-31001:
--

What's {{{}__partition_columns{}}}? Is that something specific to Delta, or are 
you saying it's a hidden feature of Spark?

> Add ability to create a partitioned table via catalog.createTable()
> ---
>
> Key: SPARK-31001
> URL: https://issues.apache.org/jira/browse/SPARK-31001
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> There doesn't appear to be a way to create a partitioned table using the 
> Catalog interface.
> In SQL, however, you can do this via {{{}CREATE TABLE ... PARTITIONED BY{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39630) Allow all Reader or Writer settings to be provided as options

2022-06-28 Thread Nicholas Chammas (Jira)

Title: Message Title


 
 
 
 

 
 
 

 
   
 Nicholas Chammas created an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Spark /  SPARK-39630  
 
 
  Allow all Reader or Writer settings to be provided as options   
 

  
 
 
 
 

 
Issue Type: 
  Improvement  
 
 
Affects Versions: 
 3.3.0  
 
 
Assignee: 
 Unassigned  
 
 
Components: 
 SQL  
 
 
Created: 
 28/Jun/22 21:03  
 
 
Priority: 
  Minor  
 
 
Reporter: 
 Nicholas Chammas  
 

  
 
 
 
 

 
 Almost all Reader or Writer settings can be provided via individual calls to `.option()` or by providing a map to `.options()`. There are notable exceptions, though, like: 
 
read/write format 
write mode 
write partitionBy, bucketBy, and sortBy 
 These settings must be provided via dedicated method calls. Why not make it so that all settings can be provided as options? Is there a design reason not to do this? Any given DataFrameReader or DataFrameWriter (along with the streaming equivalents) should be able to "export" all of its settings as a map of options, and then in turn be reconstituted entirely from that map of options. 

 

reader1 = spark.read.option("format", "parquet").option("path", "/data")
options = reader.getOptions()
reader2 = spark.read.options(options)

# reader1 and reader2 are configured identically
data1 = reader1.load()
data2 = reader2.load()
data1.collect() == data2.collect()

[jira] [Created] (SPARK-39582) "Since " docs on array_agg are incorrect

2022-06-24 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-39582:


 Summary: "Since " docs on array_agg are incorrect
 Key: SPARK-39582
 URL: https://issues.apache.org/jira/browse/SPARK-39582
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Nicholas Chammas


[https://spark.apache.org/docs/latest/api/sql/#array_agg]

The docs currently say "Since: 2.0.0", but `array_agg` was added in Spark 3.3.0.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37219) support AS OF syntax

2022-05-16 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537589#comment-17537589
 ] 

Nicholas Chammas commented on SPARK-37219:
--

This change will enable not just Delta, but also Iceberg to use the {{AS OF}} 
syntax, correct?

By the way, could an admin please delete the spam comments just above (and 
perhaps also ban the user if that's all they comment on here)?

> support AS OF syntax
> 
>
> Key: SPARK-37219
> URL: https://issues.apache.org/jira/browse/SPARK-37219
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>
> https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel
> Delta Lake time travel allows user to query an older snapshot of a Delta 
> table. To query an older version of a table, user needs to specify a version 
> or timestamp in a SELECT statement using AS OF syntax as the follows
> SELECT * FROM default.people10m VERSION AS OF 0;
> SELECT * FROM default.people10m TIMESTAMP AS OF '2019-01-29 00:37:58';
> This ticket is opened to add AS OF syntax in Spark



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31001) Add ability to create a partitioned table via catalog.createTable()

2022-05-10 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31001:
-
Description: 
There doesn't appear to be a way to create a partitioned table using the 
Catalog interface.

In SQL, however, you can do this via {{{}CREATE TABLE ... PARTITIONED BY{}}}.

  was:There doesn't appear to be a way to create a partitioned table using the 
Catalog interface.


> Add ability to create a partitioned table via catalog.createTable()
> ---
>
> Key: SPARK-31001
> URL: https://issues.apache.org/jira/browse/SPARK-31001
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> There doesn't appear to be a way to create a partitioned table using the 
> Catalog interface.
> In SQL, however, you can do this via {{{}CREATE TABLE ... PARTITIONED BY{}}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures

2022-04-26 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528233#comment-17528233
 ] 

Nicholas Chammas edited comment on SPARK-37222 at 4/26/22 3:44 PM:
---

I've found a helpful log setting that causes Spark to print out detailed 
information about how exactly a plan is transformed during optimization:
{code:java}
spark.conf.set("spark.sql.planChangeLog.level", "warn") {code}
Here's the log generated by enabling this setting and running Shawn's example: 
[^plan-log.log]

To confirm what Shawn noted in his comment above, it looks like the chain of 
events that results in a loop is as follows:
 # ColumnPruning
 # FoldablePropagation __
 # RemoveNoopOperators
 # PushDownLeftSemiAntiJoin
 # ColumnPruning
 # CollapseProject
 # __

What seems to be the problem is that ColumnPruning inserts some Project 
operators which are then removed successively by CollapseProject, 
RemoveNoopOperators, and PushDownLeftSemiAntiJoin.

These rules go back and forth, undoing each other's work, until 
{{spark.sql.optimizer.maxIterations}} is exhausted.


was (Author: nchammas):
I've found a helpful log setting that causes Spark to print out detailed 
information about how exactly a plan is transformed during optimization:
{code:java}
spark.conf.set("spark.sql.planChangeLog.level", "warn") {code}
Here's the log generated by enabling this setting and running Shawn's example: 
[^plan-log.log]

To confirm what Shawn noted in his comment above, it looks like the chain of 
events that results in a loop is as follows:
 # PushDownLeftSemiAntiJoin
 # ColumnPruning
 # CollapseProject
 # FoldablePropagation
 # RemoveNoopOperators
 # 

What seems to be the problem is that:
 * ColumnPruning inserts a couple of Project operators which are then removed 
by CollapseProject.
 * CollapseProject in turn pushes up the left anti-join which is then pushed 
down again by PushDownLeftSemiAntiJoin.

These three rules go back and forth, undoing each other's work, until 
{{spark.sql.optimizer.maxIterations}} is exhausted.

> Max iterations reached in Operator Optimization w/left_anti or left_semi join 
> and nested structures
> ---
>
> Key: SPARK-37222
> URL: https://issues.apache.org/jira/browse/SPARK-37222
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.2, 3.2.0, 3.2.1
> Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and 
> with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, 
> 2021.
> The problem does not occur with Spark 3.0.1.
>  
>Reporter: Shawn Smith
>Priority: Major
> Attachments: plan-log.log
>
>
> The query optimizer never reaches a fixed point when optimizing the query 
> below. This manifests as a warning:
> > WARN: Max iterations (100) reached for batch Operator Optimization before 
> > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a 
> > larger value.
> But the suggested fix won't help. The actual problem is that the optimizer 
> fails to make progress on each iteration and gets stuck in a loop.
> In practice, Spark logs a warning but continues on and appears to execute the 
> query successfully, albeit perhaps sub-optimally.
> To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and 
> 3.2.0 but not 3.0.1 it will throw an exception:
> {noformat}
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id, 'nested("n"))
> .explain()
> }
> {noformat}
> Looking at the query plan as the optimizer iterates in 
> {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations:
> {noformat}
> Project [id#2, _gen_alias_108#108L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_108#108L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> And here's the plan after one more iteration. You can see that all that has 
> changed is new aliases for the column in the nested column: 
> "{{_gen_alias_108#108L}}" to "{{_gen_alias_109#109L}}".
> {noformat}
> Project [id#2, _gen_alias_109#109L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project

[jira] [Commented] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures

2022-04-26 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528233#comment-17528233
 ] 

Nicholas Chammas commented on SPARK-37222:
--

I've found a helpful log setting that causes Spark to print out detailed 
information about how exactly a plan is transformed during optimization:
{code:java}
spark.conf.set("spark.sql.planChangeLog.level", "warn") {code}
Here's the log generated by enabling this setting and running Shawn's example: 
[^plan-log.log]

To confirm what Shawn noted in his comment above, it looks like the chain of 
events that results in a loop is as follows:
 # PushDownLeftSemiAntiJoin
 # ColumnPruning
 # CollapseProject
 # FoldablePropagation
 # RemoveNoopOperators
 # 

What seems to be the problem is that:
 * ColumnPruning inserts a couple of Project operators which are then removed 
by CollapseProject.
 * CollapseProject in turn pushes up the left anti-join which is then pushed 
down again by PushDownLeftSemiAntiJoin.

These three rules go back and forth, undoing each other's work, until 
{{spark.sql.optimizer.maxIterations}} is exhausted.

> Max iterations reached in Operator Optimization w/left_anti or left_semi join 
> and nested structures
> ---
>
> Key: SPARK-37222
> URL: https://issues.apache.org/jira/browse/SPARK-37222
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.2, 3.2.0, 3.2.1
> Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and 
> with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, 
> 2021.
> The problem does not occur with Spark 3.0.1.
>  
>Reporter: Shawn Smith
>Priority: Major
> Attachments: plan-log.log
>
>
> The query optimizer never reaches a fixed point when optimizing the query 
> below. This manifests as a warning:
> > WARN: Max iterations (100) reached for batch Operator Optimization before 
> > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a 
> > larger value.
> But the suggested fix won't help. The actual problem is that the optimizer 
> fails to make progress on each iteration and gets stuck in a loop.
> In practice, Spark logs a warning but continues on and appears to execute the 
> query successfully, albeit perhaps sub-optimally.
> To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and 
> 3.2.0 but not 3.0.1 it will throw an exception:
> {noformat}
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id, 'nested("n"))
> .explain()
> }
> {noformat}
> Looking at the query plan as the optimizer iterates in 
> {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations:
> {noformat}
> Project [id#2, _gen_alias_108#108L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_108#108L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> And here's the plan after one more iteration. You can see that all that has 
> changed is new aliases for the column in the nested column: 
> "{{_gen_alias_108#108L}}" to "{{_gen_alias_109#109L}}".
> {noformat}
> Project [id#2, _gen_alias_109#109L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_109#109L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> The optimizer continues looping and tweaking the alias until it hits the max 
> iteration count and bails out.
> Here's an example that includes a stack trace:
> {noformat}
> $ bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0
>   /_/
> Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.12)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> case class Nested(b: Boolean, n: Long)
> case class

[jira] [Updated] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures

2022-04-26 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-37222:
-
Attachment: plan-log.log

> Max iterations reached in Operator Optimization w/left_anti or left_semi join 
> and nested structures
> ---
>
> Key: SPARK-37222
> URL: https://issues.apache.org/jira/browse/SPARK-37222
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.2, 3.2.0, 3.2.1
> Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and 
> with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, 
> 2021.
> The problem does not occur with Spark 3.0.1.
>  
>Reporter: Shawn Smith
>Priority: Major
> Attachments: plan-log.log
>
>
> The query optimizer never reaches a fixed point when optimizing the query 
> below. This manifests as a warning:
> > WARN: Max iterations (100) reached for batch Operator Optimization before 
> > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a 
> > larger value.
> But the suggested fix won't help. The actual problem is that the optimizer 
> fails to make progress on each iteration and gets stuck in a loop.
> In practice, Spark logs a warning but continues on and appears to execute the 
> query successfully, albeit perhaps sub-optimally.
> To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and 
> 3.2.0 but not 3.0.1 it will throw an exception:
> {noformat}
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id, 'nested("n"))
> .explain()
> }
> {noformat}
> Looking at the query plan as the optimizer iterates in 
> {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations:
> {noformat}
> Project [id#2, _gen_alias_108#108L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_108#108L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> And here's the plan after one more iteration. You can see that all that has 
> changed is new aliases for the column in the nested column: 
> "{{_gen_alias_108#108L}}" to "{{_gen_alias_109#109L}}".
> {noformat}
> Project [id#2, _gen_alias_109#109L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_109#109L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> The optimizer continues looping and tweaking the alias until it hits the max 
> iteration count and bails out.
> Here's an example that includes a stack trace:
> {noformat}
> $ bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0
>   /_/
> Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.12)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id, 'nested("n"))
> .explain()
> }
> // Exiting paste mode, now interpreting.
> java.lang.RuntimeException: Max iterations (100) reached for batch Operator 
> Optimization before Inferring Filters, please set 
> 'spark.sql.optimizer.maxIterations' to a larger value.
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:246)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at 
>

[jira] [Updated] (SPARK-37696) Optimizer exceeds max iterations

2022-04-25 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-37696:
-
Affects Version/s: 3.2.1

> Optimizer exceeds max iterations
> 
>
> Key: SPARK-37696
> URL: https://issues.apache.org/jira/browse/SPARK-37696
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Denis Tarima
>Priority: Minor
>
> A specific scenario causing Spark's failure in tests and a warning in 
> production:
> 21/12/20 06:45:24 WARN BaseSessionStateBuilder$$anon$2: Max iterations (100) 
> reached for batch Operator Optimization before Inferring Filters, please set 
> 'spark.sql.optimizer.maxIterations' to a larger value.
> 21/12/20 06:45:24 WARN BaseSessionStateBuilder$$anon$2: Max iterations (100) 
> reached for batch Operator Optimization after Inferring Filters, please set 
> 'spark.sql.optimizer.maxIterations' to a larger value.
>  
> To reproduce run the following commands in `spark-shell`:
> {{// define case class for a struct type in an array}}
> {{case class S(v: Int, v2: Int)}}
>  
> {{// prepare a table with an array of structs}}
> {{Seq((10, Seq(S(1, 2.toDF("i", "data").write.saveAsTable("tbl")}}
>  
> {{// select using SQL and join with a dataset using "left_anti"}}
> {{spark.sql("select i, data[size(data) - 1].v from 
> tbl").join(Seq(10).toDF("i"), Seq("i"), "left_anti").show()}}
>  
> The following conditions are required:
>  # Having additional `v2` field in `S`
>  # Using `{{{}data[size(data) - 1]{}}}` instead of `{{{}element_at(data, 
> -1){}}}`
>  # Using `{{{}left_anti{}}}` in join operation
>  
> The same behavior was observed in `master` branch and `3.1.1`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures

2022-04-25 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527527#comment-17527527
 ] 

Nicholas Chammas commented on SPARK-37222:
--

Thanks for the detailed report, [~ssmith]. I am hitting this issue as well on 
Spark 3.2.1, and your minimal test case also reproduces the issue for me.

How did you break down the optimization into its individual steps like that? 
That was very helpful.

I was able to use your breakdown to work around the issue by excluding 
{{{}PushDownLeftSemiAntiJoin{}}}:
{code:java}
spark.conf.set(
  "spark.sql.optimizer.excludedRules",
  "org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin"
){code}
If I run that before running the problematic query (including your test case), 
it seems to work around the issue.

> Max iterations reached in Operator Optimization w/left_anti or left_semi join 
> and nested structures
> ---
>
> Key: SPARK-37222
> URL: https://issues.apache.org/jira/browse/SPARK-37222
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.2, 3.2.0, 3.2.1
> Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and 
> with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, 
> 2021.
> The problem does not occur with Spark 3.0.1.
>  
>Reporter: Shawn Smith
>Priority: Major
>
> The query optimizer never reaches a fixed point when optimizing the query 
> below. This manifests as a warning:
> > WARN: Max iterations (100) reached for batch Operator Optimization before 
> > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a 
> > larger value.
> But the suggested fix won't help. The actual problem is that the optimizer 
> fails to make progress on each iteration and gets stuck in a loop.
> In practice, Spark logs a warning but continues on and appears to execute the 
> query successfully, albeit perhaps sub-optimally.
> To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and 
> 3.2.0 but not 3.0.1 it will throw an exception:
> {noformat}
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id, 'nested("n"))
> .explain()
> }
> {noformat}
> Looking at the query plan as the optimizer iterates in 
> {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations:
> {noformat}
> Project [id#2, _gen_alias_108#108L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_108#108L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> And here's the plan after one more iteration. You can see that all that has 
> changed is new aliases for the column in the nested column: 
> "{{_gen_alias_108#108L}}" to "{{_gen_alias_109#109L}}".
> {noformat}
> Project [id#2, _gen_alias_109#109L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_109#109L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> The optimizer continues looping and tweaking the alias until it hits the max 
> iteration count and bails out.
> Here's an example that includes a stack trace:
> {noformat}
> $ bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0
>   /_/
> Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.12)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id,

[jira] [Updated] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures

2022-04-25 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-37222:
-
Affects Version/s: 3.2.1

> Max iterations reached in Operator Optimization w/left_anti or left_semi join 
> and nested structures
> ---
>
> Key: SPARK-37222
> URL: https://issues.apache.org/jira/browse/SPARK-37222
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.2, 3.2.0, 3.2.1
> Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and 
> with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, 
> 2021.
> The problem does not occur with Spark 3.0.1.
>  
>Reporter: Shawn Smith
>Priority: Major
>
> The query optimizer never reaches a fixed point when optimizing the query 
> below. This manifests as a warning:
> > WARN: Max iterations (100) reached for batch Operator Optimization before 
> > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a 
> > larger value.
> But the suggested fix won't help. The actual problem is that the optimizer 
> fails to make progress on each iteration and gets stuck in a loop.
> In practice, Spark logs a warning but continues on and appears to execute the 
> query successfully, albeit perhaps sub-optimally.
> To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and 
> 3.2.0 but not 3.0.1 it will throw an exception:
> {noformat}
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id, 'nested("n"))
> .explain()
> }
> {noformat}
> Looking at the query plan as the optimizer iterates in 
> {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations:
> {noformat}
> Project [id#2, _gen_alias_108#108L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_108#108L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> And here's the plan after one more iteration. You can see that all that has 
> changed is new aliases for the column in the nested column: 
> "{{_gen_alias_108#108L}}" to "{{_gen_alias_109#109L}}".
> {noformat}
> Project [id#2, _gen_alias_109#109L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_109#109L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> The optimizer continues looping and tweaking the alias until it hits the max 
> iteration count and bails out.
> Here's an example that includes a stack trace:
> {noformat}
> $ bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0
>   /_/
> Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.12)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id, 'nested("n"))
> .explain()
> }
> // Exiting paste mode, now interpreting.
> java.lang.RuntimeException: Max iterations (100) reached for batch Operator 
> Optimization before Inferring Filters, please set 
> 'spark.sql.optimizer.maxIterations' to a larger value.
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:246)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
>

[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2021-12-20 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462805#comment-17462805
 ] 

Nicholas Chammas commented on SPARK-5997:
-

[~tenstriker] - I believe in your case you should be able to set 
{{spark.sql.files.maxRecordsPerFile}} to some number. Spark will not shuffle 
the data but it will still split up your output across multiple files.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-5997) Increase partition count without performing a shuffle

2021-12-20 Thread Nicholas Chammas (Jira)



[ https://issues.apache.org/jira/browse/SPARK-5997 ]


Nicholas Chammas deleted comment on SPARK-5997:
-

was (Author: nchammas):
[~tenstriker] - I believe in your case you should be able to set 
{{spark.sql.files.maxRecordsPerFile}} to some number. Spark will not shuffle 
the data but it will still split up your output across multiple files.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-12-20 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462718#comment-17462718
 ] 

Nicholas Chammas commented on SPARK-24853:
--

I would expect something like that to yield an {{{}AnalysisException{}}}. Would 
that address your concern, or are you suggesting that it might be difficult to 
catch that sort of problem cleanly?

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-12-14 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459601#comment-17459601
 ] 

Nicholas Chammas commented on SPARK-24853:
--

Assuming we are talking about the example I provided: Yes, {{col("count")}} 
would still be ambiguous.

I don't know if Spark would know to catch that problem. But note that the 
current behavior of {{.withColumnRenamed('count', ...)}} renames all columns 
named "count", which is just incorrect.

So allowing {{col("count")}} will either be just as incorrect as the current 
behavior, or it will be an improvement in that Spark may complain that the 
column reference is ambiguous. I'd have to try it to confirm the behavior.

Of course, the main improvement offered by {{Column}} references is that users 
can do something like {{.withColumnRenamed(left_counts['count'], ...)}} and get 
the correct behavior.

I didn't follow what you are getting at regarding {{{}from_json{}}}, but does 
that address your concern?

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2021-12-14 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-25150.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

It looks like Spark 3.1.2 exhibits a different sort of broken behavior:
{code:java}
pyspark.sql.utils.AnalysisException: Column State#38 are ambiguous. It's 
probably because you joined several Datasets together, and some of these 
Datasets are the same. This column points to one of the Datasets but Spark is 
unable to figure out which one. Please alias the Datasets with different names 
via `Dataset.as` before joining them, and specify the column using qualified 
name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set 
spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check. {code}
I don't think the join in {{zombie-analysis.py}} is ambiguous, and since this 
now works fine in Spark 3.2.0, that's what I'm going to mark as the "Fix 
Version" for this issue.

The fix must have made it in somewhere between Spark 3.1.2 and 3.2.0.

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
>Reporter: Nicholas Chammas
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.0
>
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1094 matches

Mail list logo