[jira] [Updated] (SPARK-48222) Sync Ruby Bundler to 2.4.22 and refresh Gem lock file
[ https://issues.apache.org/jira/browse/SPARK-48222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-48222: - Component/s: Documentation > Sync Ruby Bundler to 2.4.22 and refresh Gem lock file > - > > Key: SPARK-48222 > URL: https://issues.apache.org/jira/browse/SPARK-48222 > Project: Spark > Issue Type: Improvement > Components: Build, Documentation >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48222) Sync Ruby Bundler to 2.4.22 and refresh Gem lock file
Nicholas Chammas created SPARK-48222: Summary: Sync Ruby Bundler to 2.4.22 and refresh Gem lock file Key: SPARK-48222 URL: https://issues.apache.org/jira/browse/SPARK-48222 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48176) Fix name of FIELD_ALREADY_EXISTS error condition
Nicholas Chammas created SPARK-48176: Summary: Fix name of FIELD_ALREADY_EXISTS error condition Key: SPARK-48176 URL: https://issues.apache.org/jira/browse/SPARK-48176 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48107) Exclude tests from Python distribution
Nicholas Chammas created SPARK-48107: Summary: Exclude tests from Python distribution Key: SPARK-48107 URL: https://issues.apache.org/jira/browse/SPARK-48107 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47429) Rename errorClass to errorCondition and subClass to subCondition
[ https://issues.apache.org/jira/browse/SPARK-47429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842694#comment-17842694 ] Nicholas Chammas commented on SPARK-47429: -- I think one intermediate step we can take here is to mark the existing fields as deprecated, indicating that they will be renamed. That way, if we don't complete this renaming before the 4.0 release we at least have the deprecation in. Another thing we can do in addition to deprecating the existing fields is to add the renamed fields and simply have them redirect to the original ones. I will build a list of the classes, class attributes, methods, and method parameters that will need this kind of update. Note that this list will be much, much smaller than the thousands of uses that BingKun highlighted, since I am just focusing on the declarations. cc [~cloud_fan] > Rename errorClass to errorCondition and subClass to subCondition > > > Key: SPARK-47429 > URL: https://issues.apache.org/jira/browse/SPARK-47429 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Attachments: image-2024-04-18-09-26-04-493.png > > > We've agreed on the parent task to rename {{errorClass}} to align it more > closely with the SQL standard, and take advantage of the opportunity to break > backwards compatibility offered by the Spark version change from 3.5 to 4.0. > This ticket also covers renaming {{subClass}} as well. > This is a subtask so the changes are in their own PR and easier to review > apart from other things. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47429) Rename errorClass to errorCondition and subClass to subCondition
[ https://issues.apache.org/jira/browse/SPARK-47429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-47429: - Summary: Rename errorClass to errorCondition and subClass to subCondition (was: Rename errorClass to errorCondition) > Rename errorClass to errorCondition and subClass to subCondition > > > Key: SPARK-47429 > URL: https://issues.apache.org/jira/browse/SPARK-47429 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Attachments: image-2024-04-18-09-26-04-493.png > > > We've agreed on the parent task to rename {{errorClass}} to align it more > closely with the SQL standard, and take advantage of the opportunity to break > backwards compatibility offered by the Spark version change from 3.5 to 4.0. > This ticket also covers renaming {{subClass}} as well. > This is a subtask so the changes are in their own PR and easier to review > apart from other things. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47429) Rename errorClass to errorCondition
[ https://issues.apache.org/jira/browse/SPARK-47429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-47429: - Description: We've agreed on the parent task to rename {{errorClass}} to align it more closely with the SQL standard, and take advantage of the opportunity to break backwards compatibility offered by the Spark version change from 3.5 to 4.0. This ticket also covers renaming {{subClass}} as well. This is a subtask so the changes are in their own PR and easier to review apart from other things. was: We've agreed on the parent task to rename {{errorClass}} to align it more closely with the SQL standard, and take advantage of the opportunity to break backwards compatibility offered by the Spark version change from 3.5 to 4.0. This is a subtask so the changes are in their own PR and easier to review apart from other things. > Rename errorClass to errorCondition > --- > > Key: SPARK-47429 > URL: https://issues.apache.org/jira/browse/SPARK-47429 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > We've agreed on the parent task to rename {{errorClass}} to align it more > closely with the SQL standard, and take advantage of the opportunity to break > backwards compatibility offered by the Spark version change from 3.5 to 4.0. > This ticket also covers renaming {{subClass}} as well. > This is a subtask so the changes are in their own PR and easier to review > apart from other things. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28024) Incorrect numeric values when out of range
[ https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17837292#comment-17837292 ] Nicholas Chammas commented on SPARK-28024: -- [~cloud_fan] - Given the updated descriptions for Cases 2, 3, and 4, do you still consider there to be a problem here? Or shall we just consider this an acceptable difference between how Spark and Postgres handle these cases? > Incorrect numeric values when out of range > -- > > Key: SPARK-28024 > URL: https://issues.apache.org/jira/browse/SPARK-28024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: correctness > Attachments: SPARK-28024.png > > > Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} > with {{set spark.sql.ansi.enabled=true;}} as compared to the default behavior > on PostgreSQL 16. > Case 1: > {code:sql} > select tinyint(128) * tinyint(2); -- 0 > select smallint(2147483647) * smallint(2); -- -2 > select int(2147483647) * int(2); -- -2 > SELECT smallint((-32768)) * smallint(-1); -- -32768 > {code} > With ANSI mode enabled, this case is no longer an issue. All 4 of the above > statements now yield {{CAST_OVERFLOW}} or {{ARITHMETIC_OVERFLOW}} errors. > Case 2: > {code:sql} > spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); > 0.0 -0.0 > postgres=# select cast('10e-70' as float), cast('-10e-70' as float); > float8 | float8 > + > 1e-69 | -1e-69 {code} > Case 3: > {code:sql} > spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); > 0.0 -0.0 > postgres=# select cast('10e-400' as double precision), cast('-10e-400' as > double precision); > ERROR: "10e-400" is out of range for type double precision > LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... > ^ {code} > Case 4: > {code:sql} > spark-sql (default)> select exp(1.2345678901234E200); > Infinity > postgres=# select exp(1.2345678901234E200); > ERROR: value overflows numeric format {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range
[ https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-28024: - Description: Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} with {{set spark.sql.ansi.enabled=true;}} as compared to the default behavior on PostgreSQL 16. Case 1: {code:sql} select tinyint(128) * tinyint(2); -- 0 select smallint(2147483647) * smallint(2); -- -2 select int(2147483647) * int(2); -- -2 SELECT smallint((-32768)) * smallint(-1); -- -32768 {code} With ANSI mode enabled, this case is no longer an issue. All 4 of the above statements now yield {{CAST_OVERFLOW}} or {{ARITHMETIC_OVERFLOW}} errors. Case 2: {code:sql} spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); 0.0 -0.0 postgres=# select cast('10e-70' as float), cast('-10e-70' as float); float8 | float8 + 1e-69 | -1e-69 {code} Case 3: {code:sql} spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); 0.0 -0.0 postgres=# select cast('10e-400' as double precision), cast('-10e-400' as double precision); ERROR: "10e-400" is out of range for type double precision LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... ^ {code} Case 4: {code:sql} spark-sql (default)> select exp(1.2345678901234E200); Infinity postgres=# select exp(1.2345678901234E200); ERROR: value overflows numeric format {code} was: Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} with {{set spark.sql.ansi.enabled=true;}} as compared to the default behavior on PostgreSQL 16. Case 1: {code:sql} select tinyint(128) * tinyint(2); -- 0 select smallint(2147483647) * smallint(2); -- -2 select int(2147483647) * int(2); -- -2 SELECT smallint((-32768)) * smallint(-1); -- -32768 {code} With ANSI mode enabled, this case is no longer an issue. All 4 of the above statements now yield {{CAST_OVERFLOW or }}{{ARITHMETIC_OVERFLOW}} errors. Case 2: {code:sql} spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); 0.0 -0.0 postgres=# select cast('10e-70' as float), cast('-10e-70' as float); float8 | float8 + 1e-69 | -1e-69 {code} Case 3: {code:sql} spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); 0.0 -0.0 postgres=# select cast('10e-400' as double precision), cast('-10e-400' as double precision); ERROR: "10e-400" is out of range for type double precision LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... ^ {code} Case 4: {code:sql} spark-sql (default)> select exp(1.2345678901234E200); Infinity postgres=# select exp(1.2345678901234E200); ERROR: value overflows numeric format {code} > Incorrect numeric values when out of range > -- > > Key: SPARK-28024 > URL: https://issues.apache.org/jira/browse/SPARK-28024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: correctness > Attachments: SPARK-28024.png > > > Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} > with {{set spark.sql.ansi.enabled=true;}} as compared to the default behavior > on PostgreSQL 16. > Case 1: > {code:sql} > select tinyint(128) * tinyint(2); -- 0 > select smallint(2147483647) * smallint(2); -- -2 > select int(2147483647) * int(2); -- -2 > SELECT smallint((-32768)) * smallint(-1); -- -32768 > {code} > With ANSI mode enabled, this case is no longer an issue. All 4 of the above > statements now yield {{CAST_OVERFLOW}} or {{ARITHMETIC_OVERFLOW}} errors. > Case 2: > {code:sql} > spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); > 0.0 -0.0 > postgres=# select cast('10e-70' as float), cast('-10e-70' as float); > float8 | float8 > + > 1e-69 | -1e-69 {code} > Case 3: > {code:sql} > spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); > 0.0 -0.0 > postgres=# select cast('10e-400' as double precision), cast('-10e-400' as > double precision); > ERROR: "10e-400" is out of range for type double precision > LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... > ^ {code} > Case 4: > {code:sql} > spark-sql (default)> select exp(1.2345678901234E200); > Infinity > postgres=# select exp(1.2345678901234E200); > ERROR: value overflows numeric format {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range
[ https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-28024: - Description: Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} with {{set spark.sql.ansi.enabled=true;}} as compared to the default behavior on PostgreSQL 16. Case 1: {code:sql} select tinyint(128) * tinyint(2); -- 0 select smallint(2147483647) * smallint(2); -- -2 select int(2147483647) * int(2); -- -2 SELECT smallint((-32768)) * smallint(-1); -- -32768 {code} With ANSI mode enabled, this case is no longer an issue. All 4 of the above statements now yield {{CAST_OVERFLOW or }}{{ARITHMETIC_OVERFLOW}} errors. Case 2: {code:sql} spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); 0.0 -0.0 postgres=# select cast('10e-70' as float), cast('-10e-70' as float); float8 | float8 + 1e-69 | -1e-69 {code} Case 3: {code:sql} spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); 0.0 -0.0 postgres=# select cast('10e-400' as double precision), cast('-10e-400' as double precision); ERROR: "10e-400" is out of range for type double precision LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... ^ {code} Case 4: {code:sql} spark-sql (default)> select exp(1.2345678901234E200); Infinity postgres=# select exp(1.2345678901234E200); ERROR: value overflows numeric format {code} was: As compared to PostgreSQL 16. Case 1: {code:sql} select tinyint(128) * tinyint(2); -- 0 select smallint(2147483647) * smallint(2); -- -2 select int(2147483647) * int(2); -- -2 SELECT smallint((-32768)) * smallint(-1); -- -32768 {code} Case 2: {code:sql} spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); 0.0 -0.0 postgres=# select cast('10e-70' as float), cast('-10e-70' as float); float8 | float8 + 1e-69 | -1e-69 {code} Case 3: {code:sql} spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); 0.0 -0.0 postgres=# select cast('10e-400' as double precision), cast('-10e-400' as double precision); ERROR: "10e-400" is out of range for type double precision LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... ^ {code} Case 4: {code:sql} spark-sql (default)> select exp(1.2345678901234E200); Infinity postgres=# select exp(1.2345678901234E200); ERROR: value overflows numeric format {code} > Incorrect numeric values when out of range > -- > > Key: SPARK-28024 > URL: https://issues.apache.org/jira/browse/SPARK-28024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: correctness > Attachments: SPARK-28024.png > > > Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} > with {{set spark.sql.ansi.enabled=true;}} as compared to the default behavior > on PostgreSQL 16. > Case 1: > {code:sql} > select tinyint(128) * tinyint(2); -- 0 > select smallint(2147483647) * smallint(2); -- -2 > select int(2147483647) * int(2); -- -2 > SELECT smallint((-32768)) * smallint(-1); -- -32768 > {code} > With ANSI mode enabled, this case is no longer an issue. All 4 of the above > statements now yield {{CAST_OVERFLOW or }}{{ARITHMETIC_OVERFLOW}} errors. > Case 2: > {code:sql} > spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); > 0.0 -0.0 > postgres=# select cast('10e-70' as float), cast('-10e-70' as float); > float8 | float8 > + > 1e-69 | -1e-69 {code} > Case 3: > {code:sql} > spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); > 0.0 -0.0 > postgres=# select cast('10e-400' as double precision), cast('-10e-400' as > double precision); > ERROR: "10e-400" is out of range for type double precision > LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... > ^ {code} > Case 4: > {code:sql} > spark-sql (default)> select exp(1.2345678901234E200); > Infinity > postgres=# select exp(1.2345678901234E200); > ERROR: value overflows numeric format {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28024) Incorrect numeric values when out of range
[ https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836706#comment-17836706 ] Nicholas Chammas commented on SPARK-28024: -- I've just retried cases 2-4 on master with ANSI mode enabled, and Spark's behavior appears to be the same as when I last checked it in February. I also ran those same cases against PostgreSQL 16. I couldn't replicate the output for Case 4, and I believe there was a mistake in the original description of that case where the sign was flipped. So I've adjusted the sign accordingly and shown Spark and Postgres's behavior side-by-side. Here is the original Case 4 with the negative sign: {code:sql} spark-sql (default)> select exp(-1.2345678901234E200); 0.0 postgres=# select exp(-1.2345678901234E200); 0. {code} So I don't think there is a problem there. With a positive sign, the behavior is different as shown in the ticket description above. > Incorrect numeric values when out of range > -- > > Key: SPARK-28024 > URL: https://issues.apache.org/jira/browse/SPARK-28024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: correctness > Attachments: SPARK-28024.png > > > As compared to PostgreSQL 16. > Case 1: > {code:sql} > select tinyint(128) * tinyint(2); -- 0 > select smallint(2147483647) * smallint(2); -- -2 > select int(2147483647) * int(2); -- -2 > SELECT smallint((-32768)) * smallint(-1); -- -32768 > {code} > Case 2: > {code:sql} > spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); > 0.0 -0.0 > postgres=# select cast('10e-70' as float), cast('-10e-70' as float); > float8 | float8 > + > 1e-69 | -1e-69 {code} > Case 3: > {code:sql} > spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); > 0.0 -0.0 > postgres=# select cast('10e-400' as double precision), cast('-10e-400' as > double precision); > ERROR: "10e-400" is out of range for type double precision > LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... > ^ {code} > Case 4: > {code:sql} > spark-sql (default)> select exp(1.2345678901234E200); > Infinity > postgres=# select exp(1.2345678901234E200); > ERROR: value overflows numeric format {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range
[ https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-28024: - Description: As compared to PostgreSQL 16. Case 1: {code:sql} select tinyint(128) * tinyint(2); -- 0 select smallint(2147483647) * smallint(2); -- -2 select int(2147483647) * int(2); -- -2 SELECT smallint((-32768)) * smallint(-1); -- -32768 {code} Case 2: {code:sql} spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); 0.0 -0.0 postgres=# select cast('10e-70' as float), cast('-10e-70' as float); float8 | float8 + 1e-69 | -1e-69 {code} Case 3: {code:sql} spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); 0.0 -0.0 postgres=# select cast('10e-400' as double precision), cast('-10e-400' as double precision); ERROR: "10e-400" is out of range for type double precision LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... ^ {code} Case 4: {code:sql} spark-sql (default)> select exp(1.2345678901234E200); Infinity postgres=# select exp(1.2345678901234E200); ERROR: value overflows numeric format {code} was: As compared to PostgreSQL 16. Case 1: {code:sql} select tinyint(128) * tinyint(2); -- 0 select smallint(2147483647) * smallint(2); -- -2 select int(2147483647) * int(2); -- -2 SELECT smallint((-32768)) * smallint(-1); -- -32768 {code} Case 2: {code:sql} spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); 0.0 -0.0 postgres=# select cast('10e-70' as float), cast('-10e-70' as float); float8 | float8 + 1e-69 | -1e-69 {code} Case 3: {code:sql} spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); 0.0 -0.0 postgres=# select cast('10e-400' as double precision), cast('-10e-400' as double precision); ERROR: "10e-400" is out of range for type double precision LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... ^ {code} Case 4: {code:sql} spark-sql (default)> select exp(1.2345678901234E200); Infinity postgres=# select exp(-1.2345678901234E200); ERROR: value overflows numeric format {code} > Incorrect numeric values when out of range > -- > > Key: SPARK-28024 > URL: https://issues.apache.org/jira/browse/SPARK-28024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: correctness > Attachments: SPARK-28024.png > > > As compared to PostgreSQL 16. > Case 1: > {code:sql} > select tinyint(128) * tinyint(2); -- 0 > select smallint(2147483647) * smallint(2); -- -2 > select int(2147483647) * int(2); -- -2 > SELECT smallint((-32768)) * smallint(-1); -- -32768 > {code} > Case 2: > {code:sql} > spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); > 0.0 -0.0 > postgres=# select cast('10e-70' as float), cast('-10e-70' as float); > float8 | float8 > + > 1e-69 | -1e-69 {code} > Case 3: > {code:sql} > spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); > 0.0 -0.0 > postgres=# select cast('10e-400' as double precision), cast('-10e-400' as > double precision); > ERROR: "10e-400" is out of range for type double precision > LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... > ^ {code} > Case 4: > {code:sql} > spark-sql (default)> select exp(1.2345678901234E200); > Infinity > postgres=# select exp(1.2345678901234E200); > ERROR: value overflows numeric format {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range
[ https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-28024: - Description: As compared to PostgreSQL 16. Case 1: {code:sql} select tinyint(128) * tinyint(2); -- 0 select smallint(2147483647) * smallint(2); -- -2 select int(2147483647) * int(2); -- -2 SELECT smallint((-32768)) * smallint(-1); -- -32768 {code} Case 2: {code:sql} spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); 0.0 -0.0 postgres=# select cast('10e-70' as float), cast('-10e-70' as float); float8 | float8 + 1e-69 | -1e-69 {code} Case 3: {code:sql} spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); 0.0 -0.0 postgres=# select cast('10e-400' as double precision), cast('-10e-400' as double precision); ERROR: "10e-400" is out of range for type double precision LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... ^ {code} Case 4: {code:sql} spark-sql (default)> select exp(1.2345678901234E200); Infinity postgres=# select exp(-1.2345678901234E200); ERROR: value overflows numeric format {code} was: As compared to PostgreSQL 16. Case 1: {code:sql} select tinyint(128) * tinyint(2); -- 0 select smallint(2147483647) * smallint(2); -- -2 select int(2147483647) * int(2); -- -2 SELECT smallint((-32768)) * smallint(-1); -- -32768 {code} Case 2: {code:sql} spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); 0.0 -0.0 postgres=# select cast('10e-70' as float), cast('-10e-70' as float); float8 | float8 + 1e-69 | -1e-69 {code} Case 3: {code:sql} spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); 0.0 -0.0 postgres=# select cast('10e-400' as double precision), cast('-10e-400' as double precision); ERROR: "10e-400" is out of range for type double precision LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... ^ {code} Case 4: {code:sql} spark-sql (default)> select exp(1.2345678901234E200); Infinity postgres=# select exp(-1.2345678901234E200); ERROR: value overflows numeric format {code} > Incorrect numeric values when out of range > -- > > Key: SPARK-28024 > URL: https://issues.apache.org/jira/browse/SPARK-28024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: correctness > Attachments: SPARK-28024.png > > > As compared to PostgreSQL 16. > Case 1: > {code:sql} > select tinyint(128) * tinyint(2); -- 0 > select smallint(2147483647) * smallint(2); -- -2 > select int(2147483647) * int(2); -- -2 > SELECT smallint((-32768)) * smallint(-1); -- -32768 > {code} > Case 2: > {code:sql} > spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); > 0.0 -0.0 > postgres=# select cast('10e-70' as float), cast('-10e-70' as float); > float8 | float8 > + > 1e-69 | -1e-69 {code} > Case 3: > {code:sql} > spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); > 0.0 -0.0 > postgres=# select cast('10e-400' as double precision), cast('-10e-400' as > double precision); > ERROR: "10e-400" is out of range for type double precision > LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... > ^ {code} > Case 4: > {code:sql} > spark-sql (default)> select exp(1.2345678901234E200); > Infinity > postgres=# select exp(-1.2345678901234E200); > ERROR: value overflows numeric format > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range
[ https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-28024: - Description: As compared to PostgreSQL 16. Case 1: {code:sql} select tinyint(128) * tinyint(2); -- 0 select smallint(2147483647) * smallint(2); -- -2 select int(2147483647) * int(2); -- -2 SELECT smallint((-32768)) * smallint(-1); -- -32768 {code} Case 2: {code:sql} spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); 0.0 -0.0 postgres=# select cast('10e-70' as float), cast('-10e-70' as float); float8 | float8 + 1e-69 | -1e-69 {code} Case 3: {code:sql} spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); 0.0 -0.0 postgres=# select cast('10e-400' as double precision), cast('-10e-400' as double precision); ERROR: "10e-400" is out of range for type double precision LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... ^ {code} Case 4: {code:sql} spark-sql (default)> select exp(1.2345678901234E200); Infinity postgres=# select exp(-1.2345678901234E200); ERROR: value overflows numeric format {code} was: As compared to PostgreSQL 16. Case 1: {code:sql} select tinyint(128) * tinyint(2); -- 0 select smallint(2147483647) * smallint(2); -- -2 select int(2147483647) * int(2); -- -2 SELECT smallint((-32768)) * smallint(-1); -- -32768 {code} Case 2: {code:sql} spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); 0.0 -0.0 postgres=# select cast('10e-70' as float), cast('-10e-70' as float); float8 | float8 + 1e-69 | -1e-69 {code} Case 3: {code:sql} spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); 0.0 -0.0 postgres=# select cast('10e-400' as double precision), cast('-10e-400' as double precision); ERROR: "10e-400" is out of range for type double precision LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... ^ {code} Case 4: {code:sql} spark-sql (default)> select exp(1.2345678901234E200); Infinity postgres=# select exp(1.2345678901234E200); ERROR: value overflows numeric format {code} > Incorrect numeric values when out of range > -- > > Key: SPARK-28024 > URL: https://issues.apache.org/jira/browse/SPARK-28024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: correctness > Attachments: SPARK-28024.png > > > As compared to PostgreSQL 16. > Case 1: > {code:sql} > select tinyint(128) * tinyint(2); -- 0 > select smallint(2147483647) * smallint(2); -- -2 > select int(2147483647) * int(2); -- -2 > SELECT smallint((-32768)) * smallint(-1); -- -32768 > {code} > Case 2: > {code:sql} > spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); > 0.0 -0.0 > postgres=# select cast('10e-70' as float), cast('-10e-70' as float); > float8 | float8 > + > 1e-69 | -1e-69 {code} > Case 3: > {code:sql} > spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); > 0.0 -0.0 > postgres=# select cast('10e-400' as double precision), cast('-10e-400' as > double precision); > ERROR: "10e-400" is out of range for type double precision > LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... > ^ {code} > Case 4: > {code:sql} > spark-sql (default)> select exp(1.2345678901234E200); > Infinity > postgres=# select exp(-1.2345678901234E200); > ERROR: value overflows numeric format > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range
[ https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-28024: - Description: As compared to PostgreSQL 16. Case 1: {code:sql} select tinyint(128) * tinyint(2); -- 0 select smallint(2147483647) * smallint(2); -- -2 select int(2147483647) * int(2); -- -2 SELECT smallint((-32768)) * smallint(-1); -- -32768 {code} Case 2: {code:sql} spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); 0.0 -0.0 postgres=# select cast('10e-70' as float), cast('-10e-70' as float); float8 | float8 + 1e-69 | -1e-69 {code} Case 3: {code:sql} spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); 0.0 -0.0 postgres=# select cast('10e-400' as double precision), cast('-10e-400' as double precision); ERROR: "10e-400" is out of range for type double precision LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... ^ {code} Case 4: {code:sql} spark-sql (default)> select exp(1.2345678901234E200); Infinity postgres=# select exp(1.2345678901234E200); ERROR: value overflows numeric format {code} was: For example Case 1: {code:sql} select tinyint(128) * tinyint(2); -- 0 select smallint(2147483647) * smallint(2); -- -2 select int(2147483647) * int(2); -- -2 SELECT smallint((-32768)) * smallint(-1); -- -32768 {code} Case 2: {code:sql} spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); 0.0 -0.0 {code} Case 3: {code:sql} spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); 0.0 -0.0 {code} Case 4: {code:sql} spark-sql> select exp(-1.2345678901234E200); 0.0 postgres=# select exp(-1.2345678901234E200); ERROR: value overflows numeric format {code} > Incorrect numeric values when out of range > -- > > Key: SPARK-28024 > URL: https://issues.apache.org/jira/browse/SPARK-28024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: correctness > Attachments: SPARK-28024.png > > > As compared to PostgreSQL 16. > Case 1: > {code:sql} > select tinyint(128) * tinyint(2); -- 0 > select smallint(2147483647) * smallint(2); -- -2 > select int(2147483647) * int(2); -- -2 > SELECT smallint((-32768)) * smallint(-1); -- -32768 > {code} > Case 2: > {code:sql} > spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); > 0.0 -0.0 > postgres=# select cast('10e-70' as float), cast('-10e-70' as float); > float8 | float8 > + > 1e-69 | -1e-69 {code} > Case 3: > {code:sql} > spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); > 0.0 -0.0 > postgres=# select cast('10e-400' as double precision), cast('-10e-400' as > double precision); > ERROR: "10e-400" is out of range for type double precision > LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ... > ^ {code} > Case 4: > {code:sql} > spark-sql (default)> select exp(1.2345678901234E200); > Infinity > postgres=# select exp(1.2345678901234E200); > ERROR: value overflows numeric format {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47429) Rename errorClass to errorCondition
Nicholas Chammas created SPARK-47429: Summary: Rename errorClass to errorCondition Key: SPARK-47429 URL: https://issues.apache.org/jira/browse/SPARK-47429 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Nicholas Chammas We've agreed on the parent task to rename {{errorClass}} to align it more closely with the SQL standard, and take advantage of the opportunity to break backwards compatibility offered by the Spark version change from 3.5 to 4.0. This is a subtask so the changes are in their own PR and easier to review apart from other things. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823713#comment-17823713 ] Nicholas Chammas commented on SPARK-46810: -- [~cloud_fan], [~LuciferYang], [~beliefer], and [~dongjoon] - Friendly ping. Any thoughts on how to resolve the inconsistent error terminology? > Clarify error class terminology > --- > > Key: SPARK-46810 > URL: https://issues.apache.org/jira/browse/SPARK-46810 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > > We use inconsistent terminology when talking about error classes. I'd like to > get some clarity on that before contributing any potential improvements to > this part of the documentation. > Consider > [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. > It has several key pieces of hierarchical information that have inconsistent > names throughout our documentation and codebase: > * 42 > ** K01 > *** INCOMPLETE_TYPE_DEFINITION > ARRAY > MAP > STRUCT > What are the names of these different levels of information? > Some examples of inconsistent terminology: > * [Over > here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] > we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION > we call that an "error class". So what exactly is a class, the 42 or the > INCOMPLETE_TYPE_DEFINITION? > * [Over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] > we call K01 the "subclass". But [over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] > we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for > INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". > So what exactly is a subclass? > * [On this > page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition] > we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other > places we refer to it as an "error class". > I don't think we should leave this status quo as-is. I see a couple of ways > to fix this. > h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition" > One solution is to use the following terms: > * Error class: 42 > * Error sub-class: K01 > * Error state: 42K01 > * Error condition: INCOMPLETE_TYPE_DEFINITION > * Error sub-condition: ARRAY, MAP, STRUCT > Pros: > * This terminology seems (to me at least) the most natural and intuitive. > * It aligns most closely to the SQL standard. > Cons: > * We use {{errorClass}} [all over our > codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30] > – literally in thousands of places – to refer to strings like > INCOMPLETE_TYPE_DEFINITION. > ** It's probably not practical to update all these usages to say > {{errorCondition}} instead, so if we go with this approach there will be a > divide between the terminology we use in user-facing documentation vs. what > the code base uses. > ** We can perhaps rename the existing {{error-classes.json}} to > {{error-conditions.json}} but clarify the reason for this divide between code > and user docs in the documentation for {{ErrorClassesJsonReader}} . > h1. Option 2: 42 becomes an "Error Category" > Another approach is to use the following terminology: > * Error category: 42 > * Error sub-category: K01 > * Error state: 42K01 > * Error class: INCOMPLETE_TYPE_DEFINITION > * Error sub-classes: ARRAY, MAP, STRUCT > Pros: > * We continue to use "error class" as we do today in our code base. > * The change from calling "42" a "class" to a "category" is low impact and > may not show up in user-facing documentation at all. (See my side note below.) > Cons: > * These terms do not align with the SQL standard. > * We will have to retire the term "error condition", which we have [already > used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md] > in user-facing documentation. > h1. Option 3: "Error Class" and "State Class" > * SQL state class: 42 > * SQL state sub-class: K01 > * SQL state: 42K01 > * Error class: INCOMPLETE_TYPE_DEFINITION > * Error sub-classes: ARRAY, MAP, STRUCT > Pros: > * We continue to use "error class" as we do today in our code base. > * The chang
[jira] [Created] (SPARK-47271) Explain importance of statistics on SQL performance tuning page
Nicholas Chammas created SPARK-47271: Summary: Explain importance of statistics on SQL performance tuning page Key: SPARK-47271 URL: https://issues.apache.org/jira/browse/SPARK-47271 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47252) Clarify that pivot may trigger an eager computation
Nicholas Chammas created SPARK-47252: Summary: Clarify that pivot may trigger an eager computation Key: SPARK-47252 URL: https://issues.apache.org/jira/browse/SPARK-47252 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47216) Refine layout of SQL performance tuning page
Nicholas Chammas created SPARK-47216: Summary: Refine layout of SQL performance tuning page Key: SPARK-47216 URL: https://issues.apache.org/jira/browse/SPARK-47216 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47190) Add support for checkpointing to Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-47190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821286#comment-17821286 ] Nicholas Chammas commented on SPARK-47190: -- [~gurwls223] - Is there some design reason we do _not_ want to support checkpointing in Spark Connect? Or is it just a matter of someone taking the time to implement support? If the latter, do we do so via a new method directly on {{SparkSession}}, or shall we somehow expose a limited version of {{spark.sparkContext}} so users can call the existing {{setCheckpointDir()}} method? > Add support for checkpointing to Spark Connect > -- > > Key: SPARK-47190 > URL: https://issues.apache.org/jira/browse/SPARK-47190 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > The {{sparkContext}} that underlies a given {{SparkSession}} is not > accessible over Spark Connect. This means you cannot call > {{spark.sparkContext.setCheckpointDir(...)}}, which in turn means you cannot > checkpoint a DataFrame. > We should add support for this somehow to Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47190) Add support for checkpointing to Spark Connect
Nicholas Chammas created SPARK-47190: Summary: Add support for checkpointing to Spark Connect Key: SPARK-47190 URL: https://issues.apache.org/jira/browse/SPARK-47190 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 4.0.0 Reporter: Nicholas Chammas The {{sparkContext}} that underlies a given {{SparkSession}} is not accessible over Spark Connect. This means you cannot call {{spark.sparkContext.setCheckpointDir(...)}}, which in turn means you cannot checkpoint a DataFrame. We should add support for this somehow to Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47189) Tweak column error names and text
Nicholas Chammas created SPARK-47189: Summary: Tweak column error names and text Key: SPARK-47189 URL: https://issues.apache.org/jira/browse/SPARK-47189 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47180) Migrate CSV parsing off of Univocity
Nicholas Chammas created SPARK-47180: Summary: Migrate CSV parsing off of Univocity Key: SPARK-47180 URL: https://issues.apache.org/jira/browse/SPARK-47180 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Nicholas Chammas Univocity appears to be unmaintained. As of February 2024: * The last release was [more than 3 years ago|https://github.com/uniVocity/univocity-parsers/releases]. * The last commit to {{master}} was [almost 3 years ago|https://github.com/uniVocity/univocity-parsers/commits/master/]. * The website is [down|https://github.com/uniVocity/univocity-parsers/issues/506]. * There are [multiple|https://github.com/uniVocity/univocity-parsers/issues/494] [open|https://github.com/uniVocity/univocity-parsers/issues/495] [bugs|https://github.com/uniVocity/univocity-parsers/issues/499] on the tracker with no indication that anyone cares. It's not urgent, but we should consider migrating to an actively maintained CSV library in the JVM ecosystem. There are a bunch of libraries [listed here on this Maven Repository|https://mvnrepository.com/open-source/csv-libraries]. [jackson-dataformats-text|https://github.com/FasterXML/jackson-dataformats-text] looks interesting. I know we already use FasterXML to parse JSON. Perhaps we should use them to parse CSV as well. I'm guessing we chose Univocity back in the day because it was the fastest CSV library on the JVM. However, the last performance benchmark comparing it to others was [from February 2018|https://github.com/uniVocity/csv-parsers-comparison/blob/5548b52f2cc27eb19c11464e9a331491e8ad4ba6/README.md#statistics-updated-28th-of-february-2018], so this may no longer be true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47082) Out of bounds error message is incorrect
[ https://issues.apache.org/jira/browse/SPARK-47082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-47082: - Summary: Out of bounds error message is incorrect (was: Out of bounds error message flips the bounds) > Out of bounds error message is incorrect > > > Key: SPARK-47082 > URL: https://issues.apache.org/jira/browse/SPARK-47082 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47082) Out of bounds error message flips the bounds
Nicholas Chammas created SPARK-47082: Summary: Out of bounds error message flips the bounds Key: SPARK-47082 URL: https://issues.apache.org/jira/browse/SPARK-47082 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47024) Sum of floats/doubles may be incorrect depending on partitioning
[ https://issues.apache.org/jira/browse/SPARK-47024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-47024. -- Resolution: Not A Problem Resolving this as "Not A Problem". I mean, it _is_ a problem, but it's a basic problem with floats, and I don't think there is anything practical that can be done about it in Spark. > Sum of floats/doubles may be incorrect depending on partitioning > > > Key: SPARK-47024 > URL: https://issues.apache.org/jira/browse/SPARK-47024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 3.5.0, 3.3.4 >Reporter: Nicholas Chammas >Priority: Major > Labels: correctness > > I found this problem using > [Hypothesis|https://hypothesis.readthedocs.io/en/latest/]. > Here's a reproduction that fails on {{{}master{}}}, 3.5.0, 3.4.2, and 3.3.4 > (and probably all prior versions as well): > {code:python} > from pyspark.sql import SparkSession > from pyspark.sql.functions import col, sum > SUM_EXAMPLE = [ > (1.0,), > (0.0,), > (1.0,), > (9007199254740992.0,), > ] > spark = ( > SparkSession.builder > .config("spark.log.level", "ERROR") > .getOrCreate() > ) > def compare_sums(data, num_partitions): > df = spark.createDataFrame(data, "val double").coalesce(1) > result1 = df.agg(sum(col("val"))).collect()[0][0] > df = spark.createDataFrame(data, "val double").repartition(num_partitions) > result2 = df.agg(sum(col("val"))).collect()[0][0] > assert result1 == result2, f"{result1}, {result2}" > if __name__ == "__main__": > print(compare_sums(SUM_EXAMPLE, 2)) > {code} > This fails as follows: > {code:python} > AssertionError: 9007199254740994.0, 9007199254740992.0 > {code} > I suspected some kind of problem related to code generation, so tried setting > all of these to {{{}false{}}}: > * {{spark.sql.codegen.wholeStage}} > * {{spark.sql.codegen.aggregate.map.twolevel.enabled}} > * {{spark.sql.codegen.aggregate.splitAggregateFunc.enabled}} > But this did not change the behavior. > Somehow, the partitioning of the data affects the computed sum. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47024) Sum of floats/doubles may be incorrect depending on partitioning
[ https://issues.apache.org/jira/browse/SPARK-47024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-47024: - Description: I found this problem using [Hypothesis|https://hypothesis.readthedocs.io/en/latest/]. Here's a reproduction that fails on {{{}master{}}}, 3.5.0, 3.4.2, and 3.3.4 (and probably all prior versions as well): {code:python} from pyspark.sql import SparkSession from pyspark.sql.functions import col, sum SUM_EXAMPLE = [ (1.0,), (0.0,), (1.0,), (9007199254740992.0,), ] spark = ( SparkSession.builder .config("spark.log.level", "ERROR") .getOrCreate() ) def compare_sums(data, num_partitions): df = spark.createDataFrame(data, "val double").coalesce(1) result1 = df.agg(sum(col("val"))).collect()[0][0] df = spark.createDataFrame(data, "val double").repartition(num_partitions) result2 = df.agg(sum(col("val"))).collect()[0][0] assert result1 == result2, f"{result1}, {result2}" if __name__ == "__main__": print(compare_sums(SUM_EXAMPLE, 2)) {code} This fails as follows: {code:python} AssertionError: 9007199254740994.0, 9007199254740992.0 {code} I suspected some kind of problem related to code generation, so tried setting all of these to {{{}false{}}}: * {{spark.sql.codegen.wholeStage}} * {{spark.sql.codegen.aggregate.map.twolevel.enabled}} * {{spark.sql.codegen.aggregate.splitAggregateFunc.enabled}} But this did not change the behavior. Somehow, the partitioning of the data affects the computed sum. was:Will fill in the details shortly. Summary: Sum of floats/doubles may be incorrect depending on partitioning (was: Sum is incorrect (exact cause currently unknown)) Sadly, I think this is a case where we may not be able to do anything. The problem appears to be a classic case of floating point arithmetic going wrong. {code:scala} scala> 9007199254740992.0 + 1.0 val res0: Double = 9.007199254740992E15 scala> 9007199254740992.0 + 2.0 val res1: Double = 9.007199254740994E15 {code} Notice how adding {{1.0}} did not change the large value, whereas adding {{2.0}} did. So what I believe is happening is that, depending on the order in which the rows happen to be added, we either hit or do not hit this corner case. In other words, if the aggregation goes like this: {code:java} (1.0 + 1.0) + (0.0 + 9007199254740992.0) 2.0 + 9007199254740992.0 9007199254740994.0 {code} Then there is no problem. However, if we are unlucky and it goes like this: {code:java} (1.0 + 0.0) + (1.0 + 9007199254740992.0) 1.0 + 9007199254740992.0 9007199254740992.0 {code} Then we get the incorrect result shown in the description above. This violates what I believe should be an invariant in Spark: That declarative aggregates like {{sum}} do not compute different results depending on accidents of row order or partitioning. However, given that this is a basic problem of floating point arithmetic, I doubt we can really do anything here. Note that there are many such "special" numbers that have this problem, not just 9007199254740992.0: {code:scala} scala> 1.7168917017330176e+16 + 1.0 val res2: Double = 1.7168917017330176E16 scala> 1.7168917017330176e+16 + 2.0 val res3: Double = 1.7168917017330178E16 {code} > Sum of floats/doubles may be incorrect depending on partitioning > > > Key: SPARK-47024 > URL: https://issues.apache.org/jira/browse/SPARK-47024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 3.5.0, 3.3.4 >Reporter: Nicholas Chammas >Priority: Major > Labels: correctness > > I found this problem using > [Hypothesis|https://hypothesis.readthedocs.io/en/latest/]. > Here's a reproduction that fails on {{{}master{}}}, 3.5.0, 3.4.2, and 3.3.4 > (and probably all prior versions as well): > {code:python} > from pyspark.sql import SparkSession > from pyspark.sql.functions import col, sum > SUM_EXAMPLE = [ > (1.0,), > (0.0,), > (1.0,), > (9007199254740992.0,), > ] > spark = ( > SparkSession.builder > .config("spark.log.level", "ERROR") > .getOrCreate() > ) > def compare_sums(data, num_partitions): > df = spark.createDataFrame(data, "val double").coalesce(1) > result1 = df.agg(sum(col("val"))).collect()[0][0] > df = spark.createDataFrame(data, "val double").repartition(num_partitions) > result2 = df.agg(sum(col("val"))).collect()[0][0] > assert result1 == result2, f"{result1}, {result2}" > if __name__ == "__main__": > print(compare_sums(SUM_EXAMPLE, 2)) > {code} > This fails as follows: > {code:python} > AssertionError: 9007199254740994.0, 9007199254740992.0 > {code} > I suspected some kind of problem related to code generation, so tried setting > all of these to {{{}false{}}}: > * {{spark.sql.codegen.who
[jira] [Created] (SPARK-47024) Sum is incorrect (exact cause currently unknown)
Nicholas Chammas created SPARK-47024: Summary: Sum is incorrect (exact cause currently unknown) Key: SPARK-47024 URL: https://issues.apache.org/jira/browse/SPARK-47024 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.4, 3.5.0, 3.4.2 Reporter: Nicholas Chammas Will fill in the details shortly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46992) Inconsistent results with 'sort', 'cache', and AQE.
[ https://issues.apache.org/jira/browse/SPARK-46992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-46992: - Labels: correctness (was: ) > Inconsistent results with 'sort', 'cache', and AQE. > --- > > Key: SPARK-46992 > URL: https://issues.apache.org/jira/browse/SPARK-46992 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.2, 3.5.0 >Reporter: Denis Tarima >Priority: Critical > Labels: correctness > > > With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes > {color:#4c9aff}sample{color} results after caching. > Moreover, when cached, {color:#4c9aff}collect{color} returns records as if > it's not cached, which is inconsistent with {color:#4c9aff}count{color} and > {color:#4c9aff}show{color}. > A script to reproduce: > {code:scala} > import spark.implicits._ > val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123) > println("NON CACHED:") > println(" count: " + df.count()) > println(" collect: " + df.collect().mkString(" ")) > df.show() > println("CACHED:") > df.cache().count() > println(" count: " + df.count()) > println(" collect: " + df.collect().mkString(" ")) > df.show() > df.unpersist() > {code} > output: > {code} > NON CACHED: > count: 2 > collect: [1] [4] > +---+ > | id| > +---+ > | 1| > | 4| > +---+ > CACHED: > count: 3 > collect: [1] [4] > +---+ > | id| > +---+ > | 1| > | 2| > | 3| > +---+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46992) Inconsistent results with 'sort', 'cache', and AQE.
[ https://issues.apache.org/jira/browse/SPARK-46992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814913#comment-17814913 ] Nicholas Chammas commented on SPARK-46992: -- I can confirm the behavior described above is still present on {{master}} at [{{5d5b3a5}}|https://github.com/apache/spark/commit/5d5b3a54b7b5fb4308fe40da696ba805c72983fc]. Adding the {{correctness}} label. > Inconsistent results with 'sort', 'cache', and AQE. > --- > > Key: SPARK-46992 > URL: https://issues.apache.org/jira/browse/SPARK-46992 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.2, 3.5.0 >Reporter: Denis Tarima >Priority: Critical > > > With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes > {color:#4c9aff}sample{color} results after caching. > Moreover, when cached, {color:#4c9aff}collect{color} returns records as if > it's not cached, which is inconsistent with {color:#4c9aff}count{color} and > {color:#4c9aff}show{color}. > A script to reproduce: > {code:scala} > import spark.implicits._ > val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123) > println("NON CACHED:") > println(" count: " + df.count()) > println(" collect: " + df.collect().mkString(" ")) > df.show() > println("CACHED:") > df.cache().count() > println(" count: " + df.count()) > println(" collect: " + df.collect().mkString(" ")) > df.show() > df.unpersist() > {code} > output: > {code} > NON CACHED: > count: 2 > collect: [1] [4] > +---+ > | id| > +---+ > | 1| > | 4| > +---+ > CACHED: > count: 3 > collect: [1] [4] > +---+ > | id| > +---+ > | 1| > | 2| > | 3| > +---+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814406#comment-17814406 ] Nicholas Chammas commented on SPARK-46810: -- [~cloud_fan], [~LuciferYang], [~beliefer], and [~dongjoon] - What are your thoughts on the 3 proposed options? > Clarify error class terminology > --- > > Key: SPARK-46810 > URL: https://issues.apache.org/jira/browse/SPARK-46810 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > > We use inconsistent terminology when talking about error classes. I'd like to > get some clarity on that before contributing any potential improvements to > this part of the documentation. > Consider > [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. > It has several key pieces of hierarchical information that have inconsistent > names throughout our documentation and codebase: > * 42 > ** K01 > *** INCOMPLETE_TYPE_DEFINITION > ARRAY > MAP > STRUCT > What are the names of these different levels of information? > Some examples of inconsistent terminology: > * [Over > here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] > we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION > we call that an "error class". So what exactly is a class, the 42 or the > INCOMPLETE_TYPE_DEFINITION? > * [Over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] > we call K01 the "subclass". But [over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] > we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for > INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". > So what exactly is a subclass? > * [On this > page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition] > we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other > places we refer to it as an "error class". > I don't think we should leave this status quo as-is. I see a couple of ways > to fix this. > h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition" > One solution is to use the following terms: > * Error class: 42 > * Error sub-class: K01 > * Error state: 42K01 > * Error condition: INCOMPLETE_TYPE_DEFINITION > * Error sub-condition: ARRAY, MAP, STRUCT > Pros: > * This terminology seems (to me at least) the most natural and intuitive. > * It aligns most closely to the SQL standard. > Cons: > * We use {{errorClass}} [all over our > codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30] > – literally in thousands of places – to refer to strings like > INCOMPLETE_TYPE_DEFINITION. > ** It's probably not practical to update all these usages to say > {{errorCondition}} instead, so if we go with this approach there will be a > divide between the terminology we use in user-facing documentation vs. what > the code base uses. > ** We can perhaps rename the existing {{error-classes.json}} to > {{error-conditions.json}} but clarify the reason for this divide between code > and user docs in the documentation for {{ErrorClassesJsonReader}} . > h1. Option 2: 42 becomes an "Error Category" > Another approach is to use the following terminology: > * Error category: 42 > * Error sub-category: K01 > * Error state: 42K01 > * Error class: INCOMPLETE_TYPE_DEFINITION > * Error sub-classes: ARRAY, MAP, STRUCT > Pros: > * We continue to use "error class" as we do today in our code base. > * The change from calling "42" a "class" to a "category" is low impact and > may not show up in user-facing documentation at all. (See my side note below.) > Cons: > * These terms do not align with the SQL standard. > * We will have to retire the term "error condition", which we have [already > used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md] > in user-facing documentation. > h1. Option 3: "Error Class" and "State Class" > * SQL state class: 42 > * SQL state sub-class: K01 > * SQL state: 42K01 > * Error class: INCOMPLETE_TYPE_DEFINITION > * Error sub-classes: ARRAY, MAP, STRUCT > Pros: > * We continue to use "error class" as we do today in our code base. > * The change from calling "42" a "class" to
[jira] [Commented] (SPARK-40549) PYSPARK: Observation computes the wrong results when using `corr` function
[ https://issues.apache.org/jira/browse/SPARK-40549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813780#comment-17813780 ] Nicholas Chammas commented on SPARK-40549: -- I think this is just a consequence of floating point arithmetic being imprecise. {code:python} >>> for i in range(10): ... o = Observation(f"test_{i}") ... df_o = df.observe(o, F.corr("id", "id2")) ... df_o.count() ... print(o.get) ... {'corr(id, id2)': 1.0} {'corr(id, id2)': 1.0002} {'corr(id, id2)': 1.0} {'corr(id, id2)': 1.0} {'corr(id, id2)': 1.0} {'corr(id, id2)': 1.0} {'corr(id, id2)': 1.0} {'corr(id, id2)': 1.0002} {'corr(id, id2)': 0.} {'corr(id, id2)': 1.0} {code} Unfortunately, {{corr}} seems to convert to float internally, so even if you give it decimals you will get a similar result: {code:python} >>> from decimal import Decimal >>> import pyspark.sql.functions as F >>> >>> df = spark.createDataFrame( ... [(Decimal(i), Decimal(i * 10)) for i in range(10)], ... schema="id decimal, id2 decimal", ... )for i in range(10): o = Observation(f"test_{i}") df_o = df.observe(o, F.corr("id", "id2")) df_o.count() print(o.get) >>> >>> for i in range(10): ... o = Observation(f"test_{i}") ... df_o = df.observe(o, F.corr("id", "id2")) ... df_o.count() ... print(o.get) ... {'corr(id, id2)': 1.0} {'corr(id, id2)': 1.0} {'corr(id, id2)': 1.0} {'corr(id, id2)': 0.} {'corr(id, id2)': 1.0} {'corr(id, id2)': 1.0002} {'corr(id, id2)': 1.0} {'corr(id, id2)': 1.0} {'corr(id, id2)': 1.0} {'corr(id, id2)': 1.0} {code} I don't think there is anything that can be done here. > PYSPARK: Observation computes the wrong results when using `corr` function > --- > > Key: SPARK-40549 > URL: https://issues.apache.org/jira/browse/SPARK-40549 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 > Environment: {code:java} > // lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 22.04.1 LTS > Release: 22.04 > Codename: jammy {code} > {code:java} > // python -V > python 3.10.4 > {code} > {code:java} > // lshw -class cpu > *-cpu > description: CPU product: AMD Ryzen 9 3900X 12-Core Processor > vendor: Advanced Micro Devices [AMD] physical id: f bus info: > cpu@0 version: 23.113.0 serial: Unknown slot: AM4 > size: 2194MHz capacity: 4672MHz width: 64 bits clock: > 100MHz capabilities: lm fpu fpu_exception wp vme de pse tsc msr pae > mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht > syscall nx mmxext fxsr_opt pdpe1gb rdtscp x86-64 constant_tsc rep_good nopl > nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma > cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy > svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit > wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 > cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm > rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves > cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr > rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean > flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif > v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es cpufreq > configuration: cores=12 enabledcores=12 microcode=141561875 threads=24 > {code} >Reporter: Herminio Vazquez >Priority: Major > Labels: correctness > > Minimalistic description of the odd computation results. > When creating a new `Observation` object and computing a simple correlation > function between 2 columns, the results appear to be non-deterministic. > {code:java} > # Init > from pyspark.sql import SparkSession, Observation > import pyspark.sql.functions as F > df = spark.createDataFrame([(float(i), float(i*10),) for i in range(10)], > schema="id double, id2 double") > for i in range(10): > o = Observation(f"test_{i}") > df_o = df.observe(o, F.corr("id", "id2").eqNullSafe(1.0)) > df_o.count() > print(o.get) > # Results > {'(corr(id, id2) <=> 1.0)': False} > {'(corr(id, id2) <=> 1.0)': False} > {'(corr(id, id2) <=> 1.0)': False} > {'(corr(id, id2) <=> 1.0)': True} > {'(corr(id, id2) <=> 1.0)': True} > {'(corr(id, id2) <=> 1.0)': True} > {'(corr(id, id2) <=> 1.0)': True} > {'(corr(id, id2) <=> 1.0)': True} > {'(corr(id, id2) <=> 1.0)': True} > {'(corr(id, id2) <=> 1.0)': False}{code} > -
[jira] [Commented] (SPARK-45786) Inaccurate Decimal multiplication and division results
[ https://issues.apache.org/jira/browse/SPARK-45786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813766#comment-17813766 ] Nicholas Chammas commented on SPARK-45786: -- [~kazuyukitanimura] - I'm just curious: How did you find this bug? Was it something you stumbled on by accident or did you search for it using something like a fuzzer? > Inaccurate Decimal multiplication and division results > -- > > Key: SPARK-45786 > URL: https://issues.apache.org/jira/browse/SPARK-45786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.4, 3.3.3, 3.4.1, 3.5.0, 4.0.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Blocker > Labels: correctness, pull-request-available > Fix For: 3.4.2, 4.0.0, 3.5.1 > > > Decimal multiplication and division results may be inaccurate due to rounding > issues. > h2. Multiplication: > {code:scala} > scala> sql("select -14120025096157587712113961295153.858047 * > -0.4652").show(truncate=false) > ++ > > |(-14120025096157587712113961295153.858047 * -0.4652)| > ++ > |6568635674732509803675414794505.574764 | > ++ > {code} > The correct answer is > {quote}6568635674732509803675414794505.574763 > {quote} > Please note that the last digit is 3 instead of 4 as > > {code:scala} > scala> > java.math.BigDecimal("-14120025096157587712113961295153.858047").multiply(java.math.BigDecimal("-0.4652")) > val res21: java.math.BigDecimal = 6568635674732509803675414794505.5747634644 > {code} > Since the factional part .574763 is followed by 4644, it should not be > rounded up. > h2. Division: > {code:scala} > scala> sql("select -0.172787979 / > 533704665545018957788294905796.5").show(truncate=false) > +-+ > |(-0.172787979 / 533704665545018957788294905796.5)| > +-+ > |-3.237521E-31| > +-+ > {code} > The correct answer is > {quote}-3.237520E-31 > {quote} > Please note that the last digit is 0 instead of 1 as > > {code:scala} > scala> > java.math.BigDecimal("-0.172787979").divide(java.math.BigDecimal("533704665545018957788294905796.5"), > 100, java.math.RoundingMode.DOWN) > val res22: java.math.BigDecimal = > -3.237520489418037889998826491401059986665344697406144511563561222578738E-31 > {code} > Since the factional part .237520 is followed by 4894..., it should not be > rounded up. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38167) CSV parsing error when using escape='"'
[ https://issues.apache.org/jira/browse/SPARK-38167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813741#comment-17813741 ] Nicholas Chammas commented on SPARK-38167: -- [~marnixvandenbroek] - Could you link to the bug report you filed with Univocity? cc [~maxgekk] - I believe you have hit some parsing bugs in Univocity recently. > CSV parsing error when using escape='"' > > > Key: SPARK-38167 > URL: https://issues.apache.org/jira/browse/SPARK-38167 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.2.1 > Environment: Pyspark on a single-node Databricks managed Spark 3.1.2 > cluster. >Reporter: Marnix van den Broek >Priority: Major > Labels: correctness, csv, csvparser, data-integrity > > hi all, > When reading CSV files with Spark, I ran into a parsing bug. > {*}The summary{*}: > When > # reading a comma separated, double-quote quoted CSV file using the csv > reader options _escape='"'_ and {_}header=True{_}, > # with a row containing a quoted empty field > # followed by a quoted field starting with a comma and followed by one or > more characters > selecting columns from the dataframe at or after the field described in 3) > gives incorrect and inconsistent results > {*}In detail{*}: > When I instruct Spark to read this CSV file: > > {code:java} > col1,col2 > "",",a" > {code} > > using the CSV reader options escape='"' (unnecessary for the example, > necessary for the files I'm processing) and header=True, I expect the > following result: > > {code:java} > spark.read.csv(path, escape='"', header=True).show() > > +++ > |col1|col2| > +++ > |null| ,a| > +++ {code} > > Spark does yield this result, so far so good. However, when I select col2 > from the dataframe, Spark yields an incorrect result: > > {code:java} > spark.read.csv(path, escape='"', header=True).select('col2').show() > > ++ > |col2| > ++ > | a"| > ++{code} > > If you run this example with more columns in the file, and more commas in the > field, e.g. ",,,a", the problem compounds, as Spark shifts many values to > the right, causing unexpected and incorrect results. The inconsistency > between both methods surprised me, as it implies the parsing is evaluated > differently between both methods. > I expect the bug to be located in the quote-balancing and un-escaping methods > of the csv parser, but I can't find where that code is located in the code > base. I'd be happy to take a look at it if anyone can point me where it is. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42399) CONV() silently overflows returning wrong results
[ https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-42399: - Affects Version/s: (was: 3.5.0) > CONV() silently overflows returning wrong results > - > > Key: SPARK-42399 > URL: https://issues.apache.org/jira/browse/SPARK-42399 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Critical > Labels: correctness, pull-request-available > > spark-sql> SELECT > CONV(SUBSTRING('0x', > 3), 16, 10); > 18446744073709551615 > Time taken: 2.114 seconds, Fetched 1 row(s) > spark-sql> set spark.sql.ansi.enabled = true; > spark.sql.ansi.enabled true > Time taken: 0.068 seconds, Fetched 1 row(s) > spark-sql> SELECT > CONV(SUBSTRING('0x', > 3), 16, 10); > 18446744073709551615 > Time taken: 0.05 seconds, Fetched 1 row(s) > In ANSI mode we should raise an error for sure. > In non ANSI either an error or a NULL maybe be acceptable. > Alternatively, of course, we could consider if we can support arbitrary > domains since the result is a STRING again. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42399) CONV() silently overflows returning wrong results
[ https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813733#comment-17813733 ] Nicholas Chammas commented on SPARK-42399: -- This issue does indeed appear to be resolved on {{master}} when ANSI mode is enabled: {code:java} >>> spark.sql(f"SELECT CONV('{'f' * 64}', 16, 10) AS >>> result").show(truncate=False) ++ |result | ++ |18446744073709551615| ++ >>> spark.conf.set("spark.sql.ansi.enabled", "true") >>> spark.sql(f"SELECT CONV('{'f' * 64}', 16, 10) AS >>> result").show(truncate=False) Traceback (most recent call last): ... pyspark.errors.exceptions.captured.ArithmeticException: [ARITHMETIC_OVERFLOW] Overflow in function conv(). If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22003 == SQL (line 1, position 8) == SELECT CONV('', 16, 10) AS result {code} However, there is still a silent overflow when ANSI mode is disabled. The error message suggests this is intended behavior. cc [~gengliang] and [~gurwls223], who resolved SPARK-42427. > CONV() silently overflows returning wrong results > - > > Key: SPARK-42399 > URL: https://issues.apache.org/jira/browse/SPARK-42399 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Serge Rielau >Priority: Critical > Labels: correctness, pull-request-available > > spark-sql> SELECT > CONV(SUBSTRING('0x', > 3), 16, 10); > 18446744073709551615 > Time taken: 2.114 seconds, Fetched 1 row(s) > spark-sql> set spark.sql.ansi.enabled = true; > spark.sql.ansi.enabled true > Time taken: 0.068 seconds, Fetched 1 row(s) > spark-sql> SELECT > CONV(SUBSTRING('0x', > 3), 16, 10); > 18446744073709551615 > Time taken: 0.05 seconds, Fetched 1 row(s) > In ANSI mode we should raise an error for sure. > In non ANSI either an error or a NULL maybe be acceptable. > Alternatively, of course, we could consider if we can support arbitrary > domains since the result is a STRING again. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42399) CONV() silently overflows returning wrong results
[ https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-42399: - Affects Version/s: 3.5.0 > CONV() silently overflows returning wrong results > - > > Key: SPARK-42399 > URL: https://issues.apache.org/jira/browse/SPARK-42399 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Serge Rielau >Priority: Critical > Labels: correctness, pull-request-available > > spark-sql> SELECT > CONV(SUBSTRING('0x', > 3), 16, 10); > 18446744073709551615 > Time taken: 2.114 seconds, Fetched 1 row(s) > spark-sql> set spark.sql.ansi.enabled = true; > spark.sql.ansi.enabled true > Time taken: 0.068 seconds, Fetched 1 row(s) > spark-sql> SELECT > CONV(SUBSTRING('0x', > 3), 16, 10); > 18446744073709551615 > Time taken: 0.05 seconds, Fetched 1 row(s) > In ANSI mode we should raise an error for sure. > In non ANSI either an error or a NULL maybe be acceptable. > Alternatively, of course, we could consider if we can support arbitrary > domains since the result is a STRING again. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42399) CONV() silently overflows returning wrong results
[ https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-42399: - Labels: correctness pull-request-available (was: pull-request-available) > CONV() silently overflows returning wrong results > - > > Key: SPARK-42399 > URL: https://issues.apache.org/jira/browse/SPARK-42399 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Critical > Labels: correctness, pull-request-available > > spark-sql> SELECT > CONV(SUBSTRING('0x', > 3), 16, 10); > 18446744073709551615 > Time taken: 2.114 seconds, Fetched 1 row(s) > spark-sql> set spark.sql.ansi.enabled = true; > spark.sql.ansi.enabled true > Time taken: 0.068 seconds, Fetched 1 row(s) > spark-sql> SELECT > CONV(SUBSTRING('0x', > 3), 16, 10); > 18446744073709551615 > Time taken: 0.05 seconds, Fetched 1 row(s) > In ANSI mode we should raise an error for sure. > In non ANSI either an error or a NULL maybe be acceptable. > Alternatively, of course, we could consider if we can support arbitrary > domains since the result is a STRING again. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-46810: - Description: We use inconsistent terminology when talking about error classes. I'd like to get some clarity on that before contributing any potential improvements to this part of the documentation. Consider [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. It has several key pieces of hierarchical information that have inconsistent names throughout our documentation and codebase: * 42 ** K01 *** INCOMPLETE_TYPE_DEFINITION ARRAY MAP STRUCT What are the names of these different levels of information? Some examples of inconsistent terminology: * [Over here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we call that an "error class". So what exactly is a class, the 42 or the INCOMPLETE_TYPE_DEFINITION? * [Over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] we call K01 the "subclass". But [over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". So what exactly is a subclass? * [On this page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition] we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other places we refer to it as an "error class". I don't think we should leave this status quo as-is. I see a couple of ways to fix this. h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition" One solution is to use the following terms: * Error class: 42 * Error sub-class: K01 * Error state: 42K01 * Error condition: INCOMPLETE_TYPE_DEFINITION * Error sub-condition: ARRAY, MAP, STRUCT Pros: * This terminology seems (to me at least) the most natural and intuitive. * It aligns most closely to the SQL standard. Cons: * We use {{errorClass}} [all over our codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30] – literally in thousands of places – to refer to strings like INCOMPLETE_TYPE_DEFINITION. ** It's probably not practical to update all these usages to say {{errorCondition}} instead, so if we go with this approach there will be a divide between the terminology we use in user-facing documentation vs. what the code base uses. ** We can perhaps rename the existing {{error-classes.json}} to {{error-conditions.json}} but clarify the reason for this divide between code and user docs in the documentation for {{ErrorClassesJsonReader}} . h1. Option 2: 42 becomes an "Error Category" Another approach is to use the following terminology: * Error category: 42 * Error sub-category: K01 * Error state: 42K01 * Error class: INCOMPLETE_TYPE_DEFINITION * Error sub-classes: ARRAY, MAP, STRUCT Pros: * We continue to use "error class" as we do today in our code base. * The change from calling "42" a "class" to a "category" is low impact and may not show up in user-facing documentation at all. (See my side note below.) Cons: * These terms do not align with the SQL standard. * We will have to retire the term "error condition", which we have [already used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md] in user-facing documentation. h1. Option 3: "Error Class" and "State Class" * SQL state class: 42 * SQL state sub-class: K01 * SQL state: 42K01 * Error class: INCOMPLETE_TYPE_DEFINITION * Error sub-classes: ARRAY, MAP, STRUCT Pros: * We continue to use "error class" as we do today in our code base. * The change from calling "42" a "class" to a "state class" is low impact and may not show up in user-facing documentation at all. (See my side note below.) Cons: * "State class" vs. "Error class" is a bit confusing. * These terms do not align with the SQL standard. * We will have to retire the term "error condition", which we have [already used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md] in user-facing documentation. — Side note: In any case, I believe talking about "42" and "K01" – regardless of what we end up calling them – in front of users is not helpful. I don't think anybody cares what "42" by itself means, or what "K01" by itself means. Accordingly, we should limit how much we tal
[jira] [Created] (SPARK-46935) Consolidate error documentation
Nicholas Chammas created SPARK-46935: Summary: Consolidate error documentation Key: SPARK-46935 URL: https://issues.apache.org/jira/browse/SPARK-46935 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-46810: - Description: We use inconsistent terminology when talking about error classes. I'd like to get some clarity on that before contributing any potential improvements to this part of the documentation. Consider [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. It has several key pieces of hierarchical information that have inconsistent names throughout our documentation and codebase: * 42 ** K01 *** INCOMPLETE_TYPE_DEFINITION ARRAY MAP STRUCT What are the names of these different levels of information? Some examples of inconsistent terminology: * [Over here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we call that an "error class". So what exactly is a class, the 42 or the INCOMPLETE_TYPE_DEFINITION? * [Over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] we call K01 the "subclass". But [over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". So what exactly is a subclass? * [On this page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition] we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other places we refer to it as an "error class". I don't think we should leave this status quo as-is. I see a couple of ways to fix this. h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition" One solution is to use the following terms: * Error class: 42 * Error sub-class: K01 * Error state: 42K01 * Error condition: INCOMPLETE_TYPE_DEFINITION * Error sub-condition: ARRAY, MAP, STRUCT Pros: * This terminology seems (to me at least) the most natural and intuitive. * It may also match the SQL standard. Cons: * We use {{errorClass}} [all over our codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30] – literally in thousands of places – to refer to strings like INCOMPLETE_TYPE_DEFINITION. ** It's probably not practical to update all these usages to say {{errorCondition}} instead, so if we go with this approach there will be a divide between the terminology we use in user-facing documentation vs. what the code base uses. ** We can perhaps rename the existing {{error-classes.json}} to {{error-conditions.json}} but clarify the reason for this divide between code and user docs in the documentation for {{ErrorClassesJsonReader}} . h1. Option 2: 42 becomes an "Error Category" Another approach is to use the following terminology: * Error category: 42 * Error sub-category: K01 * Error state: 42K01 * Error class: INCOMPLETE_TYPE_DEFINITION * Error sub-classes: ARRAY, MAP, STRUCT Pros: * We continue to use "error class" as we do today in our code base. * The change from calling "42" a class to a category is low impact and may not show up in user-facing documentation at all. (See my side note below.) Cons: * These terms may not align with the SQL standard. * We will have to retire the term "error condition", which we have [already used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md] in user-facing documentation. h1. Option 3: "Error Class" and "State Class" * SQL state class: 42 * SQL state sub-class: K01 * SQL state: 42K01 * Error class: INCOMPLETE_TYPE_DEFINITION * Error sub-classes: ARRAY, MAP, STRUCT — Side note: In any case, I believe talking about "42" and "K01" – regardless of what we end up calling them – in front of users is not helpful. I don't think anybody cares what "42" by itself means, or what "K01" by itself means. Accordingly, we should limit how much we talk about these concepts in the user-facing documentation. was: We use inconsistent terminology when talking about error classes. I'd like to get some clarity on that before contributing any potential improvements to this part of the documentation. Consider [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. It has several key pieces of hierarchical information that have inconsistent names throughout our documentation and codebase: * 42 ** K01 *** INCOMPLETE_TYPE_DEFINITION ARRAY
[jira] [Updated] (SPARK-46923) Limit width of config tables in documentation and style them consistently
[ https://issues.apache.org/jira/browse/SPARK-46923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-46923: - Summary: Limit width of config tables in documentation and style them consistently (was: Style config tables in documentation consistently) > Limit width of config tables in documentation and style them consistently > - > > Key: SPARK-46923 > URL: https://issues.apache.org/jira/browse/SPARK-46923 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46923) Style config tables in documentation consistently
Nicholas Chammas created SPARK-46923: Summary: Style config tables in documentation consistently Key: SPARK-46923 URL: https://issues.apache.org/jira/browse/SPARK-46923 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811923#comment-17811923 ] Nicholas Chammas commented on SPARK-46810: -- I think Option 3 is a good compromise that lets us continue calling {{INCOMPLETE_TYPE_DEFINITION}} an "error class", which perhaps would be the least disruptive to Spark developers. However, for the record, the SQL standard only seems to use the term "class" in the context of the 5-character SQLSTATE. Otherwise, the standard uses the term "condition" or "exception condition". I don't have a copy of the SQL 2016 standard handy. It's not available on ISO's website for sale, actually. The only option appears to be to purchase [the SQL 2023 standard for ~$220|https://www.iso.org/standard/76583.html]. However, there is a copy of the [SQL 1992 standard available publicly|https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt]. Table 23 on page 619 is relevant: {code} Table_23-SQLSTATE_class_and_subclass_values _Condition__Class_Subcondition___Subclass | ambiguous cursor name| 3C | (no subclass)| 000 | | | | | | | | | | | | cardinality violation| 21 | (no subclass)| 000 | | | | | | | connection exception | 08 | (no subclass)| 000 | | | | | | | | | connection does not | 003 | exist | | | connection failure | 006 | | | | | | | | | connection name in use | 002 | | | | | | | | | SQL-client unable to | 001 | establish SQL-connection ... {code} I think this maps closest to Option 1, but again if we want to go with Option 3 I think that's reasonable too. But in the case of Option 3 we should then retire [our use of the term "error condition"|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html] so that we don't use multiple terms to refer to the same thing. > Clarify error class terminology > --- > > Key: SPARK-46810 > URL: https://issues.apache.org/jira/browse/SPARK-46810 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > > We use inconsistent terminology when talking about error classes. I'd like to > get some clarity on that before contributing any potential improvements to > this part of the documentation. > Consider > [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. > It has several key pieces of hierarchical information that have inconsistent > names throughout our documentation and codebase: > * 42 > ** K01 > *** INCOMPLETE_TYPE_DEFINITION > ARRAY > MAP > STRUCT > What are the names of these different levels of information? > Some examples of inconsistent terminology: > * [Over > here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] > we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION > we call that an "error class". So what exactly is a class, the 42 or the > INCOMPLETE_TYPE_DEFINITION? > * [Over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] > we call K01 the "subclass". But [over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] > we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for > INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". > So what exactly is a subclass? > * [On this > page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition] > we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other > places we refer to it as an "error class". > I don't think we should leave this status quo as-is. I see a couple of ways > to fix this. > h1. Option 1: INCOM
[jira] [Created] (SPARK-46894) Move PySpark error conditions into standalone JSON file
Nicholas Chammas created SPARK-46894: Summary: Move PySpark error conditions into standalone JSON file Key: SPARK-46894 URL: https://issues.apache.org/jira/browse/SPARK-46894 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811627#comment-17811627 ] Nicholas Chammas commented on SPARK-46810: -- Thanks for sharing the relevant quote, [~srielau]. 1. So just to be clear, you are saying you prefer Option 1. Is that correct? I will update the PR accordingly. 2. Is there anyone else we need buy-in from before moving forward? [~maxgekk], perhaps? > Clarify error class terminology > --- > > Key: SPARK-46810 > URL: https://issues.apache.org/jira/browse/SPARK-46810 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > > We use inconsistent terminology when talking about error classes. I'd like to > get some clarity on that before contributing any potential improvements to > this part of the documentation. > Consider > [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. > It has several key pieces of hierarchical information that have inconsistent > names throughout our documentation and codebase: > * 42 > ** K01 > *** INCOMPLETE_TYPE_DEFINITION > ARRAY > MAP > STRUCT > What are the names of these different levels of information? > Some examples of inconsistent terminology: > * [Over > here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] > we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION > we call that an "error class". So what exactly is a class, the 42 or the > INCOMPLETE_TYPE_DEFINITION? > * [Over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] > we call K01 the "subclass". But [over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] > we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for > INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". > So what exactly is a subclass? > * [On this > page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition] > we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other > places we refer to it as an "error class". > I don't think we should leave this status quo as-is. I see a couple of ways > to fix this. > h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition" > One solution is to use the following terms: > * Error class: 42 > * Error sub-class: K01 > * Error state: 42K01 > * Error condition: INCOMPLETE_TYPE_DEFINITION > * Error sub-condition: ARRAY, MAP, STRUCT > Pros: > * This terminology seems (to me at least) the most natural and intuitive. > * It may also match the SQL standard. > Cons: > * We use {{errorClass}} [all over our > codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30] > – literally in thousands of places – to refer to strings like > INCOMPLETE_TYPE_DEFINITION. > ** It's probably not practical to update all these usages to say > {{errorCondition}} instead, so if we go with this approach there will be a > divide between the terminology we use in user-facing documentation vs. what > the code base uses. > ** We can perhaps rename the existing {{error-classes.json}} to > {{error-conditions.json}} but clarify the reason for this divide between code > and user docs in the documentation for {{ErrorClassesJsonReader}} . > h1. Option 2: 42 becomes an "Error Category" > Another approach is to use the following terminology: > * Error category: 42 > * Error sub-category: K01 > * Error state: 42K01 > * Error class: INCOMPLETE_TYPE_DEFINITION > * Error sub-classes: ARRAY, MAP, STRUCT > Pros: > * We continue to use "error class" as we do today in our code base. > * The change from calling "42" a class to a category is low impact and may > not show up in user-facing documentation at all. (See my side note below.) > Cons: > * These terms may not align with the SQL standard. > * We will have to retire the term "error condition", which we have [already > used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md] > in user-facing documentation. > — > Side note: In either case, I believe talking about "42" and "K01" – > regardless of what we end up calling them – in front of users is not helpful. > I don't think anybody cares what "42" by itself m
[jira] [Comment Edited] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811470#comment-17811470 ] Nicholas Chammas edited comment on SPARK-46810 at 1/27/24 5:00 AM: --- [~srielau] - What do you think of the problem and proposed solutions described above? I am partial to Option 1, but certainly either solution will need buy-in from whoever cares about how we manage and document errors. Also, you mentioned [on the PR|https://github.com/apache/spark/pull/44902/files#r1468258626] that the SQL standard uses specific terms. Could you link to or quote the relevant parts? was (Author: nchammas): [~srielau] - What do you think of the problem and proposed solutions described above? Also, you mentioned [on the PR|https://github.com/apache/spark/pull/44902/files#r1468258626] that the SQL standard uses specific terms. Could you link to or quote the relevant parts? > Clarify error class terminology > --- > > Key: SPARK-46810 > URL: https://issues.apache.org/jira/browse/SPARK-46810 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > > We use inconsistent terminology when talking about error classes. I'd like to > get some clarity on that before contributing any potential improvements to > this part of the documentation. > Consider > [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. > It has several key pieces of hierarchical information that have inconsistent > names throughout our documentation and codebase: > * 42 > ** K01 > *** INCOMPLETE_TYPE_DEFINITION > ARRAY > MAP > STRUCT > What are the names of these different levels of information? > Some examples of inconsistent terminology: > * [Over > here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] > we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION > we call that an "error class". So what exactly is a class, the 42 or the > INCOMPLETE_TYPE_DEFINITION? > * [Over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] > we call K01 the "subclass". But [over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] > we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for > INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". > So what exactly is a subclass? > * [On this > page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition] > we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other > places we refer to it as an "error class". > I don't think we should leave this status quo as-is. I see a couple of ways > to fix this. > h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition" > One solution is to use the following terms: > * Error class: 42 > * Error sub-class: K01 > * Error state: 42K01 > * Error condition: INCOMPLETE_TYPE_DEFINITION > * Error sub-condition: ARRAY, MAP, STRUCT > Pros: > * This terminology seems (to me at least) the most natural and intuitive. > * It may also match the SQL standard. > Cons: > * We use {{errorClass}} [all over our > codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30] > – literally in thousands of places – to refer to strings like > INCOMPLETE_TYPE_DEFINITION. > ** It's probably not practical to update all these usages to say > {{errorCondition}} instead, so if we go with this approach there will be a > divide between the terminology we use in user-facing documentation vs. what > the code base uses. > ** We can perhaps rename the existing {{error-classes.json}} to > {{error-conditions.json}} but clarify the reason for this divide between code > and user docs in the documentation for {{ErrorClassesJsonReader}} . > h1. Option 2: 42 becomes an "Error Category" > Another approach is to use the following terminology: > * Error category: 42 > * Error sub-category: K01 > * Error state: 42K01 > * Error class: INCOMPLETE_TYPE_DEFINITION > * Error sub-classes: ARRAY, MAP, STRUCT > Pros: > * We continue to use "error class" as we do today in our code base. > * The change from calling "42" a class to a category is low impact and may > not show up in user-facing documentation at all. (See my side note bel
[jira] [Updated] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-46810: - Description: We use inconsistent terminology when talking about error classes. I'd like to get some clarity on that before contributing any potential improvements to this part of the documentation. Consider [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. It has several key pieces of hierarchical information that have inconsistent names throughout our documentation and codebase: * 42 ** K01 *** INCOMPLETE_TYPE_DEFINITION ARRAY MAP STRUCT What are the names of these different levels of information? Some examples of inconsistent terminology: * [Over here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we call that an "error class". So what exactly is a class, the 42 or the INCOMPLETE_TYPE_DEFINITION? * [Over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] we call K01 the "subclass". But [over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". So what exactly is a subclass? * [On this page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition] we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other places we refer to it as an "error class". I don't think we should leave this status quo as-is. I see a couple of ways to fix this. h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition" One solution is to use the following terms: * Error class: 42 * Error sub-class: K01 * Error state: 42K01 * Error condition: INCOMPLETE_TYPE_DEFINITION * Error sub-condition: ARRAY, MAP, STRUCT Pros: * This terminology seems (to me at least) the most natural and intuitive. * It may also match the SQL standard. Cons: * We use {{errorClass}} [all over our codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30] – literally in thousands of places – to refer to strings like INCOMPLETE_TYPE_DEFINITION. ** It's probably not practical to update all these usages to say {{errorCondition}} instead, so if we go with this approach there will be a divide between the terminology we use in user-facing documentation vs. what the code base uses. ** We can perhaps rename the existing {{error-classes.json}} to {{error-conditions.json}} but clarify the reason for this divide between code and user docs in the documentation for {{ErrorClassesJsonReader}} . h1. Option 2: 42 becomes an "Error Category" Another approach is to use the following terminology: * Error category: 42 * Error sub-category: K01 * Error state: 42K01 * Error class: INCOMPLETE_TYPE_DEFINITION * Error sub-classes: ARRAY, MAP, STRUCT Pros: * We continue to use "error class" as we do today in our code base. * The change from calling "42" a class to a category is low impact and may not show up in user-facing documentation at all. (See my side note below.) Cons: * These terms may not align with the SQL standard. * We will have to retire the term "error condition", which we have [already used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md] in user-facing documentation. — Side note: In either case, I believe talking about "42" and "K01" – regardless of what we end up calling them – in front of users is not helpful. I don't think anybody cares what "42" by itself means, or what "K01" by itself means. Accordingly, we should limit how much we talk about these concepts in the user-facing documentation. was: We use inconsistent terminology when talking about error classes. I'd like to get some clarity on that before contributing any potential improvements to this part of the documentation. Consider [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. It has several key pieces of hierarchical information that have inconsistent names throughout our documentation and codebase: * 42 ** K01 *** INCOMPLETE_TYPE_DEFINITION ARRAY MAP STRUCT What are the names of these different levels of information? Some examples of inconsistent terminology: * [Over here|https://spark.apache.org/docs/latest/sql-error-conditio
[jira] [Commented] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811470#comment-17811470 ] Nicholas Chammas commented on SPARK-46810: -- [~srielau] - What do you think of the problem and proposed solutions described above? Also, you mentioned [on the PR|https://github.com/apache/spark/pull/44902/files#r1468258626] that the SQL standard uses specific terms. Could you link to or quote the relevant parts? > Clarify error class terminology > --- > > Key: SPARK-46810 > URL: https://issues.apache.org/jira/browse/SPARK-46810 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > > We use inconsistent terminology when talking about error classes. I'd like to > get some clarity on that before contributing any potential improvements to > this part of the documentation. > Consider > [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. > It has several key pieces of hierarchical information that have inconsistent > names throughout our documentation and codebase: > * 42 > ** K01 > *** INCOMPLETE_TYPE_DEFINITION > ARRAY > MAP > STRUCT > What are the names of these different levels of information? > Some examples of inconsistent terminology: > * [Over > here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] > we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION > we call that an "error class". So what exactly is a class, the 42 or the > INCOMPLETE_TYPE_DEFINITION? > * [Over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] > we call K01 the "subclass". But [over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] > we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for > INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". > So what exactly is a subclass? > * [On this > page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition] > we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other > places we refer to it as an "error class". > I don't think we should leave this status quo as-is. I see a couple of ways > to fix this. > h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition" > One solution is to use the following terms: > * Error class: 42 > * Error sub-class: K01 > * Error state: 42K01 > * Error condition: INCOMPLETE_TYPE_DEFINITION > * Error sub-condition: ARRAY, MAP, STRUCT > Pros: > * This terminology seems (to me at least) the most natural and intuitive. > * It may also match the SQL standard. > Cons: > * We use {{errorClass}} [all over our > codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30] > – literally in thousands of places – to refer to strings like > INCOMPLETE_TYPE_DEFINITION. > ** It's probably not practical to update all these usages to say > {{errorCondition}} instead, so if we go with this approach there will be a > divide between the terminology we use in user-facing documentation vs. what > the code base uses. > ** We can perhaps rename the existing {{error-classes.json}} to > {{error-conditions.json}} but clarify the reason for this divide between code > and user docs in the documentation for {{ErrorClassesJsonReader}} . > h1. Option 2: 42 becomes an "Error Category" > Another approach is to use the following terminology: > * Error category: 42 > * Error sub-category: K01 > * Error state: 42K01 > * Error class: INCOMPLETE_TYPE_DEFINITION > * Error sub-classes: ARRAY, MAP, STRUCT > Pros: > * We continue to use "error class" as we do today in our code base. > * The change from calling "42" a class to a category is low impact and may > not show up in user-facing documentation at all. (See my side note below.) > Cons: > * These terms may not align with the SQL standard. > * We will have to retire the term "error condition", which we have [already > used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md#L0-L1] > in user-facing documentation. > — > Side note: In either case, I believe talking about "42" and "K01" – > regardless of what we end up calling them – in front of users is not helpful. > I don't think anybody ca
[jira] [Updated] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-46810: - Description: We use inconsistent terminology when talking about error classes. I'd like to get some clarity on that before contributing any potential improvements to this part of the documentation. Consider [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. It has several key pieces of hierarchical information that have inconsistent names throughout our documentation and codebase: * 42 ** K01 *** INCOMPLETE_TYPE_DEFINITION ARRAY MAP STRUCT What are the names of these different levels of information? Some examples of inconsistent terminology: * [Over here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we call that an "error class". So what exactly is a class, the 42 or the INCOMPLETE_TYPE_DEFINITION? * [Over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] we call K01 the "subclass". But [over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". So what exactly is a subclass? * [On this page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition] we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other places we refer to it as an "error class". I don't think we should leave this status quo as-is. I see a couple of ways to fix this. h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition" One solution is to use the following terms: * Error class: 42 * Error sub-class: K01 * Error state: 42K01 * Error condition: INCOMPLETE_TYPE_DEFINITION * Error sub-condition: ARRAY, MAP, STRUCT Pros: * This terminology seems (to me at least) the most natural and intuitive. * It may also match the SQL standard. Cons: * We use {{errorClass}} [all over our codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30] – literally in thousands of places – to refer to strings like INCOMPLETE_TYPE_DEFINITION. ** It's probably not practical to update all these usages to say {{errorCondition}} instead, so if we go with this approach there will be a divide between the terminology we use in user-facing documentation vs. what the code base uses. ** We can perhaps rename the existing {{error-classes.json}} to {{error-conditions.json}} but clarify the reason for this divide between code and user docs in the documentation for {{ErrorClassesJsonReader}} . h1. Option 2: 42 becomes an "Error Category" Another approach is to use the following terminology: * Error category: 42 * Error sub-category: K01 * Error state: 42K01 * Error class: INCOMPLETE_TYPE_DEFINITION * Error sub-classes: ARRAY, MAP, STRUCT Pros: * We continue to use "error class" as we do today in our code base. * The change from calling "42" a class to a category is low impact and may not show up in user-facing documentation at all. (See my side note below.) Cons: * These terms may not align with the SQL standard. * We will have to retire the term "error condition", which we have [already used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md#L0-L1] in user-facing documentation. — Side note: In either case, I believe talking about "42" and "K01" – regardless of what we end up calling them – in front of users is not helpful. I don't think anybody cares what "42" by itself means, or what "K01" by itself means. Accordingly, we should limit how much we talk about these concepts in the user-facing documentation. was: We use inconsistent terminology when talking about error classes. I'd like to get some clarity on that before contributing any potential improvements to this part of the documentation. Consider [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. It has several key pieces of hierarchical information that have inconsistent names throughout our documentation and codebase: * 42 ** K01 *** INCOMPLETE_TYPE_DEFINITION ARRAY MAP STRUCT What are the names of these different levels of information? Some examples of inconsistent terminology: * [Over here|https://spark.apache.org/docs/latest/sql-error-co
[jira] [Updated] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-46810: - Description: We use inconsistent terminology when talking about error classes. I'd like to get some clarity on that before contributing any potential improvements to this part of the documentation. Consider [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. It has several key pieces of hierarchical information that have inconsistent names throughout our documentation and codebase: * 42 ** K01 *** INCOMPLETE_TYPE_DEFINITION ARRAY MAP STRUCT What are the names of these different levels of information? Some examples of inconsistent terminology: * [Over here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we call that an "error class". So what exactly is a class, the 42 or the INCOMPLETE_TYPE_DEFINITION? * [Over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] we call K01 the "subclass". But [over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". So what exactly is a subclass? * [On this page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition] we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other places we refer to it as an "error class". I don't think we should leave this status quo as-is. I see a couple of ways to fix this. h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition" One solution is to use the following terms: * Error class: 42 * Error sub-class: K01 * Error state: 42K01 * Error condition: INCOMPLETE_TYPE_DEFINITION * Error sub-condition: ARRAY, MAP, STRUCT Pros: * This terminology seems (to me at least) the most natural and intuitive. * It may also match the SQL standard. Cons: * We use {{errorClass}} [all over our codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30] – literally in thousands of places – to refer to INCOMPLETE_TYPE_DEFINITION. ** It's probably not practical to update all these usages to say {{errorCondition}} instead, so if we go with this approach there will be a divide between the terminology we use in user-facing documentation vs. what the code base uses. ** We can perhaps rename the existing {{error-classes.json}} to {{error-conditions.json}} but clarify the reason for this divide in the documentation for {{ErrorClassesJsonReader}} . h1. Option 2: 42 becomes an "Error Category" Another * Error category: 42 * Error sub-category: K01 * Error state: 42K01 * Error class: INCOMPLETE_TYPE_DEFINITION * Error sub-classes: ARRAY, MAP, STRUCT We should not use "error condition" if one of the above terms more accurately describes what we are talking about. Side note: With this terminology, I believe talking about error categories and sub-categories in front of users is not helpful. I don't think anybody cares what "42" by itself means, or what "K01" by itself means. Accordingly, we should limit how much we talk about these concepts in the user-facing documentation. was: We use inconsistent terminology when talking about error classes. I'd like to get some clarity on that before contributing any potential improvements to this part of the documentation. Consider [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. It has several key pieces of hierarchical information that have inconsistent names throughout our documentation and codebase: * 42 ** K01 *** INCOMPLETE_TYPE_DEFINITION ARRAY MAP STRUCT What are the names of these different levels of information? Some examples of inconsistent terminology: * [Over here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we call that an "error class". So what exactly is a class, the 42 or the INCOMPLETE_TYPE_DEFINITION? * [Over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] we call K01 the "subclass". But [over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450f
[jira] [Created] (SPARK-46863) Clean up custom.css
Nicholas Chammas created SPARK-46863: Summary: Clean up custom.css Key: SPARK-46863 URL: https://issues.apache.org/jira/browse/SPARK-46863 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46825) Build Spark only once when building docs
Nicholas Chammas created SPARK-46825: Summary: Build Spark only once when building docs Key: SPARK-46825 URL: https://issues.apache.org/jira/browse/SPARK-46825 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46819) Port error class data to automation-friendly format
Nicholas Chammas created SPARK-46819: Summary: Port error class data to automation-friendly format Key: SPARK-46819 URL: https://issues.apache.org/jira/browse/SPARK-46819 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 4.0.0 Reporter: Nicholas Chammas As described in SPARK-46810, we have several types of error data captured in our code and documentation. Unfortunately, a good chunk of this data is in a Markdown table that is not friendly to automation (e.g. to generate documentation, or run tests). [https://github.com/apache/spark/blob/d1fbc4c7191aafadada1a6f7c217bf89f6cae49f/common/utils/src/main/resources/error/README.md#L121] We should migrate this error data to an automation-friendly format. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-46810: - Description: We use inconsistent terminology when talking about error classes. I'd like to get some clarity on that before contributing any potential improvements to this part of the documentation. Consider [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. It has several key pieces of hierarchical information that have inconsistent names throughout our documentation and codebase: * 42 ** K01 *** INCOMPLETE_TYPE_DEFINITION ARRAY MAP STRUCT What are the names of these different levels of information? Some examples of inconsistent terminology: * [Over here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we call that an "error class". So what exactly is a class, the 42 or the INCOMPLETE_TYPE_DEFINITION? * [Over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] we call K01 the "subclass". But [over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". So what exactly is a subclass? * [On this page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition] we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other places we refer to it as an "error class". I personally like the terminology "error condition", but as we are already using "error class" very heavily throughout the codebase to refer to something like INCOMPLETE_TYPE_DEFINITION, I don't think it's practical to change at this point. To rationalize the different terms we are using, I propose the following terminology, which we should use consistently throughout our code and documentation: * Error category: 42 * Error sub-category: K01 * Error state: 42K01 * Error class: INCOMPLETE_TYPE_DEFINITION * Error sub-classes: ARRAY, MAP, STRUCT We should not use "error condition" if one of the above terms more accurately describes what we are talking about. Side note: With this terminology, I believe talking about error categories and sub-categories in front of users is not helpful. I don't think anybody cares what "42" by itself means, or what "K01" by itself means. Accordingly, we should limit how much we talk about these concepts in the user-facing documentation. was: We use inconsistent terminology when talking about error classes. I'd like to get some clarity on that before contributing any potential improvements to this part of the documentation. Consider [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. It has several key pieces of hierarchical information that have inconsistent names throughout our documentation and codebase: * 42 ** K01 *** INCOMPLETE_TYPE_DEFINITION ARRAY MAP STRUCT What are the names of these different levels of information? Some examples of inconsistent terminology: * [Over here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we call that an "error class". So what exactly is a class, the 42 or the INCOMPLETE_TYPE_DEFINITION? * [Over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] we call K01 the "subclass". But [over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". So what exactly is a subclass? I propose the following terminology, which we should use consistently throughout our code and documentation: * Error class: 42 * Error subclass: K01 * Error state: 42K01 * Error condition: INCOMPLETE_TYPE_DEFINITION * Error sub-conditions: ARRAY, MAP, STRUCT Side note: With this terminology, I believe talking about error classes and subclasses in front of users is not helpful. I don't think anybody cares about what "42" by itself means, or what "K01" by itself means. Accordingly, we should limit how much we talk about these concepts in the user-facin
[jira] [Updated] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-46810: - Description: We use inconsistent terminology when talking about error classes. I'd like to get some clarity on that before contributing any potential improvements to this part of the documentation. Consider [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. It has several key pieces of hierarchical information that have inconsistent names throughout our documentation and codebase: * 42 ** K01 *** INCOMPLETE_TYPE_DEFINITION ARRAY MAP STRUCT What are the names of these different levels of information? Some examples of inconsistent terminology: * [Over here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we call that an "error class". So what exactly is a class, the 42 or the INCOMPLETE_TYPE_DEFINITION? * [Over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] we call K01 the "subclass". But [over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". So what exactly is a subclass? I propose the following terminology, which we should use consistently throughout our code and documentation: * Error class: 42 * Error subclass: K01 * Error state: 42K01 * Error condition: INCOMPLETE_TYPE_DEFINITION * Error sub-conditions: ARRAY, MAP, STRUCT Side note: With this terminology, I believe talking about error classes and subclasses in front of users is not helpful. I don't think anybody cares about what "42" by itself means, or what "K01" by itself means. Accordingly, we should limit how much we talk about these concepts in the user-facing documentation. was: We use inconsistent terminology when talking about error classes. I'd like to get some clarity on that before contributing any potential improvements to this part of the documentation. Consider [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. It has several key pieces of hierarchical information that have inconsistent names throughout our documentation and codebase: * 42 ** K01 *** INCOMPLETE_TYPE_DEFINITION ARRAY MAP STRUCT What are the names of these different levels of information? Some examples of inconsistent terminology: * [Over here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we call that an "error class". So what exactly is a class, the 42 or the INCOMPLETE_TYPE_DEFINITION? * [Over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] we call K01 the "subclass". But [over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". So what exactly is a subclass? I propose the following terminology, which we should use consistently throughout our code and documentation: * Error class: 42 * Error subclass: K01 * Error state: 42K01 * Error condition: INCOMPLETE_TYPE_DEFINITION * Error sub-conditions: ARRAY, MAP, STRUCT Side note: With this terminology, I believe talking about error classes and subclasses in front of users is not helpful. I don't think anybody cares about what 42 by itself means, or what K01 by itself means. Accordingly, we should limit how much we talk about these concepts. > Clarify error class terminology > --- > > Key: SPARK-46810 > URL: https://issues.apache.org/jira/browse/SPARK-46810 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > We use inconsistent terminology when talking about error classes. I'd like to > get some clarity on that before contributing any potential improvements to > this part of the documentation. > Consider > [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sq
[jira] [Commented] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809804#comment-17809804 ] Nicholas Chammas commented on SPARK-46810: -- [~itholic] [~gurwls223] - What do you think? cc also [~karenfeng], who I see in git blame as the original contributor of error classes. > Clarify error class terminology > --- > > Key: SPARK-46810 > URL: https://issues.apache.org/jira/browse/SPARK-46810 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > We use inconsistent terminology when talking about error classes. I'd like to > get some clarity on that before contributing any potential improvements to > this part of the documentation. > Consider > [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. > It has several key pieces of hierarchical information that have inconsistent > names throughout our documentation and codebase: > * 42 > ** K01 > *** INCOMPLETE_TYPE_DEFINITION > ARRAY > MAP > STRUCT > What are the names of these different levels of information? > Some examples of inconsistent terminology: > * [Over > here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] > we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION > we call that an "error class". So what exactly is a class, the 42 or the > INCOMPLETE_TYPE_DEFINITION? > * [Over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] > we call K01 the "subclass". But [over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] > we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for > INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". > So what exactly is a subclass? > I propose the following terminology, which we should use consistently > throughout our code and documentation: > * Error class: 42 > * Error subclass: K01 > * Error state: 42K01 > * Error condition: INCOMPLETE_TYPE_DEFINITION > * Error sub-conditions: ARRAY, MAP, STRUCT > Side note: With this terminology, I believe talking about error classes and > subclasses in front of users is not helpful. I don't think anybody cares > about what 42 by itself means, or what K01 by itself means. Accordingly, we > should limit how much we talk about these concepts. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46810) Clarify error class terminology
Nicholas Chammas created SPARK-46810: Summary: Clarify error class terminology Key: SPARK-46810 URL: https://issues.apache.org/jira/browse/SPARK-46810 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 4.0.0 Reporter: Nicholas Chammas We use inconsistent terminology when talking about error classes. I'd like to get some clarity on that before contributing any potential improvements to this part of the documentation. Consider [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. It has several key pieces of hierarchical information that have inconsistent names throughout our documentation and codebase: * 42 ** K01 *** INCOMPLETE_TYPE_DEFINITION ARRAY MAP STRUCT What are the names of these different levels of information? Some examples of inconsistent terminology: * [Over here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we call that an "error class". So what exactly is a class, the 42 or the INCOMPLETE_TYPE_DEFINITION? * [Over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] we call K01 the "subclass". But [over here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". So what exactly is a subclass? I propose the following terminology, which we should use consistently throughout our code and documentation: * Error class: 42 * Error subclass: K01 * Error state: 42K01 * Error condition: INCOMPLETE_TYPE_DEFINITION * Error sub-conditions: ARRAY, MAP, STRUCT Side note: With this terminology, I believe talking about error classes and subclasses in front of users is not helpful. I don't think anybody cares about what 42 by itself means, or what K01 by itself means. Accordingly, we should limit how much we talk about these concepts. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46807) Include automation notice in SQL error class documents
Nicholas Chammas created SPARK-46807: Summary: Include automation notice in SQL error class documents Key: SPARK-46807 URL: https://issues.apache.org/jira/browse/SPARK-46807 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46775) Fix formatting of Kinesis docs
Nicholas Chammas created SPARK-46775: Summary: Fix formatting of Kinesis docs Key: SPARK-46775 URL: https://issues.apache.org/jira/browse/SPARK-46775 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46764) Reorganize Ruby script to build API docs
Nicholas Chammas created SPARK-46764: Summary: Reorganize Ruby script to build API docs Key: SPARK-46764 URL: https://issues.apache.org/jira/browse/SPARK-46764 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
[ https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806954#comment-17806954 ] Nicholas Chammas commented on SPARK-45599: -- Using [Hypothesis|https://github.com/HypothesisWorks/hypothesis], I've managed to shrink the provided test case from 373 elements down to 14: {code:python} from math import nan from pyspark.sql import SparkSession HYPOTHESIS_EXAMPLE = [ (0.0,), (2.0,), (153.0,), (168.0,), (3252411229536261.0,), (7.205759403792794e+16,), (1.7976931348623157e+308,), (0.25,), (nan,), (nan,), (-0.0,), (-128.0,), (nan,), (nan,), ] spark = ( SparkSession.builder .config("spark.log.level", "ERROR") .getOrCreate() ) def compare_percentiles(data, slices): rdd = spark.sparkContext.parallelize(data, numSlices=1) df = spark.createDataFrame(rdd, "val double") result1 = df.selectExpr('percentile(val, 0.1)').collect()[0][0] rdd = spark.sparkContext.parallelize(data, numSlices=slices) df = spark.createDataFrame(rdd, "val double") result2 = df.selectExpr('percentile(val, 0.1)').collect()[0][0] assert result1 == result2, f"{result1}, {result2}" if __name__ == "__main__": compare_percentiles(HYPOTHESIS_EXAMPLE, 2) {code} Running this test fails as follows: {code:python} Traceback (most recent call last): File ".../SPARK-45599.py", line 41, in compare_percentiles(HYPOTHESIS_EXAMPLE, 2) File ".../SPARK-45599.py", line 37, in compare_percentiles assert result1 == result2, f"{result1}, {result2}" ^^ AssertionError: 0.050044, -0.0 {code} > Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset > -- > > Key: SPARK-45599 > URL: https://issues.apache.org/jira/browse/SPARK-45599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0 >Reporter: Robert Joseph Evans >Priority: Critical > Labels: correctness > > I think this actually impacts all versions that have ever supported > percentile and it may impact other things because the bug is in OpenHashMap. > > I am really surprised that we caught this bug because everything has to hit > just wrong to make it happen. in python/pyspark if you run > > {code:python} > from math import * > from pyspark.sql.types import * > data = [(1.779652973678931e+173,), (9.247723870123388e-295,), > (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), > (-3.085825028509117e+74,), (-1.9569489404314425e+128,), > (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), > (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), > (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), > (5.120795466142678e-215,), (-9.01991342808203e+282,), > (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), > (3.4543959813437507e-304,), (-7.590734560275502e-63,), > (9.376528689861087e+117,), (-2.1696969883753554e-292,), > (7.227411393136537e+206,), (-2.428999624265911e-293,), > (5.741383583382542e-14,), (-1.4882040107841963e+286,), > (2.1973064836362255e-159,), (0.028096279323357867,), > (8.475809563703283e-64,), (3.002803065141241e-139,), > (-1.1041009815645263e+203,), (1.8461539468514548e-225,), > (-5.620339412794757e-251,), (3.5103766991437114e-60,), > (2.4925669515657655e+165,), (3.217759099462207e+108,), > (-8.796717685143486e+203,), (2.037360925124577e+292,), > (-6.542279108216022e+206,), (-7.951172614280046e-74,), > (6.226527569272003e+152,), (-5.673977270111637e-84,), > (-1.0186016078084965e-281,), (1.7976931348623157e+308,), > (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), > (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), > (1.7976931348623157e+308,), (4.3214483342777574e-117,), > (-7.973642629411105e-89,), (-1.1028137694801181e-297,), > (2.9000325280299273e-39,), (-1.077534929323113e-264,), > (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), > (-1.831402251805194e+65,), (-2.664533698035492e+203,), > (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), > (-9.607772864590422e+217,), (3.437191836077251e+209,), > (1.9846569552093057e-137,), (-3.010452936419635e-233,), > (1.43097
[jira] [Commented] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
[ https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806150#comment-17806150 ] Nicholas Chammas commented on SPARK-45599: -- cc [~dongjoon] - This is an old correctness bug with a concise reproduction. > Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset > -- > > Key: SPARK-45599 > URL: https://issues.apache.org/jira/browse/SPARK-45599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0 >Reporter: Robert Joseph Evans >Priority: Critical > Labels: correctness > > I think this actually impacts all versions that have ever supported > percentile and it may impact other things because the bug is in OpenHashMap. > > I am really surprised that we caught this bug because everything has to hit > just wrong to make it happen. in python/pyspark if you run > > {code:python} > from math import * > from pyspark.sql.types import * > data = [(1.779652973678931e+173,), (9.247723870123388e-295,), > (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), > (-3.085825028509117e+74,), (-1.9569489404314425e+128,), > (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), > (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), > (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), > (5.120795466142678e-215,), (-9.01991342808203e+282,), > (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), > (3.4543959813437507e-304,), (-7.590734560275502e-63,), > (9.376528689861087e+117,), (-2.1696969883753554e-292,), > (7.227411393136537e+206,), (-2.428999624265911e-293,), > (5.741383583382542e-14,), (-1.4882040107841963e+286,), > (2.1973064836362255e-159,), (0.028096279323357867,), > (8.475809563703283e-64,), (3.002803065141241e-139,), > (-1.1041009815645263e+203,), (1.8461539468514548e-225,), > (-5.620339412794757e-251,), (3.5103766991437114e-60,), > (2.4925669515657655e+165,), (3.217759099462207e+108,), > (-8.796717685143486e+203,), (2.037360925124577e+292,), > (-6.542279108216022e+206,), (-7.951172614280046e-74,), > (6.226527569272003e+152,), (-5.673977270111637e-84,), > (-1.0186016078084965e-281,), (1.7976931348623157e+308,), > (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), > (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), > (1.7976931348623157e+308,), (4.3214483342777574e-117,), > (-7.973642629411105e-89,), (-1.1028137694801181e-297,), > (2.9000325280299273e-39,), (-1.077534929323113e-264,), > (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), > (-1.831402251805194e+65,), (-2.664533698035492e+203,), > (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), > (-9.607772864590422e+217,), (3.437191836077251e+209,), > (1.9846569552093057e-137,), (-3.010452936419635e-233,), > (1.4309793775440402e-87,), (-2.9383643865423363e-103,), > (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), > (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), > (2.187766760184779e+306,), (7.679268835670585e+223,), > (6.3131466321042515e+153,), (1.779652973678931e+173,), > (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), > (1.9042708096454302e+195,), (-3.085825028509117e+74,), > (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), > (2.5212410617263588e-282,), (-2.646144697462316e-35,), > (-3.468683249247593e-196,), (nan,), (None,), (nan,), > (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), > (5.120795466142678e-215,), (-9.01991342808203e+282,), > (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), > (3.4543959813437507e-304,), (-7.590734560275502e-63,), > (9.376528689861087e+117,), (-2.1696969883753554e-292,), > (7.227411393136537e+206,), (-2.428999624265911e-293,), > (5.741383583382542e-14,), (-1.4882040107841963e+286,), > (2.1973064836362255e
[jira] [Updated] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
[ https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-45599: - Labels: correctness (was: data-corruption) > Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset > -- > > Key: SPARK-45599 > URL: https://issues.apache.org/jira/browse/SPARK-45599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0 >Reporter: Robert Joseph Evans >Priority: Critical > Labels: correctness > > I think this actually impacts all versions that have ever supported > percentile and it may impact other things because the bug is in OpenHashMap. > > I am really surprised that we caught this bug because everything has to hit > just wrong to make it happen. in python/pyspark if you run > > {code:python} > from math import * > from pyspark.sql.types import * > data = [(1.779652973678931e+173,), (9.247723870123388e-295,), > (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), > (-3.085825028509117e+74,), (-1.9569489404314425e+128,), > (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), > (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), > (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), > (5.120795466142678e-215,), (-9.01991342808203e+282,), > (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), > (3.4543959813437507e-304,), (-7.590734560275502e-63,), > (9.376528689861087e+117,), (-2.1696969883753554e-292,), > (7.227411393136537e+206,), (-2.428999624265911e-293,), > (5.741383583382542e-14,), (-1.4882040107841963e+286,), > (2.1973064836362255e-159,), (0.028096279323357867,), > (8.475809563703283e-64,), (3.002803065141241e-139,), > (-1.1041009815645263e+203,), (1.8461539468514548e-225,), > (-5.620339412794757e-251,), (3.5103766991437114e-60,), > (2.4925669515657655e+165,), (3.217759099462207e+108,), > (-8.796717685143486e+203,), (2.037360925124577e+292,), > (-6.542279108216022e+206,), (-7.951172614280046e-74,), > (6.226527569272003e+152,), (-5.673977270111637e-84,), > (-1.0186016078084965e-281,), (1.7976931348623157e+308,), > (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), > (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), > (1.7976931348623157e+308,), (4.3214483342777574e-117,), > (-7.973642629411105e-89,), (-1.1028137694801181e-297,), > (2.9000325280299273e-39,), (-1.077534929323113e-264,), > (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), > (-1.831402251805194e+65,), (-2.664533698035492e+203,), > (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), > (-9.607772864590422e+217,), (3.437191836077251e+209,), > (1.9846569552093057e-137,), (-3.010452936419635e-233,), > (1.4309793775440402e-87,), (-2.9383643865423363e-103,), > (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), > (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), > (2.187766760184779e+306,), (7.679268835670585e+223,), > (6.3131466321042515e+153,), (1.779652973678931e+173,), > (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), > (1.9042708096454302e+195,), (-3.085825028509117e+74,), > (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), > (2.5212410617263588e-282,), (-2.646144697462316e-35,), > (-3.468683249247593e-196,), (nan,), (None,), (nan,), > (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), > (5.120795466142678e-215,), (-9.01991342808203e+282,), > (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), > (3.4543959813437507e-304,), (-7.590734560275502e-63,), > (9.376528689861087e+117,), (-2.1696969883753554e-292,), > (7.227411393136537e+206,), (-2.428999624265911e-293,), > (5.741383583382542e-14,), (-1.4882040107841963e+286,), > (2.1973064836362255e-159,), (0.028096279323357867,), > (8.475809563703283e-64,), (3.002803065141241e-139,)
[jira] [Commented] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
[ https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806148#comment-17806148 ] Nicholas Chammas commented on SPARK-45599: -- I can confirm that this bug is still present on {{master}} at commit [a3266b411723310ec10fc1843ddababc15249db0|https://github.com/apache/spark/tree/a3266b411723310ec10fc1843ddababc15249db0]. With {{numSlices=4}} I get {{-5.924228780007003E136}} and with {{numSlices=1}} I get {{{}-4.739483957565084E136{}}}. Updating the label on this issue. I will also ping some committers to bring this bug to their attention, as correctness bugs are taken very seriously. > Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset > -- > > Key: SPARK-45599 > URL: https://issues.apache.org/jira/browse/SPARK-45599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0 >Reporter: Robert Joseph Evans >Priority: Critical > Labels: data-corruption > > I think this actually impacts all versions that have ever supported > percentile and it may impact other things because the bug is in OpenHashMap. > > I am really surprised that we caught this bug because everything has to hit > just wrong to make it happen. in python/pyspark if you run > > {code:python} > from math import * > from pyspark.sql.types import * > data = [(1.779652973678931e+173,), (9.247723870123388e-295,), > (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), > (-3.085825028509117e+74,), (-1.9569489404314425e+128,), > (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), > (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), > (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), > (5.120795466142678e-215,), (-9.01991342808203e+282,), > (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), > (3.4543959813437507e-304,), (-7.590734560275502e-63,), > (9.376528689861087e+117,), (-2.1696969883753554e-292,), > (7.227411393136537e+206,), (-2.428999624265911e-293,), > (5.741383583382542e-14,), (-1.4882040107841963e+286,), > (2.1973064836362255e-159,), (0.028096279323357867,), > (8.475809563703283e-64,), (3.002803065141241e-139,), > (-1.1041009815645263e+203,), (1.8461539468514548e-225,), > (-5.620339412794757e-251,), (3.5103766991437114e-60,), > (2.4925669515657655e+165,), (3.217759099462207e+108,), > (-8.796717685143486e+203,), (2.037360925124577e+292,), > (-6.542279108216022e+206,), (-7.951172614280046e-74,), > (6.226527569272003e+152,), (-5.673977270111637e-84,), > (-1.0186016078084965e-281,), (1.7976931348623157e+308,), > (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), > (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), > (1.7976931348623157e+308,), (4.3214483342777574e-117,), > (-7.973642629411105e-89,), (-1.1028137694801181e-297,), > (2.9000325280299273e-39,), (-1.077534929323113e-264,), > (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), > (-1.831402251805194e+65,), (-2.664533698035492e+203,), > (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), > (-9.607772864590422e+217,), (3.437191836077251e+209,), > (1.9846569552093057e-137,), (-3.010452936419635e-233,), > (1.4309793775440402e-87,), (-2.9383643865423363e-103,), > (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), > (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), > (2.187766760184779e+306,), (7.679268835670585e+223,), > (6.3131466321042515e+153,), (1.779652973678931e+173,), > (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), > (1.9042708096454302e+195,), (-3.085825028509117e+74,), > (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), > (2.5212410617263588e-282,), (-2.646144697462316e-35,), > (-3.468683249247593e-196,), (nan,), (None,), (nan,), > (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), >
[jira] [Updated] (SPARK-46395) Assign Spark configs to groups for use in documentation
[ https://issues.apache.org/jira/browse/SPARK-46395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-46395: - Summary: Assign Spark configs to groups for use in documentation (was: Automatically generate SQL configuration tables for documentation) > Assign Spark configs to groups for use in documentation > --- > > Key: SPARK-46395 > URL: https://issues.apache.org/jira/browse/SPARK-46395 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.5.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46668) Parallelize Sphinx build of Python API docs
Nicholas Chammas created SPARK-46668: Summary: Parallelize Sphinx build of Python API docs Key: SPARK-46668 URL: https://issues.apache.org/jira/browse/SPARK-46668 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46658) Loosen Ruby dependency specs for doc build
Nicholas Chammas created SPARK-46658: Summary: Loosen Ruby dependency specs for doc build Key: SPARK-46658 URL: https://issues.apache.org/jira/browse/SPARK-46658 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46437) Enable conditional includes in Jekyll documentation
[ https://issues.apache.org/jira/browse/SPARK-46437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-46437: - Component/s: (was: SQL) > Enable conditional includes in Jekyll documentation > --- > > Key: SPARK-46437 > URL: https://issues.apache.org/jira/browse/SPARK-46437 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46437) Enable conditional includes in Jekyll documentation
[ https://issues.apache.org/jira/browse/SPARK-46437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-46437: - Summary: Enable conditional includes in Jekyll documentation (was: Remove unnecessary cruft from SQL built-in functions docs) > Enable conditional includes in Jekyll documentation > --- > > Key: SPARK-46437 > URL: https://issues.apache.org/jira/browse/SPARK-46437 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.5.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46626) Bump jekyll version to support Ruby 3.3
Nicholas Chammas created SPARK-46626: Summary: Bump jekyll version to support Ruby 3.3 Key: SPARK-46626 URL: https://issues.apache.org/jira/browse/SPARK-46626 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46449) Add ability to create databases/schemas via Catalog API
[ https://issues.apache.org/jira/browse/SPARK-46449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-46449: - Summary: Add ability to create databases/schemas via Catalog API (was: Add ability to create databases via Catalog API) > Add ability to create databases/schemas via Catalog API > --- > > Key: SPARK-46449 > URL: https://issues.apache.org/jira/browse/SPARK-46449 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Nicholas Chammas >Priority: Minor > > As of Spark 3.5, the only way to create a database is via SQL. The Catalog > API should offer an equivalent. > Perhaps something like: > {code:python} > spark.catalog.createDatabase( > name: str, > existsOk: bool = False, > comment: str = None, > location: str = None, > properties: dict = None, > ) > {code} > If {{schema}} is the preferred terminology, then we should use that instead > of {{database}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46449) Add ability to create databases via Catalog API
[ https://issues.apache.org/jira/browse/SPARK-46449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-46449: - Description: As of Spark 3.5, the only way to create a database is via SQL. The Catalog API should offer an equivalent. Perhaps something like: {code:python} spark.catalog.createDatabase( name: str, existsOk: bool = False, comment: str = None, location: str = None, properties: dict = None, ) {code} If {{schema}} is the preferred terminology, then we should use that instead of {{database}}. was: As of Spark 3.5, the only way to create a database is via SQL. The Catalog API should offer an equivalent. Perhaps something like: {code:python} spark.catalog.createDatabase( name: str, existsOk: bool = False, comment: str = None, location: str = None, properties: dict = None, ) {code} > Add ability to create databases via Catalog API > --- > > Key: SPARK-46449 > URL: https://issues.apache.org/jira/browse/SPARK-46449 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Nicholas Chammas >Priority: Minor > > As of Spark 3.5, the only way to create a database is via SQL. The Catalog > API should offer an equivalent. > Perhaps something like: > {code:python} > spark.catalog.createDatabase( > name: str, > existsOk: bool = False, > comment: str = None, > location: str = None, > properties: dict = None, > ) > {code} > If {{schema}} is the preferred terminology, then we should use that instead > of {{database}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46449) Add ability to create databases via Catalog API
Nicholas Chammas created SPARK-46449: Summary: Add ability to create databases via Catalog API Key: SPARK-46449 URL: https://issues.apache.org/jira/browse/SPARK-46449 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Nicholas Chammas As of Spark 3.5, the only way to create a database is via SQL. The Catalog API should offer an equivalent. Perhaps something like: {code:python} spark.catalog.createDatabase( name: str, existsOk: bool = False, comment: str = None, location: str = None, properties: dict = None, ) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46437) Remove unnecessary cruft from SQL built-in functions docs
Nicholas Chammas created SPARK-46437: Summary: Remove unnecessary cruft from SQL built-in functions docs Key: SPARK-46437 URL: https://issues.apache.org/jira/browse/SPARK-46437 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 3.5.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46395) Automatically generate SQL configuration tables for documentation
Nicholas Chammas created SPARK-46395: Summary: Automatically generate SQL configuration tables for documentation Key: SPARK-46395 URL: https://issues.apache.org/jira/browse/SPARK-46395 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 3.5.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
[ https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795162#comment-17795162 ] Nicholas Chammas commented on SPARK-45599: -- Per the [contributing guide|https://spark.apache.org/contributing.html], I suggest the {{correctness}} label instead of {{{}data-corruption{}}}. > Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset > -- > > Key: SPARK-45599 > URL: https://issues.apache.org/jira/browse/SPARK-45599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0 >Reporter: Robert Joseph Evans >Priority: Critical > Labels: data-corruption > > I think this actually impacts all versions that have ever supported > percentile and it may impact other things because the bug is in OpenHashMap. > > I am really surprised that we caught this bug because everything has to hit > just wrong to make it happen. in python/pyspark if you run > > {code:python} > from math import * > from pyspark.sql.types import * > data = [(1.779652973678931e+173,), (9.247723870123388e-295,), > (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), > (-3.085825028509117e+74,), (-1.9569489404314425e+128,), > (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), > (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), > (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), > (5.120795466142678e-215,), (-9.01991342808203e+282,), > (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), > (3.4543959813437507e-304,), (-7.590734560275502e-63,), > (9.376528689861087e+117,), (-2.1696969883753554e-292,), > (7.227411393136537e+206,), (-2.428999624265911e-293,), > (5.741383583382542e-14,), (-1.4882040107841963e+286,), > (2.1973064836362255e-159,), (0.028096279323357867,), > (8.475809563703283e-64,), (3.002803065141241e-139,), > (-1.1041009815645263e+203,), (1.8461539468514548e-225,), > (-5.620339412794757e-251,), (3.5103766991437114e-60,), > (2.4925669515657655e+165,), (3.217759099462207e+108,), > (-8.796717685143486e+203,), (2.037360925124577e+292,), > (-6.542279108216022e+206,), (-7.951172614280046e-74,), > (6.226527569272003e+152,), (-5.673977270111637e-84,), > (-1.0186016078084965e-281,), (1.7976931348623157e+308,), > (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), > (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), > (1.7976931348623157e+308,), (4.3214483342777574e-117,), > (-7.973642629411105e-89,), (-1.1028137694801181e-297,), > (2.9000325280299273e-39,), (-1.077534929323113e-264,), > (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), > (-1.831402251805194e+65,), (-2.664533698035492e+203,), > (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), > (-9.607772864590422e+217,), (3.437191836077251e+209,), > (1.9846569552093057e-137,), (-3.010452936419635e-233,), > (1.4309793775440402e-87,), (-2.9383643865423363e-103,), > (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), > (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), > (2.187766760184779e+306,), (7.679268835670585e+223,), > (6.3131466321042515e+153,), (1.779652973678931e+173,), > (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), > (1.9042708096454302e+195,), (-3.085825028509117e+74,), > (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), > (2.5212410617263588e-282,), (-2.646144697462316e-35,), > (-3.468683249247593e-196,), (nan,), (None,), (nan,), > (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), > (5.120795466142678e-215,), (-9.01991342808203e+282,), > (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), > (3.4543959813437507e-304,), (-7.590734560275502e-63,), > (9.376528689861087e+117,), (-2.1696969883753554e-292,), > (7.227411393136537e+206,), (-2.428999624265911e-293,), > (5.74
[jira] [Created] (SPARK-46357) Replace use of setConf with conf.set in docs
Nicholas Chammas created SPARK-46357: Summary: Replace use of setConf with conf.set in docs Key: SPARK-46357 URL: https://issues.apache.org/jira/browse/SPARK-46357 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 3.5.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37571) decouple amplab jenkins from spark website, builds and tests
[ https://issues.apache.org/jira/browse/SPARK-37571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17793347#comment-17793347 ] Nicholas Chammas commented on SPARK-37571: -- Since we've [retired|https://lists.apache.org/thread/5n59fs22rtytflbz4sz1pz32ozzfbkrx] the venerable Jenkins infrastructure, I suppose we can close this issue. > decouple amplab jenkins from spark website, builds and tests > > > Key: SPARK-37571 > URL: https://issues.apache.org/jira/browse/SPARK-37571 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > Attachments: audit.txt, spark-repo-to-be-audited.txt > > > we will be turning off jenkins on dec 23rd, and we need to decouple the build > infra from jenkins, as well as remove any amplab jenkins-specific docs on the > website, scripts and infra setup. > i'll be creating > 1 PRs for this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37647) Expose percentile function in Scala/Python APIs
[ https://issues.apache.org/jira/browse/SPARK-37647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-37647. -- Resolution: Fixed It looks like this got added as part of Spark 3.5: [https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.percentile.html] > Expose percentile function in Scala/Python APIs > --- > > Key: SPARK-37647 > URL: https://issues.apache.org/jira/browse/SPARK-37647 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > SQL offers a percentile function (exact, not approximate) that is not > available directly in the Scala or Python DataFrame APIs. > While it is possible to invoke SQL functions from Scala or Python via > {{{}expr(){}}}, I think most users expect function parity across Scala, > Python, and SQL. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45390) Remove `distutils` usage
[ https://issues.apache.org/jira/browse/SPARK-45390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787268#comment-17787268 ] Nicholas Chammas commented on SPARK-45390: -- Ah, are you referring to [PySpark's Python dependencies|https://github.com/apache/spark/blob/4520f3b2da01badb506488b6ff2899babd8c709e/python/setup.py#L310-L330] not supporting Python 3.12? > Remove `distutils` usage > > > Key: SPARK-45390 > URL: https://issues.apache.org/jira/browse/SPARK-45390 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > [PEP-632|https://peps.python.org/pep-0632] deprecated {{distutils}} module in > Python {{3.10}} and dropped in Python {{3.12}} in favor of {{packaging}} > package. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45390) Remove `distutils` usage
[ https://issues.apache.org/jira/browse/SPARK-45390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786401#comment-17786401 ] Nicholas Chammas commented on SPARK-45390: -- {quote}We don't promise to support all future unreleased Python versions {quote} "all future unreleased versions" is a tall ask that no-one is making. :) The relevant circumstances here are that a) Python 3.12 is already out and the backwards-incompatible changes are known and [very limited|https://docs.python.org/3/whatsnew/3.12.html], and b) Spark 4.0 may be a disruptive change and so many people may remain on Spark 3.5 for longer than usual. If we expect 3.5 -> 4.0 to be an easy migration, then backporting a fix like this to 3.5 is not as important. {quote}we need much more validation because all Python package ecosystem should work there without any issues {quote} I'm not sure what you mean here. Anyway, I suppose we could just wait and see. Maybe I'm wrong, but I suspect many users will find it surprising that Spark 3.5 doesn't work on Python 3.12, especially if this is the only (or close to the only) fix required. > Remove `distutils` usage > > > Key: SPARK-45390 > URL: https://issues.apache.org/jira/browse/SPARK-45390 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > [PEP-632|https://peps.python.org/pep-0632] deprecated {{distutils}} module in > Python {{3.10}} and dropped in Python {{3.12}} in favor of {{packaging}} > package. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31001) Add ability to create a partitioned table via catalog.createTable()
[ https://issues.apache.org/jira/browse/SPARK-31001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598403#comment-17598403 ] Nicholas Chammas commented on SPARK-31001: -- Thanks for sharing these details. This is very helpful. Yeah, this seems like an "unofficial" answer to the original problem. It is helpful nonetheless, but as you said it will take a separate effort to formalize and document this. I agree that a formal solution will probably not use an option named with leading underscores. > Add ability to create a partitioned table via catalog.createTable() > --- > > Key: SPARK-31001 > URL: https://issues.apache.org/jira/browse/SPARK-31001 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > There doesn't appear to be a way to create a partitioned table using the > Catalog interface. > In SQL, however, you can do this via {{{}CREATE TABLE ... PARTITIONED BY{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31001) Add ability to create a partitioned table via catalog.createTable()
[ https://issues.apache.org/jira/browse/SPARK-31001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598115#comment-17598115 ] Nicholas Chammas commented on SPARK-31001: -- What's {{{}__partition_columns{}}}? Is that something specific to Delta, or are you saying it's a hidden feature of Spark? > Add ability to create a partitioned table via catalog.createTable() > --- > > Key: SPARK-31001 > URL: https://issues.apache.org/jira/browse/SPARK-31001 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > There doesn't appear to be a way to create a partitioned table using the > Catalog interface. > In SQL, however, you can do this via {{{}CREATE TABLE ... PARTITIONED BY{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39630) Allow all Reader or Writer settings to be provided as options
Title: Message Title Nicholas Chammas created an issue Spark / SPARK-39630 Allow all Reader or Writer settings to be provided as options Issue Type: Improvement Affects Versions: 3.3.0 Assignee: Unassigned Components: SQL Created: 28/Jun/22 21:03 Priority: Minor Reporter: Nicholas Chammas Almost all Reader or Writer settings can be provided via individual calls to `.option()` or by providing a map to `.options()`. There are notable exceptions, though, like: read/write format write mode write partitionBy, bucketBy, and sortBy These settings must be provided via dedicated method calls. Why not make it so that all settings can be provided as options? Is there a design reason not to do this? Any given DataFrameReader or DataFrameWriter (along with the streaming equivalents) should be able to "export" all of its settings as a map of options, and then in turn be reconstituted entirely from that map of options. reader1 = spark.read.option("format", "parquet").option("path", "/data") options = reader.getOptions() reader2 = spark.read.options(options) # reader1 and reader2 are configured identically data1 = reader1.load() data2 = reader2.load() data1.collect() == data2.collect() Some
[jira] [Created] (SPARK-39582) "Since " docs on array_agg are incorrect
Nicholas Chammas created SPARK-39582: Summary: "Since " docs on array_agg are incorrect Key: SPARK-39582 URL: https://issues.apache.org/jira/browse/SPARK-39582 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Nicholas Chammas [https://spark.apache.org/docs/latest/api/sql/#array_agg] The docs currently say "Since: 2.0.0", but `array_agg` was added in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37219) support AS OF syntax
[ https://issues.apache.org/jira/browse/SPARK-37219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537589#comment-17537589 ] Nicholas Chammas commented on SPARK-37219: -- This change will enable not just Delta, but also Iceberg to use the {{AS OF}} syntax, correct? By the way, could an admin please delete the spam comments just above (and perhaps also ban the user if that's all they comment on here)? > support AS OF syntax > > > Key: SPARK-37219 > URL: https://issues.apache.org/jira/browse/SPARK-37219 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.3.0 > > > https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel > Delta Lake time travel allows user to query an older snapshot of a Delta > table. To query an older version of a table, user needs to specify a version > or timestamp in a SELECT statement using AS OF syntax as the follows > SELECT * FROM default.people10m VERSION AS OF 0; > SELECT * FROM default.people10m TIMESTAMP AS OF '2019-01-29 00:37:58'; > This ticket is opened to add AS OF syntax in Spark -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31001) Add ability to create a partitioned table via catalog.createTable()
[ https://issues.apache.org/jira/browse/SPARK-31001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31001: - Description: There doesn't appear to be a way to create a partitioned table using the Catalog interface. In SQL, however, you can do this via {{{}CREATE TABLE ... PARTITIONED BY{}}}. was:There doesn't appear to be a way to create a partitioned table using the Catalog interface. > Add ability to create a partitioned table via catalog.createTable() > --- > > Key: SPARK-31001 > URL: https://issues.apache.org/jira/browse/SPARK-31001 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > There doesn't appear to be a way to create a partitioned table using the > Catalog interface. > In SQL, however, you can do this via {{{}CREATE TABLE ... PARTITIONED BY{}}}. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures
[ https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528233#comment-17528233 ] Nicholas Chammas edited comment on SPARK-37222 at 4/26/22 3:44 PM: --- I've found a helpful log setting that causes Spark to print out detailed information about how exactly a plan is transformed during optimization: {code:java} spark.conf.set("spark.sql.planChangeLog.level", "warn") {code} Here's the log generated by enabling this setting and running Shawn's example: [^plan-log.log] To confirm what Shawn noted in his comment above, it looks like the chain of events that results in a loop is as follows: # ColumnPruning # FoldablePropagation __ # RemoveNoopOperators # PushDownLeftSemiAntiJoin # ColumnPruning # CollapseProject # __ What seems to be the problem is that ColumnPruning inserts some Project operators which are then removed successively by CollapseProject, RemoveNoopOperators, and PushDownLeftSemiAntiJoin. These rules go back and forth, undoing each other's work, until {{spark.sql.optimizer.maxIterations}} is exhausted. was (Author: nchammas): I've found a helpful log setting that causes Spark to print out detailed information about how exactly a plan is transformed during optimization: {code:java} spark.conf.set("spark.sql.planChangeLog.level", "warn") {code} Here's the log generated by enabling this setting and running Shawn's example: [^plan-log.log] To confirm what Shawn noted in his comment above, it looks like the chain of events that results in a loop is as follows: # PushDownLeftSemiAntiJoin # ColumnPruning # CollapseProject # FoldablePropagation # RemoveNoopOperators # What seems to be the problem is that: * ColumnPruning inserts a couple of Project operators which are then removed by CollapseProject. * CollapseProject in turn pushes up the left anti-join which is then pushed down again by PushDownLeftSemiAntiJoin. These three rules go back and forth, undoing each other's work, until {{spark.sql.optimizer.maxIterations}} is exhausted. > Max iterations reached in Operator Optimization w/left_anti or left_semi join > and nested structures > --- > > Key: SPARK-37222 > URL: https://issues.apache.org/jira/browse/SPARK-37222 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.1.2, 3.2.0, 3.2.1 > Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and > with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, > 2021. > The problem does not occur with Spark 3.0.1. > >Reporter: Shawn Smith >Priority: Major > Attachments: plan-log.log > > > The query optimizer never reaches a fixed point when optimizing the query > below. This manifests as a warning: > > WARN: Max iterations (100) reached for batch Operator Optimization before > > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a > > larger value. > But the suggested fix won't help. The actual problem is that the optimizer > fails to make progress on each iteration and gets stuck in a loop. > In practice, Spark logs a warning but continues on and appears to execute the > query successfully, albeit perhaps sub-optimally. > To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and > 3.2.0 but not 3.0.1 it will throw an exception: > {noformat} > case class Nested(b: Boolean, n: Long) > case class Table(id: String, nested: Nested) > case class Identifier(id: String) > locally { > System.setProperty("spark.testing", "true") // Fail instead of logging a > warning > val df = List.empty[Table].toDS.cache() > val ids = List.empty[Identifier].toDS.cache() > df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi" > .select('id, 'nested("n")) > .explain() > } > {noformat} > Looking at the query plan as the optimizer iterates in > {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations: > {noformat} > Project [id#2, _gen_alias_108#108L AS nested.n#28L] > +- Join LeftAnti, (id#2 = id#18) >:- Project [id#2, nested#3.n AS _gen_alias_108#108L] >: +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, > deserialized, 1 replicas) >:+- LocalTableScan , [id#2, nested#3] >+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- LocalTableScan , [id#18] > {noformat} > And here's the plan after one more iteration. You can see that all that has > changed is new aliases for the column in the nested column: > "{{_gen_alias_108#108L}}" to "{{_gen_alias_109#109L}}". > {noformat} > Project [id#2, _gen_alias_109#109L AS nested.n#28L] > +- Join LeftAnti, (id#2 = id#18) >
[jira] [Commented] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures
[ https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528233#comment-17528233 ] Nicholas Chammas commented on SPARK-37222: -- I've found a helpful log setting that causes Spark to print out detailed information about how exactly a plan is transformed during optimization: {code:java} spark.conf.set("spark.sql.planChangeLog.level", "warn") {code} Here's the log generated by enabling this setting and running Shawn's example: [^plan-log.log] To confirm what Shawn noted in his comment above, it looks like the chain of events that results in a loop is as follows: # PushDownLeftSemiAntiJoin # ColumnPruning # CollapseProject # FoldablePropagation # RemoveNoopOperators # What seems to be the problem is that: * ColumnPruning inserts a couple of Project operators which are then removed by CollapseProject. * CollapseProject in turn pushes up the left anti-join which is then pushed down again by PushDownLeftSemiAntiJoin. These three rules go back and forth, undoing each other's work, until {{spark.sql.optimizer.maxIterations}} is exhausted. > Max iterations reached in Operator Optimization w/left_anti or left_semi join > and nested structures > --- > > Key: SPARK-37222 > URL: https://issues.apache.org/jira/browse/SPARK-37222 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.1.2, 3.2.0, 3.2.1 > Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and > with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, > 2021. > The problem does not occur with Spark 3.0.1. > >Reporter: Shawn Smith >Priority: Major > Attachments: plan-log.log > > > The query optimizer never reaches a fixed point when optimizing the query > below. This manifests as a warning: > > WARN: Max iterations (100) reached for batch Operator Optimization before > > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a > > larger value. > But the suggested fix won't help. The actual problem is that the optimizer > fails to make progress on each iteration and gets stuck in a loop. > In practice, Spark logs a warning but continues on and appears to execute the > query successfully, albeit perhaps sub-optimally. > To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and > 3.2.0 but not 3.0.1 it will throw an exception: > {noformat} > case class Nested(b: Boolean, n: Long) > case class Table(id: String, nested: Nested) > case class Identifier(id: String) > locally { > System.setProperty("spark.testing", "true") // Fail instead of logging a > warning > val df = List.empty[Table].toDS.cache() > val ids = List.empty[Identifier].toDS.cache() > df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi" > .select('id, 'nested("n")) > .explain() > } > {noformat} > Looking at the query plan as the optimizer iterates in > {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations: > {noformat} > Project [id#2, _gen_alias_108#108L AS nested.n#28L] > +- Join LeftAnti, (id#2 = id#18) >:- Project [id#2, nested#3.n AS _gen_alias_108#108L] >: +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, > deserialized, 1 replicas) >:+- LocalTableScan , [id#2, nested#3] >+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- LocalTableScan , [id#18] > {noformat} > And here's the plan after one more iteration. You can see that all that has > changed is new aliases for the column in the nested column: > "{{_gen_alias_108#108L}}" to "{{_gen_alias_109#109L}}". > {noformat} > Project [id#2, _gen_alias_109#109L AS nested.n#28L] > +- Join LeftAnti, (id#2 = id#18) >:- Project [id#2, nested#3.n AS _gen_alias_109#109L] >: +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, > deserialized, 1 replicas) >:+- LocalTableScan , [id#2, nested#3] >+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- LocalTableScan , [id#18] > {noformat} > The optimizer continues looping and tweaking the alias until it hits the max > iteration count and bails out. > Here's an example that includes a stack trace: > {noformat} > $ bin/spark-shell > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.2.0 > /_/ > Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.12) > Type in expressions to have them evaluated. > Type :help for more information. > scala> :paste > // Entering paste mode (ctrl-D to finish) > case class Nested(b: Boolean, n: Long) > ca
[jira] [Updated] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures
[ https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-37222: - Attachment: plan-log.log > Max iterations reached in Operator Optimization w/left_anti or left_semi join > and nested structures > --- > > Key: SPARK-37222 > URL: https://issues.apache.org/jira/browse/SPARK-37222 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.1.2, 3.2.0, 3.2.1 > Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and > with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, > 2021. > The problem does not occur with Spark 3.0.1. > >Reporter: Shawn Smith >Priority: Major > Attachments: plan-log.log > > > The query optimizer never reaches a fixed point when optimizing the query > below. This manifests as a warning: > > WARN: Max iterations (100) reached for batch Operator Optimization before > > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a > > larger value. > But the suggested fix won't help. The actual problem is that the optimizer > fails to make progress on each iteration and gets stuck in a loop. > In practice, Spark logs a warning but continues on and appears to execute the > query successfully, albeit perhaps sub-optimally. > To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and > 3.2.0 but not 3.0.1 it will throw an exception: > {noformat} > case class Nested(b: Boolean, n: Long) > case class Table(id: String, nested: Nested) > case class Identifier(id: String) > locally { > System.setProperty("spark.testing", "true") // Fail instead of logging a > warning > val df = List.empty[Table].toDS.cache() > val ids = List.empty[Identifier].toDS.cache() > df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi" > .select('id, 'nested("n")) > .explain() > } > {noformat} > Looking at the query plan as the optimizer iterates in > {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations: > {noformat} > Project [id#2, _gen_alias_108#108L AS nested.n#28L] > +- Join LeftAnti, (id#2 = id#18) >:- Project [id#2, nested#3.n AS _gen_alias_108#108L] >: +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, > deserialized, 1 replicas) >:+- LocalTableScan , [id#2, nested#3] >+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- LocalTableScan , [id#18] > {noformat} > And here's the plan after one more iteration. You can see that all that has > changed is new aliases for the column in the nested column: > "{{_gen_alias_108#108L}}" to "{{_gen_alias_109#109L}}". > {noformat} > Project [id#2, _gen_alias_109#109L AS nested.n#28L] > +- Join LeftAnti, (id#2 = id#18) >:- Project [id#2, nested#3.n AS _gen_alias_109#109L] >: +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, > deserialized, 1 replicas) >:+- LocalTableScan , [id#2, nested#3] >+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- LocalTableScan , [id#18] > {noformat} > The optimizer continues looping and tweaking the alias until it hits the max > iteration count and bails out. > Here's an example that includes a stack trace: > {noformat} > $ bin/spark-shell > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.2.0 > /_/ > Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.12) > Type in expressions to have them evaluated. > Type :help for more information. > scala> :paste > // Entering paste mode (ctrl-D to finish) > case class Nested(b: Boolean, n: Long) > case class Table(id: String, nested: Nested) > case class Identifier(id: String) > locally { > System.setProperty("spark.testing", "true") // Fail instead of logging a > warning > val df = List.empty[Table].toDS.cache() > val ids = List.empty[Identifier].toDS.cache() > df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi" > .select('id, 'nested("n")) > .explain() > } > // Exiting paste mode, now interpreting. > java.lang.RuntimeException: Max iterations (100) reached for batch Operator > Optimization before Inferring Filters, please set > 'spark.sql.optimizer.maxIterations' to a larger value. > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:246) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200) > at scala.collection.immutable.List.foreach(List.scala:431) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.
[jira] [Updated] (SPARK-37696) Optimizer exceeds max iterations
[ https://issues.apache.org/jira/browse/SPARK-37696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-37696: - Affects Version/s: 3.2.1 > Optimizer exceeds max iterations > > > Key: SPARK-37696 > URL: https://issues.apache.org/jira/browse/SPARK-37696 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.2.1 >Reporter: Denis Tarima >Priority: Minor > > A specific scenario causing Spark's failure in tests and a warning in > production: > 21/12/20 06:45:24 WARN BaseSessionStateBuilder$$anon$2: Max iterations (100) > reached for batch Operator Optimization before Inferring Filters, please set > 'spark.sql.optimizer.maxIterations' to a larger value. > 21/12/20 06:45:24 WARN BaseSessionStateBuilder$$anon$2: Max iterations (100) > reached for batch Operator Optimization after Inferring Filters, please set > 'spark.sql.optimizer.maxIterations' to a larger value. > > To reproduce run the following commands in `spark-shell`: > {{// define case class for a struct type in an array}} > {{case class S(v: Int, v2: Int)}} > > {{// prepare a table with an array of structs}} > {{Seq((10, Seq(S(1, 2.toDF("i", "data").write.saveAsTable("tbl")}} > > {{// select using SQL and join with a dataset using "left_anti"}} > {{spark.sql("select i, data[size(data) - 1].v from > tbl").join(Seq(10).toDF("i"), Seq("i"), "left_anti").show()}} > > The following conditions are required: > # Having additional `v2` field in `S` > # Using `{{{}data[size(data) - 1]{}}}` instead of `{{{}element_at(data, > -1){}}}` > # Using `{{{}left_anti{}}}` in join operation > > The same behavior was observed in `master` branch and `3.1.1`. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures
[ https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527527#comment-17527527 ] Nicholas Chammas commented on SPARK-37222: -- Thanks for the detailed report, [~ssmith]. I am hitting this issue as well on Spark 3.2.1, and your minimal test case also reproduces the issue for me. How did you break down the optimization into its individual steps like that? That was very helpful. I was able to use your breakdown to work around the issue by excluding {{{}PushDownLeftSemiAntiJoin{}}}: {code:java} spark.conf.set( "spark.sql.optimizer.excludedRules", "org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin" ){code} If I run that before running the problematic query (including your test case), it seems to work around the issue. > Max iterations reached in Operator Optimization w/left_anti or left_semi join > and nested structures > --- > > Key: SPARK-37222 > URL: https://issues.apache.org/jira/browse/SPARK-37222 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.1.2, 3.2.0, 3.2.1 > Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and > with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, > 2021. > The problem does not occur with Spark 3.0.1. > >Reporter: Shawn Smith >Priority: Major > > The query optimizer never reaches a fixed point when optimizing the query > below. This manifests as a warning: > > WARN: Max iterations (100) reached for batch Operator Optimization before > > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a > > larger value. > But the suggested fix won't help. The actual problem is that the optimizer > fails to make progress on each iteration and gets stuck in a loop. > In practice, Spark logs a warning but continues on and appears to execute the > query successfully, albeit perhaps sub-optimally. > To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and > 3.2.0 but not 3.0.1 it will throw an exception: > {noformat} > case class Nested(b: Boolean, n: Long) > case class Table(id: String, nested: Nested) > case class Identifier(id: String) > locally { > System.setProperty("spark.testing", "true") // Fail instead of logging a > warning > val df = List.empty[Table].toDS.cache() > val ids = List.empty[Identifier].toDS.cache() > df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi" > .select('id, 'nested("n")) > .explain() > } > {noformat} > Looking at the query plan as the optimizer iterates in > {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations: > {noformat} > Project [id#2, _gen_alias_108#108L AS nested.n#28L] > +- Join LeftAnti, (id#2 = id#18) >:- Project [id#2, nested#3.n AS _gen_alias_108#108L] >: +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, > deserialized, 1 replicas) >:+- LocalTableScan , [id#2, nested#3] >+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- LocalTableScan , [id#18] > {noformat} > And here's the plan after one more iteration. You can see that all that has > changed is new aliases for the column in the nested column: > "{{_gen_alias_108#108L}}" to "{{_gen_alias_109#109L}}". > {noformat} > Project [id#2, _gen_alias_109#109L AS nested.n#28L] > +- Join LeftAnti, (id#2 = id#18) >:- Project [id#2, nested#3.n AS _gen_alias_109#109L] >: +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, > deserialized, 1 replicas) >:+- LocalTableScan , [id#2, nested#3] >+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- LocalTableScan , [id#18] > {noformat} > The optimizer continues looping and tweaking the alias until it hits the max > iteration count and bails out. > Here's an example that includes a stack trace: > {noformat} > $ bin/spark-shell > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.2.0 > /_/ > Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.12) > Type in expressions to have them evaluated. > Type :help for more information. > scala> :paste > // Entering paste mode (ctrl-D to finish) > case class Nested(b: Boolean, n: Long) > case class Table(id: String, nested: Nested) > case class Identifier(id: String) > locally { > System.setProperty("spark.testing", "true") // Fail instead of logging a > warning > val df = List.empty[Table].toDS.cache() > val ids = List.empty[Identifier].toDS.cache() > df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi" > .s
[jira] [Updated] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures
[ https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-37222: - Affects Version/s: 3.2.1 > Max iterations reached in Operator Optimization w/left_anti or left_semi join > and nested structures > --- > > Key: SPARK-37222 > URL: https://issues.apache.org/jira/browse/SPARK-37222 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.1.2, 3.2.0, 3.2.1 > Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and > with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, > 2021. > The problem does not occur with Spark 3.0.1. > >Reporter: Shawn Smith >Priority: Major > > The query optimizer never reaches a fixed point when optimizing the query > below. This manifests as a warning: > > WARN: Max iterations (100) reached for batch Operator Optimization before > > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a > > larger value. > But the suggested fix won't help. The actual problem is that the optimizer > fails to make progress on each iteration and gets stuck in a loop. > In practice, Spark logs a warning but continues on and appears to execute the > query successfully, albeit perhaps sub-optimally. > To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and > 3.2.0 but not 3.0.1 it will throw an exception: > {noformat} > case class Nested(b: Boolean, n: Long) > case class Table(id: String, nested: Nested) > case class Identifier(id: String) > locally { > System.setProperty("spark.testing", "true") // Fail instead of logging a > warning > val df = List.empty[Table].toDS.cache() > val ids = List.empty[Identifier].toDS.cache() > df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi" > .select('id, 'nested("n")) > .explain() > } > {noformat} > Looking at the query plan as the optimizer iterates in > {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations: > {noformat} > Project [id#2, _gen_alias_108#108L AS nested.n#28L] > +- Join LeftAnti, (id#2 = id#18) >:- Project [id#2, nested#3.n AS _gen_alias_108#108L] >: +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, > deserialized, 1 replicas) >:+- LocalTableScan , [id#2, nested#3] >+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- LocalTableScan , [id#18] > {noformat} > And here's the plan after one more iteration. You can see that all that has > changed is new aliases for the column in the nested column: > "{{_gen_alias_108#108L}}" to "{{_gen_alias_109#109L}}". > {noformat} > Project [id#2, _gen_alias_109#109L AS nested.n#28L] > +- Join LeftAnti, (id#2 = id#18) >:- Project [id#2, nested#3.n AS _gen_alias_109#109L] >: +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, > deserialized, 1 replicas) >:+- LocalTableScan , [id#2, nested#3] >+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- LocalTableScan , [id#18] > {noformat} > The optimizer continues looping and tweaking the alias until it hits the max > iteration count and bails out. > Here's an example that includes a stack trace: > {noformat} > $ bin/spark-shell > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.2.0 > /_/ > Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.12) > Type in expressions to have them evaluated. > Type :help for more information. > scala> :paste > // Entering paste mode (ctrl-D to finish) > case class Nested(b: Boolean, n: Long) > case class Table(id: String, nested: Nested) > case class Identifier(id: String) > locally { > System.setProperty("spark.testing", "true") // Fail instead of logging a > warning > val df = List.empty[Table].toDS.cache() > val ids = List.empty[Identifier].toDS.cache() > df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi" > .select('id, 'nested("n")) > .explain() > } > // Exiting paste mode, now interpreting. > java.lang.RuntimeException: Max iterations (100) reached for batch Operator > Optimization before Inferring Filters, please set > 'spark.sql.optimizer.maxIterations' to a larger value. > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:246) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200) > at scala.collection.immutable.List.foreach(List.scala:431) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200) > at
[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle
[ https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462805#comment-17462805 ] Nicholas Chammas commented on SPARK-5997: - [~tenstriker] - I believe in your case you should be able to set {{spark.sql.files.maxRecordsPerFile}} to some number. Spark will not shuffle the data but it will still split up your output across multiple files. > Increase partition count without performing a shuffle > - > > Key: SPARK-5997 > URL: https://issues.apache.org/jira/browse/SPARK-5997 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Andrew Ash >Priority: Major > > When decreasing partition count with rdd.repartition() or rdd.coalesce(), the > user has the ability to choose whether or not to perform a shuffle. However > when increasing partition count there is no option of whether to perform a > shuffle or not -- a shuffle always occurs. > This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call > that performs a repartition to a higher partition count without a shuffle. > The motivating use case is to decrease the size of an individual partition > enough that the .toLocalIterator has significantly reduced memory pressure on > the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-5997) Increase partition count without performing a shuffle
[ https://issues.apache.org/jira/browse/SPARK-5997 ] Nicholas Chammas deleted comment on SPARK-5997: - was (Author: nchammas): [~tenstriker] - I believe in your case you should be able to set {{spark.sql.files.maxRecordsPerFile}} to some number. Spark will not shuffle the data but it will still split up your output across multiple files. > Increase partition count without performing a shuffle > - > > Key: SPARK-5997 > URL: https://issues.apache.org/jira/browse/SPARK-5997 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Andrew Ash >Priority: Major > > When decreasing partition count with rdd.repartition() or rdd.coalesce(), the > user has the ability to choose whether or not to perform a shuffle. However > when increasing partition count there is no option of whether to perform a > shuffle or not -- a shuffle always occurs. > This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call > that performs a repartition to a higher partition count without a shuffle. > The motivating use case is to decrease the size of an individual partition > enough that the .toLocalIterator has significantly reduced memory pressure on > the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462718#comment-17462718 ] Nicholas Chammas commented on SPARK-24853: -- I would expect something like that to yield an {{{}AnalysisException{}}}. Would that address your concern, or are you suggesting that it might be difficult to catch that sort of problem cleanly? > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 3.2.0 >Reporter: nirav patel >Priority: Minor > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459601#comment-17459601 ] Nicholas Chammas commented on SPARK-24853: -- Assuming we are talking about the example I provided: Yes, {{col("count")}} would still be ambiguous. I don't know if Spark would know to catch that problem. But note that the current behavior of {{.withColumnRenamed('count', ...)}} renames all columns named "count", which is just incorrect. So allowing {{col("count")}} will either be just as incorrect as the current behavior, or it will be an improvement in that Spark may complain that the column reference is ambiguous. I'd have to try it to confirm the behavior. Of course, the main improvement offered by {{Column}} references is that users can do something like {{.withColumnRenamed(left_counts['count'], ...)}} and get the correct behavior. I didn't follow what you are getting at regarding {{{}from_json{}}}, but does that address your concern? > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 3.2.0 >Reporter: nirav patel >Priority: Minor > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-25150. -- Fix Version/s: 3.2.0 Resolution: Fixed It looks like Spark 3.1.2 exhibits a different sort of broken behavior: {code:java} pyspark.sql.utils.AnalysisException: Column State#38 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before joining them, and specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check. {code} I don't think the join in {{zombie-analysis.py}} is ambiguous, and since this now works fine in Spark 3.2.0, that's what I'm going to mark as the "Fix Version" for this issue. The fix must have made it in somewhere between Spark 3.1.2 and 3.2.0. > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.3 >Reporter: Nicholas Chammas >Priority: Major > Labels: correctness > Fix For: 3.2.0 > > Attachments: expected-output.txt, > output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, > persons.csv, states.csv, zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not "correct" in the sense that it should > be left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org