Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Nicholas Chammas
[dev list to bcc]

This is a question for the user list  
or for Stack Overflow 
. The dev list is for 
discussions related to the development of Spark itself.

Nick


> On May 21, 2024, at 6:58 AM, Prem Sahoo  wrote:
> 
> Hello Vibhor,
> Thanks for the suggestion .
> I am looking for some other alternatives where I can use the same dataframe 
> can be written to two destinations without re execution and cache or persist .
> 
> Can some one help me in scenario 2 ?
> How to make spark write to MinIO faster ?
> Sent from my iPhone
> 
>> On May 21, 2024, at 1:18 AM, Vibhor Gupta  wrote:
>> 
>> 
>> Hi Prem,
>>  
>> You can try to write to HDFS then read from HDFS and write to MinIO.
>>  
>> This will prevent duplicate transformation.
>>  
>> You can also try persisting the dataframe using the DISK_ONLY level.
>>  
>> Regards,
>> Vibhor
>> From: Prem Sahoo 
>> Date: Tuesday, 21 May 2024 at 8:16 AM
>> To: Spark dev list 
>> Subject: EXT: Dual Write to HDFS and MinIO in faster way
>> 
>> EXTERNAL: Report suspicious emails to Email Abuse.
>> 
>> Hello Team,
>> I am planning to write to two datasource at the same time . 
>>  
>> Scenario:-
>>  
>> Writing the same dataframe to HDFS and MinIO without re-executing the 
>> transformations and no cache(). Then how can we make it faster ?
>>  
>> Read the parquet file and do a few transformations and write to HDFS and 
>> MinIO.
>>  
>> here in both write spark needs execute the transformation again. Do we know 
>> how we can avoid re-execution of transformation  without cache()/persist ?
>>  
>> Scenario2 :-
>> I am writing 3.2G data to HDFS and MinIO which takes ~6mins.
>> Do we have any way to make writing this faster ?
>>  
>> I don't want to do repartition and write as repartition will have overhead 
>> of shuffling .
>>  
>> Please provide some inputs. 



Re: [DISCUSS] Spark - How to improve our release processes

2024-05-12 Thread Nicholas Chammas
Re: unification

We also have a long-standing problem with how we manage Python dependencies, 
something I’ve tried (unsuccessfully 
) to fix in the past.

Consider, for example, how many separate places this numpy dependency is 
installed:

1. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L277
2. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L733
3. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L853
4. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L871
5. 
https://github.com/apache/spark/blob/8094535973f19e9f0543535a97254e8ebffc1b23/.github/workflows/build_python_connect35.yml#L70
6. 
https://github.com/apache/spark/blob/553e1b85c42a60c082d33f7b9df53b0495893286/.github/workflows/maven_test.yml#L181
7. 
https://github.com/apache/spark/blob/6e5d1db9058de62a45f35d3f41e028a72f688b70/dev/requirements.txt#L5
8. 
https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L90
9. 
https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L99
10. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/dev/create-release/spark-rm/Dockerfile#L40
11. 
https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L89
12. 
https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L92

None of those installations reference a unified version requirement, so 
naturally they are inconsistent across all these different lines. Some say 
`>=1.21`, others say `>=1.20.0`, and still others say `==1.20.3`. In several 
cases there is no version requirement specified at all.

I’m interested in trying again to fix this problem, but it needs to be in 
collaboration with a committer since I cannot fully test the release scripts. 
(This testing gap is what doomed my last attempt at fixing this problem.)

Nick


> On May 13, 2024, at 12:18 AM, Wenchen Fan  wrote:
> 
> After finishing the 4.0.0-preview1 RC1, I have more experience with this 
> topic now.
> 
> In fact, the main job of the release process: building packages and 
> documents, is tested in Github Action jobs. However, the way we test them is 
> different from what we do in the release scripts.
> 
> 1. the execution environment is different:
> The release scripts define the execution environment with this Dockerfile: 
> https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
> However, Github Action jobs use a different Dockerfile: 
> https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
> We should figure out a way to unify it. The docker image for the release 
> process needs to set up more things so it may not be viable to use a single 
> Dockerfile for both.
> 
> 2. the execution code is different. Use building documents as an example:
> The release scripts: 
> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
> The Github Action job: 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
> I don't know which one is more correct, but we should definitely unify them.
> 
> It's better if we can run the release scripts as Github Action jobs, but I 
> think it's more important to do the unification now.
> 
> Thanks,
> Wenchen
> 
> 
> On Fri, May 10, 2024 at 12:34 AM Hussein Awala  > wrote:
>> Hello,
>> 
>> I can answer some of your common questions with other Apache projects.
>> 
>> > Who currently has permissions for Github actions? Is there a specific 
>> > owner for that today or a different volunteer each time?
>> 
>> The Apache organization owns Github Actions, and committers (contributors 
>> with write permissions) can retrigger/cancel a Github Actions workflow, but 
>> Github Actions runners are managed by the Apache infra team.
>> 
>> > What are the current limits of GitHub Actions, who set them - and what is 
>> > the process to change those (if possible at all, but I presume not all 
>> > Apache projects have the same limits)?
>> 
>> For limits, I don't think there is any significant limit, especially since 
>> the Apache organization has 900 donated runners used by its projects, and 
>> there is an initiative from the Infra team to add self-hosted runners 
>> running on Kubernetes (document 
>> ).
>> 
>> > Where should the artifacts be stored?
>> 
>> Usually, we use Maven for jars, DockerHub for Docker images, and Github 
>> cache for workflow cache. But we can use Github artifacts to store any kind 
>> of package (even Docker images in the 

[jira] [Updated] (SPARK-48222) Sync Ruby Bundler to 2.4.22 and refresh Gem lock file

2024-05-09 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-48222:
-
Component/s: Documentation

> Sync Ruby Bundler to 2.4.22 and refresh Gem lock file
> -
>
> Key: SPARK-48222
> URL: https://issues.apache.org/jira/browse/SPARK-48222
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Affects Versions: 4.0.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48222) Sync Ruby Bundler to 2.4.22 and refresh Gem lock file

2024-05-09 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-48222:


 Summary: Sync Ruby Bundler to 2.4.22 and refresh Gem lock file
 Key: SPARK-48222
 URL: https://issues.apache.org/jira/browse/SPARK-48222
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48176) Fix name of FIELD_ALREADY_EXISTS error condition

2024-05-07 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-48176:


 Summary: Fix name of FIELD_ALREADY_EXISTS error condition
 Key: SPARK-48176
 URL: https://issues.apache.org/jira/browse/SPARK-48176
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48107) Exclude tests from Python distribution

2024-05-02 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-48107:


 Summary: Exclude tests from Python distribution
 Key: SPARK-48107
 URL: https://issues.apache.org/jira/browse/SPARK-48107
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47429) Rename errorClass to errorCondition and subClass to subCondition

2024-05-01 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842694#comment-17842694
 ] 

Nicholas Chammas commented on SPARK-47429:
--

I think one intermediate step we can take here is to mark the existing fields 
as deprecated, indicating that they will be renamed. That way, if we don't 
complete this renaming before the 4.0 release we at least have the deprecation 
in.

Another thing we can do in addition to deprecating the existing fields is to 
add the renamed fields and simply have them redirect to the original ones.

I will build a list of the classes, class attributes, methods, and method 
parameters that will need this kind of update. Note that this list will be 
much, much smaller than the thousands of uses that BingKun highlighted, since I 
am just focusing on the declarations.

cc [~cloud_fan] 

> Rename errorClass to errorCondition and subClass to subCondition
> 
>
> Key: SPARK-47429
> URL: https://issues.apache.org/jira/browse/SPARK-47429
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>    Reporter: Nicholas Chammas
>Priority: Minor
> Attachments: image-2024-04-18-09-26-04-493.png
>
>
> We've agreed on the parent task to rename {{errorClass}} to align it more 
> closely with the SQL standard, and take advantage of the opportunity to break 
> backwards compatibility offered by the Spark version change from 3.5 to 4.0.
> This ticket also covers renaming {{subClass}} as well.
> This is a subtask so the changes are in their own PR and easier to review 
> apart from other things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47429) Rename errorClass to errorCondition and subClass to subCondition

2024-05-01 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-47429:
-
Summary: Rename errorClass to errorCondition and subClass to subCondition  
(was: Rename errorClass to errorCondition)

> Rename errorClass to errorCondition and subClass to subCondition
> 
>
> Key: SPARK-47429
> URL: https://issues.apache.org/jira/browse/SPARK-47429
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>    Reporter: Nicholas Chammas
>Priority: Minor
> Attachments: image-2024-04-18-09-26-04-493.png
>
>
> We've agreed on the parent task to rename {{errorClass}} to align it more 
> closely with the SQL standard, and take advantage of the opportunity to break 
> backwards compatibility offered by the Spark version change from 3.5 to 4.0.
> This ticket also covers renaming {{subClass}} as well.
> This is a subtask so the changes are in their own PR and easier to review 
> apart from other things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47429) Rename errorClass to errorCondition

2024-04-15 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-47429:
-
Description: 
We've agreed on the parent task to rename {{errorClass}} to align it more 
closely with the SQL standard, and take advantage of the opportunity to break 
backwards compatibility offered by the Spark version change from 3.5 to 4.0.

This ticket also covers renaming {{subClass}} as well.

This is a subtask so the changes are in their own PR and easier to review apart 
from other things.

  was:
We've agreed on the parent task to rename {{errorClass}} to align it more 
closely with the SQL standard, and take advantage of the opportunity to break 
backwards compatibility offered by the Spark version change from 3.5 to 4.0.

This is a subtask so the changes are in their own PR and easier to review apart 
from other things.


> Rename errorClass to errorCondition
> ---
>
> Key: SPARK-47429
> URL: https://issues.apache.org/jira/browse/SPARK-47429
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> We've agreed on the parent task to rename {{errorClass}} to align it more 
> closely with the SQL standard, and take advantage of the opportunity to break 
> backwards compatibility offered by the Spark version change from 3.5 to 4.0.
> This ticket also covers renaming {{subClass}} as well.
> This is a subtask so the changes are in their own PR and easier to review 
> apart from other things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28024) Incorrect numeric values when out of range

2024-04-15 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837292#comment-17837292
 ] 

Nicholas Chammas commented on SPARK-28024:
--

[~cloud_fan] - Given the updated descriptions for Cases 2, 3, and 4, do you 
still consider there to be a problem here? Or shall we just consider this an 
acceptable difference between how Spark and Postgres handle these cases?

> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} 
> with {{set spark.sql.ansi.enabled=true;}} as compared to the default behavior 
> on PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> With ANSI mode enabled, this case is no longer an issue. All 4 of the above 
> statements now yield {{CAST_OVERFLOW}} or {{ARITHMETIC_OVERFLOW}} errors.
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(1.2345678901234E200);
> ERROR:  value overflows numeric format {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [DISCUSS] SPARK-44444: Use ANSI SQL mode by default

2024-04-12 Thread Nicholas Chammas
This is a side issue, but I’d like to bring people’s attention to SPARK-28024. 

Cases 2, 3, and 4 described in that ticket are still problems today on master 
(I just rechecked) even with ANSI mode enabled.

Well, maybe not problems, but I’m flagging this since Spark’s behavior differs 
in these cases from Postgres, as described in the ticket.


> On Apr 12, 2024, at 12:09 AM, Gengliang Wang  wrote:
> 
> 
> +1, enabling Spark's ANSI SQL mode in version 4.0 will significantly enhance 
> data quality and integrity. I fully support this initiative.
> 
> > In other words, the current Spark ANSI SQL implementation becomes the first 
> > implementation for Spark SQL users to face at first while providing
> `spark.sql.ansi.enabled=false` in the same way without losing any 
> capability.`spark.sql.ansi.enabled=false` in the same way without losing any 
> capability.
> 
> BTW, the try_* 
> 
>  functions and SQL Error Attribution Framework 
>  will also be beneficial 
> in migrating to ANSI SQL mode.
> 
> 
> Gengliang
> 
> 
> On Thu, Apr 11, 2024 at 7:56 PM Dongjoon Hyun  > wrote:
>> Hi, All.
>> 
>> Thanks to you, we've been achieving many things and have on-going SPIPs.
>> I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more narrowly
>> by asking your opinions about Apache Spark's ANSI SQL mode.
>> 
>> https://issues.apache.org/jira/browse/SPARK-44111
>> Prepare Apache Spark 4.0.0
>> 
>> SPARK-4 was proposed last year (on 15/Jul/23) as the one of desirable
>> items for 4.0.0 because it's a big behavior.
>> 
>> https://issues.apache.org/jira/browse/SPARK-4
>> Use ANSI SQL mode by default
>> 
>> Historically, spark.sql.ansi.enabled was added at Apache Spark 3.0.0 and has
>> been aiming to provide a better Spark SQL compatibility in a standard way.
>> We also have a daily CI to protect the behavior too.
>> 
>> https://github.com/apache/spark/actions/workflows/build_ansi.yml
>> 
>> However, it's still behind the configuration with several known issues, e.g.,
>> 
>> SPARK-41794 Reenable ANSI mode in test_connect_column
>> SPARK-41547 Reenable ANSI mode in test_connect_functions
>> SPARK-46374 Array Indexing is 1-based via ANSI SQL Standard
>> 
>> To be clear, we know that many DBMSes have their own implementations of
>> SQL standard and not the same. Like them, SPARK-4 aims to enable
>> only the existing Spark's configuration, `spark.sql.ansi.enabled=true`.
>> There is nothing more than that.
>> 
>> In other words, the current Spark ANSI SQL implementation becomes the first
>> implementation for Spark SQL users to face at first while providing
>> `spark.sql.ansi.enabled=false` in the same way without losing any capability.
>> 
>> If we don't want this change for some reasons, we can simply exclude
>> SPARK-4 from SPARK-44111 as a part of Apache Spark 4.0.0 preparation.
>> It's time just to make a go/no-go decision for this item for the global 
>> optimization
>> for Apache Spark 4.0.0 release. After 4.0.0, it's unlikely for us to aim
>> for this again for the next four years until 2028.
>> 
>> WDYT?
>> 
>> Bests,
>> Dongjoon



[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-28024:
-
Description: 
Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} with 
{{set spark.sql.ansi.enabled=true;}} as compared to the default behavior on 
PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
With ANSI mode enabled, this case is no longer an issue. All 4 of the above 
statements now yield {{CAST_OVERFLOW}} or {{ARITHMETIC_OVERFLOW}} errors.

Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}

  was:
Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} with 
{{set spark.sql.ansi.enabled=true;}} as compared to the default behavior on 
PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
With ANSI mode enabled, this case is no longer an issue. All 4 of the above 
statements now yield {{CAST_OVERFLOW or }}{{ARITHMETIC_OVERFLOW}} errors.

Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}


> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} 
> with {{set spark.sql.ansi.enabled=true;}} as compared to the default behavior 
> on PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> With ANSI mode enabled, this case is no longer an issue. All 4 of the above 
> statements now yield {{CAST_OVERFLOW}} or {{ARITHMETIC_OVERFLOW}} errors.
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(1.2345678901234E200);
> ERROR:  value overflows numeric format {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-28024:
-
Description: 
Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} with 
{{set spark.sql.ansi.enabled=true;}} as compared to the default behavior on 
PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
With ANSI mode enabled, this case is no longer an issue. All 4 of the above 
statements now yield {{CAST_OVERFLOW or }}{{ARITHMETIC_OVERFLOW}} errors.

Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}

  was:
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}


> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} 
> with {{set spark.sql.ansi.enabled=true;}} as compared to the default behavior 
> on PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> With ANSI mode enabled, this case is no longer an issue. All 4 of the above 
> statements now yield {{CAST_OVERFLOW or }}{{ARITHMETIC_OVERFLOW}} errors.
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(1.2345678901234E200);
> ERROR:  value overflows numeric format {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836706#comment-17836706
 ] 

Nicholas Chammas commented on SPARK-28024:
--

I've just retried cases 2-4 on master with ANSI mode enabled, and Spark's 
behavior appears to be the same as when I last checked it in February.

I also ran those same cases against PostgreSQL 16. I couldn't replicate the 
output for Case 4, and I believe there was a mistake in the original 
description of that case where the sign was flipped. So I've adjusted the sign 
accordingly and shown Spark and Postgres's behavior side-by-side.

Here is the original Case 4 with the negative sign:

{code:sql}
spark-sql (default)> select exp(-1.2345678901234E200);
0.0

postgres=# select exp(-1.2345678901234E200); 
0.
{code}
 
So I don't think there is a problem there. With a positive sign, the behavior 
is different as shown in the ticket description above.

> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> As compared to PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(1.2345678901234E200);
> ERROR:  value overflows numeric format {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-28024:
-
Description: 
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}

  was:
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(-1.2345678901234E200);
ERROR:  value overflows numeric format
{code}


> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> As compared to PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(1.2345678901234E200);
> ERROR:  value overflows numeric format {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-28024:
-
Description: 
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(-1.2345678901234E200);
ERROR:  value overflows numeric format
{code}

  was:
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity
 postgres=# select exp(-1.2345678901234E200);
ERROR:  value overflows numeric format
{code}


> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> As compared to PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(-1.2345678901234E200);
> ERROR:  value overflows numeric format
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-28024:
-
Description: 
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity
 postgres=# select exp(-1.2345678901234E200);
ERROR:  value overflows numeric format
{code}

  was:
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}


> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> As compared to PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
>  postgres=# select exp(-1.2345678901234E200);
> ERROR:  value overflows numeric format
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-28024:
-
Description: 
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}

  was:
For example
Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}

Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0
{code}

Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0
{code}

Case 4:
{code:sql}
spark-sql> select exp(-1.2345678901234E200);
0.0

postgres=# select exp(-1.2345678901234E200);
ERROR:  value overflows numeric format
{code}


> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> As compared to PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(1.2345678901234E200);
> ERROR:  value overflows numeric format {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47429) Rename errorClass to errorCondition

2024-03-16 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-47429:


 Summary: Rename errorClass to errorCondition
 Key: SPARK-47429
 URL: https://issues.apache.org/jira/browse/SPARK-47429
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas


We've agreed on the parent task to rename {{errorClass}} to align it more 
closely with the SQL standard, and take advantage of the opportunity to break 
backwards compatibility offered by the Spark version change from 3.5 to 4.0.

This is a subtask so the changes are in their own PR and easier to review apart 
from other things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46810) Clarify error class terminology

2024-03-05 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823713#comment-17823713
 ] 

Nicholas Chammas commented on SPARK-46810:
--

[~cloud_fan], [~LuciferYang], [~beliefer], and [~dongjoon] - Friendly ping.

Any thoughts on how to resolve the inconsistent error terminology?

> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
> [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
>  It has several key pieces of hierarchical information that have inconsistent 
> names throughout our documentation and codebase:
>  * 42
>  ** K01
>  *** INCOMPLETE_TYPE_DEFINITION
>   ARRAY
>   MAP
>   STRUCT
> What are the names of these different levels of information?
> Some examples of inconsistent terminology:
>  * [Over 
> here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
>  we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION 
> we call that an "error class". So what exactly is a class, the 42 or the 
> INCOMPLETE_TYPE_DEFINITION?
>  * [Over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
>  we call K01 the "subclass". But [over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
>  we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
> INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
> So what exactly is a subclass?
>  * [On this 
> page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
>  we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
> places we refer to it as an "error class".
> I don't think we should leave this status quo as-is. I see a couple of ways 
> to fix this.
> h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"
> One solution is to use the following terms:
>  * Error class: 42
>  * Error sub-class: K01
>  * Error state: 42K01
>  * Error condition: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-condition: ARRAY, MAP, STRUCT
> Pros: 
>  * This terminology seems (to me at least) the most natural and intuitive.
>  * It aligns most closely to the SQL standard.
> Cons:
>  * We use {{errorClass}} [all over our 
> codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
>  – literally in thousands of places – to refer to strings like 
> INCOMPLETE_TYPE_DEFINITION.
>  ** It's probably not practical to update all these usages to say 
> {{errorCondition}} instead, so if we go with this approach there will be a 
> divide between the terminology we use in user-facing documentation vs. what 
> the code base uses.
>  ** We can perhaps rename the existing {{error-classes.json}} to 
> {{error-conditions.json}} but clarify the reason for this divide between code 
> and user docs in the documentation for {{ErrorClassesJsonReader}} .
> h1. Option 2: 42 becomes an "Error Category"
> Another approach is to use the following terminology:
>  * Error category: 42
>  * Error sub-category: K01
>  * Error state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The change from calling "42" a "class" to a "category" is low impact and 
> may not show up in user-facing documentation at all. (See my side note below.)
> Cons:
>  * These terms do not align with the SQL standard.
>  * We will have to retire the term "error condition", which we have [already 
> used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs

[jira] [Created] (SPARK-47271) Explain importance of statistics on SQL performance tuning page

2024-03-04 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-47271:


 Summary: Explain importance of statistics on SQL performance 
tuning page
 Key: SPARK-47271
 URL: https://issues.apache.org/jira/browse/SPARK-47271
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47252) Clarify that pivot may trigger an eager computation

2024-03-02 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-47252:


 Summary: Clarify that pivot may trigger an eager computation
 Key: SPARK-47252
 URL: https://issues.apache.org/jira/browse/SPARK-47252
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47216) Refine layout of SQL performance tuning page

2024-02-28 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-47216:


 Summary: Refine layout of SQL performance tuning page
 Key: SPARK-47216
 URL: https://issues.apache.org/jira/browse/SPARK-47216
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47190) Add support for checkpointing to Spark Connect

2024-02-27 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821286#comment-17821286
 ] 

Nicholas Chammas commented on SPARK-47190:
--

[~gurwls223] - Is there some design reason we do _not_ want to support 
checkpointing in Spark Connect? Or is it just a matter of someone taking the 
time to implement support?

If the latter, do we do so via a new method directly on {{SparkSession}}, or 
shall we somehow expose a limited version of {{spark.sparkContext}} so users 
can call the existing {{setCheckpointDir()}} method?

> Add support for checkpointing to Spark Connect
> --
>
> Key: SPARK-47190
> URL: https://issues.apache.org/jira/browse/SPARK-47190
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> The {{sparkContext}} that underlies a given {{SparkSession}} is not 
> accessible over Spark Connect. This means you cannot call 
> {{spark.sparkContext.setCheckpointDir(...)}}, which in turn means you cannot 
> checkpoint a DataFrame.
> We should add support for this somehow to Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47190) Add support for checkpointing to Spark Connect

2024-02-27 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-47190:


 Summary: Add support for checkpointing to Spark Connect
 Key: SPARK-47190
 URL: https://issues.apache.org/jira/browse/SPARK-47190
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Nicholas Chammas


The {{sparkContext}} that underlies a given {{SparkSession}} is not accessible 
over Spark Connect. This means you cannot call 
{{spark.sparkContext.setCheckpointDir(...)}}, which in turn means you cannot 
checkpoint a DataFrame.

We should add support for this somehow to Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47189) Tweak column error names and text

2024-02-27 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-47189:


 Summary: Tweak column error names and text
 Key: SPARK-47189
 URL: https://issues.apache.org/jira/browse/SPARK-47189
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47180) Migrate CSV parsing off of Univocity

2024-02-26 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-47180:


 Summary: Migrate CSV parsing off of Univocity
 Key: SPARK-47180
 URL: https://issues.apache.org/jira/browse/SPARK-47180
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas


Univocity appears to be unmaintained.

As of February 2024:
 * The last release was [more than 3 years 
ago|https://github.com/uniVocity/univocity-parsers/releases].
 * The last commit to {{master}} was [almost 3 years 
ago|https://github.com/uniVocity/univocity-parsers/commits/master/].
 * The website is 
[down|https://github.com/uniVocity/univocity-parsers/issues/506].
 * There are 
[multiple|https://github.com/uniVocity/univocity-parsers/issues/494] 
[open|https://github.com/uniVocity/univocity-parsers/issues/495] 
[bugs|https://github.com/uniVocity/univocity-parsers/issues/499] on the tracker 
with no indication that anyone cares.

It's not urgent, but we should consider migrating to an actively maintained CSV 
library in the JVM ecosystem.

There are a bunch of libraries [listed here on this Maven 
Repository|https://mvnrepository.com/open-source/csv-libraries].

[jackson-dataformats-text|https://github.com/FasterXML/jackson-dataformats-text]
 looks interesting. I know we already use FasterXML to parse JSON. Perhaps we 
should use them to parse CSV as well.

I'm guessing we chose Univocity back in the day because it was the fastest CSV 
library on the JVM. However, the last performance benchmark comparing it to 
others was [from February 
2018|https://github.com/uniVocity/csv-parsers-comparison/blob/5548b52f2cc27eb19c11464e9a331491e8ad4ba6/README.md#statistics-updated-28th-of-february-2018],
 so this may no longer be true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Generating config docs automatically

2024-02-22 Thread Nicholas Chammas
Thank you, Holden!

Yes, having everything live in the ConfigEntry is attractive.

The main reason I proposed an alternative where the groups are defined in YAML 
is that if the config groups are defined in ConfigEntry, then altering the 
groupings – which is relevant only to the display of config documentation – 
requires rebuilding Spark. This feels a bit off to me in terms of design.

For example, on the SQL performance tuning page there is some narrative 
documentation about caching 
<https://spark.apache.org/docs/3.5.0/sql-performance-tuning.html#caching-data-in-memory>,
 plus a table of relevant configs. If I want an additional config to show up in 
this table, I need to add it to the config group that backs the table.

With the ConfigEntry approach in #44755 
<https://github.com/apache/spark/pull/44755>, that means editing the 
appropriate ConfigEntry and rebuilding Spark before I can regenerate the config 
table.

val SOME_CONFIG = buildConf("spark.sql.someCachingRelatedConfig")
  .doc("some documentation")
  .version("2.1.0")
  .withDocumentationGroup("sql-tuning-caching-data")  // assign group to the 
config
With the YAML approach in #44756 <https://github.com/apache/spark/pull/44756>, 
that means editing the config group defined in the YAML file and regenerating 
the config table. No Spark rebuild required.

sql-tuning-caching-data:
- spark.sql.inMemoryColumnarStorage.compressed
- spark.sql.inMemoryColumnarStorage.batchSize
- spark.sql.someCachingRelatedConfig  # add config to the group
In both cases the config names, descriptions, defaults, etc. will be pulled 
from the ConfigEntry when building the HTML tables.

I prefer the latter approach but I’m open to whatever committers are more 
comfortable with. If you prefer the former, then I’ll focus on that and ping 
you for reviews accordingly!


> On Feb 21, 2024, at 11:43 AM, Holden Karau  wrote:
> 
> I think this is a good idea. I like having everything in one source of truth 
> rather than two (so option 1 sounds like a good idea); but that’s just my 
> opinion. I'd be happy to help with reviews though.
> 
> On Wed, Feb 21, 2024 at 6:37 AM Nicholas Chammas  <mailto:nicholas.cham...@gmail.com>> wrote:
>> I know config documentation is not the most exciting thing. If there is 
>> anything I can do to make this as easy as possible for a committer to 
>> shepherd, I’m all ears!
>> 
>> 
>>> On Feb 14, 2024, at 8:53 PM, Nicholas Chammas >> <mailto:nicholas.cham...@gmail.com>> wrote:
>>> 
>>> I’m interested in automating our config documentation and need input from a 
>>> committer who is interested in shepherding this work.
>>> 
>>> We have around 60 tables of configs across our documentation. Here’s a 
>>> typical example. 
>>> <https://github.com/apache/spark/blob/736d8ab3f00e7c5ba1b01c22f6398b636b8492ea/docs/sql-performance-tuning.md?plain=1#L65-L159>
>>> 
>>> These tables span several thousand lines of manually maintained HTML, which 
>>> poses a few problems:
>>> The documentation for a given config is sometimes out of sync across the 
>>> HTML table and its source `ConfigEntry`.
>>> Internal configs that are not supposed to be documented publicly sometimes 
>>> are.
>>> Many config names and defaults are extremely long, posing formatting 
>>> problems.
>>> 
>>> Contributors waste time dealing with these issues in a losing battle to 
>>> keep everything up-to-date and consistent.
>>> 
>>> I’d like to solve all these problems by generating HTML tables 
>>> automatically from the `ConfigEntry` instances where the configs are 
>>> defined.
>>> 
>>> I’ve proposed two alternative solutions:
>>> #44755 <https://github.com/apache/spark/pull/44755>: Enhance `ConfigEntry` 
>>> so a config can be associated with one or more groups, and use that new 
>>> metadata to generate the tables we need.
>>> #44756 <https://github.com/apache/spark/pull/44756>: Add a standalone YAML 
>>> file where we define config groups, and use that to generate the tables we 
>>> need.
>>> 
>>> If you’re a committer and are interested in this problem, please chime in 
>>> on whatever approach appeals to you. If you think this is a bad idea, I’m 
>>> also eager to hear your feedback.
>>> 
>>> Nick
>>> 
> 
> 



Re: Generating config docs automatically

2024-02-21 Thread Nicholas Chammas
I know config documentation is not the most exciting thing. If there is 
anything I can do to make this as easy as possible for a committer to shepherd, 
I’m all ears!


> On Feb 14, 2024, at 8:53 PM, Nicholas Chammas  
> wrote:
> 
> I’m interested in automating our config documentation and need input from a 
> committer who is interested in shepherding this work.
> 
> We have around 60 tables of configs across our documentation. Here’s a 
> typical example. 
> <https://github.com/apache/spark/blob/736d8ab3f00e7c5ba1b01c22f6398b636b8492ea/docs/sql-performance-tuning.md?plain=1#L65-L159>
> 
> These tables span several thousand lines of manually maintained HTML, which 
> poses a few problems:
> The documentation for a given config is sometimes out of sync across the HTML 
> table and its source `ConfigEntry`.
> Internal configs that are not supposed to be documented publicly sometimes 
> are.
> Many config names and defaults are extremely long, posing formatting problems.
> 
> Contributors waste time dealing with these issues in a losing battle to keep 
> everything up-to-date and consistent.
> 
> I’d like to solve all these problems by generating HTML tables automatically 
> from the `ConfigEntry` instances where the configs are defined.
> 
> I’ve proposed two alternative solutions:
> #44755 <https://github.com/apache/spark/pull/44755>: Enhance `ConfigEntry` so 
> a config can be associated with one or more groups, and use that new metadata 
> to generate the tables we need.
> #44756 <https://github.com/apache/spark/pull/44756>: Add a standalone YAML 
> file where we define config groups, and use that to generate the tables we 
> need.
> 
> If you’re a committer and are interested in this problem, please chime in on 
> whatever approach appeals to you. If you think this is a bad idea, I’m also 
> eager to hear your feedback.
> 
> Nick
> 



[jira] [Updated] (SPARK-47082) Out of bounds error message is incorrect

2024-02-17 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-47082:
-
Summary: Out of bounds error message is incorrect  (was: Out of bounds 
error message flips the bounds)

> Out of bounds error message is incorrect
> 
>
> Key: SPARK-47082
> URL: https://issues.apache.org/jira/browse/SPARK-47082
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47082) Out of bounds error message flips the bounds

2024-02-17 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-47082:


 Summary: Out of bounds error message flips the bounds
 Key: SPARK-47082
 URL: https://issues.apache.org/jira/browse/SPARK-47082
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Generating config docs automatically

2024-02-14 Thread Nicholas Chammas
I’m interested in automating our config documentation and need input from a 
committer who is interested in shepherding this work.

We have around 60 tables of configs across our documentation. Here’s a typical 
example. 


These tables span several thousand lines of manually maintained HTML, which 
poses a few problems:
The documentation for a given config is sometimes out of sync across the HTML 
table and its source `ConfigEntry`.
Internal configs that are not supposed to be documented publicly sometimes are.
Many config names and defaults are extremely long, posing formatting problems.

Contributors waste time dealing with these issues in a losing battle to keep 
everything up-to-date and consistent.

I’d like to solve all these problems by generating HTML tables automatically 
from the `ConfigEntry` instances where the configs are defined.

I’ve proposed two alternative solutions:
#44755 : Enhance `ConfigEntry` so a 
config can be associated with one or more groups, and use that new metadata to 
generate the tables we need.
#44756 : Add a standalone YAML file 
where we define config groups, and use that to generate the tables we need.

If you’re a committer and are interested in this problem, please chime in on 
whatever approach appeals to you. If you think this is a bad idea, I’m also 
eager to hear your feedback.

Nick



Re: How do you debug a code-generated aggregate?

2024-02-12 Thread Nicholas Chammas
OK, I figured it out. The details are in SPARK-47024 
<https://issues.apache.org/jira/browse/SPARK-47024> for anyone who’s interested.

It turned out to be a floating point arithmetic “bug”. The main reason I was 
able to figure it out was because I’ve been investigating another, unrelated 
bug (a real bug) related to floats, so these weird float corner cases have been 
top of mind.

If it weren't for that, I wonder how much progress I would have made. Though I 
could inspect the generated code, I couldn’t figure out how to get logging 
statements placed in the generated code to print somewhere I could see them.

Depending on how often we find ourselves debugging aggregates like this, it 
would be really helpful if we added some way to trace the aggregation buffer.

In any case, mystery solved. Thank you for the pointer!


> On Feb 12, 2024, at 8:39 AM, Herman van Hovell  wrote:
> 
> There is no really easy way of getting the state of the aggregation buffer, 
> unless you are willing to modify the code generation and sprinkle in some 
> logging.
> 
> What I would start with is dumping the generated code by calling 
> explain('codegen') on the DataFrame. That helped me to find similar issues in 
> most cases.
> 
> HTH
> 
> On Sun, Feb 11, 2024 at 11:26 PM Nicholas Chammas  <mailto:nicholas.cham...@gmail.com>> wrote:
>> Consider this example:
>> >>> from pyspark.sql.functions import sum
>> >>> spark.range(4).repartition(2).select(sum("id")).show()
>> +---+
>> |sum(id)|
>> +---+
>> |  6|
>> +---+
>> 
>> I’m trying to understand how this works because I’m investigating a bug in 
>> this kind of aggregate.
>> 
>> I see that doProduceWithoutKeys 
>> <https://github.com/apache/spark/blob/d02fbba6491fd17dc6bfc1a416971af7544952f3/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala#L98>
>>  and doConsumeWithoutKeys 
>> <https://github.com/apache/spark/blob/d02fbba6491fd17dc6bfc1a416971af7544952f3/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala#L193>
>>  are called, and I believe they are responsible for computing a declarative 
>> aggregate like `sum`. But I’m not sure how I would debug the generated code, 
>> or the inputs that drive what code gets generated.
>> 
>> Say you were running the above example and it was producing an incorrect 
>> result, and you knew the problem was somehow related to the sum. How would 
>> you troubleshoot it to identify the root cause?
>> 
>> Ideally, I would like some way to track how the aggregation buffer mutates 
>> as the computation is executed, so I can see something roughly like:
>> [0, 1, 2, 3]
>> [1, 5]
>> [6]
>> 
>> Is there some way to trace a declarative aggregate like this?
>> 
>> Nick
>> 



[jira] [Resolved] (SPARK-47024) Sum of floats/doubles may be incorrect depending on partitioning

2024-02-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-47024.
--
Resolution: Not A Problem

Resolving this as "Not A Problem".

I mean, it _is_ a problem, but it's a basic problem with floats, and I don't 
think there is anything practical that can be done about it in Spark.

> Sum of floats/doubles may be incorrect depending on partitioning
> 
>
> Key: SPARK-47024
> URL: https://issues.apache.org/jira/browse/SPARK-47024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0, 3.3.4
>Reporter: Nicholas Chammas
>Priority: Major
>  Labels: correctness
>
> I found this problem using 
> [Hypothesis|https://hypothesis.readthedocs.io/en/latest/].
> Here's a reproduction that fails on {{{}master{}}}, 3.5.0, 3.4.2, and 3.3.4 
> (and probably all prior versions as well):
> {code:python}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import col, sum
> SUM_EXAMPLE = [
> (1.0,),
> (0.0,),
> (1.0,),
> (9007199254740992.0,),
> ]
> spark = (
> SparkSession.builder
> .config("spark.log.level", "ERROR")
> .getOrCreate()
> )
> def compare_sums(data, num_partitions):
> df = spark.createDataFrame(data, "val double").coalesce(1)
> result1 = df.agg(sum(col("val"))).collect()[0][0]
> df = spark.createDataFrame(data, "val double").repartition(num_partitions)
> result2 = df.agg(sum(col("val"))).collect()[0][0]
> assert result1 == result2, f"{result1}, {result2}"
> if __name__ == "__main__":
> print(compare_sums(SUM_EXAMPLE, 2))
> {code}
> This fails as follows:
> {code:python}
> AssertionError: 9007199254740994.0, 9007199254740992.0
> {code}
> I suspected some kind of problem related to code generation, so tried setting 
> all of these to {{{}false{}}}:
>  * {{spark.sql.codegen.wholeStage}}
>  * {{spark.sql.codegen.aggregate.map.twolevel.enabled}}
>  * {{spark.sql.codegen.aggregate.splitAggregateFunc.enabled}}
> But this did not change the behavior.
> Somehow, the partitioning of the data affects the computed sum.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47024) Sum of floats/doubles may be incorrect depending on partitioning

2024-02-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-47024:
-
Description: 
I found this problem using 
[Hypothesis|https://hypothesis.readthedocs.io/en/latest/].

Here's a reproduction that fails on {{{}master{}}}, 3.5.0, 3.4.2, and 3.3.4 
(and probably all prior versions as well):
{code:python}
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum

SUM_EXAMPLE = [
(1.0,),
(0.0,),
(1.0,),
(9007199254740992.0,),
]

spark = (
SparkSession.builder
.config("spark.log.level", "ERROR")
.getOrCreate()
)


def compare_sums(data, num_partitions):
df = spark.createDataFrame(data, "val double").coalesce(1)
result1 = df.agg(sum(col("val"))).collect()[0][0]
df = spark.createDataFrame(data, "val double").repartition(num_partitions)
result2 = df.agg(sum(col("val"))).collect()[0][0]
assert result1 == result2, f"{result1}, {result2}"


if __name__ == "__main__":
print(compare_sums(SUM_EXAMPLE, 2))
{code}
This fails as follows:
{code:python}
AssertionError: 9007199254740994.0, 9007199254740992.0
{code}
I suspected some kind of problem related to code generation, so tried setting 
all of these to {{{}false{}}}:
 * {{spark.sql.codegen.wholeStage}}
 * {{spark.sql.codegen.aggregate.map.twolevel.enabled}}
 * {{spark.sql.codegen.aggregate.splitAggregateFunc.enabled}}

But this did not change the behavior.

Somehow, the partitioning of the data affects the computed sum.

  was:Will fill in the details shortly.

Summary: Sum of floats/doubles may be incorrect depending on 
partitioning  (was: Sum is incorrect (exact cause currently unknown))

Sadly, I think this is a case where we may not be able to do anything. The 
problem appears to be a classic case of floating point arithmetic going wrong.
{code:scala}
scala> 9007199254740992.0 + 1.0
val res0: Double = 9.007199254740992E15

scala> 9007199254740992.0 + 2.0
val res1: Double = 9.007199254740994E15
{code}
Notice how adding {{1.0}} did not change the large value, whereas adding 
{{2.0}} did.

So what I believe is happening is that, depending on the order in which the 
rows happen to be added, we either hit or do not hit this corner case.

In other words, if the aggregation goes like this:
{code:java}
(1.0 + 1.0) + (0.0 + 9007199254740992.0)
2.0 + 9007199254740992.0
9007199254740994.0
{code}
Then there is no problem.

However, if we are unlucky and it goes like this:
{code:java}
(1.0 + 0.0) + (1.0 + 9007199254740992.0)
1.0 + 9007199254740992.0
9007199254740992.0
{code}
Then we get the incorrect result shown in the description above.

This violates what I believe should be an invariant in Spark: That declarative 
aggregates like {{sum}} do not compute different results depending on accidents 
of row order or partitioning.

However, given that this is a basic problem of floating point arithmetic, I 
doubt we can really do anything here.

Note that there are many such "special" numbers that have this problem, not 
just 9007199254740992.0:
{code:scala}
scala> 1.7168917017330176e+16 + 1.0
val res2: Double = 1.7168917017330176E16

scala> 1.7168917017330176e+16 + 2.0
val res3: Double = 1.7168917017330178E16
{code}

> Sum of floats/doubles may be incorrect depending on partitioning
> 
>
> Key: SPARK-47024
> URL: https://issues.apache.org/jira/browse/SPARK-47024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0, 3.3.4
>Reporter: Nicholas Chammas
>Priority: Major
>  Labels: correctness
>
> I found this problem using 
> [Hypothesis|https://hypothesis.readthedocs.io/en/latest/].
> Here's a reproduction that fails on {{{}master{}}}, 3.5.0, 3.4.2, and 3.3.4 
> (and probably all prior versions as well):
> {code:python}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import col, sum
> SUM_EXAMPLE = [
> (1.0,),
> (0.0,),
> (1.0,),
> (9007199254740992.0,),
> ]
> spark = (
> SparkSession.builder
> .config("spark.log.level", "ERROR")
> .getOrCreate()
> )
> def compare_sums(data, num_partitions):
> df = spark.createDataFrame(data, "val double").coalesce(1)
> result1 = df.agg(sum(col("val"))).collect()[0][0]
> df = spark.createDataFrame(data, "val double").repartition(num_partitions)
> result2 = df.agg(sum(col("val"))).collect()[0][0]
> assert result1 == result2, f"{result1}, {result2}"
> if __na

[jira] [Created] (SPARK-47024) Sum is incorrect (exact cause currently unknown)

2024-02-12 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-47024:


 Summary: Sum is incorrect (exact cause currently unknown)
 Key: SPARK-47024
 URL: https://issues.apache.org/jira/browse/SPARK-47024
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.4, 3.5.0, 3.4.2
Reporter: Nicholas Chammas


Will fill in the details shortly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



How do you debug a code-generated aggregate?

2024-02-11 Thread Nicholas Chammas
Consider this example:
>>> from pyspark.sql.functions import sum
>>> spark.range(4).repartition(2).select(sum("id")).show()
+---+
|sum(id)|
+---+
|  6|
+---+
I’m trying to understand how this works because I’m investigating a bug in this 
kind of aggregate.

I see that doProduceWithoutKeys 

 and doConsumeWithoutKeys 

 are called, and I believe they are responsible for computing a declarative 
aggregate like `sum`. But I’m not sure how I would debug the generated code, or 
the inputs that drive what code gets generated.

Say you were running the above example and it was producing an incorrect 
result, and you knew the problem was somehow related to the sum. How would you 
troubleshoot it to identify the root cause?

Ideally, I would like some way to track how the aggregation buffer mutates as 
the computation is executed, so I can see something roughly like:
[0, 1, 2, 3]
[1, 5]
[6]
Is there some way to trace a declarative aggregate like this?

Nick



[jira] [Updated] (SPARK-46992) Inconsistent results with 'sort', 'cache', and AQE.

2024-02-06 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46992:
-
Labels: correctness  (was: )

> Inconsistent results with 'sort', 'cache', and AQE.
> ---
>
> Key: SPARK-46992
> URL: https://issues.apache.org/jira/browse/SPARK-46992
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Denis Tarima
>Priority: Critical
>  Labels: correctness
>
>  
> With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes 
> {color:#4c9aff}sample{color} results after caching.
> Moreover, when cached,  {color:#4c9aff}collect{color} returns records as if 
> it's not cached, which is inconsistent with {color:#4c9aff}count{color} and 
> {color:#4c9aff}show{color}.
> A script to reproduce:
> {code:scala}
> import spark.implicits._
> val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123)
> println("NON CACHED:")
> println("  count: " + df.count())
> println("  collect: " + df.collect().mkString(" "))
> df.show()
> println("CACHED:")
> df.cache().count()
> println("  count: " + df.count())
> println("  collect: " + df.collect().mkString(" "))
> df.show()
> df.unpersist()
> {code}
> output:
> {code}
> NON CACHED:
>   count: 2
>   collect: [1] [4]
> +---+
> | id|
> +---+
> |  1|
> |  4|
> +---+
> CACHED:
>   count: 3
>   collect: [1] [4]
> +---+
> | id|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46992) Inconsistent results with 'sort', 'cache', and AQE.

2024-02-06 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814913#comment-17814913
 ] 

Nicholas Chammas commented on SPARK-46992:
--

I can confirm the behavior described above is still present on {{master}} at 
[{{5d5b3a5}}|https://github.com/apache/spark/commit/5d5b3a54b7b5fb4308fe40da696ba805c72983fc].

Adding the {{correctness}} label.

> Inconsistent results with 'sort', 'cache', and AQE.
> ---
>
> Key: SPARK-46992
> URL: https://issues.apache.org/jira/browse/SPARK-46992
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Denis Tarima
>Priority: Critical
>
>  
> With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes 
> {color:#4c9aff}sample{color} results after caching.
> Moreover, when cached,  {color:#4c9aff}collect{color} returns records as if 
> it's not cached, which is inconsistent with {color:#4c9aff}count{color} and 
> {color:#4c9aff}show{color}.
> A script to reproduce:
> {code:scala}
> import spark.implicits._
> val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123)
> println("NON CACHED:")
> println("  count: " + df.count())
> println("  collect: " + df.collect().mkString(" "))
> df.show()
> println("CACHED:")
> df.cache().count()
> println("  count: " + df.count())
> println("  collect: " + df.collect().mkString(" "))
> df.show()
> df.unpersist()
> {code}
> output:
> {code}
> NON CACHED:
>   count: 2
>   collect: [1] [4]
> +---+
> | id|
> +---+
> |  1|
> |  4|
> +---+
> CACHED:
>   count: 3
>   collect: [1] [4]
> +---+
> | id|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46810) Clarify error class terminology

2024-02-05 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814406#comment-17814406
 ] 

Nicholas Chammas commented on SPARK-46810:
--

[~cloud_fan], [~LuciferYang], [~beliefer], and [~dongjoon] - What are your 
thoughts on the 3 proposed options?

> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
> [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
>  It has several key pieces of hierarchical information that have inconsistent 
> names throughout our documentation and codebase:
>  * 42
>  ** K01
>  *** INCOMPLETE_TYPE_DEFINITION
>   ARRAY
>   MAP
>   STRUCT
> What are the names of these different levels of information?
> Some examples of inconsistent terminology:
>  * [Over 
> here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
>  we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION 
> we call that an "error class". So what exactly is a class, the 42 or the 
> INCOMPLETE_TYPE_DEFINITION?
>  * [Over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
>  we call K01 the "subclass". But [over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
>  we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
> INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
> So what exactly is a subclass?
>  * [On this 
> page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
>  we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
> places we refer to it as an "error class".
> I don't think we should leave this status quo as-is. I see a couple of ways 
> to fix this.
> h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"
> One solution is to use the following terms:
>  * Error class: 42
>  * Error sub-class: K01
>  * Error state: 42K01
>  * Error condition: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-condition: ARRAY, MAP, STRUCT
> Pros: 
>  * This terminology seems (to me at least) the most natural and intuitive.
>  * It aligns most closely to the SQL standard.
> Cons:
>  * We use {{errorClass}} [all over our 
> codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
>  – literally in thousands of places – to refer to strings like 
> INCOMPLETE_TYPE_DEFINITION.
>  ** It's probably not practical to update all these usages to say 
> {{errorCondition}} instead, so if we go with this approach there will be a 
> divide between the terminology we use in user-facing documentation vs. what 
> the code base uses.
>  ** We can perhaps rename the existing {{error-classes.json}} to 
> {{error-conditions.json}} but clarify the reason for this divide between code 
> and user docs in the documentation for {{ErrorClassesJsonReader}} .
> h1. Option 2: 42 becomes an "Error Category"
> Another approach is to use the following terminology:
>  * Error category: 42
>  * Error sub-category: K01
>  * Error state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The change from calling "42" a "class" to a "category" is low impact and 
> may not show up in user-facing documentation at all. (See my side note below.)
> Cons:
>  * These terms do not align with the SQL standard.
>  * We will have to retire the term "error condition", which we have [already 
> used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md]
>  i

[jira] [Commented] (SPARK-40549) PYSPARK: Observation computes the wrong results when using `corr` function

2024-02-02 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813780#comment-17813780
 ] 

Nicholas Chammas commented on SPARK-40549:
--

I think this is just a consequence of floating point arithmetic being imprecise.
{code:python}
>>> for i in range(10):
...     o = Observation(f"test_{i}")
...     df_o = df.observe(o, F.corr("id", "id2"))
...     df_o.count()
...     print(o.get)
... 
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0002}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0002}
{'corr(id, id2)': 0.}
{'corr(id, id2)': 1.0} {code}
Unfortunately, {{corr}} seems to convert to float internally, so even if you 
give it decimals you will get a similar result:
{code:python}
>>> from decimal import Decimal
>>> import pyspark.sql.functions as F
>>> 
>>> df = spark.createDataFrame(
...     [(Decimal(i), Decimal(i * 10)) for i in range(10)],
...     schema="id decimal, id2 decimal",
... )for i in range(10):
    o = Observation(f"test_{i}")
    df_o = df.observe(o, F.corr("id", "id2"))
    df_o.count()
    print(o.get)
>>> 
>>> for i in range(10):
...     o = Observation(f"test_{i}")
...     df_o = df.observe(o, F.corr("id", "id2"))
...     df_o.count()
...     print(o.get)
... 
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 0.}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0002}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0} {code}

I don't think there is anything that can be done here.

> PYSPARK: Observation computes the wrong results when using `corr` function 
> ---
>
> Key: SPARK-40549
> URL: https://issues.apache.org/jira/browse/SPARK-40549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
> Environment: {code:java}
> // lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:    Ubuntu 22.04.1 LTS
> Release:        22.04
> Codename:       jammy {code}
> {code:java}
>  // python -V
> python 3.10.4
> {code}
> {code:java}
>  // lshw -class cpu
> *-cpu                             
> description: CPU        product: AMD Ryzen 9 3900X 12-Core Processor        
> vendor: Advanced Micro Devices [AMD]        physical id: f        bus info: 
> cpu@0        version: 23.113.0        serial: Unknown        slot: AM4        
> size: 2194MHz        capacity: 4672MHz        width: 64 bits        clock: 
> 100MHz        capabilities: lm fpu fpu_exception wp vme de pse tsc msr pae 
> mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht 
> syscall nx mmxext fxsr_opt pdpe1gb rdtscp x86-64 constant_tsc rep_good nopl 
> nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma 
> cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy 
> svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit 
> wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 
> cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm 
> rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves 
> cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr 
> rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean 
> flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif 
> v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es cpufreq      
>   configuration: cores=12 enabledcores=12 microcode=141561875 threads=24
> {code}
>Reporter: Herminio Vazquez
>Priority: Major
>  Labels: correctness
>
> Minimalistic description of the odd computation results.
> When creating a new `Observation` object and computing a simple correlation 
> function between 2 columns, the results appear to be non-deterministic.
> {code:java}
> # Init
> from pyspark.sql import SparkSession, Observation
> import pyspark.sql.functions as F
> df = spark.createDataFrame([(float(i), float(i*10),) for i in range(10)], 
> schema="id double, id2 double")
> for i in range(10):
>     o = Observation(f"test_{i}")
>     df_o = df.observe(o, F.corr("id", "id2").eqNullSafe(1.0))
>     df_o.count()
> print(o.get)
> # Results
> {

[jira] [Commented] (SPARK-45786) Inaccurate Decimal multiplication and division results

2024-02-02 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813766#comment-17813766
 ] 

Nicholas Chammas commented on SPARK-45786:
--

[~kazuyukitanimura] - I'm just curious: How did you find this bug? Was it 
something you stumbled on by accident or did you search for it using something 
like a fuzzer?

> Inaccurate Decimal multiplication and division results
> --
>
> Key: SPARK-45786
> URL: https://issues.apache.org/jira/browse/SPARK-45786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.4, 3.3.3, 3.4.1, 3.5.0, 4.0.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Blocker
>  Labels: correctness, pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1
>
>
> Decimal multiplication and division results may be inaccurate due to rounding 
> issues.
> h2. Multiplication:
> {code:scala}
> scala> sql("select  -14120025096157587712113961295153.858047 * 
> -0.4652").show(truncate=false)
> ++
>   
> |(-14120025096157587712113961295153.858047 * -0.4652)|
> ++
> |6568635674732509803675414794505.574764  |
> ++
> {code}
> The correct answer is
> {quote}6568635674732509803675414794505.574763
> {quote}
> Please note that the last digit is 3 instead of 4 as
>  
> {code:scala}
> scala> 
> java.math.BigDecimal("-14120025096157587712113961295153.858047").multiply(java.math.BigDecimal("-0.4652"))
> val res21: java.math.BigDecimal = 6568635674732509803675414794505.5747634644
> {code}
> Since the factional part .574763 is followed by 4644, it should not be 
> rounded up.
> h2. Division:
> {code:scala}
> scala> sql("select -0.172787979 / 
> 533704665545018957788294905796.5").show(truncate=false)
> +-+
> |(-0.172787979 / 533704665545018957788294905796.5)|
> +-+
> |-3.237521E-31|
> +-+
> {code}
> The correct answer is
> {quote}-3.237520E-31
> {quote}
> Please note that the last digit is 0 instead of 1 as
>  
> {code:scala}
> scala> 
> java.math.BigDecimal("-0.172787979").divide(java.math.BigDecimal("533704665545018957788294905796.5"),
>  100, java.math.RoundingMode.DOWN)
> val res22: java.math.BigDecimal = 
> -3.237520489418037889998826491401059986665344697406144511563561222578738E-31
> {code}
> Since the factional part .237520 is followed by 4894..., it should not be 
> rounded up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38167) CSV parsing error when using escape='"'

2024-02-02 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813741#comment-17813741
 ] 

Nicholas Chammas commented on SPARK-38167:
--

[~marnixvandenbroek] - Could you link to the bug report you filed with 
Univocity?

cc [~maxgekk] - I believe you have hit some parsing bugs in Univocity recently.

> CSV parsing error when using escape='"' 
> 
>
> Key: SPARK-38167
> URL: https://issues.apache.org/jira/browse/SPARK-38167
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.2.1
> Environment: Pyspark on a single-node Databricks managed Spark 3.1.2 
> cluster.
>Reporter: Marnix van den Broek
>Priority: Major
>  Labels: correctness, csv, csvparser, data-integrity
>
> hi all,
> When reading CSV files with Spark, I ran into a parsing bug.
> {*}The summary{*}:
> When
>  # reading a comma separated, double-quote quoted CSV file using the csv 
> reader options _escape='"'_ and {_}header=True{_},
>  # with a row containing a quoted empty field
>  # followed by a quoted field starting with a comma and followed by one or 
> more characters
> selecting columns from the dataframe at or after the field described in 3) 
> gives incorrect and inconsistent results
> {*}In detail{*}:
> When I instruct Spark to read this CSV file:
>  
> {code:java}
> col1,col2
> "",",a"
> {code}
>  
> using the CSV reader options escape='"' (unnecessary for the example, 
> necessary for the files I'm processing) and header=True, I expect the 
> following result:
>  
> {code:java}
> spark.read.csv(path, escape='"', header=True).show()
>  
> +++
> |col1|col2|
> +++
> |null|  ,a|
> +++   {code}
>  
>  Spark does yield this result, so far so good. However, when I select col2 
> from the dataframe, Spark yields an incorrect result:
>  
> {code:java}
> spark.read.csv(path, escape='"', header=True).select('col2').show()
>  
> ++
> |col2|
> ++
> |  a"|
> ++{code}
>  
> If you run this example with more columns in the file, and more commas in the 
> field, e.g. ",,,a", the problem compounds, as Spark shifts many values to 
> the right, causing unexpected and incorrect results. The inconsistency 
> between both methods surprised me, as it implies the parsing is evaluated 
> differently between both methods. 
> I expect the bug to be located in the quote-balancing and un-escaping methods 
> of the csv parser, but I can't find where that code is located in the code 
> base. I'd be happy to take a look at it if anyone can point me where it is. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42399) CONV() silently overflows returning wrong results

2024-02-02 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-42399:
-
Affects Version/s: (was: 3.5.0)

> CONV() silently overflows returning wrong results
> -
>
> Key: SPARK-42399
> URL: https://issues.apache.org/jira/browse/SPARK-42399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Critical
>  Labels: correctness, pull-request-available
>
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 2.114 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.ansi.enabled = true;
> spark.sql.ansi.enabled true
> Time taken: 0.068 seconds, Fetched 1 row(s)
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 0.05 seconds, Fetched 1 row(s)
> In ANSI mode we should raise an error for sure.
> In non ANSI either an error or a NULL maybe be acceptable.
> Alternatively, of course, we could consider if we can support arbitrary 
> domains since the result is a STRING again. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42399) CONV() silently overflows returning wrong results

2024-02-02 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813733#comment-17813733
 ] 

Nicholas Chammas commented on SPARK-42399:
--

This issue does indeed appear to be resolved on {{master}} when ANSI mode is 
enabled:
{code:java}
>>> spark.sql(f"SELECT CONV('{'f' * 64}', 16, 10) AS 
>>> result").show(truncate=False)
++
|result              |
++
|18446744073709551615|
++
>>> spark.conf.set("spark.sql.ansi.enabled", "true")
>>> spark.sql(f"SELECT CONV('{'f' * 64}', 16, 10) AS 
>>> result").show(truncate=False)
Traceback (most recent call last):
...
pyspark.errors.exceptions.captured.ArithmeticException: [ARITHMETIC_OVERFLOW] 
Overflow in function conv(). If necessary set "spark.sql.ansi.enabled" to 
"false" to bypass this error. SQLSTATE: 22003
== SQL (line 1, position 8) ==
SELECT CONV('', 
16, 10) AS result
       

 {code}
However, there is still a silent overflow when ANSI mode is disabled. The error 
message suggests this is intended behavior.

cc [~gengliang] and [~gurwls223], who resolved SPARK-42427.

> CONV() silently overflows returning wrong results
> -
>
> Key: SPARK-42399
> URL: https://issues.apache.org/jira/browse/SPARK-42399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Serge Rielau
>Priority: Critical
>  Labels: correctness, pull-request-available
>
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 2.114 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.ansi.enabled = true;
> spark.sql.ansi.enabled true
> Time taken: 0.068 seconds, Fetched 1 row(s)
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 0.05 seconds, Fetched 1 row(s)
> In ANSI mode we should raise an error for sure.
> In non ANSI either an error or a NULL maybe be acceptable.
> Alternatively, of course, we could consider if we can support arbitrary 
> domains since the result is a STRING again. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42399) CONV() silently overflows returning wrong results

2024-02-02 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-42399:
-
Affects Version/s: 3.5.0

> CONV() silently overflows returning wrong results
> -
>
> Key: SPARK-42399
> URL: https://issues.apache.org/jira/browse/SPARK-42399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Serge Rielau
>Priority: Critical
>  Labels: correctness, pull-request-available
>
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 2.114 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.ansi.enabled = true;
> spark.sql.ansi.enabled true
> Time taken: 0.068 seconds, Fetched 1 row(s)
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 0.05 seconds, Fetched 1 row(s)
> In ANSI mode we should raise an error for sure.
> In non ANSI either an error or a NULL maybe be acceptable.
> Alternatively, of course, we could consider if we can support arbitrary 
> domains since the result is a STRING again. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42399) CONV() silently overflows returning wrong results

2024-02-02 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-42399:
-
Labels: correctness pull-request-available  (was: pull-request-available)

> CONV() silently overflows returning wrong results
> -
>
> Key: SPARK-42399
> URL: https://issues.apache.org/jira/browse/SPARK-42399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Critical
>  Labels: correctness, pull-request-available
>
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 2.114 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.ansi.enabled = true;
> spark.sql.ansi.enabled true
> Time taken: 0.068 seconds, Fetched 1 row(s)
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 0.05 seconds, Fetched 1 row(s)
> In ANSI mode we should raise an error for sure.
> In non ANSI either an error or a NULL maybe be acceptable.
> Alternatively, of course, we could consider if we can support arbitrary 
> domains since the result is a STRING again. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46810) Clarify error class terminology

2024-02-01 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46810:
-
Description: 
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
 * [Over 
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
 we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we 
call that an "error class". So what exactly is a class, the 42 or the 
INCOMPLETE_TYPE_DEFINITION?
 * [Over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
 we call K01 the "subclass". But [over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
 we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
So what exactly is a subclass?
 * [On this 
page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
 we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
places we refer to it as an "error class".

I don't think we should leave this status quo as-is. I see a couple of ways to 
fix this.
h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"

One solution is to use the following terms:
 * Error class: 42
 * Error sub-class: K01
 * Error state: 42K01
 * Error condition: INCOMPLETE_TYPE_DEFINITION
 * Error sub-condition: ARRAY, MAP, STRUCT

Pros: 
 * This terminology seems (to me at least) the most natural and intuitive.
 * It aligns most closely to the SQL standard.

Cons:
 * We use {{errorClass}} [all over our 
codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
 – literally in thousands of places – to refer to strings like 
INCOMPLETE_TYPE_DEFINITION.
 ** It's probably not practical to update all these usages to say 
{{errorCondition}} instead, so if we go with this approach there will be a 
divide between the terminology we use in user-facing documentation vs. what the 
code base uses.
 ** We can perhaps rename the existing {{error-classes.json}} to 
{{error-conditions.json}} but clarify the reason for this divide between code 
and user docs in the documentation for {{ErrorClassesJsonReader}} .

h1. Option 2: 42 becomes an "Error Category"

Another approach is to use the following terminology:
 * Error category: 42
 * Error sub-category: K01
 * Error state: 42K01
 * Error class: INCOMPLETE_TYPE_DEFINITION
 * Error sub-classes: ARRAY, MAP, STRUCT

Pros:
 * We continue to use "error class" as we do today in our code base.
 * The change from calling "42" a "class" to a "category" is low impact and may 
not show up in user-facing documentation at all. (See my side note below.)

Cons:
 * These terms do not align with the SQL standard.
 * We will have to retire the term "error condition", which we have [already 
used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md]
 in user-facing documentation.

h1. Option 3: "Error Class" and "State Class"
 * SQL state class: 42
 * SQL state sub-class: K01
 * SQL state: 42K01
 * Error class: INCOMPLETE_TYPE_DEFINITION
 * Error sub-classes: ARRAY, MAP, STRUCT

Pros:
 * We continue to use "error class" as we do today in our code base.
 * The change from calling "42" a "class" to a "state class" is low impact and 
may not show up in user-facing documentation at all. (See my side note below.)

Cons:
 * "State class" vs. "Error class" is a bit confusing.
 * These terms do not align with the SQL standard.
 * We will have to retire the term "error condition", which we have [already 
used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md]
 in user-facing documentation.

—

Side note: In any case, I believe talking about &quo

[jira] [Created] (SPARK-46935) Consolidate error documentation

2024-01-31 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-46935:


 Summary: Consolidate error documentation
 Key: SPARK-46935
 URL: https://issues.apache.org/jira/browse/SPARK-46935
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46810) Clarify error class terminology

2024-01-31 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46810:
-
Description: 
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
 * [Over 
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
 we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we 
call that an "error class". So what exactly is a class, the 42 or the 
INCOMPLETE_TYPE_DEFINITION?
 * [Over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
 we call K01 the "subclass". But [over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
 we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
So what exactly is a subclass?
 * [On this 
page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
 we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
places we refer to it as an "error class".

I don't think we should leave this status quo as-is. I see a couple of ways to 
fix this.
h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"

One solution is to use the following terms:
 * Error class: 42
 * Error sub-class: K01
 * Error state: 42K01
 * Error condition: INCOMPLETE_TYPE_DEFINITION
 * Error sub-condition: ARRAY, MAP, STRUCT

Pros: 
 * This terminology seems (to me at least) the most natural and intuitive.
 * It may also match the SQL standard.

Cons:
 * We use {{errorClass}} [all over our 
codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
 – literally in thousands of places – to refer to strings like 
INCOMPLETE_TYPE_DEFINITION.
 ** It's probably not practical to update all these usages to say 
{{errorCondition}} instead, so if we go with this approach there will be a 
divide between the terminology we use in user-facing documentation vs. what the 
code base uses.
 ** We can perhaps rename the existing {{error-classes.json}} to 
{{error-conditions.json}} but clarify the reason for this divide between code 
and user docs in the documentation for {{ErrorClassesJsonReader}} .

h1. Option 2: 42 becomes an "Error Category"

Another approach is to use the following terminology:
 * Error category: 42
 * Error sub-category: K01
 * Error state: 42K01
 * Error class: INCOMPLETE_TYPE_DEFINITION
 * Error sub-classes: ARRAY, MAP, STRUCT

Pros:
 * We continue to use "error class" as we do today in our code base.
 * The change from calling "42" a class to a category is low impact and may not 
show up in user-facing documentation at all. (See my side note below.)

Cons:
 * These terms may not align with the SQL standard.
 * We will have to retire the term "error condition", which we have [already 
used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md]
 in user-facing documentation.

h1. Option 3: "Error Class" and "State Class"
 * SQL state class: 42
 * SQL state sub-class: K01
 * SQL state: 42K01
 * Error class: INCOMPLETE_TYPE_DEFINITION
 * Error sub-classes: ARRAY, MAP, STRUCT

—

Side note: In any case, I believe talking about "42" and "K01" – regardless of 
what we end up calling them – in front of users is not helpful. I don't think 
anybody cares what "42" by itself means, or what "K01" by itself means. 
Accordingly, we should limit how much we talk about these concepts in the 
user-facing documentation.

  was:
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several

[jira] [Updated] (SPARK-46923) Limit width of config tables in documentation and style them consistently

2024-01-30 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46923:
-
Summary: Limit width of config tables in documentation and style them 
consistently  (was: Style config tables in documentation consistently)

> Limit width of config tables in documentation and style them consistently
> -
>
> Key: SPARK-46923
> URL: https://issues.apache.org/jira/browse/SPARK-46923
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 4.0.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46923) Style config tables in documentation consistently

2024-01-30 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-46923:


 Summary: Style config tables in documentation consistently
 Key: SPARK-46923
 URL: https://issues.apache.org/jira/browse/SPARK-46923
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46810) Clarify error class terminology

2024-01-29 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811923#comment-17811923
 ] 

Nicholas Chammas commented on SPARK-46810:
--

I think Option 3 is a good compromise that lets us continue calling 
{{INCOMPLETE_TYPE_DEFINITION}} an "error class", which perhaps would be the 
least disruptive to Spark developers.

However, for the record, the SQL standard only seems to use the term "class" in 
the context of the 5-character SQLSTATE. Otherwise, the standard uses the term 
"condition" or "exception condition".

I don't have a copy of the SQL 2016 standard handy. It's not available on ISO's 
website for sale, actually. The only option appears to be to purchase [the SQL 
2023 standard for ~$220|https://www.iso.org/standard/76583.html].

However, there is a copy of the [SQL 1992 standard available 
publicly|https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt]. 

Table 23 on page 619 is relevant:

{code}
 Table_23-SQLSTATE_class_and_subclass_values

 _Condition__Class_Subcondition___Subclass

| ambiguous cursor name| 3C  | (no subclass)| 000  |
|  | |  |  |
|  | |  |  |
| cardinality violation| 21  | (no subclass)| 000  |
|  | |  |  |
| connection exception | 08  | (no subclass)| 000  |
|  | |  |  |
|  | | connection does not  | 003  |
   exist
|  | | connection failure   | 006  |
|  | |  |  |
|  | | connection name in use   | 002  |
|  | |  |  |
|  | | SQL-client unable to | 001  |
   establish SQL-connection
...
{code}

I think this maps closest to Option 1, but again if we want to go with Option 3 
I think that's reasonable too. But in the case of Option 3 we should then 
retire [our use of the term "error 
condition"|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html] so 
that we don't use multiple terms to refer to the same thing.

> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
> [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
>  It has several key pieces of hierarchical information that have inconsistent 
> names throughout our documentation and codebase:
>  * 42
>  ** K01
>  *** INCOMPLETE_TYPE_DEFINITION
>   ARRAY
>   MAP
>   STRUCT
> What are the names of these different levels of information?
> Some examples of inconsistent terminology:
>  * [Over 
> here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
>  we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION 
> we call that an "error class". So what exactly is a class, the 42 or the 
> INCOMPLETE_TYPE_DEFINITION?
>  * [Over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
>  we call K01 the "subclass". But [over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
>  we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
> INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
> So what exactly is a subclass?
>  * [On this 
> page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
>  we call 

[jira] [Created] (SPARK-46894) Move PySpark error conditions into standalone JSON file

2024-01-28 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-46894:


 Summary: Move PySpark error conditions into standalone JSON file
 Key: SPARK-46894
 URL: https://issues.apache.org/jira/browse/SPARK-46894
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46810) Clarify error class terminology

2024-01-27 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811627#comment-17811627
 ] 

Nicholas Chammas commented on SPARK-46810:
--

Thanks for sharing the relevant quote, [~srielau].

1. So just to be clear, you are saying you prefer Option 1. Is that correct? I 
will update the PR accordingly.

2. Is there anyone else we need buy-in from before moving forward? [~maxgekk], 
perhaps?

> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
> [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
>  It has several key pieces of hierarchical information that have inconsistent 
> names throughout our documentation and codebase:
>  * 42
>  ** K01
>  *** INCOMPLETE_TYPE_DEFINITION
>   ARRAY
>   MAP
>   STRUCT
> What are the names of these different levels of information?
> Some examples of inconsistent terminology:
>  * [Over 
> here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
>  we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION 
> we call that an "error class". So what exactly is a class, the 42 or the 
> INCOMPLETE_TYPE_DEFINITION?
>  * [Over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
>  we call K01 the "subclass". But [over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
>  we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
> INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
> So what exactly is a subclass?
>  * [On this 
> page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
>  we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
> places we refer to it as an "error class".
> I don't think we should leave this status quo as-is. I see a couple of ways 
> to fix this.
> h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"
> One solution is to use the following terms:
>  * Error class: 42
>  * Error sub-class: K01
>  * Error state: 42K01
>  * Error condition: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-condition: ARRAY, MAP, STRUCT
> Pros: 
>  * This terminology seems (to me at least) the most natural and intuitive.
>  * It may also match the SQL standard.
> Cons:
>  * We use {{errorClass}} [all over our 
> codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
>  – literally in thousands of places – to refer to strings like 
> INCOMPLETE_TYPE_DEFINITION.
>  ** It's probably not practical to update all these usages to say 
> {{errorCondition}} instead, so if we go with this approach there will be a 
> divide between the terminology we use in user-facing documentation vs. what 
> the code base uses.
>  ** We can perhaps rename the existing {{error-classes.json}} to 
> {{error-conditions.json}} but clarify the reason for this divide between code 
> and user docs in the documentation for {{ErrorClassesJsonReader}} .
> h1. Option 2: 42 becomes an "Error Category"
> Another approach is to use the following terminology:
>  * Error category: 42
>  * Error sub-category: K01
>  * Error state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The change from calling "42" a class to a category is low impact and may 
> not show up in user-facing documentation at all. (See my side note below.)
> Cons:
>  * These terms may not align with the SQL standard.
>  * We will have to retire the term "error condition", which we have [already 
> used|ht

[jira] [Comment Edited] (SPARK-46810) Clarify error class terminology

2024-01-26 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811470#comment-17811470
 ] 

Nicholas Chammas edited comment on SPARK-46810 at 1/27/24 5:00 AM:
---

[~srielau] - What do you think of the problem and proposed solutions described 
above?

I am partial to Option 1, but certainly either solution will need buy-in from 
whoever cares about how we manage and document errors.

Also, you mentioned [on the 
PR|https://github.com/apache/spark/pull/44902/files#r1468258626] that the SQL 
standard uses specific terms. Could you link to or quote the relevant parts?


was (Author: nchammas):
[~srielau] - What do you think of the problem and proposed solutions described 
above?

Also, you mentioned [on the 
PR|https://github.com/apache/spark/pull/44902/files#r1468258626] that the SQL 
standard uses specific terms. Could you link to or quote the relevant parts?

> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
> [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
>  It has several key pieces of hierarchical information that have inconsistent 
> names throughout our documentation and codebase:
>  * 42
>  ** K01
>  *** INCOMPLETE_TYPE_DEFINITION
>   ARRAY
>   MAP
>   STRUCT
> What are the names of these different levels of information?
> Some examples of inconsistent terminology:
>  * [Over 
> here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
>  we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION 
> we call that an "error class". So what exactly is a class, the 42 or the 
> INCOMPLETE_TYPE_DEFINITION?
>  * [Over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
>  we call K01 the "subclass". But [over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
>  we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
> INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
> So what exactly is a subclass?
>  * [On this 
> page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
>  we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
> places we refer to it as an "error class".
> I don't think we should leave this status quo as-is. I see a couple of ways 
> to fix this.
> h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"
> One solution is to use the following terms:
>  * Error class: 42
>  * Error sub-class: K01
>  * Error state: 42K01
>  * Error condition: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-condition: ARRAY, MAP, STRUCT
> Pros: 
>  * This terminology seems (to me at least) the most natural and intuitive.
>  * It may also match the SQL standard.
> Cons:
>  * We use {{errorClass}} [all over our 
> codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
>  – literally in thousands of places – to refer to strings like 
> INCOMPLETE_TYPE_DEFINITION.
>  ** It's probably not practical to update all these usages to say 
> {{errorCondition}} instead, so if we go with this approach there will be a 
> divide between the terminology we use in user-facing documentation vs. what 
> the code base uses.
>  ** We can perhaps rename the existing {{error-classes.json}} to 
> {{error-conditions.json}} but clarify the reason for this divide between code 
> and user docs in the documentation for {{ErrorClassesJsonReader}} .
> h1. Option 2: 42 becomes an "Error Category"
> Another approach is to use the following terminology:
>  * Error category: 42
>  * Error sub-category: K01
>  * Error state: 42K01
>  * Error class: INCOMPLET

[jira] [Updated] (SPARK-46810) Clarify error class terminology

2024-01-26 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46810:
-
Description: 
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
 * [Over 
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
 we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we 
call that an "error class". So what exactly is a class, the 42 or the 
INCOMPLETE_TYPE_DEFINITION?
 * [Over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
 we call K01 the "subclass". But [over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
 we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
So what exactly is a subclass?
 * [On this 
page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
 we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
places we refer to it as an "error class".

I don't think we should leave this status quo as-is. I see a couple of ways to 
fix this.
h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"

One solution is to use the following terms:
 * Error class: 42
 * Error sub-class: K01
 * Error state: 42K01
 * Error condition: INCOMPLETE_TYPE_DEFINITION
 * Error sub-condition: ARRAY, MAP, STRUCT

Pros: 
 * This terminology seems (to me at least) the most natural and intuitive.
 * It may also match the SQL standard.

Cons:
 * We use {{errorClass}} [all over our 
codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
 – literally in thousands of places – to refer to strings like 
INCOMPLETE_TYPE_DEFINITION.
 ** It's probably not practical to update all these usages to say 
{{errorCondition}} instead, so if we go with this approach there will be a 
divide between the terminology we use in user-facing documentation vs. what the 
code base uses.
 ** We can perhaps rename the existing {{error-classes.json}} to 
{{error-conditions.json}} but clarify the reason for this divide between code 
and user docs in the documentation for {{ErrorClassesJsonReader}} .

h1. Option 2: 42 becomes an "Error Category"

Another approach is to use the following terminology:
 * Error category: 42
 * Error sub-category: K01
 * Error state: 42K01
 * Error class: INCOMPLETE_TYPE_DEFINITION
 * Error sub-classes: ARRAY, MAP, STRUCT

Pros:
 * We continue to use "error class" as we do today in our code base.
 * The change from calling "42" a class to a category is low impact and may not 
show up in user-facing documentation at all. (See my side note below.)

Cons:
 * These terms may not align with the SQL standard.
 * We will have to retire the term "error condition", which we have [already 
used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md]
 in user-facing documentation.

—

Side note: In either case, I believe talking about "42" and "K01" – regardless 
of what we end up calling them – in front of users is not helpful. I don't 
think anybody cares what "42" by itself means, or what "K01" by itself means. 
Accordingly, we should limit how much we talk about these concepts in the 
user-facing documentation.

  was:
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the nam

[jira] [Commented] (SPARK-46810) Clarify error class terminology

2024-01-26 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811470#comment-17811470
 ] 

Nicholas Chammas commented on SPARK-46810:
--

[~srielau] - What do you think of the problem and proposed solutions described 
above?

Also, you mentioned [on the 
PR|https://github.com/apache/spark/pull/44902/files#r1468258626] that the SQL 
standard uses specific terms. Could you link to or quote the relevant parts?

> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
> [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
>  It has several key pieces of hierarchical information that have inconsistent 
> names throughout our documentation and codebase:
>  * 42
>  ** K01
>  *** INCOMPLETE_TYPE_DEFINITION
>   ARRAY
>   MAP
>   STRUCT
> What are the names of these different levels of information?
> Some examples of inconsistent terminology:
>  * [Over 
> here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
>  we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION 
> we call that an "error class". So what exactly is a class, the 42 or the 
> INCOMPLETE_TYPE_DEFINITION?
>  * [Over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
>  we call K01 the "subclass". But [over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
>  we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
> INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
> So what exactly is a subclass?
>  * [On this 
> page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
>  we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
> places we refer to it as an "error class".
> I don't think we should leave this status quo as-is. I see a couple of ways 
> to fix this.
> h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"
> One solution is to use the following terms:
>  * Error class: 42
>  * Error sub-class: K01
>  * Error state: 42K01
>  * Error condition: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-condition: ARRAY, MAP, STRUCT
> Pros: 
>  * This terminology seems (to me at least) the most natural and intuitive.
>  * It may also match the SQL standard.
> Cons:
>  * We use {{errorClass}} [all over our 
> codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
>  – literally in thousands of places – to refer to strings like 
> INCOMPLETE_TYPE_DEFINITION.
>  ** It's probably not practical to update all these usages to say 
> {{errorCondition}} instead, so if we go with this approach there will be a 
> divide between the terminology we use in user-facing documentation vs. what 
> the code base uses.
>  ** We can perhaps rename the existing {{error-classes.json}} to 
> {{error-conditions.json}} but clarify the reason for this divide between code 
> and user docs in the documentation for {{ErrorClassesJsonReader}} .
> h1. Option 2: 42 becomes an "Error Category"
> Another approach is to use the following terminology:
>  * Error category: 42
>  * Error sub-category: K01
>  * Error state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The change from calling "42" a class to a category is low impact and may 
> not show up in user-facing documentation at all. (See my side note below.)
> Cons:
>  * These terms may not align with the SQL standard.
>  * We will have to retire the term "error condition", which we have [a

[jira] [Updated] (SPARK-46810) Clarify error class terminology

2024-01-26 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46810:
-
Description: 
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
 * [Over 
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
 we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we 
call that an "error class". So what exactly is a class, the 42 or the 
INCOMPLETE_TYPE_DEFINITION?
 * [Over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
 we call K01 the "subclass". But [over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
 we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
So what exactly is a subclass?
 * [On this 
page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
 we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
places we refer to it as an "error class".

I don't think we should leave this status quo as-is. I see a couple of ways to 
fix this.
h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"

One solution is to use the following terms:
 * Error class: 42
 * Error sub-class: K01
 * Error state: 42K01
 * Error condition: INCOMPLETE_TYPE_DEFINITION
 * Error sub-condition: ARRAY, MAP, STRUCT

Pros: 
 * This terminology seems (to me at least) the most natural and intuitive.
 * It may also match the SQL standard.

Cons:
 * We use {{errorClass}} [all over our 
codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
 – literally in thousands of places – to refer to strings like 
INCOMPLETE_TYPE_DEFINITION.
 ** It's probably not practical to update all these usages to say 
{{errorCondition}} instead, so if we go with this approach there will be a 
divide between the terminology we use in user-facing documentation vs. what the 
code base uses.
 ** We can perhaps rename the existing {{error-classes.json}} to 
{{error-conditions.json}} but clarify the reason for this divide between code 
and user docs in the documentation for {{ErrorClassesJsonReader}} .

h1. Option 2: 42 becomes an "Error Category"

Another approach is to use the following terminology:
 * Error category: 42
 * Error sub-category: K01
 * Error state: 42K01
 * Error class: INCOMPLETE_TYPE_DEFINITION
 * Error sub-classes: ARRAY, MAP, STRUCT

Pros:
 * We continue to use "error class" as we do today in our code base.
 * The change from calling "42" a class to a category is low impact and may not 
show up in user-facing documentation at all. (See my side note below.)

Cons:
 * These terms may not align with the SQL standard.
 * We will have to retire the term "error condition", which we have [already 
used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md#L0-L1]
 in user-facing documentation.

—

Side note: In either case, I believe talking about "42" and "K01" – regardless 
of what we end up calling them – in front of users is not helpful. I don't 
think anybody cares what "42" by itself means, or what "K01" by itself means. 
Accordingly, we should limit how much we talk about these concepts in the 
user-facing documentation.

  was:
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the nam

[jira] [Updated] (SPARK-46810) Clarify error class terminology

2024-01-26 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46810:
-
Description: 
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
 * [Over 
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
 we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we 
call that an "error class". So what exactly is a class, the 42 or the 
INCOMPLETE_TYPE_DEFINITION?
 * [Over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
 we call K01 the "subclass". But [over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
 we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
So what exactly is a subclass?
 * [On this 
page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
 we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
places we refer to it as an "error class".

I don't think we should leave this status quo as-is. I see a couple of ways to 
fix this.
h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"

One solution is to use the following terms:
 * Error class: 42
 * Error sub-class: K01
 * Error state: 42K01
 * Error condition: INCOMPLETE_TYPE_DEFINITION
 * Error sub-condition: ARRAY, MAP, STRUCT

Pros: 
 * This terminology seems (to me at least) the most natural and intuitive.
 * It may also match the SQL standard.

Cons:
 * We use {{errorClass}} [all over our 
codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
 – literally in thousands of places – to refer to INCOMPLETE_TYPE_DEFINITION.
 ** It's probably not practical to update all these usages to say 
{{errorCondition}} instead, so if we go with this approach there will be a 
divide between the terminology we use in user-facing documentation vs. what the 
code base uses.
 ** We can perhaps rename the existing {{error-classes.json}} to 
{{error-conditions.json}} but clarify the reason for this divide in the 
documentation for {{ErrorClassesJsonReader}} .

h1. Option 2: 42 becomes an "Error Category"

Another 
 * Error category: 42
 * Error sub-category: K01
 * Error state: 42K01
 * Error class: INCOMPLETE_TYPE_DEFINITION
 * Error sub-classes: ARRAY, MAP, STRUCT

We should not use "error condition" if one of the above terms more accurately 
describes what we are talking about.

Side note: With this terminology, I believe talking about error categories and 
sub-categories in front of users is not helpful. I don't think anybody cares 
what "42" by itself means, or what "K01" by itself means. Accordingly, we 
should limit how much we talk about these concepts in the user-facing 
documentation.

  was:
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
 * [Over 
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
 we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we 
call that an "error class". So what exactly is a class, the 42 or the 
INCOMPLETE_TYPE_DEFINITION?
 * [Over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/commo

[jira] [Created] (SPARK-46863) Clean up custom.css

2024-01-25 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-46863:


 Summary: Clean up custom.css
 Key: SPARK-46863
 URL: https://issues.apache.org/jira/browse/SPARK-46863
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46825) Build Spark only once when building docs

2024-01-23 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-46825:


 Summary: Build Spark only once when building docs
 Key: SPARK-46825
 URL: https://issues.apache.org/jira/browse/SPARK-46825
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46819) Port error class data to automation-friendly format

2024-01-23 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-46819:


 Summary: Port error class data to automation-friendly format
 Key: SPARK-46819
 URL: https://issues.apache.org/jira/browse/SPARK-46819
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas


As described in SPARK-46810, we have several types of error data captured in 
our code and documentation.

Unfortunately, a good chunk of this data is in a Markdown table that is not 
friendly to automation (e.g. to generate documentation, or run tests).

[https://github.com/apache/spark/blob/d1fbc4c7191aafadada1a6f7c217bf89f6cae49f/common/utils/src/main/resources/error/README.md#L121]

We should migrate this error data to an automation-friendly format.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46810) Clarify error class terminology

2024-01-23 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46810:
-
Description: 
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
 * [Over 
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
 we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we 
call that an "error class". So what exactly is a class, the 42 or the 
INCOMPLETE_TYPE_DEFINITION?
 * [Over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
 we call K01 the "subclass". But [over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
 we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
So what exactly is a subclass?
 * [On this 
page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
 we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
places we refer to it as an "error class".

I personally like the terminology "error condition", but as we are already 
using "error class" very heavily throughout the codebase to refer to something 
like INCOMPLETE_TYPE_DEFINITION, I don't think it's practical to change at this 
point.

To rationalize the different terms we are using, I propose the following 
terminology, which we should use consistently throughout our code and 
documentation:
 * Error category: 42
 * Error sub-category: K01
 * Error state: 42K01
 * Error class: INCOMPLETE_TYPE_DEFINITION
 * Error sub-classes: ARRAY, MAP, STRUCT

We should not use "error condition" if one of the above terms more accurately 
describes what we are talking about.

Side note: With this terminology, I believe talking about error categories and 
sub-categories in front of users is not helpful. I don't think anybody cares 
what "42" by itself means, or what "K01" by itself means. Accordingly, we 
should limit how much we talk about these concepts in the user-facing 
documentation.

  was:
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
 * [Over 
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
 we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we 
call that an "error class". So what exactly is a class, the 42 or the 
INCOMPLETE_TYPE_DEFINITION?
 * [Over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
 we call K01 the "subclass". But [over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
 we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
So what exactly is a subclass?

I propose the following terminology, which we should use consistently 
throughout our code and documentation:
 * Error class: 42
 * Error subclass: K01
 * Error state: 42K01
 * Error condition: INCOMPLETE_TYPE_DEFINITION
 * Error sub-conditions: ARRAY, MAP, STRUCT

Side note: With this terminology, I believe talking about error classes and 
subclasses in front of users is not helpful. I don't think anybody 

Re: End-user troubleshooting of bad c-ares interaction with router

2024-01-23 Thread Nicholas Chammas via c-ares
To close the loop on this discussion, I’ve filed the following issue with the 
gRPC folks:

https://github.com/grpc/grpc/issues/35638

Thank you again for all of your help. I would not have been able to understand 
what’s going on without it.


> On Jan 23, 2024, at 11:43 AM, Brad House  wrote:
> 
> Yeah, it does clearly show them enqueuing IPv4 and IPv6 requests separately.  
> So either they need to add logic similar to c-ares has internally with 
> https://github.com/c-ares/c-ares/pull/551 or just use ares_getaddrinfo() 
> instead of ares_gethostbyname() with address family AF_UNSPEC and let c-ares 
> do the right thing.
> 
> 
> 
> On 1/23/24 11:25 AM, Nicholas Chammas wrote:
>> Thank you for all the troubleshooting help, Brad.
>> 
>> I am using gRPC via Apache Spark Connect (a Python library), so I am two 
>> levels removed from c-ares itself. Looking in the Python virtual environment 
>> where gRPC is installed, I’m not sure what file to run otool on. The only 
>> seemingly relevant file I could find is called cygrpc.cpython-311-darwin.so, 
>> and otool didn’t turn up anything interesting on it.
>> 
>> I will take this issue up with the gRPC folks.
>> 
>> I see in several places that the gRPC folks are using ares_gethostbyname:
>> https://github.com/grpc/grpc/blob/v1.60.0/src/core/lib/event_engine/ares_resolver.cc#L287-L293
>> https://github.com/grpc/grpc/blob/v1.60.0/src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc#L748-L758
>> https://github.com/grpc/grpc/blob/v1.60.0/src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc#L1075-L1086
>> 
>> 
>>> On Jan 22, 2024, at 1:39 PM, Brad House  
>>> <mailto:b...@brad-house.com> wrote:
>>> 
>>> Are you using gRPC installed via homebrew or is it bundled with something 
>>> else?  Usually package maintainers like homebrew will dynamically link to 
>>> the system versions of dependencies so they can be updated independently.  
>>> You might be able to run otool -L on grpc to see what c-ares library its 
>>> picking up (and if none are listed, it might be compiled in statically).
>>> 
>>> That said, according to your grpc logs, it appears that grpc may be itself 
>>> performing both A and  queries and expect responses to both of those.  
>>> I see the "A" reply comes back but the "" reply never comes and it 
>>> bails at that point.  Many years ago c-ares didn't have a way to request 
>>> both A and  records with one query, but does these days via 
>>> ares_getaddrinfo(), and it was recently enhanced with logic to assist in 
>>> the exact scenario you are seeing, basically it will stop retrying when at 
>>> least one address family is returned. 
>>> 
>>> You might need to escalate this to the gRPC folks.
>>> 
>>> On 1/22/24 12:10 PM, Nicholas Chammas wrote:
>>>> Here’s the output of adig and ahost 
>>>> <https://gist.github.com/nchammas/a4c9873d8158c323796e9b47c064e63a#file-adig-ahost-txt>,
>>>>  both with and without the DNS servers set directly on the network 
>>>> interface (vs. just on the router).
>>>> 
>>>> I also learned that gRPC 1.60.0 may be using c-ares 1.19.1 
>>>> <https://github.com/grpc/grpc/tree/v1.60.0/third_party/cares>, though 
>>>> again that’s just via looking at the gRPC source and not via some runtime 
>>>> query.
>>>> 
>>>> 
>>>>> On Jan 21, 2024, at 7:34 AM, Brad House  
>>>>> <mailto:b...@brad-house.com> wrote:
>>>>> 
>>>>> I think homebrew distributes the 'adig' and 'ahost' utilities from 
>>>>> c-ares.  Can you try using those to do the same lookup so we can see the 
>>>>> results?
>>>>> 
>>>>> On 1/19/24 11:01 AM, Nicholas Chammas wrote:
>>>>>> 
>>>>>>> On Jan 17, 2024, at 3:38 PM, Brad House  
>>>>>>> <mailto:b...@brad-house.com> wrote:
>>>>>>> What version of c-ares is installed?
>>>>>>> 
>>>>>> Sorry about the delay in responding. Answering this question is more 
>>>>>> difficult than I expected.
>>>>>> 
>>>>>> I know that Spark Connect is running gRPC 1.160.0. Looking through the 
>>>>>> gRPC repo, I see mention of c-ares 1.13.0 
>>>>>> <https://github.com/grpc/grpc/blob/v1.60.0/cmake/cares.cmake#L42>, but I 
>>>>>>

Re: End-user troubleshooting of bad c-ares interaction with router

2024-01-23 Thread Nicholas Chammas via c-ares
Thank you for all the troubleshooting help, Brad.

I am using gRPC via Apache Spark Connect (a Python library), so I am two levels 
removed from c-ares itself. Looking in the Python virtual environment where 
gRPC is installed, I’m not sure what file to run otool on. The only seemingly 
relevant file I could find is called cygrpc.cpython-311-darwin.so, and otool 
didn’t turn up anything interesting on it.

I will take this issue up with the gRPC folks.

I see in several places that the gRPC folks are using ares_gethostbyname:
https://github.com/grpc/grpc/blob/v1.60.0/src/core/lib/event_engine/ares_resolver.cc#L287-L293
https://github.com/grpc/grpc/blob/v1.60.0/src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc#L748-L758
https://github.com/grpc/grpc/blob/v1.60.0/src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc#L1075-L1086


> On Jan 22, 2024, at 1:39 PM, Brad House  wrote:
> 
> Are you using gRPC installed via homebrew or is it bundled with something 
> else?  Usually package maintainers like homebrew will dynamically link to the 
> system versions of dependencies so they can be updated independently.  You 
> might be able to run otool -L on grpc to see what c-ares library its picking 
> up (and if none are listed, it might be compiled in statically).
> 
> That said, according to your grpc logs, it appears that grpc may be itself 
> performing both A and  queries and expect responses to both of those.  I 
> see the "A" reply comes back but the "" reply never comes and it bails at 
> that point.  Many years ago c-ares didn't have a way to request both A and 
>  records with one query, but does these days via ares_getaddrinfo(), and 
> it was recently enhanced with logic to assist in the exact scenario you are 
> seeing, basically it will stop retrying when at least one address family is 
> returned. 
> 
> You might need to escalate this to the gRPC folks.
> 
> On 1/22/24 12:10 PM, Nicholas Chammas wrote:
>> Here’s the output of adig and ahost 
>> <https://gist.github.com/nchammas/a4c9873d8158c323796e9b47c064e63a#file-adig-ahost-txt>,
>>  both with and without the DNS servers set directly on the network interface 
>> (vs. just on the router).
>> 
>> I also learned that gRPC 1.60.0 may be using c-ares 1.19.1 
>> <https://github.com/grpc/grpc/tree/v1.60.0/third_party/cares>, though again 
>> that’s just via looking at the gRPC source and not via some runtime query.
>> 
>> 
>>> On Jan 21, 2024, at 7:34 AM, Brad House  
>>> <mailto:b...@brad-house.com> wrote:
>>> 
>>> I think homebrew distributes the 'adig' and 'ahost' utilities from c-ares.  
>>> Can you try using those to do the same lookup so we can see the results?
>>> 
>>> On 1/19/24 11:01 AM, Nicholas Chammas wrote:
>>>> 
>>>>> On Jan 17, 2024, at 3:38 PM, Brad House  
>>>>> <mailto:b...@brad-house.com> wrote:
>>>>> What version of c-ares is installed?
>>>>> 
>>>> Sorry about the delay in responding. Answering this question is more 
>>>> difficult than I expected.
>>>> 
>>>> I know that Spark Connect is running gRPC 1.160.0. Looking through the 
>>>> gRPC repo, I see mention of c-ares 1.13.0 
>>>> <https://github.com/grpc/grpc/blob/v1.60.0/cmake/cares.cmake#L42>, but I 
>>>> don’t know how that translates to my runtime. Homebrew tells me I have 
>>>> c-ares 1.25.0 installed, but again, I’m not sure if that’s what I’m 
>>>> actually running.
>>>> 
>>>> Is there a way I can directly query the version of c-ares being run via 
>>>> Spark Connect / gRPC? I asked this question on the gRPC forum 
>>>> <https://groups.google.com/g/grpc-io/c/3tZCa48Xvh8> but no response yet.
>>>> 
>>>> For the record, I know that c-ares is involved because if I tell gRPC to 
>>>> not use it (via GRPC_DNS_RESOLVER=native 
>>>> <https://github.com/grpc/grpc/blob/b34d98fbd47834845e3f9cdaa4aa706f1aa4eddb/doc/environment_variables.md>)
>>>>  then my problem disappears.
>>>>> What DNS servers are configured on your MacOS system when its not 
>>>>> operating properly?  The output of "scutil --dns" would be helpful here.
>>>>> 
>>>> Here’s that output. 
>>>> <https://gist.github.com/nchammas/a4c9873d8158c323796e9b47c064e63a#file-scutil-dns-txt>
>>>>  I believe 192.168.1.1 is just my local router, and on there is where I 
>>>> have the default DNS servers set to 1.1.1.1 and 1.0.0.1.
>>>> 
>> 

-- 
c-ares mailing list
c-ares@lists.haxx.se
https://lists.haxx.se/mailman/listinfo/c-ares


[jira] [Updated] (SPARK-46810) Clarify error class terminology

2024-01-23 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46810:
-
Description: 
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
 * [Over 
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
 we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we 
call that an "error class". So what exactly is a class, the 42 or the 
INCOMPLETE_TYPE_DEFINITION?
 * [Over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
 we call K01 the "subclass". But [over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
 we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
So what exactly is a subclass?

I propose the following terminology, which we should use consistently 
throughout our code and documentation:
 * Error class: 42
 * Error subclass: K01
 * Error state: 42K01
 * Error condition: INCOMPLETE_TYPE_DEFINITION
 * Error sub-conditions: ARRAY, MAP, STRUCT

Side note: With this terminology, I believe talking about error classes and 
subclasses in front of users is not helpful. I don't think anybody cares about 
what "42" by itself means, or what "K01" by itself means. Accordingly, we 
should limit how much we talk about these concepts in the user-facing 
documentation.

  was:
We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
 * [Over 
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
 we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we 
call that an "error class". So what exactly is a class, the 42 or the 
INCOMPLETE_TYPE_DEFINITION?
 * [Over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
 we call K01 the "subclass". But [over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
 we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
So what exactly is a subclass?

I propose the following terminology, which we should use consistently 
throughout our code and documentation:
 * Error class: 42
 * Error subclass: K01
 * Error state: 42K01
 * Error condition: INCOMPLETE_TYPE_DEFINITION
 * Error sub-conditions: ARRAY, MAP, STRUCT

Side note: With this terminology, I believe talking about error classes and 
subclasses in front of users is not helpful. I don't think anybody cares about 
what 42 by itself means, or what K01 by itself means. Accordingly, we should 
limit how much we talk about these concepts.


> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>      Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing 

[jira] [Commented] (SPARK-46810) Clarify error class terminology

2024-01-23 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809804#comment-17809804
 ] 

Nicholas Chammas commented on SPARK-46810:
--

[~itholic] [~gurwls223] - What do you think?

cc also [~karenfeng], who I see in git blame as the original contributor of 
error classes.

> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
> [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
>  It has several key pieces of hierarchical information that have inconsistent 
> names throughout our documentation and codebase:
>  * 42
>  ** K01
>  *** INCOMPLETE_TYPE_DEFINITION
>   ARRAY
>   MAP
>   STRUCT
> What are the names of these different levels of information?
> Some examples of inconsistent terminology:
>  * [Over 
> here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
>  we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION 
> we call that an "error class". So what exactly is a class, the 42 or the 
> INCOMPLETE_TYPE_DEFINITION?
>  * [Over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
>  we call K01 the "subclass". But [over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
>  we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
> INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
> So what exactly is a subclass?
> I propose the following terminology, which we should use consistently 
> throughout our code and documentation:
>  * Error class: 42
>  * Error subclass: K01
>  * Error state: 42K01
>  * Error condition: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-conditions: ARRAY, MAP, STRUCT
> Side note: With this terminology, I believe talking about error classes and 
> subclasses in front of users is not helpful. I don't think anybody cares 
> about what 42 by itself means, or what K01 by itself means. Accordingly, we 
> should limit how much we talk about these concepts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46810) Clarify error class terminology

2024-01-23 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-46810:


 Summary: Clarify error class terminology
 Key: SPARK-46810
 URL: https://issues.apache.org/jira/browse/SPARK-46810
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas


We use inconsistent terminology when talking about error classes. I'd like to 
get some clarity on that before contributing any potential improvements to this 
part of the documentation.

Consider 
[INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
 It has several key pieces of hierarchical information that have inconsistent 
names throughout our documentation and codebase:
 * 42
 ** K01
 *** INCOMPLETE_TYPE_DEFINITION
  ARRAY
  MAP
  STRUCT

What are the names of these different levels of information?

Some examples of inconsistent terminology:
 * [Over 
here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
 we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION we 
call that an "error class". So what exactly is a class, the 42 or the 
INCOMPLETE_TYPE_DEFINITION?
 * [Over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
 we call K01 the "subclass". But [over 
here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
 we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
So what exactly is a subclass?

I propose the following terminology, which we should use consistently 
throughout our code and documentation:
 * Error class: 42
 * Error subclass: K01
 * Error state: 42K01
 * Error condition: INCOMPLETE_TYPE_DEFINITION
 * Error sub-conditions: ARRAY, MAP, STRUCT

Side note: With this terminology, I believe talking about error classes and 
subclasses in front of users is not helpful. I don't think anybody cares about 
what 42 by itself means, or what K01 by itself means. Accordingly, we should 
limit how much we talk about these concepts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46807) Include automation notice in SQL error class documents

2024-01-22 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-46807:


 Summary: Include automation notice in SQL error class documents
 Key: SPARK-46807
 URL: https://issues.apache.org/jira/browse/SPARK-46807
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[grpc-io] Re: What version of c-ares is gRPC running?

2024-01-22 Thread Nicholas Chammas
Looks like the information I'm looking for may be here:

https://github.com/grpc/grpc/tree/v1.60.0/third_party/cares

If I'm understanding this correctly, gRPC 1.60.0 should be running c-ares 
1.19.1.

On Thursday, January 18, 2024 at 8:13:34 PM UTC-5 Nicholas Chammas wrote:

> How can I tell what version of c-ares gRPC is running?
>
> I am running a Spark Connect program which uses gRPC under the hood. I 
> tried enabling some gRPC debug information as follows:
>
> GRPC_TRACE=cares_resolver,cares_address_sorting,dns_resolver 
> GRPC_VERBOSITY=DEBUG python my-script.py
>
> But even though I see log lines related to c-ares, I don't see anything 
> that tells me what version of c-ares is running.
>
> I believe Spark Connect is using gRPC 1.60.0 under the hood, and looking 
> through the source I see mention of c-ares 1.13.0 
> <https://github.com/grpc/grpc/blob/v1.60.0/cmake/cares.cmake#L42>. But 
> this looks like a conditional build instruction, and I am not sure how this 
> translates to my runtime.
>
> So is there any way I can be sure of the version of c-ares that gRPC is 
> running on my system?
>
> Nick
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to grpc-io+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/c09d5c61-373d-4827-9261-c9edc14b99dan%40googlegroups.com.


Re: End-user troubleshooting of bad c-ares interaction with router

2024-01-22 Thread Nicholas Chammas via c-ares
Here’s the output of adig and ahost 
<https://gist.github.com/nchammas/a4c9873d8158c323796e9b47c064e63a#file-adig-ahost-txt>,
 both with and without the DNS servers set directly on the network interface 
(vs. just on the router).

I also learned that gRPC 1.60.0 may be using c-ares 1.19.1 
<https://github.com/grpc/grpc/tree/v1.60.0/third_party/cares>, though again 
that’s just via looking at the gRPC source and not via some runtime query.


> On Jan 21, 2024, at 7:34 AM, Brad House  wrote:
> 
> I think homebrew distributes the 'adig' and 'ahost' utilities from c-ares.  
> Can you try using those to do the same lookup so we can see the results?
> 
> On 1/19/24 11:01 AM, Nicholas Chammas wrote:
>> 
>>> On Jan 17, 2024, at 3:38 PM, Brad House  
>>> <mailto:b...@brad-house.com> wrote:
>>> What version of c-ares is installed?
>>> 
>> Sorry about the delay in responding. Answering this question is more 
>> difficult than I expected.
>> 
>> I know that Spark Connect is running gRPC 1.160.0. Looking through the gRPC 
>> repo, I see mention of c-ares 1.13.0 
>> <https://github.com/grpc/grpc/blob/v1.60.0/cmake/cares.cmake#L42>, but I 
>> don’t know how that translates to my runtime. Homebrew tells me I have 
>> c-ares 1.25.0 installed, but again, I’m not sure if that’s what I’m actually 
>> running.
>> 
>> Is there a way I can directly query the version of c-ares being run via 
>> Spark Connect / gRPC? I asked this question on the gRPC forum 
>> <https://groups.google.com/g/grpc-io/c/3tZCa48Xvh8> but no response yet.
>> 
>> For the record, I know that c-ares is involved because if I tell gRPC to not 
>> use it (via GRPC_DNS_RESOLVER=native 
>> <https://github.com/grpc/grpc/blob/b34d98fbd47834845e3f9cdaa4aa706f1aa4eddb/doc/environment_variables.md>)
>>  then my problem disappears.
>>> What DNS servers are configured on your MacOS system when its not operating 
>>> properly?  The output of "scutil --dns" would be helpful here.
>>> 
>> Here’s that output. 
>> <https://gist.github.com/nchammas/a4c9873d8158c323796e9b47c064e63a#file-scutil-dns-txt>
>>  I believe 192.168.1.1 is just my local router, and on there is where I have 
>> the default DNS servers set to 1.1.1.1 and 1.0.0.1.
>> 

-- 
c-ares mailing list
c-ares@lists.haxx.se
https://lists.haxx.se/mailman/listinfo/c-ares


[jira] [Commented] (AVRO-3923) Add Avro 1.11.3 release blog

2024-01-21 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/AVRO-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809199#comment-17809199
 ] 

Nicholas Chammas commented on AVRO-3923:


Silly question but: Is this URL supposed to 404?

[https://avro.apache.org/docs/1.11.3/specification/]

Where are the docs for 1.11.3?

> Add Avro 1.11.3 release blog
> 
>
> Key: AVRO-3923
> URL: https://issues.apache.org/jira/browse/AVRO-3923
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: website
>Affects Versions: 1.11.3
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Removing Kinesis in Spark 4

2024-01-20 Thread Nicholas Chammas
Oh, that’s a very interesting dashboard. I was familiar with the Matomo snippet 
but never looked up where exactly those metrics were going.

I see that the Kinesis docs do indeed have around 650 views in the past month, 
but for Kafka I see 11K and 1.3K views for the Structured Streaming and DStream 
docs, respectively. Big difference there, though maybe that's because Kinesis 
doesn’t have docs for structured streaming. Hard to tell.





These statistics also raise questions about the future of the R API, though 
that’s a topic for another thread.



Nick


> On Jan 20, 2024, at 1:05 PM, Sean Owen  wrote:
> 
> I'm not aware of much usage. but that doesn't mean a lot.
> 
> FWIW, in the past month or so, the Kinesis docs page got about 700 views, 
> compared to about 1400 for Kafka
> https://analytics.apache.org/index.php?module=CoreHome=index=yesterday=day=40#?idSite=40=range=2023-12-15,2024-01-20=General_Actions=Actions_SubmenuPageTitles
> 
> Those are "low" in general, compared to the views for streaming pages, which 
> got tens of thousands of views.
> 
> I do feel like it's unmaintained, and do feel like it might be a stretch to 
> leave it lying around until Spark 5.
> It's not exactly unused though.
> 
> I would not object to removing it unless there is some voice of support here.
> 
> On Sat, Jan 20, 2024 at 10:38 AM Nicholas Chammas  <mailto:nicholas.cham...@gmail.com>> wrote:
>> From the dev thread: What else could be removed in Spark 4? 
>> <https://lists.apache.org/thread/shxj7qmrtqbxqf85lrlsv6510892ktnz>
>>> On Aug 17, 2023, at 1:44 AM, Yang Jie >> <mailto:yangji...@apache.org>> wrote:
>>> 
>>> I would like to know how we should handle the two Kinesis-related modules 
>>> in Spark 4.0. They have a very low frequency of code updates, and because 
>>> the corresponding tests are not continuously executed in any GitHub Actions 
>>> pipeline, so I think they significantly lack quality assurance. On top of 
>>> that, I am not certain if the test cases, which require AWS credentials in 
>>> these modules, get verified during each Spark version release.
>> 
>> Did we ever reach a decision about removing Kinesis in Spark 4?
>> 
>> I was cleaning up some docs related to Kinesis and came across a reference 
>> to some Java API docs that I could not find 
>> <https://github.com/apache/spark/pull/44802#discussion_r1459337001>. And 
>> looking around I came across both this email thread and this thread on JIRA 
>> <https://issues.apache.org/jira/browse/SPARK-45720?focusedCommentId=17787227=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17787227>
>>  about potentially removing Kinesis.
>> 
>> But as far as I can tell we haven’t made a clear decision one way or the 
>> other.
>> 
>> Nick
>> 



Removing Kinesis in Spark 4

2024-01-20 Thread Nicholas Chammas
From the dev thread: What else could be removed in Spark 4? 

> On Aug 17, 2023, at 1:44 AM, Yang Jie  wrote:
> 
> I would like to know how we should handle the two Kinesis-related modules in 
> Spark 4.0. They have a very low frequency of code updates, and because the 
> corresponding tests are not continuously executed in any GitHub Actions 
> pipeline, so I think they significantly lack quality assurance. On top of 
> that, I am not certain if the test cases, which require AWS credentials in 
> these modules, get verified during each Spark version release.

Did we ever reach a decision about removing Kinesis in Spark 4?

I was cleaning up some docs related to Kinesis and came across a reference to 
some Java API docs that I could not find 
. And 
looking around I came across both this email thread and this thread on JIRA 

 about potentially removing Kinesis.

But as far as I can tell we haven’t made a clear decision one way or the other.

Nick



[jira] [Created] (SPARK-46775) Fix formatting of Kinesis docs

2024-01-19 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-46775:


 Summary: Fix formatting of Kinesis docs
 Key: SPARK-46775
 URL: https://issues.apache.org/jira/browse/SPARK-46775
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: End-user troubleshooting of bad c-ares interaction with router

2024-01-19 Thread Nicholas Chammas via c-ares

> On Jan 17, 2024, at 3:38 PM, Brad House  wrote:
> What version of c-ares is installed?
> 
Sorry about the delay in responding. Answering this question is more difficult 
than I expected.

I know that Spark Connect is running gRPC 1.160.0. Looking through the gRPC 
repo, I see mention of c-ares 1.13.0 
, but I don’t 
know how that translates to my runtime. Homebrew tells me I have c-ares 1.25.0 
installed, but again, I’m not sure if that’s what I’m actually running.

Is there a way I can directly query the version of c-ares being run via Spark 
Connect / gRPC? I asked this question on the gRPC forum 
 but no response yet.

For the record, I know that c-ares is involved because if I tell gRPC to not 
use it (via GRPC_DNS_RESOLVER=native 
)
 then my problem disappears.
> What DNS servers are configured on your MacOS system when its not operating 
> properly?  The output of "scutil --dns" would be helpful here.
> 
Here’s that output. 

 I believe 192.168.1.1 is just my local router, and on there is where I have 
the default DNS servers set to 1.1.1.1 and 1.0.0.1.

-- 
c-ares mailing list
c-ares@lists.haxx.se
https://lists.haxx.se/mailman/listinfo/c-ares


[jira] [Created] (SPARK-46764) Reorganize Ruby script to build API docs

2024-01-18 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-46764:


 Summary: Reorganize Ruby script to build API docs
 Key: SPARK-46764
 URL: https://issues.apache.org/jira/browse/SPARK-46764
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[grpc-io] What version of c-ares is gRPC running?

2024-01-18 Thread Nicholas Chammas
How can I tell what version of c-ares gRPC is running?

I am running a Spark Connect program which uses gRPC under the hood. I 
tried enabling some gRPC debug information as follows:

GRPC_TRACE=cares_resolver,cares_address_sorting,dns_resolver 
GRPC_VERBOSITY=DEBUG python my-script.py

But even though I see log lines related to c-ares, I don't see anything 
that tells me what version of c-ares is running.

I believe Spark Connect is using gRPC 1.60.0 under the hood, and looking 
through the source I see mention of c-ares 1.13.0 
. But this 
looks like a conditional build instruction, and I am not sure how this 
translates to my runtime.

So is there any way I can be sure of the version of c-ares that gRPC is 
running on my system?

Nick

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to grpc-io+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/c24e5292-cbc3-4254-86a6-0fc4d19a59d1n%40googlegroups.com.


End-user troubleshooting of bad c-ares interaction with router

2024-01-17 Thread Nicholas Chammas via c-ares
Hello,

I am trying to troubleshoot a problem as an end-user of c-ares. I use a library 
(Apache Spark Connect 
) that uses 
gRPC, which in turn uses c-ares. I am two levels removed from c-ares itself and 
am a little out of my depth.

I have a little Python script that connects to a remote Apache Spark cluster 
via Spark Connect and runs a test query. When I run this script on my home 
network, it takes over 20 seconds to run. When I tether my workstation to my 
phone (which is connected via LTE), the same script runs in a second or two. In 
both cases the script runs successfully.

I enabled some gRPC debug flags which print out a lot of information. This led 
me to c-ares, as I believe the difference in runtime is related somehow to DNS.

I’ve extracted the log lines output by gRPC related to c-ares 
. (Be sure 
to scroll down to see both files; there is one for home and one for LTE.) The 
gRPC codebase is hosted on GitHub, where you can find the grpc_ares_wrapper.cc 

 file mentioned in the log files.

I tried changing the DNS servers configured in my home router but that didn’t 
seem to help. Interestingly, however, if I set the same DNS servers already 
configured in my home router directly on the network interface I’m using, the 
20 second delay disappears:

```
networksetup -setdnsservers “My Network" 1.1.1.1 1.0.0.1
```

But this setting doesn’t persist across restarts, and only Spark Connect seems 
to have this problem. It seems there is some kind of bad interaction between 
c-ares and my router.

How can I dig deeper to understand what’s going wrong with my home network? I 
checked the c-ares docs  but I don’t see a way 
for an end-user to enable debug output from c-ares, e.g. via an environment 
variable.

Any suggestions? I’m running macOS 14.2.1. The router is an Apple AirPort.

Nick

-- 
c-ares mailing list
c-ares@lists.haxx.se
https://lists.haxx.se/mailman/listinfo/c-ares


[jira] [Commented] (RAT-352) Enable use of wildcard expressions in exclude file

2024-01-16 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807311#comment-17807311
 ] 

Nicholas Chammas commented on RAT-352:
--

> would it make sense to provide a CLI option that reads a .gitignore instead 
> of a .ratexclude file to allow for your feature request?

The Spark project needs separate listings in .gitignore vs. .rat-excludes. So 
as long as the new option simply changes how the patterns are interpreted (from 
regex to wildcard), then we can update our existing .rat-excludes to work with 
the new option.

The goal (for me at least) is to be able to look at .gitignore and 
.rat-excludes and interpret the entries in there the same way. I think it's 
more intuitive and easier to manage.

> Enable use of wildcard expressions in exclude file
> --
>
> Key: RAT-352
> URL: https://issues.apache.org/jira/browse/RAT-352
> Project: Apache Rat
>  Issue Type: Improvement
>  Components: cli
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> Due to the widespread use of git, I would find it much more intuitive if 
> .rat-excludes worked like .gitignore. I think most people on the Spark 
> project would agree (though, fair disclosure, I haven't polled them).
> Would it make sense to add a CLI option instructing RAT to interpret entries 
> in the exclude file as wildcard expressions (as opposed to regular 
> expressions) that work more or less like .gitignore?
> This feature request is somewhat related to RAT-265.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2024-01-15 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806954#comment-17806954
 ] 

Nicholas Chammas commented on SPARK-45599:
--

Using [Hypothesis|https://github.com/HypothesisWorks/hypothesis], I've managed 
to shrink the provided test case from 373 elements down to 14:

{code:python}
from math import nan
from pyspark.sql import SparkSession

HYPOTHESIS_EXAMPLE = [
(0.0,),
(2.0,),
(153.0,),
(168.0,),
(3252411229536261.0,),
(7.205759403792794e+16,),
(1.7976931348623157e+308,),
(0.25,),
(nan,),
(nan,),
(-0.0,),
(-128.0,),
(nan,),
(nan,),
]

spark = (
SparkSession.builder
.config("spark.log.level", "ERROR")
.getOrCreate()
)


def compare_percentiles(data, slices):
rdd = spark.sparkContext.parallelize(data, numSlices=1)
df = spark.createDataFrame(rdd, "val double")
result1 = df.selectExpr('percentile(val, 0.1)').collect()[0][0]

rdd = spark.sparkContext.parallelize(data, numSlices=slices)
df = spark.createDataFrame(rdd, "val double")
result2 = df.selectExpr('percentile(val, 0.1)').collect()[0][0]

assert result1 == result2, f"{result1}, {result2}"


if __name__ == "__main__":
compare_percentiles(HYPOTHESIS_EXAMPLE, 2)
{code}

Running this test fails as follows:

{code:python}
Traceback (most recent call last):  
  File ".../SPARK-45599.py", line 41, in 
compare_percentiles(HYPOTHESIS_EXAMPLE, 2)
  File ".../SPARK-45599.py", line 37, in compare_percentiles
assert result1 == result2, f"{result1}, {result2}"
   ^^
AssertionError: 0.050044, -0.0
{code}

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: correctness
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (

[jira] [Created] (RAT-352) Enable use of wildcard expressions in exclude file

2024-01-15 Thread Nicholas Chammas (Jira)
Nicholas Chammas created RAT-352:


 Summary: Enable use of wildcard expressions in exclude file
 Key: RAT-352
 URL: https://issues.apache.org/jira/browse/RAT-352
 Project: Apache Rat
  Issue Type: Improvement
  Components: cli
Reporter: Nicholas Chammas


Due to the widespread use of git, I would find it much more intuitive if 
.rat-excludes worked like .gitignore. I think most people on the Spark project 
would agree (though, fair disclosure, I haven't polled them).

Would it make sense to add a CLI option instructing RAT to interpret entries in 
the exclude file as wildcard expressions (as opposed to regular expressions) 
that work more or less like .gitignore?

This feature request is somewhat related to RAT-265.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (RAT-323) Harmonize UIs

2024-01-15 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806930#comment-17806930
 ] 

Nicholas Chammas commented on RAT-323:
--

Big +1 for enabling the CLI to use SCM ignores as excludes.

The Apache Spark project uses RAT via the CLI, and I am currently trying to 
clean up the configured excludes there because it's a [total 
mess|https://github.com/apache/spark/blob/c0ff0f579daa21dcc6004058537d275a0dd2920f/dev/.rat-excludes].
 This is partly because RAT is not using the project's existing .gitignore 
files, and partly because people expect .rat-excludes to work the same way as 
.gitignore.

> Harmonize UIs
> -
>
> Key: RAT-323
> URL: https://issues.apache.org/jira/browse/RAT-323
> Project: Apache Rat
>  Issue Type: Improvement
>  Components: cli
>Affects Versions: 0.16
>Reporter: Claude Warren
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The UIs (CLI, ANT and Maven) were all developed separately and have different 
> options.
> There is an overlap in some functionality and the functionality of some UIs 
> is not found in others.
> This task is to do two things:
>  # collect all the UI options, and ensure that they are all supported in the 
> ReportConfiguration class. 
>  # modify the UIs so that the names of the options are the same (or as 
> similar as possible) across the three UIs.  Renamed methods are to be 
> deprecated in favour of new methods.
>  
> Example:
> apache-rat-plugin has 3 options: parseSCMIgnoresAsExcludes, 
> useEclipseDefaultExcludes, useIdeaDefaultExcludes that change the file 
> filter.  These are options that would be useful in all UIs and should be 
> moved to the ReportConfiguration so that any UI can set them.
> By harmonization I mean that options like the above are extracted from the 
> specific UIs where they are implemented and moved to the ReportConfiguration 
> so that the implementations are in one place and can be shared across all UIs.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2024-01-12 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806150#comment-17806150
 ] 

Nicholas Chammas commented on SPARK-45599:
--

cc [~dongjoon] - This is an old correctness bug with a concise reproduction.

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: correctness
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e

[jira] [Updated] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2024-01-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-45599:
-
Labels: correctness  (was: data-corruption)

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: correctness
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.169696988

[jira] [Commented] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2024-01-12 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806148#comment-17806148
 ] 

Nicholas Chammas commented on SPARK-45599:
--

I can confirm that this bug is still present on {{master}} at commit 
[a3266b411723310ec10fc1843ddababc15249db0|https://github.com/apache/spark/tree/a3266b411723310ec10fc1843ddababc15249db0].

With {{numSlices=4}} I get {{-5.924228780007003E136}} and with {{numSlices=1}} 
I get {{{}-4.739483957565084E136{}}}.

Updating the label on this issue. I will also ping some committers to bring 
this bug to their attention, as correctness bugs are taken very seriously.

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: data-corruption
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.45014771701

[jira] [Updated] (SPARK-46395) Assign Spark configs to groups for use in documentation

2024-01-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46395:
-
Summary: Assign Spark configs to groups for use in documentation  (was: 
Automatically generate SQL configuration tables for documentation)

> Assign Spark configs to groups for use in documentation
> ---
>
> Key: SPARK-46395
> URL: https://issues.apache.org/jira/browse/SPARK-46395
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.5.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46668) Parallelize Sphinx build of Python API docs

2024-01-10 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-46668:


 Summary: Parallelize Sphinx build of Python API docs
 Key: SPARK-46668
 URL: https://issues.apache.org/jira/browse/SPARK-46668
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Install Ruby 3 to build the docs

2024-01-10 Thread Nicholas Chammas
Just a quick heads up that, while Ruby 2.7 will continue to work, you should 
plan to install Ruby 3 in the near future in order to build the docs. (I 
recommend using rbenv  to manage multiple Ruby 
versions.)

Ruby 2 reached EOL in March 2023 
. We will be 
unable to upgrade some of our Ruby dependencies to their latest versions until 
we are using Ruby 3.

This is not a problem today but will likely become a problem in the near future.

For more details, please refer to this pull request 
.

Best,
Nick



[jira] [Created] (SPARK-46658) Loosen Ruby dependency specs for doc build

2024-01-10 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-46658:


 Summary: Loosen Ruby dependency specs for doc build
 Key: SPARK-46658
 URL: https://issues.apache.org/jira/browse/SPARK-46658
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46437) Enable conditional includes in Jekyll documentation

2024-01-08 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46437:
-
Component/s: (was: SQL)

> Enable conditional includes in Jekyll documentation
> ---
>
> Key: SPARK-46437
> URL: https://issues.apache.org/jira/browse/SPARK-46437
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.5.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46437) Enable conditional includes in Jekyll documentation

2024-01-08 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46437:
-
Summary: Enable conditional includes in Jekyll documentation  (was: Remove 
unnecessary cruft from SQL built-in functions docs)

> Enable conditional includes in Jekyll documentation
> ---
>
> Key: SPARK-46437
> URL: https://issues.apache.org/jira/browse/SPARK-46437
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.5.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46626) Bump jekyll version to support Ruby 3.3

2024-01-08 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-46626:


 Summary: Bump jekyll version to support Ruby 3.3
 Key: SPARK-46626
 URL: https://issues.apache.org/jira/browse/SPARK-46626
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[grpc-io] Re: QUEUE_TIMEOUT on one network but not another

2024-01-05 Thread Nicholas Chammas
The formatting came out weird on my original post (especially for the code 
blocks), so here it is again as a GitHub 
gist: https://gist.github.com/nchammas/5eb46cbbcc8f5fc197cefbc2b0add819

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to grpc-io+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/9a80bd88-43fe-4f19-ac95-5e68cc10b720n%40googlegroups.com.


[grpc-io] QUEUE_TIMEOUT on one network but not another

2024-01-04 Thread Nicholas Chammas


I have a very simple Spark Connect 
 Python 
script that is showing very strange behavior that I am having trouble 
debugging. Spark Connect uses gRPC under the hood to communicate with a 
remote Apache Spark cluster. The cluster I’m pointed at is on a private 
network, so sharing the script is probably not helpful since you won’t be 
able to run it.

The gist of the issue is that this script connects to the cluster and runs 
a simple test query. (Think SELECT 1against a database.)

When I run the script from my home network, it takes 20-30 seconds to run. 
When I tether my workstation to my phone and run the same script, it takes 
1-2 seconds. So the issue is somehow connected to my home network.

I suspected DNS at first, but switching between Cloudflare, Quad9, and 
Google DNS all yield the same result on my home network.

I profiled my script using cProfile and the call that came up as taking all 
the time is this one:
method 'next_event' of 'grpc._cython.cygrpc.SegregatedCall' 

This call is made via this Spark Connect method 

.

So I enabled gRPC tracing as described in the troubleshooting guide 
 to dig 
further:
GRPC_VERBOSITY=debug GRPC_TRACE=all python -u test.py &> home-network.txt 
GRPC_VERBOSITY=debug GRPC_TRACE=all python -u test.py &> lte-network.txt 

I would share the full output but it’s a lot of noise mixed with some 
secrets and keys that I would have to redact. But I did find what I believe 
to be the key difference between these two traces.

home-network.txt has a bunch of lines that lte-network.txt doesn’t have, 
which look like they are related to a network issue:
I0104 16:17:04.205512000 140704379816832 completion_queue.cc:965] 
grpc_completion_queue_next(cq=0x7fb61df17470, deadline=gpr_timespec { 
tv_sec: 1704403024, tv_nsec: 40551, clock_type: 1 }, reserved=0x0) 
D0104 16:17:04.249492000 140704379816832 grpc_ares_wrapper.cc:366] (c-ares 
resolver) request:0x7fb63e1d38f0 readable on c-ares fd: 8 D0104 
16:17:04.249705000 140704379816832 grpc_ares_wrapper.cc:670] (c-ares 
resolver) request:0x7fb63e1d38f0 on_hostbyname_done_locked qtype=A 
host=dbc-REDACTED.cloud.databricks.com ARES_SUCCESS D0104 
16:17:04.249733000 140704379816832 grpc_ares_wrapper.cc:712] (c-ares 
resolver) request:0x7fb63e1d38f0 c-ares resolver gets a AF_INET result: 
addr: 44.REDACTED port: 443 D0104 16:17:04.249763000 140704379816832 
grpc_ares_wrapper.cc:193] (c-ares resolver) request:0x7fb63e1d38f0 Ref 
ev_driver 0x7fb63e1fcd60 D0104 16:17:04.249781000 140704379816832 
grpc_ares_wrapper.cc:449] (c-ares resolver) request:0x7fb63e1d38f0 notify 
read on: c-ares fd: 8 D0104 16:17:04.249794000 140704379816832 
grpc_ares_wrapper.cc:204] (c-ares resolver) request:0x7fb63e1d38f0 Unref 
ev_driver 0x7fb63e1fcd60 D0104 16:17:04.26775 140704379816832 
grpc_ares_wrapper.cc:366] (c-ares resolver) request:0x7fb63e1d38f0 readable 
on c-ares fd: 8 D0104 16:17:04.267957000 140704379816832 
grpc_ares_wrapper.cc:193] (c-ares resolver) request:0x7fb63e1d38f0 Ref 
ev_driver 0x7fb63e1fcd60 D0104 16:17:04.268002000 140704379816832 
grpc_ares_wrapper.cc:449] (c-ares resolver) request:0x7fb63e1d38f0 notify 
read on: c-ares fd: 8 D0104 16:17:04.268026000 140704379816832 
grpc_ares_wrapper.cc:204] (c-ares resolver) request:0x7fb63e1d38f0 Unref 
ev_driver 0x7fb63e1fcd60 # NOTE: Both traces are more or less identical up 
to this point. # The following part, however, is specific to 
home-network.txt. I0104 16:17:04.407392000 140704379816832 
completion_queue.cc:1069] RETURN_EVENT[0x7fb61df17470]: QUEUE_TIMEOUT I0104 
16:17:04.407523000 140704379816832 completion_queue.cc:965] 
grpc_completion_queue_next(cq=0x7fb61df17470, deadline=gpr_timespec { 
tv_sec: 1704403024, tv_nsec: 607516000, clock_type: 1 }, reserved=0x0) 
I0104 16:17:04.609106000 140704379816832 completion_queue.cc:1069] 
RETURN_EVENT[0x7fb61df17470]: QUEUE_TIMEOUT I0104 16:17:04.609209000 
140704379816832 completion_queue.cc:965] 
grpc_completion_queue_next(cq=0x7fb61df17470, deadline=gpr_timespec { 
tv_sec: 1704403024, tv_nsec: 809195000, clock_type: 1 }, reserved=0x0) 
I0104 16:17:04.809394000 140704379816832 completion_queue.cc:1069] 
RETURN_EVENT[0x7fb61df17470]: QUEUE_TIMEOUT I0104 16:17:04.80943 
140704379816832 completion_queue.cc:965] 
grpc_completion_queue_next(cq=0x7fb61df17470, deadline=gpr_timespec { 
tv_sec: 1704403025, tv_nsec: 9426000, clock_type: 1 }, reserved=0x0) I0104 
16:17:05.011529000 140704379816832 completion_queue.cc:1069] 
RETURN_EVENT[0x7fb61df17470]: QUEUE_TIMEOUT I0104 16:17:05.011597000 
140704379816832 completion_queue.cc:965] 
grpc_completion_queue_next(cq=0x7fb61df17470, deadline=gpr_timespec { 
tv_sec: 1704403025, tv_nsec: 211592000, clock_type: 1 }, reserved=0x0) 
I0104 16:17:05.201301000 123145868943360 

[jira] [Updated] (SPARK-46449) Add ability to create databases/schemas via Catalog API

2023-12-30 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46449:
-
Summary: Add ability to create databases/schemas via Catalog API  (was: Add 
ability to create databases via Catalog API)

> Add ability to create databases/schemas via Catalog API
> ---
>
> Key: SPARK-46449
> URL: https://issues.apache.org/jira/browse/SPARK-46449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> As of Spark 3.5, the only way to create a database is via SQL. The Catalog 
> API should offer an equivalent.
> Perhaps something like:
> {code:python}
> spark.catalog.createDatabase(
> name: str,
> existsOk: bool = False,
> comment: str = None,
> location: str = None,
> properties: dict = None,
> )
> {code}
> If {{schema}} is the preferred terminology, then we should use that instead 
> of {{database}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46449) Add ability to create databases via Catalog API

2023-12-28 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46449:
-
Description: 
As of Spark 3.5, the only way to create a database is via SQL. The Catalog API 
should offer an equivalent.

Perhaps something like:
{code:python}
spark.catalog.createDatabase(
name: str,
existsOk: bool = False,
comment: str = None,
location: str = None,
properties: dict = None,
)
{code}

If {{schema}} is the preferred terminology, then we should use that instead of 
{{database}}.

  was:
As of Spark 3.5, the only way to create a database is via SQL. The Catalog API 
should offer an equivalent.

Perhaps something like:
{code:python}
spark.catalog.createDatabase(
name: str,
existsOk: bool = False,
comment: str = None,
location: str = None,
properties: dict = None,
)
{code}


> Add ability to create databases via Catalog API
> ---
>
> Key: SPARK-46449
> URL: https://issues.apache.org/jira/browse/SPARK-46449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> As of Spark 3.5, the only way to create a database is via SQL. The Catalog 
> API should offer an equivalent.
> Perhaps something like:
> {code:python}
> spark.catalog.createDatabase(
> name: str,
> existsOk: bool = False,
> comment: str = None,
> location: str = None,
> properties: dict = None,
> )
> {code}
> If {{schema}} is the preferred terminology, then we should use that instead 
> of {{database}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Validate spark sql

2023-12-24 Thread Nicholas Chammas
This is a user-list question, not a dev-list question. Moving this conversation 
to the user list and BCC-ing the dev list.

Also, this statement

> We are not validating against table or column existence.

is not correct. When you call spark.sql(…), Spark will lookup the table 
references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.

Also, when you run DDL via spark.sql(…), Spark will actually run it. So 
spark.sql(“drop table my_table”) will actually drop my_table. It’s not a 
validation-only operation.

This question of validating SQL is already discussed on Stack Overflow 
. You may find some useful tips 
there.

Nick


> On Dec 24, 2023, at 4:52 AM, Mich Talebzadeh  
> wrote:
> 
>   
> Yes, you can validate the syntax of your PySpark SQL queries without 
> connecting to an actual dataset or running the queries on a cluster.
> PySpark provides a method for syntax validation without executing the query. 
> Something like below
>   __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.4.0
>   /_/
> 
> Using Python version 3.9.16 (main, Apr 24 2023 10:36:11)
> Spark context Web UI available at http://rhes75:4040 
> Spark context available as 'sc' (master = local[*], app id = 
> local-1703410019374).
> SparkSession available as 'spark'.
> >>> from pyspark.sql import SparkSession
> >>> spark = SparkSession.builder.appName("validate").getOrCreate()
> 23/12/24 09:28:02 WARN SparkSession: Using an existing Spark session; only 
> runtime SQL configurations will take effect.
> >>> sql = "SELECT * FROM  WHERE  = some value"
> >>> try:
> ...   spark.sql(sql)
> ...   print("is working")
> ... except Exception as e:
> ...   print(f"Syntax error: {e}")
> ...
> Syntax error:
> [PARSE_SYNTAX_ERROR] Syntax error at or near '<'.(line 1, pos 14)
> 
> == SQL ==
> SELECT * FROM  WHERE  = some value
> --^^^
> 
> Here we only check for syntax errors and not the actual existence of query 
> semantics. We are not validating against table or column existence.
> 
> This method is useful when you want to catch obvious syntax errors before 
> submitting your PySpark job to a cluster, especially when you don't have 
> access to the actual data.
> In summary
> Theis method validates syntax but will not catch semantic errors
> If you need more comprehensive validation, consider using a testing framework 
> and a small dataset.
> For complex queries, using a linter or code analysis tool can help identify 
> potential issues.
> HTH
> 
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
> 
>view my Linkedin profile 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Sun, 24 Dec 2023 at 07:57, ram manickam  > wrote:
>> Hello,
>> Is there a way to validate pyspark sql to validate only syntax errors?. I 
>> cannot connect do actual data set to perform this validation.  Any help 
>> would be appreciated.
>> 
>> 
>> Thanks
>> Ram



Re: Validate spark sql

2023-12-24 Thread Nicholas Chammas
This is a user-list question, not a dev-list question. Moving this conversation 
to the user list and BCC-ing the dev list.

Also, this statement

> We are not validating against table or column existence.

is not correct. When you call spark.sql(…), Spark will lookup the table 
references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.

Also, when you run DDL via spark.sql(…), Spark will actually run it. So 
spark.sql(“drop table my_table”) will actually drop my_table. It’s not a 
validation-only operation.

This question of validating SQL is already discussed on Stack Overflow 
. You may find some useful tips 
there.

Nick


> On Dec 24, 2023, at 4:52 AM, Mich Talebzadeh  
> wrote:
> 
>   
> Yes, you can validate the syntax of your PySpark SQL queries without 
> connecting to an actual dataset or running the queries on a cluster.
> PySpark provides a method for syntax validation without executing the query. 
> Something like below
>   __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.4.0
>   /_/
> 
> Using Python version 3.9.16 (main, Apr 24 2023 10:36:11)
> Spark context Web UI available at http://rhes75:4040 
> Spark context available as 'sc' (master = local[*], app id = 
> local-1703410019374).
> SparkSession available as 'spark'.
> >>> from pyspark.sql import SparkSession
> >>> spark = SparkSession.builder.appName("validate").getOrCreate()
> 23/12/24 09:28:02 WARN SparkSession: Using an existing Spark session; only 
> runtime SQL configurations will take effect.
> >>> sql = "SELECT * FROM  WHERE  = some value"
> >>> try:
> ...   spark.sql(sql)
> ...   print("is working")
> ... except Exception as e:
> ...   print(f"Syntax error: {e}")
> ...
> Syntax error:
> [PARSE_SYNTAX_ERROR] Syntax error at or near '<'.(line 1, pos 14)
> 
> == SQL ==
> SELECT * FROM  WHERE  = some value
> --^^^
> 
> Here we only check for syntax errors and not the actual existence of query 
> semantics. We are not validating against table or column existence.
> 
> This method is useful when you want to catch obvious syntax errors before 
> submitting your PySpark job to a cluster, especially when you don't have 
> access to the actual data.
> In summary
> Theis method validates syntax but will not catch semantic errors
> If you need more comprehensive validation, consider using a testing framework 
> and a small dataset.
> For complex queries, using a linter or code analysis tool can help identify 
> potential issues.
> HTH
> 
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
> 
>view my Linkedin profile 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Sun, 24 Dec 2023 at 07:57, ram manickam  > wrote:
>> Hello,
>> Is there a way to validate pyspark sql to validate only syntax errors?. I 
>> cannot connect do actual data set to perform this validation.  Any help 
>> would be appreciated.
>> 
>> 
>> Thanks
>> Ram



  1   2   3   4   5   6   7   8   9   10   >