[jira] [Assigned] (SPARK-47253) Allow LiveEventBus to stop without the completely draining of event queue

2024-04-12 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-47253:
---

Assignee: TakawaAkirayo

> Allow LiveEventBus to stop without the completely draining of event queue
> -
>
> Key: SPARK-47253
> URL: https://issues.apache.org/jira/browse/SPARK-47253
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: TakawaAkirayo
>Assignee: TakawaAkirayo
>Priority: Minor
>  Labels: pull-request-available
>
> #Problem statement:
> The SparkContext.stop() hung a long time on LiveEventBus.stop() when 
> listeners slow
> #User scenarios:
> We have a centralized service with multiple instances to regularly execute 
> user's scheduled tasks.
> For each user task within one service instance, the process is as follows:
> 1.Create a Spark session directly within the service process with an account 
> defined in the task.
> 2.Instantiate listeners by class names and register them with the 
> SparkContext. The JARs containing the listener classes are uploaded to the 
> service by the user.
> 3.Prepare resources.
> 4.Run user logic (Spark SQL).
> 5.Stop the Spark session by invoking SparkSession.stop().
> In step 5, it will wait for the LiveEventBus to stop, which requires the 
> remaining events to be completely drained by each listener.
> Since the listener is implemented by users and we cannot prevent some heavy 
> stuffs within the listener on each event, there are cases where a single 
> heavy job has over 30,000 tasks,
> and it could take 30 minutes for the listener to process all the remaining 
> events, because within the listener, it requires a coarse-grained global lock 
> and update the internal status to the remote database.
> This kind of delay affects other user tasks in the queue. Therefore, from the 
> server side perspective, we need the guarantee that the stop operation 
> finishes quickly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47253) Allow LiveEventBus to stop without the completely draining of event queue

2024-04-12 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-47253.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45367
[https://github.com/apache/spark/pull/45367]

> Allow LiveEventBus to stop without the completely draining of event queue
> -
>
> Key: SPARK-47253
> URL: https://issues.apache.org/jira/browse/SPARK-47253
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: TakawaAkirayo
>Assignee: TakawaAkirayo
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> #Problem statement:
> The SparkContext.stop() hung a long time on LiveEventBus.stop() when 
> listeners slow
> #User scenarios:
> We have a centralized service with multiple instances to regularly execute 
> user's scheduled tasks.
> For each user task within one service instance, the process is as follows:
> 1.Create a Spark session directly within the service process with an account 
> defined in the task.
> 2.Instantiate listeners by class names and register them with the 
> SparkContext. The JARs containing the listener classes are uploaded to the 
> service by the user.
> 3.Prepare resources.
> 4.Run user logic (Spark SQL).
> 5.Stop the Spark session by invoking SparkSession.stop().
> In step 5, it will wait for the LiveEventBus to stop, which requires the 
> remaining events to be completely drained by each listener.
> Since the listener is implemented by users and we cannot prevent some heavy 
> stuffs within the listener on each event, there are cases where a single 
> heavy job has over 30,000 tasks,
> and it could take 30 minutes for the listener to process all the remaining 
> events, because within the listener, it requires a coarse-grained global lock 
> and update the internal status to the remote database.
> This kind of delay affects other user tasks in the queue. Therefore, from the 
> server side perspective, we need the guarantee that the stop operation 
> finishes quickly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47812) Support Serializing Spark Sessions in ForEachBAtch

2024-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47812.
--
Resolution: Fixed

Issue resolved by pull request 46002
[https://github.com/apache/spark/pull/46002]

> Support Serializing Spark Sessions in ForEachBAtch
> --
>
> Key: SPARK-47812
> URL: https://issues.apache.org/jira/browse/SPARK-47812
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.1
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> SparkSessions using Connect should be serialized when used in ForEachBatch 
> and friends.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47812) Support Serializing Spark Sessions in ForEachBAtch

2024-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47812:


Assignee: Martin Grund

> Support Serializing Spark Sessions in ForEachBAtch
> --
>
> Key: SPARK-47812
> URL: https://issues.apache.org/jira/browse/SPARK-47812
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.1
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> SparkSessions using Connect should be serialized when used in ForEachBatch 
> and friends.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47757) Reeanble ResourceProfileTests for pyspark-connect

2024-04-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47757:
---
Labels: pull-request-available  (was: )

> Reeanble ResourceProfileTests for pyspark-connect
> -
>
> Key: SPARK-47757
> URL: https://issues.apache.org/jira/browse/SPARK-47757
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46574) Upgrade maven plugin to latest version

2024-04-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46574:
---
Labels: pull-request-available  (was: )

> Upgrade maven plugin to latest version
> --
>
> Key: SPARK-46574
> URL: https://issues.apache.org/jira/browse/SPARK-46574
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47840) Remove foldable propagation across Streaming Aggregate/Join nodes

2024-04-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47840:
---
Labels: pull-request-available  (was: )

> Remove foldable propagation across Streaming Aggregate/Join nodes
> -
>
> Key: SPARK-47840
> URL: https://issues.apache.org/jira/browse/SPARK-47840
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Bhuwan Sahni
>Priority: Major
>  Labels: pull-request-available
>
> Streaming queries with Union of 2 data streams followed by an Aggregate 
> (groupBy) can produce incorrect results if the grouping key is a constant 
> literal for micro-batch duration.
> The query produces incorrect results because the query optimizer recognizes 
> the literal value in the grouping key as foldable and replaces the grouping 
> key expression with the actual literal value. This optimization is correct 
> for batch queries. However Streaming queries also read information from 
> StateStore, and the output contains both the results from StateStore 
> (computed in previous microbatches) and data from input sources (computed in 
> this microbatch). The HashAggregate node after StateStore always reads 
> grouping key value as the optimized literal (as the grouping key expression 
> is optimized into a literal by the optimizer). This ends up replacing keys in 
> StateStore with the literal value resulting in incorrect output. 
> See an example logical and physical plan below for a query performing a union 
> on 2 data streams, followed by a groupBy. Note that the name#4 expression has 
> been optimized to ds1. The Streaming query Aggregate adds StateStoreSave node 
> as child of HashAggregate, however any grouping key read from StateStore will 
> still be read as ds1 due to the optimization. 
>  
> *Optimized Logical Plan*
> {quote}=== Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.FoldablePropagation ===
> === Old Plan ===
> WriteToMicroBatchDataSource MemorySink, eb67645e-30fc-41a8-8006-35bb7649c202, 
> Complete, 0
> +- Aggregate [name#4|#4], [name#4, count(1) AS count#31L|#4, count(1) AS 
> count#31L]
>    +- Project [ds1 AS name#4|#4]
>       +- StreamingDataSourceV2ScanRelation[value#1|#1] MemoryStreamDataSource
> === New Plan ===
> WriteToMicroBatchDataSource MemorySink, eb67645e-30fc-41a8-8006-35bb7649c202, 
> Complete, 0
> +- Aggregate [ds1], [ds1 AS name#4, count(1) AS count#31L|#4, count(1) AS 
> count#31L]
>    +- Project [ds1 AS name#4|#4]
>       +- StreamingDataSourceV2ScanRelation[value#1|#1] MemoryStreamDataSource
> 
> {quote}
> *Corresponding Physical Plan*
> {quote}WriteToDataSourceV2 MicroBatchWrite[epoch: 0, writer: 
> org.apache.spark.sql.execution.streaming.sources.MemoryStreamingWrite@2b4c6242],
>  
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$3143/1859075634@35709d26
> +- HashAggregate(keys=[ds1#39|#39], functions=[finalmerge_count(merge 
> count#38L) AS count(1)#30L|#38L) AS count(1)#30L], output=[name#4, 
> count#31L|#4, count#31L])
>    +- StateStoreSave [ds1#39|#39], state info [ checkpoint = 
> [file:/tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state|file:///tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state],
>  runId = 4dedecca-910c-4518-855e-456702617414, opId = 0, ver = 0, 
> numPartitions = 5], Complete, 0, 0, 2
>       +- HashAggregate(keys=[ds1#39|#39], functions=[merge_count(merge 
> count#38L) AS count#38L|#38L) AS count#38L], output=[ds1#39, count#38L|#39, 
> count#38L])
>          +- StateStoreRestore [ds1#39|#39], state info [ checkpoint = 
> [file:/tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state|file:///tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state],
>  runId = 4dedecca-910c-4518-855e-456702617414, opId = 0, ver = 0, 
> numPartitions = 5], 2
>             +- HashAggregate(keys=[ds1#39|#39], functions=[merge_count(merge 
> count#38L) AS count#38L|#38L) AS count#38L], output=[ds1#39, count#38L|#39, 
> count#38L])
>                +- HashAggregate(keys=[ds1 AS ds1#39|#39], 
> functions=[partial_count(1) AS count#38L|#38L], output=[ds1#39, 
> count#38L|#39, count#38L])
>                   +- Project
>                      +- MicroBatchScan[value#1|#1] MemoryStreamDataSource
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47839) Fix Aggregate bug in RewriteWithExpression

2024-04-12 Thread Kelvin Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kelvin Jiang updated SPARK-47839:
-
Issue Type: Bug  (was: Improvement)

> Fix Aggregate bug in RewriteWithExpression
> --
>
> Key: SPARK-47839
> URL: https://issues.apache.org/jira/browse/SPARK-47839
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kelvin Jiang
>Priority: Major
>  Labels: pull-request-available
>
> The following query will fail:
> ```
> SELECT NULLIF(id + 1, 1)
> from range(10)
> group by id
> ```
> This is because `NullIf` gets rewritten to `With`, then 
> `RewriteWithExpression` tries to pull `id + 1` out of the aggregate, 
> resulting in an invalid plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47839) Fix Aggregate bug in RewriteWithExpression

2024-04-12 Thread Kelvin Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kelvin Jiang updated SPARK-47839:
-
Description: 
The following query will fail:

{code:SQL}
SELECT NULLIF(id + 1, 1)
from range(10)
group by id
{code}

This is because {{NullIf}} gets rewritten to {{With}}, then 
{{RewriteWithExpression}} tries to pull common expression {{id + 1}} out of the 
aggregate, resulting in an invalid plan.

  was:
The following query will fail:

```
SELECT NULLIF(id + 1, 1)
from range(10)
group by id
```

This is because `NullIf` gets rewritten to `With`, then `RewriteWithExpression` 
tries to pull `id + 1` out of the aggregate, resulting in an invalid plan.


> Fix Aggregate bug in RewriteWithExpression
> --
>
> Key: SPARK-47839
> URL: https://issues.apache.org/jira/browse/SPARK-47839
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kelvin Jiang
>Priority: Major
>  Labels: pull-request-available
>
> The following query will fail:
> {code:SQL}
> SELECT NULLIF(id + 1, 1)
> from range(10)
> group by id
> {code}
> This is because {{NullIf}} gets rewritten to {{With}}, then 
> {{RewriteWithExpression}} tries to pull common expression {{id + 1}} out of 
> the aggregate, resulting in an invalid plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47840) Remove foldable propagation across Streaming Aggregate/Join nodes

2024-04-12 Thread Bhuwan Sahni (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhuwan Sahni updated SPARK-47840:
-
Description: 
Streaming queries with Union of 2 data streams followed by an Aggregate 
(groupBy) can produce incorrect results if the grouping key is a constant 
literal for micro-batch duration.

The query produces incorrect results because the query optimizer recognizes the 
literal value in the grouping key as foldable and replaces the grouping key 
expression with the actual literal value. This optimization is correct for 
batch queries. However Streaming queries also read information from StateStore, 
and the output contains both the results from StateStore (computed in previous 
microbatches) and data from input sources (computed in this microbatch). The 
HashAggregate node after StateStore always reads grouping key value as the 
optimized literal (as the grouping key expression is optimized into a literal 
by the optimizer). This ends up replacing keys in StateStore with the literal 
value resulting in incorrect output. 

See an example logical and physical plan below for a query performing a union 
on 2 data streams, followed by a groupBy. Note that the name#4 expression has 
been optimized to ds1. The Streaming query Aggregate adds StateStoreSave node 
as child of HashAggregate, however any grouping key read from StateStore will 
still be read as ds1 due to the optimization. 

 

*Optimized Logical Plan*
{quote}=== Applying Rule 
org.apache.spark.sql.catalyst.optimizer.FoldablePropagation ===

=== Old Plan ===

WriteToMicroBatchDataSource MemorySink, eb67645e-30fc-41a8-8006-35bb7649c202, 
Complete, 0
+- Aggregate [name#4|#4], [name#4, count(1) AS count#31L|#4, count(1) AS 
count#31L]
   +- Project [ds1 AS name#4|#4]
      +- StreamingDataSourceV2ScanRelation[value#1|#1] MemoryStreamDataSource

=== New Plan ===

WriteToMicroBatchDataSource MemorySink, eb67645e-30fc-41a8-8006-35bb7649c202, 
Complete, 0
+- Aggregate [ds1], [ds1 AS name#4, count(1) AS count#31L|#4, count(1) AS 
count#31L]
   +- Project [ds1 AS name#4|#4]
      +- StreamingDataSourceV2ScanRelation[value#1|#1] MemoryStreamDataSource


{quote}
*Corresponding Physical Plan*
{quote}WriteToDataSourceV2 MicroBatchWrite[epoch: 0, writer: 
org.apache.spark.sql.execution.streaming.sources.MemoryStreamingWrite@2b4c6242],
 
org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$3143/1859075634@35709d26
+- HashAggregate(keys=[ds1#39|#39], functions=[finalmerge_count(merge 
count#38L) AS count(1)#30L|#38L) AS count(1)#30L], output=[name#4, 
count#31L|#4, count#31L])
   +- StateStoreSave [ds1#39|#39], state info [ checkpoint = 
[file:/tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state|file:///tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state],
 runId = 4dedecca-910c-4518-855e-456702617414, opId = 0, ver = 0, numPartitions 
= 5], Complete, 0, 0, 2
      +- HashAggregate(keys=[ds1#39|#39], functions=[merge_count(merge 
count#38L) AS count#38L|#38L) AS count#38L], output=[ds1#39, count#38L|#39, 
count#38L])
         +- StateStoreRestore [ds1#39|#39], state info [ checkpoint = 
[file:/tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state|file:///tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state],
 runId = 4dedecca-910c-4518-855e-456702617414, opId = 0, ver = 0, numPartitions 
= 5], 2
            +- HashAggregate(keys=[ds1#39|#39], functions=[merge_count(merge 
count#38L) AS count#38L|#38L) AS count#38L], output=[ds1#39, count#38L|#39, 
count#38L])
               +- HashAggregate(keys=[ds1 AS ds1#39|#39], 
functions=[partial_count(1) AS count#38L|#38L], output=[ds1#39, count#38L|#39, 
count#38L])
                  +- Project
                     +- MicroBatchScan[value#1|#1] MemoryStreamDataSource
{quote}
 

  was:
Streaming queries with Union of 2 data streams followed by an Aggregate 
(groupBy) can produce incorrect results if the grouping key is a constant 
literal for micro-batch duration.

The query produces incorrect results because the query optimizer recognizes the 
literal value in the grouping key as foldable and replaces the grouping key 
expression with the actual literal value. This optimization is correct for 
batch queries. However Streaming queries also read information from StateStore, 
and the output contains both the results from StateStore (computed in previous 
microbatches) and data from input sources (computed in this microbatch). The 
HashAggregate node after StateStore always reads grouping key value as the 
optimized literal (as the grouping key expression is optimized into a literal 
by the optimizer). This ends up replacing keys in StateStore with the literal 
value resulting in incorrect output. 

See an example logical and physical plan below for a query performing a union 
on 2 data streams, followed by a groupBy. Note that the name#4 

[jira] [Updated] (SPARK-47840) Remove foldable propagation across Streaming Aggregate/Join nodes

2024-04-12 Thread Bhuwan Sahni (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhuwan Sahni updated SPARK-47840:
-
Description: 
Streaming queries with Union of 2 data streams followed by an Aggregate 
(groupBy) can produce incorrect results if the grouping key is a constant 
literal for micro-batch duration.

The query produces incorrect results because the query optimizer recognizes the 
literal value in the grouping key as foldable and replaces the grouping key 
expression with the actual literal value. This optimization is correct for 
batch queries. However Streaming queries also read information from StateStore, 
and the output contains both the results from StateStore (computed in previous 
microbatches) and data from input sources (computed in this microbatch). The 
HashAggregate node after StateStore always reads grouping key value as the 
optimized literal (as the grouping key expression is optimized into a literal 
by the optimizer). This ends up replacing keys in StateStore with the literal 
value resulting in incorrect output. 

See an example logical and physical plan below for a query performing a union 
on 2 data streams, followed by a groupBy. Note that the name#4 expression has 
been optimized to ds1. The Streaming query Aggregate adds StateStoreSave node 
as child of HashAggregate, however any grouping key read from StateStore will 
still be read as ds1 due to the optimization. 

 

### Optimized Logical Plan
{quote}
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation 
===

=== Old Plan ===

WriteToMicroBatchDataSource MemorySink, eb67645e-30fc-41a8-8006-35bb7649c202, 
Complete, 0
+- Aggregate [name#4|#4], [name#4, count(1) AS count#31L|#4, count(1) AS 
count#31L]
   +- Project [ds1 AS name#4|#4]
      +- StreamingDataSourceV2ScanRelation[value#1|#1] MemoryStreamDataSource

=== New Plan ===

WriteToMicroBatchDataSource MemorySink, eb67645e-30fc-41a8-8006-35bb7649c202, 
Complete, 0
+- Aggregate [ds1], [ds1 AS name#4, count(1) AS count#31L|#4, count(1) AS 
count#31L]
   +- Project [ds1 AS name#4|#4]
      +- StreamingDataSourceV2ScanRelation[value#1|#1] MemoryStreamDataSource


{quote}




### Corresponding Physical Plan
{quote}
WriteToDataSourceV2 MicroBatchWrite[epoch: 0, writer: 
org.apache.spark.sql.execution.streaming.sources.MemoryStreamingWrite@2b4c6242],
 
org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$3143/1859075634@35709d26
+- HashAggregate(keys=[ds1#39|#39], functions=[finalmerge_count(merge 
count#38L) AS count(1)#30L|#38L) AS count(1)#30L], output=[name#4, 
count#31L|#4, count#31L])
   +- StateStoreSave [ds1#39|#39], state info [ checkpoint = 
[file:/tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state|file:///tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state],
 runId = 4dedecca-910c-4518-855e-456702617414, opId = 0, ver = 0, numPartitions 
= 5], Complete, 0, 0, 2
      +- HashAggregate(keys=[ds1#39|#39], functions=[merge_count(merge 
count#38L) AS count#38L|#38L) AS count#38L], output=[ds1#39, count#38L|#39, 
count#38L])
         +- StateStoreRestore [ds1#39|#39], state info [ checkpoint = 
[file:/tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state|file:///tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state],
 runId = 4dedecca-910c-4518-855e-456702617414, opId = 0, ver = 0, numPartitions 
= 5], 2
            +- HashAggregate(keys=[ds1#39|#39], functions=[merge_count(merge 
count#38L) AS count#38L|#38L) AS count#38L], output=[ds1#39, count#38L|#39, 
count#38L])
               +- HashAggregate(keys=[ds1 AS ds1#39|#39], 
functions=[partial_count(1) AS count#38L|#38L], output=[ds1#39, count#38L|#39, 
count#38L])
                  +- Project
                     +- MicroBatchScan[value#1|#1] MemoryStreamDataSource
{quote}
 

  was:
Streaming queries with Union of 2 data streams followed by an Aggregate 
(groupBy) can produce incorrect results if the grouping key is a constant 
literal for micro-batch duration.

The query produces incorrect results because the query optimizer recognizes the 
literal value in the grouping key as foldable and replaces the grouping key 
expression with the actual literal value. This optimization is correct for 
batch queries. However Streaming queries also read information from StateStore, 
and the output contains both the results from StateStore (computed in previous 
microbatches) and data from input sources (computed in this microbatch). The 
HashAggregate node after StateStore always reads grouping key value as the 
optimized literal (as the grouping key expression is optimized into a literal 
by the optimizer). This ends up replacing keys in StateStore with the literal 
value resulting in incorrect output. 

See an example logical and physical plan below for a query performing a union 
on 2 data streams, followed by a groupBy. Note that the 

[jira] [Created] (SPARK-47840) Remove foldable propagation across Streaming Aggregate/Join nodes

2024-04-12 Thread Bhuwan Sahni (Jira)
Bhuwan Sahni created SPARK-47840:


 Summary: Remove foldable propagation across Streaming 
Aggregate/Join nodes
 Key: SPARK-47840
 URL: https://issues.apache.org/jira/browse/SPARK-47840
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.5.1, 4.0.0
Reporter: Bhuwan Sahni


Streaming queries with Union of 2 data streams followed by an Aggregate 
(groupBy) can produce incorrect results if the grouping key is a constant 
literal for micro-batch duration.

The query produces incorrect results because the query optimizer recognizes the 
literal value in the grouping key as foldable and replaces the grouping key 
expression with the actual literal value. This optimization is correct for 
batch queries. However Streaming queries also read information from StateStore, 
and the output contains both the results from StateStore (computed in previous 
microbatches) and data from input sources (computed in this microbatch). The 
HashAggregate node after StateStore always reads grouping key value as the 
optimized literal (as the grouping key expression is optimized into a literal 
by the optimizer). This ends up replacing keys in StateStore with the literal 
value resulting in incorrect output. 

See an example logical and physical plan below for a query performing a union 
on 2 data streams, followed by a groupBy. Note that the name#4 expression has 
been optimized to ds1. The Streaming query Aggregate adds StateStoreSave node 
as child of HashAggregate, however any grouping key read from StateStore will 
still be read as ds1 due to the optimization. 

### Optimized Logical Plan

```
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation 
===

=== Old Plan ===

WriteToMicroBatchDataSource MemorySink, eb67645e-30fc-41a8-8006-35bb7649c202, 
Complete, 0
+- Aggregate [name#4], [name#4, count(1) AS count#31L]
   +- Project [ds1 AS name#4]
      +- StreamingDataSourceV2ScanRelation[value#1] MemoryStreamDataSource


=== New Plan ===

WriteToMicroBatchDataSource MemorySink, eb67645e-30fc-41a8-8006-35bb7649c202, 
Complete, 0
+- Aggregate [ds1], [ds1 AS name#4, count(1) AS count#31L]
   +- Project [ds1 AS name#4]
      +- StreamingDataSourceV2ScanRelation[value#1] MemoryStreamDataSource



```


### Corresponding Physical Plan

```
WriteToDataSourceV2 MicroBatchWrite[epoch: 0, writer: 
org.apache.spark.sql.execution.streaming.sources.MemoryStreamingWrite@2b4c6242],
 
org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$3143/1859075634@35709d26
+- HashAggregate(keys=[ds1#39], functions=[finalmerge_count(merge count#38L) AS 
count(1)#30L], output=[name#4, count#31L])
   +- StateStoreSave [ds1#39], state info [ checkpoint = 
file:/tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state, runId 
= 4dedecca-910c-4518-855e-456702617414, opId = 0, ver = 0, numPartitions = 5], 
Complete, 0, 0, 2
      +- HashAggregate(keys=[ds1#39], functions=[merge_count(merge count#38L) 
AS count#38L], output=[ds1#39, count#38L])
         +- StateStoreRestore [ds1#39], state info [ checkpoint = 
file:/tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state, runId 
= 4dedecca-910c-4518-855e-456702617414, opId = 0, ver = 0, numPartitions = 5], 2
            +- HashAggregate(keys=[ds1#39], functions=[merge_count(merge 
count#38L) AS count#38L], output=[ds1#39, count#38L])
               +- HashAggregate(keys=[ds1 AS ds1#39], 
functions=[partial_count(1) AS count#38L], output=[ds1#39, count#38L])
                  +- Project
                     +- MicroBatchScan[value#1] MemoryStreamDataSource

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47839) Fix Aggregate bug in RewriteWithExpression

2024-04-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47839:
---
Labels: pull-request-available  (was: )

> Fix Aggregate bug in RewriteWithExpression
> --
>
> Key: SPARK-47839
> URL: https://issues.apache.org/jira/browse/SPARK-47839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kelvin Jiang
>Priority: Major
>  Labels: pull-request-available
>
> The following query will fail:
> ```
> SELECT NULLIF(id + 1, 1)
> from range(10)
> group by id
> ```
> This is because `NullIf` gets rewritten to `With`, then 
> `RewriteWithExpression` tries to pull `id + 1` out of the aggregate, 
> resulting in an invalid plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47839) Fix Aggregate bug in RewriteWithExpression

2024-04-12 Thread Kelvin Jiang (Jira)
Kelvin Jiang created SPARK-47839:


 Summary: Fix Aggregate bug in RewriteWithExpression
 Key: SPARK-47839
 URL: https://issues.apache.org/jira/browse/SPARK-47839
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kelvin Jiang


The following query will fail:

```
SELECT NULLIF(id + 1, 1)
from range(10)
group by id
```

This is because `NullIf` gets rewritten to `With`, then `RewriteWithExpression` 
tries to pull `id + 1` out of the aggregate, resulting in an invalid plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42369) Fix constructor for java.nio.DirectByteBuffer for Java 21+

2024-04-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-42369:
---
Labels: pull-request-available  (was: )

> Fix constructor for java.nio.DirectByteBuffer for Java 21+
> --
>
> Key: SPARK-42369
> URL: https://issues.apache.org/jira/browse/SPARK-42369
> Project: Spark
>  Issue Type: Sub-task
>  Components: Java API
>Affects Versions: 3.5.0
>Reporter: Ludovic Henry
>Assignee: Ludovic Henry
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In the latest JDK, the constructor {{DirectByteBuffer(long, int)}} was 
> replaced with {{{}DirectByteBuffer(long, long){}}}. We just want to support 
> both by probing for the legacy one first and falling back to the newer one 
> second.
> This change is completely transparent for the end-user, and makes sure Spark 
> works transparently on the latest JDK as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-28024:
-
Description: 
Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} with 
{{set spark.sql.ansi.enabled=true;}} as compared to the default behavior on 
PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
With ANSI mode enabled, this case is no longer an issue. All 4 of the above 
statements now yield {{CAST_OVERFLOW}} or {{ARITHMETIC_OVERFLOW}} errors.

Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}

  was:
Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} with 
{{set spark.sql.ansi.enabled=true;}} as compared to the default behavior on 
PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
With ANSI mode enabled, this case is no longer an issue. All 4 of the above 
statements now yield {{CAST_OVERFLOW or }}{{ARITHMETIC_OVERFLOW}} errors.

Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}


> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} 
> with {{set spark.sql.ansi.enabled=true;}} as compared to the default behavior 
> on PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> With ANSI mode enabled, this case is no longer an issue. All 4 of the above 
> statements now yield {{CAST_OVERFLOW}} or {{ARITHMETIC_OVERFLOW}} errors.
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(1.2345678901234E200);
> ERROR:  value overflows numeric format {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-28024:
-
Description: 
Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} with 
{{set spark.sql.ansi.enabled=true;}} as compared to the default behavior on 
PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
With ANSI mode enabled, this case is no longer an issue. All 4 of the above 
statements now yield {{CAST_OVERFLOW or }}{{ARITHMETIC_OVERFLOW}} errors.

Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}

  was:
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}


> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} 
> with {{set spark.sql.ansi.enabled=true;}} as compared to the default behavior 
> on PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> With ANSI mode enabled, this case is no longer an issue. All 4 of the above 
> statements now yield {{CAST_OVERFLOW or }}{{ARITHMETIC_OVERFLOW}} errors.
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(1.2345678901234E200);
> ERROR:  value overflows numeric format {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46122) Disable spark.sql.legacy.createHiveTableByDefault by default

2024-04-12 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836708#comment-17836708
 ] 

Dongjoon Hyun commented on SPARK-46122:
---

Gentle ping, [~yumwang] . You can throw a discussion thread for this if you 
want.

> Disable spark.sql.legacy.createHiveTableByDefault by default
> 
>
> Key: SPARK-46122
> URL: https://issues.apache.org/jira/browse/SPARK-46122
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836706#comment-17836706
 ] 

Nicholas Chammas commented on SPARK-28024:
--

I've just retried cases 2-4 on master with ANSI mode enabled, and Spark's 
behavior appears to be the same as when I last checked it in February.

I also ran those same cases against PostgreSQL 16. I couldn't replicate the 
output for Case 4, and I believe there was a mistake in the original 
description of that case where the sign was flipped. So I've adjusted the sign 
accordingly and shown Spark and Postgres's behavior side-by-side.

Here is the original Case 4 with the negative sign:

{code:sql}
spark-sql (default)> select exp(-1.2345678901234E200);
0.0

postgres=# select exp(-1.2345678901234E200); 
0.
{code}
 
So I don't think there is a problem there. With a positive sign, the behavior 
is different as shown in the ticket description above.

> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> As compared to PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(1.2345678901234E200);
> ERROR:  value overflows numeric format {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-28024:
-
Description: 
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}

  was:
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(-1.2345678901234E200);
ERROR:  value overflows numeric format
{code}


> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> As compared to PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(1.2345678901234E200);
> ERROR:  value overflows numeric format {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-28024:
-
Description: 
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(-1.2345678901234E200);
ERROR:  value overflows numeric format
{code}

  was:
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity
 postgres=# select exp(-1.2345678901234E200);
ERROR:  value overflows numeric format
{code}


> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> As compared to PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(-1.2345678901234E200);
> ERROR:  value overflows numeric format
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-28024:
-
Description: 
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity
 postgres=# select exp(-1.2345678901234E200);
ERROR:  value overflows numeric format
{code}

  was:
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}


> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> As compared to PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
>  postgres=# select exp(-1.2345678901234E200);
> ERROR:  value overflows numeric format
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2024-04-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-28024:
-
Description: 
As compared to PostgreSQL 16.

Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}
Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0

postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
 float8 | float8 
+
  1e-69 | -1e-69 {code}
Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0

postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
double precision);
ERROR:  "10e-400" is out of range for type double precision
LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
                    ^ {code}
Case 4:
{code:sql}
spark-sql (default)> select exp(1.2345678901234E200);
Infinity

postgres=# select exp(1.2345678901234E200);
ERROR:  value overflows numeric format {code}

  was:
For example
Case 1:
{code:sql}
select tinyint(128) * tinyint(2); -- 0
select smallint(2147483647) * smallint(2); -- -2
select int(2147483647) * int(2); -- -2
SELECT smallint((-32768)) * smallint(-1); -- -32768
{code}

Case 2:
{code:sql}
spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
0.0 -0.0
{code}

Case 3:
{code:sql}
spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
0.0 -0.0
{code}

Case 4:
{code:sql}
spark-sql> select exp(-1.2345678901234E200);
0.0

postgres=# select exp(-1.2345678901234E200);
ERROR:  value overflows numeric format
{code}


> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> As compared to PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(1.2345678901234E200);
> ERROR:  value overflows numeric format {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44518) Completely make hive as a data source

2024-04-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44518:
--
Parent: (was: SPARK-44111)
Issue Type: Improvement  (was: Sub-task)

> Completely make hive as a data source
> -
>
> Key: SPARK-44518
> URL: https://issues.apache.org/jira/browse/SPARK-44518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: He Qi
>Priority: Major
>
> Now, hive is a different data source from other data sources. In Spark 
> Project, Hive have many special logic and burden the cost of maintenance . 
> Like presto, hive is only a connector. Is it possible that we can  make hive 
> as a data source completely?
> Surely, I know that it's very difficult. It has many historical problems and 
> compatible problems. Could we reduce these problems as possible as we can if 
> we release 4.0?
> I just wanna start a discussion to collect more people's suggestion. Any 
> suggestion is welcome. I just feel 4.0 is a good opportunity to discuss this 
> issue.
> If I am wrong, it's welcome to point it out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44518) Completely make hive as a data source

2024-04-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44518:
--
Target Version/s:   (was: 4.0.0)

> Completely make hive as a data source
> -
>
> Key: SPARK-44518
> URL: https://issues.apache.org/jira/browse/SPARK-44518
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: He Qi
>Priority: Major
> Fix For: 4.0.0
>
>
> Now, hive is a different data source from other data sources. In Spark 
> Project, Hive have many special logic and burden the cost of maintenance . 
> Like presto, hive is only a connector. Is it possible that we can  make hive 
> as a data source completely?
> Surely, I know that it's very difficult. It has many historical problems and 
> compatible problems. Could we reduce these problems as possible as we can if 
> we release 4.0?
> I just wanna start a discussion to collect more people's suggestion. Any 
> suggestion is welcome. I just feel 4.0 is a good opportunity to discuss this 
> issue.
> If I am wrong, it's welcome to point it out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44518) Completely make hive as a data source

2024-04-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44518:
--
Fix Version/s: (was: 4.0.0)

> Completely make hive as a data source
> -
>
> Key: SPARK-44518
> URL: https://issues.apache.org/jira/browse/SPARK-44518
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: He Qi
>Priority: Major
>
> Now, hive is a different data source from other data sources. In Spark 
> Project, Hive have many special logic and burden the cost of maintenance . 
> Like presto, hive is only a connector. Is it possible that we can  make hive 
> as a data source completely?
> Surely, I know that it's very difficult. It has many historical problems and 
> compatible problems. Could we reduce these problems as possible as we can if 
> we release 4.0?
> I just wanna start a discussion to collect more people's suggestion. Any 
> suggestion is welcome. I just feel 4.0 is a good opportunity to discuss this 
> issue.
> If I am wrong, it's welcome to point it out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44518) Completely make hive as a data source

2024-04-12 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836704#comment-17836704
 ] 

Dongjoon Hyun commented on SPARK-44518:
---

In addition, I'll convert this to an independent JIRA because there is no 
activity until now. Please note that we can bring it back if there is any 
progress before Apache Spark 4.0.0 deadline.

> Completely make hive as a data source
> -
>
> Key: SPARK-44518
> URL: https://issues.apache.org/jira/browse/SPARK-44518
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: He Qi
>Priority: Major
>
> Now, hive is a different data source from other data sources. In Spark 
> Project, Hive have many special logic and burden the cost of maintenance . 
> Like presto, hive is only a connector. Is it possible that we can  make hive 
> as a data source completely?
> Surely, I know that it's very difficult. It has many historical problems and 
> compatible problems. Could we reduce these problems as possible as we can if 
> we release 4.0?
> I just wanna start a discussion to collect more people's suggestion. Any 
> suggestion is welcome. I just feel 4.0 is a good opportunity to discuss this 
> issue.
> If I am wrong, it's welcome to point it out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44518) Completely make hive as a data source

2024-04-12 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836703#comment-17836703
 ] 

Dongjoon Hyun commented on SPARK-44518:
---

Hi, [~roryqi], this looks like a question instead of any concrete suggest . 

According to the Apache Spark community policy, I removed the `Target Version` 
from this JIRA.
- https://spark.apache.org/contributing.html

{code}
Do not set the following fields:
- Fix Version. This is assigned by committers only when resolved.
- Target Version. This is assigned by committers to indicate a PR has been 
accepted for possible fix by the target version.
{code}

> Completely make hive as a data source
> -
>
> Key: SPARK-44518
> URL: https://issues.apache.org/jira/browse/SPARK-44518
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: He Qi
>Priority: Major
> Fix For: 4.0.0
>
>
> Now, hive is a different data source from other data sources. In Spark 
> Project, Hive have many special logic and burden the cost of maintenance . 
> Like presto, hive is only a connector. Is it possible that we can  make hive 
> as a data source completely?
> Surely, I know that it's very difficult. It has many historical problems and 
> compatible problems. Could we reduce these problems as possible as we can if 
> we release 4.0?
> I just wanna start a discussion to collect more people's suggestion. Any 
> suggestion is welcome. I just feel 4.0 is a good opportunity to discuss this 
> issue.
> If I am wrong, it's welcome to point it out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44116) Utilize Hadoop vectorized APIs

2024-04-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44116:
--
Parent: (was: SPARK-44111)
Issue Type: New Feature  (was: Sub-task)

> Utilize Hadoop vectorized APIs
> --
>
> Key: SPARK-44116
> URL: https://issues.apache.org/jira/browse/SPARK-44116
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Apache Hadoop 3.3.5+ supports vectorized APIs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44444) Enabled ANSI mode by default

2024-04-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-4:
-

Assignee: Dongjoon Hyun

> Enabled ANSI mode by default
> 
>
> Key: SPARK-4
> URL: https://issues.apache.org/jira/browse/SPARK-4
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> To avoid data issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47837) Official Spark Docker Container images are available from DockerHub in 2 locations

2024-04-12 Thread Marty Kandes (Jira)
Marty Kandes created SPARK-47837:


 Summary: Official Spark Docker Container images are available from 
DockerHub in 2 locations
 Key: SPARK-47837
 URL: https://issues.apache.org/jira/browse/SPARK-47837
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.5.1, 3.5.0, 3.4.1
Reporter: Marty Kandes


The downloads section of the project's website [1] provides a link [2] to an 
official set of Docker containers hosted on DockerHub [3]. However, most of 
these containers are quite outdated now and it appears as if there is now a new 
location where the lateset offical containers are maintained and distributed on 
DockerHub [4]. It would be nice to fix/update the documentation to guide users 
to the new / recommendation location of the official containers maintained or 
endorsed by the project and/or community.

[1] [https://spark.apache.org/downloads.html]

[2] [https://hub.docker.com/r/apache/spark-py/tags]

[3] [https://hub.docker.com/r/apache/spark]

[4] [https://hub.docker.com/_/spark]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47765) Add SET COLLATION to parser rules

2024-04-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47765.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45946
[https://github.com/apache/spark/pull/45946]

> Add SET COLLATION to parser rules
> -
>
> Key: SPARK-47765
> URL: https://issues.apache.org/jira/browse/SPARK-47765
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47765) Add SET COLLATION to parser rules

2024-04-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47765:
---

Assignee: Mihailo Milosevic

> Add SET COLLATION to parser rules
> -
>
> Key: SPARK-47765
> URL: https://issues.apache.org/jira/browse/SPARK-47765
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47836) Performance problem with QuantileSummaries

2024-04-12 Thread Tanel Kiis (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836606#comment-17836606
 ] 

Tanel Kiis commented on SPARK-47836:


I would be willing to make a PR, but I do not know the right solution.
Migrating from a custom solution to the KllDoublesSketch would be the cleanest, 
but its results are nondeterministic

> Performance problem with QuantileSummaries
> --
>
> Key: SPARK-47836
> URL: https://issues.apache.org/jira/browse/SPARK-47836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Tanel Kiis
>Priority: Major
>
> SPARK-29336 caused a severe performance regression.
> In practice a partial_aggregate with several approx_percentile calls ran less 
> than hour and the final aggrergation after exchange would have taken over a 
> week.
> Simple percentile ran about the same time in the first part and the final 
> aggregate ran very quickly.
> I made a benchmark, and it reveals that the merge operation is very-very 
> slow: 
> https://github.com/tanelk/spark/commit/3b16f429a77b10003572295f42361fbfb2f3c63e
> From my experiments it looks like it is n^2 with the number of partitions 
> (number of partial aggregations to merge).
> When I reverted the changes made in this PR, then the "Only insert" and 
> "Insert & merge" were very similar.
> The cause seems to be, that compressImmut does not reduce the number samples 
> allmost at all after merges and just keeps iterating over an evergrowing list.
> I was not able to figure out how to fix the issue without just reverting the 
> PR.
> I also added a benchmark with KllDoublesSketch from the apache datasketches 
> project and it worked even better than this class before this PR.
> Only downside was that it is not-deterministic. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47836) Performance problem with QuantileSummaries

2024-04-12 Thread Tanel Kiis (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836605#comment-17836605
 ] 

Tanel Kiis commented on SPARK-47836:



{noformat}
QuantileSummaries:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

Only insert 168171  
 8  6.0 167.6   1.0X
Insert & merge 6690   6792 
143  0.16690.3   0.0X
KllFloatsSketch insert   44 47  
 6 22.5  44.4   3.8X
KllFloatsSketch Insert & merge   55 57  
 6 18.3  54.7   3.1X
{noformat}


{code:java}
object QuantileSummariesBenchmark extends BenchmarkBase {

  def test(name: String, numValues: Int): Unit = {
runBenchmark(name) {
  val values = (1 to numValues).map(_ => Random.nextDouble())

  val benchmark = new Benchmark(name, numValues, output = output)
  benchmark.addCase("Only insert") { _: Int =>
var summaries = new QuantileSummaries(
  compressThreshold = QuantileSummaries.defaultCompressThreshold,
  relativeError = QuantileSummaries.defaultRelativeError)

for (value <- values) {
  summaries = summaries.insert(value)
}
summaries = summaries.compress()

println("Median: " + summaries.query(0.5))
  }

  benchmark.addCase("Insert & merge") { _: Int =>
// Insert values in batches of 1000 and merge the summaries.
val summaries = values.grouped(1000).map(vs => {
  var partialSummaries = new QuantileSummaries(
compressThreshold = QuantileSummaries.defaultCompressThreshold,
relativeError = QuantileSummaries.defaultRelativeError)

  for (value <- vs) {
partialSummaries = partialSummaries.insert(value)
  }
  partialSummaries.compress()
}).reduce(_.merge(_))

println("Median: " + summaries.query(0.5))
  }

  benchmark.addCase("KllFloatsSketch insert") { _: Int =>
// Insert values in batches of 1000 and merge the summaries.
val summaries = KllDoublesSketch.newHeapInstance(
  KllSketch.getKFromEpsilon(QuantileSummaries.defaultRelativeError, 
true)
)

for (value <- values) {
  summaries.update(value)
}

println("Median: " + summaries.getQuantile(0.5))
  }

  benchmark.addCase("KllFloatsSketch Insert & merge") { _: Int =>
// Insert values in batches of 1000 and merge the summaries.
val summaries = values.grouped(1000).map(vs => {
  val partialSummaries = KllDoublesSketch.newHeapInstance(
KllSketch.getKFromEpsilon(QuantileSummaries.defaultRelativeError, 
true)
  )

  for (value <- vs) {
partialSummaries.update(value)
  }

  partialSummaries
}).reduce((a, b) => {
  a.merge(b)
  a
})

println("Median: " + summaries.getQuantile(0.5))
  }

  benchmark.run()
}
  }

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
test("QuantileSummaries", 1_000_000)
  }
}
{code}


> Performance problem with QuantileSummaries
> --
>
> Key: SPARK-47836
> URL: https://issues.apache.org/jira/browse/SPARK-47836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Tanel Kiis
>Priority: Major
>
> SPARK-29336 caused a severe performance regression.
> In practice a partial_aggregate with several approx_percentile calls ran less 
> than hour and the final aggrergation after exchange would have taken over a 
> week.
> Simple percentile ran about the same time in the first part and the final 
> aggregate ran very quickly.
> I made a benchmark, and it reveals that the merge operation is very-very 
> slow: 
> https://github.com/tanelk/spark/commit/3b16f429a77b10003572295f42361fbfb2f3c63e
> From my experiments it looks like it is n^2 with the number of partitions 
> (number of partial aggregations to merge).
> When I reverted the changes made in this PR, then the "Only insert" and 
> "Insert & merge" were very similar.
> The cause seems to be, that compressImmut does not reduce the number samples 
> allmost at all after merges and just keeps iterating over an evergrowing list.
> I was not able to figure out how to fix the issue without just reverting the 
> PR.
> I also added a benchmark with KllDoublesSketch from the apache datasketches 
> project and it worked even better than 

[jira] [Updated] (SPARK-47836) Performance problem with QuantileSummaries

2024-04-12 Thread Tanel Kiis (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis updated SPARK-47836:
---
Description: 
SPARK-29336 caused a severe performance regression.
In practice a partial_aggregate with several approx_percentile calls ran less 
than hour and the final aggrergation after exchange would have taken over a 
week.
Simple percentile ran about the same time in the first part and the final 
aggregate ran very quickly.

I made a benchmark, and it reveals that the merge operation is very-very slow: 
https://github.com/tanelk/spark/commit/3b16f429a77b10003572295f42361fbfb2f3c63e
>From my experiments it looks like it is n^2 with the number of partitions 
>(number of partial aggregations to merge).
When I reverted the changes made in this PR, then the "Only insert" and "Insert 
& merge" were very similar.

The cause seems to be, that compressImmut does not reduce the number samples 
allmost at all after merges and just keeps iterating over an evergrowing list.
I was not able to figure out how to fix the issue without just reverting the PR.

I also added a benchmark with KllDoublesSketch from the apache datasketches 
project and it worked even better than this class before this PR.
Only downside was that it is not-deterministic. 

> Performance problem with QuantileSummaries
> --
>
> Key: SPARK-47836
> URL: https://issues.apache.org/jira/browse/SPARK-47836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Tanel Kiis
>Priority: Major
>
> SPARK-29336 caused a severe performance regression.
> In practice a partial_aggregate with several approx_percentile calls ran less 
> than hour and the final aggrergation after exchange would have taken over a 
> week.
> Simple percentile ran about the same time in the first part and the final 
> aggregate ran very quickly.
> I made a benchmark, and it reveals that the merge operation is very-very 
> slow: 
> https://github.com/tanelk/spark/commit/3b16f429a77b10003572295f42361fbfb2f3c63e
> From my experiments it looks like it is n^2 with the number of partitions 
> (number of partial aggregations to merge).
> When I reverted the changes made in this PR, then the "Only insert" and 
> "Insert & merge" were very similar.
> The cause seems to be, that compressImmut does not reduce the number samples 
> allmost at all after merges and just keeps iterating over an evergrowing list.
> I was not able to figure out how to fix the issue without just reverting the 
> PR.
> I also added a benchmark with KllDoublesSketch from the apache datasketches 
> project and it worked even better than this class before this PR.
> Only downside was that it is not-deterministic. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee that the relativeError will be respected

2024-04-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-29336:
---
Labels: pull-request-available  (was: )

> The implementation of QuantileSummaries.merge  does not guarantee that the 
> relativeError will be respected 
> ---
>
> Key: SPARK-29336
> URL: https://issues.apache.org/jira/browse/SPARK-29336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Guilherme Souza
>Assignee: Guilherme Souza
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>
> Hello Spark maintainers,
> I was experimenting with my own implementation of the [space-efficient 
> quantile 
> algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf]
>  in another language and I was using the Spark's one as a reference.
> In my analysis, I believe to have found an issue with the {{merge()}} logic. 
> Here is some simple Scala code that reproduces the issue I've found:
>  
> {code:java}
> var values = (1 to 100).toArray
> val all_quantiles = values.indices.map(i => (i+1).toDouble / 
> values.length).toArray
> for (n <- 0 until 5) {
>   var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
>   val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1)
>   val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
>   val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) 
> => Math.abs(expected - answer) }).toArray
>   val max_error = error.max
>   print(max_error + "\n")
> }
> {code}
> I query for all possible quantiles in a 100-element array with a desired 10% 
> max error. In this scenario, one would expect to observe a maximum error of 
> 10 ranks or less (10% of 100). However, the output I observe is:
>  
> {noformat}
> 16
> 12
> 10
> 11
> 17{noformat}
> The variance is probably due to non-deterministic operations behind the 
> scenes, but irrelevant to the core cause. (and sorry for my Scala, I'm not 
> used to it)
> Interestingly enough, if I change from five to one partition the code works 
> as expected and gives 10 every time. This seems to point to some problem at 
> the [merge 
> logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L153-L171]
> The original authors ([~clockfly] and [~cloud_fan] for what I could dig from 
> the history) suggest the published paper is not clear on how that should be 
> done and, honestly, I was not confident in the current approach either.
> I've found SPARK-21184 that reports the same problem, but it was 
> unfortunately closed with no fix applied.
> In my external implementation I believe to have found a sound way to 
> implement the merge method. [Here is my take in Rust, if 
> relevant|https://github.com/sitegui/space-efficient-quantile/blob/188c74638c9840e5f47d6c6326b2886d47b149bc/src/modified_gk/summary.rs#L162-L218]
> I'd be really glad to add unit tests and contribute my implementation adapted 
> to Scala.
>  I'd love to hear your opinion on the matter.
> Best regards
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47835) Remove switch for remoteReadNioBufferConversion

2024-04-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47835:
---
Labels: pull-request-available  (was: )

> Remove switch for remoteReadNioBufferConversion
> ---
>
> Key: SPARK-47835
> URL: https://issues.apache.org/jira/browse/SPARK-47835
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47835) Remove switch for remoteReadNioBufferConversion

2024-04-12 Thread Cheng Pan (Jira)
Cheng Pan created SPARK-47835:
-

 Summary: Remove switch for remoteReadNioBufferConversion
 Key: SPARK-47835
 URL: https://issues.apache.org/jira/browse/SPARK-47835
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 4.0.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47410) refactor UTF8String and CollationFactory

2024-04-12 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-47410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47410:
-
Description: 
This ticket addresses the need to refactor the {{UTF8String}} and 
{{CollationFactory}} classes within Spark to enhance support for 
collation-aware expressions. The goal is to improve code structure, 
maintainability, readability, and testing coverage for collation-aware Spark 
SQL expressions.

The changed introduced herein should simplify addition of new collation-aware 
operations and ensure consistent testing across the codebase.

 

To further support the addition of collation support for new Spark expressions, 
here are a couple of guidelines to follow:

 

// 1. Collation-aware expression implementation

CollationSupport.java
 * should serve as a static entry point for collation-aware expressions, 
providing custom support
 * for example: one by one Spark expression with corresponding collation support
 * also note that: CollationAwareUTF8String should be used for collation-aware 
UTF8String operations & other utility methods

CollationFactory.java
 * should continue to serve as a static provider for high-level collation 
interface
 * for example: interacting with external ICU components such as Collator, 
StringSearch, etc.
 * also note that: no low-level / expression-specific code should generally be 
found here

UTF8String.java
 * should be largely collation-unaware, and generally be used only as storage, 
nothing else
 * for example: don’t change this class at all (with the only one-time 
exception of: semanticEquals/Compare)
 * also note that: no collation-aware operation implementations (using 
collationId) should be put here

stringExpressions.scala / regexpExpressions.scala / other 
“sql.catalyst.expressions” (for example: Between.scala)
 * should only contain minimal changes in order to re-route collation-aware 
implementations to CollationSupport
 * for example: most changes should be in relation to: adding collationId, 
using correct data types, replacements, etc.
 * also note that: nullSafeEval & doGenCode should likely note contain 
introduce extra branching based on collationId

 

// 2. Collation-aware expression testing

CollationSuite.scala
 * should be used for testing more general collation concepts
 * for example: collate/collation expressions, collation names, DDL, casting, 
aggregate, shuffle, join, etc.
 * also note that: no extra tests should generally be added for newly supported 
string expressions

CollationSupportSuite.java
 * should be used for expression unit tests, these tests should be as rigorous 
as possible in order to cover various cases
 * for example: unit tests that test collation-aware expression implementation 
for various collations (binary, lowercase, ICU)
 * also note that: these tests should generally be written after adding 
appropriate expression support in CollationSupport.java

CollationStringExpressionsSuite.scala / CollationRegexpExpressionsSuite.scala / 
CollationExpressionSuite.scala
 * should be used for expression end-to-end tests, these tests should only 
cover crucial expression behaviour
 * for example: SQL tests that verify query execution results, expected return 
data types, casting, unsupported collation handling, etc.
 * also note that: these tests should generally be written after enabling 
appropriate expression support in stringExpressions.scala

 

// 3. Closing notes
 * Carefully think about performance implications of newly added custom 
collation-aware expression implementation
 * for example: be very careful with extra string allocations (UTF8Strings -> 
(Java) String -> UTF8Strings, etc.)
 * also note that: some operations introduce very heavy performance penalties 
(we should avoid the ones we can)

 
 * Make sure to test all newly added expressions and completely (unit tests, 
end-to-end tests, etc.)
 * for example: consider edge, such as: empty strings, uppercase and lowercase 
mix, different byte-length chars, etc.
 * also note that: all similar tests should be uniform & readable and be kept 
in one place for various expressions

 
 * Consider how new expressions interact with the rest of the system (casting; 
collation support level - use correct AbstractStringType, etc.)
 * for example: we should watch out for casting, test it thoroughly, and use 
CollationTypeCasts if needed for new expressions
 * also note that: some expressions won’t support all collation types, so that 
should be clearly reflected in tests and dataTypes

  was:
This ticket addresses the need to refactor the {{UTF8String}} and 
{{CollationFactory}} classes within Spark to enhance support for 
collation-aware expressions. The goal is to improve code structure, 
maintainability, readability, and testing coverage for collation-aware Spark 
SQL expressions.

The changed introduced herein should simplify addition of new collation-aware 

[jira] [Updated] (SPARK-29089) DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files

2024-04-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-29089:
---
Labels: pull-request-available  (was: )

> DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when 
> reading large amount of S3 files
> --
>
> Key: SPARK-29089
> URL: https://issues.apache.org/jira/browse/SPARK-29089
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Arwin S Tio
>Assignee: Arwin S Tio
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.1.0
>
>
> When using DataFrameReader#csv to read many S3 files (in my case 300k), I've 
> noticed that it took about an hour for the files to be loaded on the driver.
>  
>  You can see the timestamp difference when the log from InMemoryFileIndex 
> occurs from 7:45 to 8:54:
> {quote}19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
>  19/09/06 07:44:42 INFO SparkContext: Submitted application: 
> LoglineParquetGenerator
>  ...
>  19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered 
> StateStoreCoordinator endpoint
>  19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under: [300K files...]
> {quote}
>  
> A major source of the bottleneck comes from 
> DataSource#checkAndGlobPathIfNecessary, which will [(possibly) glob|#L549] 
> and do a [FileSystem#exists|#L557] on all the paths in a single thread. On 
> S3, these are slow network calls.
> After a discussion on the mailing list [0], it was suggested that an 
> improvement could be to:
>   
>  * have SparkHadoopUtils differentiate between files returned by 
> globStatus(), and which therefore exist, and those which it didn't glob for 
> -it will only need to check those. 
>  * add parallel execution to the glob and existence checks
>   
> I am currently working on a patch that implements this improvement
>  [0] 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47834) Mark deprecated functions with `@deprecated` in `SQLImplicits`

2024-04-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47834:
---
Labels: pull-request-available  (was: )

> Mark deprecated functions with `@deprecated` in `SQLImplicits`
> --
>
> Key: SPARK-47834
> URL: https://issues.apache.org/jira/browse/SPARK-47834
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47834) Mark deprecated functions with `@deprecated` in `SQLImplicits`

2024-04-12 Thread Yang Jie (Jira)
Yang Jie created SPARK-47834:


 Summary: Mark deprecated functions with `@deprecated` in 
`SQLImplicits`
 Key: SPARK-47834
 URL: https://issues.apache.org/jira/browse/SPARK-47834
 Project: Spark
  Issue Type: Improvement
  Components: Connect, Spark Core
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47833) Supply caller stackstrace for checkAndGlobPathIfNecessary AnalysisException

2024-04-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47833:
---
Labels: pull-request-available  (was: )

> Supply caller stackstrace for checkAndGlobPathIfNecessary AnalysisException
> ---
>
> Key: SPARK-47833
> URL: https://issues.apache.org/jira/browse/SPARK-47833
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.3
>Reporter: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47833) Supply caller stackstrace for checkAndGlobPathIfNecessary AnalysisException

2024-04-12 Thread Cheng Pan (Jira)
Cheng Pan created SPARK-47833:
-

 Summary: Supply caller stackstrace for checkAndGlobPathIfNecessary 
AnalysisException
 Key: SPARK-47833
 URL: https://issues.apache.org/jira/browse/SPARK-47833
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.1.3
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47832) Fix problematic test in TPC-DS Collations test when ANSI flag is set

2024-04-12 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47832:
--
Summary: Fix problematic test in TPC-DS Collations test when ANSI flag is 
set  (was: Skip problematic test in TPC-DS Collations test when ANSI flag is 
set)

> Fix problematic test in TPC-DS Collations test when ANSI flag is set
> 
>
> Key: SPARK-47832
> URL: https://issues.apache.org/jira/browse/SPARK-47832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
> Fix For: 4.0.0
>
>
> "Build / ANSI (master, Hadoop 3, JDK 17, Scala 2.13)" CI is broken by TPC-DS 
> collations test. Error:
> {code:java}
> [info] - q35-v2.7 *** FAILED *** (2 seconds, 695 milliseconds)
> 3489[info]   java.lang.Exception: Expected "[null f   d   0   
> 1   0.0 0   0   2   1   2.0 2   2   2 
>   1   2.0 2   2
> 3490
> ...
> null  m   m   4   1   4.0 4   4   1   1   
> 1.0 1   1   3   1   3.0 3   3]", but got 
> "[org.apache.spark.sparkexception
> 3589[info] {
> 3590[info]   "errorclass" : "_legacy_error_temp_2250",
> 3591[info]   "messageparameters" : {
> 3592[info] "analyzetblmsg" : " or analyze these tables through: analyze 
> table `spark_catalog`.`tpcds_utf8`.`customer_demographics` compute 
> statistics;.",
> 3593[info] "autobroadcastjointhreshold" : 
> "spark.sql.autobroadcastjointhreshold",
> 3594[info] "drivermemory" : "spark.driver.memory"
> 3595[info]   }
> 3596[info] }]"
> 3597[info] Error using configs:
> 3598[info]   at 
> org.apache.spark.sql.TPCDSCollationQueryTestSuite.$anonfun$runQuery$1(TPCDSCollationQueryTestSuite.scala:228)
> 3599
> ... {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47832) Skip problematic test in TPC-DS Collations test when ANSI flag is set

2024-04-12 Thread Nikola Mandic (Jira)
Nikola Mandic created SPARK-47832:
-

 Summary: Skip problematic test in TPC-DS Collations test when ANSI 
flag is set
 Key: SPARK-47832
 URL: https://issues.apache.org/jira/browse/SPARK-47832
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Nikola Mandic
 Fix For: 4.0.0


"Build / ANSI (master, Hadoop 3, JDK 17, Scala 2.13)" CI is broken by TPC-DS 
collations test. Error:
{code:java}
[info] - q35-v2.7 *** FAILED *** (2 seconds, 695 milliseconds)
3489[info]   java.lang.Exception: Expected "[null   f   d   0   
1   0.0 0   0   2   1   2.0 2   2   2   
1   2.0 2   2
3490
...
nullm   m   4   1   4.0 4   4   1   1   
1.0 1   1   3   1   3.0 3   3]", but got 
"[org.apache.spark.sparkexception
3589[info] {
3590[info]   "errorclass" : "_legacy_error_temp_2250",
3591[info]   "messageparameters" : {
3592[info] "analyzetblmsg" : " or analyze these tables through: analyze 
table `spark_catalog`.`tpcds_utf8`.`customer_demographics` compute 
statistics;.",
3593[info] "autobroadcastjointhreshold" : 
"spark.sql.autobroadcastjointhreshold",
3594[info] "drivermemory" : "spark.driver.memory"
3595[info]   }
3596[info] }]"
3597[info] Error using configs:
3598[info]   at 
org.apache.spark.sql.TPCDSCollationQueryTestSuite.$anonfun$runQuery$1(TPCDSCollationQueryTestSuite.scala:228)
3599
... {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47831) Run Pandas API on Spark for pyspark-connect package

2024-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47831.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46001
[https://github.com/apache/spark/pull/46001]

> Run Pandas API on Spark for pyspark-connect package
> ---
>
> Key: SPARK-47831
> URL: https://issues.apache.org/jira/browse/SPARK-47831
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47831) Run Pandas API on Spark for pyspark-connect package

2024-04-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47831:
---
Labels: pull-request-available  (was: )

> Run Pandas API on Spark for pyspark-connect package
> ---
>
> Key: SPARK-47831
> URL: https://issues.apache.org/jira/browse/SPARK-47831
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47831) Run Pandas API on Spark for pyspark-connect package

2024-04-12 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-47831:


 Summary: Run Pandas API on Spark for pyspark-connect package
 Key: SPARK-47831
 URL: https://issues.apache.org/jira/browse/SPARK-47831
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark, PySpark, Tests
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47816) Document the lazy evaluation of views in spark.{sql, table}

2024-04-12 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-47816:
-

Assignee: Ruifeng Zheng

> Document the lazy evaluation of views in spark.{sql, table}
> ---
>
> Key: SPARK-47816
> URL: https://issues.apache.org/jira/browse/SPARK-47816
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Documentation
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47816) Document the lazy evaluation of views in spark.{sql, table}

2024-04-12 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-47816.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46007
[https://github.com/apache/spark/pull/46007]

> Document the lazy evaluation of views in spark.{sql, table}
> ---
>
> Key: SPARK-47816
> URL: https://issues.apache.org/jira/browse/SPARK-47816
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Documentation
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47819) Use asynchronous callback for execution cleanup

2024-04-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47819:
---
Labels: pull-request-available  (was: )

> Use asynchronous callback for execution cleanup
> ---
>
> Key: SPARK-47819
> URL: https://issues.apache.org/jira/browse/SPARK-47819
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Xi Lyu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Expired sessions are regularly checked and cleaned up by a maintenance 
> thread. However, currently, this process is synchronous. Therefore, in rare 
> cases, interrupting the execution thread of a query in a session can take 
> hours, causing the entire maintenance process to stall, resulting in a large 
> amount of memory not being cleared.
> We address this by introducing asynchronous callbacks for execution cleanup, 
> avoiding synchronous joins of execution threads, and preventing the 
> maintenance thread from stalling in the above scenarios. To be more specific, 
> instead of calling {{runner.join()}} in ExecutorHolder.close(), we set a 
> post-cleanup function as the callback through 
> {{{}runner.processOnCompletion{}}}, which will be called asynchronously once 
> the execution runner is completed or interrupted. In this way, the 
> maintenance thread won't get blocked on {{{}join{}}}ing an execution thread.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47418) Optimize string predicate expressions for UTF8_BINARY_LCASE collation

2024-04-12 Thread Vladimir Golubev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836505#comment-17836505
 ] 

Vladimir Golubev commented on SPARK-47418:
--

I'll work on that.

> Optimize string predicate expressions for UTF8_BINARY_LCASE collation
> -
>
> Key: SPARK-47418
> URL: https://issues.apache.org/jira/browse/SPARK-47418
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Implement {*}contains{*}, {*}startsWith{*}, and *endsWith* built-in string 
> Spark functions using optimized lowercase comparison approach introduced by 
> [~nikolamand-db] in [https://github.com/apache/spark/pull/45816]. Refer to 
> the latest design and code structure imposed by [~uros-db] in 
> https://issues.apache.org/jira/browse/SPARK-47410 to understand how collation 
> support is introduced for Spark SQL expressions. In addition, review previous 
> Jira tickets under the current parent in order to understand how 
> *StringPredicate* expressions are currently used and tested in Spark:
>  * [SPARK-47131|https://issues.apache.org/jira/browse/SPARK-47131]
>  * [SPARK-47248|https://issues.apache.org/jira/browse/SPARK-47248]
>  * [SPARK-47295|https://issues.apache.org/jira/browse/SPARK-47295]
> These tickets should help you understand what changes were introduced in 
> order to enable collation support for these functions. Lastly, feel free to 
> use your chosen Spark SQL Editor to play around with the existing functions 
> and learn more about how they work.
>  
> The goal for this Jira ticket is to improve the UTF8_BINARY_LCASE 
> implementation for the {*}contains{*}, {*}startsWith{*}, and *endsWith* 
> functions so that they use optimized lowercase comparison approach (following 
> the general logic in Nikola's PR), and benchmark the results accordingly. As 
> for testing, the currently existing unit test cases and end-to-end tests 
> should already fully cover the expected behaviour of *StringPredicate* 
> expressions for all collation types. In other words, the objective of this 
> ticket is only to enhance the internal implementation, without introducing 
> any user-facing changes to Spark SQL API.
>  
> Finally, feel free to refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47830) Reeanble MemoryProfilerParityTests for pyspark-connect

2024-04-12 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-47830:


 Summary: Reeanble MemoryProfilerParityTests for pyspark-connect
 Key: SPARK-47830
 URL: https://issues.apache.org/jira/browse/SPARK-47830
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark, Tests
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47733) Add operational metrics for TWS operators

2024-04-12 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-47733:


Assignee: Anish Shrigondekar  (was: Jing Zhan)

> Add operational metrics for TWS operators
> -
>
> Key: SPARK-47733
> URL: https://issues.apache.org/jira/browse/SPARK-47733
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Jing Zhan
>Assignee: Anish Shrigondekar
>Priority: Major
>  Labels: pull-request-available
>
> Add metrics to improve observability for the newly added operator 
> TransformWithState and some changes we've made into RocksDB.
> Proposed metrics to add:
>  * on the RocksDB StateStore metrics side, we will add the following:
>  ** num external col families
>  ** num internal col families
>  * on the operator side, we will add the following:
>  ** number of state vars
>  ** count of state vars by type
>  ** output mode
>  ** timeout mode
>  ** registered timers in batch
>  ** expired timers in batch
>  ** initial state enabled or not
>  ** number of state vars removed in batch



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47733) Add operational metrics for TWS operators

2024-04-12 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-47733:


Assignee: Jing Zhan

> Add operational metrics for TWS operators
> -
>
> Key: SPARK-47733
> URL: https://issues.apache.org/jira/browse/SPARK-47733
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Jing Zhan
>Assignee: Jing Zhan
>Priority: Major
>  Labels: pull-request-available
>
> Add metrics to improve observability for the newly added operator 
> TransformWithState and some changes we've made into RocksDB.
> Proposed metrics to add:
>  * on the RocksDB StateStore metrics side, we will add the following:
>  ** num external col families
>  ** num internal col families
>  * on the operator side, we will add the following:
>  ** number of state vars
>  ** count of state vars by type
>  ** output mode
>  ** timeout mode
>  ** registered timers in batch
>  ** expired timers in batch
>  ** initial state enabled or not
>  ** number of state vars removed in batch



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47829) Text Datasource supports zstd compression codec

2024-04-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47829:
---
Labels: pull-request-available  (was: )

> Text Datasource supports zstd compression codec
> ---
>
> Key: SPARK-47829
> URL: https://issues.apache.org/jira/browse/SPARK-47829
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: xorsum
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> {code:java}
> drop table if exists tmp_compressed_datasource;CREATE TABLE 
> tmp_compressed_datasource (key STRING) USING TEXT OPTIONS (compression 
> 'zstd');INSERT INTO TABLE tmp_compressed_datasource values 
> ('a'),('b'),('c');select * from tmp_compressed_datasource ;
> {code}
> {code:java}
> java.lang.IllegalArgumentException: Codec [zstd] is not available. Known 
> codecs are bzip2, deflate, uncompressed, lz4, gzip, snappy, none.
>     at 
> org.apache.spark.sql.catalyst.util.CompressionCodecs$.getCodecClassName(CompressionCodecs.scala:53)
>     at 
> org.apache.spark.sql.execution.datasources.text.TextOptions.$anonfun$compressionCodec$1(TextOptions.scala:38)
>     at scala.Option.map(Option.scala:230)
>     at 
> org.apache.spark.sql.execution.datasources.text.TextOptions.(TextOptions.scala:38)
>     at 
> org.apache.spark.sql.execution.datasources.text.TextOptions.(TextOptions.scala:33)
>     at 
> org.apache.spark.sql.execution.datasources.text.TextFileFormat.prepareWrite(TextFileFormat.scala:72)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:123)
>     at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:190)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47829) Text Datasource supports zstd compression codec

2024-04-12 Thread xorsum (Jira)
xorsum created SPARK-47829:
--

 Summary: Text Datasource supports zstd compression codec
 Key: SPARK-47829
 URL: https://issues.apache.org/jira/browse/SPARK-47829
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: xorsum





{code:java}
drop table if exists tmp_compressed_datasource;CREATE TABLE 
tmp_compressed_datasource (key STRING) USING TEXT OPTIONS (compression 
'zstd');INSERT INTO TABLE tmp_compressed_datasource values 
('a'),('b'),('c');select * from tmp_compressed_datasource ;
{code}



{code:java}
java.lang.IllegalArgumentException: Codec [zstd] is not available. Known codecs 
are bzip2, deflate, uncompressed, lz4, gzip, snappy, none.
    at 
org.apache.spark.sql.catalyst.util.CompressionCodecs$.getCodecClassName(CompressionCodecs.scala:53)
    at 
org.apache.spark.sql.execution.datasources.text.TextOptions.$anonfun$compressionCodec$1(TextOptions.scala:38)
    at scala.Option.map(Option.scala:230)
    at 
org.apache.spark.sql.execution.datasources.text.TextOptions.(TextOptions.scala:38)
    at 
org.apache.spark.sql.execution.datasources.text.TextOptions.(TextOptions.scala:33)
    at 
org.apache.spark.sql.execution.datasources.text.TextFileFormat.prepareWrite(TextFileFormat.scala:72)
    at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:123)
    at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:190)

{code}

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43861) History Server delete event log of running Spark applicaiton

2024-04-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-43861:
---
Labels: pull-request-available  (was: )

> History Server delete event log of running Spark applicaiton
> 
>
> Key: SPARK-43861
> URL: https://issues.apache.org/jira/browse/SPARK-43861
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: shengkui leng
>Priority: Minor
>  Labels: pull-request-available
>
> When enable Spark history cleaner:
> {noformat}
> spark.history.fs.cleaner.enabled=true
> spark.history.fs.cleaner.interval=1d
> spark.history.fs.cleaner.maxAge=7d
> spark.history.fs.cleaner.maxNum=10
> {noformat}
> Spark will delete history for running Spark applications, following is some 
> message picked from the spark log:
> {noformat}
> 153 23/05/26 14:19:43 INFO FsHistoryProvider: Parsing 
> /tmp/spark-events/local-1685081895370 to re-build UI...
> 154 23/05/26 14:19:43 INFO FsHistoryProvider: Finished parsing 
> /tmp/spark-events/local-1685081895370
> 155 23/05/28 13:09:53 INFO FsHistoryProvider: Deleting expired event log for 
> app-20230524141002-
> 156 23/05/29 13:09:53 INFO FsHistoryProvider: Deleting expired event log for 
> local-1684999095807
> 157 23/05/29 13:09:53 INFO FsHistoryProvider: Deleting expired event log for 
> app-20230524171730-.inprogress
> 158 23/05/29 13:34:24 INFO ApplicationCache: Failed to load application 
> attempt app-20230524171730-/
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47800) Add method for converting v2 identifier to table identifier

2024-04-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47800.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45985
[https://github.com/apache/spark/pull/45985]

> Add method for converting v2 identifier to table identifier
> ---
>
> Key: SPARK-47800
> URL: https://issues.apache.org/jira/browse/SPARK-47800
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Uros Stankovic
>Assignee: Uros Stankovic
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Move conversion of v2 identifier object to v1 table identifier to new method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47800) Add method for converting v2 identifier to table identifier

2024-04-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47800:
---

Assignee: Uros Stankovic

> Add method for converting v2 identifier to table identifier
> ---
>
> Key: SPARK-47800
> URL: https://issues.apache.org/jira/browse/SPARK-47800
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Uros Stankovic
>Assignee: Uros Stankovic
>Priority: Minor
>  Labels: pull-request-available
>
> Move conversion of v2 identifier object to v1 table identifier to new method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47827) Missing warnings for deprecated features

2024-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47827.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46021
[https://github.com/apache/spark/pull/46021]

> Missing warnings for deprecated features
> 
>
> Key: SPARK-47827
> URL: https://issues.apache.org/jira/browse/SPARK-47827
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> There are some APIs will be removed but missing deprecation warnings



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47827) Missing warnings for deprecated features

2024-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47827:


Assignee: Haejoon Lee

> Missing warnings for deprecated features
> 
>
> Key: SPARK-47827
> URL: https://issues.apache.org/jira/browse/SPARK-47827
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> There are some APIs will be removed but missing deprecation warnings



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org