[jira] [Resolved] (SPARK-47420) Fix CollationSupport test output

2024-04-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47420.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46058
[https://github.com/apache/spark/pull/46058]

> Fix CollationSupport test output
> 
>
> Key: SPARK-47420
> URL: https://issues.apache.org/jira/browse/SPARK-47420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47769) Add schema_of_variant_agg expression.

2024-04-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47769.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45934
[https://github.com/apache/spark/pull/45934]

> Add schema_of_variant_agg expression.
> -
>
> Key: SPARK-47769
> URL: https://issues.apache.org/jira/browse/SPARK-47769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Chenhao Li
>Assignee: Chenhao Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47769) Add schema_of_variant_agg expression.

2024-04-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47769:
---

Assignee: Chenhao Li

> Add schema_of_variant_agg expression.
> -
>
> Key: SPARK-47769
> URL: https://issues.apache.org/jira/browse/SPARK-47769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Chenhao Li
>Assignee: Chenhao Li
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47463) An error occurred while pushing down the filter of if expression for iceberg datasource.

2024-04-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47463.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45589
[https://github.com/apache/spark/pull/45589]

> An error occurred while pushing down the filter of if expression for iceberg 
> datasource.
> 
>
> Key: SPARK-47463
> URL: https://issues.apache.org/jira/browse/SPARK-47463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
> Environment: Spark 3.5.0
> Iceberg 1.4.3
>Reporter: Zhen Wang
>Assignee: Zhen Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Reproduce:
> {code:java}
> create table t1(c1 int) using iceberg;
> select * from
> (select if(c1 = 1, c1, null) as c1 from t1) t
> where t.c1 > 0; {code}
> Error:
> {code:java}
> org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase 
> optimization failed with an internal error. You hit a bug in Spark or the 
> Spark plugins you use. Please, report this bug to the corresponding 
> communities or vendors, and provide the full stack trace.
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:107)
>   at 
> org.apache.spark.sql.execution.QueryExecution$.toInternalError(QueryExecution.scala:536)
>   at 
> org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:548)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:148)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:144)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertOptimized(QueryExecution.scala:162)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:182)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:179)
>   at 
> org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:238)
>   at 
> org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:284)
>   at 
> org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:252)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:117)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4327)
>   at org.apache.spark.sql.Dataset.collect(Dataset.scala:3580)
>   at 
> org.apache.kyuubi.engine.spark.operation.ExecuteStatement.fullCollectResult(ExecuteStatement.scala:72)
>   at 
> org.apache.kyuubi.engine.spark.operation.ExecuteStatement.collectAsIterator(ExecuteStatement.scala:164)
>   at 
> org.apache.kyuubi.engine.spark.operation.ExecuteStatement.$anonfun$executeStatement$1(ExecuteStatement.scala:87)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.kyuubi.engine.spark.operation.SparkOperation.$anonfun$withLocalProperties$1(SparkOperation.scala:155)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
>   at 
> org.apache.kyuubi.engine.spark.operation.SparkOperation.withLocalProperties(SparkOperation.scala:139)
>   at 
> org.apache.kyuubi.engine.spark.operation.ExecuteStatement.executeStatement(ExecuteStatement.scala:81)
>   at 
> org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$1.run(ExecuteStatement.scala:103)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.AssertionError: assertion failed

[jira] [Resolved] (SPARK-47862) Connect generated proots can't be pickled

2024-04-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47862.
--
Resolution: Fixed

Issue resolved by pull request 46068
[https://github.com/apache/spark/pull/46068]

> Connect generated proots can't be pickled
> -
>
> Key: SPARK-47862
> URL: https://issues.apache.org/jira/browse/SPARK-47862
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When Spark Connect generates the protobuf files, they're manually adjusted 
> and moved to the right folder. However, we did not fix the package for the 
> descriptor. This breaks serializing them to proto.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47862) Connect generated proots can't be pickled

2024-04-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47862:


Assignee: Martin Grund

> Connect generated proots can't be pickled
> -
>
> Key: SPARK-47862
> URL: https://issues.apache.org/jira/browse/SPARK-47862
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When Spark Connect generates the protobuf files, they're manually adjusted 
> and moved to the right folder. However, we did not fix the package for the 
> descriptor. This breaks serializing them to proto.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47840) Remove foldable propagation across Streaming Aggregate/Join nodes

2024-04-15 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-47840:


Assignee: Bhuwan Sahni

> Remove foldable propagation across Streaming Aggregate/Join nodes
> -
>
> Key: SPARK-47840
> URL: https://issues.apache.org/jira/browse/SPARK-47840
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Bhuwan Sahni
>Assignee: Bhuwan Sahni
>Priority: Major
>  Labels: pull-request-available
>
> Streaming queries with Union of 2 data streams followed by an Aggregate 
> (groupBy) can produce incorrect results if the grouping key is a constant 
> literal for micro-batch duration.
> The query produces incorrect results because the query optimizer recognizes 
> the literal value in the grouping key as foldable and replaces the grouping 
> key expression with the actual literal value. This optimization is correct 
> for batch queries. However Streaming queries also read information from 
> StateStore, and the output contains both the results from StateStore 
> (computed in previous microbatches) and data from input sources (computed in 
> this microbatch). The HashAggregate node after StateStore always reads 
> grouping key value as the optimized literal (as the grouping key expression 
> is optimized into a literal by the optimizer). This ends up replacing keys in 
> StateStore with the literal value resulting in incorrect output. 
> See an example logical and physical plan below for a query performing a union 
> on 2 data streams, followed by a groupBy. Note that the name#4 expression has 
> been optimized to ds1. The Streaming query Aggregate adds StateStoreSave node 
> as child of HashAggregate, however any grouping key read from StateStore will 
> still be read as ds1 due to the optimization. 
>  
> *Optimized Logical Plan*
> {quote}=== Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.FoldablePropagation ===
> === Old Plan ===
> WriteToMicroBatchDataSource MemorySink, eb67645e-30fc-41a8-8006-35bb7649c202, 
> Complete, 0
> +- Aggregate [name#4|#4], [name#4, count(1) AS count#31L|#4, count(1) AS 
> count#31L]
>    +- Project [ds1 AS name#4|#4]
>       +- StreamingDataSourceV2ScanRelation[value#1|#1] MemoryStreamDataSource
> === New Plan ===
> WriteToMicroBatchDataSource MemorySink, eb67645e-30fc-41a8-8006-35bb7649c202, 
> Complete, 0
> +- Aggregate [ds1], [ds1 AS name#4, count(1) AS count#31L|#4, count(1) AS 
> count#31L]
>    +- Project [ds1 AS name#4|#4]
>       +- StreamingDataSourceV2ScanRelation[value#1|#1] MemoryStreamDataSource
> 
> {quote}
> *Corresponding Physical Plan*
> {quote}WriteToDataSourceV2 MicroBatchWrite[epoch: 0, writer: 
> org.apache.spark.sql.execution.streaming.sources.MemoryStreamingWrite@2b4c6242],
>  
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$3143/1859075634@35709d26
> +- HashAggregate(keys=[ds1#39|#39], functions=[finalmerge_count(merge 
> count#38L) AS count(1)#30L|#38L) AS count(1)#30L], output=[name#4, 
> count#31L|#4, count#31L])
>    +- StateStoreSave [ds1#39|#39], state info [ checkpoint = 
> [file:/tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state|file:///tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state],
>  runId = 4dedecca-910c-4518-855e-456702617414, opId = 0, ver = 0, 
> numPartitions = 5], Complete, 0, 0, 2
>       +- HashAggregate(keys=[ds1#39|#39], functions=[merge_count(merge 
> count#38L) AS count#38L|#38L) AS count#38L], output=[ds1#39, count#38L|#39, 
> count#38L])
>          +- StateStoreRestore [ds1#39|#39], state info [ checkpoint = 
> [file:/tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state|file:///tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state],
>  runId = 4dedecca-910c-4518-855e-456702617414, opId = 0, ver = 0, 
> numPartitions = 5], 2
>             +- HashAggregate(keys=[ds1#39|#39], functions=[merge_count(merge 
> count#38L) AS count#38L|#38L) AS count#38L], output=[ds1#39, count#38L|#39, 
> count#38L])
>                +- HashAggregate(keys=[ds1 AS ds1#39|#39], 
> functions=[partial_count(1) AS count#38L|#38L], output=[ds1#39, 
> count#38L|#39, count#38L])
>                   +- Project
>                      +- MicroBatchScan[value#1|#1] MemoryStreamDataSource
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47840) Remove foldable propagation across Streaming Aggregate/Join nodes

2024-04-15 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-47840.
--
Fix Version/s: 3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 46035
[https://github.com/apache/spark/pull/46035]

> Remove foldable propagation across Streaming Aggregate/Join nodes
> -
>
> Key: SPARK-47840
> URL: https://issues.apache.org/jira/browse/SPARK-47840
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Bhuwan Sahni
>Assignee: Bhuwan Sahni
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.2, 4.0.0
>
>
> Streaming queries with Union of 2 data streams followed by an Aggregate 
> (groupBy) can produce incorrect results if the grouping key is a constant 
> literal for micro-batch duration.
> The query produces incorrect results because the query optimizer recognizes 
> the literal value in the grouping key as foldable and replaces the grouping 
> key expression with the actual literal value. This optimization is correct 
> for batch queries. However Streaming queries also read information from 
> StateStore, and the output contains both the results from StateStore 
> (computed in previous microbatches) and data from input sources (computed in 
> this microbatch). The HashAggregate node after StateStore always reads 
> grouping key value as the optimized literal (as the grouping key expression 
> is optimized into a literal by the optimizer). This ends up replacing keys in 
> StateStore with the literal value resulting in incorrect output. 
> See an example logical and physical plan below for a query performing a union 
> on 2 data streams, followed by a groupBy. Note that the name#4 expression has 
> been optimized to ds1. The Streaming query Aggregate adds StateStoreSave node 
> as child of HashAggregate, however any grouping key read from StateStore will 
> still be read as ds1 due to the optimization. 
>  
> *Optimized Logical Plan*
> {quote}=== Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.FoldablePropagation ===
> === Old Plan ===
> WriteToMicroBatchDataSource MemorySink, eb67645e-30fc-41a8-8006-35bb7649c202, 
> Complete, 0
> +- Aggregate [name#4|#4], [name#4, count(1) AS count#31L|#4, count(1) AS 
> count#31L]
>    +- Project [ds1 AS name#4|#4]
>       +- StreamingDataSourceV2ScanRelation[value#1|#1] MemoryStreamDataSource
> === New Plan ===
> WriteToMicroBatchDataSource MemorySink, eb67645e-30fc-41a8-8006-35bb7649c202, 
> Complete, 0
> +- Aggregate [ds1], [ds1 AS name#4, count(1) AS count#31L|#4, count(1) AS 
> count#31L]
>    +- Project [ds1 AS name#4|#4]
>       +- StreamingDataSourceV2ScanRelation[value#1|#1] MemoryStreamDataSource
> 
> {quote}
> *Corresponding Physical Plan*
> {quote}WriteToDataSourceV2 MicroBatchWrite[epoch: 0, writer: 
> org.apache.spark.sql.execution.streaming.sources.MemoryStreamingWrite@2b4c6242],
>  
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$3143/1859075634@35709d26
> +- HashAggregate(keys=[ds1#39|#39], functions=[finalmerge_count(merge 
> count#38L) AS count(1)#30L|#38L) AS count(1)#30L], output=[name#4, 
> count#31L|#4, count#31L])
>    +- StateStoreSave [ds1#39|#39], state info [ checkpoint = 
> [file:/tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state|file:///tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state],
>  runId = 4dedecca-910c-4518-855e-456702617414, opId = 0, ver = 0, 
> numPartitions = 5], Complete, 0, 0, 2
>       +- HashAggregate(keys=[ds1#39|#39], functions=[merge_count(merge 
> count#38L) AS count#38L|#38L) AS count#38L], output=[ds1#39, count#38L|#39, 
> count#38L])
>          +- StateStoreRestore [ds1#39|#39], state info [ checkpoint = 
> [file:/tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state|file:///tmp/streaming.metadata-e470782a-18a3-463c-9e61-3a10d0bdf180/state],
>  runId = 4dedecca-910c-4518-855e-456702617414, opId = 0, ver = 0, 
> numPartitions = 5], 2
>             +- HashAggregate(keys=[ds1#39|#39], functions=[merge_count(merge 
> count#38L) AS count#38L|#38L) AS count#38L], output=[ds1#39, count#38L|#39, 
> count#38L])
>                +- HashAggregate(keys=[ds1 AS ds1#39|#39], 
> functions=[partial_count(1) AS count#38L|#38L], output=[ds1#39, 
> count#38L|#39, count#38L])
>                   +- Project
>                      +- MicroBatchScan[value#1|#1] MemoryStreamDataSource
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Assigned] (SPARK-47371) XML: Ignore row tags in CDATA Tokenizer

2024-04-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47371:


Assignee: Yousof Hosny

> XML: Ignore row tags in CDATA Tokenizer
> ---
>
> Key: SPARK-47371
> URL: https://issues.apache.org/jira/browse/SPARK-47371
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yousof Hosny
>Assignee: Yousof Hosny
>Priority: Minor
>  Labels: pull-request-available
>
> The current parser does not recognize CDATA sections and thus will read row 
> tags that are enclosed within a CDATA section. The expected behavior is for 
> none of the following rows to be read, but they are all read. 
> {code:java}
> // BUG:  rowTag in CDATA section
> val xmlString="""
> 
> 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47371) XML: Ignore row tags in CDATA Tokenizer

2024-04-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47371.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45487
[https://github.com/apache/spark/pull/45487]

> XML: Ignore row tags in CDATA Tokenizer
> ---
>
> Key: SPARK-47371
> URL: https://issues.apache.org/jira/browse/SPARK-47371
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yousof Hosny
>Assignee: Yousof Hosny
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> The current parser does not recognize CDATA sections and thus will read row 
> tags that are enclosed within a CDATA section. The expected behavior is for 
> none of the following rows to be read, but they are all read. 
> {code:java}
> // BUG:  rowTag in CDATA section
> val xmlString="""
> 
> 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47866) Deflaky PythonForeachWriterSuite

2024-04-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47866.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46070
[https://github.com/apache/spark/pull/46070]

> Deflaky PythonForeachWriterSuite
> 
>
> Key: SPARK-47866
> URL: https://issues.apache.org/jira/browse/SPARK-47866
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47866) Deflaky PythonForeachWriterSuite

2024-04-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47866:


Assignee: Dongjoon Hyun

> Deflaky PythonForeachWriterSuite
> 
>
> Key: SPARK-47866
> URL: https://issues.apache.org/jira/browse/SPARK-47866
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46810) Clarify error class terminology

2024-04-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46810.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44902
[https://github.com/apache/spark/pull/44902]

> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
> [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
>  It has several key pieces of hierarchical information that have inconsistent 
> names throughout our documentation and codebase:
>  * 42
>  ** K01
>  *** INCOMPLETE_TYPE_DEFINITION
>   ARRAY
>   MAP
>   STRUCT
> What are the names of these different levels of information?
> Some examples of inconsistent terminology:
>  * [Over 
> here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
>  we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION 
> we call that an "error class". So what exactly is a class, the 42 or the 
> INCOMPLETE_TYPE_DEFINITION?
>  * [Over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
>  we call K01 the "subclass". But [over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
>  we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
> INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
> So what exactly is a subclass?
>  * [On this 
> page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
>  we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
> places we refer to it as an "error class".
> I don't think we should leave this status quo as-is. I see a couple of ways 
> to fix this.
> h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"
> One solution is to use the following terms:
>  * Error class: 42
>  * Error sub-class: K01
>  * Error state: 42K01
>  * Error condition: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-condition: ARRAY, MAP, STRUCT
> Pros: 
>  * This terminology seems (to me at least) the most natural and intuitive.
>  * It aligns most closely to the SQL standard.
> Cons:
>  * We use {{errorClass}} [all over our 
> codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
>  – literally in thousands of places – to refer to strings like 
> INCOMPLETE_TYPE_DEFINITION.
>  ** It's probably not practical to update all these usages to say 
> {{errorCondition}} instead, so if we go with this approach there will be a 
> divide between the terminology we use in user-facing documentation vs. what 
> the code base uses.
>  ** We can perhaps rename the existing {{error-classes.json}} to 
> {{error-conditions.json}} but clarify the reason for this divide between code 
> and user docs in the documentation for {{ErrorClassesJsonReader}} .
> h1. Option 2: 42 becomes an "Error Category"
> Another approach is to use the following terminology:
>  * Error category: 42
>  * Error sub-category: K01
>  * Error state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The change from calling "42" a "class" to a "category" is low impact and 
> may not show up in user-facing documentation at all. (See my side note below.)
> Cons:
>  * These terms do not align with the SQL standard.
>  * We will have to retire the term "error condition", which we have [already 
> used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md]
>  in user-facing documentation.
> h1. Option 3: "Error Class" and "State Class"
>  * SQL state class: 42
>  * SQL state sub-class: K01
>  * SQL state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The change 

[jira] [Assigned] (SPARK-46810) Clarify error class terminology

2024-04-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46810:
---

Assignee: Nicholas Chammas

> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
> [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
>  It has several key pieces of hierarchical information that have inconsistent 
> names throughout our documentation and codebase:
>  * 42
>  ** K01
>  *** INCOMPLETE_TYPE_DEFINITION
>   ARRAY
>   MAP
>   STRUCT
> What are the names of these different levels of information?
> Some examples of inconsistent terminology:
>  * [Over 
> here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
>  we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION 
> we call that an "error class". So what exactly is a class, the 42 or the 
> INCOMPLETE_TYPE_DEFINITION?
>  * [Over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
>  we call K01 the "subclass". But [over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
>  we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
> INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
> So what exactly is a subclass?
>  * [On this 
> page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
>  we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
> places we refer to it as an "error class".
> I don't think we should leave this status quo as-is. I see a couple of ways 
> to fix this.
> h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"
> One solution is to use the following terms:
>  * Error class: 42
>  * Error sub-class: K01
>  * Error state: 42K01
>  * Error condition: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-condition: ARRAY, MAP, STRUCT
> Pros: 
>  * This terminology seems (to me at least) the most natural and intuitive.
>  * It aligns most closely to the SQL standard.
> Cons:
>  * We use {{errorClass}} [all over our 
> codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
>  – literally in thousands of places – to refer to strings like 
> INCOMPLETE_TYPE_DEFINITION.
>  ** It's probably not practical to update all these usages to say 
> {{errorCondition}} instead, so if we go with this approach there will be a 
> divide between the terminology we use in user-facing documentation vs. what 
> the code base uses.
>  ** We can perhaps rename the existing {{error-classes.json}} to 
> {{error-conditions.json}} but clarify the reason for this divide between code 
> and user docs in the documentation for {{ErrorClassesJsonReader}} .
> h1. Option 2: 42 becomes an "Error Category"
> Another approach is to use the following terminology:
>  * Error category: 42
>  * Error sub-category: K01
>  * Error state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The change from calling "42" a "class" to a "category" is low impact and 
> may not show up in user-facing documentation at all. (See my side note below.)
> Cons:
>  * These terms do not align with the SQL standard.
>  * We will have to retire the term "error condition", which we have [already 
> used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md]
>  in user-facing documentation.
> h1. Option 3: "Error Class" and "State Class"
>  * SQL state class: 42
>  * SQL state sub-class: K01
>  * SQL state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The change from calling "42" a "class" to a "state class" is low impact 
> and may not show up in user-facing documentation at all. (See my 

[jira] [Updated] (SPARK-47867) Support Variant in JSON scan.

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47867:
---
Labels: pull-request-available  (was: )

> Support Variant in JSON scan.
> -
>
> Key: SPARK-47867
> URL: https://issues.apache.org/jira/browse/SPARK-47867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Chenhao Li
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47673) [Arbitrary State Support] State TTL support - ListState

2024-04-15 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-47673.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45932
[https://github.com/apache/spark/pull/45932]

> [Arbitrary State Support] State TTL support - ListState
> ---
>
> Key: SPARK-47673
> URL: https://issues.apache.org/jira/browse/SPARK-47673
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Eric Marnadi
>Assignee: Eric Marnadi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add support for expiring state value based on ttl for List State in 
> transformWithState operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47867) Support Variant in JSON scan.

2024-04-15 Thread Chenhao Li (Jira)
Chenhao Li created SPARK-47867:
--

 Summary: Support Variant in JSON scan.
 Key: SPARK-47867
 URL: https://issues.apache.org/jira/browse/SPARK-47867
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Chenhao Li






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47804) Add Dataframe cache debug log

2024-04-15 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-47804.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45990
[https://github.com/apache/spark/pull/45990]

> Add Dataframe cache debug log
> -
>
> Key: SPARK-47804
> URL: https://issues.apache.org/jira/browse/SPARK-47804
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Xinyi Yu
>Assignee: Xinyi Yu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add debug log for dataframe cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47866) Deflaky PythonForeachWriterSuite

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47866:
---
Labels: pull-request-available  (was: )

> Deflaky PythonForeachWriterSuite
> 
>
> Key: SPARK-47866
> URL: https://issues.apache.org/jira/browse/SPARK-47866
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47865) Deflaky PythonForeachWriterSuite

2024-04-15 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-47865:
-

 Summary: Deflaky PythonForeachWriterSuite
 Key: SPARK-47865
 URL: https://issues.apache.org/jira/browse/SPARK-47865
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47866) Deflaky PythonForeachWriterSuite

2024-04-15 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-47866:
-

 Summary: Deflaky PythonForeachWriterSuite
 Key: SPARK-47866
 URL: https://issues.apache.org/jira/browse/SPARK-47866
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43394) Upgrade maven to 3.8.8

2024-04-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43394:
--
Fix Version/s: 3.4.4

> Upgrade maven to 3.8.8
> --
>
> Key: SPARK-43394
> URL: https://issues.apache.org/jira/browse/SPARK-43394
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 3.5.0, 3.4.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47828) DataFrameWriterV2.overwrite fails with invalid plan

2024-04-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47828:
--
Fix Version/s: 3.5.2

> DataFrameWriterV2.overwrite fails with invalid plan
> ---
>
> Key: SPARK-47828
> URL: https://issues.apache.org/jira/browse/SPARK-47828
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.2, 4.0.0, 3.5.1
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47864) Enhance "Installation" page to cover all installable options

2024-04-15 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-47864:
---

 Summary: Enhance "Installation" page to cover all installable 
options
 Key: SPARK-47864
 URL: https://issues.apache.org/jira/browse/SPARK-47864
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


Like Installation page from Pandas, we might need to cover all installable 
options with related dependencies from our Installation documentation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47855) Warn `spark.sql.execution.arrow.pyspark.fallback.enabled` in Connect

2024-04-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47855:
-

Assignee: Ruifeng Zheng

> Warn `spark.sql.execution.arrow.pyspark.fallback.enabled` in Connect
> 
>
> Key: SPARK-47855
> URL: https://issues.apache.org/jira/browse/SPARK-47855
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47855) Warn `spark.sql.execution.arrow.pyspark.fallback.enabled` in Connect

2024-04-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47855.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46056
[https://github.com/apache/spark/pull/46056]

> Warn `spark.sql.execution.arrow.pyspark.fallback.enabled` in Connect
> 
>
> Key: SPARK-47855
> URL: https://issues.apache.org/jira/browse/SPARK-47855
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47860) Upgrade `kubernetes-client` to 6.12.0

2024-04-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47860.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46066
[https://github.com/apache/spark/pull/46066]

> Upgrade `kubernetes-client` to 6.12.0
> -
>
> Key: SPARK-47860
> URL: https://issues.apache.org/jira/browse/SPARK-47860
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47863) endsWith and startsWith don't work correctly for some collations

2024-04-15 Thread Vladimir Golubev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Golubev updated SPARK-47863:
-
Parent: SPARK-46837
Issue Type: Sub-task  (was: Bug)

> endsWith and startsWith don't work correctly for some collations
> 
>
> Key: SPARK-47863
> URL: https://issues.apache.org/jira/browse/SPARK-47863
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Priority: Major
>
> *CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
> {*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to 
> compare prefixes/suffixes. This is not correct, since sometimes string parts 
> (suffix/prefix) of different lengths are actually equal in context of 
> case-insensitive and lower-case collations.
> Example test cases that highlight the problem:
> {{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
> *CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
> {{{}- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
> *CollationSupportSuite.*{}}}{{{}{*}testEndsWith{*}.{}}}
> {{The first passes, since it uses *StringSearch* directly, the second one 
> does not.}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47863) endsWith and startsWith don't work correctly for some collations

2024-04-15 Thread Vladimir Golubev (Jira)
Vladimir Golubev created SPARK-47863:


 Summary: endsWith and startsWith don't work correctly for some 
collations
 Key: SPARK-47863
 URL: https://issues.apache.org/jira/browse/SPARK-47863
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Vladimir Golubev


*CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
{*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to compare 
prefixes/suffixes. This is not correct, since sometimes string parts 
(suffix/prefix) of different lengths are actually equal in context of 
case-insensitive and lower-case collations.



Example test cases that highlight the problem:

{{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
*CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
{{- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
}}{*}{{CollationSupportSuite.}}{*}{{{}{*}testEndsWith{*}.{}}}

{{{}The first passes, since it uses *StringSearch* directly, the second one 
does not.{}}}{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47863) endsWith and startsWith don't work correctly for some collations

2024-04-15 Thread Vladimir Golubev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Golubev updated SPARK-47863:
-
Description: 
*CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
{*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to compare 
prefixes/suffixes. This is not correct, since sometimes string parts 
(suffix/prefix) of different lengths are actually equal in context of 
case-insensitive and lower-case collations.

Example test cases that highlight the problem:

{{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
*CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
{{{}- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
}}{*}{{CollationSupportSuite.{*}{}}}{{{}{*}testEndsWith{*}.{}}}

{{The first passes, since it uses *StringSearch* directly, the second one does 
not.}}

  was:
*CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
{*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to compare 
prefixes/suffixes. This is not correct, since sometimes string parts 
(suffix/prefix) of different lengths are actually equal in context of 
case-insensitive and lower-case collations.



Example test cases that highlight the problem:

{{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
*CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
{{- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
}}{*}{{CollationSupportSuite.}}{*}{{{}{*}testEndsWith{*}.{}}}

{{{}The first passes, since it uses *StringSearch* directly, the second one 
does not.{}}}{{{}{}}}


> endsWith and startsWith don't work correctly for some collations
> 
>
> Key: SPARK-47863
> URL: https://issues.apache.org/jira/browse/SPARK-47863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Priority: Major
>
> *CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
> {*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to 
> compare prefixes/suffixes. This is not correct, since sometimes string parts 
> (suffix/prefix) of different lengths are actually equal in context of 
> case-insensitive and lower-case collations.
> Example test cases that highlight the problem:
> {{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
> *CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
> {{{}- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
> }}{*}{{CollationSupportSuite.{*}{}}}{{{}{*}testEndsWith{*}.{}}}
> {{The first passes, since it uses *StringSearch* directly, the second one 
> does not.}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47863) endsWith and startsWith don't work correctly for some collations

2024-04-15 Thread Vladimir Golubev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Golubev updated SPARK-47863:
-
Description: 
*CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
{*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to compare 
prefixes/suffixes. This is not correct, since sometimes string parts 
(suffix/prefix) of different lengths are actually equal in context of 
case-insensitive and lower-case collations.

Example test cases that highlight the problem:

{{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
*CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
{{{}- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
*CollationSupportSuite.*{}}}{{{}{*}testEndsWith{*}.{}}}

{{The first passes, since it uses *StringSearch* directly, the second one does 
not.}}

  was:
*CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
{*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to compare 
prefixes/suffixes. This is not correct, since sometimes string parts 
(suffix/prefix) of different lengths are actually equal in context of 
case-insensitive and lower-case collations.

Example test cases that highlight the problem:

{{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
*CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
{{{}- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
}}{*}{{CollationSupportSuite.{*}{}}}{{{}{*}testEndsWith{*}.{}}}

{{The first passes, since it uses *StringSearch* directly, the second one does 
not.}}


> endsWith and startsWith don't work correctly for some collations
> 
>
> Key: SPARK-47863
> URL: https://issues.apache.org/jira/browse/SPARK-47863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Priority: Major
>
> *CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
> {*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to 
> compare prefixes/suffixes. This is not correct, since sometimes string parts 
> (suffix/prefix) of different lengths are actually equal in context of 
> case-insensitive and lower-case collations.
> Example test cases that highlight the problem:
> {{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
> *CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
> {{{}- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
> *CollationSupportSuite.*{}}}{{{}{*}testEndsWith{*}.{}}}
> {{The first passes, since it uses *StringSearch* directly, the second one 
> does not.}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47862) Connect generated proots can't be pickled

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47862:
---
Labels: pull-request-available  (was: )

> Connect generated proots can't be pickled
> -
>
> Key: SPARK-47862
> URL: https://issues.apache.org/jira/browse/SPARK-47862
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Martin Grund
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When Spark Connect generates the protobuf files, they're manually adjusted 
> and moved to the right folder. However, we did not fix the package for the 
> descriptor. This breaks serializing them to proto.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47862) Connect generated proots can't be pickled

2024-04-15 Thread Martin Grund (Jira)
Martin Grund created SPARK-47862:


 Summary: Connect generated proots can't be pickled
 Key: SPARK-47862
 URL: https://issues.apache.org/jira/browse/SPARK-47862
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.4.1
Reporter: Martin Grund
 Fix For: 4.0.0


When Spark Connect generates the protobuf files, they're manually adjusted and 
moved to the right folder. However, we did not fix the package for the 
descriptor. This breaks serializing them to proto.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47603) Resource managers: Migrate logWarn with variables to structured logging framework

2024-04-15 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-47603.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45957
[https://github.com/apache/spark/pull/45957]

> Resource managers: Migrate logWarn with variables to structured logging 
> framework
> -
>
> Key: SPARK-47603
> URL: https://issues.apache.org/jira/browse/SPARK-47603
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Assignee: BingKun Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47860) Upgrade `kubernetes-client` to 6.12.0

2024-04-15 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-47860:
-

 Summary: Upgrade `kubernetes-client` to 6.12.0
 Key: SPARK-47860
 URL: https://issues.apache.org/jira/browse/SPARK-47860
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Kubernetes
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47859) Why does javaRDD().mapPartitions lead to the memory leak in this case?

2024-04-15 Thread Leo Timofeyev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leo Timofeyev updated SPARK-47859:
--
Description: 
Hello Spark community. I have an Java Spark Structured Streaming application:
Unless I am doing silly mistake, the JedisCluster closed in the finally block, 
but still some memory leak. 
{code:java}
FlatMapFunction, Row> myFunction = new 
MyFunction(jedisConfiguration);
StructType  structSchema = getSchema();

VoidFunction2, Long> forEachFunc = (dataset, aLong) -> {
Dataset dataset = getDataset();
        dataset.persist();
JavaRDD processedRDD = dataset.javaRDD().mapPartitions(myFunction);
Dataset processedDS = sparkSession().createDataFrame(processedRDD, 
structSchema);
parquetWriter.write(processedDS);
dataset.unpersist();
};

DataStreamWriter dataStream = dataset
.writeStream()
.foreachBatch(forEachFunc)
.outputMode(outputMode)
.option("checkpointLocation", checkpointLocation);

 {code}
And the function
{code:java}
public class MyFunction implements FlatMapFunction, Row> {

   ...

@Override
public Iterator call(Iterator rowIterator) throws Exception {

List output;
JedisCluster redis = new JedisCluster(jedisConfiguration);

try {
output = new ArrayList<>();

while (rowIterator.hasNext()) {
Row row = rowIterator.next();
Long var1 = row.getAs("var1");
Long var2 = row.getAs("var2");

var redisKey = "some_key";
var result = redis.hgetAll(redisKey);

if (!result.isEmpty()) {
output.add(RowFactory.create(
var1,
var2,
result.getOrDefault("some_id", null)));
}
}
} finally {
if (redis != null) {
try {
redis.close();
} catch (Exception e) {
throw new RuntimeException("Failed to close Redis 
connection: " + e);
}
}
}
return output.iterator();
}
} {code}
It actually works couple of days then dies. Can't figure out what does cause 
memory leak in the Driver?

Tested with Spark 3.3.2 and 3.5.0

Grafana board of the Driver's Memory Pool 
!Screenshot 2024-04-15 at 20.43.22.png|width=875,height=169!

  was:
Hello Spark community. I have an Java Spark Structured Streaming application:
Unless I am doing silly mistake, the JedisCluster closed in the finally block, 
but still some memory leak. 
{code:java}
FlatMapFunction, Row> myFunction = new 
MyFunction(jedisConfiguration);
StructType  structSchema = getSchema();

VoidFunction2, Long> forEachFunc = (dataset, aLong) -> {
Dataset dataset = getDataset();
        dataset.persist();
JavaRDD processedRDD = dataset.javaRDD().mapPartitions(myFunction);
Dataset processedDS = sparkSession().createDataFrame(processedRDD, 
structSchema);
parquetWriter.write(processedDS);
dataset.unpersist();
};

DataStreamWriter dataStream = dataset
.writeStream()
.foreachBatch(forEachFunc)
.outputMode(outputMode)
.option("checkpointLocation", checkpointLocation);

 {code}
And the function
{code:java}
public class MyFunction implements FlatMapFunction, Row> {

   ...

@Override
public Iterator call(Iterator rowIterator) throws Exception {

List output;
JedisCluster redis = new JedisCluster(jedisConfiguration);

try {
output = new ArrayList<>();

while (rowIterator.hasNext()) {
Row row = rowIterator.next();
Long var1 = row.getAs("var1");
Long var2 = row.getAs("var2");

var redisKey = "some_key";
var result = redis.hgetAll(redisKey);

if (!result.isEmpty()) {
output.add(RowFactory.create(
var1,
var2,
result.getOrDefault("some_id", null)));
}
}
} finally {
if (redis != null) {
try {
redis.close();
} catch (Exception e) {
throw new RuntimeException("Failed to close Redis 
connection: " + e);
}
}
}
return output.iterator();
}
} {code}
It actually works couple of days then dies. Can't figure out what does cause 
memory leak in the  Driver?

Tested with Spark 3.3.2 and 3.5.0

Grafana board of the Driver's Memory Pool 
!Screenshot 2024-04-15 at 20.43.22.png|width=875,height=169!


> Why does javaRDD().mapPartitions lead to the memory leak in this case?
> --
>
> Key: SPARK-47859
>

[jira] [Comment Edited] (SPARK-47859) Why does javaRDD().mapPartitions lead to the memory leak in this case?

2024-04-15 Thread Leo Timofeyev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837420#comment-17837420
 ] 

Leo Timofeyev edited comment on SPARK-47859 at 4/15/24 7:40 PM:


This is likely duplicate of  https://issues.apache.org/jira/browse/SPARK-35262


was (Author: JIRAUSER303957):
This is likely duplicate https://issues.apache.org/jira/browse/SPARK-35262

> Why does javaRDD().mapPartitions lead to the memory leak in this case?
> --
>
> Key: SPARK-47859
> URL: https://issues.apache.org/jira/browse/SPARK-47859
> Project: Spark
>  Issue Type: IT Help
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Leo Timofeyev
>Priority: Major
> Attachments: Screenshot 2024-04-15 at 20.43.22.png
>
>
> Hello Spark community. I have an Java Spark Structured Streaming application:
> Unless I am doing silly mistake, the JedisCluster closed in the finally 
> block, but still some memory leak. 
> {code:java}
> FlatMapFunction, Row> myFunction = new 
> MyFunction(jedisConfiguration);
> StructType  structSchema = getSchema();
> VoidFunction2, Long> forEachFunc = (dataset, aLong) -> {
> Dataset dataset = getDataset();
>         dataset.persist();
> JavaRDD processedRDD = 
> dataset.javaRDD().mapPartitions(myFunction);
> Dataset processedDS = 
> sparkSession().createDataFrame(processedRDD, structSchema);
> parquetWriter.write(processedDS);
> dataset.unpersist();
> };
> DataStreamWriter dataStream = dataset
> .writeStream()
> .foreachBatch(forEachFunc)
> .outputMode(outputMode)
> .option("checkpointLocation", checkpointLocation);
>  {code}
> And the function
> {code:java}
> public class MyFunction implements FlatMapFunction, Row> {
>...
> @Override
> public Iterator call(Iterator rowIterator) throws Exception {
> List output;
> JedisCluster redis = new JedisCluster(jedisConfiguration);
> try {
> output = new ArrayList<>();
> while (rowIterator.hasNext()) {
> Row row = rowIterator.next();
> Long var1 = row.getAs("var1");
> Long var2 = row.getAs("var2");
> var redisKey = "some_key";
> var result = redis.hgetAll(redisKey);
> if (!result.isEmpty()) {
> output.add(RowFactory.create(
> var1,
> var2,
> result.getOrDefault("some_id", null)));
> }
> }
> } finally {
> if (redis != null) {
> try {
> redis.close();
> } catch (Exception e) {
> throw new RuntimeException("Failed to close Redis 
> connection: " + e);
> }
> }
> }
> return output.iterator();
> }
> } {code}
> It actually works couple of days then dies. Can't figure out what does cause 
> memory leak in the  Driver?
> Tested with Spark 3.3.2 and 3.5.0
> Grafana board of the Driver's Memory Pool 
> !Screenshot 2024-04-15 at 20.43.22.png|width=875,height=169!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47859) Why does javaRDD().mapPartitions lead to the memory leak in this case?

2024-04-15 Thread Leo Timofeyev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837420#comment-17837420
 ] 

Leo Timofeyev commented on SPARK-47859:
---

This is likely duplicate

> Why does javaRDD().mapPartitions lead to the memory leak in this case?
> --
>
> Key: SPARK-47859
> URL: https://issues.apache.org/jira/browse/SPARK-47859
> Project: Spark
>  Issue Type: IT Help
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Leo Timofeyev
>Priority: Major
> Attachments: Screenshot 2024-04-15 at 20.43.22.png
>
>
> Hello Spark community. I have an Java Spark Structured Streaming application:
> Unless I am doing silly mistake, the JedisCluster closed in the finally 
> block, but still some memory leak. 
> {code:java}
> FlatMapFunction, Row> myFunction = new 
> MyFunction(jedisConfiguration);
> StructType  structSchema = getSchema();
> VoidFunction2, Long> forEachFunc = (dataset, aLong) -> {
> Dataset dataset = getDataset();
>         dataset.persist();
> JavaRDD processedRDD = 
> dataset.javaRDD().mapPartitions(myFunction);
> Dataset processedDS = 
> sparkSession().createDataFrame(processedRDD, structSchema);
> parquetWriter.write(processedDS);
> dataset.unpersist();
> };
> DataStreamWriter dataStream = dataset
> .writeStream()
> .foreachBatch(forEachFunc)
> .outputMode(outputMode)
> .option("checkpointLocation", checkpointLocation);
>  {code}
> And the function
> {code:java}
> public class MyFunction implements FlatMapFunction, Row> {
>...
> @Override
> public Iterator call(Iterator rowIterator) throws Exception {
> List output;
> JedisCluster redis = new JedisCluster(jedisConfiguration);
> try {
> output = new ArrayList<>();
> while (rowIterator.hasNext()) {
> Row row = rowIterator.next();
> Long var1 = row.getAs("var1");
> Long var2 = row.getAs("var2");
> var redisKey = "some_key";
> var result = redis.hgetAll(redisKey);
> if (!result.isEmpty()) {
> output.add(RowFactory.create(
> var1,
> var2,
> result.getOrDefault("some_id", null)));
> }
> }
> } finally {
> if (redis != null) {
> try {
> redis.close();
> } catch (Exception e) {
> throw new RuntimeException("Failed to close Redis 
> connection: " + e);
> }
> }
> }
> return output.iterator();
> }
> } {code}
> It actually works couple of days then dies. Can't figure out what does cause 
> memory leak in the  Driver?
> Tested with Spark 3.3.2 and 3.5.0
> Grafana board of the Driver's Memory Pool 
> !Screenshot 2024-04-15 at 20.43.22.png|width=875,height=169!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-47859) Why does javaRDD().mapPartitions lead to the memory leak in this case?

2024-04-15 Thread Leo Timofeyev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837420#comment-17837420
 ] 

Leo Timofeyev edited comment on SPARK-47859 at 4/15/24 7:39 PM:


This is likely duplicate https://issues.apache.org/jira/browse/SPARK-35262


was (Author: JIRAUSER303957):
This is likely duplicate

> Why does javaRDD().mapPartitions lead to the memory leak in this case?
> --
>
> Key: SPARK-47859
> URL: https://issues.apache.org/jira/browse/SPARK-47859
> Project: Spark
>  Issue Type: IT Help
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Leo Timofeyev
>Priority: Major
> Attachments: Screenshot 2024-04-15 at 20.43.22.png
>
>
> Hello Spark community. I have an Java Spark Structured Streaming application:
> Unless I am doing silly mistake, the JedisCluster closed in the finally 
> block, but still some memory leak. 
> {code:java}
> FlatMapFunction, Row> myFunction = new 
> MyFunction(jedisConfiguration);
> StructType  structSchema = getSchema();
> VoidFunction2, Long> forEachFunc = (dataset, aLong) -> {
> Dataset dataset = getDataset();
>         dataset.persist();
> JavaRDD processedRDD = 
> dataset.javaRDD().mapPartitions(myFunction);
> Dataset processedDS = 
> sparkSession().createDataFrame(processedRDD, structSchema);
> parquetWriter.write(processedDS);
> dataset.unpersist();
> };
> DataStreamWriter dataStream = dataset
> .writeStream()
> .foreachBatch(forEachFunc)
> .outputMode(outputMode)
> .option("checkpointLocation", checkpointLocation);
>  {code}
> And the function
> {code:java}
> public class MyFunction implements FlatMapFunction, Row> {
>...
> @Override
> public Iterator call(Iterator rowIterator) throws Exception {
> List output;
> JedisCluster redis = new JedisCluster(jedisConfiguration);
> try {
> output = new ArrayList<>();
> while (rowIterator.hasNext()) {
> Row row = rowIterator.next();
> Long var1 = row.getAs("var1");
> Long var2 = row.getAs("var2");
> var redisKey = "some_key";
> var result = redis.hgetAll(redisKey);
> if (!result.isEmpty()) {
> output.add(RowFactory.create(
> var1,
> var2,
> result.getOrDefault("some_id", null)));
> }
> }
> } finally {
> if (redis != null) {
> try {
> redis.close();
> } catch (Exception e) {
> throw new RuntimeException("Failed to close Redis 
> connection: " + e);
> }
> }
> }
> return output.iterator();
> }
> } {code}
> It actually works couple of days then dies. Can't figure out what does cause 
> memory leak in the  Driver?
> Tested with Spark 3.3.2 and 3.5.0
> Grafana board of the Driver's Memory Pool 
> !Screenshot 2024-04-15 at 20.43.22.png|width=875,height=169!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35262) Memory leak when dataset is being persisted

2024-04-15 Thread Leo Timofeyev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837418#comment-17837418
 ] 

Leo Timofeyev commented on SPARK-35262:
---

Hello [~dnskrv] [~iamelin]
I can confirm that the issue is still existing in 3.5.0

> Memory leak when dataset is being persisted
> ---
>
> Key: SPARK-35262
> URL: https://issues.apache.org/jira/browse/SPARK-35262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Igor Amelin
>Priority: Major
>
> If a Java- or Scala-application with SparkSession runs a long time and 
> persists a lot of datasets, it can crash because of a memory leak.
>  I've noticed the following. When we have a dataset and persist it, the 
> SparkSession used to load that dataset is cloned in CacheManager, and this 
> clone is added as a listener to `listenersPlusTimers` in `ListenerBus`. But 
> this clone isn't removed from the list of listeners after that, e.g. 
> unpersisting the dataset. If we persist a lot of datasets, the SparkSession 
> is cloned and added to `ListenerBus` many times. This leads to a memory leak 
> since the `listenersPlusTimers` list become very large.
> I've found out that the SparkSession is cloned is CacheManager when the 
> parameters `spark.sql.sources.bucketing.autoBucketedScan.enabled` and 
> `spark.sql.adaptive.enabled` are true. The first one is true by default, and 
> this default behavior leads to the problem. When auto bucketed scan is 
> disabled, the SparkSession isn't cloned, and there are no duplicates in 
> ListenerBus, so the memory leak doesn't occur.
> Here is a small Java application to reproduce the memory leak: 
> [https://github.com/iamelin/spark-memory-leak]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47859) Why does javaRDD().mapPartitions lead to the memory leak in this case?

2024-04-15 Thread Leo Timofeyev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leo Timofeyev updated SPARK-47859:
--
Summary: Why does javaRDD().mapPartitions lead to the memory leak in this 
case?  (was: Why does  lead to the memory leak?)

> Why does javaRDD().mapPartitions lead to the memory leak in this case?
> --
>
> Key: SPARK-47859
> URL: https://issues.apache.org/jira/browse/SPARK-47859
> Project: Spark
>  Issue Type: IT Help
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Leo Timofeyev
>Priority: Major
> Attachments: Screenshot 2024-04-15 at 20.43.22.png
>
>
> Hello Spark community. I have an Java Spark Structured Streaming application:
> Unless I am doing silly mistake, the JedisCluster closed in the finally 
> block, but still some memory leak. 
> {code:java}
> FlatMapFunction, Row> myFunction = new 
> MyFunction(jedisConfiguration);
> StructType  structSchema = getSchema();
> VoidFunction2, Long> forEachFunc = (dataset, aLong) -> {
> Dataset dataset = getDataset();
>         dataset.persist();
> JavaRDD processedRDD = 
> dataset.javaRDD().mapPartitions(myFunction);
> Dataset processedDS = 
> sparkSession().createDataFrame(processedRDD, structSchema);
> parquetWriter.write(processedDS);
> dataset.unpersist();
> };
> DataStreamWriter dataStream = dataset
> .writeStream()
> .foreachBatch(forEachFunc)
> .outputMode(outputMode)
> .option("checkpointLocation", checkpointLocation);
>  {code}
> And the function
> {code:java}
> public class MyFunction implements FlatMapFunction, Row> {
>...
> @Override
> public Iterator call(Iterator rowIterator) throws Exception {
> List output;
> JedisCluster redis = new JedisCluster(jedisConfiguration);
> try {
> output = new ArrayList<>();
> while (rowIterator.hasNext()) {
> Row row = rowIterator.next();
> Long var1 = row.getAs("var1");
> Long var2 = row.getAs("var2");
> var redisKey = "some_key";
> var result = redis.hgetAll(redisKey);
> if (!result.isEmpty()) {
> output.add(RowFactory.create(
> var1,
> var2,
> result.getOrDefault("some_id", null)));
> }
> }
> } finally {
> if (redis != null) {
> try {
> redis.close();
> } catch (Exception e) {
> throw new RuntimeException("Failed to close Redis 
> connection: " + e);
> }
> }
> }
> return output.iterator();
> }
> } {code}
> It actually works couple of days then dies. Can't figure out what does cause 
> memory leak in the  Driver?
> Tested with Spark 3.3.2 and 3.5.0
> Grafana board of the Driver's Memory Pool 
> !Screenshot 2024-04-15 at 20.43.22.png|width=875,height=169!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47859) Why does lead to the memory leak?

2024-04-15 Thread Leo Timofeyev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leo Timofeyev updated SPARK-47859:
--
Summary: Why does  lead to the memory leak?  (was: Why does this lead to 
the memory leak?)

> Why does  lead to the memory leak?
> --
>
> Key: SPARK-47859
> URL: https://issues.apache.org/jira/browse/SPARK-47859
> Project: Spark
>  Issue Type: IT Help
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Leo Timofeyev
>Priority: Major
> Attachments: Screenshot 2024-04-15 at 20.43.22.png
>
>
> Hello Spark community. I have an Java Spark Structured Streaming application:
> Unless I am doing silly mistake, the JedisCluster closed in the finally 
> block, but still some memory leak. 
> {code:java}
> FlatMapFunction, Row> myFunction = new 
> MyFunction(jedisConfiguration);
> StructType  structSchema = getSchema();
> VoidFunction2, Long> forEachFunc = (dataset, aLong) -> {
> Dataset dataset = getDataset();
>         dataset.persist();
> JavaRDD processedRDD = 
> dataset.javaRDD().mapPartitions(myFunction);
> Dataset processedDS = 
> sparkSession().createDataFrame(processedRDD, structSchema);
> parquetWriter.write(processedDS);
> dataset.unpersist();
> };
> DataStreamWriter dataStream = dataset
> .writeStream()
> .foreachBatch(forEachFunc)
> .outputMode(outputMode)
> .option("checkpointLocation", checkpointLocation);
>  {code}
> And the function
> {code:java}
> public class MyFunction implements FlatMapFunction, Row> {
>...
> @Override
> public Iterator call(Iterator rowIterator) throws Exception {
> List output;
> JedisCluster redis = new JedisCluster(jedisConfiguration);
> try {
> output = new ArrayList<>();
> while (rowIterator.hasNext()) {
> Row row = rowIterator.next();
> Long var1 = row.getAs("var1");
> Long var2 = row.getAs("var2");
> var redisKey = "some_key";
> var result = redis.hgetAll(redisKey);
> if (!result.isEmpty()) {
> output.add(RowFactory.create(
> var1,
> var2,
> result.getOrDefault("some_id", null)));
> }
> }
> } finally {
> if (redis != null) {
> try {
> redis.close();
> } catch (Exception e) {
> throw new RuntimeException("Failed to close Redis 
> connection: " + e);
> }
> }
> }
> return output.iterator();
> }
> } {code}
> It actually works couple of days then dies. Can't figure out what does cause 
> memory leak in the  Driver?
> Tested with Spark 3.3.2 and 3.5.0
> Grafana board of the Driver's Memory Pool 
> !Screenshot 2024-04-15 at 20.43.22.png|width=875,height=169!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47859) Why does this lead to the memory leak?

2024-04-15 Thread Leo Timofeyev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leo Timofeyev updated SPARK-47859:
--
Description: 
Hello Spark community. I have an Java Spark Structured Streaming application:
Unless I am doing silly mistake, the JedisCluster closed in the finally block, 
but still some memory leak. 
{code:java}
FlatMapFunction, Row> myFunction = new 
MyFunction(jedisConfiguration);
StructType  structSchema = getSchema();

VoidFunction2, Long> forEachFunc = (dataset, aLong) -> {
Dataset dataset = getDataset();
        dataset.persist();
JavaRDD processedRDD = dataset.javaRDD().mapPartitions(myFunction);
Dataset processedDS = sparkSession().createDataFrame(processedRDD, 
structSchema);
parquetWriter.write(processedDS);
dataset.unpersist();
};

DataStreamWriter dataStream = dataset
.writeStream()
.foreachBatch(forEachFunc)
.outputMode(outputMode)
.option("checkpointLocation", checkpointLocation);

 {code}
And the function
{code:java}
public class MyFunction implements FlatMapFunction, Row> {

   ...

@Override
public Iterator call(Iterator rowIterator) throws Exception {

List output;
JedisCluster redis = new JedisCluster(jedisConfiguration);

try {
output = new ArrayList<>();

while (rowIterator.hasNext()) {
Row row = rowIterator.next();
Long var1 = row.getAs("var1");
Long var2 = row.getAs("var2");

var redisKey = "some_key";
var result = redis.hgetAll(redisKey);

if (!result.isEmpty()) {
output.add(RowFactory.create(
var1,
var2,
result.getOrDefault("some_id", null)));
}
}
} finally {
if (redis != null) {
try {
redis.close();
} catch (Exception e) {
throw new RuntimeException("Failed to close Redis 
connection: " + e);
}
}
}
return output.iterator();
}
} {code}
It actually works couple of days then dies. Can't figure out what does cause 
memory leak in the  Driver?

Tested with Spark 3.3.2 and 3.5.0

Grafana board of the Driver's Memory Pool 
!Screenshot 2024-04-15 at 20.43.22.png|width=875,height=169!

  was:
Hello Spark community. I have an Java Spark Structured Streaming application:
Unless I am doing silly mistake, the JedisCluster closed in the finally block, 
but still some memory leak. 
{code:java}
FlatMapFunction, Row> myFunction = new 
MyFunction(jedisConfiguration);
StructType  structSchema = getSchema();

VoidFunction2, Long> forEachFunc = (dataset, aLong) -> {
Dataset dataset = getDataset();
        dataset.persist();
JavaRDD processedRDD = dataset.javaRDD().mapPartitions(myFunction);
Dataset processedDS = sparkSession().createDataFrame(processedRDD, 
structSchema);
parquetWriter.write(processedDS);
dataset.unpersist();
};

DataStreamWriter dataStream = dataset
.writeStream()
.foreachBatch(forEachFunc)
.outputMode(outputMode)
.option("checkpointLocation", checkpointLocation);

 {code}
And function
{code:java}
public class MyFunction implements FlatMapFunction, Row> {

   ...

@Override
public Iterator call(Iterator rowIterator) throws Exception {

List output;
JedisCluster redis = new JedisCluster(jedisConfiguration);

try {
output = new ArrayList<>();

while (rowIterator.hasNext()) {
Row row = rowIterator.next();
Long var1 = row.getAs("var1");
Long var2 = row.getAs("var2");

var redisKey = "some_key";
var result = redis.hgetAll(redisKey);

if (!result.isEmpty()) {
output.add(RowFactory.create(
var1,
var2,
result.getOrDefault("some_id", null)));
}
}
} finally {
if (redis != null) {
try {
redis.close();
} catch (Exception e) {
throw new RuntimeException("Failed to close Redis 
connection: " + e);
}
}
}
return output.iterator();
}
} {code}
It actually works couple of days then dies. Can't figure out what does cause 
memory leak in the  Driver?

Tested with Spark 3.3.2 and 3.5.0

Grafana board of the Driver's Memory Pool 
!Screenshot 2024-04-15 at 20.43.22.png|width=875,height=169!


> Why does this lead to the memory leak?
> --
>
> Key: SPARK-47859
> URL: 

[jira] [Updated] (SPARK-47859) Why does this lead to the memory leak?

2024-04-15 Thread Leo Timofeyev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leo Timofeyev updated SPARK-47859:
--
Description: 
Hello Spark community. I have an Java Spark Structured Streaming application:
JedisCluster closed in finally block, but still some memory leak. 
{code:java}
FlatMapFunction, Row> myFunction = new 
MyFunction(jedisConfiguration);
StructType  structSchema = getSchema();

VoidFunction2, Long> forEachFunc = (dataset, aLong) -> {
Dataset dataset = getDataset();
        dataset.persist();
JavaRDD processedRDD = dataset.javaRDD().mapPartitions(myFunction);
Dataset processedDS = sparkSession().createDataFrame(processedRDD, 
structSchema);
parquetWriter.write(processedDS);
dataset.unpersist();
};

DataStreamWriter dataStream = dataset
.writeStream()
.foreachBatch(forEachFunc)
.outputMode(outputMode)
.option("checkpointLocation", checkpointLocation);

 {code}
And function
{code:java}
public class MyFunction implements FlatMapFunction, Row> {

   ...

@Override
public Iterator call(Iterator rowIterator) throws Exception {

List output;
JedisCluster redis = new JedisCluster(jedisConfiguration);

try {
output = new ArrayList<>();

while (rowIterator.hasNext()) {
Row row = rowIterator.next();
Long var1 = row.getAs("var1");
Long var2 = row.getAs("var2");

var redisKey = "some_key";
var result = redis.hgetAll(redisKey);

if (!result.isEmpty()) {
output.add(RowFactory.create(
var1,
var2,
result.getOrDefault("some_id", null)));
}
}
} finally {
if (redis != null) {
try {
redis.close();
} catch (Exception e) {
throw new RuntimeException("Failed to close Redis 
connection: " + e);
}
}
}
return output.iterator();
}
} {code}
It actually works couple of days then dies. Can't figure out what does cause 
memory leak in the  Driver?

Tested with Spark 3.3.2 and 3.5.0

Grafana board of the Driver's Memory Pool 
!Screenshot 2024-04-15 at 20.43.22.png|width=875,height=169!

  was:
Hello Spark community. I have an Java Spark Structured Streaming application:
JedisCluster closed in finally block, but still some memory leak. 
{code:java}
FlatMapFunction, Row> myFunction = new 
MyFunction(jedisConfiguration);
StructType  structSchema = getSchema();

VoidFunction2, Long> forEachFunc = (dataset, aLong) -> {
Dataset dataset = getDataset();
        dataset.persist();
JavaRDD processedRDD = dataset.javaRDD().mapPartitions(myFunction);
Dataset processedDS = sparkSession().createDataFrame(processedRDD, 
structSchema);
parquetWriter.write(processedDS);
dataset.unpersist();
};

DataStreamWriter dataStream = dataset
.writeStream()
.foreachBatch(forEachFunc)
.outputMode(outputMode)
.option("checkpointLocation", checkpointLocation);

 {code}
And function
{code:java}
public class MyFunction implements FlatMapFunction, Row> {

   ...

@Override
public Iterator call(Iterator rowIterator) throws Exception {

List output;
JedisCluster redis = new JedisCluster(jedisConfiguration);

try {
output = new ArrayList<>();

while (rowIterator.hasNext()) {
Row row = rowIterator.next();
Long var1 = row.getAs("var1");
Long var2 = row.getAs("var2");

var redisKey = "some_key";
var result = redis.hgetAll(redisKey);

if (!result.isEmpty()) {
output.add(RowFactory.create(
var1,
var2,
result.getOrDefault("some_id", null)));
}
}
} finally {
if (redis != null) {
try {
redis.close();
} catch (Exception e) {
throw new RuntimeException("Failed to close Redis 
connection: " + e);
}
}
}
return output.iterator();
}
} {code}
It actually works couple of days then dies. Can't figure out what does cause 
memory leak in the  Driver?

Grafana board of the Driver's Memory Pool 
!Screenshot 2024-04-15 at 20.43.22.png|width=875,height=169!


> Why does this lead to the memory leak?
> --
>
> Key: SPARK-47859
> URL: https://issues.apache.org/jira/browse/SPARK-47859
> Project: Spark
>  Issue Type: IT Help
>  Components: Spark Core
>Affects Versions: 

[jira] [Updated] (SPARK-47859) Why does this lead to the memory leak?

2024-04-15 Thread Leo Timofeyev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leo Timofeyev updated SPARK-47859:
--
Description: 
Hello Spark community. I have an Java Spark Structured Streaming application:
Unless I am doing silly mistake, the JedisCluster closed in the finally block, 
but still some memory leak. 
{code:java}
FlatMapFunction, Row> myFunction = new 
MyFunction(jedisConfiguration);
StructType  structSchema = getSchema();

VoidFunction2, Long> forEachFunc = (dataset, aLong) -> {
Dataset dataset = getDataset();
        dataset.persist();
JavaRDD processedRDD = dataset.javaRDD().mapPartitions(myFunction);
Dataset processedDS = sparkSession().createDataFrame(processedRDD, 
structSchema);
parquetWriter.write(processedDS);
dataset.unpersist();
};

DataStreamWriter dataStream = dataset
.writeStream()
.foreachBatch(forEachFunc)
.outputMode(outputMode)
.option("checkpointLocation", checkpointLocation);

 {code}
And function
{code:java}
public class MyFunction implements FlatMapFunction, Row> {

   ...

@Override
public Iterator call(Iterator rowIterator) throws Exception {

List output;
JedisCluster redis = new JedisCluster(jedisConfiguration);

try {
output = new ArrayList<>();

while (rowIterator.hasNext()) {
Row row = rowIterator.next();
Long var1 = row.getAs("var1");
Long var2 = row.getAs("var2");

var redisKey = "some_key";
var result = redis.hgetAll(redisKey);

if (!result.isEmpty()) {
output.add(RowFactory.create(
var1,
var2,
result.getOrDefault("some_id", null)));
}
}
} finally {
if (redis != null) {
try {
redis.close();
} catch (Exception e) {
throw new RuntimeException("Failed to close Redis 
connection: " + e);
}
}
}
return output.iterator();
}
} {code}
It actually works couple of days then dies. Can't figure out what does cause 
memory leak in the  Driver?

Tested with Spark 3.3.2 and 3.5.0

Grafana board of the Driver's Memory Pool 
!Screenshot 2024-04-15 at 20.43.22.png|width=875,height=169!

  was:
Hello Spark community. I have an Java Spark Structured Streaming application:
JedisCluster closed in finally block, but still some memory leak. 
{code:java}
FlatMapFunction, Row> myFunction = new 
MyFunction(jedisConfiguration);
StructType  structSchema = getSchema();

VoidFunction2, Long> forEachFunc = (dataset, aLong) -> {
Dataset dataset = getDataset();
        dataset.persist();
JavaRDD processedRDD = dataset.javaRDD().mapPartitions(myFunction);
Dataset processedDS = sparkSession().createDataFrame(processedRDD, 
structSchema);
parquetWriter.write(processedDS);
dataset.unpersist();
};

DataStreamWriter dataStream = dataset
.writeStream()
.foreachBatch(forEachFunc)
.outputMode(outputMode)
.option("checkpointLocation", checkpointLocation);

 {code}
And function
{code:java}
public class MyFunction implements FlatMapFunction, Row> {

   ...

@Override
public Iterator call(Iterator rowIterator) throws Exception {

List output;
JedisCluster redis = new JedisCluster(jedisConfiguration);

try {
output = new ArrayList<>();

while (rowIterator.hasNext()) {
Row row = rowIterator.next();
Long var1 = row.getAs("var1");
Long var2 = row.getAs("var2");

var redisKey = "some_key";
var result = redis.hgetAll(redisKey);

if (!result.isEmpty()) {
output.add(RowFactory.create(
var1,
var2,
result.getOrDefault("some_id", null)));
}
}
} finally {
if (redis != null) {
try {
redis.close();
} catch (Exception e) {
throw new RuntimeException("Failed to close Redis 
connection: " + e);
}
}
}
return output.iterator();
}
} {code}
It actually works couple of days then dies. Can't figure out what does cause 
memory leak in the  Driver?

Tested with Spark 3.3.2 and 3.5.0

Grafana board of the Driver's Memory Pool 
!Screenshot 2024-04-15 at 20.43.22.png|width=875,height=169!


> Why does this lead to the memory leak?
> --
>
> Key: SPARK-47859
> URL: https://issues.apache.org/jira/browse/SPARK-47859
> Project: Spark
>  

[jira] [Updated] (SPARK-47859) Why does this lead to the memory leak?

2024-04-15 Thread Leo Timofeyev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leo Timofeyev updated SPARK-47859:
--
Description: 
Hello Spark community. I have an Java Spark Structured Streaming application:
JedisCluster closed in finally block, but still some memory leak. 
{code:java}
FlatMapFunction, Row> myFunction = new 
MyFunction(jedisConfiguration);
StructType  structSchema = getSchema();

VoidFunction2, Long> forEachFunc = (dataset, aLong) -> {
Dataset dataset = getDataset();
        dataset.persist();
JavaRDD processedRDD = dataset.javaRDD().mapPartitions(myFunction);
Dataset processedDS = sparkSession().createDataFrame(processedRDD, 
structSchema);
parquetWriter.write(processedDS);
dataset.unpersist();
};

DataStreamWriter dataStream = dataset
.writeStream()
.foreachBatch(forEachFunc)
.outputMode(outputMode)
.option("checkpointLocation", checkpointLocation);

 {code}
And function
{code:java}
public class MyFunction implements FlatMapFunction, Row> {

   ...

@Override
public Iterator call(Iterator rowIterator) throws Exception {

List output;
JedisCluster redis = new JedisCluster(jedisConfiguration);

try {
output = new ArrayList<>();

while (rowIterator.hasNext()) {
Row row = rowIterator.next();
Long var1 = row.getAs("var1");
Long var2 = row.getAs("var2");

var redisKey = "some_key";
var result = redis.hgetAll(redisKey);

if (!result.isEmpty()) {
output.add(RowFactory.create(
var1,
var2,
result.getOrDefault("some_id", null)));
}
}
} finally {
if (redis != null) {
try {
redis.close();
} catch (Exception e) {
throw new RuntimeException("Failed to close Redis 
connection: " + e);
}
}
}
return output.iterator();
}
} {code}
It actually works couple of days then dies. Can't figure out what does cause 
memory leak in the  Driver?

Grafana board of the Driver's Memory Pool 
!Screenshot 2024-04-15 at 20.43.22.png|width=875,height=169!

  was:
Hello Spark community. I have an Java Spark Structured Streaming application:
JedisCluster closed in finally block, but still some memory leak. 


{code:java}
FlatMapFunction, Row> myFunction = new 
MyFunction(jedisConfiguration);
StructType  structSchema = getSchema();

VoidFunction2, Long> forEachFunc = (dataset, aLong) -> {
Dataset dataset = getDataset();
        dataset.persist();
JavaRDD processedRDD = dataset.javaRDD().mapPartitions(myFunction);
Dataset processedDS = sparkSession().createDataFrame(processedRDD, 
structSchema);
parquetWriter.write(processedDS);
dataset.unpersist();
};

DataStreamWriter dataStream = dataset
.writeStream()
.foreachBatch(forEachFunc)
.outputMode(outputMode)
.option("checkpointLocation", checkpointLocation);

 {code}

And function

{code:java}
public class MyFunction implements FlatMapFunction, Row> {

   ...

@Override
public Iterator call(Iterator rowIterator) throws Exception {

List output;
JedisCluster redis = new JedisCluster(jedisConfiguration);

try {
output = new ArrayList<>();

while (rowIterator.hasNext()) {
Row row = rowIterator.next();
Long var1 = row.getAs("var1");
Long var2 = row.getAs("var2");

var redisKey = "some_key";
var result = redis.hgetAll(redisKey);

if (!result.isEmpty()) {
output.add(RowFactory.create(
var1,
var2,
result.getOrDefault("some_id", null)));
}
}
} finally {
if (redis != null) {
try {
redis.close();
} catch (Exception e) {
throw new RuntimeException("Failed to close Redis 
connection: " + e);
}
}
}
return output.iterator();
}
} {code}
It actually works couple of days then dies. Can't figure out what does cause 
memory leak in the  Driver?


> Why does this lead to the memory leak?
> --
>
> Key: SPARK-47859
> URL: https://issues.apache.org/jira/browse/SPARK-47859
> Project: Spark
>  Issue Type: IT Help
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Leo Timofeyev
>Priority: Major
> Attachments: Screenshot 2024-04-15 at 20.43.22.png

[jira] [Created] (SPARK-47859) Why does this lead to the memory leak?

2024-04-15 Thread Leo Timofeyev (Jira)
Leo Timofeyev created SPARK-47859:
-

 Summary: Why does this lead to the memory leak?
 Key: SPARK-47859
 URL: https://issues.apache.org/jira/browse/SPARK-47859
 Project: Spark
  Issue Type: IT Help
  Components: Spark Core
Affects Versions: 3.5.0, 3.3.2
Reporter: Leo Timofeyev
 Attachments: Screenshot 2024-04-15 at 20.43.22.png

Hello Spark community. I have an Java Spark Structured Streaming application:
JedisCluster closed in finally block, but still some memory leak. 


{code:java}
FlatMapFunction, Row> myFunction = new 
MyFunction(jedisConfiguration);
StructType  structSchema = getSchema();

VoidFunction2, Long> forEachFunc = (dataset, aLong) -> {
Dataset dataset = getDataset();
        dataset.persist();
JavaRDD processedRDD = dataset.javaRDD().mapPartitions(myFunction);
Dataset processedDS = sparkSession().createDataFrame(processedRDD, 
structSchema);
parquetWriter.write(processedDS);
dataset.unpersist();
};

DataStreamWriter dataStream = dataset
.writeStream()
.foreachBatch(forEachFunc)
.outputMode(outputMode)
.option("checkpointLocation", checkpointLocation);

 {code}

And function

{code:java}
public class MyFunction implements FlatMapFunction, Row> {

   ...

@Override
public Iterator call(Iterator rowIterator) throws Exception {

List output;
JedisCluster redis = new JedisCluster(jedisConfiguration);

try {
output = new ArrayList<>();

while (rowIterator.hasNext()) {
Row row = rowIterator.next();
Long var1 = row.getAs("var1");
Long var2 = row.getAs("var2");

var redisKey = "some_key";
var result = redis.hgetAll(redisKey);

if (!result.isEmpty()) {
output.add(RowFactory.create(
var1,
var2,
result.getOrDefault("some_id", null)));
}
}
} finally {
if (redis != null) {
try {
redis.close();
} catch (Exception e) {
throw new RuntimeException("Failed to close Redis 
connection: " + e);
}
}
}
return output.iterator();
}
} {code}
It actually works couple of days then dies. Can't figure out what does cause 
memory leak in the  Driver?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47859) Why does this lead to the memory leak?

2024-04-15 Thread Leo Timofeyev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leo Timofeyev updated SPARK-47859:
--
Attachment: Screenshot 2024-04-15 at 20.43.22.png

> Why does this lead to the memory leak?
> --
>
> Key: SPARK-47859
> URL: https://issues.apache.org/jira/browse/SPARK-47859
> Project: Spark
>  Issue Type: IT Help
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Leo Timofeyev
>Priority: Major
> Attachments: Screenshot 2024-04-15 at 20.43.22.png
>
>
> Hello Spark community. I have an Java Spark Structured Streaming application:
> JedisCluster closed in finally block, but still some memory leak. 
> {code:java}
> FlatMapFunction, Row> myFunction = new 
> MyFunction(jedisConfiguration);
> StructType  structSchema = getSchema();
> VoidFunction2, Long> forEachFunc = (dataset, aLong) -> {
> Dataset dataset = getDataset();
>         dataset.persist();
> JavaRDD processedRDD = 
> dataset.javaRDD().mapPartitions(myFunction);
> Dataset processedDS = 
> sparkSession().createDataFrame(processedRDD, structSchema);
> parquetWriter.write(processedDS);
> dataset.unpersist();
> };
> DataStreamWriter dataStream = dataset
> .writeStream()
> .foreachBatch(forEachFunc)
> .outputMode(outputMode)
> .option("checkpointLocation", checkpointLocation);
>  {code}
> And function
> {code:java}
> public class MyFunction implements FlatMapFunction, Row> {
>...
> @Override
> public Iterator call(Iterator rowIterator) throws Exception {
> List output;
> JedisCluster redis = new JedisCluster(jedisConfiguration);
> try {
> output = new ArrayList<>();
> while (rowIterator.hasNext()) {
> Row row = rowIterator.next();
> Long var1 = row.getAs("var1");
> Long var2 = row.getAs("var2");
> var redisKey = "some_key";
> var result = redis.hgetAll(redisKey);
> if (!result.isEmpty()) {
> output.add(RowFactory.create(
> var1,
> var2,
> result.getOrDefault("some_id", null)));
> }
> }
> } finally {
> if (redis != null) {
> try {
> redis.close();
> } catch (Exception e) {
> throw new RuntimeException("Failed to close Redis 
> connection: " + e);
> }
> }
> }
> return output.iterator();
> }
> } {code}
> It actually works couple of days then dies. Can't figure out what does cause 
> memory leak in the  Driver?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47694) Make max message size configurable on client side

2024-04-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-47694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-47694:
-

Assignee: Robert Dillitz  (was: Martin Grund)

> Make max message size configurable on client side
> -
>
> Key: SPARK-47694
> URL: https://issues.apache.org/jira/browse/SPARK-47694
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Robert Dillitz
>Assignee: Robert Dillitz
>Priority: Major
>  Labels: pull-request-available
>
> Follow-up to SPARK-42816: Make the limit configurable on the client side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47694) Make max message size configurable on client side

2024-04-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-47694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-47694.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

> Make max message size configurable on client side
> -
>
> Key: SPARK-47694
> URL: https://issues.apache.org/jira/browse/SPARK-47694
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Robert Dillitz
>Assignee: Robert Dillitz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Follow-up to SPARK-42816: Make the limit configurable on the client side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47856) Oracle: Document Mapping Spark SQL Data Types from Oracle

2024-04-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47856.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46059
[https://github.com/apache/spark/pull/46059]

> Oracle: Document Mapping Spark SQL Data Types from Oracle
> -
>
> Key: SPARK-47856
> URL: https://issues.apache.org/jira/browse/SPARK-47856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47856) Oracle: Document Mapping Spark SQL Data Types from Oracle

2024-04-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47856:
-

Assignee: Kent Yao

> Oracle: Document Mapping Spark SQL Data Types from Oracle
> -
>
> Key: SPARK-47856
> URL: https://issues.apache.org/jira/browse/SPARK-47856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47856) Oracle: Document Mapping Spark SQL Data Types from Oracle and add tests

2024-04-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47856:
--
Summary: Oracle: Document Mapping Spark SQL Data Types from Oracle and add 
tests  (was: Oracle: Document Mapping Spark SQL Data Types from Oracle)

> Oracle: Document Mapping Spark SQL Data Types from Oracle and add tests
> ---
>
> Key: SPARK-47856
> URL: https://issues.apache.org/jira/browse/SPARK-47856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47819) Use asynchronous callback for execution cleanup

2024-04-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-47819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell updated SPARK-47819:
--
Affects Version/s: 3.5.1
   3.5.0

> Use asynchronous callback for execution cleanup
> ---
>
> Key: SPARK-47819
> URL: https://issues.apache.org/jira/browse/SPARK-47819
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Xi Lyu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Expired sessions are regularly checked and cleaned up by a maintenance 
> thread. However, currently, this process is synchronous. Therefore, in rare 
> cases, interrupting the execution thread of a query in a session can take 
> hours, causing the entire maintenance process to stall, resulting in a large 
> amount of memory not being cleared.
> We address this by introducing asynchronous callbacks for execution cleanup, 
> avoiding synchronous joins of execution threads, and preventing the 
> maintenance thread from stalling in the above scenarios. To be more specific, 
> instead of calling {{runner.join()}} in ExecutorHolder.close(), we set a 
> post-cleanup function as the callback through 
> {{{}runner.processOnCompletion{}}}, which will be called asynchronously once 
> the execution runner is completed or interrupted. In this way, the 
> maintenance thread won't get blocked on {{{}join{}}}ing an execution thread.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47357) Add support for Upper, Lower, InitCap (all collations)

2024-04-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47357.
-
Fix Version/s: 4.0.0
 Assignee: Mihailo Milosevic
   Resolution: Fixed

> Add support for Upper, Lower, InitCap (all collations)
> --
>
> Key: SPARK-47357
> URL: https://issues.apache.org/jira/browse/SPARK-47357
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47429) Rename errorClass to errorCondition

2024-04-15 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-47429:
-
Description: 
We've agreed on the parent task to rename {{errorClass}} to align it more 
closely with the SQL standard, and take advantage of the opportunity to break 
backwards compatibility offered by the Spark version change from 3.5 to 4.0.

This ticket also covers renaming {{subClass}} as well.

This is a subtask so the changes are in their own PR and easier to review apart 
from other things.

  was:
We've agreed on the parent task to rename {{errorClass}} to align it more 
closely with the SQL standard, and take advantage of the opportunity to break 
backwards compatibility offered by the Spark version change from 3.5 to 4.0.

This is a subtask so the changes are in their own PR and easier to review apart 
from other things.


> Rename errorClass to errorCondition
> ---
>
> Key: SPARK-47429
> URL: https://issues.apache.org/jira/browse/SPARK-47429
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We've agreed on the parent task to rename {{errorClass}} to align it more 
> closely with the SQL standard, and take advantage of the opportunity to break 
> backwards compatibility offered by the Spark version change from 3.5 to 4.0.
> This ticket also covers renaming {{subClass}} as well.
> This is a subtask so the changes are in their own PR and easier to review 
> apart from other things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28024) Incorrect numeric values when out of range

2024-04-15 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837292#comment-17837292
 ] 

Nicholas Chammas commented on SPARK-28024:
--

[~cloud_fan] - Given the updated descriptions for Cases 2, 3, and 4, do you 
still consider there to be a problem here? Or shall we just consider this an 
acceptable difference between how Spark and Postgres handle these cases?

> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> Spark on {{master}} at commit {{de00ac8a05aedb3a150c8c10f76d1fe5496b1df3}} 
> with {{set spark.sql.ansi.enabled=true;}} as compared to the default behavior 
> on PostgreSQL 16.
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> With ANSI mode enabled, this case is no longer an issue. All 4 of the above 
> statements now yield {{CAST_OVERFLOW}} or {{ARITHMETIC_OVERFLOW}} errors.
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> postgres=# select cast('10e-70' as float), cast('-10e-70' as float);
>  float8 | float8 
> +
>   1e-69 | -1e-69 {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> postgres=# select cast('10e-400' as double precision), cast('-10e-400' as 
> double precision);
> ERROR:  "10e-400" is out of range for type double precision
> LINE 1: select cast('10e-400' as double precision), cast('-10e-400' ...
>                     ^ {code}
> Case 4:
> {code:sql}
> spark-sql (default)> select exp(1.2345678901234E200);
> Infinity
> postgres=# select exp(1.2345678901234E200);
> ERROR:  value overflows numeric format {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47858) Refactoring the structure for DataFrame error context

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47858:
---
Labels: pull-request-available  (was: )

> Refactoring the structure for DataFrame error context
> -
>
> Key: SPARK-47858
> URL: https://issues.apache.org/jira/browse/SPARK-47858
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> The current implementation for PySpark DataFrame error context could be more 
> flexible by addressing some hacky spots.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47858) Refactoring the structure for DataFrame error context

2024-04-15 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-47858:
---

 Summary: Refactoring the structure for DataFrame error context
 Key: SPARK-47858
 URL: https://issues.apache.org/jira/browse/SPARK-47858
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


The current implementation for PySpark DataFrame error context could be more 
flexible by addressing some hacky spots.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47356) Add support for ConcatWs & Elt (all collations)

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47356:
---
Labels: pull-request-available  (was: )

> Add support for ConcatWs & Elt (all collations)
> ---
>
> Key: SPARK-47356
> URL: https://issues.apache.org/jira/browse/SPARK-47356
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47356) Add support for ConcatWs & Elt (all collations)

2024-04-15 Thread Mihailo Milosevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mihailo Milosevic updated SPARK-47356:
--
Summary: Add support for ConcatWs & Elt (all collations)  (was: ConcatWs & 
Elt (all collations))

> Add support for ConcatWs & Elt (all collations)
> ---
>
> Key: SPARK-47356
> URL: https://issues.apache.org/jira/browse/SPARK-47356
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47420) Fix CollationSupport test output

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47420:
---
Labels: pull-request-available  (was: )

> Fix CollationSupport test output
> 
>
> Key: SPARK-47420
> URL: https://issues.apache.org/jira/browse/SPARK-47420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47412) StringLPad, StringRPad (all collations)

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47412:
---
Labels: pull-request-available  (was: )

> StringLPad, StringRPad (all collations)
> ---
>
> Key: SPARK-47412
> URL: https://issues.apache.org/jira/browse/SPARK-47412
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *StringLPad* & *StringRPad* built-in string 
> functions in Spark. First confirm what is the expected behaviour for these 
> functions when given collated strings, then move on to the implementation 
> that would enable handling strings of all collation types. Implement the 
> corresponding unit tests (CollationStringExpressionsSuite) and E2E tests 
> (CollationSuite) to reflect how this function should be used with collation 
> in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringLPad* & *StringRPad* 
> functions so that they support all collation types currently supported in 
> Spark. To understand what changes were introduced in order to enable full 
> collation support for other existing functions in Spark, take a look at the 
> Spark PRs and Jira tickets for completed tasks in this parent (for example: 
> Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47856) Oracle: Document Mapping Spark SQL Data Types from Oracle

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47856:
---
Labels: pull-request-available  (was: )

> Oracle: Document Mapping Spark SQL Data Types from Oracle
> -
>
> Key: SPARK-47856
> URL: https://issues.apache.org/jira/browse/SPARK-47856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47356) ConcatWs & Elt (all collations)

2024-04-15 Thread Mihailo Milosevic (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837145#comment-17837145
 ] 

Mihailo Milosevic commented on SPARK-47356:
---

Working on this.

> ConcatWs & Elt (all collations)
> ---
>
> Key: SPARK-47356
> URL: https://issues.apache.org/jira/browse/SPARK-47356
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47856) Oracle: Document Mapping Spark SQL Data Types from Oracle

2024-04-15 Thread Kent Yao (Jira)
Kent Yao created SPARK-47856:


 Summary: Oracle: Document Mapping Spark SQL Data Types from Oracle
 Key: SPARK-47856
 URL: https://issues.apache.org/jira/browse/SPARK-47856
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47413) Substring, Right, Left (all collations)

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47413:
--

Assignee: Apache Spark

> Substring, Right, Left (all collations)
> ---
>
> Key: SPARK-47413
> URL: https://issues.apache.org/jira/browse/SPARK-47413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *Substring* built-in string function in 
> Spark (including *Right* and *Left* functions). First confirm what is the 
> expected behaviour for these functions when given collated strings, then move 
> on to the implementation that would enable handling strings of all collation 
> types. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the {*}Substring{*}, 
> {*}Right{*}, and *Left* functions so that they support all collation types 
> currently supported in Spark. To understand what changes were introduced in 
> order to enable full collation support for other existing functions in Spark, 
> take a look at the Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47594) Connector module: Migrate logInfo with variables to structured logging framework

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47594:
--

Assignee: (was: Apache Spark)

> Connector module: Migrate logInfo with variables to structured logging 
> framework
> 
>
> Key: SPARK-47594
> URL: https://issues.apache.org/jira/browse/SPARK-47594
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47413) Substring, Right, Left (all collations)

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47413:
--

Assignee: (was: Apache Spark)

> Substring, Right, Left (all collations)
> ---
>
> Key: SPARK-47413
> URL: https://issues.apache.org/jira/browse/SPARK-47413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *Substring* built-in string function in 
> Spark (including *Right* and *Left* functions). First confirm what is the 
> expected behaviour for these functions when given collated strings, then move 
> on to the implementation that would enable handling strings of all collation 
> types. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the {*}Substring{*}, 
> {*}Right{*}, and *Left* functions so that they support all collation types 
> currently supported in Spark. To understand what changes were introduced in 
> order to enable full collation support for other existing functions in Spark, 
> take a look at the Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47566) SubstringIndex

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47566:
--

Assignee: (was: Apache Spark)

> SubstringIndex
> --
>
> Key: SPARK-47566
> URL: https://issues.apache.org/jira/browse/SPARK-47566
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Milan Dankovic
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *SubstringIndex* built-in string function in 
> Spark. First confirm what is the expected behaviour for these functions when 
> given collated strings, and then move on to implementation and testing. One 
> way to go about this is to consider using {_}StringSearch{_}, an efficient 
> ICU service for string matching. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *SubstringIndex* functions 
> so that they support all collation types currently supported in Spark. To 
> understand what changes were introduced in order to enable full collation 
> support for other existing functions in Spark, take a look at the Spark PRs 
> and Jira tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47566) SubstringIndex

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47566:
--

Assignee: Apache Spark

> SubstringIndex
> --
>
> Key: SPARK-47566
> URL: https://issues.apache.org/jira/browse/SPARK-47566
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Milan Dankovic
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *SubstringIndex* built-in string function in 
> Spark. First confirm what is the expected behaviour for these functions when 
> given collated strings, and then move on to implementation and testing. One 
> way to go about this is to consider using {_}StringSearch{_}, an efficient 
> ICU service for string matching. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *SubstringIndex* functions 
> so that they support all collation types currently supported in Spark. To 
> understand what changes were introduced in order to enable full collation 
> support for other existing functions in Spark, take a look at the Spark PRs 
> and Jira tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47413) Substring, Right, Left (all collations)

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47413:
--

Assignee: (was: Apache Spark)

> Substring, Right, Left (all collations)
> ---
>
> Key: SPARK-47413
> URL: https://issues.apache.org/jira/browse/SPARK-47413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *Substring* built-in string function in 
> Spark (including *Right* and *Left* functions). First confirm what is the 
> expected behaviour for these functions when given collated strings, then move 
> on to the implementation that would enable handling strings of all collation 
> types. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the {*}Substring{*}, 
> {*}Right{*}, and *Left* functions so that they support all collation types 
> currently supported in Spark. To understand what changes were introduced in 
> order to enable full collation support for other existing functions in Spark, 
> take a look at the Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47594) Connector module: Migrate logInfo with variables to structured logging framework

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47594:
--

Assignee: (was: Apache Spark)

> Connector module: Migrate logInfo with variables to structured logging 
> framework
> 
>
> Key: SPARK-47594
> URL: https://issues.apache.org/jira/browse/SPARK-47594
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47413) Substring, Right, Left (all collations)

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47413:
--

Assignee: Apache Spark

> Substring, Right, Left (all collations)
> ---
>
> Key: SPARK-47413
> URL: https://issues.apache.org/jira/browse/SPARK-47413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *Substring* built-in string function in 
> Spark (including *Right* and *Left* functions). First confirm what is the 
> expected behaviour for these functions when given collated strings, then move 
> on to the implementation that would enable handling strings of all collation 
> types. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the {*}Substring{*}, 
> {*}Right{*}, and *Left* functions so that they support all collation types 
> currently supported in Spark. To understand what changes were introduced in 
> order to enable full collation support for other existing functions in Spark, 
> take a look at the Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47420) Fix CollationSupport test output

2024-04-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-47420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47420:
-
Summary: Fix CollationSupport test output  (was: TBD)

> Fix CollationSupport test output
> 
>
> Key: SPARK-47420
> URL: https://issues.apache.org/jira/browse/SPARK-47420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47584) SQL core: Migrate logWarn with variables to structured logging framework

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47584:
---
Labels: pull-request-available  (was: )

> SQL core: Migrate logWarn with variables to structured logging framework
> 
>
> Key: SPARK-47584
> URL: https://issues.apache.org/jira/browse/SPARK-47584
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47855) Warn `spark.sql.execution.arrow.pyspark.fallback.enabled` in Connect

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47855:
---
Labels: pull-request-available  (was: )

> Warn `spark.sql.execution.arrow.pyspark.fallback.enabled` in Connect
> 
>
> Key: SPARK-47855
> URL: https://issues.apache.org/jira/browse/SPARK-47855
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47854) [PYTHON] Avoid shadowing python built-ins in python function variable naming

2024-04-15 Thread Liu Cao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Cao updated SPARK-47854:

Description: 
Given that spark 4.0.0 is upcoming I wonder if we should at least consider 
renaming certain function variable naming in python. Otherwise, we may need to 
wait another 4 years to do so.

Example

[https://github.com/apache/spark/blob/e6b7950f553cff5adc02b8b5195e79c3c97c/python/pyspark/sql/functions/builtin.py#L12768]

There are 8 uses of `len` and 35 `str` as variable names, both of which are 
python built-ins. Shadowing `str` is somewhat dangerous in that the following 
would be non-sensical. 

 
{code:java}
def foo(str: "ColumnOrName", bar: "ColumnOrName"):
# str is variable now, cannot be used as type
bar = if lit(bar) if isinstance(bar, str) else bar
{code}
 

 

Now obviously this would be breaking change for user code if the function is 
called with kwargs style. If we rename `str` to `src` or `col`, old code 
calling `foo(str="x", bar="y")` would break; though `foo("x", bar="y")` would 
be fine.

 

Is this change a possibility? Or are we thinking that the kwargs breaking 
change is not enough of a benefit to make?

 

 

 

  was:
Given that spark 4.0.0 is upcoming I wonder if we should at least consider 
renaming certain function variable naming in python. Otherwise, we may need to 
wait another 4 years to do so.

Example

[https://github.com/apache/spark/blob/e6b7950f553cff5adc02b8b5195e79c3c97c/python/pyspark/sql/functions/builtin.py#L12768]

There are 8 uses of `len` and 35 `str` as variable names, both of which are 
python built-ins. Shadowing `str` is somewhat dangerous in that the following 
would be non-sensical. 

 
{code:java}
def foo(str: "ColumnOrName", bar: "ColumnOrName"):
      bar = if lit(bar) if isinstance(bar, str) else bar  # str is variable 
now, cannot be used as type
{code}
 

 

Now obviously this would be breaking change for user code if the function is 
called with kwargs style. If we rename `str` to `src` or `col`, old code 
calling `foo(str="x", bar="y")` would break; though `foo("x", bar="y")` would 
be fine.

 

Is this change a possibility? Or are we thinking that the kwargs breaking 
change is not enough of a benefit to make?

 

 

 


> [PYTHON] Avoid shadowing python built-ins in python function variable naming
> 
>
> Key: SPARK-47854
> URL: https://issues.apache.org/jira/browse/SPARK-47854
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.1, 3.5.0, 3.5.1, 3.3.4
>Reporter: Liu Cao
>Priority: Major
>
> Given that spark 4.0.0 is upcoming I wonder if we should at least consider 
> renaming certain function variable naming in python. Otherwise, we may need 
> to wait another 4 years to do so.
> Example
> [https://github.com/apache/spark/blob/e6b7950f553cff5adc02b8b5195e79c3c97c/python/pyspark/sql/functions/builtin.py#L12768]
> There are 8 uses of `len` and 35 `str` as variable names, both of which are 
> python built-ins. Shadowing `str` is somewhat dangerous in that the following 
> would be non-sensical. 
>  
> {code:java}
> def foo(str: "ColumnOrName", bar: "ColumnOrName"):
> # str is variable now, cannot be used as type
> bar = if lit(bar) if isinstance(bar, str) else bar
> {code}
>  
>  
> Now obviously this would be breaking change for user code if the function is 
> called with kwargs style. If we rename `str` to `src` or `col`, old code 
> calling `foo(str="x", bar="y")` would break; though `foo("x", bar="y")` would 
> be fine.
>  
> Is this change a possibility? Or are we thinking that the kwargs breaking 
> change is not enough of a benefit to make?
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47855) Warn `spark.sql.execution.arrow.pyspark.fallback.enabled` in Connect

2024-04-15 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-47855:
-

 Summary: Warn `spark.sql.execution.arrow.pyspark.fallback.enabled` 
in Connect
 Key: SPARK-47855
 URL: https://issues.apache.org/jira/browse/SPARK-47855
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47851) Document pyspark-connect package

2024-04-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47851.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46054
[https://github.com/apache/spark/pull/46054]

> Document pyspark-connect package
> 
>
> Key: SPARK-47851
> URL: https://issues.apache.org/jira/browse/SPARK-47851
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47853) jdbc connect to duckdb with error Unrecognized configuration property "path"

2024-04-15 Thread WeiNan Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

WeiNan Zhao updated SPARK-47853:

Description: 
Link issue: _[https://github.com/duckdb/duckdb/issues/11651_]
 # reproduce python code 

```python

from pyspark.sql import SparkSession

if {_}{{_}}name{{_}}{_} == '{_}{{_}}main{{_}}{_}':
    spark = SparkSession.builder \
        .appName("Example Application") \
        .config("spark.master", "local") \
        .config("spark.jars.packages",
                
"io.delta:delta-core_2.12:2.4.0,org.xerial:sqlite-jdbc:3.45.2.0,org.duckdb:duckdb_jdbc:0.9.2")
 \
        .getOrCreate()

    spark.sql(
        f"""
            create table default.movies  
            using jdbc
            options (url "jdbc:duckdb:database/duckdb.db" , driver 
"org.duckdb.DuckDBDriver" , dbtable "duckdb.main.test");
            """
    )

    spark.sql("select * from default.movies").show()

    spark.stop()

```

2. error log

```

16:28:57    Runtime Error in model movies (models/sources/movies.sql)
  An error occurred while calling o40.sql.
  : java.sql.SQLException: Invalid Input Error: Unrecognized configuration 
property "path"
        at org.duckdb.DuckDBNative.duckdb_jdbc_startup(Native Method)
        at org.duckdb.DuckDBConnection.newConnection(DuckDBConnection.java:48)
        at org.duckdb.DuckDBDriver.connect(DuckDBDriver.java:41)
        at 
org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
    ```

3.  relative spark code

[https://github.com/apache/spark/blob/e6b7950f553cff5adc02b8b5195e79c3c97c/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala#L64]

Why do we need to replicate the `path` into the JDBC connection?

 

 

  was:
Link issue: _https://github.com/duckdb/duckdb/issues/11651_
 # reproduce python code 

```python

from pyspark.sql import SparkSession

if {_}{{_}}name{{_}}{_} == '{_}{{_}}main{{_}}{_}':
    spark = SparkSession.builder \
        .appName("Example Application") \
        .config("spark.master", "local") \
        .config("spark.jars.packages",
                
"io.delta:delta-core_2.12:2.4.0,org.xerial:sqlite-jdbc:3.45.2.0,org.duckdb:duckdb_jdbc:0.9.2")
 \
        .getOrCreate()

    spark.sql(
        f"""
            create table default.movies  
            using jdbc
            options (url "jdbc:duckdb:database/duckdb.db" , driver 
"org.duckdb.DuckDBDriver" , dbtable "duckdb.main.test");
            """
    )

    spark.sql("select * from default.movies").show()

    spark.stop()

```

2. error log

```

16:28:57    Runtime Error in model movies (models/sources/movies.sql)
  An error occurred while calling o40.sql.
  : java.sql.SQLException: Invalid Input Error: Unrecognized configuration 
property "path"
        at org.duckdb.DuckDBNative.duckdb_jdbc_startup(Native Method)
        at org.duckdb.DuckDBConnection.newConnection(DuckDBConnection.java:48)
        at org.duckdb.DuckDBDriver.connect(DuckDBDriver.java:41)
        at 
org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
        at 
org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProviderBase.create(ConnectionProvider.scala:102)
        at 
org.apache.spark.sql.jdbc.JdbcDialect.$anonfun$createConnectionFactory$1(JdbcDialects.scala:123)
        at 
org.apache.spark.sql.jdbc.JdbcDialect.$anonfun$createConnectionFactory$1$adapted(JdbcDialects.scala:119)
        at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:50)
        at 
org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:506)
        at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:228)
        at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:183)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
        at 

[jira] [Assigned] (SPARK-47851) Document pyspark-connect package

2024-04-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47851:


Assignee: Hyukjin Kwon

> Document pyspark-connect package
> 
>
> Key: SPARK-47851
> URL: https://issues.apache.org/jira/browse/SPARK-47851
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47853) jdbc connect to duckdb with error Unrecognized configuration property "path"

2024-04-15 Thread WeiNan Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

WeiNan Zhao updated SPARK-47853:

Description: 
Link issue: _https://github.com/duckdb/duckdb/issues/11651_
 # reproduce python code 

```python

from pyspark.sql import SparkSession

if {_}{{_}}name{{_}}{_} == '{_}{{_}}main{{_}}{_}':
    spark = SparkSession.builder \
        .appName("Example Application") \
        .config("spark.master", "local") \
        .config("spark.jars.packages",
                
"io.delta:delta-core_2.12:2.4.0,org.xerial:sqlite-jdbc:3.45.2.0,org.duckdb:duckdb_jdbc:0.9.2")
 \
        .getOrCreate()

    spark.sql(
        f"""
            create table default.movies  
            using jdbc
            options (url "jdbc:duckdb:database/duckdb.db" , driver 
"org.duckdb.DuckDBDriver" , dbtable "duckdb.main.test");
            """
    )

    spark.sql("select * from default.movies").show()

    spark.stop()

```

2. error log

```

16:28:57    Runtime Error in model movies (models/sources/movies.sql)
  An error occurred while calling o40.sql.
  : java.sql.SQLException: Invalid Input Error: Unrecognized configuration 
property "path"
        at org.duckdb.DuckDBNative.duckdb_jdbc_startup(Native Method)
        at org.duckdb.DuckDBConnection.newConnection(DuckDBConnection.java:48)
        at org.duckdb.DuckDBDriver.connect(DuckDBDriver.java:41)
        at 
org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
        at 
org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProviderBase.create(ConnectionProvider.scala:102)
        at 
org.apache.spark.sql.jdbc.JdbcDialect.$anonfun$createConnectionFactory$1(JdbcDialects.scala:123)
        at 
org.apache.spark.sql.jdbc.JdbcDialect.$anonfun$createConnectionFactory$1$adapted(JdbcDialects.scala:119)
        at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:50)
        at 
org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:506)
        at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:228)
        at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:183)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
        at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
        at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
        at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488)
        at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
        at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81)
        at 

[jira] [Updated] (SPARK-47853) jdbc connect to duckdb with error Unrecognized configuration property "path"

2024-04-15 Thread WeiNan Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

WeiNan Zhao updated SPARK-47853:

Description: 
# reproduce python code 

```python

from pyspark.sql import SparkSession

if _{_}name{_}_ == '_{_}main{_}_':
    spark = SparkSession.builder \
        .appName("Example Application") \
        .config("spark.master", "local") \
        .config("spark.jars.packages",
                
"io.delta:delta-core_2.12:2.4.0,org.xerial:sqlite-jdbc:3.45.2.0,org.duckdb:duckdb_jdbc:0.9.2")
 \
        .getOrCreate()

    spark.sql(
        f"""
            create table default.movies  
            using jdbc
            options (url "jdbc:duckdb:database/duckdb.db" , driver 
"org.duckdb.DuckDBDriver" , dbtable "duckdb.main.test");
            """
    )

    spark.sql("select * from default.movies").show()

    spark.stop()

```

2. error log

```

16:28:57    Runtime Error in model movies (models/sources/movies.sql)
  An error occurred while calling o40.sql.
  : java.sql.SQLException: Invalid Input Error: Unrecognized configuration 
property "path"
        at org.duckdb.DuckDBNative.duckdb_jdbc_startup(Native Method)
        at org.duckdb.DuckDBConnection.newConnection(DuckDBConnection.java:48)
        at org.duckdb.DuckDBDriver.connect(DuckDBDriver.java:41)
        at 
org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
        at 
org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProviderBase.create(ConnectionProvider.scala:102)
        at 
org.apache.spark.sql.jdbc.JdbcDialect.$anonfun$createConnectionFactory$1(JdbcDialects.scala:123)
        at 
org.apache.spark.sql.jdbc.JdbcDialect.$anonfun$createConnectionFactory$1$adapted(JdbcDialects.scala:119)
        at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:50)
        at 
org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:506)
        at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:228)
        at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:183)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
        at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
        at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
        at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488)
        at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
        at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81)
        at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79)
        at 

[jira] [Created] (SPARK-47853) jdbc connect to duckdb with error Unrecognized configuration property "path"

2024-04-15 Thread WeiNan Zhao (Jira)
WeiNan Zhao created SPARK-47853:
---

 Summary: jdbc connect to duckdb with error Unrecognized 
configuration property "path"
 Key: SPARK-47853
 URL: https://issues.apache.org/jira/browse/SPARK-47853
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 3.5.1, 3.5.0, 3.4.1, 3.4.0
Reporter: WeiNan Zhao


# reproduce python code 

```python

from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder \
        .appName("Example Application") \
        .config("spark.master", "local") \
        .config("spark.jars.packages",
                
"io.delta:delta-core_2.12:2.4.0,org.xerial:sqlite-jdbc:3.45.2.0,org.duckdb:duckdb_jdbc:0.9.2")
 \
        .getOrCreate()

    spark.sql(
        f"""
            create table default.movies  
            using jdbc
            options (url "jdbc:duckdb:database/duckdb.db" , driver 
"org.duckdb.DuckDBDriver" , dbtable "duckdb.main.test");
            """
    )

    spark.sql("select * from default.movies").show()

    spark.stop()

```

2. error log

```

16:28:57    Runtime Error in model movies (models/sources/movies.sql)
  An error occurred while calling o40.sql.
  : java.sql.SQLException: Invalid Input Error: Unrecognized configuration 
property "path"
        at org.duckdb.DuckDBNative.duckdb_jdbc_startup(Native Method)
        at org.duckdb.DuckDBConnection.newConnection(DuckDBConnection.java:48)
        at org.duckdb.DuckDBDriver.connect(DuckDBDriver.java:41)
        at 
org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
        at 
org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProviderBase.create(ConnectionProvider.scala:102)
        at 
org.apache.spark.sql.jdbc.JdbcDialect.$anonfun$createConnectionFactory$1(JdbcDialects.scala:123)
        at 
org.apache.spark.sql.jdbc.JdbcDialect.$anonfun$createConnectionFactory$1$adapted(JdbcDialects.scala:119)
        at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:50)
        at 
org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:506)
        at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:228)
        at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:183)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
        at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
        at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
        at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488)
        at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
  

[jira] [Updated] (SPARK-47760) Reeanble Avro function doctests

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47760:
---
Labels: pull-request-available  (was: )

> Reeanble Avro function doctests
> ---
>
> Key: SPARK-47760
> URL: https://issues.apache.org/jira/browse/SPARK-47760
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47851) Document pyspark-connect package

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47851:
---
Labels: pull-request-available  (was: )

> Document pyspark-connect package
> 
>
> Key: SPARK-47851
> URL: https://issues.apache.org/jira/browse/SPARK-47851
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47852) Support DataFrameQueryContext for reverse operations

2024-04-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47852:
---
Labels: pull-request-available  (was: )

> Support DataFrameQueryContext for reverse operations
> 
>
> Key: SPARK-47852
> URL: https://issues.apache.org/jira/browse/SPARK-47852
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> To improve error message for reverse ops



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47788) Ensure the same hash partitioning scheme/hash function is used across batches

2024-04-15 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-47788.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45971
[https://github.com/apache/spark/pull/45971]

> Ensure the same hash partitioning scheme/hash function is used across batches
> -
>
> Key: SPARK-47788
> URL: https://issues.apache.org/jira/browse/SPARK-47788
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.5.1
>Reporter: Fanyue Xia
>Assignee: Fanyue Xia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> To really make sure that any changes to hash function / partitioner in Spark 
> doesn’t cause logical correctness issues in existing running streaming 
> queries, we should add a new unit test, to ensure hash function stability is 
> maintained.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47788) Ensure the same hash partitioning scheme/hash function is used across batches

2024-04-15 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-47788:


Assignee: Fanyue Xia

> Ensure the same hash partitioning scheme/hash function is used across batches
> -
>
> Key: SPARK-47788
> URL: https://issues.apache.org/jira/browse/SPARK-47788
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.5.1
>Reporter: Fanyue Xia
>Assignee: Fanyue Xia
>Priority: Major
>  Labels: pull-request-available
>
> To really make sure that any changes to hash function / partitioner in Spark 
> doesn’t cause logical correctness issues in existing running streaming 
> queries, we should add a new unit test, to ensure hash function stability is 
> maintained.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47848) Fix thread safe access for loadedMaps in close

2024-04-15 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-47848.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46048
[https://github.com/apache/spark/pull/46048]

> Fix thread safe access for loadedMaps in close
> --
>
> Key: SPARK-47848
> URL: https://issues.apache.org/jira/browse/SPARK-47848
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Fix thread safe access for loadedMaps in close



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org