[jira] [Resolved] (SPARK-47221) Uses AbstractParser instead of CsvParser for CSV parser signature

2024-02-28 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-47221.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45328
[https://github.com/apache/spark/pull/45328]

> Uses AbstractParser instead of CsvParser for CSV parser signature
> -
>
> Key: SPARK-47221
> URL: https://issues.apache.org/jira/browse/SPARK-47221
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> https://github.com/uniVocity/univocity-parsers becomes inactive for the last 
> 3 years, and we can't land some bug fixes anymore. Maybe we should leverage 
> their interface, and have our CSV parser. This is a base work.
> We should use the higher class if it fits for better maintenance in any event.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47223) Update usage of deprecated Thread.getId() to Thread.threadId()

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47223:
---
Labels: pull-request-available  (was: )

> Update usage of deprecated Thread.getId() to Thread.threadId()
> --
>
> Key: SPARK-47223
> URL: https://issues.apache.org/jira/browse/SPARK-47223
> Project: Spark
>  Issue Type: Request
>  Components: Spark Core, SQL
>Affects Versions: 3.5.1
>Reporter: Neil Gupta
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 3.5.1
>
>
> Update usage of deprecated Thread.getId() to Thread.threadId().
>  
> Currently in Spark, there are multiple references still to the deprecated 
> method [Thread.getId()|#getId()]] given that the current version is using 
> Java 21. Java officially requests any type of usage to be switched to the 
> [Thread.threadId()|#threadId()]] method instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47223) Update usage of deprecated Thread.getId() to Thread.threadId()

2024-02-28 Thread Neil Gupta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821925#comment-17821925
 ] 

Neil Gupta commented on SPARK-47223:


I can take a stab it this one myself

> Update usage of deprecated Thread.getId() to Thread.threadId()
> --
>
> Key: SPARK-47223
> URL: https://issues.apache.org/jira/browse/SPARK-47223
> Project: Spark
>  Issue Type: Request
>  Components: Spark Core, SQL
>Affects Versions: 3.5.1
>Reporter: Neil Gupta
>Priority: Trivial
> Fix For: 3.5.1
>
>
> Update usage of deprecated Thread.getId() to Thread.threadId().
>  
> Currently in Spark, there are multiple references still to the deprecated 
> method [Thread.getId()|#getId()]] given that the current version is using 
> Java 21. Java officially requests any type of usage to be switched to the 
> [Thread.threadId()|#threadId()]] method instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47223) Update usage of deprecated Thread.getId() to Thread.threadId()

2024-02-28 Thread Neil Gupta (Jira)
Neil Gupta created SPARK-47223:
--

 Summary: Update usage of deprecated Thread.getId() to 
Thread.threadId()
 Key: SPARK-47223
 URL: https://issues.apache.org/jira/browse/SPARK-47223
 Project: Spark
  Issue Type: Request
  Components: Spark Core, SQL
Affects Versions: 3.5.1
Reporter: Neil Gupta
 Fix For: 3.5.1


Update usage of deprecated Thread.getId() to Thread.threadId().

 

Currently in Spark, there are multiple references still to the deprecated 
method 
[`Thread.getId()`|[https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Thread.html#getId()]|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Thread.html#getId()],]
 given that the current version is using Java 21. Java officially requests any 
type of usage to be switched to the 
[`Thread.threadId()`|[https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Thread.html#threadId()]]
 method instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47223) Update usage of deprecated Thread.getId() to Thread.threadId()

2024-02-28 Thread Neil Gupta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Gupta updated SPARK-47223:
---
Description: 
Update usage of deprecated Thread.getId() to Thread.threadId().

 

Currently in Spark, there are multiple references still to the deprecated 
method [Thread.getId()|#getId()]] given that the current version is using Java 
21. Java officially requests any type of usage to be switched to the 
[Thread.threadId()|#threadId()]] method instead.

  was:
Update usage of deprecated Thread.getId() to Thread.threadId().

 

Currently in Spark, there are multiple references still to the deprecated 
method 
[`Thread.getId()`|[https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Thread.html#getId()]|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Thread.html#getId()],]
 given that the current version is using Java 21. Java officially requests any 
type of usage to be switched to the 
[`Thread.threadId()`|[https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Thread.html#threadId()]]
 method instead.


> Update usage of deprecated Thread.getId() to Thread.threadId()
> --
>
> Key: SPARK-47223
> URL: https://issues.apache.org/jira/browse/SPARK-47223
> Project: Spark
>  Issue Type: Request
>  Components: Spark Core, SQL
>Affects Versions: 3.5.1
>Reporter: Neil Gupta
>Priority: Trivial
> Fix For: 3.5.1
>
>
> Update usage of deprecated Thread.getId() to Thread.threadId().
>  
> Currently in Spark, there are multiple references still to the deprecated 
> method [Thread.getId()|#getId()]] given that the current version is using 
> Java 21. Java officially requests any type of usage to be switched to the 
> [Thread.threadId()|#threadId()]] method instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47148) Avoid to materialize AQE ExchangeQueryStageExec on the cancellation

2024-02-28 Thread Eren Avsarogullari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-47148:
---
Summary: Avoid to materialize AQE ExchangeQueryStageExec on the 
cancellation  (was: Avoid to materialize AQE QueryStages on the cancellation)

> Avoid to materialize AQE ExchangeQueryStageExec on the cancellation
> ---
>
> Key: SPARK-47148
> URL: https://issues.apache.org/jira/browse/SPARK-47148
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL
>Affects Versions: 4.0.0
>Reporter: Eren Avsarogullari
>Priority: Major
>  Labels: pull-request-available
>
> AQE can materialize *ShuffleQueryStage* on the cancellation. This causes 
> unnecessary stage materialization by submitting Shuffle Job. Under normal 
> circumstances, if the stage is already non-materialized (a.k.a 
> ShuffleQueryStage.shuffleFuture is not initialized yet), it should just be 
> skipped without materializing it.
> Please find sample use-case:
> *1- Stage Materialization Steps:*
> When stage materialization is failed:
> {code:java}
> 1.1- ShuffleQueryStage1 - is materialized successfully,
> 1.2- ShuffleQueryStage2 - materialization is failed,
> 1.3- ShuffleQueryStage3 - Not materialized yet so 
> ShuffleQueryStage3.shuffleFuture is not initialized yet{code}
> *2- Stage Cancellation Steps:*
> {code:java}
> 2.1- ShuffleQueryStage1 - is canceled due to already materialized,
> 2.2- ShuffleQueryStage2 - is earlyFailedStage so currently, it is skipped as 
> default by AQE because it could not be materialized,
> 2.3- ShuffleQueryStage3 - Problem is here: This stage is not materialized yet 
> but currently, it is also tried to cancel and this stage requires to be 
> materialized first.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47222) fileCompressionFactor should be applied to the size of the table

2024-02-28 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821923#comment-17821923
 ] 

Yuming Wang commented on SPARK-47222:
-

https://github.com/apache/spark/pull/45329

> fileCompressionFactor should be applied to the size of the table
> 
>
> Key: SPARK-47222
> URL: https://issues.apache.org/jira/browse/SPARK-47222
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47222) fileCompressionFactor should be applied to the size of the table

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47222:
---
Labels: pull-request-available  (was: )

> fileCompressionFactor should be applied to the size of the table
> 
>
> Key: SPARK-47222
> URL: https://issues.apache.org/jira/browse/SPARK-47222
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47222) fileCompressionFactor should be applied to the size of the table

2024-02-28 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-47222:
---

 Summary: fileCompressionFactor should be applied to the size of 
the table
 Key: SPARK-47222
 URL: https://issues.apache.org/jira/browse/SPARK-47222
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47221) Uses AbstractParser instead of CsvParser for CSV parser signature

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47221:
---
Labels: pull-request-available  (was: )

> Uses AbstractParser instead of CsvParser for CSV parser signature
> -
>
> Key: SPARK-47221
> URL: https://issues.apache.org/jira/browse/SPARK-47221
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>
> https://github.com/uniVocity/univocity-parsers becomes inactive for the last 
> 3 years, and we can't land some bug fixes anymore. Maybe we should leverage 
> their interface, and have our CSV parser. This is a base work.
> We should use the higher class if it fits for better maintenance in any event.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47146) Possible thread leak when doing sort merge join

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47146:
---
Labels: pull-request-available  (was: )

> Possible thread leak when doing sort merge join
> ---
>
> Key: SPARK-47146
> URL: https://issues.apache.org/jira/browse/SPARK-47146
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: JacobZheng
>Priority: Critical
>  Labels: pull-request-available
>
> I have a long-running spark job. stumbled upon executor taking up a lot of 
> threads, resulting in no threads available on the server. Querying thread 
> details via jstack, there are tons of threads named read-ahead. Checking the 
> code confirms that these threads are created by ReadAheadInputStream. This 
> class is initialized to create a single-threaded thread pool
> {code:java}
> private final ExecutorService executorService =
> ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code}
> This thread pool is closed by ReadAheadInputStream#close(). 
> The call stack for the normal case close() method is
> {code:java}
> ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 
> in stage 71.0 (TID 
> 258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230
>     @org.apache.spark.io.ReadAheadInputStream.close()
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87)
>         at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187)
>         at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)
>         at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>         at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>         at org.apache.spark.scheduler.Task.run(Task.scala:139)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(Thread.java:829) {code}
> As shown in UnsafeSorterSpillReader#close, the stream is only closed when the 
> data in the stream is read through.
> {code:java}
> @Override
> public void loadNext() throws IOException {
>   // Kill the task in case it has been marked as killed. This logic is from
>   // InterruptibleIterator, but we inline it here instead of wrapping the 
> iterator in order
>   // to avoid performance overhead. This check is added here in `loadNext()` 
> instead of in
>   // `hasNext()` because it's technically possible 

[jira] [Commented] (SPARK-46762) Spark Connect 3.5 Classloading issue with external jar

2024-02-28 Thread Zhen Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821918#comment-17821918
 ] 

Zhen Li commented on SPARK-46762:
-

[~tenstriker] can you provide more info to reproduce your error? the class 
loading problem is a bit hard to debug. It would be helpful if you can give us 
a command or test to reproduce the error?

> Spark Connect 3.5 Classloading issue with external jar
> --
>
> Key: SPARK-46762
> URL: https://issues.apache.org/jira/browse/SPARK-46762
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: nirav patel
>Priority: Major
> Attachments: Screenshot 2024-02-22 at 2.04.37 PM.png, Screenshot 
> 2024-02-22 at 2.04.49 PM.png
>
>
> We are having following `java.lang.ClassCastException` error in spark 
> Executors when using spark-connect 3.5 with external spark sql catalog jar - 
> iceberg-spark-runtime-3.5_2.12-1.4.3.jar
> We also set "spark.executor.userClassPathFirst=true" otherwise child class 
> gets loaded by MutableClassLoader and parent class gets loaded by 
> ChildFirstCLassLoader and that causes ClassCastException as well.
>  
> {code:java}
> pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3) (spark35-m.c.mycomp-dev-test.internal executor 2): 
> java.lang.ClassCastException: class 
> org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to 
> class org.apache.iceberg.Table 
> (org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed 
> module of loader org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053; 
> org.apache.iceberg.Table is in unnamed module of loader 
> org.apache.spark.util.ChildFirstURLClassLoader @4b18b943)
>     at 
> org.apache.iceberg.spark.source.SparkInputPartition.table(SparkInputPartition.java:88)
>     at 
> org.apache.iceberg.spark.source.RowDataReader.(RowDataReader.java:50)
>     at 
> org.apache.iceberg.spark.source.SparkRowReaderFactory.createReader(SparkRowReaderFactory.java:45)
>     at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84)
>     at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
>     at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:141)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
>     at org.apach...{code}
>  
> `org.apache.iceberg.spark.source.SerializableTableWithSize` is a child of 
> `org.apache.iceberg.Table` and they are both in only one jar  
> `iceberg-spark-runtime-3.5_2.12-1.4.3.jar` 
> We verified that there's only one jar of 
> `iceberg-spark-runtime-3.5_2.12-1.4.3.jar` loaded when spark-connect server 
> is started. 
> Looking more into Error it seems classloader itself is instantiated multiple 
> times somewhere. I can see two instances: 
> org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053 and 
> org.apache.spark.util.ChildFirstURLClassLoader @4b18b943 
>  
> *Affected version:*
> spark 3.5 and spark-connect_2.12:3.5.0 works fine
>  
> *Not affected version and variation:*
> Spark 3.4 and spark-connect_2.12:3.4.0 works fine with external jar
> Also works with just Spark 3.5 spark-submit script directly (ie without using 
> spark-connect 3.5 )
>  
> Issue has been open with Iceberg as well: 
> [https://github.com/apache/iceberg/issues/8978]
> And been discussed in dev@org.apache.iceberg: 
> [https://lists.apache.org/thread/5q1pdqqrd1h06hgs8vx9ztt60z5yv8n1]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, 

[jira] [Updated] (SPARK-47078) Documentation for SparkSession-based Profilers

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47078:
---
Labels: pull-request-available  (was: )

> Documentation for SparkSession-based Profilers
> --
>
> Key: SPARK-47078
> URL: https://issues.apache.org/jira/browse/SPARK-47078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Xinrong Meng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43744) Spark Connect scala UDF serialization pulling in unrelated classes not available on server

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-43744:
---
Labels: SPARK-43745 pull-request-available  (was: SPARK-43745)

> Spark Connect scala UDF serialization pulling in unrelated classes not 
> available on server
> --
>
> Key: SPARK-43744
> URL: https://issues.apache.org/jira/browse/SPARK-43744
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Juliusz Sompolski
>Assignee: Zhen Li
>Priority: Major
>  Labels: SPARK-43745, pull-request-available
> Fix For: 3.5.0
>
>
> [https://github.com/apache/spark/pull/41487] moved "interrupt all - 
> background queries, foreground interrupt" and "interrupt all - foreground 
> queries, background interrupt" tests from ClientE2ETestSuite into a new 
> isolated suite SparkSessionE2ESuite to avoid an unexplicable UDF 
> serialization issue.
>  
> When these tests are moved back to ClientE2ETestSuite and when testing with
> {code:java}
> build/mvn clean install -DskipTests -Phive
> build/mvn test -pl connector/connect/client/jvm -Dtest=none 
> -DwildcardSuites=org.apache.spark.sql.ClientE2ETestSuite{code}
>  
> the tests fails with
> {code:java}
> 23/05/22 15:44:11 ERROR SparkConnectService: Error during: execute. UserId: . 
> SessionId: 0f4013ca-3af9-443b-a0e5-e339a827e0cf.
> java.lang.NoClassDefFoundError: 
> org/apache/spark/sql/connect/client/SparkResult
> at java.lang.Class.getDeclaredMethods0(Native Method)
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
> at java.lang.Class.getDeclaredMethod(Class.java:2128)
> at java.io.ObjectStreamClass.getPrivateMethod(ObjectStreamClass.java:1643)
> at java.io.ObjectStreamClass.access$1700(ObjectStreamClass.java:79)
> at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:520)
> at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:494)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.io.ObjectStreamClass.(ObjectStreamClass.java:494)
> at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:391)
> at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:681)
> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
> at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1815)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1640)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
> at org.apache.spark.util.Utils$.deserialize(Utils.scala:148)
> at 
> org.apache.spark.sql.connect.planner.SparkConnectPlanner.org$apache$spark$sql$connect$planner$SparkConnectPlanner$$unpackUdf(SparkConnectPlanner.scala:1353)
> at 
> org.apache.spark.sql.connect.planner.SparkConnectPlanner$TypedScalaUdf$.apply(SparkConnectPlanner.scala:761)
> at 
> org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformTypedMapPartitions(SparkConnectPlanner.scala:531)
> at 
> org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformMapPartitions(SparkConnectPlanner.scala:495)
> at 
> org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:143)
> at 
> org.apache.spark.sql.connect.service.SparkConnectStreamHandler.handlePlan(SparkConnectStreamHandler.scala:100)
> at 
> org.apache.spark.sql.connect.service.SparkConnectStreamHandler.$anonfun$handle$2(SparkConnectStreamHandler.scala:87)
> at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> at 

[jira] [Commented] (SPARK-46762) Spark Connect 3.5 Classloading issue with external jar

2024-02-28 Thread nirav patel (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821911#comment-17821911
 ] 

nirav patel commented on SPARK-46762:
-

[~zhenli]  is this related to following? 

 

[https://github.com/apache/spark/pull/42069]

[https://github.com/apache/spark/commit/6d0fed9a18ff87e73fdf1ee46b6b0d2df8dd5a1b#diff-53329bc2a642e88cff40148f7788a440de6eebfcc5e74c67c0c1dd9ccb46b0e7]

https://issues.apache.org/jira/browse/SPARK-43744

 

> Spark Connect 3.5 Classloading issue with external jar
> --
>
> Key: SPARK-46762
> URL: https://issues.apache.org/jira/browse/SPARK-46762
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: nirav patel
>Priority: Major
> Attachments: Screenshot 2024-02-22 at 2.04.37 PM.png, Screenshot 
> 2024-02-22 at 2.04.49 PM.png
>
>
> We are having following `java.lang.ClassCastException` error in spark 
> Executors when using spark-connect 3.5 with external spark sql catalog jar - 
> iceberg-spark-runtime-3.5_2.12-1.4.3.jar
> We also set "spark.executor.userClassPathFirst=true" otherwise child class 
> gets loaded by MutableClassLoader and parent class gets loaded by 
> ChildFirstCLassLoader and that causes ClassCastException as well.
>  
> {code:java}
> pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3) (spark35-m.c.mycomp-dev-test.internal executor 2): 
> java.lang.ClassCastException: class 
> org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to 
> class org.apache.iceberg.Table 
> (org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed 
> module of loader org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053; 
> org.apache.iceberg.Table is in unnamed module of loader 
> org.apache.spark.util.ChildFirstURLClassLoader @4b18b943)
>     at 
> org.apache.iceberg.spark.source.SparkInputPartition.table(SparkInputPartition.java:88)
>     at 
> org.apache.iceberg.spark.source.RowDataReader.(RowDataReader.java:50)
>     at 
> org.apache.iceberg.spark.source.SparkRowReaderFactory.createReader(SparkRowReaderFactory.java:45)
>     at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84)
>     at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
>     at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:141)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
>     at org.apach...{code}
>  
> `org.apache.iceberg.spark.source.SerializableTableWithSize` is a child of 
> `org.apache.iceberg.Table` and they are both in only one jar  
> `iceberg-spark-runtime-3.5_2.12-1.4.3.jar` 
> We verified that there's only one jar of 
> `iceberg-spark-runtime-3.5_2.12-1.4.3.jar` loaded when spark-connect server 
> is started. 
> Looking more into Error it seems classloader itself is instantiated multiple 
> times somewhere. I can see two instances: 
> org.apache.spark.util.ChildFirstURLClassLoader @5e7ae053 and 
> org.apache.spark.util.ChildFirstURLClassLoader @4b18b943 
>  
> *Affected version:*
> spark 3.5 and spark-connect_2.12:3.5.0 works fine
>  
> *Not affected version and variation:*
> Spark 3.4 and spark-connect_2.12:3.4.0 works fine with external jar
> Also works with just Spark 3.5 spark-submit script directly (ie without using 
> spark-connect 3.5 )
>  
> Issue has been open with Iceberg as well: 
> [https://github.com/apache/iceberg/issues/8978]
> And been discussed in dev@org.apache.iceberg: 
> [https://lists.apache.org/thread/5q1pdqqrd1h06hgs8vx9ztt60z5yv8n1]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (SPARK-47206) Add official image Dockerfile for Apache Spark 3.5.1

2024-02-28 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-47206:

Fix Version/s: 3.5.1
   (was: 4.0.0)

> Add official image Dockerfile for Apache Spark 3.5.1
> 
>
> Key: SPARK-47206
> URL: https://issues.apache.org/jira/browse/SPARK-47206
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Docker
>Affects Versions: 3.5.1
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47206) Add official image Dockerfile for Apache Spark 3.5.1

2024-02-28 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang reassigned SPARK-47206:
---

Assignee: Yikun Jiang

> Add official image Dockerfile for Apache Spark 3.5.1
> 
>
> Key: SPARK-47206
> URL: https://issues.apache.org/jira/browse/SPARK-47206
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Docker
>Affects Versions: 3.5.1
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47206) Add official image Dockerfile for Apache Spark 3.5.1

2024-02-28 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang resolved SPARK-47206.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 59
[https://github.com/apache/spark-docker/pull/59]

> Add official image Dockerfile for Apache Spark 3.5.1
> 
>
> Key: SPARK-47206
> URL: https://issues.apache.org/jira/browse/SPARK-47206
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Docker
>Affects Versions: 3.5.1
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47220) log4j race condition during shutdown

2024-02-28 Thread Holden Karau (Jira)
Holden Karau created SPARK-47220:


 Summary: log4j race condition during shutdown
 Key: SPARK-47220
 URL: https://issues.apache.org/jira/browse/SPARK-47220
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Holden Karau
Assignee: Holden Karau


There is a race condition during shutdown which can result in a few different 
errors:
 * ERROR Attempted to append to non-started appender
 *  ERROR Unable to write to stream

Since I've only seen it during stop() triggered within a shutdown hook I 
believe this is caused by the parallel execution of shutdown hooks (see 
[https://stackoverflow.com/questions/17400136/how-to-log-within-shutdown-hooks-with-log4j2]
 )



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47214) Create API for 'analyze' method to differentiate constant NULL arguments and other types of arguments

2024-02-28 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-47214.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45319
[https://github.com/apache/spark/pull/45319]

> Create API for 'analyze' method to differentiate constant NULL arguments and 
> other types of arguments
> -
>
> Key: SPARK-47214
> URL: https://issues.apache.org/jira/browse/SPARK-47214
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47214) Create API for 'analyze' method to differentiate constant NULL arguments and other types of arguments

2024-02-28 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-47214:
-

Assignee: Daniel

> Create API for 'analyze' method to differentiate constant NULL arguments and 
> other types of arguments
> -
>
> Key: SPARK-47214
> URL: https://issues.apache.org/jira/browse/SPARK-47214
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47218) XML: Ignore commented Row Tags in XML tokenizer

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47218:
---
Labels: pull-request-available  (was: )

> XML: Ignore commented Row Tags in XML tokenizer
> ---
>
> Key: SPARK-47218
> URL: https://issues.apache.org/jira/browse/SPARK-47218
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yousof Hosny
>Priority: Major
>  Labels: pull-request-available
>
> The following returns rows that was within comments:
> {{}}
> {code:java}
> // BUG: rowTag in comment -- incorrectly processed 
> display(spark.read.xml(write(""" 1 
>  """))){code}
> {{}}
> This has been reported before:[!https://github.com/fluidicon.png!How to 
> Ignore XML comments like this · Issue #208 · 
> databricks/spark-xml|https://github.com/databricks/spark-xml/issues/208]
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47135) Implement error classes for Kafka data loss exceptions

2024-02-28 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-47135.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45221
[https://github.com/apache/spark/pull/45221]

> Implement error classes for Kafka data loss exceptions 
> ---
>
> Key: SPARK-47135
> URL: https://issues.apache.org/jira/browse/SPARK-47135
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: B. Micheal Okutubo
>Assignee: B. Micheal Okutubo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> In the kafka connector code, we have several code that throws the java 
> *IllegalStateException* to report data loss, while reading from Kafka. We 
> want to properly classify those exceptions using the new error framework. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47218) XML: Ignore commented Row Tags in XML tokenizer

2024-02-28 Thread Yousof Hosny (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yousof Hosny updated SPARK-47218:
-
Summary: XML: Ignore commented Row Tags in XML tokenizer  (was: XML: Ignore 
commented row tags in XML tokenizer)

> XML: Ignore commented Row Tags in XML tokenizer
> ---
>
> Key: SPARK-47218
> URL: https://issues.apache.org/jira/browse/SPARK-47218
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yousof Hosny
>Priority: Major
>
> The following returns rows that was within comments:
> {{}}
> {code:java}
> // BUG: rowTag in comment -- incorrectly processed 
> display(spark.read.xml(write(""" 1 
>  """))){code}
> {{}}
> This has been reported before:[!https://github.com/fluidicon.png!How to 
> Ignore XML comments like this · Issue #208 · 
> databricks/spark-xml|https://github.com/databricks/spark-xml/issues/208]
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47135) Implement error classes for Kafka data loss exceptions

2024-02-28 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-47135:


Assignee: B. Micheal Okutubo

> Implement error classes for Kafka data loss exceptions 
> ---
>
> Key: SPARK-47135
> URL: https://issues.apache.org/jira/browse/SPARK-47135
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: B. Micheal Okutubo
>Assignee: B. Micheal Okutubo
>Priority: Major
>  Labels: pull-request-available
>
> In the kafka connector code, we have several code that throws the java 
> *IllegalStateException* to report data loss, while reading from Kafka. We 
> want to properly classify those exceptions using the new error framework. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47218) XML: Ignore commented row tags in XML tokenizer

2024-02-28 Thread Yousof Hosny (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yousof Hosny updated SPARK-47218:
-
Summary: XML: Ignore commented row tags in XML tokenizer  (was: XML: Skip 
rowTag in a comment)

> XML: Ignore commented row tags in XML tokenizer
> ---
>
> Key: SPARK-47218
> URL: https://issues.apache.org/jira/browse/SPARK-47218
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yousof Hosny
>Priority: Major
>
> The following returns rows that was within comments:
> {{}}
> {code:java}
> // BUG: rowTag in comment -- incorrectly processed 
> display(spark.read.xml(write(""" 1 
>  """))){code}
> {{}}
> This has been reported before:[!https://github.com/fluidicon.png!How to 
> Ignore XML comments like this · Issue #208 · 
> databricks/spark-xml|https://github.com/databricks/spark-xml/issues/208]
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47219) XML: Ignore commented row tags in XML tokenizer

2024-02-28 Thread Sandip Agarwala (Jira)
Sandip Agarwala created SPARK-47219:
---

 Summary: XML: Ignore commented row tags in XML tokenizer
 Key: SPARK-47219
 URL: https://issues.apache.org/jira/browse/SPARK-47219
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Sandip Agarwala






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47215) Reduce the number of required threads in MasterSuite

2024-02-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47215.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45320
[https://github.com/apache/spark/pull/45320]

> Reduce the number of required threads in MasterSuite
> 
>
> Key: SPARK-47215
> URL: https://issues.apache.org/jira/browse/SPARK-47215
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47215) Reduce the number of required threads in MasterSuite

2024-02-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47215:
-

Assignee: Dongjoon Hyun

> Reduce the number of required threads in MasterSuite
> 
>
> Key: SPARK-47215
> URL: https://issues.apache.org/jira/browse/SPARK-47215
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44265) Built-in XML data source support

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44265:
---
Labels: pull-request-available  (was: )

> Built-in XML data source support
> 
>
> Key: SPARK-44265
> URL: https://issues.apache.org/jira/browse/SPARK-44265
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Priority: Critical
>  Labels: pull-request-available
>
> XML is a widely used data format. An external spark-xml package 
> ([https://github.com/databricks/spark-xml)] is available to read and write 
> XML data in spark. Making spark-xml built-in will provide a better user 
> experience for Spark SQL and structured streaming. The proposal is to inline 
> code from spark-xml package.
>  
> Here is the link to 
> [SPIP|https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47218) XML: Skip rowTag in a comment

2024-02-28 Thread Yousof Hosny (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yousof Hosny updated SPARK-47218:
-
Description: 
The following returns rows that was within comments:
```

{{// BUG: rowTag in comment -- incorrectly processed
display(spark.read.xml(write(""" 1 
 """)))}}

```

This has been reported before:[!https://github.com/fluidicon.png!How to Ignore 
XML comments like this · Issue #208 · 
databricks/spark-xml|https://github.com/databricks/spark-xml/issues/208]
{{}}

  was:
The following returns rows that was within comments:
display(spark.read.xml(write(""" 1 
 “"")))
 
This has been reported before:[!https://github.com/fluidicon.png!How to Ignore 
XML comments like this · Issue #208 · 
databricks/spark-xml|https://github.com/databricks/spark-xml/issues/208]
{{}}


> XML: Skip rowTag in a comment
> -
>
> Key: SPARK-47218
> URL: https://issues.apache.org/jira/browse/SPARK-47218
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yousof Hosny
>Priority: Major
>
> The following returns rows that was within comments:
> ```
> {{// BUG: rowTag in comment -- incorrectly processed
> display(spark.read.xml(write(""" 1 
>  """)))}}
> ```
> This has been reported before:[!https://github.com/fluidicon.png!How to 
> Ignore XML comments like this · Issue #208 · 
> databricks/spark-xml|https://github.com/databricks/spark-xml/issues/208]
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47218) XML: Skip rowTag in a comment

2024-02-28 Thread Yousof Hosny (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yousof Hosny updated SPARK-47218:
-
Description: 
The following returns rows that was within comments:

{{}}
{code:java}
// BUG: rowTag in comment -- incorrectly processed 
display(spark.read.xml(write(""" 1 
 """))){code}
{{}}

This has been reported before:[!https://github.com/fluidicon.png!How to Ignore 
XML comments like this · Issue #208 · 
databricks/spark-xml|https://github.com/databricks/spark-xml/issues/208]
{{}}

  was:
The following returns rows that was within comments:
```

{{// BUG: rowTag in comment -- incorrectly processed
display(spark.read.xml(write(""" 1 
 """)))}}

```

This has been reported before:[!https://github.com/fluidicon.png!How to Ignore 
XML comments like this · Issue #208 · 
databricks/spark-xml|https://github.com/databricks/spark-xml/issues/208]
{{}}


> XML: Skip rowTag in a comment
> -
>
> Key: SPARK-47218
> URL: https://issues.apache.org/jira/browse/SPARK-47218
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yousof Hosny
>Priority: Major
>
> The following returns rows that was within comments:
> {{}}
> {code:java}
> // BUG: rowTag in comment -- incorrectly processed 
> display(spark.read.xml(write(""" 1 
>  """))){code}
> {{}}
> This has been reported before:[!https://github.com/fluidicon.png!How to 
> Ignore XML comments like this · Issue #208 · 
> databricks/spark-xml|https://github.com/databricks/spark-xml/issues/208]
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47218) XML: Skip rowTag in a comment

2024-02-28 Thread Yousof Hosny (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yousof Hosny updated SPARK-47218:
-
Summary: XML: Skip rowTag in a comment  (was: XML: Skip rowTag in a 
comment,)

> XML: Skip rowTag in a comment
> -
>
> Key: SPARK-47218
> URL: https://issues.apache.org/jira/browse/SPARK-47218
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yousof Hosny
>Priority: Major
>
> The following returns rows that was within comments:
> display(spark.read.xml(write(""" 1 
>  “"")))
>  
> This has been reported before:[!https://github.com/fluidicon.png!How to 
> Ignore XML comments like this · Issue #208 · 
> databricks/spark-xml|https://github.com/databricks/spark-xml/issues/208]
> {{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47218) XML: Skip rowTag in a comment,

2024-02-28 Thread Yousof Hosny (Jira)
Yousof Hosny created SPARK-47218:


 Summary: XML: Skip rowTag in a comment,
 Key: SPARK-47218
 URL: https://issues.apache.org/jira/browse/SPARK-47218
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Yousof Hosny


The following returns rows that was within comments:
display(spark.read.xml(write(""" 1 
 “"")))
 
This has been reported before:[!https://github.com/fluidicon.png!How to Ignore 
XML comments like this · Issue #208 · 
databricks/spark-xml|https://github.com/databricks/spark-xml/issues/208]
{{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47217) De-duplication of Relations in Joins, can result in plan resolution failure

2024-02-28 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47217:
-
Description: 
In case of some flavours of  nested joins involving repetition of relation, the 
projected columns when passed to the DataFrame.select API , as form of 
df.column , can result in plan resolution failure due to attribute resolution 
not happening.

A scenario in which this happens is
{noformat}
   
  Project ( dataframe A.column("col-a") )
 |
  Join2
  || 
   Join1  DataFrame A  
  |
 DataFrame ADataFrame B

{noformat}
In such cases, If it so happens that Join2 - right leg DataFrame A gets 
re-aliased due to De-Duplication of relations, and if the project uses Column 
definition obtained from DataFrame A, its exprId will not match the re-aliased 
Join2 - right Leg- DataFrame A , causing resolution failure.

  was:
In case of some flavours of self join queries or nested joins involving 
repetition of relation, the projected columns when passed to the 
DataFrame.select API , as form of df.column , can result in plan resolution 
failure due to attribute resolution not happening.

A scenario in which this happens is
{noformat}
   
  Project ( dataframe A.column("col-a") )
 |
  Join2
  || 
   Join1  DataFrame A  
  |
 DataFrame ADataFrame B

{noformat}
In such cases, If it so happens that Join2 - right leg DataFrame A gets 
re-aliased due to De-Duplication of relations, and if the project uses Column 
definition obtained from DataFrame A, its exprId will not match the re-aliased 
Join2 - right Leg- DataFrame A , causing resolution failure.


> De-duplication of Relations in Joins, can result in plan resolution failure
> ---
>
> Key: SPARK-47217
> URL: https://issues.apache.org/jira/browse/SPARK-47217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: Spark-SQL
>
> In case of some flavours of  nested joins involving repetition of relation, 
> the projected columns when passed to the DataFrame.select API , as form of 
> df.column , can result in plan resolution failure due to attribute resolution 
> not happening.
> A scenario in which this happens is
> {noformat}
>
>   Project ( dataframe A.column("col-a") )
>  |
>   Join2
>   || 
>Join1  DataFrame A  
>   |
>  DataFrame ADataFrame B
> {noformat}
> In such cases, If it so happens that Join2 - right leg DataFrame A gets 
> re-aliased due to De-Duplication of relations, and if the project uses Column 
> definition obtained from DataFrame A, its exprId will not match the 
> re-aliased Join2 - right Leg- DataFrame A , causing resolution failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47216) Refine layout of SQL performance tuning page

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47216:
---
Labels: pull-request-available  (was: )

> Refine layout of SQL performance tuning page
> 
>
> Key: SPARK-47216
> URL: https://issues.apache.org/jira/browse/SPARK-47216
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47217) De-duplication of Relations in Joins, can result in plan resolution failure

2024-02-28 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47217:
-
Description: 
In case of some flavours of self join queries or nested joins involving 
repetition of relation, the projected columns when passed to the 
DataFrame.select API , as form of df.column , can result in plan resolution 
failure due to attribute resolution not happening.

A scenario in which this happens is
{noformat}
   
  Project ( dataframe A.column("col-a") )
 |
  Join2
  || 
   Join1  DataFrame A  
  |
 DataFrame ADataFrame B

{noformat}
In such cases, If it so happens that Join2 - right leg DataFrame A gets 
re-aliased due to De-Duplication of relations, and if the project uses Column 
definition obtained from DataFrame A, its exprId will not match the re-aliased 
Join2 - right Leg- DataFrame A , causing resolution failure.

  was:
In case of some flavours of nested self join queries,  the projected columns 
when passed to the DataFrame.select API ,  as form of df.column ,  can result 
in plan resolution failure due to attribute resolution not happening.

A scenario in which this happens is
 
{noformat}
   
  Project ( dataframe A.column("col-a") )
 |
  Join2
  || 
   Join1  DataFrame A  
  |
 DataFrame ADataFrame B

{noformat}


In such cases, If it so happens that  Join2 - right leg DataFrame A gets 
re-aliased due to De-Duplication of relations,  and if the project uses Column 
definition obtained from DataFrame A, its exprId will not match the re-aliased  
Join2  - right Leg- DataFrame A , causing resolution failure.


> De-duplication of Relations in Joins, can result in plan resolution failure
> ---
>
> Key: SPARK-47217
> URL: https://issues.apache.org/jira/browse/SPARK-47217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: Spark-SQL
>
> In case of some flavours of self join queries or nested joins involving 
> repetition of relation, the projected columns when passed to the 
> DataFrame.select API , as form of df.column , can result in plan resolution 
> failure due to attribute resolution not happening.
> A scenario in which this happens is
> {noformat}
>
>   Project ( dataframe A.column("col-a") )
>  |
>   Join2
>   || 
>Join1  DataFrame A  
>   |
>  DataFrame ADataFrame B
> {noformat}
> In such cases, If it so happens that Join2 - right leg DataFrame A gets 
> re-aliased due to De-Duplication of relations, and if the project uses Column 
> definition obtained from DataFrame A, its exprId will not match the 
> re-aliased Join2 - right Leg- DataFrame A , causing resolution failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47217) De-duplication of Relations in Joins, can result in plan resolution failure

2024-02-28 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47217:
-
Description: 
In case of some flavours of nested self join queries,  the projected columns 
when passed to the DataFrame.select API ,  as form of df.column ,  can result 
in plan resolution failure due to attribute resolution not happening.

A scenario in which this happens is
 
{noformat}
   
  Project ( dataframe A.column("col-a") )
 |
  Join2
  || 
   Join1  DataFrame A  
  |
 DataFrame ADataFrame B

{noformat}


In such cases, If it so happens that  Join2 - right leg DataFrame A gets 
re-aliased due to De-Duplication of relations,  and if the project uses Column 
definition obtained from DataFrame A, its exprId will not match the re-aliased  
Join2  - right Leg- DataFrame A , causing resolution failure.

  was:
In case of some flavours of nested self join queries,  the projected columns 
when passed to the DataFrame.select API ,  as form of df.column ,  can result 
in plan resolution failure due to attribute resolution not happening.

A scenario in which this happens is

   Project ( dataframe A.column("col-a") )
 |
  Join2
  |DataFrame A  
   Join1
  |
DataFrame ADataFrame B


In such cases, If it so happens that  Join2 - right leg DataFrame A gets 
re-aliased due to De-Duplication of relations,  and if the project uses Column 
definition obtained from DataFrame A, its exprId will not match the re-aliased  
Join2  - right Leg- DataFrame A , causing resolution failure.


> De-duplication of Relations in Joins, can result in plan resolution failure
> ---
>
> Key: SPARK-47217
> URL: https://issues.apache.org/jira/browse/SPARK-47217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: Spark-SQL
>
> In case of some flavours of nested self join queries,  the projected columns 
> when passed to the DataFrame.select API ,  as form of df.column ,  can result 
> in plan resolution failure due to attribute resolution not happening.
> A scenario in which this happens is
>  
> {noformat}
>
>   Project ( dataframe A.column("col-a") )
>  |
>   Join2
>   || 
>Join1  DataFrame A  
>   |
>  DataFrame ADataFrame B
> {noformat}
> In such cases, If it so happens that  Join2 - right leg DataFrame A gets 
> re-aliased due to De-Duplication of relations,  and if the project uses 
> Column definition obtained from DataFrame A, its exprId will not match the 
> re-aliased  Join2  - right Leg- DataFrame A , causing resolution failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47217) De-duplication of Relations in Joins, can result in plan resolution failure

2024-02-28 Thread Asif (Jira)
Asif created SPARK-47217:


 Summary: De-duplication of Relations in Joins, can result in plan 
resolution failure
 Key: SPARK-47217
 URL: https://issues.apache.org/jira/browse/SPARK-47217
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1
Reporter: Asif


In case of some flavours of nested self join queries,  the projected columns 
when passed to the DataFrame.select API ,  as form of df.column ,  can result 
in plan resolution failure due to attribute resolution not happening.

A scenario in which this happens is

   Project ( dataframe A.column("col-a") )
 |
  Join2
  |DataFrame A  
   Join1
  |
DataFrame ADataFrame B


In such cases, If it so happens that  Join2 - right leg DataFrame A gets 
re-aliased due to De-Duplication of relations,  and if the project uses Column 
definition obtained from DataFrame A, its exprId will not match the re-aliased  
Join2  - right Leg- DataFrame A , causing resolution failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47216) Refine layout of SQL performance tuning page

2024-02-28 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-47216:


 Summary: Refine layout of SQL performance tuning page
 Key: SPARK-47216
 URL: https://issues.apache.org/jira/browse/SPARK-47216
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47215) Reduce the number of required threads in MasterSuite

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47215:
---
Labels: pull-request-available  (was: )

> Reduce the number of required threads in MasterSuite
> 
>
> Key: SPARK-47215
> URL: https://issues.apache.org/jira/browse/SPARK-47215
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47214) Create API for 'analyze' method to differentiate constant NULL arguments and other types of arguments

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47214:
---
Labels: pull-request-available  (was: )

> Create API for 'analyze' method to differentiate constant NULL arguments and 
> other types of arguments
> -
>
> Key: SPARK-47214
> URL: https://issues.apache.org/jira/browse/SPARK-47214
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Daniel
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47215) Reduce the number of required threads in MasterSuite

2024-02-28 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-47215:
-

 Summary: Reduce the number of required threads in MasterSuite
 Key: SPARK-47215
 URL: https://issues.apache.org/jira/browse/SPARK-47215
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, Tests
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47214) Create API for 'analyze' method to differentiate constant NULL arguments and other types of arguments

2024-02-28 Thread Daniel (Jira)
Daniel created SPARK-47214:
--

 Summary: Create API for 'analyze' method to differentiate constant 
NULL arguments and other types of arguments
 Key: SPARK-47214
 URL: https://issues.apache.org/jira/browse/SPARK-47214
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Daniel






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47207) Support `spark.driver.timeout` and `DriverTimeoutPlugin`

2024-02-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47207.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45313
[https://github.com/apache/spark/pull/45313]

> Support `spark.driver.timeout` and `DriverTimeoutPlugin`
> 
>
> Key: SPARK-47207
> URL: https://issues.apache.org/jira/browse/SPARK-47207
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47209) Upgrade slf4j to 2.0.12

2024-02-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47209:
-

Assignee: Yang Jie

> Upgrade slf4j to 2.0.12
> ---
>
> Key: SPARK-47209
> URL: https://issues.apache.org/jira/browse/SPARK-47209
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> https://www.slf4j.org/news.html#2.0.12



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47209) Upgrade slf4j to 2.0.12

2024-02-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47209.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45315
[https://github.com/apache/spark/pull/45315]

> Upgrade slf4j to 2.0.12
> ---
>
> Key: SPARK-47209
> URL: https://issues.apache.org/jira/browse/SPARK-47209
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> https://www.slf4j.org/news.html#2.0.12



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41392) spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin

2024-02-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821741#comment-17821741
 ] 

Dongjoon Hyun commented on SPARK-41392:
---

To [~ste...@apache.org] , as you know, we update according to only the official 
releases instead of RCs. However, feel free to make a PR if you want.

> spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin
> ---
>
> Key: SPARK-41392
> URL: https://issues.apache.org/jira/browse/SPARK-41392
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> on hadoop trunk (but not the 3.3.x line), spark builds fail with a CNFE
> {code}
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: 
> org/bouncycastle/jce/provider/BouncyCastleProvider
> {code}
> full stack
> {code}
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile 
> (scala-test-compile-first) on project spark-sql_2.12: Execution 
> scala-test-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile failed: A required 
> class was missing while executing 
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: 
> org/bouncycastle/jce/provider/BouncyCastleProvider
> [ERROR] -
> [ERROR] realm =plugin>net.alchim31.maven:scala-maven-plugin:4.7.2
> [ERROR] strategy = org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy
> [ERROR] urls[0] = 
> file:/Users/stevel/.m2/repository/net/alchim31/maven/scala-maven-plugin/4.7.2/scala-maven-plugin-4.7.2.jar
> [ERROR] urls[1] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/shared/maven-dependency-tree/3.2.0/maven-dependency-tree-3.2.0.jar
> [ERROR] urls[2] = 
> file:/Users/stevel/.m2/repository/org/eclipse/aether/aether-util/1.0.0.v20140518/aether-util-1.0.0.v20140518.jar
> [ERROR] urls[3] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/reporting/maven-reporting-api/3.1.1/maven-reporting-api-3.1.1.jar
> [ERROR] urls[4] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-sink-api/1.11.1/doxia-sink-api-1.11.1.jar
> [ERROR] urls[5] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-logging-api/1.11.1/doxia-logging-api-1.11.1.jar
> [ERROR] urls[6] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/maven-archiver/3.6.0/maven-archiver-3.6.0.jar
> [ERROR] urls[7] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-io/3.4.0/plexus-io-3.4.0.jar
> [ERROR] urls[8] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-interpolation/1.26/plexus-interpolation-1.26.jar
> [ERROR] urls[9] = 
> file:/Users/stevel/.m2/repository/org/apache/commons/commons-exec/1.3/commons-exec-1.3.jar
> [ERROR] urls[10] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-utils/3.4.2/plexus-utils-3.4.2.jar
> [ERROR] urls[11] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-archiver/4.5.0/plexus-archiver-4.5.0.jar
> [ERROR] urls[12] = 
> file:/Users/stevel/.m2/repository/commons-io/commons-io/2.11.0/commons-io-2.11.0.jar
> [ERROR] urls[13] = 
> file:/Users/stevel/.m2/repository/org/apache/commons/commons-compress/1.21/commons-compress-1.21.jar
> [ERROR] urls[14] = 
> file:/Users/stevel/.m2/repository/org/iq80/snappy/snappy/0.4/snappy-0.4.jar
> [ERROR] urls[15] = 
> file:/Users/stevel/.m2/repository/org/tukaani/xz/1.9/xz-1.9.jar
> [ERROR] urls[16] = 
> file:/Users/stevel/.m2/repository/com/github/luben/zstd-jni/1.5.2-4/zstd-jni-1.5.2-4.jar
> [ERROR] urls[17] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc_2.13/1.7.1/zinc_2.13-1.7.1.jar
> [ERROR] urls[18] = 
> file:/Users/stevel/.m2/repository/org/scala-lang/scala-library/2.13.8/scala-library-2.13.8.jar
> [ERROR] urls[19] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-core_2.13/1.7.1/zinc-core_2.13-1.7.1.jar
> [ERROR] urls[20] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-apiinfo_2.13/1.7.1/zinc-apiinfo_2.13-1.7.1.jar
> [ERROR] urls[21] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/compiler-bridge_2.13/1.7.1/compiler-bridge_2.13-1.7.1.jar
> [ERROR] urls[22] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-classpath_2.13/1.7.1/zinc-classpath_2.13-1.7.1.jar
> [ERROR] urls[23] = 
> file:/Users/stevel/.m2/repository/org/scala-lang/scala-compiler/2.13.8/scala-compiler-2.13.8.jar
> [ERROR] urls[24] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/compiler-interface/1.7.1/compiler-interface-1.7.1.jar
> [ERROR] urls[25] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/util-interface/1.7.0/util-interface-1.7.0.jar
> [ERROR] urls[26] = 
> 

[jira] [Resolved] (SPARK-47197) Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell

2024-02-28 Thread YUBI LEE (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE resolved SPARK-47197.
--
Resolution: Not A Problem

https://github.com/apache/spark/pull/45309#issuecomment-1969269354

> Failed to connect HiveMetastore when using iceberg with HiveCatalog on 
> spark-sql or spark-shell
> ---
>
> Key: SPARK-47197
> URL: https://issues.apache.org/jira/browse/SPARK-47197
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 3.2.3, 3.5.1
>Reporter: YUBI LEE
>Priority: Major
>  Labels: pull-request-available
>
> I can't connect to kerberized HiveMetastore when using iceberg with 
> HiveCatalog on spark-sql or spark-shell.
> I think this issue is caused by the fact that there is no way to get 
> HIVE_DELEGATION_TOKEN when using spark-sql or spark-shell.
> ([https://github.com/apache/spark/blob/v3.5.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/security/HiveDelegationTokenProvider.scala#L78-L83)]
>  
> {code:java}
>     val currentToken = 
> UserGroupInformation.getCurrentUser().getCredentials().getToken(tokenAlias)
>     currentToken == null && UserGroupInformation.isSecurityEnabled &&
>       hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty &&
>       (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) 
> ||
>         (!Utils.isClientMode(sparkConf) && !sparkConf.contains(KEYTAB))) 
> {code}
> There should be a way to force to get HIVE_DELEGATION_TOKEN even when using 
> spark-sql or spark-shell.
> Possible way is to get HIVE_DELEGATION_TOKEN if the configuration below is 
> set?
> {code:java}
> spark.security.credentials.hive.enabled   true {code}
>  
> {code:java}
> 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
> (machine1.example.com executor 2): 
> org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
> Metastore
> ...
> Caused by: MetaException(message:Could not connect to meta store using any of 
> the URIs provided. Most recent failure: 
> org.apache.thrift.transport.TTransportException: GSS initiate failed {code}
>  
>  
> {code:java}
> spark-sql> select * from temp.test_hive_catalog;
> ...
> ...
> 24/02/28 07:42:04 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) 
> (machine1.example.com executor 2): 
> org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
> Metastore
>         at 
> org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:84)
>         at 
> org.apache.iceberg.hive.HiveClientPool.newClient(HiveClientPool.java:34)
>         at org.apache.iceberg.ClientPoolImpl.get(ClientPoolImpl.java:125)
>         at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:56)
>         at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51)
>         at 
> org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:122)
>         at 
> org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:158)
>         at 
> org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
>         at 
> org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80)
>         at 
> org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47)
>         at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:124)
>         at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:111)
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.overlayTableProperties(HiveIcebergStorageHandler.java:276)
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.configureInputJobProperties(HiveIcebergStorageHandler.java:86)
>         at 
> org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:426)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:456)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:342)
>         at 
> org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:342)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181)
>         at scala.Option.foreach(Option.scala:407)
>         at 
> org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:181)
>         at scala.Option.getOrElse(Option.scala:189)
>         at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:178)
>         at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:247)
>         at 

[jira] [Updated] (SPARK-47211) Fix ignored PySpark Connect string collation

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47211:
---
Labels: pull-request-available  (was: )

> Fix ignored PySpark Connect string collation
> 
>
> Key: SPARK-47211
> URL: https://issues.apache.org/jira/browse/SPARK-47211
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When using Connect with PySpark, string collation silently gets dropped:
> {code:java}
> Client connected to the Spark Connect server at localhost
> SparkSession available as 'spark'.
> >>> spark.sql("select 'abc' collate 'UNICODE'")
> DataFrame[collate(abc): string]
> >>> from pyspark.sql.types import StructType, StringType, StructField
> >>> spark.createDataFrame([], StructType([StructField('id', StringType(2))]))
> DataFrame[id: string]
> {code}
> Instead of "string" type in dataframe, we should be seeing "string COLLATE 
> 'UNICODE'".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47211) Fix ignored PySpark Connect string collation

2024-02-28 Thread Nikola Mandic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47211:
--
Component/s: Connect

> Fix ignored PySpark Connect string collation
> 
>
> Key: SPARK-47211
> URL: https://issues.apache.org/jira/browse/SPARK-47211
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
> Fix For: 4.0.0
>
>
> When using Connect with PySpark, string collation silently gets dropped:
> {code:java}
> Client connected to the Spark Connect server at localhost
> SparkSession available as 'spark'.
> >>> spark.sql("select 'abc' collate 'UNICODE'")
> DataFrame[collate(abc): string]
> >>> from pyspark.sql.types import StructType, StringType, StructField
> >>> spark.createDataFrame([], StructType([StructField('id', StringType(2))]))
> DataFrame[id: string]
> {code}
> Instead of "string" type in dataframe, we should be seeing "string COLLATE 
> 'UNICODE'".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47213) Proposal about moving on from the Shepherd terminology in SPIPs to "Mentor"

2024-02-28 Thread Mich Talebzadeh (Jira)
Mich Talebzadeh created SPARK-47213:
---

 Summary: Proposal about moving on from the Shepherd terminology in 
SPIPs to "Mentor"
 Key: SPARK-47213
 URL: https://issues.apache.org/jira/browse/SPARK-47213
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.5.1
 Environment: Documentation. SPIP form submission
Reporter: Mich Talebzadeh
 Fix For: 4.0.0


As an active member I am proposing a move to replace the current terminology 
"SPIP Shepherd" with the more respectful and inclusive term "SPIP Mentor." We 
have over the past few years have tried to replace some past terminologies with 
more acceptable ones.

While some may not find "Shepherd" offensive, it can unintentionally imply 
passivity or dependence on community members, which might not accurately 
reflect their expertise and contributions. Additionally, the shepherd-sheep 
dynamic might be interpreted as hierarchical, which does not align with the 
collaborative and open nature of Spark community.

*"SPIP Mentor"* better emphasizes the collaborative nature of the process, 
focusing on supporting and guiding members while respecting their strengths and 
contributions. It also avoids any potentially offensive or hierarchical 
connotations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47212) When creating Jira please use the word "mentor" instead of "Shepherd" in SPIP

2024-02-28 Thread Mich Talebzadeh (Jira)
Mich Talebzadeh created SPARK-47212:
---

 Summary: When creating Jira please use the word "mentor" instead 
of "Shepherd" in SPIP
 Key: SPARK-47212
 URL: https://issues.apache.org/jira/browse/SPARK-47212
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.3.4
Reporter: Mich Talebzadeh


As an active member I am proposing a move to replace the current terminology 
"SPIP Shepherd" with the more respectful and inclusive term "SPIP Mentor." We 
have over the past few years have tried to replace some past terminologies with 
more acceptable ones.

While some may not find "Shepherd" offensive, it can unintentionally imply 
passivity or dependence on community members, which might not accurately 
reflect their expertise and contributions. Additionally, the shepherd-sheep 
dynamic might be interpreted as hierarchical, which does not align with the 
collaborative and open nature of Spark community.

*"SPIP Mentor"* better emphasizes the collaborative nature of the process, 
focusing on supporting and guiding members while respecting their strengths and 
contributions. It also avoids any potentially offensive or hierarchical 
connotations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41392) spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-41392:
---
Labels: pull-request-available  (was: )

> spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin
> ---
>
> Key: SPARK-41392
> URL: https://issues.apache.org/jira/browse/SPARK-41392
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>
> on hadoop trunk (but not the 3.3.x line), spark builds fail with a CNFE
> {code}
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: 
> org/bouncycastle/jce/provider/BouncyCastleProvider
> {code}
> full stack
> {code}
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile 
> (scala-test-compile-first) on project spark-sql_2.12: Execution 
> scala-test-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile failed: A required 
> class was missing while executing 
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: 
> org/bouncycastle/jce/provider/BouncyCastleProvider
> [ERROR] -
> [ERROR] realm =plugin>net.alchim31.maven:scala-maven-plugin:4.7.2
> [ERROR] strategy = org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy
> [ERROR] urls[0] = 
> file:/Users/stevel/.m2/repository/net/alchim31/maven/scala-maven-plugin/4.7.2/scala-maven-plugin-4.7.2.jar
> [ERROR] urls[1] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/shared/maven-dependency-tree/3.2.0/maven-dependency-tree-3.2.0.jar
> [ERROR] urls[2] = 
> file:/Users/stevel/.m2/repository/org/eclipse/aether/aether-util/1.0.0.v20140518/aether-util-1.0.0.v20140518.jar
> [ERROR] urls[3] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/reporting/maven-reporting-api/3.1.1/maven-reporting-api-3.1.1.jar
> [ERROR] urls[4] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-sink-api/1.11.1/doxia-sink-api-1.11.1.jar
> [ERROR] urls[5] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-logging-api/1.11.1/doxia-logging-api-1.11.1.jar
> [ERROR] urls[6] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/maven-archiver/3.6.0/maven-archiver-3.6.0.jar
> [ERROR] urls[7] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-io/3.4.0/plexus-io-3.4.0.jar
> [ERROR] urls[8] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-interpolation/1.26/plexus-interpolation-1.26.jar
> [ERROR] urls[9] = 
> file:/Users/stevel/.m2/repository/org/apache/commons/commons-exec/1.3/commons-exec-1.3.jar
> [ERROR] urls[10] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-utils/3.4.2/plexus-utils-3.4.2.jar
> [ERROR] urls[11] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-archiver/4.5.0/plexus-archiver-4.5.0.jar
> [ERROR] urls[12] = 
> file:/Users/stevel/.m2/repository/commons-io/commons-io/2.11.0/commons-io-2.11.0.jar
> [ERROR] urls[13] = 
> file:/Users/stevel/.m2/repository/org/apache/commons/commons-compress/1.21/commons-compress-1.21.jar
> [ERROR] urls[14] = 
> file:/Users/stevel/.m2/repository/org/iq80/snappy/snappy/0.4/snappy-0.4.jar
> [ERROR] urls[15] = 
> file:/Users/stevel/.m2/repository/org/tukaani/xz/1.9/xz-1.9.jar
> [ERROR] urls[16] = 
> file:/Users/stevel/.m2/repository/com/github/luben/zstd-jni/1.5.2-4/zstd-jni-1.5.2-4.jar
> [ERROR] urls[17] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc_2.13/1.7.1/zinc_2.13-1.7.1.jar
> [ERROR] urls[18] = 
> file:/Users/stevel/.m2/repository/org/scala-lang/scala-library/2.13.8/scala-library-2.13.8.jar
> [ERROR] urls[19] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-core_2.13/1.7.1/zinc-core_2.13-1.7.1.jar
> [ERROR] urls[20] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-apiinfo_2.13/1.7.1/zinc-apiinfo_2.13-1.7.1.jar
> [ERROR] urls[21] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/compiler-bridge_2.13/1.7.1/compiler-bridge_2.13-1.7.1.jar
> [ERROR] urls[22] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-classpath_2.13/1.7.1/zinc-classpath_2.13-1.7.1.jar
> [ERROR] urls[23] = 
> file:/Users/stevel/.m2/repository/org/scala-lang/scala-compiler/2.13.8/scala-compiler-2.13.8.jar
> [ERROR] urls[24] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/compiler-interface/1.7.1/compiler-interface-1.7.1.jar
> [ERROR] urls[25] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/util-interface/1.7.0/util-interface-1.7.0.jar
> [ERROR] urls[26] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-persist-core-assembly/1.7.1/zinc-persist-core-assembly-1.7.1.jar
> [ERROR] urls[27] = 
> 

[jira] [Updated] (SPARK-41392) spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin

2024-02-28 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-41392:
---
Priority: Major  (was: Minor)

> spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin
> ---
>
> Key: SPARK-41392
> URL: https://issues.apache.org/jira/browse/SPARK-41392
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Steve Loughran
>Priority: Major
>
> on hadoop trunk (but not the 3.3.x line), spark builds fail with a CNFE
> {code}
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: 
> org/bouncycastle/jce/provider/BouncyCastleProvider
> {code}
> full stack
> {code}
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile 
> (scala-test-compile-first) on project spark-sql_2.12: Execution 
> scala-test-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile failed: A required 
> class was missing while executing 
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: 
> org/bouncycastle/jce/provider/BouncyCastleProvider
> [ERROR] -
> [ERROR] realm =plugin>net.alchim31.maven:scala-maven-plugin:4.7.2
> [ERROR] strategy = org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy
> [ERROR] urls[0] = 
> file:/Users/stevel/.m2/repository/net/alchim31/maven/scala-maven-plugin/4.7.2/scala-maven-plugin-4.7.2.jar
> [ERROR] urls[1] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/shared/maven-dependency-tree/3.2.0/maven-dependency-tree-3.2.0.jar
> [ERROR] urls[2] = 
> file:/Users/stevel/.m2/repository/org/eclipse/aether/aether-util/1.0.0.v20140518/aether-util-1.0.0.v20140518.jar
> [ERROR] urls[3] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/reporting/maven-reporting-api/3.1.1/maven-reporting-api-3.1.1.jar
> [ERROR] urls[4] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-sink-api/1.11.1/doxia-sink-api-1.11.1.jar
> [ERROR] urls[5] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-logging-api/1.11.1/doxia-logging-api-1.11.1.jar
> [ERROR] urls[6] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/maven-archiver/3.6.0/maven-archiver-3.6.0.jar
> [ERROR] urls[7] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-io/3.4.0/plexus-io-3.4.0.jar
> [ERROR] urls[8] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-interpolation/1.26/plexus-interpolation-1.26.jar
> [ERROR] urls[9] = 
> file:/Users/stevel/.m2/repository/org/apache/commons/commons-exec/1.3/commons-exec-1.3.jar
> [ERROR] urls[10] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-utils/3.4.2/plexus-utils-3.4.2.jar
> [ERROR] urls[11] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-archiver/4.5.0/plexus-archiver-4.5.0.jar
> [ERROR] urls[12] = 
> file:/Users/stevel/.m2/repository/commons-io/commons-io/2.11.0/commons-io-2.11.0.jar
> [ERROR] urls[13] = 
> file:/Users/stevel/.m2/repository/org/apache/commons/commons-compress/1.21/commons-compress-1.21.jar
> [ERROR] urls[14] = 
> file:/Users/stevel/.m2/repository/org/iq80/snappy/snappy/0.4/snappy-0.4.jar
> [ERROR] urls[15] = 
> file:/Users/stevel/.m2/repository/org/tukaani/xz/1.9/xz-1.9.jar
> [ERROR] urls[16] = 
> file:/Users/stevel/.m2/repository/com/github/luben/zstd-jni/1.5.2-4/zstd-jni-1.5.2-4.jar
> [ERROR] urls[17] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc_2.13/1.7.1/zinc_2.13-1.7.1.jar
> [ERROR] urls[18] = 
> file:/Users/stevel/.m2/repository/org/scala-lang/scala-library/2.13.8/scala-library-2.13.8.jar
> [ERROR] urls[19] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-core_2.13/1.7.1/zinc-core_2.13-1.7.1.jar
> [ERROR] urls[20] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-apiinfo_2.13/1.7.1/zinc-apiinfo_2.13-1.7.1.jar
> [ERROR] urls[21] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/compiler-bridge_2.13/1.7.1/compiler-bridge_2.13-1.7.1.jar
> [ERROR] urls[22] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-classpath_2.13/1.7.1/zinc-classpath_2.13-1.7.1.jar
> [ERROR] urls[23] = 
> file:/Users/stevel/.m2/repository/org/scala-lang/scala-compiler/2.13.8/scala-compiler-2.13.8.jar
> [ERROR] urls[24] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/compiler-interface/1.7.1/compiler-interface-1.7.1.jar
> [ERROR] urls[25] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/util-interface/1.7.0/util-interface-1.7.0.jar
> [ERROR] urls[26] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-persist-core-assembly/1.7.1/zinc-persist-core-assembly-1.7.1.jar
> [ERROR] urls[27] = 
> 

[jira] [Commented] (SPARK-41392) spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin

2024-02-28 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821694#comment-17821694
 ] 

Steve Loughran commented on SPARK-41392:


Hadoop 3.4.0 RC2 exhibits this; spark needs its patches in

> spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin
> ---
>
> Key: SPARK-41392
> URL: https://issues.apache.org/jira/browse/SPARK-41392
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Steve Loughran
>Priority: Minor
>
> on hadoop trunk (but not the 3.3.x line), spark builds fail with a CNFE
> {code}
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: 
> org/bouncycastle/jce/provider/BouncyCastleProvider
> {code}
> full stack
> {code}
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile 
> (scala-test-compile-first) on project spark-sql_2.12: Execution 
> scala-test-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile failed: A required 
> class was missing while executing 
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: 
> org/bouncycastle/jce/provider/BouncyCastleProvider
> [ERROR] -
> [ERROR] realm =plugin>net.alchim31.maven:scala-maven-plugin:4.7.2
> [ERROR] strategy = org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy
> [ERROR] urls[0] = 
> file:/Users/stevel/.m2/repository/net/alchim31/maven/scala-maven-plugin/4.7.2/scala-maven-plugin-4.7.2.jar
> [ERROR] urls[1] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/shared/maven-dependency-tree/3.2.0/maven-dependency-tree-3.2.0.jar
> [ERROR] urls[2] = 
> file:/Users/stevel/.m2/repository/org/eclipse/aether/aether-util/1.0.0.v20140518/aether-util-1.0.0.v20140518.jar
> [ERROR] urls[3] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/reporting/maven-reporting-api/3.1.1/maven-reporting-api-3.1.1.jar
> [ERROR] urls[4] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-sink-api/1.11.1/doxia-sink-api-1.11.1.jar
> [ERROR] urls[5] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-logging-api/1.11.1/doxia-logging-api-1.11.1.jar
> [ERROR] urls[6] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/maven-archiver/3.6.0/maven-archiver-3.6.0.jar
> [ERROR] urls[7] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-io/3.4.0/plexus-io-3.4.0.jar
> [ERROR] urls[8] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-interpolation/1.26/plexus-interpolation-1.26.jar
> [ERROR] urls[9] = 
> file:/Users/stevel/.m2/repository/org/apache/commons/commons-exec/1.3/commons-exec-1.3.jar
> [ERROR] urls[10] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-utils/3.4.2/plexus-utils-3.4.2.jar
> [ERROR] urls[11] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-archiver/4.5.0/plexus-archiver-4.5.0.jar
> [ERROR] urls[12] = 
> file:/Users/stevel/.m2/repository/commons-io/commons-io/2.11.0/commons-io-2.11.0.jar
> [ERROR] urls[13] = 
> file:/Users/stevel/.m2/repository/org/apache/commons/commons-compress/1.21/commons-compress-1.21.jar
> [ERROR] urls[14] = 
> file:/Users/stevel/.m2/repository/org/iq80/snappy/snappy/0.4/snappy-0.4.jar
> [ERROR] urls[15] = 
> file:/Users/stevel/.m2/repository/org/tukaani/xz/1.9/xz-1.9.jar
> [ERROR] urls[16] = 
> file:/Users/stevel/.m2/repository/com/github/luben/zstd-jni/1.5.2-4/zstd-jni-1.5.2-4.jar
> [ERROR] urls[17] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc_2.13/1.7.1/zinc_2.13-1.7.1.jar
> [ERROR] urls[18] = 
> file:/Users/stevel/.m2/repository/org/scala-lang/scala-library/2.13.8/scala-library-2.13.8.jar
> [ERROR] urls[19] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-core_2.13/1.7.1/zinc-core_2.13-1.7.1.jar
> [ERROR] urls[20] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-apiinfo_2.13/1.7.1/zinc-apiinfo_2.13-1.7.1.jar
> [ERROR] urls[21] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/compiler-bridge_2.13/1.7.1/compiler-bridge_2.13-1.7.1.jar
> [ERROR] urls[22] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-classpath_2.13/1.7.1/zinc-classpath_2.13-1.7.1.jar
> [ERROR] urls[23] = 
> file:/Users/stevel/.m2/repository/org/scala-lang/scala-compiler/2.13.8/scala-compiler-2.13.8.jar
> [ERROR] urls[24] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/compiler-interface/1.7.1/compiler-interface-1.7.1.jar
> [ERROR] urls[25] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/util-interface/1.7.0/util-interface-1.7.0.jar
> [ERROR] urls[26] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-persist-core-assembly/1.7.1/zinc-persist-core-assembly-1.7.1.jar
> [ERROR] urls[27] = 
> 

[jira] [Updated] (SPARK-47168) Disable parquet filter pushdown for non default collated strings

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47168:
---
Labels: pull-request-available  (was: )

> Disable parquet filter pushdown for non default collated strings
> 
>
> Key: SPARK-47168
> URL: https://issues.apache.org/jira/browse/SPARK-47168
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47211) Fix ignored PySpark Connect string collation

2024-02-28 Thread Nikola Mandic (Jira)
Nikola Mandic created SPARK-47211:
-

 Summary: Fix ignored PySpark Connect string collation
 Key: SPARK-47211
 URL: https://issues.apache.org/jira/browse/SPARK-47211
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Nikola Mandic
 Fix For: 4.0.0


When using Connect with PySpark, string collation silently gets dropped:
{code:java}
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.
>>> spark.sql("select 'abc' collate 'UNICODE'")
DataFrame[collate(abc): string]
>>> from pyspark.sql.types import StructType, StringType, StructField
>>> spark.createDataFrame([], StructType([StructField('id', StringType(2))]))
DataFrame[id: string]
{code}
Instead of "string" type in dataframe, we should be seeing "string COLLATE 
'UNICODE'".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47210) Implicit casting on collated expressions

2024-02-28 Thread Mihailo Milosevic (Jira)
Mihailo Milosevic created SPARK-47210:
-

 Summary: Implicit casting on collated expressions
 Key: SPARK-47210
 URL: https://issues.apache.org/jira/browse/SPARK-47210
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Mihailo Milosevic






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46919) Upgrade `grpcio*` and `grpc-java` to 1.62

2024-02-28 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-46919:
-
Parent: SPARK-47046
Issue Type: Sub-task  (was: Improvement)

> Upgrade `grpcio*` and `grpc-java` to 1.62
> -
>
> Key: SPARK-46919
> URL: https://issues.apache.org/jira/browse/SPARK-46919
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Connect
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46919) Upgrade `grpcio*` and `grpc-java` to 1.62

2024-02-28 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-46919:
-
Summary: Upgrade `grpcio*` and `grpc-java` to 1.62  (was: Upgrade `grpcio*` 
to 1.60.0 and `grpc-java` to 1.61.0)

> Upgrade `grpcio*` and `grpc-java` to 1.62
> -
>
> Key: SPARK-46919
> URL: https://issues.apache.org/jira/browse/SPARK-46919
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Connect
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47209) Upgrade slf4j to 2.0.12

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47209:
---
Labels: pull-request-available  (was: )

> Upgrade slf4j to 2.0.12
> ---
>
> Key: SPARK-47209
> URL: https://issues.apache.org/jira/browse/SPARK-47209
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> https://www.slf4j.org/news.html#2.0.12



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47209) Upgrade slf4j to 2.0.12

2024-02-28 Thread Yang Jie (Jira)
Yang Jie created SPARK-47209:


 Summary: Upgrade slf4j to 2.0.12
 Key: SPARK-47209
 URL: https://issues.apache.org/jira/browse/SPARK-47209
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 4.0.0
Reporter: Yang Jie


https://www.slf4j.org/news.html#2.0.12



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42040) SPJ: Introduce a new API for V2 input partition to report partition size

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-42040:
---
Labels: pull-request-available  (was: )

> SPJ: Introduce a new API for V2 input partition to report partition size
> 
>
> Key: SPARK-42040
> URL: https://issues.apache.org/jira/browse/SPARK-42040
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Priority: Major
>  Labels: pull-request-available
>
> It's useful for a {{InputPartition}} to also report its size (in bytes), so 
> that Spark can use the info to decide whether partition grouping should be 
> applied or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47132) Mistake in Docstring for Pyspark's Dataframe.head()

2024-02-28 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47132.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

> Mistake in Docstring for Pyspark's Dataframe.head()
> ---
>
> Key: SPARK-47132
> URL: https://issues.apache.org/jira/browse/SPARK-47132
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Albert Ziegler
>Assignee: Albert Ziegler
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: image-2024-02-22-11-18-02-429.png, 
> image-2024-02-22-11-21-30-460.png
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The docstring claims that {{head(n)}} would return a {{Row}} (rather than a 
> list of rows) iff n == 1, but that's incorrect.
> Type hints, example, and implementation show that the difference between row 
> or list of rows lies in whether n is supplied at all -- if it isn't, 
> {{head()}} returns a {{{}Row{}}}, if it is, even if it is 1, {{head(n)}} 
> returns a list.
>  
> A suggestion to fix is here: https://github.com/apache/spark/pull/45197



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47208) Allow overriding base overhead memory

2024-02-28 Thread Joao Correia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821601#comment-17821601
 ] 

Joao Correia commented on SPARK-47208:
--

https://github.com/apache/spark/pull/45240

> Allow overriding base overhead memory
> -
>
> Key: SPARK-47208
> URL: https://issues.apache.org/jira/browse/SPARK-47208
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core, YARN
>Affects Versions: 3.5.1
>Reporter: Joao Correia
>Priority: Major
>  Labels: pull-request-available
>
> We can already select the desired overhead memory directly via the 
> _'spark.driver/executor.memoryOverhead'_ flags, however, if that flag is not 
> present the overhead memory calculation goes as follows:
> {code:java}
> overhead_memory = Max(384, 'spark.driver/executor.memory' * 
> 'spark.driver/executor.memoryOverheadFactor')
> where the 'memoryOverheadFactor' flag defaults to 0.1{code}
> There are certain times where being able to override the 384Mb minimum 
> directly can be beneficial. We may have a scenario where a lot of off-heap 
> operations are performed (ex: using package managers/native 
> compression/decompression) where we don't have a need for a large JVM heap 
> but we may still need a signficant amount of memory in the spark node. 
> Using the '{_}memoryOverheadFactor{_}' flag may not prove appropriate. Since 
> we may not want the overhead allocation to directly scale with JVM memory, as 
> a cost saving/resource limitation problem.
> As such, I propose the addition of a 
> 'spark.driver/executor.minMemoryOverhead' flag, which can be used to override 
> the 384Mib value used in the overhead calculation.
> The memory overhead calculation will now be :
> {code:java}
> min_memory = 
> sparkConf.get('spark.driver/executor.minMemoryOverhead').getOrElse(384)
> overhead_memory = Max(min_memory, 'spark.driver/executor.memory' * 
> 'spark.driver/executor.memoryOverheadFactor'){code}
> PR: https://github.com/apache/spark/pull/45240  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47208) Allow overriding base overhead memory

2024-02-28 Thread Joao Correia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joao Correia updated SPARK-47208:
-
Component/s: Spark Core

> Allow overriding base overhead memory
> -
>
> Key: SPARK-47208
> URL: https://issues.apache.org/jira/browse/SPARK-47208
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core, YARN
>Affects Versions: 3.5.1
>Reporter: Joao Correia
>Priority: Major
>  Labels: pull-request-available
>
> We can already select the desired overhead memory directly via the 
> _'spark.driver/executor.memoryOverhead'_ flags, however, if that flag is not 
> present the overhead memory calculation goes as follows:
> {code:java}
> overhead_memory = Max(384, 'spark.driver/executor.memory' * 
> 'spark.driver/executor.memoryOverheadFactor')
> where the 'memoryOverheadFactor' flag defaults to 0.1{code}
> There are certain times where being able to override the 384Mb minimum 
> directly can be beneficial. We may have a scenario where a lot of off-heap 
> operations are performed (ex: using package managers/native 
> compression/decompression) where we don't have a need for a large JVM heap 
> but we may still need a signficant amount of memory in the spark node. 
> Using the '{_}memoryOverheadFactor{_}' flag may not prove appropriate. Since 
> we may not want the overhead allocation to directly scale with JVM memory, as 
> a cost saving/resource limitation problem.
> As such, I propose the addition of a 
> 'spark.driver/executor.minMemoryOverhead' flag, which can be used to override 
> the 384Mib value used in the overhead calculation.
> The memory overhead calculation will now be :
> {code:java}
> min_memory = 
> sparkConf.get('spark.driver/executor.minMemoryOverhead').getOrElse(384)
> overhead_memory = Max(min_memory, 'spark.driver/executor.memory' * 
> 'spark.driver/executor.memoryOverheadFactor'){code}
> PR: https://github.com/apache/spark/pull/45240  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47132) Mistake in Docstring for Pyspark's Dataframe.head()

2024-02-28 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-47132:


Assignee: Albert Ziegler

> Mistake in Docstring for Pyspark's Dataframe.head()
> ---
>
> Key: SPARK-47132
> URL: https://issues.apache.org/jira/browse/SPARK-47132
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Albert Ziegler
>Assignee: Albert Ziegler
>Priority: Trivial
>  Labels: pull-request-available
> Attachments: image-2024-02-22-11-18-02-429.png, 
> image-2024-02-22-11-21-30-460.png
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The docstring claims that {{head(n)}} would return a {{Row}} (rather than a 
> list of rows) iff n == 1, but that's incorrect.
> Type hints, example, and implementation show that the difference between row 
> or list of rows lies in whether n is supplied at all -- if it isn't, 
> {{head()}} returns a {{{}Row{}}}, if it is, even if it is 1, {{head(n)}} 
> returns a list.
>  
> A suggestion to fix is here: https://github.com/apache/spark/pull/45197



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47147) Fix Pyspark collated string conversion error

2024-02-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47147.
--
Resolution: Fixed

Issue resolved by pull request 45257
[https://github.com/apache/spark/pull/45257]

> Fix Pyspark collated string conversion error
> 
>
> Key: SPARK-47147
> URL: https://issues.apache.org/jira/browse/SPARK-47147
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When running Pyspark shell in non-Spark Connect mode, query "SELECT 'abc' 
> COLLATE 'UCS_BASIC_LCASE'" produces the following error:
> {code:java}
> AssertionError: Undefined error message parameter for error class: 
> CANNOT_PARSE_DATATYPE. Parameters: {'error': "Undefined error message 
> parameter for error class: CANNOT_PARSE_DATATYPE. Parameters: {'error': 
> 'string(UCS_BASIC_LCASE)'}"}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47208) Allow overriding base overhead memory

2024-02-28 Thread Joao Correia (Jira)
Joao Correia created SPARK-47208:


 Summary: Allow overriding base overhead memory
 Key: SPARK-47208
 URL: https://issues.apache.org/jira/browse/SPARK-47208
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes, YARN
Affects Versions: 3.5.1
Reporter: Joao Correia


We can already select the desired overhead memory directly via the 
_'spark.driver/executor.memoryOverhead'_ flags, however, if that flag is not 
present the overhead memory calculation goes as follows:
{code:java}
overhead_memory = Max(384, 'spark.driver/executor.memory' * 
'spark.driver/executor.memoryOverheadFactor')

where the 'memoryOverheadFactor' flag defaults to 0.1{code}

There are certain times where being able to override the 384Mb minimum directly 
can be beneficial. We may have a scenario where a lot of off-heap operations 
are performed (ex: using package managers/native compression/decompression) 
where we don't have a need for a large JVM heap but we may still need a 
signficant amount of memory in the spark node. 

Using the '{_}memoryOverheadFactor{_}' flag may not prove appropriate. Since we 
may not want the overhead allocation to directly scale with JVM memory, as a 
cost saving/resource limitation problem.

As such, I propose the addition of a 'spark.driver/executor.minMemoryOverhead' 
flag, which can be used to override the 384Mib value used in the overhead 
calculation.

The memory overhead calculation will now be :
{code:java}
min_memory = 
sparkConf.get('spark.driver/executor.minMemoryOverhead').getOrElse(384)

overhead_memory = Max(min_memory, 'spark.driver/executor.memory' * 
'spark.driver/executor.memoryOverheadFactor'){code}

PR: https://github.com/apache/spark/pull/45240  

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47102:
--

Assignee: (was: Apache Spark)

> Add COLLATION_ENABLED config flag
> -
>
> Key: SPARK-47102
> URL: https://issues.apache.org/jira/browse/SPARK-47102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>
> *What changes were proposed in this pull request?*
> This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error 
> class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage 
> of feature under development. 
> *Why are the changes needed?*
> We want to make collations configurable on this flag. These changes disable 
> usage of `collate` and `collation` functions, along with any `COLLATE` syntax 
> when the flag is set to false. By default, the flag is set to false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47094) SPJ : Dynamically rebalance number of buckets when they are not equal

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47094:
--

Assignee: Apache Spark

> SPJ : Dynamically rebalance number of buckets when they are not equal
> -
>
> Key: SPARK-47094
> URL: https://issues.apache.org/jira/browse/SPARK-47094
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Himadri Pal
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> SPJ: Storage Partition Join works with Iceberg tables when both the tables 
> have same number of buckets. As part of this feature request, we would like 
> spark to gather the number of buckets information from both the tables and 
> dynamically rebalance the number of buckets by coalesce or repartition so 
> that SPJ will work fine. In this case, we would still have to shuffle but 
> would be better than no SPJ.
> Use Case : 
> Many times we do not have control of the input tables, hence it's not 
> possible to change partitioning scheme on those tables. As a consumer, we 
> would still like them to be used with SPJ when used with other tables and 
> output tables which has different number of buckets.
> In these scenario, we would need to read those tables rewrite them with 
> matching number of buckets for the SPJ to work, this extra step could 
> outweigh the benefits of less shuffle via SPJ. Also when there are multiple 
> different tables being joined, each tables need to be rewritten with matching 
> number of buckets. 
> If this feature is implemented, SPJ functionality will be more powerful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47102:
--

Assignee: Apache Spark

> Add COLLATION_ENABLED config flag
> -
>
> Key: SPARK-47102
> URL: https://issues.apache.org/jira/browse/SPARK-47102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> *What changes were proposed in this pull request?*
> This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error 
> class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage 
> of feature under development. 
> *Why are the changes needed?*
> We want to make collations configurable on this flag. These changes disable 
> usage of `collate` and `collation` functions, along with any `COLLATE` syntax 
> when the flag is set to false. By default, the flag is set to false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47102:
--

Assignee: (was: Apache Spark)

> Add COLLATION_ENABLED config flag
> -
>
> Key: SPARK-47102
> URL: https://issues.apache.org/jira/browse/SPARK-47102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>
> *What changes were proposed in this pull request?*
> This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error 
> class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage 
> of feature under development. 
> *Why are the changes needed?*
> We want to make collations configurable on this flag. These changes disable 
> usage of `collate` and `collation` functions, along with any `COLLATE` syntax 
> when the flag is set to false. By default, the flag is set to false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47102:
--

Assignee: Apache Spark

> Add COLLATION_ENABLED config flag
> -
>
> Key: SPARK-47102
> URL: https://issues.apache.org/jira/browse/SPARK-47102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> *What changes were proposed in this pull request?*
> This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error 
> class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage 
> of feature under development. 
> *Why are the changes needed?*
> We want to make collations configurable on this flag. These changes disable 
> usage of `collate` and `collation` functions, along with any `COLLATE` syntax 
> when the flag is set to false. By default, the flag is set to false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47102:
--

Assignee: (was: Apache Spark)

> Add COLLATION_ENABLED config flag
> -
>
> Key: SPARK-47102
> URL: https://issues.apache.org/jira/browse/SPARK-47102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>
> *What changes were proposed in this pull request?*
> This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error 
> class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage 
> of feature under development. 
> *Why are the changes needed?*
> We want to make collations configurable on this flag. These changes disable 
> usage of `collate` and `collation` functions, along with any `COLLATE` syntax 
> when the flag is set to false. By default, the flag is set to false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47102:
--

Assignee: Apache Spark

> Add COLLATION_ENABLED config flag
> -
>
> Key: SPARK-47102
> URL: https://issues.apache.org/jira/browse/SPARK-47102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> *What changes were proposed in this pull request?*
> This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error 
> class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage 
> of feature under development. 
> *Why are the changes needed?*
> We want to make collations configurable on this flag. These changes disable 
> usage of `collate` and `collation` functions, along with any `COLLATE` syntax 
> when the flag is set to false. By default, the flag is set to false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47094) SPJ : Dynamically rebalance number of buckets when they are not equal

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47094:
--

Assignee: Apache Spark

> SPJ : Dynamically rebalance number of buckets when they are not equal
> -
>
> Key: SPARK-47094
> URL: https://issues.apache.org/jira/browse/SPARK-47094
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Himadri Pal
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> SPJ: Storage Partition Join works with Iceberg tables when both the tables 
> have same number of buckets. As part of this feature request, we would like 
> spark to gather the number of buckets information from both the tables and 
> dynamically rebalance the number of buckets by coalesce or repartition so 
> that SPJ will work fine. In this case, we would still have to shuffle but 
> would be better than no SPJ.
> Use Case : 
> Many times we do not have control of the input tables, hence it's not 
> possible to change partitioning scheme on those tables. As a consumer, we 
> would still like them to be used with SPJ when used with other tables and 
> output tables which has different number of buckets.
> In these scenario, we would need to read those tables rewrite them with 
> matching number of buckets for the SPJ to work, this extra step could 
> outweigh the benefits of less shuffle via SPJ. Also when there are multiple 
> different tables being joined, each tables need to be rewritten with matching 
> number of buckets. 
> If this feature is implemented, SPJ functionality will be more powerful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47207) Support `spark.driver.timeout` and `DriverTimeoutPlugin`

2024-02-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47207:
-

Assignee: Dongjoon Hyun

> Support `spark.driver.timeout` and `DriverTimeoutPlugin`
> 
>
> Key: SPARK-47207
> URL: https://issues.apache.org/jira/browse/SPARK-47207
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47145) Provide table identifier to scan node when DS v2 strategy is applied

2024-02-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47145:
---

Assignee: Uros Stankovic

> Provide table identifier to scan node when DS v2 strategy is applied
> 
>
> Key: SPARK-47145
> URL: https://issues.apache.org/jira/browse/SPARK-47145
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Uros Stankovic
>Assignee: Uros Stankovic
>Priority: Minor
>  Labels: pull-request-available
>
> Currently, DataSourceScanExec node can accept table identifier, and that 
> information can be useful for later logging, debugging, etc, but 
> DataSourceV2Strategy does not provide that information to scan node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47145) Provide table identifier to scan node when DS v2 strategy is applied

2024-02-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47145.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45200
[https://github.com/apache/spark/pull/45200]

> Provide table identifier to scan node when DS v2 strategy is applied
> 
>
> Key: SPARK-47145
> URL: https://issues.apache.org/jira/browse/SPARK-47145
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Uros Stankovic
>Assignee: Uros Stankovic
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, DataSourceScanExec node can accept table identifier, and that 
> information can be useful for later logging, debugging, etc, but 
> DataSourceV2Strategy does not provide that information to scan node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47207) Support `spark.driver.timeout` and `DriverTimeoutPlugin`

2024-02-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47207:
---
Labels: pull-request-available  (was: )

> Support `spark.driver.timeout` and `DriverTimeoutPlugin`
> 
>
> Key: SPARK-47207
> URL: https://issues.apache.org/jira/browse/SPARK-47207
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47207) Support `spark.driver.timeout` and `DriverTimeoutPlugin`

2024-02-28 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-47207:
-

 Summary: Support `spark.driver.timeout` and `DriverTimeoutPlugin`
 Key: SPARK-47207
 URL: https://issues.apache.org/jira/browse/SPARK-47207
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47205) Upgrade docker-java to 3.3.5

2024-02-28 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47205.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45307
[https://github.com/apache/spark/pull/45307]

> Upgrade docker-java to 3.3.5
> 
>
> Key: SPARK-47205
> URL: https://issues.apache.org/jira/browse/SPARK-47205
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Docker
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46077) Error in postgresql when pushing down filter by timestamp_ntz field

2024-02-28 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-46077.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45261
[https://github.com/apache/spark/pull/45261]

> Error in postgresql when pushing down filter by timestamp_ntz field
> ---
>
> Key: SPARK-46077
> URL: https://issues.apache.org/jira/browse/SPARK-46077
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Marina Krasilnikova
>Assignee: Pablo Langa Blanco
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> code to reproduce:
> SparkSession sparkSession = SparkSession
> .builder()
> .appName("test-app")
> .master("local[*]")
> .config("spark.sql.timestampType", "TIMESTAMP_NTZ")
> .getOrCreate();
> String url = "...";
> String catalogPropPrefix = "spark.sql.catalog.myc";
> sparkSession.conf().set(catalogPropPrefix, JDBCTableCatalog.class.getName());
> sparkSession.conf().set(catalogPropPrefix + ".url", url);
> Map options = new HashMap<>();
> options.put("driver", "org.postgresql.Driver");
> // options.put("pushDownPredicate", "false");  it works fine if  this line is 
> uncommented
> Dataset dataset = sparkSession.read()
> .options(options)
> .table("myc.demo.`My table`");
> dataset.createOrReplaceTempView("view1");
> String sql = "select * from view1 where `my date` = '2021-04-01 00:00:00'";
> Dataset result = sparkSession.sql(sql);
> result.show();
> result.printSchema();
> Field `my date` is of type timestamp. This code results in 
> org.postgresql.util.PSQLException  syntax error
>  
>  
> String sql = "select * from view1 where `my date` = to_timestamp('2021-04-01 
> 00:00:00', '-MM-dd HH:mm:ss')";  // this query also doesn't work
> String sql = "select * from view1 where `my date` = date_trunc('DAY', 
> to_timestamp('2021-04-01 00:00:00', '-MM-dd HH:mm:ss'))";  // but this is 
> OK
>  
> Is it a bug or I got something wrong?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46077) Error in postgresql when pushing down filter by timestamp_ntz field

2024-02-28 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-46077:


Assignee: Pablo Langa Blanco

> Error in postgresql when pushing down filter by timestamp_ntz field
> ---
>
> Key: SPARK-46077
> URL: https://issues.apache.org/jira/browse/SPARK-46077
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Marina Krasilnikova
>Assignee: Pablo Langa Blanco
>Priority: Minor
>  Labels: pull-request-available
>
> code to reproduce:
> SparkSession sparkSession = SparkSession
> .builder()
> .appName("test-app")
> .master("local[*]")
> .config("spark.sql.timestampType", "TIMESTAMP_NTZ")
> .getOrCreate();
> String url = "...";
> String catalogPropPrefix = "spark.sql.catalog.myc";
> sparkSession.conf().set(catalogPropPrefix, JDBCTableCatalog.class.getName());
> sparkSession.conf().set(catalogPropPrefix + ".url", url);
> Map options = new HashMap<>();
> options.put("driver", "org.postgresql.Driver");
> // options.put("pushDownPredicate", "false");  it works fine if  this line is 
> uncommented
> Dataset dataset = sparkSession.read()
> .options(options)
> .table("myc.demo.`My table`");
> dataset.createOrReplaceTempView("view1");
> String sql = "select * from view1 where `my date` = '2021-04-01 00:00:00'";
> Dataset result = sparkSession.sql(sql);
> result.show();
> result.printSchema();
> Field `my date` is of type timestamp. This code results in 
> org.postgresql.util.PSQLException  syntax error
>  
>  
> String sql = "select * from view1 where `my date` = to_timestamp('2021-04-01 
> 00:00:00', '-MM-dd HH:mm:ss')";  // this query also doesn't work
> String sql = "select * from view1 where `my date` = date_trunc('DAY', 
> to_timestamp('2021-04-01 00:00:00', '-MM-dd HH:mm:ss'))";  // but this is 
> OK
>  
> Is it a bug or I got something wrong?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org