date:20240516

[jira] [Assigned] (SPARK-48317) Enable test_udtf_with_analyze_using_archive and test_udtf_with_analyze_using_file

2024-05-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48317:


Assignee: Hyukjin Kwon

> Enable test_udtf_with_analyze_using_archive and 
> test_udtf_with_analyze_using_file
> -
>
> Key: SPARK-48317
> URL: https://issues.apache.org/jira/browse/SPARK-48317
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48317) Enable test_udtf_with_analyze_using_archive and test_udtf_with_analyze_using_file

2024-05-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48317.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46632
[https://github.com/apache/spark/pull/46632]

> Enable test_udtf_with_analyze_using_archive and 
> test_udtf_with_analyze_using_file
> -
>
> Key: SPARK-48317
> URL: https://issues.apache.org/jira/browse/SPARK-48317
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48319) Test `assert_true` and `raise_error` with the same error class as Spark Classic

2024-05-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48319:
---
Labels: pull-request-available  (was: )

> Test `assert_true` and `raise_error` with the same error class as Spark 
> Classic
> ---
>
> Key: SPARK-48319
> URL: https://issues.apache.org/jira/browse/SPARK-48319
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48319) Test `assert_true` and `raise_error` with more specific error class

2024-05-16 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-48319:
-

 Summary: Test `assert_true` and `raise_error` with more specific 
error class
 Key: SPARK-48319
 URL: https://issues.apache.org/jira/browse/SPARK-48319
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark, Tests
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48319) Test `assert_true` and `raise_error` with the same error class as Spark Classic

2024-05-16 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-48319:
--
Summary: Test `assert_true` and `raise_error` with the same error class as 
Spark Classic  (was: Test `assert_true` and `raise_error` with more specific 
error class)

> Test `assert_true` and `raise_error` with the same error class as Spark 
> Classic
> ---
>
> Key: SPARK-48319
> URL: https://issues.apache.org/jira/browse/SPARK-48319
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48318) Hash join support for strings with collation (complex types)

2024-05-16 Thread Jira

Uroš Bojanić created SPARK-48318:


 Summary: Hash join support for strings with collation (complex 
types)
 Key: SPARK-48318
 URL: https://issues.apache.org/jira/browse/SPARK-48318
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48000) Hash join support for strings with collation (StringType only)

2024-05-16 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-48000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-48000:
-
Summary: Hash join support for strings with collation (StringType only)  
(was: Hash join support for strings with collation)

> Hash join support for strings with collation (StringType only)
> --
>
> Key: SPARK-48000
> URL: https://issues.apache.org/jira/browse/SPARK-48000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48316) Fix comments for SparkFrameMethodsParityTests.test_coalesce and test_repartition

2024-05-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48316:


Assignee: Hyukjin Kwon

> Fix comments for SparkFrameMethodsParityTests.test_coalesce and 
> test_repartition
> 
>
> Key: SPARK-48316
> URL: https://issues.apache.org/jira/browse/SPARK-48316
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48316) Fix comments for SparkFrameMethodsParityTests.test_coalesce and test_repartition

2024-05-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48316.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46629
[https://github.com/apache/spark/pull/46629]

> Fix comments for SparkFrameMethodsParityTests.test_coalesce and 
> test_repartition
> 
>
> Key: SPARK-48316
> URL: https://issues.apache.org/jira/browse/SPARK-48316
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48316) Fix comments for SparkFrameMethodsParityTests.test_coalesce and test_repartition

2024-05-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-48316:
-
Summary: Fix comments for SparkFrameMethodsParityTests.test_coalesce and 
test_repartition  (was: Enable SparkFrameMethodsParityTests.test_coalesce and 
test_repartition)

> Fix comments for SparkFrameMethodsParityTests.test_coalesce and 
> test_repartition
> 
>
> Key: SPARK-48316
> URL: https://issues.apache.org/jira/browse/SPARK-48316
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48317) Enable test_udtf_with_analyze_using_archive and test_udtf_with_analyze_using_file

2024-05-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48317:
---
Labels: pull-request-available  (was: )

> Enable test_udtf_with_analyze_using_archive and 
> test_udtf_with_analyze_using_file
> -
>
> Key: SPARK-48317
> URL: https://issues.apache.org/jira/browse/SPARK-48317
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48317) Enable test_udtf_with_analyze_using_archive and test_udtf_with_analyze_using_file

2024-05-16 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-48317:


 Summary: Enable test_udtf_with_analyze_using_archive and 
test_udtf_with_analyze_using_file
 Key: SPARK-48317
 URL: https://issues.apache.org/jira/browse/SPARK-48317
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark, Tests
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48306) Improve UDT in error message

2024-05-16 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48306.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46616
[https://github.com/apache/spark/pull/46616]

> Improve UDT in error message
> 
>
> Key: SPARK-48306
> URL: https://issues.apache.org/jira/browse/SPARK-48306
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48316) Enable SparkFrameMethodsParityTests.test_coalesce and test_repartition

2024-05-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48316:
---
Labels: pull-request-available  (was: )

> Enable SparkFrameMethodsParityTests.test_coalesce and test_repartition
> --
>
> Key: SPARK-48316
> URL: https://issues.apache.org/jira/browse/SPARK-48316
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48238) Spark fail to start due to class o.a.h.yarn.server.webproxy.amfilter.AmIpFilter is not a jakarta.servlet.Filter

2024-05-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-48238:
-
Parent: (was: SPARK-47970)
Issue Type: Bug  (was: Sub-task)

> Spark fail to start due to class 
> o.a.h.yarn.server.webproxy.amfilter.AmIpFilter is not a jakarta.servlet.Filter
> ---
>
> Key: SPARK-48238
> URL: https://issues.apache.org/jira/browse/SPARK-48238
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Priority: Blocker
>  Labels: pull-request-available
>
> I tested the latest master branch, it failed to start on YARN mode
> {code:java}
> dev/make-distribution.sh --tgz -Phive,hive-thriftserver,yarn{code}
>  
> {code:java}
> $ bin/spark-sql --master yarn
> WARNING: Using incubator modules: jdk.incubator.vector
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 2024-05-10 17:58:17 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2024-05-10 17:58:18 WARN Client: Neither spark.yarn.jars nor 
> spark.yarn.archive} is set, falling back to uploading libraries under 
> SPARK_HOME.
> 2024-05-10 17:58:25 ERROR SparkContext: Error initializing SparkContext.
> org.sparkproject.jetty.util.MultiException: Multiple exceptions
>     at 
> org.sparkproject.jetty.util.MultiException.ifExceptionThrow(MultiException.java:117)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:751)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:392)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:902)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:306)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:93)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:514) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$2(SparkUI.scala:81) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$2$adapted(SparkUI.scala:81)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:619) 
> ~[scala-library-2.13.13.jar:?]
>     at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:617) 
> ~[scala-library-2.13.13.jar:?]
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:935) 
> ~[scala-library-2.13.13.jar:?]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$1(SparkUI.scala:81) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$1$adapted(SparkUI.scala:79)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.Option.foreach(Option.scala:437) ~[scala-library-2.13.13.jar:?]
>     at org.apache.spark.ui.SparkUI.attachAllHandlers(SparkUI.scala:79) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at org.apache.spark.SparkContext.$anonfun$new$31(SparkContext.scala:690) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.SparkContext.$anonfun$new$31$adapted(SparkContext.scala:690) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.Option.foreach(Option.scala:437) ~[scala-library-2.13.13.jar:?]
>     at org.apache.spark.SparkContext.(SparkContext.scala:690) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2963) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1118)
>  ~[spark-sql_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.Option.getOrElse(Option.scala:201) [scala-library-2.13.13.jar:?]
>     at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1112)
>  [spark-sql_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:64)
>

[jira] [Created] (SPARK-48316) Enable SparkFrameMethodsParityTests.test_coalesce and test_repartition

2024-05-16 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-48316:


 Summary: Enable SparkFrameMethodsParityTests.test_coalesce and 
test_repartition
 Key: SPARK-48316
 URL: https://issues.apache.org/jira/browse/SPARK-48316
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark, PySpark, Tests
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48310) Cached Properties Should return copies instead of values

2024-05-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48310.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46621
[https://github.com/apache/spark/pull/46621]

> Cached Properties Should return copies instead of values
> 
>
> Key: SPARK-48310
> URL: https://issues.apache.org/jira/browse/SPARK-48310
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When returning cached properties for schema and columns a user might 
> incidentally modify the cached values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48314) FileStreamSource shouldn't double cache files for availableNow

2024-05-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48314:
---
Labels: pull-request-available  (was: )

> FileStreamSource shouldn't double cache files for availableNow
> --
>
> Key: SPARK-48314
> URL: https://issues.apache.org/jira/browse/SPARK-48314
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Adam Binford
>Priority: Major
>  Labels: pull-request-available
>
> FileStreamSource loads and saves all files at initialization for 
> Trigger.AvailableNow. However files will also be cached in unreadFiles, which 
> is a waste and causes issues identified in 
> https://issues.apache.org/jira/browse/SPARK-44924 for streams that are 
> reading more than 10k files per batch. We should always skip using the 
> unreadFiles cache when using available now trigger, as there is no need for 
> it anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48268) Add a configuration for SparkContext.setCheckpointDir

2024-05-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48268.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46571
[https://github.com/apache/spark/pull/46571]

> Add a configuration for SparkContext.setCheckpointDir
> -
>
> Key: SPARK-48268
> URL: https://issues.apache.org/jira/browse/SPARK-48268
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Would be great to have it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48268) Add a configuration for SparkContext.setCheckpointDir

2024-05-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48268:


Assignee: Hyukjin Kwon

> Add a configuration for SparkContext.setCheckpointDir
> -
>
> Key: SPARK-48268
> URL: https://issues.apache.org/jira/browse/SPARK-48268
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> Would be great to have it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43815) Add to_varchar alias for to_char SQL function

2024-05-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-43815:
---
Labels: pull-request-available  (was: )

> Add to_varchar alias for to_char SQL function
> -
>
> Key: SPARK-43815
> URL: https://issues.apache.org/jira/browse/SPARK-43815
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Richard Yu
>Assignee: Richard Yu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> We want to add the alias to_varchar for the function to_char. 
> For users who are migrating to Spark SQL such that the SQL engine they 
> formerly used supported to_varchar instead of to_char, this change would 
> minimize the number of changes to their application to ensure it is 
> compatible with Spark SQL syntax and support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48315) Create user-facing error for null locale in CSV options

2024-05-16 Thread Michael Zhang (Jira)

Michael Zhang created SPARK-48315:
-

 Summary: Create user-facing error for null locale in CSV options
 Key: SPARK-48315
 URL: https://issues.apache.org/jira/browse/SPARK-48315
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1, 3.5.0, 4.0.0, 3.5.2
Reporter: Michael Zhang
 Fix For: 4.0.0, 3.5.2


When user incorrectly sets `locale` option to `null` with csv, a null pointer 
exception is thrown. We should wrap the exception so the user understands what 
the issue is.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48314) FileStreamSource shouldn't double cache files for availableNow

2024-05-16 Thread Adam Binford (Jira)

Adam Binford created SPARK-48314:


 Summary: FileStreamSource shouldn't double cache files for 
availableNow
 Key: SPARK-48314
 URL: https://issues.apache.org/jira/browse/SPARK-48314
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.1
Reporter: Adam Binford


FileStreamSource loads and saves all files at initialization for 
Trigger.AvailableNow. However files will also be cached in unreadFiles, which 
is a waste and causes issues identified in 
https://issues.apache.org/jira/browse/SPARK-44924 for streams that are reading 
more than 10k files per batch. We should always skip using the unreadFiles 
cache when using available now trigger, as there is no need for it anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists

2024-05-16 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-48302:
-
Description: 
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
{{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
with empty lists.

The PySpark function where this happens is 
{{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

A possible fix for this will involve adding a {{mask}} argument to 
{{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which 
PySpark will still support for a while) won't have this argument, we will need 
to do a check like:

{{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("1X.0.0")}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.

  was:
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
{{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
with empty lists.

The PySpark function where this happens is 
{{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

A possible fix for this will involve adding a {{mask}} argument to 
{{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which 
PySpark will still support for a while) won't have this argument, we will need 
to do a check like:

{{LooseVersion(pa.__version__) >= LooseVersion("1X.0.0")}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.


> Null values in map columns of PyArrow tables are replaced with empty lists
> --
>
> Key: SPARK-48302
> URL: https://issues.apache.org/jira/browse/SPARK-48302
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ian Cook
>Priority: Major
>
> Because of a limitation in PyArrow, when PyArrow Tables are passed to 
> {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
> with empty lists.
> The PySpark function where this happens is 
> {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.
> Also see [https://github.com/apache/arrow/issues/41684].
> See the skipped tests and the TODO mentioning SPARK-48302.
> A possible fix for this will involve adding a {{mask}} argument to 
> {{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which 
> PySpark will still support for a while) won't have this argument, we will 
> need to do a check like:
> {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("1X.0.0")}}
> or
> {{from inspect import signature}}
> {{"mask" in signature(pa.MapArray.from_arrays).parameters}}
> and only pass {{mask}} if that's true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists

2024-05-16 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-48302:
-
Description: 
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
{{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
with empty lists.

The PySpark function where this happens is 
{{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

A possible fix for this will involve adding a {{mask}} argument to 
{{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which 
PySpark will still support for a while) won't have this argument, we will need 
to do a check like:

{{if LooseVersion(pa.\_\_version\_\_) >= LooseVersion("1X.0.0"):}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.

  was:
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
{{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
with empty lists.

The PySpark function where this happens is 
{{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

A possible fix for this will involve adding a {{mask}} argument to 
{{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which 
PySpark will still support for a while) won't have this argument, we will need 
to do a check like:

{{if LooseVersion(pa.__version__) >= LooseVersion("1X.0.0"):}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.


> Null values in map columns of PyArrow tables are replaced with empty lists
> --
>
> Key: SPARK-48302
> URL: https://issues.apache.org/jira/browse/SPARK-48302
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ian Cook
>Priority: Major
>
> Because of a limitation in PyArrow, when PyArrow Tables are passed to 
> {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
> with empty lists.
> The PySpark function where this happens is 
> {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.
> Also see [https://github.com/apache/arrow/issues/41684].
> See the skipped tests and the TODO mentioning SPARK-48302.
> A possible fix for this will involve adding a {{mask}} argument to 
> {{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which 
> PySpark will still support for a while) won't have this argument, we will 
> need to do a check like:
> {{if LooseVersion(pa.\_\_version\_\_) >= LooseVersion("1X.0.0"):}}
> or
> {{from inspect import signature}}
> {{"mask" in signature(pa.MapArray.from_arrays).parameters}}
> and only pass {{mask}} if that's true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists

2024-05-16 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-48302:
-
Description: 
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
{{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
with empty lists.

The PySpark function where this happens is 
{{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

A possible fix for this will involve adding a {{mask}} argument to 
{{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which 
PySpark will still support for a while) won't have this argument, we will need 
to do a check like:

{{LooseVersion(pa.__version__) >= LooseVersion("1X.0.0")}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.

  was:
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
{{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
with empty lists.

The PySpark function where this happens is 
{{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

A possible fix for this will involve adding a {{mask}} argument to 
{{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which 
PySpark will still support for a while) won't have this argument, we will need 
to do a check like:

{{if LooseVersion(pa.\_\_version\_\_) >= LooseVersion("1X.0.0"):}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.


> Null values in map columns of PyArrow tables are replaced with empty lists
> --
>
> Key: SPARK-48302
> URL: https://issues.apache.org/jira/browse/SPARK-48302
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ian Cook
>Priority: Major
>
> Because of a limitation in PyArrow, when PyArrow Tables are passed to 
> {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
> with empty lists.
> The PySpark function where this happens is 
> {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.
> Also see [https://github.com/apache/arrow/issues/41684].
> See the skipped tests and the TODO mentioning SPARK-48302.
> A possible fix for this will involve adding a {{mask}} argument to 
> {{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which 
> PySpark will still support for a while) won't have this argument, we will 
> need to do a check like:
> {{LooseVersion(pa.__version__) >= LooseVersion("1X.0.0")}}
> or
> {{from inspect import signature}}
> {{"mask" in signature(pa.MapArray.from_arrays).parameters}}
> and only pass {{mask}} if that's true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists

2024-05-16 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-48302:
-
Description: 
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
{{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
with empty lists.

The PySpark function where this happens is 
{{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

A possible fix for this will involve adding a {{mask}} argument to 
{{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which 
PySpark will still support for a while) won't have this argument, we will need 
to do a check like:

{{if LooseVersion(pa.__version__) >= LooseVersion("1X.0.0"):}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.

  was:
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
{{spark.createDataFrame()}}, null values in MapArray columns are replaced with 
empty lists.

The PySpark function where this happens is 
{{pyspark.sql.pandas.types._check_arrow_array_timestamps_localize}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

A possible fix for this will involve adding a {{mask}} argument to 
{{pa.MapArray.from_arrays}}. But since older versions of PyArrow (which PySpark 
will still support for a while) won't have this argument, we will need to do a 
check like:

{{if LooseVersion(pa._{_}version{_}_) >= LooseVersion("1X.0.0"):}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.


> Null values in map columns of PyArrow tables are replaced with empty lists
> --
>
> Key: SPARK-48302
> URL: https://issues.apache.org/jira/browse/SPARK-48302
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ian Cook
>Priority: Major
>
> Because of a limitation in PyArrow, when PyArrow Tables are passed to 
> {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
> with empty lists.
> The PySpark function where this happens is 
> {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.
> Also see [https://github.com/apache/arrow/issues/41684].
> See the skipped tests and the TODO mentioning SPARK-48302.
> A possible fix for this will involve adding a {{mask}} argument to 
> {{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which 
> PySpark will still support for a while) won't have this argument, we will 
> need to do a check like:
> {{if LooseVersion(pa.__version__) >= LooseVersion("1X.0.0"):}}
> or
> {{from inspect import signature}}
> {{"mask" in signature(pa.MapArray.from_arrays).parameters}}
> and only pass {{mask}} if that's true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists

2024-05-16 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-48302:
-
Description: 
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
{{spark.createDataFrame()}}, null values in MapArray columns are replaced with 
empty lists.

The PySpark function where this happens is 
{{pyspark.sql.pandas.types._check_arrow_array_timestamps_localize}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

A possible fix for this will involve adding a {{mask}} argument to 
{{pa.MapArray.from_arrays}}. But since older versions of PyArrow (which PySpark 
will still support for a while) won't have this argument, we will need to do a 
check like:

{{if LooseVersion(pa._{_}version{_}_) >= LooseVersion("1X.0.0"):}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.

  was:
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
spark.createDataFrame(), null values in MapArray columns are replaced with 
empty lists.

The PySpark function where this happens is pyspark.sql.pandas.types.
_check_arrow_array_timestamps_localize.
Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.


> Null values in map columns of PyArrow tables are replaced with empty lists
> --
>
> Key: SPARK-48302
> URL: https://issues.apache.org/jira/browse/SPARK-48302
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ian Cook
>Priority: Major
>
> Because of a limitation in PyArrow, when PyArrow Tables are passed to 
> {{spark.createDataFrame()}}, null values in MapArray columns are replaced 
> with empty lists.
> The PySpark function where this happens is 
> {{pyspark.sql.pandas.types._check_arrow_array_timestamps_localize}}.
> Also see [https://github.com/apache/arrow/issues/41684].
> See the skipped tests and the TODO mentioning SPARK-48302.
> A possible fix for this will involve adding a {{mask}} argument to 
> {{pa.MapArray.from_arrays}}. But since older versions of PyArrow (which 
> PySpark will still support for a while) won't have this argument, we will 
> need to do a check like:
> {{if LooseVersion(pa._{_}version{_}_) >= LooseVersion("1X.0.0"):}}
> or
> {{from inspect import signature}}
> {{"mask" in signature(pa.MapArray.from_arrays).parameters}}
> and only pass {{mask}} if that's true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48220) Allow passing PyArrow Table to createDataFrame()

2024-05-16 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-48220:
-
Fix Version/s: 4.0.0

> Allow passing PyArrow Table to createDataFrame()
> 
>
> Key: SPARK-48220
> URL: https://issues.apache.org/jira/browse/SPARK-48220
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Input/Output, PySpark, SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Ian Cook
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> SPARK-47365 added support for returning a Spark DataFrame as a PyArrow Table.
> It would be nice if we could also go in the opposite direction, enabling 
> users to create a Spark DataFrame from a PyArrow Table by passing the PyArrow 
> Table to {{spark.createDataFrame()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48294) Make nestedTypeMissingElementTypeError case insensitive

2024-05-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48294:
---
Labels: pull-request-available  (was: )

> Make nestedTypeMissingElementTypeError case insensitive
> ---
>
> Key: SPARK-48294
> URL: https://issues.apache.org/jira/browse/SPARK-48294
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0, 3.5.1, 3.5.2
>Reporter: Michael Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When incorrectly declaring a complex data type using nested types (ARRAY, MAP 
> and STRUCT), the query fails with a match error rather than 
> `INCOMPLETE_TYPE_DEFINITION`. This is because the match is case sensitive,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists

2024-05-16 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-48302:
-
Description: 
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
spark.createDataFrame(), null values in MapArray columns are replaced with 
empty lists.

The PySpark function where this happens is pyspark.sql.pandas.types.
_check_arrow_array_timestamps_localize.
Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

  was:
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
spark.createDataFrame(), null values in MapArray columns are replaced with 
empty lists.

The PySpark function where this happens is pyspark.sql.pandas.types.
_check_arrow_array_timestamps_localize.
Also see [https://github.com/apache/arrow/issues/41684].


> Null values in map columns of PyArrow tables are replaced with empty lists
> --
>
> Key: SPARK-48302
> URL: https://issues.apache.org/jira/browse/SPARK-48302
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ian Cook
>Priority: Major
>
> Because of a limitation in PyArrow, when PyArrow Tables are passed to 
> spark.createDataFrame(), null values in MapArray columns are replaced with 
> empty lists.
> The PySpark function where this happens is pyspark.sql.pandas.types.
> _check_arrow_array_timestamps_localize.
> Also see [https://github.com/apache/arrow/issues/41684].
> See the skipped tests and the TODO mentioning SPARK-48302.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48313) test

2024-05-16 Thread guihuawen (Jira)

guihuawen created SPARK-48313:
-

 Summary: test
 Key: SPARK-48313
 URL: https://issues.apache.org/jira/browse/SPARK-48313
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: guihuawen
 Fix For: 4.0.0


test



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48311) Nested pythonUDF in groupBy and aggregate result in Binding Exception

2024-05-16 Thread Sumit Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Singh updated SPARK-48311:

Description: 
Steps to Reproduce 

1. Data creation
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, LongType, TimestampType, 
StringType
from datetime import datetime

# Define the schema
schema = StructType([
    StructField("col1", LongType(), nullable=True),
    StructField("col2", TimestampType(), nullable=True),
    StructField("col3", StringType(), nullable=True)
])

# Define the data
data = [
    (1, datetime(2023, 5, 15, 12, 30), "Discount"),
    (2, datetime(2023, 5, 16, 16, 45), "Promotion"),
    (3, datetime(2023, 5, 17, 9, 15), "Coupon")
]

# Create the DataFrame
df = spark.createDataFrame(data, 
schema)df.createOrReplaceTempView("temp_offers")

# Query the temporary table using SQL
# DISTINCT required to reproduce the issue. 
testDf = spark.sql("""
                    SELECT DISTINCT 
                    col1,
                    col2,
                    col3 FROM temp_offers
                    """) {code}
2. UDF registration 
{code:java}
import pyspark.sql.functions as F 
import pyspark.sql.types as T

#Creating udf functions 
def udf1(d):
    return d

def udf2(d):
    if d.isoweekday() in (1, 2, 3, 4):
        return 'WEEKDAY'
    else:
        return 'WEEKEND'

udf1_name = F.udf(udf1, T.TimestampType())
udf2_name = F.udf(udf2, T.StringType()) {code}
3. Adding UDF in grouping and agg
{code:java}
groupBy_cols = ['col1', 'col4', 'col5', 'col3']
temp = testDf \
  .select('*', udf1_name(F.col('col2')).alias('col4')).select('*', 
udf2_name('col4').alias('col5')) 

result = 
(temp.groupBy(*groupBy_cols).agg(F.countDistinct('col5').alias('col6'))){code}
4. Result
{code:java}
result.show(5, False) {code}
*We get below error*
{code:java}
An error was encountered:
An error occurred while calling o1079.showString.
: java.lang.IllegalStateException: Couldn't find pythonUDF0#1108 in 
[col1#978L,groupingPythonUDF#1104,groupingPythonUDF#1105,col3#980,count(pythonUDF0#1108)#1080L]
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
 {code}

  was:
Steps to Reproduce 

1. Data creation
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, LongType, TimestampType, 
StringType
from datetime import datetime

# Define the schema
schema = StructType([
    StructField("col1", LongType(), nullable=True),
    StructField("col2", TimestampType(), nullable=True),
    StructField("col3", StringType(), nullable=True)
])

# Define the data
data = [
    (1, datetime(2023, 5, 15, 12, 30), "Discount"),
    (2, datetime(2023, 5, 16, 16, 45), "Promotion"),
    (3, datetime(2023, 5, 17, 9, 15), "Coupon")
]

# Create the DataFrame
df = spark.createDataFrame(data, 
schema)df.createOrReplaceTempView("temp_offers")

# Query the temporary table using SQL
# DISTINCT required to reproduce the issue. 
testDf = spark.sql("""
                    SELECT DISTINCT 
                    col1,
                    col2,
                    col3 FROM temp_offers
                    """) {code}
2. UDF registration 
{code:java}
import pyspark.sql.functions as F 
import pyspark.sql.types as T

#Creating udf functions 
def udf1(incentive_date):
    return incentive_date

def udf2(d):
    if d.isoweekday() in (1, 2, 3, 4):
        return 'WEEKDAY'
    else:
        return 'WEEKEND'

udf1_name = F.udf(udf1, T.TimestampType())
udf2_name = F.udf(udf2, T.StringType()) {code}
3. Adding UDF in grouping and agg
{code:java}
groupBy_cols = ['col1', 'col4', 'col5', 'col3']
temp = testDf \
  .select('*', udf1_name(F.col('col2')).alias('col4')).select('*', 
udf2_name('col4').alias('col5')) 

result = 
(temp.groupBy(*groupBy_cols).agg(F.countDistinct('col5').alias('col6'))){code}
4. Result
{code:java}
result.show(5, False) {code}
*We get below error*
{code:java}
An error was encountered:
An error occurred while calling o1079.showString.
: java.lang.IllegalStateException: Couldn't find pythonUDF0#1108 in 
[col1#978L,groupingPythonUDF#1104,groupingPythonUDF#1105,col3#980,count(pythonUDF0#1108)#1080L]
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
 {code}


> Nested pythonUDF in groupBy and aggregate result in Binding Exception 
> --
>
> Key: SPARK-48311
> URL: https://issues.apache.org/jira/browse/SPARK-48311
> Project: Spark
>

[jira] [Updated] (SPARK-48311) Nested pythonUDF in groupBy and aggregate result in Binding Exception

2024-05-16 Thread Sumit Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Singh updated SPARK-48311:

Description: 
Steps to Reproduce 

1. Data creation
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, LongType, TimestampType, 
StringType
from datetime import datetime

# Define the schema
schema = StructType([
    StructField("col1", LongType(), nullable=True),
    StructField("col2", TimestampType(), nullable=True),
    StructField("col3", StringType(), nullable=True)
])

# Define the data
data = [
    (1, datetime(2023, 5, 15, 12, 30), "Discount"),
    (2, datetime(2023, 5, 16, 16, 45), "Promotion"),
    (3, datetime(2023, 5, 17, 9, 15), "Coupon")
]

# Create the DataFrame
df = spark.createDataFrame(data, 
schema)df.createOrReplaceTempView("temp_offers")

# Query the temporary table using SQL
# DISTINCT required to reproduce the issue. 
testDf = spark.sql("""
                    SELECT DISTINCT 
                    col1,
                    col2,
                    col3 FROM temp_offers
                    """) {code}
2. UDF registration 
{code:java}
import pyspark.sql.functions as F 
import pyspark.sql.types as T

#Creating udf functions 
def udf1(incentive_date):
    return incentive_date

def udf2(d):
    if d.isoweekday() in (1, 2, 3, 4):
        return 'WEEKDAY'
    else:
        return 'WEEKEND'

udf1_name = F.udf(udf1, T.TimestampType())
udf2_name = F.udf(udf2, T.StringType()) {code}
3. Adding UDF in grouping and agg
{code:java}
groupBy_cols = ['col1', 'col4', 'col5', 'col3']
temp = testDf \
  .select('*', udf1_name(F.col('col2')).alias('col4')).select('*', 
udf2_name('col4').alias('col5')) 

result = 
(temp.groupBy(*groupBy_cols).agg(F.countDistinct('col5').alias('col6'))){code}
4. Result
{code:java}
result.show(5, False) {code}
*We get below error*
{code:java}
An error was encountered:
An error occurred while calling o1079.showString.
: java.lang.IllegalStateException: Couldn't find pythonUDF0#1108 in 
[col1#978L,groupingPythonUDF#1104,groupingPythonUDF#1105,col3#980,count(pythonUDF0#1108)#1080L]
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
 {code}

  was:
Steps to Reproduce 
 # Data creation
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, LongType, TimestampType, 
StringType
from datetime import datetime

# Define the schema
schema = StructType([
    StructField("col1", LongType(), nullable=True),
    StructField("col2", TimestampType(), nullable=True),
    StructField("col3", StringType(), nullable=True)
])

# Define the data
data = [
    (1, datetime(2023, 5, 15, 12, 30), "Discount"),
    (2, datetime(2023, 5, 16, 16, 45), "Promotion"),
    (3, datetime(2023, 5, 17, 9, 15), "Coupon")
]

# Create the DataFrame
df = spark.createDataFrame(data, 
schema)df.createOrReplaceTempView("temp_offers")

# Query the temporary table using SQL
# DISTINCT required to reproduce the issue. 
testDf = spark.sql("""
                    SELECT DISTINCT 
                    col1,
                    col2,
                    col3 FROM temp_offers
                    """) {code}

 # UDF registration 

{code:java}
import pyspark.sql.functions as F 
import pyspark.sql.types as T

#Creating udf functions 
def udf1(incentive_date):
    return incentive_date

def udf2(d):
    if d.isoweekday() in (1, 2, 3, 4):
        return 'WEEKDAY'
    else:
        return 'WEEKEND'

udf1_name = F.udf(udf1, T.TimestampType())
udf2_name = F.udf(udf2, T.StringType()) {code}

 # Adding UDF in grouping and agg

{code:java}
groupBy_cols = ['col1', 'col4', 'col5', 'col3']
temp = testDf \
  .select('*', udf1_name(F.col('col2')).alias('col4')).select('*', 
udf2_name('col4').alias('col5')) 

result = 
(temp.groupBy(*groupBy_cols).agg(F.countDistinct('col5').alias('col6'))){code}

 # Result

{code:java}
result.show(5, False) {code}
*We get below error*

{code:java}
An error was encountered:
An error occurred while calling o1079.showString.
: java.lang.IllegalStateException: Couldn't find pythonUDF0#1108 in 
[col1#978L,groupingPythonUDF#1104,groupingPythonUDF#1105,col3#980,count(pythonUDF0#1108)#1080L]
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
 {code}


> Nested pythonUDF in groupBy and aggregate result in Binding Exception 
> --
>
> Key: SPARK-48311
> URL: https://issues.apache.org/jira/browse/SPARK-48311
>

[jira] [Created] (SPARK-48311) Nested pythonUDF in groupBy and aggregate result in Binding Exception

2024-05-16 Thread Sumit Singh (Jira)

Sumit Singh created SPARK-48311:
---

 Summary: Nested pythonUDF in groupBy and aggregate result in 
Binding Exception 
 Key: SPARK-48311
 URL: https://issues.apache.org/jira/browse/SPARK-48311
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.3.2
Reporter: Sumit Singh


Steps to Reproduce 
 # Data creation
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, LongType, TimestampType, 
StringType
from datetime import datetime

# Define the schema
schema = StructType([
    StructField("col1", LongType(), nullable=True),
    StructField("col2", TimestampType(), nullable=True),
    StructField("col3", StringType(), nullable=True)
])

# Define the data
data = [
    (1, datetime(2023, 5, 15, 12, 30), "Discount"),
    (2, datetime(2023, 5, 16, 16, 45), "Promotion"),
    (3, datetime(2023, 5, 17, 9, 15), "Coupon")
]

# Create the DataFrame
df = spark.createDataFrame(data, 
schema)df.createOrReplaceTempView("temp_offers")

# Query the temporary table using SQL
# DISTINCT required to reproduce the issue. 
testDf = spark.sql("""
                    SELECT DISTINCT 
                    col1,
                    col2,
                    col3 FROM temp_offers
                    """) {code}

 # UDF registration 

{code:java}
import pyspark.sql.functions as F 
import pyspark.sql.types as T

#Creating udf functions 
def udf1(incentive_date):
    return incentive_date

def udf2(d):
    if d.isoweekday() in (1, 2, 3, 4):
        return 'WEEKDAY'
    else:
        return 'WEEKEND'

udf1_name = F.udf(udf1, T.TimestampType())
udf2_name = F.udf(udf2, T.StringType()) {code}

 # Adding UDF in grouping and agg

{code:java}
groupBy_cols = ['col1', 'col4', 'col5', 'col3']
temp = testDf \
  .select('*', udf1_name(F.col('col2')).alias('col4')).select('*', 
udf2_name('col4').alias('col5')) 

result = 
(temp.groupBy(*groupBy_cols).agg(F.countDistinct('col5').alias('col6'))){code}

 # Result

{code:java}
result.show(5, False) {code}
*We get below error*

{code:java}
An error was encountered:
An error occurred while calling o1079.showString.
: java.lang.IllegalStateException: Couldn't find pythonUDF0#1108 in 
[col1#978L,groupingPythonUDF#1104,groupingPythonUDF#1105,col3#980,count(pythonUDF0#1108)#1080L]
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48309) Stop am retry, in situations where some errors and retries may not be successful

2024-05-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48309:
---
Labels: pull-request-available  (was: )

> Stop am retry, in situations where some errors and retries may not be 
> successful
> 
>
> Key: SPARK-48309
> URL: https://issues.apache.org/jira/browse/SPARK-48309
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 4.0.0
>Reporter: guihuawen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> In yarn cluster mode, spark.yarn.maxAppAttempts will be configured. In our 
> production environment, it is configured as 2 If the first execution fails, 
> AM will retry. However, in some scenarios, even attempting a second task may 
> fail.
> For example:
> org. apache. park. SQL AnalysisException: Table or view not found: 
> test.test_x; Line 1 pos 14;
> Project
> +-Unresolved Relationship [bigdata_qa, testx_x], [], false
>  
> Other example:
> Caused by: org. apache. hadoop. hdfs. protocol NSQuotaExceededException: The 
> NameSpace quota (directories and files) of directory/tmp/xxx_file/ is 
> exceeded: quota=100 file count=101
> Would it be more appropriate to try capturing these exceptions and stopping 
> retry?
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48308) Unify getting data schema without partition columns in FileSourceStrategy

2024-05-16 Thread Johan Lasperas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Lasperas updated SPARK-48308:
---
Description: 
In 
[FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
 the schema of the data excluding partition columns is computed 2 times in a 
slightly different way:

 
{code:java}
val dataColumnsWithoutPartitionCols = 
dataColumns.filterNot(partitionSet.contains) {code}
vs 
{code:java}
val readDataColumns = dataColumns
  .filterNot(partitionColumns.contains) {code}

  was:
In 
[FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
 the schema of the data excluding partition columns is computed 2 times in a 
slightly different way:

 
{code:java}
val dataColumnsWithoutPartitionCols = 
dataColumns.filterNot(partitionSet.contains) {code}
 

vs 
{code:java}
val readDataColumns = dataColumns
  .filterNot(partitionColumns.contains) {code}
 

 


> Unify getting data schema without partition columns in FileSourceStrategy
> -
>
> Key: SPARK-48308
> URL: https://issues.apache.org/jira/browse/SPARK-48308
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Johan Lasperas
>Assignee: Johan Lasperas
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> In 
> [FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
>  the schema of the data excluding partition columns is computed 2 times in a 
> slightly different way:
>  
> {code:java}
> val dataColumnsWithoutPartitionCols = 
> dataColumns.filterNot(partitionSet.contains) {code}
> vs 
> {code:java}
> val readDataColumns = dataColumns
>   .filterNot(partitionColumns.contains) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48308) Unify getting data schema without partition columns in FileSourceStrategy

2024-05-16 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48308.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46619
[https://github.com/apache/spark/pull/46619]

> Unify getting data schema without partition columns in FileSourceStrategy
> -
>
> Key: SPARK-48308
> URL: https://issues.apache.org/jira/browse/SPARK-48308
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Johan Lasperas
>Assignee: Johan Lasperas
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> In 
> [FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
>  the schema of the data excluding partition columns is computed 2 times in a 
> slightly different way:
>  
> {code:java}
> val dataColumnsWithoutPartitionCols = 
> dataColumns.filterNot(partitionSet.contains) {code}
>  
> vs 
> {code:java}
> val readDataColumns = dataColumns
>   .filterNot(partitionColumns.contains) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48308) Unify getting data schema without partition columns in FileSourceStrategy

2024-05-16 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48308:
---

Assignee: Johan Lasperas

> Unify getting data schema without partition columns in FileSourceStrategy
> -
>
> Key: SPARK-48308
> URL: https://issues.apache.org/jira/browse/SPARK-48308
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Johan Lasperas
>Assignee: Johan Lasperas
>Priority: Trivial
>  Labels: pull-request-available
>
> In 
> [FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
>  the schema of the data excluding partition columns is computed 2 times in a 
> slightly different way:
>  
> {code:java}
> val dataColumnsWithoutPartitionCols = 
> dataColumns.filterNot(partitionSet.contains) {code}
>  
> vs 
> {code:java}
> val readDataColumns = dataColumns
>   .filterNot(partitionColumns.contains) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48309) Stop am retry, in situations where some errors and retries may not be successful

2024-05-16 Thread guihuawen (Jira)

guihuawen created SPARK-48309:
-

 Summary: Stop am retry, in situations where some errors and 
retries may not be successful
 Key: SPARK-48309
 URL: https://issues.apache.org/jira/browse/SPARK-48309
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 4.0.0
Reporter: guihuawen
 Fix For: 4.0.0


In yarn cluster mode, spark.yarn.maxAppAttempts will be configured. In our 
production environment, it is configured as 2 If the first execution fails, AM 
will retry. However, in some scenarios, even attempting a second task may fail.

For example:

org. apache. park. SQL AnalysisException: Table or view not found: 
test.test_x; Line 1 pos 14;
Project
+-Unresolved Relationship [bigdata_qa, testx_x], [], false

 


Other example:
Caused by: org. apache. hadoop. hdfs. protocol NSQuotaExceededException: The 
NameSpace quota (directories and files) of directory/tmp/xxx_file/ is 
exceeded: quota=100 file count=101


Would it be more appropriate to try capturing these exceptions and stopping 
retry?

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48305) CurrentLike - Database/Schema, Catalog, User (all collations)

2024-05-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48305:
---
Labels: pull-request-available  (was: )

> CurrentLike - Database/Schema, Catalog, User (all collations)
> -
>
> Key: SPARK-48305
> URL: https://issues.apache.org/jira/browse/SPARK-48305
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48301) Rename CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE to CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE

2024-05-16 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-48301.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46608
[https://github.com/apache/spark/pull/46608]

> Rename CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE to 
> CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE
> --
>
> Key: SPARK-48301
> URL: https://issues.apache.org/jira/browse/SPARK-48301
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48308) Unify getting data schema without partition columns in FileSourceStrategy

2024-05-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48308:
---
Labels: pull-request-available  (was: )

> Unify getting data schema without partition columns in FileSourceStrategy
> -
>
> Key: SPARK-48308
> URL: https://issues.apache.org/jira/browse/SPARK-48308
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Johan Lasperas
>Priority: Trivial
>  Labels: pull-request-available
>
> In 
> [FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
>  the schema of the data excluding partition columns is computed 2 times in a 
> slightly different way:
>  
> {code:java}
> val dataColumnsWithoutPartitionCols = 
> dataColumns.filterNot(partitionSet.contains) {code}
>  
> vs 
> {code:java}
> val readDataColumns = dataColumns
>   .filterNot(partitionColumns.contains) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48308) Unify getting data schema without partition columns in FileSourceStrategy

2024-05-16 Thread Johan Lasperas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johan Lasperas updated SPARK-48308:
---
Description: 
In 
[FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
 the schema of the data excluding partition columns is computed 2 times in a 
slightly different way:

 
{code:java}
val dataColumnsWithoutPartitionCols = 
dataColumns.filterNot(partitionSet.contains) {code}
 

vs 
{code:java}
val readDataColumns = dataColumns
  .filterNot(partitionColumns.contains) {code}
 

 

  was:
In 
[FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
 the schema of the data excluding partition columns is computed 2 times in a 
slightly different way:

```

val dataColumnsWithoutPartitionCols = 
dataColumns.filterNot(partitionSet.contains)

```

vs 

```

      val readDataColumns = dataColumns
        .filterNot(partitionColumns.contains)

```

This should be unified

 


> Unify getting data schema without partition columns in FileSourceStrategy
> -
>
> Key: SPARK-48308
> URL: https://issues.apache.org/jira/browse/SPARK-48308
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Johan Lasperas
>Priority: Trivial
>
> In 
> [FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
>  the schema of the data excluding partition columns is computed 2 times in a 
> slightly different way:
>  
> {code:java}
> val dataColumnsWithoutPartitionCols = 
> dataColumns.filterNot(partitionSet.contains) {code}
>  
> vs 
> {code:java}
> val readDataColumns = dataColumns
>   .filterNot(partitionColumns.contains) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48308) Unify getting data schema without partition columns in FileSourceStrategy

2024-05-16 Thread Johan Lasperas (Jira)

Johan Lasperas created SPARK-48308:
--

 Summary: Unify getting data schema without partition columns in 
FileSourceStrategy
 Key: SPARK-48308
 URL: https://issues.apache.org/jira/browse/SPARK-48308
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.1
Reporter: Johan Lasperas


In 
[FileSourceStrategy,|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L191]
 the schema of the data excluding partition columns is computed 2 times in a 
slightly different way:

```

val dataColumnsWithoutPartitionCols = 
dataColumns.filterNot(partitionSet.contains)

```

vs 

```

      val readDataColumns = dataColumns
        .filterNot(partitionColumns.contains)

```

This should be unified

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48159) Datetime expressions (all collations)

2024-05-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48159:
---
Labels: pull-request-available  (was: )

> Datetime expressions (all collations)
> -
>
> Key: SPARK-48159
> URL: https://issues.apache.org/jira/browse/SPARK-48159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48288) Add source data type to connector.Cast expression

2024-05-16 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48288.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46596
[https://github.com/apache/spark/pull/46596]

> Add source data type to connector.Cast expression
> -
>
> Key: SPARK-48288
> URL: https://issues.apache.org/jira/browse/SPARK-48288
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Uros Stankovic
>Assignee: Uros Stankovic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, 
> V2ExpressionBuilder will build connector.Cast expression from catalyst.Cast 
> expression.
> Catalyst cast have expression data type, but connector cast does not have it.
> Since some casts are not allowed on external engine, we need to know source 
> and target data type, since we want finer granularity to block some 
> unsupported casts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-48238) Spark fail to start due to class o.a.h.yarn.server.webproxy.amfilter.AmIpFilter is not a jakarta.servlet.Filter

2024-05-16 Thread Cheng Pan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-48238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846913#comment-17846913
 ] 

Cheng Pan commented on SPARK-48238:
---

[~dongjoon] [~HF] I opened [https://github.com/apache/spark/pull/46611] to 
address the YARN incompatible issue by re-implementing a functionally 
equivalent Filter, please let me know what you think about this approach.

> Spark fail to start due to class 
> o.a.h.yarn.server.webproxy.amfilter.AmIpFilter is not a jakarta.servlet.Filter
> ---
>
> Key: SPARK-48238
> URL: https://issues.apache.org/jira/browse/SPARK-48238
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Priority: Blocker
>  Labels: pull-request-available
>
> I tested the latest master branch, it failed to start on YARN mode
> {code:java}
> dev/make-distribution.sh --tgz -Phive,hive-thriftserver,yarn{code}
>  
> {code:java}
> $ bin/spark-sql --master yarn
> WARNING: Using incubator modules: jdk.incubator.vector
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 2024-05-10 17:58:17 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2024-05-10 17:58:18 WARN Client: Neither spark.yarn.jars nor 
> spark.yarn.archive} is set, falling back to uploading libraries under 
> SPARK_HOME.
> 2024-05-10 17:58:25 ERROR SparkContext: Error initializing SparkContext.
> org.sparkproject.jetty.util.MultiException: Multiple exceptions
>     at 
> org.sparkproject.jetty.util.MultiException.ifExceptionThrow(MultiException.java:117)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:751)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:392)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:902)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:306)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:93)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:514) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$2(SparkUI.scala:81) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$2$adapted(SparkUI.scala:81)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:619) 
> ~[scala-library-2.13.13.jar:?]
>     at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:617) 
> ~[scala-library-2.13.13.jar:?]
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:935) 
> ~[scala-library-2.13.13.jar:?]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$1(SparkUI.scala:81) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$1$adapted(SparkUI.scala:79)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.Option.foreach(Option.scala:437) ~[scala-library-2.13.13.jar:?]
>     at org.apache.spark.ui.SparkUI.attachAllHandlers(SparkUI.scala:79) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at org.apache.spark.SparkContext.$anonfun$new$31(SparkContext.scala:690) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.SparkContext.$anonfun$new$31$adapted(SparkContext.scala:690) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.Option.foreach(Option.scala:437) ~[scala-library-2.13.13.jar:?]
>     at org.apache.spark.SparkContext.(SparkContext.scala:690) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2963) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1118)
>  ~[spark-sql_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.Option.getOrElse(Option.scala:201) [scala-library-2.13.13.jar:?]
>     at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1112)
>

[jira] [Resolved] (SPARK-48296) Codegen Support for `to_xml`

2024-05-16 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48296.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46591
[https://github.com/apache/spark/pull/46591]

> Codegen Support for `to_xml`
> 
>
> Key: SPARK-48296
> URL: https://issues.apache.org/jira/browse/SPARK-48296
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48296) Codegen Support for `to_xml`

2024-05-16 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-48296:


Assignee: BingKun Pan

> Codegen Support for `to_xml`
> 
>
> Key: SPARK-48296
> URL: https://issues.apache.org/jira/browse/SPARK-48296
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48297) Char/Varchar breaks in TRANSFORM clause

2024-05-16 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48297.
--
Fix Version/s: 3.4.4
   3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 46603
[https://github.com/apache/spark/pull/46603]

> Char/Varchar breaks in TRANSFORM clause
> ---
>
> Key: SPARK-48297
> URL: https://issues.apache.org/jira/browse/SPARK-48297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.4, 4.0.0, 3.5.1, 3.3.4, 3.4.3
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.4, 3.5.2, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48305) CurrentLike - Database/Schema, Catalog, User (all collations)

2024-05-16 Thread Jira

Uroš Bojanić created SPARK-48305:


 Summary: CurrentLike - Database/Schema, Catalog, User (all 
collations)
 Key: SPARK-48305
 URL: https://issues.apache.org/jira/browse/SPARK-48305
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48303) Reorganize `LogKey`

2024-05-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48303:
---
Labels: pull-request-available  (was: )

> Reorganize `LogKey`
> ---
>
> Key: SPARK-48303
> URL: https://issues.apache.org/jira/browse/SPARK-48303
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48303) Reorganize `LogKey`

2024-05-16 Thread BingKun Pan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-48303:

Summary: Reorganize `LogKey`  (was: Organize `LogKey`)

> Reorganize `LogKey`
> ---
>
> Key: SPARK-48303
> URL: https://issues.apache.org/jira/browse/SPARK-48303
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48304) EquivalentExpressions Cannot update expression with use count less than 0

2024-05-16 Thread Jun li (Jira)

Jun li created SPARK-48304:
--

 Summary: EquivalentExpressions Cannot update expression with use 
count less than 0
 Key: SPARK-48304
 URL: https://issues.apache.org/jira/browse/SPARK-48304
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1, 3.4.0
Reporter: Jun li


If the order of the subexpressions of the two equivalent expressions is 
inconsistent, spark will fail.

reproduce (spark-shell):

 
{code:java}
Seq(("bob", 1, 2, 3)).toDF("name", "v1", "v2", "v3").createTempView("tmp");

spark.sql("select (CASE WHEN v1 > 0 THEN v1 * 4 + v2 + v3 + 3 ELSE v3 + v1 * 4 
+ v2 END) as v from tmp").show(); {code}
output:

 
{code:java}
java.lang.IllegalStateException: Cannot update expression: ((input[1, int, 
false] * 4) + input[2, int, false]) in map: Map(ExpressionEquals((input[1, int, 
false] * 4)) -> ExpressionStats((input[1, int, false] * 4))) with use count: -1
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprInMap(EquivalentExpressions.scala:85)
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:198)
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1(EquivalentExpressions.scala:200)
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1$adapted(EquivalentExpressions.scala:200)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:200)
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateCommonExprs(EquivalentExpressions.scala:128)
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$3(EquivalentExpressions.scala:201)
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$3$adapted(EquivalentExpressions.scala:201)
  at scala.collection.immutable.List.foreach(List.scala:431)
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:201)
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1(EquivalentExpressions.scala:200)
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1$adapted(EquivalentExpressions.scala:200)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:200)
  at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTree(EquivalentExpressions.scala:188)
  at 
org.apache.spark.sql.catalyst.expressions.SubExprEvaluationRuntime.$anonfun$proxyExpressions$1(SubExprEvaluationRuntime.scala:90)
  at 
org.apache.spark.sql.catalyst.expressions.SubExprEvaluationRuntime.$anonfun$proxyExpressions$1$adapted(SubExprEvaluationRuntime.scala:90)
  at scala.collection.immutable.List.foreach(List.scala:431)
  at 
org.apache.spark.sql.catalyst.expressions.SubExprEvaluationRuntime.proxyExpressions(SubExprEvaluationRuntime.scala:90)
  at 
org.apache.spark.sql.catalyst.expressions.ExpressionsEvaluator.prepareExpressions(ExpressionsEvaluator.scala:33)
  at 
org.apache.spark.sql.catalyst.expressions.ExpressionsEvaluator.prepareExpressions$(ExpressionsEvaluator.scala:27)
  at 
org.apache.spark.sql.catalyst.expressions.package$Projection.prepareExpressions(package.scala:71)
  at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(InterpretedMutableProjection.scala:39)
  at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(InterpretedMutableProjection.scala:36)
  at 
org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$47.applyOrElse(Optimizer.scala:2159)
  at 
org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$47.applyOrElse(Optimizer.scala:2156)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
  at

[jira] [Updated] (SPARK-48238) Spark fail to start due to class o.a.h.yarn.server.webproxy.amfilter.AmIpFilter is not a jakarta.servlet.Filter

2024-05-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48238:
---
Labels: pull-request-available  (was: )

> Spark fail to start due to class 
> o.a.h.yarn.server.webproxy.amfilter.AmIpFilter is not a jakarta.servlet.Filter
> ---
>
> Key: SPARK-48238
> URL: https://issues.apache.org/jira/browse/SPARK-48238
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Priority: Blocker
>  Labels: pull-request-available
>
> I tested the latest master branch, it failed to start on YARN mode
> {code:java}
> dev/make-distribution.sh --tgz -Phive,hive-thriftserver,yarn{code}
>  
> {code:java}
> $ bin/spark-sql --master yarn
> WARNING: Using incubator modules: jdk.incubator.vector
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 2024-05-10 17:58:17 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2024-05-10 17:58:18 WARN Client: Neither spark.yarn.jars nor 
> spark.yarn.archive} is set, falling back to uploading libraries under 
> SPARK_HOME.
> 2024-05-10 17:58:25 ERROR SparkContext: Error initializing SparkContext.
> org.sparkproject.jetty.util.MultiException: Multiple exceptions
>     at 
> org.sparkproject.jetty.util.MultiException.ifExceptionThrow(MultiException.java:117)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:751)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:392)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:902)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:306)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:93)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:514) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$2(SparkUI.scala:81) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$2$adapted(SparkUI.scala:81)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:619) 
> ~[scala-library-2.13.13.jar:?]
>     at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:617) 
> ~[scala-library-2.13.13.jar:?]
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:935) 
> ~[scala-library-2.13.13.jar:?]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$1(SparkUI.scala:81) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.ui.SparkUI.$anonfun$attachAllHandlers$1$adapted(SparkUI.scala:79)
>  ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.Option.foreach(Option.scala:437) ~[scala-library-2.13.13.jar:?]
>     at org.apache.spark.ui.SparkUI.attachAllHandlers(SparkUI.scala:79) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at org.apache.spark.SparkContext.$anonfun$new$31(SparkContext.scala:690) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.SparkContext.$anonfun$new$31$adapted(SparkContext.scala:690) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.Option.foreach(Option.scala:437) ~[scala-library-2.13.13.jar:?]
>     at org.apache.spark.SparkContext.(SparkContext.scala:690) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2963) 
> ~[spark-core_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1118)
>  ~[spark-sql_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at scala.Option.getOrElse(Option.scala:201) [scala-library-2.13.13.jar:?]
>     at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1112)
>  [spark-sql_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>     at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:64)
>  [spark-hive-thriftserver_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]

[jira] [Created] (SPARK-48303) Organize `LogKey`

2024-05-16 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-48303:
---

 Summary: Organize `LogKey`
 Key: SPARK-48303
 URL: https://issues.apache.org/jira/browse/SPARK-48303
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists

2024-05-16 Thread Ian Cook (Jira)

Ian Cook created SPARK-48302:


 Summary: Null values in map columns of PyArrow tables are replaced 
with empty lists
 Key: SPARK-48302
 URL: https://issues.apache.org/jira/browse/SPARK-48302
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Ian Cook


Because of a limitation in PyArrow, when PyArrow Tables are passed to 
spark.createDataFrame(), null values in MapArray columns are replaced with 
empty lists.

The PySpark function where this happens is pyspark.sql.pandas.types.
_check_arrow_array_timestamps_localize.
Also see [https://github.com/apache/arrow/issues/41684].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48300) Codegen Support for `from_xml`

2024-05-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48300:
---
Labels: pull-request-available  (was: )

> Codegen Support for `from_xml`
> --
>
> Key: SPARK-48300
> URL: https://issues.apache.org/jira/browse/SPARK-48300
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48301) Rename CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE to CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE

2024-05-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48301:
---
Labels: pull-request-available  (was: )

> Rename CREATE_FUNC_WITH_IF_NOT_EXISTS_AND_REPLACE to 
> CREATE_ROUTINE_WITH_IF_NOT_EXISTS_AND_REPLACE
> --
>
> Key: SPARK-48301
> URL: https://issues.apache.org/jira/browse/SPARK-48301
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-48300) Codegen Support for `from_xml`

2024-05-16 Thread BingKun Pan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-48300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846830#comment-17846830
 ] 

BingKun Pan commented on SPARK-48300:
-

I work on it.

 

> Codegen Support for `from_xml`
> --
>
> Key: SPARK-48300
> URL: https://issues.apache.org/jira/browse/SPARK-48300
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48300) Codegen Support for `from_xml`

2024-05-16 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-48300:
---

 Summary: Codegen Support for `from_xml`
 Key: SPARK-48300
 URL: https://issues.apache.org/jira/browse/SPARK-48300
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48293) Add test for when ForeachBatchUserFuncException wraps interrupted exception due to query stop

2024-05-16 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-48293.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46601
[https://github.com/apache/spark/pull/46601]

> Add test for when ForeachBatchUserFuncException wraps interrupted exception 
> due to query stop
> -
>
> Key: SPARK-48293
> URL: https://issues.apache.org/jira/browse/SPARK-48293
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: B. Micheal Okutubo
>Assignee: B. Micheal Okutubo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add test for when ForeachBatchUserFuncException wraps interrupted exception 
> due to query stop



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48264) Upgrade `datasketches-java` to 6.0.0

2024-05-16 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48264.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46563
[https://github.com/apache/spark/pull/46563]

> Upgrade `datasketches-java` to 6.0.0
> 
>
> Key: SPARK-48264
> URL: https://issues.apache.org/jira/browse/SPARK-48264
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47607) Add documentation for Structured logging framework

2024-05-16 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47607.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46605
[https://github.com/apache/spark/pull/46605]

> Add documentation for Structured logging framework
> --
>
> Key: SPARK-47607
> URL: https://issues.apache.org/jira/browse/SPARK-47607
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47607) Add documentation for Structured logging framework

2024-05-16 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-47607:


Assignee: Gengliang Wang

> Add documentation for Structured logging framework
> --
>
> Key: SPARK-47607
> URL: https://issues.apache.org/jira/browse/SPARK-47607
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

66 matches

Mail list logo