date:20250527

[jira] [Updated] (SPARK-52088) Redesign ClosureCleaner Implementation Due to JDK-8309635's Removal of Old Core Reflection and Inability to Modify Private Final Fields in Hidden Classes

2025-05-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-52088:
---
Labels: pull-request-available  (was: )

>  Redesign ClosureCleaner Implementation Due to JDK-8309635's Removal of Old 
> Core Reflection and Inability to Modify Private Final Fields in Hidden Classes
> --
>
> Key: SPARK-52088
> URL: https://issues.apache.org/jira/browse/SPARK-52088
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.1.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> The removal of the old core reflection implementation in 
> [JDK-8309635|https://bugs.openjdk.org/browse/JDK-8309635] poses a risk that 
> the workaround for SPARK-40729 — which involves setting 
> `-Djdk.reflect.useDirectMethodHandle=false` to enable the old core reflection 
> — may no longer work in the next Java LTS release (if the next Java LTS does 
> not revert [JDK-8309635|https://bugs.openjdk.org/browse/JDK-8309635]). We 
> might need to consider redesigning the implementation of `ClosureCleaner`.
> Currently, when testing the `repl` module with Java 22, the following error 
> occurs:
> ```
> build/sbt clean "repl/test"
> ```
> ```
> [info] - broadcast vars *** FAILED *** (1 second, 141 milliseconds)
> [info]   isContain was true Interpreter output contained 'Exception':
> [info]   Welcome to
> [info]                       __
> [info]        / _{_}/{_}_  ___ {_}/ /{_}_
> [info]       {_}\ \/ _ \/ _ `/ __/  '{_}/
> [info]      /__{_}/ .{_}{_}/_,{_}/{_}/ /{_}/_\   version 4.1.0-SNAPSHOT
> [info]         /_/
> [info]            
> [info]   Using Scala version 2.13.16 (OpenJDK 64-Bit Server VM, Java 22.0.2)
> [info]   Type in expressions to have them evaluated.
> [info]   Type :help for more information.
> [info]   
> [info]   scala> 
> [info]   scala> var array: Array[Int] = Array(0, 0, 0, 0, 0)
> [info]   
> [info]   scala> val broadcastArray: 
> org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
> [info]   
> [info]   scala> java.lang.InternalError: java.lang.IllegalAccessException: 
> final field has no write access: $Lambda/0x060001ecedd8.arg$1/putField, 
> from class java.lang.Object (module java.base)
> [info]     at 
> java.base/jdk.internal.reflect.MethodHandleAccessorFactory.newFieldAccessor(MethodHandleAccessorFactory.java:207)
> [info]     at 
> java.base/jdk.internal.reflect.ReflectionFactory.newFieldAccessor(ReflectionFactory.java:144)
> [info]     at 
> java.base/java.lang.reflect.Field.acquireOverrideFieldAccessor(Field.java:1200)
> [info]     at 
> java.base/java.lang.reflect.Field.getOverrideFieldAccessor(Field.java:1169)
> [info]     at java.base/java.lang.reflect.Field.set(Field.java:836)
> [info]     at 
> org.apache.spark.util.ClosureCleaner$.setFieldAndIgnoreModifiers(ClosureCleaner.scala:564)
> [info]     at 
> org.apache.spark.util.ClosureCleaner$.cleanupScalaReplClosure(ClosureCleaner.scala:432)
> [info]     at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:257)
> [info]     at 
> org.apache.spark.util.SparkClosureCleaner$.clean(SparkClosureCleaner.scala:39)
> [info]     at org.apache.spark.SparkContext.clean(SparkContext.scala:2843)
> [info]     at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:425)
> [info]     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]     at org.apache.spark.rdd.RDD.withScope(RDD.scala:417)
> [info]     at org.apache.spark.rdd.RDD.map(RDD.scala:424)
> [info]     ... 79 elided
> ...
> [info] Run completed in 35 seconds, 38 milliseconds.
> [info] Total number of tests run: 36
> [info] Suites: completed 3, aborted 0
> [info] Tests: succeeded 27, failed 9, canceled 0, ignored 0, pending 0
> [info] *** 9 TESTS FAILED ***
> [error] Failed tests:
> [error]     org.apache.spark.repl.SingletonReplSuite
> [error]     org.apache.spark.repl.ReplSuite
> ```
> I tried switching to using either `VarHandle` or `Unsafe#putObject`, but 
> neither of them worked because the test cases involved modifying a `private 
> final field` within a `hidden class`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52325) Publish Apache Spark 3.5.6 to docker registry

2025-05-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-52325:
---
Labels: pull-request-available  (was: )

> Publish Apache Spark 3.5.6 to docker registry 
> --
>
> Key: SPARK-52325
> URL: https://issues.apache.org/jira/browse/SPARK-52325
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-52254) Adds a GitHub Actions workflow to convert RC to the official release

2025-05-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-52254:


Assignee: Hyukjin Kwon

> Adds a GitHub Actions workflow to convert RC to the official release
> 
>
> Key: SPARK-52254
> URL: https://issues.apache.org/jira/browse/SPARK-52254
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-52254) Adds a GitHub Actions workflow to convert RC to the official release

2025-05-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-52254.
--
Fix Version/s: 4.1.0
   Resolution: Fixed

Issue resolved by pull request 50974
[https://github.com/apache/spark/pull/50974]

> Adds a GitHub Actions workflow to convert RC to the official release
> 
>
> Key: SPARK-52254
> URL: https://issues.apache.org/jira/browse/SPARK-52254
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52312) Caching AppendData plan causes data to be inserted twice

2025-05-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-52312:
---
Labels: pull-request-available  (was: )

> Caching AppendData plan causes data to be inserted twice
> 
>
> Key: SPARK-52312
> URL: https://issues.apache.org/jira/browse/SPARK-52312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Tom van Bussel
>Priority: Major
>  Labels: pull-request-available
>
> We’ve identified an issue where a {{DataFrame}} created from an {{INSERT}} 
> SQL statement and then cached will cause the {{INSERT}} to be executed twice. 
> This happens because the logical plan for the {{INSERT}} ({{{}AppendData{}}}) 
> doesn’t extend the {{IgnoreCachedData}} trait, so it isn’t ignored during 
> caching as expected. As a result, the plan is cached and re-executed. We 
> should fix this by ensuring that plans used by {{INSERT}} all extend the 
> {{IgnoreCachedData}} trait.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-52286) Publish Apache Spark 4.0.0 to docker registry

2025-05-27 Thread Yury Molchan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-52286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17954309#comment-17954309
 ] 

Yury Molchan commented on SPARK-52286:
--

Hello

I am trying to run 4.0.0 on the Mac M1 and I am facing with the error to pull 
it.
4.0.0-preview2 docker image works well.

```
spark % docker pull spark:4.0.0-scala2.13-java17-python3-ubuntu
Error response from daemon: manifest for 
spark:4.0.0-scala2.13-java17-python3-ubuntu not found: manifest unknown: 
manifest unknown

spark % docker pull spark:4.0.0-scala2.13-java21-python3-ubuntu
Error response from daemon: manifest for 
spark:4.0.0-scala2.13-java21-python3-ubuntu not found: manifest unknown: 
manifest unknown

spark % docker pull spark:4.0.0-preview2-scala2.13-java21-python3-ubuntu
4.0.0-preview2-scala2.13-java21-python3-ubuntu: Pulling from library/spark
67b06617bd6b: Pulling fs layer 
```

> Publish Apache Spark 4.0.0 to docker registry
> -
>
> Key: SPARK-52286
> URL: https://issues.apache.org/jira/browse/SPARK-52286
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-52286) Publish Apache Spark 4.0.0 to docker registry

2025-05-27 Thread Yury Molchan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-52286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17954309#comment-17954309
 ] 

Yury Molchan edited comment on SPARK-52286 at 5/27/25 11:52 AM:


Hello

I am trying to run 4.0.0 on the Mac M1 and I am facing with the error to pull 
it.
4.0.0-preview2 docker image works well.

{code}
spark % docker pull spark:4.0.0-scala2.13-java17-python3-ubuntu
Error response from daemon: manifest for 
spark:4.0.0-scala2.13-java17-python3-ubuntu not found: manifest unknown: 
manifest unknown

spark % docker pull spark:4.0.0-scala2.13-java21-python3-ubuntu
Error response from daemon: manifest for 
spark:4.0.0-scala2.13-java21-python3-ubuntu not found: manifest unknown: 
manifest unknown

spark % docker pull spark:4.0.0-preview2-scala2.13-java21-python3-ubuntu
4.0.0-preview2-scala2.13-java21-python3-ubuntu: Pulling from library/spark
67b06617bd6b: Pulling fs layer 
{code}


was (Author: yurkom):
Hello

I am trying to run 4.0.0 on the Mac M1 and I am facing with the error to pull 
it.
4.0.0-preview2 docker image works well.

```
spark % docker pull spark:4.0.0-scala2.13-java17-python3-ubuntu
Error response from daemon: manifest for 
spark:4.0.0-scala2.13-java17-python3-ubuntu not found: manifest unknown: 
manifest unknown

spark % docker pull spark:4.0.0-scala2.13-java21-python3-ubuntu
Error response from daemon: manifest for 
spark:4.0.0-scala2.13-java21-python3-ubuntu not found: manifest unknown: 
manifest unknown

spark % docker pull spark:4.0.0-preview2-scala2.13-java21-python3-ubuntu
4.0.0-preview2-scala2.13-java21-python3-ubuntu: Pulling from library/spark
67b06617bd6b: Pulling fs layer 
```

> Publish Apache Spark 4.0.0 to docker registry
> -
>
> Key: SPARK-52286
> URL: https://issues.apache.org/jira/browse/SPARK-52286
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-52286) Publish Apache Spark 4.0.0 to docker registry

2025-05-27 Thread Yury Molchan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-52286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17954309#comment-17954309
 ] 

Yury Molchan edited comment on SPARK-52286 at 5/27/25 12:01 PM:


Hello

I am trying to run 4.0.0 on the Mac M1 and I am facing with the error to pull 
it.
4.0.0-preview2 docker image works well.
{code:java}
spark % docker pull spark:4.0.0-scala2.13-java17-python3-ubuntu
Error response from daemon: manifest for 
spark:4.0.0-scala2.13-java17-python3-ubuntu not found: manifest unknown: 
manifest unknown

spark % docker pull spark:4.0.0-scala2.13-java21-python3-ubuntu
Error response from daemon: manifest for 
spark:4.0.0-scala2.13-java21-python3-ubuntu not found: manifest unknown: 
manifest unknown

spark % docker pull spark:4.0.0-preview2-scala2.13-java21-python3-ubuntu
4.0.0-preview2-scala2.13-java21-python3-ubuntu: Pulling from library/spark
67b06617bd6b: Pulling fs layer 
{code}

Be noted that the pull was performed as 'library'.

the following is working:

{code}
spark % docker pull 
apache/spark:4.0.0-scala2.13-java17-python3-ubuntu4.0.0-scala2.13-java17-python3-ubuntu:
 Pulling from apache/spark
67b06617bd6b: Pull complete 
{code}


was (Author: yurkom):
Hello

I am trying to run 4.0.0 on the Mac M1 and I am facing with the error to pull 
it.
4.0.0-preview2 docker image works well.

{code}
spark % docker pull spark:4.0.0-scala2.13-java17-python3-ubuntu
Error response from daemon: manifest for 
spark:4.0.0-scala2.13-java17-python3-ubuntu not found: manifest unknown: 
manifest unknown

spark % docker pull spark:4.0.0-scala2.13-java21-python3-ubuntu
Error response from daemon: manifest for 
spark:4.0.0-scala2.13-java21-python3-ubuntu not found: manifest unknown: 
manifest unknown

spark % docker pull spark:4.0.0-preview2-scala2.13-java21-python3-ubuntu
4.0.0-preview2-scala2.13-java21-python3-ubuntu: Pulling from library/spark
67b06617bd6b: Pulling fs layer 
{code}

> Publish Apache Spark 4.0.0 to docker registry
> -
>
> Key: SPARK-52286
> URL: https://issues.apache.org/jira/browse/SPARK-52286
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-52324) move Spark docs to the release directory

2025-05-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-52324.
--
Fix Version/s: 4.1.0
   3.5.6
   4.0.1
   Resolution: Fixed

Issue resolved by pull request 51026
[https://github.com/apache/spark/pull/51026]

> move Spark docs to the release directory
> 
>
> Key: SPARK-52324
> URL: https://issues.apache.org/jira/browse/SPARK-52324
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0, 3.5.6, 4.0.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-52326) Add partitions related external catalog events

2025-05-27 Thread Xiang Li (Jira)

Xiang Li created SPARK-52326:


 Summary: Add partitions related external catalog events
 Key: SPARK-52326
 URL: https://issues.apache.org/jira/browse/SPARK-52326
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Xiang Li






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52326) Add partitions related external catalog events

2025-05-27 Thread Xiang Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiang Li updated SPARK-52326:
-
External issue URL: https://github.com/apache/spark/pull/51030

> Add partitions related external catalog events
> --
>
> Key: SPARK-52326
> URL: https://issues.apache.org/jira/browse/SPARK-52326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Xiang Li
>Priority: Minor
>
> In 
> [ExternalCatalogWithListener|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogWithListener.scala],
>  there are events posted for operations against db, table and function for 
> all registered listeners. But an operation against operation does not have 
> its event posted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52326) Add partitions related external catalog events

2025-05-27 Thread Xiang Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiang Li updated SPARK-52326:
-
External issue URL:   (was: https://github.com/apache/spark/pull/51030)

> Add partitions related external catalog events
> --
>
> Key: SPARK-52326
> URL: https://issues.apache.org/jira/browse/SPARK-52326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Xiang Li
>Priority: Minor
>
> In 
> [ExternalCatalogWithListener|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogWithListener.scala],
>  there are events posted for operations against db, table and function for 
> all registered listeners. But an operation against operation does not have 
> its event posted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-52324) move Spark docs to the release directory

2025-05-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-52324:


Assignee: Wenchen Fan

> move Spark docs to the release directory
> 
>
> Key: SPARK-52324
> URL: https://issues.apache.org/jira/browse/SPARK-52324
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-52267) Match field id in ParquetToSparkSchemaConverter

2025-05-27 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-52267.
-
Fix Version/s: 4.1.0
   4.0.1
   Resolution: Fixed

Issue resolved by pull request 50990
[https://github.com/apache/spark/pull/50990]

> Match field id in ParquetToSparkSchemaConverter
> ---
>
> Key: SPARK-52267
> URL: https://issues.apache.org/jira/browse/SPARK-52267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Chenhao Li
>Assignee: Chenhao Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0, 4.0.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-52267) Match field id in ParquetToSparkSchemaConverter

2025-05-27 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-52267:
---

Assignee: Chenhao Li

> Match field id in ParquetToSparkSchemaConverter
> ---
>
> Key: SPARK-52267
> URL: https://issues.apache.org/jira/browse/SPARK-52267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Chenhao Li
>Assignee: Chenhao Li
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44728) Improve PySpark documentations

2025-05-27 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-44728:
--
Affects Version/s: 4.1.0

> Improve PySpark documentations
> --
>
> Key: SPARK-44728
> URL: https://issues.apache.org/jira/browse/SPARK-44728
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.5.0, 4.0.0, 4.1.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> An umbrella Jira ticket to improve the PySpark documentation.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-52305) Refine the docstring for isnotnull, equal_null, nullif, nullifzero, nvl, nvl2, zeroifnull

2025-05-27 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-52305.
---
Fix Version/s: 4.1.0
   Resolution: Fixed

Issue resolved by pull request 51016
[https://github.com/apache/spark/pull/51016]

> Refine the docstring for isnotnull, equal_null, nullif, nullifzero, nvl, 
> nvl2, zeroifnull
> -
>
> Key: SPARK-52305
> URL: https://issues.apache.org/jira/browse/SPARK-52305
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.1
>Reporter: Evan Wu
>Assignee: Evan Wu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-52305) Refine the docstring for isnotnull, equal_null, nullif, nullifzero, nvl, nvl2, zeroifnull

2025-05-27 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-52305:
-

Assignee: Evan Wu

> Refine the docstring for isnotnull, equal_null, nullif, nullifzero, nvl, 
> nvl2, zeroifnull
> -
>
> Key: SPARK-52305
> URL: https://issues.apache.org/jira/browse/SPARK-52305
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.1
>Reporter: Evan Wu
>Assignee: Evan Wu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-52325) Publish Apache Spark 3.5.6 to docker registry

2025-05-27 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-52325:


 Summary: Publish Apache Spark 3.5.6 to docker registry 
 Key: SPARK-52325
 URL: https://issues.apache.org/jira/browse/SPARK-52325
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-52327) Glob based history provider

2025-05-27 Thread Gaurav Waghmare (Jira)

Gaurav Waghmare created SPARK-52327:
---

 Summary: Glob based history provider
 Key: SPARK-52327
 URL: https://issues.apache.org/jira/browse/SPARK-52327
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Gaurav Waghmare


Currently, spark history server runs with one base directory immediate 
subdirectories of which correspond to event logs for each application.

There are usecases for eg., multi tenancy where for the purpose of logical 
separation, the event logs could be stored in separate directories at a tenant 
level. To achieve this, instead of providing the path of the base directory, a 
glob for the tenant directories could be provided and used in a separate 
history provider.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-52286) Publish Apache Spark 4.0.0 to docker registry

2025-05-27 Thread Yury Molchan (Jira)



[ https://issues.apache.org/jira/browse/SPARK-52286 ]


Yury Molchan deleted comment on SPARK-52286:
--

was (Author: yurkom):
Hello

I am trying to run 4.0.0 on the Mac M1 and I am facing with the error to pull 
it.
4.0.0-preview2 docker image works well.
{code:java}
spark % docker pull spark:4.0.0-scala2.13-java17-python3-ubuntu
Error response from daemon: manifest for 
spark:4.0.0-scala2.13-java17-python3-ubuntu not found: manifest unknown: 
manifest unknown

spark % docker pull spark:4.0.0-scala2.13-java21-python3-ubuntu
Error response from daemon: manifest for 
spark:4.0.0-scala2.13-java21-python3-ubuntu not found: manifest unknown: 
manifest unknown

spark % docker pull spark:4.0.0-preview2-scala2.13-java21-python3-ubuntu
4.0.0-preview2-scala2.13-java21-python3-ubuntu: Pulling from library/spark
67b06617bd6b: Pulling fs layer 
{code}

Be noted that the pull was performed as 'library'.

the following is working:

{code}
spark % docker pull 
apache/spark:4.0.0-scala2.13-java17-python3-ubuntu4.0.0-scala2.13-java17-python3-ubuntu:
 Pulling from apache/spark
67b06617bd6b: Pull complete 
{code}

> Publish Apache Spark 4.0.0 to docker registry
> -
>
> Key: SPARK-52286
> URL: https://issues.apache.org/jira/browse/SPARK-52286
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-52300) Catalog config overrides do not make it into UDTVF resolution

2025-05-27 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-52300.
-
Fix Version/s: 4.1.0
   4.0.1
   Resolution: Fixed

> Catalog config overrides do not make it into UDTVF resolution
> -
>
> Key: SPARK-52300
> URL: https://issues.apache.org/jira/browse/SPARK-52300
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0, 4.0.1
>
>
> When resolving SQL User-defined Table Valued Functions, the catalog options 
> do not get registered correctly if the configurations were overridden after a 
> session is created (that is, not available as an override at Spark startup). 
> This is not a problem during View resolution.
>  
> This rift is unnecessary and the resolution rules should be consistent with 
> regards to which SQL configurations get passed down during UDTVF resolution 
> similar with view resolution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-52223) [SDP] Create spark connect API for SDP

2025-05-27 Thread Sandy Ryza (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-52223.

Resolution: Fixed

> [SDP] Create spark connect API for SDP
> --
>
> Key: SPARK-52223
> URL: https://issues.apache.org/jira/browse/SPARK-52223
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.1.0
>Reporter: Aakash Japi
>Assignee: Aakash Japi
>Priority: Major
>  Labels: pull-request-available
>
> SDP is a Spark Connect-only feature. We need to add the following APIs to 
> cover the pipeline lifecycle:
>  # {{CreateDataflowGraph}} creates a new graph in the registry.
>  # {{DefineDataset}} and {{DefineFlow}} register elements to the created 
> graph. Datasets are the nodes of the dataflow graph, and are either tables or 
> views, and flows are the edges connecting them.
>  # {{StartRun}} starts a run, which is a single execution of a graph.
>  # {{StopRun}} stops an existing run, while {{DropPipeline}} stops any 
> current runs and drops the pipeline.
>  # `PipelineCommand`, which contains a oneof that contains one of the above 
> protos. This is the interface exposed to the SC command itself.
> We also need to add the new {{PipelineCommand}} object to the 
> {{ExecutePlanRequest}} and the {{PipelineCommand.Response}} to the 
> {{ExecutePlanResponse}} object.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52329) Remove private[sql] tags for new transformWithState API

2025-05-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-52329:
---
Labels: pull-request-available  (was: )

> Remove private[sql] tags for new transformWithState API
> ---
>
> Key: SPARK-52329
> URL: https://issues.apache.org/jira/browse/SPARK-52329
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0, 4.1.0
>Reporter: Anish Shrigondekar
>Priority: Major
>  Labels: pull-request-available
>
> Remove private[sql] tags for new transformWithState API



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52315) Upgrade kubernetes-client version to 7.3.1

2025-05-27 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-52315:
--
Parent: SPARK-52205
Issue Type: Sub-task  (was: Bug)

> Upgrade kubernetes-client version to 7.3.1
> --
>
> Key: SPARK-52315
> URL: https://issues.apache.org/jira/browse/SPARK-52315
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: kubernetes-operator-0.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: kubernetes-operator-0.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52328) Use `apache/spark-connect-swift:pi` image

2025-05-27 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-52328:
--
Summary: Use `apache/spark-connect-swift:pi` image  (was: Use 
`apache/spark-connect-swift:pi`)

> Use `apache/spark-connect-swift:pi` image
> -
>
> Key: SPARK-52328
> URL: https://issues.apache.org/jira/browse/SPARK-52328
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: kubernetes-operator-0.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52328) Use `apache/spark-connect-swift:pi` image

2025-05-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-52328:
---
Labels: pull-request-available  (was: )

> Use `apache/spark-connect-swift:pi` image
> -
>
> Key: SPARK-52328
> URL: https://issues.apache.org/jira/browse/SPARK-52328
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: kubernetes-operator-0.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52326) Add partitions related external catalog events

2025-05-27 Thread Xiang Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiang Li updated SPARK-52326:
-
Description: In 
[ExternalCatalogWithListener|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogWithListener.scala],
 there are events posted for operations against db, table and function for all 
registered listeners. But an operation against operation does not have its 
event posted.

> Add partitions related external catalog events
> --
>
> Key: SPARK-52326
> URL: https://issues.apache.org/jira/browse/SPARK-52326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Xiang Li
>Priority: Minor
>
> In 
> [ExternalCatalogWithListener|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogWithListener.scala],
>  there are events posted for operations against db, table and function for 
> all registered listeners. But an operation against operation does not have 
> its event posted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52334) In Kubernetes mode, update all files, jars, archiveFiles, and pyFiles to reference the working directory after they are downloaded.

2025-05-27 Thread Tongwei (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tongwei updated SPARK-52334:

Description: 
When submitting a Spark job with the {{--files}} option and also calling 
{{SparkContext.addFile()}} for a file with the same name in the application 
code, Spark throws an exception due to a file registration conflict.

*Reproduction Steps:*
 # Submit a Spark application using {{spark-submit}} with the {{--files}} 
option:
{code:java}
bin/spark-submit \ --files s3://bucket/a.text \ --class testDemo \ app.jar 
{code}
 

 # In the {{testDemo}} application code, call:

 #  

> In Kubernetes mode, update all files, jars, archiveFiles, and pyFiles to 
> reference the working directory after they are downloaded.
> ---
>
> Key: SPARK-52334
> URL: https://issues.apache.org/jira/browse/SPARK-52334
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 4.0.0, 3.5.5
>Reporter: Tongwei
>Priority: Major
>
> When submitting a Spark job with the {{--files}} option and also calling 
> {{SparkContext.addFile()}} for a file with the same name in the application 
> code, Spark throws an exception due to a file registration conflict.
> *Reproduction Steps:*
>  # Submit a Spark application using {{spark-submit}} with the {{--files}} 
> option:
> {code:java}
> bin/spark-submit \ --files s3://bucket/a.text \ --class testDemo \ app.jar 
> {code}
>  
>  # In the {{testDemo}} application code, call:
>  #  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52334) In Kubernetes mode, update all files, jars, archiveFiles, and pyFiles to reference the working directory after they are downloaded.

2025-05-27 Thread Tongwei (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tongwei updated SPARK-52334:

Description: 
When submitting a Spark job with the {{--files}} option and also calling 
{{SparkContext.addFile()}} for a file with the same name in the application 
code, Spark throws an exception due to a file registration conflict.

*Reproduction Steps:*
 # Submit a Spark application using {{spark-submit}} with the {{--files}} 
option:
{code:java}
bin/spark-submit \ --files s3://bucket/a.text \ --class testDemo \ app.jar 
{code}

 # In the {{testDemo}} application code, call:
 #  

  was:
When submitting a Spark job with the {{--files}} option and also calling 
{{SparkContext.addFile()}} for a file with the same name in the application 
code, Spark throws an exception due to a file registration conflict.

*Reproduction Steps:*
 # Submit a Spark application using {{spark-submit}} with the {{--files}} 
option:
{code:java}
bin/spark-submit \ --files s3://bucket/a.text \ --class testDemo \ app.jar 
{code}
 

 # In the {{testDemo}} application code, call:

 #  


> In Kubernetes mode, update all files, jars, archiveFiles, and pyFiles to 
> reference the working directory after they are downloaded.
> ---
>
> Key: SPARK-52334
> URL: https://issues.apache.org/jira/browse/SPARK-52334
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 4.0.0, 3.5.5
>Reporter: Tongwei
>Priority: Major
>
> When submitting a Spark job with the {{--files}} option and also calling 
> {{SparkContext.addFile()}} for a file with the same name in the application 
> code, Spark throws an exception due to a file registration conflict.
> *Reproduction Steps:*
>  # Submit a Spark application using {{spark-submit}} with the {{--files}} 
> option:
> {code:java}
> bin/spark-submit \ --files s3://bucket/a.text \ --class testDemo \ app.jar 
> {code}
>  # In the {{testDemo}} application code, call:
>  #  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-52334) In Kubernetes mode, update all files, jars, archiveFiles, and pyFiles to reference the working directory after they are downloaded.

2025-05-27 Thread Tongwei (Jira)

Tongwei created SPARK-52334:
---

 Summary: In Kubernetes mode, update all files, jars, archiveFiles, 
and pyFiles to reference the working directory after they are downloaded.
 Key: SPARK-52334
 URL: https://issues.apache.org/jira/browse/SPARK-52334
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Spark Core
Affects Versions: 3.5.5, 4.0.0
Reporter: Tongwei






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52334) In Kubernetes mode, update all files, jars, archiveFiles, and pyFiles to reference the working directory after they are downloaded.

2025-05-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-52334:
---
Labels: pull-request-available  (was: )

> In Kubernetes mode, update all files, jars, archiveFiles, and pyFiles to 
> reference the working directory after they are downloaded.
> ---
>
> Key: SPARK-52334
> URL: https://issues.apache.org/jira/browse/SPARK-52334
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 4.0.0, 3.5.5
>Reporter: Tongwei
>Priority: Major
>  Labels: pull-request-available
>
> When submitting a Spark job with the {{--files}} option and also calling 
> {{SparkContext.addFile()}} for a file with the same name in the application 
> code, Spark throws an exception ({_}And the same code does not throw an error 
> in YARN mode{_}.)
> *Reproduction Steps:*
> 1. Submit a Spark application using {{spark-submit}} with the {{--files}} 
> option:
> {code:java}
> bin/spark-submit \ --files s3://bucket/a.text \ --class testDemo \ app.jar 
> {code}
> 2. In the {{testDemo}} application code, call:
> {code:java}
> sc.addFile("a.text", true) {code}
> Error msg：
> {code:java}
> Exception in thread "main" java.lang.IllegalArgumentException: requirement 
> failed: File a.text was already registered with a different path (old path = 
> /tmp/spark-6aa5129d-5bbb-464a-9e50-5b6ffe364ffb/a.text, new path = 
> /opt/spark/work-dir/a.text{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52334) In Kubernetes mode, update all files, jars, archiveFiles, and pyFiles to reference the working directory after they are downloaded.

2025-05-27 Thread Tongwei (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tongwei updated SPARK-52334:

Description: 
When submitting a Spark job with the {{--files}} option and also calling 
{{SparkContext.addFile()}} for a file with the same name in the application 
code, Spark throws an exception ({_}And the same code does not throw an error 
in YARN mode{_}.)

*Reproduction Steps:*

1. Submit a Spark application using {{spark-submit}} with the {{--files}} 
option:
{code:java}
bin/spark-submit \ --files s3://bucket/a.text \ --class testDemo \ app.jar 
{code}
2. In the {{testDemo}} application code, call:
{code:java}
sc.addFile("a.text", true) {code}

Error msg：
{code:java}
Exception in thread "main" java.lang.IllegalArgumentException: requirement 
failed: File a.text was already registered with a different path (old path = 
/tmp/spark-6aa5129d-5bbb-464a-9e50-5b6ffe364ffb/a.text, new path = 
/opt/spark/work-dir/a.text{code}

  was:
When submitting a Spark job with the {{--files}} option and also calling 
{{SparkContext.addFile()}} for a file with the same name in the application 
code, Spark throws an exception due to a file registration conflict.

*Reproduction Steps:*
 # Submit a Spark application using {{spark-submit}} with the {{--files}} 
option:
{code:java}
bin/spark-submit \ --files s3://bucket/a.text \ --class testDemo \ app.jar 
{code}

 # In the {{testDemo}} application code, call:
 #  


> In Kubernetes mode, update all files, jars, archiveFiles, and pyFiles to 
> reference the working directory after they are downloaded.
> ---
>
> Key: SPARK-52334
> URL: https://issues.apache.org/jira/browse/SPARK-52334
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 4.0.0, 3.5.5
>Reporter: Tongwei
>Priority: Major
>
> When submitting a Spark job with the {{--files}} option and also calling 
> {{SparkContext.addFile()}} for a file with the same name in the application 
> code, Spark throws an exception ({_}And the same code does not throw an error 
> in YARN mode{_}.)
> *Reproduction Steps:*
> 1. Submit a Spark application using {{spark-submit}} with the {{--files}} 
> option:
> {code:java}
> bin/spark-submit \ --files s3://bucket/a.text \ --class testDemo \ app.jar 
> {code}
> 2. In the {{testDemo}} application code, call:
> {code:java}
> sc.addFile("a.text", true) {code}
> Error msg：
> {code:java}
> Exception in thread "main" java.lang.IllegalArgumentException: requirement 
> failed: File a.text was already registered with a different path (old path = 
> /tmp/spark-6aa5129d-5bbb-464a-9e50-5b6ffe364ffb/a.text, new path = 
> /opt/spark/work-dir/a.text{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52333) Squeeze protocol for timers (list on specific grouping key, and expiry timers)

2025-05-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-52333:
---
Labels: pull-request-available  (was: )

> Squeeze protocol for timers (list on specific grouping key, and expiry timers)
> --
>
> Key: SPARK-52333
> URL: https://issues.apache.org/jira/browse/SPARK-52333
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Structured Streaming
>Affects Versions: 4.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>  Labels: pull-request-available
>
> Likewise we did for ListState and MapState, we figured out inlining timers 
> into proto message would give the huge benefit on the state interaction 
> (intercommunication). This ticket aims to address the same change to listing 
> timers for grouping key and expiry timers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33537) Hive Metastore filter pushdown improvement

2025-05-27 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-33537.
-
Resolution: Fixed

> Hive Metastore filter pushdown improvement
> --
>
> Key: SPARK-33537
> URL: https://issues.apache.org/jira/browse/SPARK-33537
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> This umbrella ticket to track Hive Metastore filter pushdown improvement. It 
> includes:
> 1. Date type push down
> 2. Like push down
> 3. InSet pushdown improvement
> and other fixes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-52328) Use `apache/spark-connect-swift:pi`

2025-05-27 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-52328:
-

 Summary: Use `apache/spark-connect-swift:pi`
 Key: SPARK-52328
 URL: https://issues.apache.org/jira/browse/SPARK-52328
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes
Affects Versions: kubernetes-operator-0.3.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-52328) Use `apache/spark-connect-swift:pi` image

2025-05-27 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-52328.
---
Fix Version/s: kubernetes-operator-0.3.0
   Resolution: Fixed

Issue resolved by pull request 230
[https://github.com/apache/spark-kubernetes-operator/pull/230]

> Use `apache/spark-connect-swift:pi` image
> -
>
> Key: SPARK-52328
> URL: https://issues.apache.org/jira/browse/SPARK-52328
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: kubernetes-operator-0.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: kubernetes-operator-0.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-52328) Use `apache/spark-connect-swift:pi` image

2025-05-27 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-52328:
-

Assignee: Dongjoon Hyun

> Use `apache/spark-connect-swift:pi` image
> -
>
> Key: SPARK-52328
> URL: https://issues.apache.org/jira/browse/SPARK-52328
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: kubernetes-operator-0.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-52270) User guide for native plotting

2025-05-27 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng reassigned SPARK-52270:


Assignee: Xinrong Meng

> User guide for native plotting
> --
>
> Key: SPARK-52270
> URL: https://issues.apache.org/jira/browse/SPARK-52270
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.1.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-52264) Test divide-by-zero behavior with more numeric data types

2025-05-27 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-52264.
--
  Assignee: Xinrong Meng
Resolution: Resolved

Resolved by https://github.com/apache/spark/pull/50988

> Test divide-by-zero behavior with more numeric data types
> -
>
> Key: SPARK-52264
> URL: https://issues.apache.org/jira/browse/SPARK-52264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark, Tests
>Affects Versions: 4.1.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-52270) User guide for native plotting

2025-05-27 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-52270.
--
Resolution: Resolved

Resolved by https://github.com/apache/spark/pull/50992

> User guide for native plotting
> --
>
> Key: SPARK-52270
> URL: https://issues.apache.org/jira/browse/SPARK-52270
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.1.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-52331) Adjust test for promotion from float32 to float64 during division

2025-05-27 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-52331:


 Summary: Adjust test for promotion from float32 to float64 during 
division
 Key: SPARK-52331
 URL: https://issues.apache.org/jira/browse/SPARK-52331
 Project: Spark
  Issue Type: Sub-task
  Components: PS, Tests
Affects Versions: 4.1.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-52332) Fix promotion from float32 to float64 during division

2025-05-27 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-52332:


 Summary: Fix promotion from float32 to float64 during division
 Key: SPARK-52332
 URL: https://issues.apache.org/jira/browse/SPARK-52332
 Project: Spark
  Issue Type: Sub-task
  Components: PS
Affects Versions: 4.1.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52331) Adjust test for promotion from float32 to float64 during division

2025-05-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-52331:
---
Labels: pull-request-available  (was: )

> Adjust test for promotion from float32 to float64 during division
> -
>
> Key: SPARK-52331
> URL: https://issues.apache.org/jira/browse/SPARK-52331
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS, Tests
>Affects Versions: 4.1.0
>Reporter: Xinrong Meng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52332) Fix promotion from float32 to float64 during division

2025-05-27 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-52332:
-
Description: 
{{>>> ps.set_option("compute.fail_on_ansi_mode", False)}}
{{>>> spark.conf.set("spark.sql.ansi.enabled", False)}}
{{>>> }}
{{>>> import pandas as pd}}
{{>>> import numpy as np}}
{{>>> pdf = pd.DataFrame(}}
{{...     {}}
{{...         "a": [1.0, -1.0, 0.0, np.nan],}}
{{...         "b": [0.0, 0.0, 0.0, 0.0],}}
{{...     },}}
{{...     dtype=np.float32,}}
{{... )}}
{{>>> }}
{{>>> psdf = ps.from_pandas(pdf)}}
{{>>> }}
{{>>> psdf["a"] / psdf["b"]}}
{{0    inf                                                                      
  }}
{{1   -inf}}
{{2    NaN}}
{{3    NaN}}
{{dtype: float64}}
{{>>> }}
{{>>> pdf["a"] / pdf["b"]}}
{{0    inf}}
{{1   -inf}}
{{2    NaN}}
{{3    NaN}}
{{dtype: float32}}

> Fix promotion from float32 to float64 during division
> -
>
> Key: SPARK-52332
> URL: https://issues.apache.org/jira/browse/SPARK-52332
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS
>Affects Versions: 4.1.0
>Reporter: Xinrong Meng
>Priority: Major
>
> {{>>> ps.set_option("compute.fail_on_ansi_mode", False)}}
> {{>>> spark.conf.set("spark.sql.ansi.enabled", False)}}
> {{>>> }}
> {{>>> import pandas as pd}}
> {{>>> import numpy as np}}
> {{>>> pdf = pd.DataFrame(}}
> {{...     {}}
> {{...         "a": [1.0, -1.0, 0.0, np.nan],}}
> {{...         "b": [0.0, 0.0, 0.0, 0.0],}}
> {{...     },}}
> {{...     dtype=np.float32,}}
> {{... )}}
> {{>>> }}
> {{>>> psdf = ps.from_pandas(pdf)}}
> {{>>> }}
> {{>>> psdf["a"] / psdf["b"]}}
> {{0    inf                                                                    
>     }}
> {{1   -inf}}
> {{2    NaN}}
> {{3    NaN}}
> {{dtype: float64}}
> {{>>> }}
> {{>>> pdf["a"] / pdf["b"]}}
> {{0    inf}}
> {{1   -inf}}
> {{2    NaN}}
> {{3    NaN}}
> {{dtype: float32}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-52331) Adjust test for promotion from float32 to float64 during division

2025-05-27 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-52331.
--
  Assignee: Xinrong Meng
Resolution: Resolved

Resolved by https://github.com/apache/spark/pull/51035

> Adjust test for promotion from float32 to float64 during division
> -
>
> Key: SPARK-52331
> URL: https://issues.apache.org/jira/browse/SPARK-52331
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS, Tests
>Affects Versions: 4.1.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52330) SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-27 Thread Boyang Jerry Peng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-52330:
--
Description: 
The SPIP proposes to add a new execution mode called “{*}Real-time Mode{*}”  in 
Spark Structured Streaming that significantly lowers end-to-end latency for 
processing streams of data.

Our goal is to make Spark capable of handling streaming jobs that need results 
*almost immediately (within* {*}O(100) millisecond{*}{*}){*}. We want to 
achieve this *without changing the high-level DataFrame/Dataset API* that users 
already use – so existing streaming queries can run in this new 
ultra-low-latency mode by simply turning it on, without rewriting their logic.

In short, we’re trying to enable Spark to power *real-time applications* (like 
instant anomaly alerts or live personalization) that today cannot meet their 
latency requirements with Spark’s current streaming engine.

 

SPIP doc: 
[https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing]

  was:
We propose to add a *real-time mode* in Spark Structured Streaming that 
significantly lowers end-to-end latency for processing streams of data.

Our goal is to make Spark capable of handling streaming jobs that need results 
*almost immediately (within* {*}O(100) millisecond{*}{*}){*}. We want to 
achieve this *without changing the high-level DataFrame/Dataset API* that users 
already use – so existing streaming queries can run in this new 
ultra-low-latency mode by simply turning it on, without rewriting their logic.

In short, we’re trying to enable Spark to power *real-time applications* (like 
instant anomaly alerts or live personalization) that today cannot meet their 
latency requirements with Spark’s current streaming engine.

 

SPIP doc: 
[https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing]


> SPIP: Real-Time Mode in Apache Spark Structured Streaming
> -
>
> Key: SPARK-52330
> URL: https://issues.apache.org/jira/browse/SPARK-52330
> Project: Spark
>  Issue Type: Umbrella
>  Components: Structured Streaming
>Affects Versions: 4.1.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> The SPIP proposes to add a new execution mode called “{*}Real-time Mode{*}”  
> in Spark Structured Streaming that significantly lowers end-to-end latency 
> for processing streams of data.
> Our goal is to make Spark capable of handling streaming jobs that need 
> results *almost immediately (within* {*}O(100) millisecond{*}{*}){*}. We want 
> to achieve this *without changing the high-level DataFrame/Dataset API* that 
> users already use – so existing streaming queries can run in this new 
> ultra-low-latency mode by simply turning it on, without rewriting their logic.
> In short, we’re trying to enable Spark to power *real-time applications* 
> (like instant anomaly alerts or live personalization) that today cannot meet 
> their latency requirements with Spark’s current streaming engine.
>  
> SPIP doc: 
> [https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-52330) SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-27 Thread Boyang Jerry Peng (Jira)

Boyang Jerry Peng created SPARK-52330:
-

 Summary: SPIP: Real-Time Mode in Apache Spark Structured Streaming
 Key: SPARK-52330
 URL: https://issues.apache.org/jira/browse/SPARK-52330
 Project: Spark
  Issue Type: Umbrella
  Components: Structured Streaming
Affects Versions: 4.1.0
Reporter: Boyang Jerry Peng


We propose to add a *real-time mode* in Spark Structured Streaming that 
significantly lowers end-to-end latency for processing streams of data. Our 
goal is to make Spark capable of handling streaming jobs that need results 
*almost immediately (within* {*}O(100) millisecond{*}{*}){*}. We want to 
achieve this *without changing the high-level DataFrame/Dataset API* that users 
already use – so existing streaming queries can run in this new 
ultra-low-latency mode by simply turning it on, without rewriting their logic. 
In short, we’re trying to enable Spark to power *real-time applications* (like 
instant anomaly alerts or live personalization) that today cannot meet their 
latency requirements with Spark’s current streaming engine.

 

SPIP doc: 
https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52330) SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-27 Thread Boyang Jerry Peng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-52330:
--
Description: 
The SPIP proposes a new execution mode called “{*}Real-time Mode{*}”  in Spark 
Structured Streaming that significantly lowers end-to-end latency for 
processing streams of data.

Our goal is to make Spark capable of handling streaming jobs that need results 
*almost immediately (within* {*}O(100) millisecond{*}{*}){*}. We want to 
achieve this *without changing the high-level DataFrame/Dataset API* that users 
already use – so existing streaming queries can run in this new 
ultra-low-latency mode by simply turning it on, without rewriting their logic.

In short, we’re trying to enable Spark to power *real-time applications* (like 
instant anomaly alerts or live personalization) that today cannot meet their 
latency requirements with Spark’s current streaming engine.

 

SPIP doc: 
[https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing]

  was:
The SPIP proposes to add a new execution mode called “{*}Real-time Mode{*}”  in 
Spark Structured Streaming that significantly lowers end-to-end latency for 
processing streams of data.

Our goal is to make Spark capable of handling streaming jobs that need results 
*almost immediately (within* {*}O(100) millisecond{*}{*}){*}. We want to 
achieve this *without changing the high-level DataFrame/Dataset API* that users 
already use – so existing streaming queries can run in this new 
ultra-low-latency mode by simply turning it on, without rewriting their logic.

In short, we’re trying to enable Spark to power *real-time applications* (like 
instant anomaly alerts or live personalization) that today cannot meet their 
latency requirements with Spark’s current streaming engine.

 

SPIP doc: 
[https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing]


> SPIP: Real-Time Mode in Apache Spark Structured Streaming
> -
>
> Key: SPARK-52330
> URL: https://issues.apache.org/jira/browse/SPARK-52330
> Project: Spark
>  Issue Type: Umbrella
>  Components: Structured Streaming
>Affects Versions: 4.1.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> The SPIP proposes a new execution mode called “{*}Real-time Mode{*}”  in 
> Spark Structured Streaming that significantly lowers end-to-end latency for 
> processing streams of data.
> Our goal is to make Spark capable of handling streaming jobs that need 
> results *almost immediately (within* {*}O(100) millisecond{*}{*}){*}. We want 
> to achieve this *without changing the high-level DataFrame/Dataset API* that 
> users already use – so existing streaming queries can run in this new 
> ultra-low-latency mode by simply turning it on, without rewriting their logic.
> In short, we’re trying to enable Spark to power *real-time applications* 
> (like instant anomaly alerts or live personalization) that today cannot meet 
> their latency requirements with Spark’s current streaming engine.
>  
> SPIP doc: 
> [https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52330) SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-27 Thread Boyang Jerry Peng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Jerry Peng updated SPARK-52330:
--
Description: 
We propose to add a *real-time mode* in Spark Structured Streaming that 
significantly lowers end-to-end latency for processing streams of data.

Our goal is to make Spark capable of handling streaming jobs that need results 
*almost immediately (within* {*}O(100) millisecond{*}{*}){*}. We want to 
achieve this *without changing the high-level DataFrame/Dataset API* that users 
already use – so existing streaming queries can run in this new 
ultra-low-latency mode by simply turning it on, without rewriting their logic.

In short, we’re trying to enable Spark to power *real-time applications* (like 
instant anomaly alerts or live personalization) that today cannot meet their 
latency requirements with Spark’s current streaming engine.

 

SPIP doc: 
[https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing]

  was:
We propose to add a *real-time mode* in Spark Structured Streaming that 
significantly lowers end-to-end latency for processing streams of data. Our 
goal is to make Spark capable of handling streaming jobs that need results 
*almost immediately (within* {*}O(100) millisecond{*}{*}){*}. We want to 
achieve this *without changing the high-level DataFrame/Dataset API* that users 
already use – so existing streaming queries can run in this new 
ultra-low-latency mode by simply turning it on, without rewriting their logic. 
In short, we’re trying to enable Spark to power *real-time applications* (like 
instant anomaly alerts or live personalization) that today cannot meet their 
latency requirements with Spark’s current streaming engine.

 

SPIP doc: 
https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing


> SPIP: Real-Time Mode in Apache Spark Structured Streaming
> -
>
> Key: SPARK-52330
> URL: https://issues.apache.org/jira/browse/SPARK-52330
> Project: Spark
>  Issue Type: Umbrella
>  Components: Structured Streaming
>Affects Versions: 4.1.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> We propose to add a *real-time mode* in Spark Structured Streaming that 
> significantly lowers end-to-end latency for processing streams of data.
> Our goal is to make Spark capable of handling streaming jobs that need 
> results *almost immediately (within* {*}O(100) millisecond{*}{*}){*}. We want 
> to achieve this *without changing the high-level DataFrame/Dataset API* that 
> users already use – so existing streaming queries can run in this new 
> ultra-low-latency mode by simply turning it on, without rewriting their logic.
> In short, we’re trying to enable Spark to power *real-time applications* 
> (like instant anomaly alerts or live personalization) that today cannot meet 
> their latency requirements with Spark’s current streaming engine.
>  
> SPIP doc: 
> [https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-52327) Glob based history provider

2025-05-27 Thread Gaurav Waghmare (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Waghmare updated SPARK-52327:

Description: 
Currently, spark history server runs with one base directory immediate 
subdirectories of which correspond to event logs for each application.

There are usecases for eg., multi tenancy where for the purpose of logical 
separation, the event logs could be stored in separate directories at a tenant 
level. To achieve this, instead of providing the path of the base directory, a 
glob for the tenant directories could be provided and used in a separate 
history provider similar to `org.apache.spark.deploy.history.FsHistoryProvider`.

  was:
Currently, spark history server runs with one base directory immediate 
subdirectories of which correspond to event logs for each application.

There are usecases for eg., multi tenancy where for the purpose of logical 
separation, the event logs could be stored in separate directories at a tenant 
level. To achieve this, instead of providing the path of the base directory, a 
glob for the tenant directories could be provided and used in a separate 
history provider.


> Glob based history provider
> ---
>
> Key: SPARK-52327
> URL: https://issues.apache.org/jira/browse/SPARK-52327
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gaurav Waghmare
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Currently, spark history server runs with one base directory immediate 
> subdirectories of which correspond to event logs for each application.
> There are usecases for eg., multi tenancy where for the purpose of logical 
> separation, the event logs could be stored in separate directories at a 
> tenant level. To achieve this, instead of providing the path of the base 
> directory, a glob for the tenant directories could be provided and used in a 
> separate history provider similar to 
> `org.apache.spark.deploy.history.FsHistoryProvider`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-52329) Remove private[sql] tags for new transformWithState API

2025-05-27 Thread Anish Shrigondekar (Jira)

Anish Shrigondekar created SPARK-52329:
--

 Summary: Remove private[sql] tags for new transformWithState API
 Key: SPARK-52329
 URL: https://issues.apache.org/jira/browse/SPARK-52329
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 4.0.0, 4.1.0
Reporter: Anish Shrigondekar


Remove private[sql] tags for new transformWithState API



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-52313) Correctly resolve reference data type for Views with default collation

2025-05-27 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-52313:
---

Assignee: Marko Ilic

> Correctly resolve reference data type for Views with default collation
> --
>
> Key: SPARK-52313
> URL: https://issues.apache.org/jira/browse/SPARK-52313
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.1.0
>Reporter: Marko Ilic
>Assignee: Marko Ilic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-52313) Correctly resolve reference data type for Views with default collation

2025-05-27 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-52313.
-
Fix Version/s: 4.1.0
   4.0.1
   Resolution: Fixed

Issue resolved by pull request 51023
[https://github.com/apache/spark/pull/51023]

> Correctly resolve reference data type for Views with default collation
> --
>
> Key: SPARK-52313
> URL: https://issues.apache.org/jira/browse/SPARK-52313
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.1.0
>Reporter: Marko Ilic
>Assignee: Marko Ilic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0, 4.0.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-52329) Remove private[sql] tags for new transformWithState API

2025-05-27 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-52329:


Assignee: Anish Shrigondekar

> Remove private[sql] tags for new transformWithState API
> ---
>
> Key: SPARK-52329
> URL: https://issues.apache.org/jira/browse/SPARK-52329
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0, 4.1.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
>  Labels: pull-request-available
>
> Remove private[sql] tags for new transformWithState API



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-52329) Remove private[sql] tags for new transformWithState API

2025-05-27 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-52329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-52329.
--
Fix Version/s: 4.1.0
   4.0.1
   Resolution: Fixed

Issue resolved by pull request 51033
[https://github.com/apache/spark/pull/51033]

> Remove private[sql] tags for new transformWithState API
> ---
>
> Key: SPARK-52329
> URL: https://issues.apache.org/jira/browse/SPARK-52329
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0, 4.1.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0, 4.0.1
>
>
> Remove private[sql] tags for new transformWithState API



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-52333) Squeeze protocol for timers (list on specific grouping key, and expiry timers)

2025-05-27 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-52333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17954468#comment-17954468
 ] 

Jungtaek Lim commented on SPARK-52333:
--

Going to submit a PR for this. Probably in today.

> Squeeze protocol for timers (list on specific grouping key, and expiry timers)
> --
>
> Key: SPARK-52333
> URL: https://issues.apache.org/jira/browse/SPARK-52333
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Structured Streaming
>Affects Versions: 4.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Likewise we did for ListState and MapState, we figured out inlining timers 
> into proto message would give the huge benefit on the state interaction 
> (intercommunication). This ticket aims to address the same change to listing 
> timers for grouping key and expiry timers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-52333) Squeeze protocol for timers (list on specific grouping key, and expiry timers)

2025-05-27 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-52333:


 Summary: Squeeze protocol for timers (list on specific grouping 
key, and expiry timers)
 Key: SPARK-52333
 URL: https://issues.apache.org/jira/browse/SPARK-52333
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Structured Streaming
Affects Versions: 4.1.0
Reporter: Jungtaek Lim


Likewise we did for ListState and MapState, we figured out inlining timers into 
proto message would give the huge benefit on the state interaction 
(intercommunication). This ticket aims to address the same change to listing 
timers for grouping key and expiry timers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

58 matches

Mail list logo