date:20240604

[jira] [Reopened] (SPARK-48505) Simplify the implementation of Utils#isG1GC

2024-06-04 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reopened SPARK-48505:
--

> Simplify the implementation of Utils#isG1GC
> ---
>
> Key: SPARK-48505
> URL: https://issues.apache.org/jira/browse/SPARK-48505
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48533) Add test for cached schema

2024-06-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48533:


Assignee: Ruifeng Zheng

> Add test for cached schema
> --
>
> Key: SPARK-48533
> URL: https://issues.apache.org/jira/browse/SPARK-48533
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48533) Add test for cached schema

2024-06-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48533.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46871
[https://github.com/apache/spark/pull/46871]

> Add test for cached schema
> --
>
> Key: SPARK-48533
> URL: https://issues.apache.org/jira/browse/SPARK-48533
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48534) Support interruptOperation in streaming queries

2024-06-04 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-48534:


 Summary: Support interruptOperation in streaming queries
 Key: SPARK-48534
 URL: https://issues.apache.org/jira/browse/SPARK-48534
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


Similar with https://issues.apache.org/jira/browse/SPARK-48485 but we should 
also add interruptOperation 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48422) Serialize all data at once may cause MemoryError

2024-06-04 Thread ZhouYang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhouYang updated SPARK-48422:
-
Description: 
In worker.py, there is a function called process(), the iterator loads all data 
at once
{code:java}
def process():
       iterator = deserializer.load_stream(infile)
       serializer.dump_stream(func(split_index, iterator), outfile){code}
It will cause MemoryError when working on large scale data, For the reason that 
I have indeed encountered this situation as below:
{code:java}
MemoryError at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:203) at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:244) at 
org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:162) at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
 at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:89) at 
org.apache.spark.scheduler.Task.run(Task.scala:109) at 
org.apache.spark.executor.Executor$TaskRunner$$anon$2.run(Executor.scala:355) 
at java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1721)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:353) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 2024-05-22 16:50:03,173 INFO 
org.apache.spark.scheduler.TaskSetManager: Starting task 0.1 in stage 5.0 (TID 
21, saturndatanode3, executor 2, partition 0, ANY, 5075 bytes) 2024-05-22 
16:50:03,174 INFO org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
stage 5.0 (TID 19) on saturndatanode3, executor 2: 
org.apache.spark.api.python.PythonException (Traceback (most recent call last): 
File "xx/spark/python/lib/pyspark.zip/pyspark/worker.py", line 200, in main 
process() File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
195, in process serializer.dump_stream(func(split_index, iterator), outfile) 
File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in 
 func = lambda _, it: map(mapper, it) File "", line 1, in 
 File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 73, 
in  return lambda *a: f(*a){code}
I did some tests by adding memory monitor code, I found that this code takes up 
a lot of memory during execution：
{code:java}
start_memory = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print(f"Memory usage at the beginning: {start_memory} KB")iterator = 
deserializer.load_stream(infile)
serializer.dump_stream(func(split_index, iterator), outfile)end_memory = 
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print(f"Memory usage at the end: {end_memory} KB")memory_difference = 
end_memory - start_memory
print(f"Memory usage changes:{memory_difference} KB"){code}
Can I process the data in the iterator in batches as below？
{code:java}
def process():
    iterator = deserializer.load_stream(infile)
    def batched_func(iterator, func, serializer, outfile):
        batch = []
        count = 0
        for item in iterator:
            batch.append(item)
            count += 1
            // Process the data in the iterator in batches, with 1 entries 
each time.
            if count >= 1:
                serializer.dump_stream(func(split_index, batch), outfile)
                batch = []
                count = 0
            if batch:
                serializer.dump_stream(func(split_index, batch), outfile)
        batched_func(iterator, func, serializer, outfile){code}
I test with code as above， it works well with lower memory each time.

 

  was:
In worker.py, there is a function called process(), the iterator loads all data 
at once
{code:java}
def process():
       iterator = deserializer.load_stream(infile)
       serializer.dump_stream(func(split_index, iterator), outfile){code}
It will cause MemoryError when working on large scale data, For the reason that 
I have indeed encountered this situation as below:
{code:java}

[jira] [Updated] (SPARK-48422) Serialize all data at once may cause MemoryError

2024-06-04 Thread ZhouYang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhouYang updated SPARK-48422:
-
Summary: Serialize all data at once may cause MemoryError  (was: Using 
lambda may cause MemoryError)

> Serialize all data at once may cause MemoryError
> 
>
> Key: SPARK-48422
> URL: https://issues.apache.org/jira/browse/SPARK-48422
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: ZhouYang
>Priority: Critical
>
> In worker.py, there is a function called process(), the iterator loads all 
> data at once
> {code:java}
> def process():
>        iterator = deserializer.load_stream(infile)
>        serializer.dump_stream(func(split_index, iterator), outfile){code}
> It will cause MemoryError when working on large scale data, For the reason 
> that I have indeed encountered this situation as below:
> {code:java}
> MemoryError at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:203) at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:244) 
> at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:162) at 
> org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
>  at 
> org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:89) at 
> org.apache.spark.scheduler.Task.run(Task.scala:109) at 
> org.apache.spark.executor.Executor$TaskRunner$$anon$2.run(Executor.scala:355) 
> at java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1721)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:353) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) 2024-05-22 16:50:03,173 INFO 
> org.apache.spark.scheduler.TaskSetManager: Starting task 0.1 in stage 5.0 
> (TID 21, saturndatanode3, executor 2, partition 0, ANY, 5075 bytes) 
> 2024-05-22 16:50:03,174 INFO org.apache.spark.scheduler.TaskSetManager: Lost 
> task 1.0 in stage 5.0 (TID 19) on saturndatanode3, executor 2: 
> org.apache.spark.api.python.PythonException (Traceback (most recent call 
> last): File "xx/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 200, in main process() File 
> "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 195, in process 
> serializer.dump_stream(func(split_index, iterator), outfile) File 
> "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in  
> func = lambda _, it: map(mapper, it) File "", line 1, in  
> File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 73, in 
>  return lambda *a: f(*a){code}
> I did some tests by adding memory monitor code, I found that this code takes 
> up a lot of memory during execution：
> {code:java}
> start_memory = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
> print(f"Memory usage at the beginning: {start_memory} KB")iterator = 
> deserializer.load_stream(infile)
> serializer.dump_stream(func(split_index, iterator), outfile)end_memory = 
> resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
> print(f"Memory usage at the end: {end_memory} KB")memory_difference = 
> end_memory - start_memory
> print(f"Memory usage changes:{memory_difference} KB"){code}
> Can I process the data in the iterator in batches as below？
> {code:java}
> def process():
>     iterator = deserializer.load_stream(infile)
>     def batched_func(iterator, func, serializer, outputfile):
>         batch = []
>         count = 0
>         for item in iterator:
>             batch.append(item)
>             count += 1
>             // Process the data in the iterator in batches, with 1 
> entries each time.
>             if count >= 1:
>                 serializer.dump_stream(func(split_index, batch), outfile)
>                 batch = []
>

[jira] [Updated] (SPARK-48533) Add test for cached schema

2024-06-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48533:
---
Labels: pull-request-available  (was: )

> Add test for cached schema
> --
>
> Key: SPARK-48533
> URL: https://issues.apache.org/jira/browse/SPARK-48533
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48533) Add test for cached schema

2024-06-04 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-48533:
-

 Summary: Add test for cached schema
 Key: SPARK-48533
 URL: https://issues.apache.org/jira/browse/SPARK-48533
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark, Tests
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48532) Upgrade maven plugin to latest version

2024-06-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48532:
---
Labels: pull-request-available  (was: )

> Upgrade maven plugin to latest version
> --
>
> Key: SPARK-48532
> URL: https://issues.apache.org/jira/browse/SPARK-48532
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48523) Add `grpc_max_message_size ` description to `client-connection-string.md`

2024-06-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48523.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46862
[https://github.com/apache/spark/pull/46862]

> Add `grpc_max_message_size ` description to `client-connection-string.md`
> -
>
> Key: SPARK-48523
> URL: https://issues.apache.org/jira/browse/SPARK-48523
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48523) Add `grpc_max_message_size ` description to `client-connection-string.md`

2024-06-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48523:


Assignee: BingKun Pan

> Add `grpc_max_message_size ` description to `client-connection-string.md`
> -
>
> Key: SPARK-48523
> URL: https://issues.apache.org/jira/browse/SPARK-48523
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48485) Support interruptTag and interruptAll in streaming queries

2024-06-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48485.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46819
[https://github.com/apache/spark/pull/46819]

> Support interruptTag and interruptAll in streaming queries
> --
>
> Key: SPARK-48485
> URL: https://issues.apache.org/jira/browse/SPARK-48485
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Spark Connect's interrupt API does not interrupt streaming queries. We should 
> support them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48495) Document planned approach to shredding

2024-06-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48495:
---
Labels: pull-request-available  (was: )

> Document planned approach to shredding
> --
>
> Key: SPARK-48495
> URL: https://issues.apache.org/jira/browse/SPARK-48495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: David Cashman
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48307) InlineCTE should keep not-inlined relations in the original WithCTE node

2024-06-04 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48307.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46617
[https://github.com/apache/spark/pull/46617]

> InlineCTE should keep not-inlined relations in the original WithCTE node
> 
>
> Key: SPARK-48307
> URL: https://issues.apache.org/jira/browse/SPARK-48307
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48528) Refine K8s Operator `merge_spark_pr.py` to use `kubernetes-operator-x.y.z` versions

2024-06-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48528.
---
Fix Version/s: kubernetes-operator-0.1.0
 Assignee: Dongjoon Hyun
   Resolution: Fixed

This is resolved via https://github.com/apache/spark-kubernetes-operator/pull/14

> Refine K8s Operator `merge_spark_pr.py` to use `kubernetes-operator-x.y.z` 
> versions
> ---
>
> Key: SPARK-48528
> URL: https://issues.apache.org/jira/browse/SPARK-48528
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: kubernetes-operator-0.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48531) Fix `Black` target version to Python 3.9

2024-06-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48531.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46867
[https://github.com/apache/spark/pull/46867]

> Fix `Black` target version to Python 3.9
> 
>
> Key: SPARK-48531
> URL: https://issues.apache.org/jira/browse/SPARK-48531
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48531) Fix `Black` target version to Python 3.9

2024-06-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48531:
-

Assignee: Dongjoon Hyun

> Fix `Black` target version to Python 3.9
> 
>
> Key: SPARK-48531
> URL: https://issues.apache.org/jira/browse/SPARK-48531
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48531) Fix `Black` target version to Python 3.9

2024-06-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48531:
---
Labels: pull-request-available  (was: )

> Fix `Black` target version to Python 3.9
> 
>
> Key: SPARK-48531
> URL: https://issues.apache.org/jira/browse/SPARK-48531
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48531) Fix `Black` target version to Python 3.9

2024-06-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48531:
--
Parent: SPARK-44111
Issue Type: Sub-task  (was: Improvement)

> Fix `Black` target version to Python 3.9
> 
>
> Key: SPARK-48531
> URL: https://issues.apache.org/jira/browse/SPARK-48531
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48531) Fix `Black` target version to Python 3.9

2024-06-04 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-48531:
-

 Summary: Fix `Black` target version to Python 3.9
 Key: SPARK-48531
 URL: https://issues.apache.org/jira/browse/SPARK-48531
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48528) Refine K8s Operator `merge_spark_pr.py` to use `kubernetes-operator-x.y.z` versions

2024-06-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48528:
--
Summary: Refine K8s Operator `merge_spark_pr.py` to use 
`kubernetes-operator-x.y.z` versions  (was: Refine K8s Operator 
`merge_spark_pr.py` to use `kubernetes-operator-x.y.z` version only)

> Refine K8s Operator `merge_spark_pr.py` to use `kubernetes-operator-x.y.z` 
> versions
> ---
>
> Key: SPARK-48528
> URL: https://issues.apache.org/jira/browse/SPARK-48528
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48530) [M0] Support for local variables

2024-06-04 Thread David Milicevic (Jira)

David Milicevic created SPARK-48530:
---

 Summary: [M0] Support for local variables
 Key: SPARK-48530
 URL: https://issues.apache.org/jira/browse/SPARK-48530
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


At the moment, variables in SQL scripts are creating session variables. We 
don't want this, we want variables to be considered as local (within the 
block/compound).

 

To achieve this, we probably need to wait for labels support. Once we have it, 
we can prepend variable names with labels to make distinction between variables 
with the same name and only then reuse session variables mechanism to save 
values with such composed names.

If the block/compound doesn't have label, we should generate it automatically 
(GUID or something similar).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48528) Refine K8s Operator `merge_spark_pr.py` to use `kubernetes-operator-x.y.z` version only

2024-06-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48528:
---
Labels: pull-request-available  (was: )

> Refine K8s Operator `merge_spark_pr.py` to use `kubernetes-operator-x.y.z` 
> version only
> ---
>
> Key: SPARK-48528
> URL: https://issues.apache.org/jira/browse/SPARK-48528
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48382) Add controller / reconciler module to operator

2024-06-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48382:
-

Assignee: Zhou JIANG

> Add controller / reconciler module to operator
> --
>
> Key: SPARK-48382
> URL: https://issues.apache.org/jira/browse/SPARK-48382
> Project: Spark
>  Issue Type: Sub-task
>  Components: k8s
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou JIANG
>Assignee: Zhou JIANG
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-48529) [M0] Support for labels

2024-06-04 Thread David Milicevic (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-48529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852119#comment-17852119
 ] 

David Milicevic commented on SPARK-48529:
-

[~milan.dankovic] is working on designing this.

> [M0] Support for labels
> ---
>
> Key: SPARK-48529
> URL: https://issues.apache.org/jira/browse/SPARK-48529
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Add support for labels to SQL parser.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec.|https://docs.google.com/document/d/1_UCvU3dYdcniV66akT1K6huWX4g7jpXDKaoPRDSZr2E/edit#heading=h.4cz970y1mk93]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48529) [M0] Support for labels

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48529:

Description: 
Add support for labels to SQL parser.

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec.|https://docs.google.com/document/d/1_UCvU3dYdcniV66akT1K6huWX4g7jpXDKaoPRDSZr2E/edit#heading=h.4cz970y1mk93]

  was:Add support for labels to SQL parser.


> [M0] Support for labels
> ---
>
> Key: SPARK-48529
> URL: https://issues.apache.org/jira/browse/SPARK-48529
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Add support for labels to SQL parser.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec.|https://docs.google.com/document/d/1_UCvU3dYdcniV66akT1K6huWX4g7jpXDKaoPRDSZr2E/edit#heading=h.4cz970y1mk93]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48529) [M0] Support for labels

2024-06-04 Thread David Milicevic (Jira)

David Milicevic created SPARK-48529:
---

 Summary: [M0] Support for labels
 Key: SPARK-48529
 URL: https://issues.apache.org/jira/browse/SPARK-48529
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


Add support for labels to SQL parser.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48528) Refine K8s Operator `merge_spark_pr.py` to use `kubernetes-operator-x.y.z` version only

2024-06-04 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-48528:
-

 Summary: Refine K8s Operator `merge_spark_pr.py` to use 
`kubernetes-operator-x.y.z` version only
 Key: SPARK-48528
 URL: https://issues.apache.org/jira/browse/SPARK-48528
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: kubernetes-operator-0.1.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48527) [M0] Thrift investigation

2024-06-04 Thread David Milicevic (Jira)

David Milicevic created SPARK-48527:
---

 Summary: [M0] Thrift investigation
 Key: SPARK-48527
 URL: https://issues.apache.org/jira/browse/SPARK-48527
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


Some notebook modes (SQL Warehouse, what else?) execute SQL commands through 
SQL Gateway and Thrift stacks.

We need to:
 - Figure out why SQL scripts execution is failing in these cases.

 - Understand the SQL Gateway + Thrift stack better, so we can more easily 
propose design for new API(s) we are going to introduce in the future.

 

For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48526) Allow passing custom sink to StreamTest::testStream

2024-06-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48526:
---
Labels: pull-request-available  (was: )

> Allow passing custom sink to StreamTest::testStream
> ---
>
> Key: SPARK-48526
> URL: https://issues.apache.org/jira/browse/SPARK-48526
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Johan Lasperas
>Priority: Trivial
>  Labels: pull-request-available
>
> The testing helpers for streaming don't allow providing a custom sink, this 
> is limiting in (at least) two ways:
>  * A sink can't be reused across multiple calls to `testStream`, e.g. when 
> canceling and resuming streaming
>  * A custom sink implementation other than `MemorySink` can't be provided. A 
> use case here is for example to test the Delta streaming sink by wrapping it 
> in a MemorySink interface and passing it to the test framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48526) Allow passing custom sink to StreamTest::testStream

2024-06-04 Thread Johan Lasperas (Jira)

Johan Lasperas created SPARK-48526:
--

 Summary: Allow passing custom sink to StreamTest::testStream
 Key: SPARK-48526
 URL: https://issues.apache.org/jira/browse/SPARK-48526
 Project: Spark
  Issue Type: Test
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Johan Lasperas


The testing helpers for streaming don't allow providing a custom sink, this is 
limiting in (at least) two ways:
 * A sink can't be reused across multiple calls to `testStream`, e.g. when 
canceling and resuming streaming
 * A custom sink implementation other than `MemorySink` can't be provided. A 
use case here is for example to test the Delta streaming sink by wrapping it in 
a MemorySink interface and passing it to the test framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48377) [M1] Multiple results API - sqlScript()

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48377:

Summary: [M1] Multiple results API - sqlScript()  (was: Multiple results 
API - sqlScript())

> [M1] Multiple results API - sqlScript()
> ---
>
> Key: SPARK-48377
> URL: https://issues.apache.org/jira/browse/SPARK-48377
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> For now:
>  * Write an API proposal
>  ** The API itself should be fine, but we need to figure out what the result 
> set should look like, i.e. in what format we return multiple DataFrames.
>  ** The result set should be compatible with CALL and EXECUTE IMMEDIATE as 
> well.
>  * Figure out how the API will propagate down the Spark Connect stack 
> (depends on SPARK-48452 investigation)
>  
> Probably to be separated into multiple subtasks in the future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48375) [M1] Support for SIGNAL statement

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48375:

Summary: [M1] Support for SIGNAL statement  (was: Support for SIGNAL 
statement)

> [M1] Support for SIGNAL statement
> -
>
> Key: SPARK-48375
> URL: https://issues.apache.org/jira/browse/SPARK-48375
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Details TBD.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec.|https://docs.google.com/document/d/1_UCvU3dYdcniV66akT1K6huWX4g7jpXDKaoPRDSZr2E/edit#heading=h.4cz970y1mk93]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48456) [M1] Performance benchmark

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48456:

Summary: [M1] Performance benchmark  (was: Performance benchmark)

> [M1] Performance benchmark
> --
>
> Key: SPARK-48456
> URL: https://issues.apache.org/jira/browse/SPARK-48456
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Performance parity is officially an M2 requirement, but by the end of M0 I 
> think we should start doing some perf benchmarks to figure out where do we 
> stand in the beginning and if we need to change something right from the 
> start before we get to work on a more complex stuff.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48376) [M1] Support for ITERATE statement

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48376:

Summary: [M1] Support for ITERATE statement  (was: Support for ITERATE 
statement)

> [M1] Support for ITERATE statement
> --
>
> Key: SPARK-48376
> URL: https://issues.apache.org/jira/browse/SPARK-48376
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Add support for ITERATE statement in WHILE (and other) loops to SQL scripting 
> parser & interpreter.
> This is the same functionality as CONTINUE in other languages.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48326) Use the official Apache Spark 4.0.0-preview1

2024-06-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48326:
--
Fix Version/s: kubernetes-operator-0.1.0
   (was: 4.0.0)

> Use the official Apache Spark 4.0.0-preview1
> 
>
> Key: SPARK-48326
> URL: https://issues.apache.org/jira/browse/SPARK-48326
> Project: Spark
>  Issue Type: Sub-task
>  Components: k8s
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou JIANG
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: kubernetes-operator-0.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48349) [M1] Support for debugging

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48349:

Summary: [M1] Support for debugging  (was: Support for debugging)

> [M1] Support for debugging
> --
>
> Key: SPARK-48349
> URL: https://issues.apache.org/jira/browse/SPARK-48349
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> TBD.
> Probably to be separated into multiple subtasks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48326) Use the official Apache Spark 4.0.0-preview1

2024-06-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48326:
--
Summary: Use the official Apache Spark 4.0.0-preview1  (was: Upgrade 
submission worker base Spark version to 4.0.0-preview2)

> Use the official Apache Spark 4.0.0-preview1
> 
>
> Key: SPARK-48326
> URL: https://issues.apache.org/jira/browse/SPARK-48326
> Project: Spark
>  Issue Type: Sub-task
>  Components: k8s
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou JIANG
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48455) [M1] Public documentation

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48455:

Summary: [M1] Public documentation  (was: Public documentation)

> [M1] Public documentation
> -
>
> Key: SPARK-48455
> URL: https://issues.apache.org/jira/browse/SPARK-48455
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> I guess this shouldn't be anything revolutionary, just a basic doc with SQL 
> Scripting grammar and functions explained properly.
>  
> We might want to sync with Serge about this to figure out if he has any 
> thoughts before we start working on it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48453) [M1] Support for PRINT/TRACE statement

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48453:

Summary: [M1] Support for PRINT/TRACE statement  (was: Support for 
PRINT/TRACE statement)

> [M1] Support for PRINT/TRACE statement
> --
>
> Key: SPARK-48453
> URL: https://issues.apache.org/jira/browse/SPARK-48453
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> This is not defined in Ref 
> Spec[,|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit#heading=h.4cz970y1mk93],]
>  but during POC we figured out that it might be useful.
> Still need to figure out the details when we get to it, because the 
> propagation to the client and UI on the client side might not be trivial, but 
> this needs further investigation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48525) [M0] Private documentation

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48525:

Summary: [M0] Private documentation  (was: Private documentation)

> [M0] Private documentation
> --
>
> Key: SPARK-48525
> URL: https://issues.apache.org/jira/browse/SPARK-48525
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> We do need some form of documentation for Private Preview - e.g. we used a 
> PDF doc for Collations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48356) [M1] Support for FOR statement

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48356:

Summary: [M1] Support for FOR statement  (was: Support for FOR statement)

> [M1] Support for FOR statement
> --
>
> Key: SPARK-48356
> URL: https://issues.apache.org/jira/browse/SPARK-48356
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Details TBD.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48358) [M1] Support for REPEAT statement

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48358:

Summary: [M1] Support for REPEAT statement  (was: Support for REPEAT 
statement)

> [M1] Support for REPEAT statement
> -
>
> Key: SPARK-48358
> URL: https://issues.apache.org/jira/browse/SPARK-48358
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Details TBD.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48357) [M1] Support for LOOP statement

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48357:

Summary: [M1] Support for LOOP statement  (was: Support for LOOP statement)

> [M1] Support for LOOP statement
> ---
>
> Key: SPARK-48357
> URL: https://issues.apache.org/jira/browse/SPARK-48357
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Add support for LOOP statement.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48388) [M0] Fix SET behavior for scripts

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48388:

Summary: [M0] Fix SET behavior for scripts  (was: Fix SET behavior for 
scripts)

> [M0] Fix SET behavior for scripts
> -
>
> Key: SPARK-48388
> URL: https://issues.apache.org/jira/browse/SPARK-48388
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> By standard, SET is used to set variable value in SQL scripts.
> On our end, SET is configured to work with some Hive configs, so the grammar 
> is a bit messed up and for that reason it was decided to use SET VAR instead 
> of SET to work with SQL variables.
> This is not by standard and we should figure out the way to be able to use 
> SET for SQL variables and forbid setting of Hive configs from SQL scripts.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48457) [M0] Testing and operational readiness

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48457:

Summary: [M0] Testing and operational readiness  (was: Testing and 
operational readiness)

> [M0] Testing and operational readiness
> --
>
> Key: SPARK-48457
> URL: https://issues.apache.org/jira/browse/SPARK-48457
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> We are basically doing this as we are developing the feature. This work item 
> should serve as a checkpoint by the end of M0 to figure out if we have 
> covered everything.
>  
> Testing is very clearly defined by itself.
> For the operational readiness part, we are still to figure out what exactly 
> we can do in the case of SQL scripting. It's a really straightforward feature 
> and public documentation should serve well enough for most of the issues we 
> might encounter. But, we should probably think about:
>  * Some KPI indicators.
>  * Telemetry.
>  * Something else?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48346) [M0] Support for IF ELSE statement

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48346:

Summary: [M0] Support for IF ELSE statement  (was: Support for IF ELSE 
statement)

> [M0] Support for IF ELSE statement
> --
>
> Key: SPARK-48346
> URL: https://issues.apache.org/jira/browse/SPARK-48346
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Add support for IF ELSE statements to SQL scripting parser & interpreter:
>  * IF
>  * IF / ELSE
>  * IF / ELSE IF / ELSE
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48326) Upgrade submission worker base Spark version to 4.0.0-preview2

2024-06-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-48326:
-

Assignee: Dongjoon Hyun

> Upgrade submission worker base Spark version to 4.0.0-preview2
> --
>
> Key: SPARK-48326
> URL: https://issues.apache.org/jira/browse/SPARK-48326
> Project: Spark
>  Issue Type: Sub-task
>  Components: k8s
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou JIANG
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48347) [M0] Support for WHILE statement

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48347:

Summary: [M0] Support for WHILE statement  (was: Support for WHILE 
statement)

> [M0] Support for WHILE statement
> 
>
> Key: SPARK-48347
> URL: https://issues.apache.org/jira/browse/SPARK-48347
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Add support for WHILE statements to SQL scripting parser & interpreter.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48355) [M1] Support for CASE statement

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48355:

Summary: [M1] Support for CASE statement  (was: Support for CASE statement)

> [M1] Support for CASE statement
> ---
>
> Key: SPARK-48355
> URL: https://issues.apache.org/jira/browse/SPARK-48355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Details TBD.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec.|https://docs.google.com/document/d/1_UCvU3dYdcniV66akT1K6huWX4g7jpXDKaoPRDSZr2E/edit#heading=h.4cz970y1mk93]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48452) [M0] Spark Connect investigation

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48452:

Summary: [M0] Spark Connect investigation  (was: Spark Connect 
investigation)

> [M0] Spark Connect investigation
> 
>
> Key: SPARK-48452
> URL: https://issues.apache.org/jira/browse/SPARK-48452
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Some notebook modes, VS code extension, etc. execute SQL commands through 
> Spark Connect.
> We need to:
> - Figure out exceptions that we are getting in Spark Connect stack for SQL 
> scripts.
> - Understand the Spark Connect stack better, so we can more easily propose 
> design for new API(s) we are going to introduce in the future.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48326) Upgrade submission worker base Spark version to 4.0.0-preview2

2024-06-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-48326.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 13
[https://github.com/apache/spark-kubernetes-operator/pull/13]

> Upgrade submission worker base Spark version to 4.0.0-preview2
> --
>
> Key: SPARK-48326
> URL: https://issues.apache.org/jira/browse/SPARK-48326
> Project: Spark
>  Issue Type: Sub-task
>  Components: k8s
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou JIANG
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48348) [M0] Support for LEAVE statement

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48348:

Summary: [M0] Support for LEAVE statement  (was: Support for LEAVE 
statement)

> [M0] Support for LEAVE statement
> 
>
> Key: SPARK-48348
> URL: https://issues.apache.org/jira/browse/SPARK-48348
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Add support for LEAVE statement in WHILE (and other) loops to SQL scripting 
> parser & interpreter.
> This is the same functionality as BREAK in other languages.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48345) [M0] Checks for variable declarations

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48345:

Summary: [M0] Checks for variable declarations  (was: Checks for variable 
declarations)

> [M0] Checks for variable declarations
> -
>
> Key: SPARK-48345
> URL: https://issues.apache.org/jira/browse/SPARK-48345
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Add checks to parser (visitBatchBody() in AstBuilder) for variable 
> declarations, based on a passed-in flag:
>  * Variable can be declared only at the beginning of the compound.
>  * Support for exception when wrong variable declaration is encountered.
>  
> For more details, design doc can be found in parent Jira item.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48353) [M0] Support for TRY/CATCH statement

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48353:

Summary: [M0] Support for TRY/CATCH statement  (was: Support for TRY/CATCH 
statement)

> [M0] Support for TRY/CATCH statement
> 
>
> Key: SPARK-48353
> URL: https://issues.apache.org/jira/browse/SPARK-48353
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Details TBD.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec.|https://docs.google.com/document/d/1_UCvU3dYdcniV66akT1K6huWX4g7jpXDKaoPRDSZr2E/edit#heading=h.4cz970y1mk93]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48350) [M0] Support for exceptions thrown from parser/interpreter

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48350:

Summary: [M0] Support for exceptions thrown from parser/interpreter  (was: 
Support for exceptions thrown from parser/interpreter)

> [M0] Support for exceptions thrown from parser/interpreter
> --
>
> Key: SPARK-48350
> URL: https://issues.apache.org/jira/browse/SPARK-48350
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> In general, add support for SQL scripting related exceptions.
> By the time someone starts working on this item, some exception support might 
> already exist - check if it needs refactoring.
>  
> Have in might that for some (all?) exceptions we might need to know which 
> line(s) in the script are responsible for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48343) [M0] Interpreter support

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48343:

Summary: [M0] Interpreter support  (was: Interpreter support)

> [M0] Interpreter support
> 
>
> Key: SPARK-48343
> URL: https://issues.apache.org/jira/browse/SPARK-48343
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Implement interpreter for SQL scripting:
>  * Interpreter
>  * Interpreter testing
> For more details, design doc can be found in parent Jira item.
> Update design doc accordingly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48344) [M0] Changes to sql() API

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48344:

Summary: [M0] Changes to sql() API  (was: Changes to sql() API)

> [M0] Changes to sql() API
> -
>
> Key: SPARK-48344
> URL: https://issues.apache.org/jira/browse/SPARK-48344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Implement changes to sql() API to support SQL script execution:
>  * SparkSession changes
>  * sql() API changes - iterate through the script, but return only last 
> dataframe
>  * Spark Config flag to enable/disable SQL scripting in sql() API
>  * E2E testing
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48342) [M0] Parser support

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48342:

Summary: [M0] Parser support  (was: Parser support)

> [M0] Parser support
> ---
>
> Key: SPARK-48342
> URL: https://issues.apache.org/jira/browse/SPARK-48342
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>  Labels: pull-request-available
>
> Implement parse for SQL scripting with all supporting changes for upcoming 
> interpreter implementation and future extensions of the parser:
>  * Parser - support only compound statements
>  * Parser testing
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48525) Private documentation

2024-06-04 Thread David Milicevic (Jira)

David Milicevic created SPARK-48525:
---

 Summary: Private documentation
 Key: SPARK-48525
 URL: https://issues.apache.org/jira/browse/SPARK-48525
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: David Milicevic


We do need some form of documentation for Private Preview - e.g. we used a PDF 
doc for Collations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48455) Public documentation

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48455:

Description: 
I guess this shouldn't be anything revolutionary, just a basic doc with SQL 
Scripting grammar and functions explained properly.

 

We might want to sync with Serge about this to figure out if he has any 
thoughts before we start working on it.

  was:
Public documentation is officially Milestone 1 requirement, but I think we 
should start working on this even during Milestone 0.


I guess this shouldn't be anything revolutionary, just a basic doc with SQL 
Scripting grammar and functions explained properly.

 

We might want to sync with Serge about this to figure out if he has any 
thoughts before we start working on it.


> Public documentation
> 
>
> Key: SPARK-48455
> URL: https://issues.apache.org/jira/browse/SPARK-48455
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> I guess this shouldn't be anything revolutionary, just a basic doc with SQL 
> Scripting grammar and functions explained properly.
>  
> We might want to sync with Serge about this to figure out if he has any 
> thoughts before we start working on it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48357) Support for LOOP statement

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48357:

Description: 
Add support for LOOP statement.

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].

  was:
Details TBD.

Maybe split to multiple items?

 

LEAVE should be the equivalent to BREAK?

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].


> Support for LOOP statement
> --
>
> Key: SPARK-48357
> URL: https://issues.apache.org/jira/browse/SPARK-48357
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Add support for LOOP statement.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48326) Upgrade submission worker base Spark version to 4.0.0-preview2

2024-06-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48326:
---
Labels: pull-request-available  (was: )

> Upgrade submission worker base Spark version to 4.0.0-preview2
> --
>
> Key: SPARK-48326
> URL: https://issues.apache.org/jira/browse/SPARK-48326
> Project: Spark
>  Issue Type: Sub-task
>  Components: k8s
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou JIANG
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48357) Support for LOOP statement

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48357:

Summary: Support for LOOP statement  (was: Support for LOOP and LEAVE 
statements)

> Support for LOOP statement
> --
>
> Key: SPARK-48357
> URL: https://issues.apache.org/jira/browse/SPARK-48357
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Details TBD.
> Maybe split to multiple items?
>  
> LEAVE should be the equivalent to BREAK?
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48376) Support for ITERATE statement

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48376:

Description: 
Add support for ITERATE statement in WHILE (and other) loops to SQL scripting 
parser & interpreter.

This is the same functionality as CONTINUE in other languages.

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].

  was:
Details TBD.

Maybe split to multiple items?

 

ITERATE should be the equivalent to CONTINUE?

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].


> Support for ITERATE statement
> -
>
> Key: SPARK-48376
> URL: https://issues.apache.org/jira/browse/SPARK-48376
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Add support for ITERATE statement in WHILE (and other) loops to SQL scripting 
> parser & interpreter.
> This is the same functionality as CONTINUE in other languages.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48348) Support for LEAVE statement

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48348:

Description: 
Add support for LEAVE statement in WHILE (and other) loops to SQL scripting 
parser & interpreter.

This is the same functionality as BREAK in other languages.

 

For more details:
 * Design doc in parent Jira item.
 * [SQL ref 
spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].

  was:
Add support for LEAVE statement in WHILE (and other) loops to SQL scripting 
parser & interpreter.

This is the same functionality as BREAK in other languages.

 

For more details, design doc can be found in parent Jira item.


> Support for LEAVE statement
> ---
>
> Key: SPARK-48348
> URL: https://issues.apache.org/jira/browse/SPARK-48348
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Add support for LEAVE statement in WHILE (and other) loops to SQL scripting 
> parser & interpreter.
> This is the same functionality as BREAK in other languages.
>  
> For more details:
>  * Design doc in parent Jira item.
>  * [SQL ref 
> spec|https://docs.google.com/document/d/1cpSuR3KxRuTSJ4ZMQ73FJ4_-hjouNNU2zfI4vri6yhs/edit].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48348) Support for LEAVE statement

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48348:

Description: 
Add support for LEAVE statement in WHILE (and other) loops to SQL scripting 
parser & interpreter.

This is the same functionality as BREAK in other languages.

 

For more details, design doc can be found in parent Jira item.

  was:
Add support for BREAK and CONTINUE statements in WHILE loops to SQL scripting 
parser & interpreter.

 

For more details, design doc can be found in parent Jira item.


> Support for LEAVE statement
> ---
>
> Key: SPARK-48348
> URL: https://issues.apache.org/jira/browse/SPARK-48348
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Add support for LEAVE statement in WHILE (and other) loops to SQL scripting 
> parser & interpreter.
> This is the same functionality as BREAK in other languages.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48348) Support for LEAVE statement

2024-06-04 Thread David Milicevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-48348:

Summary: Support for LEAVE statement  (was: Support for BREAK and CONTINUE 
statements)

> Support for LEAVE statement
> ---
>
> Key: SPARK-48348
> URL: https://issues.apache.org/jira/browse/SPARK-48348
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Priority: Major
>
> Add support for BREAK and CONTINUE statements in WHILE loops to SQL scripting 
> parser & interpreter.
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48524) Semantic equality of Not, IsNull and IsNotNull expressions are incorrect

2024-06-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48524:
---
Labels: pull-request-available  (was: )

> Semantic equality of Not, IsNull and IsNotNull expressions are incorrect
> 
>
> Key: SPARK-48524
> URL: https://issues.apache.org/jira/browse/SPARK-48524
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.3
>Reporter: Thomas Powell
>Priority: Major
>  Labels: pull-request-available
>
> Not(IsNull) should be semantically equally to IsNotNull and vice versa.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48524) Semantic equality of Not, IsNull and IsNotNull expressions are incorrect

2024-06-04 Thread Thomas Powell (Jira)

Thomas Powell created SPARK-48524:
-

 Summary: Semantic equality of Not, IsNull and IsNotNull 
expressions are incorrect
 Key: SPARK-48524
 URL: https://issues.apache.org/jira/browse/SPARK-48524
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.3
Reporter: Thomas Powell


Not(IsNull) should be semantically equally to IsNotNull and vice versa.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48521) Repartition, sort and partitionBy not working together

2024-06-04 Thread Alvaro Berdonces (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alvaro Berdonces updated SPARK-48521:
-
Description: 
Hi, we are having some problems writing sorted csv’s using Spark 3.5.1.

Example data:
[Parquet 1M records from 
flights|https://www.tablab.app/datasets/sample/parquet?datatable-source=demo-flights-1m]

 

Example code:
{code:scala}
val df = spark.read.parquet("Flights 1m.parquet").withColumn("partition_col", 
lit("2024")).localCheckpoint
df.repartition(1).sort("FL_DATE", 
"DISTANCE").write.mode("overwrite").partitionBy("partition_col").csv("repartition_order")
 {code}
 

Running previous example using Spark 3.3.4 writes a single file ordered by 
FL_DATE and DISTANCE fields inside the folder partition_col=2024.

On the other hand when using Spark 3.5.1 it returns same number of files as 
cores in the executors (4 in my case using 2 executors with 2 cores each) and 
rows are not sorted inside.
We can see that after repartition(1) and before the sort Spark adds another 
stage repartition(200) because of the default shuffle partitions value, and 
then AQE coalesce small partitions.

 

Spark 3.3.4 plan:
{code:java}
== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand (8)
+- AdaptiveSparkPlan (7)
   +- == Final Plan ==
      * Sort (4)
      +- ShuffleQueryStage (3), Statistics(sizeInBytes=76.3 MiB, 
rowCount=1.00E+6)
         +- Exchange (2)
            +- * Scan ExistingRDD (1)
   +- == Initial Plan ==
      Sort (6)
      +- Exchange (5)
         +- Scan ExistingRDD (1) {code}
 

Spark 3.5.1 plan:
{code:java}
== Physical Plan ==
AdaptiveSparkPlan (15)
+- == Final Plan ==
   Execute InsertIntoHadoopFsRelationCommand (9)
   +- WriteFiles (8)
      +- * Sort (7)
         +- AQEShuffleRead (6)
            +- ShuffleQueryStage (5), Statistics(sizeInBytes=76.3 MiB, 
rowCount=1.00E+6)
               +- Exchange (4)
                  +- ShuffleQueryStage (3), Statistics(sizeInBytes=76.3 MiB, 
rowCount=1.00E+6)
                     +- Exchange (2)
                        +- * Scan ExistingRDD (1)
+- == Initial Plan ==
   Execute InsertIntoHadoopFsRelationCommand (14)
   +- WriteFiles (13)
      +- Sort (12)
         +- Exchange (11)
            +- Exchange (10)
               +- Scan ExistingRDD (1) {code}
 

  was:
Hi, we are having some problems writing sorted csv’s using Spark 3.5.1.

Example data:
[Parquet 1M records from 
flights|https://www.tablab.app/datasets/sample/parquet?datatable-source=demo-flights-1m]

 

Example code:
{code:scala}
val df = spark.read.parquet("Flights 1m.parquet").withColumn("partition_col", 
lit("2024")).localCheckpoint
df.repartition(1).sort("FL_DATE", 
"DISTANCE").write.mode("overwrite").partitionBy("partition_col").csv("repartition_order")
 {code}
 

Running previous example using Spark 3.3.4 writes a single file ordered by 
FL_DATE and DISTANCE fields inside the folder partition_col=2024.

On the other hand when using Spark 3.5.1 it returns same number of files as 
cores in the executors (4 in my case using 2 executors with 2 cores each) and 
rows are not sorted inside.
We can see that after repartition(1) and before the sort Spark adds another 
stage repartition(200) because of the default shuffle partitions value, and 
then AQE coalesce small partitions.

 

Spark 3.3.4 plan:
{code:java}
== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand (8)
+- AdaptiveSparkPlan (7)
   +- == Final Plan ==
      * Sort (4)
      +- ShuffleQueryStage (3), Statistics(sizeInBytes=76.3 MiB, 
rowCount=1.00E+6)
         +- Exchange (2)
            +- * Scan ExistingRDD (1)
   +- == Initial Plan ==
      Sort (6)
      +- Exchange (5)
         +- Scan ExistingRDD (1) {code}
 

Spark 3.5.1 plan:
{code:java}
== Physical Plan ==
AdaptiveSparkPlan (15)
+- == Final Plan ==
   Execute InsertIntoHadoopFsRelationCommand (9)
   +- WriteFiles (8)
      +- * Sort (7)
         +- AQEShuffleRead (6)
            +- ShuffleQueryStage (5), Statistics(sizeInBytes=76.3 MiB, 
rowCount=1.00E+6)
               +- Exchange (4)
                  +- ShuffleQueryStage (3), Statistics(sizeInBytes=76.3 MiB, 
rowCount=1.00E+6)
                     +- Exchange (2)
                        +- * Scan ExistingRDD (1)
+- == Initial Plan ==
   Execute InsertIntoHadoopFsRelationCommand (14)
   +- WriteFiles (13)
      +- Sort (12)
         +- Exchange (11)
            +- Exchange (10)
               +- Scan ExistingRDD (1) {code}
 

 

 

 


> Repartition, sort and partitionBy not working together
> --
>
> Key: SPARK-48521
> URL: https://issues.apache.org/jira/browse/SPARK-48521
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Alvaro Berdonces
>Priority: Major
>
> Hi, we are having some

[jira] [Assigned] (SPARK-48522) Update Stream Library to 2.9.8

2024-06-04 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-48522:


Assignee: Kent Yao

> Update Stream Library to 2.9.8
> --
>
> Key: SPARK-48522
> URL: https://issues.apache.org/jira/browse/SPARK-48522
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48522) Update Stream Library to 2.9.8

2024-06-04 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-48522.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46861
[https://github.com/apache/spark/pull/46861]

> Update Stream Library to 2.9.8
> --
>
> Key: SPARK-48522
> URL: https://issues.apache.org/jira/browse/SPARK-48522
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48506) Compression codec short names are case insensitive expect for event logging

2024-06-04 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-48506:


Assignee: Kent Yao

> Compression codec short names are case insensitive expect for event logging
> ---
>
> Key: SPARK-48506
> URL: https://issues.apache.org/jira/browse/SPARK-48506
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.3, 3.2.4, 3.5.1, 3.3.4, 3.4.3
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48506) Compression codec short names are case insensitive expect for event logging

2024-06-04 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-48506.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46847
[https://github.com/apache/spark/pull/46847]

> Compression codec short names are case insensitive expect for event logging
> ---
>
> Key: SPARK-48506
> URL: https://issues.apache.org/jira/browse/SPARK-48506
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.3, 3.2.4, 3.5.1, 3.3.4, 3.4.3
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2024-06-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-22876:
---
Labels: bulk-closed pull-request-available  (was: bulk-closed)

> spark.yarn.am.attemptFailuresValidityInterval does not work correctly
> -
>
> Key: SPARK-22876
> URL: https://issues.apache.org/jira/browse/SPARK-22876
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.2.0
> Environment: hadoop version 2.7.3
>Reporter: Jinhan Zhong
>Priority: Minor
>  Labels: bulk-closed, pull-request-available
>
> I assume we can use spark.yarn.maxAppAttempts together with 
> spark.yarn.am.attemptFailuresValidityInterval to make a long running 
> application avoid stopping  after acceptable number of failures.
> But after testing, I found that the application always stops after failing n 
> times ( n is minimum value of spark.yarn.maxAppAttempts and 
> yarn.resourcemanager.am.max-attempts from client yarn-site.xml)
> for example, following setup will allow the application master to fail 20 
> times.
> * spark.yarn.am.attemptFailuresValidityInterval=1s
> * spark.yarn.maxAppAttempts=20
> * yarn client: yarn.resourcemanager.am.max-attempts=20
> * yarn resource manager: yarn.resourcemanager.am.max-attempts=3
> And after checking the source code, I found in source file 
> ApplicationMaster.scala 
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293
> there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
> if attempt id >= maxAppAttempts, it will try to unregister the application 
> and the application will finish.
> is this a expected design or a bug?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45101) Spark UI: A stage is still active even when all of it's tasks are succeeded

2024-06-04 Thread RickyMa (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852025#comment-17852025
 ] 

RickyMa commented on SPARK-45101:
-

No. It's just a Spark SQL.

> Spark UI: A stage is still active even when all of it's tasks are succeeded
> ---
>
> Key: SPARK-45101
> URL: https://issues.apache.org/jira/browse/SPARK-45101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1, 3.5.0, 4.0.0
>Reporter: RickyMa
>Priority: Critical
> Attachments: 1.png, 2.png, 3.png
>
>
> In the stage UI, we can see all the tasks' statuses are SUCCESS.
> But the stage is still marked as active.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48523) Add `grpc_max_message_size ` description to `client-connection-string.md`

2024-06-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48523:
---
Labels: pull-request-available  (was: )

> Add `grpc_max_message_size ` description to `client-connection-string.md`
> -
>
> Key: SPARK-48523
> URL: https://issues.apache.org/jira/browse/SPARK-48523
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48523) Add `grpc_max_message_size ` description to `client-connection-string.md`

2024-06-04 Thread BingKun Pan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-48523:

Summary: Add `grpc_max_message_size ` description to 
`client-connection-string.md`  (was: Add `grpc_max_message_size ` to 
`client-connection-string.md`)

> Add `grpc_max_message_size ` description to `client-connection-string.md`
> -
>
> Key: SPARK-48523
> URL: https://issues.apache.org/jira/browse/SPARK-48523
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48519) Upgrade jetty to 11.0.21

2024-06-04 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-48519.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46843
[https://github.com/apache/spark/pull/46843]

> Upgrade jetty to 11.0.21
> 
>
> Key: SPARK-48519
> URL: https://issues.apache.org/jira/browse/SPARK-48519
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> * https://github.com/jetty/jetty.project/releases/tag/jetty-11.0.21



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48519) Upgrade jetty to 11.0.21

2024-06-04 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-48519:


Assignee: Yang Jie

> Upgrade jetty to 11.0.21
> 
>
> Key: SPARK-48519
> URL: https://issues.apache.org/jira/browse/SPARK-48519
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> * https://github.com/jetty/jetty.project/releases/tag/jetty-11.0.21



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48519) Upgrade jetty to 11.0.21

2024-06-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48519:
---
Labels: pull-request-available  (was: )

> Upgrade jetty to 11.0.21
> 
>
> Key: SPARK-48519
> URL: https://issues.apache.org/jira/browse/SPARK-48519
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> * https://github.com/jetty/jetty.project/releases/tag/jetty-11.0.21



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48523) Add `grpc_max_message_size ` to `client-connection-string.md`

2024-06-04 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-48523:
---

 Summary: Add `grpc_max_message_size ` to 
`client-connection-string.md`
 Key: SPARK-48523
 URL: https://issues.apache.org/jira/browse/SPARK-48523
 Project: Spark
  Issue Type: Improvement
  Components: Connect, Documentation
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48518) Make LZF compression be able to run in parallel

2024-06-04 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-48518:


Assignee: Kent Yao

> Make LZF compression be able to run in parallel
> ---
>
> Key: SPARK-48518
> URL: https://issues.apache.org/jira/browse/SPARK-48518
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48518) Make LZF compression be able to run in parallel

2024-06-04 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48518.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46858
[https://github.com/apache/spark/pull/46858]

> Make LZF compression be able to run in parallel
> ---
>
> Key: SPARK-48518
> URL: https://issues.apache.org/jira/browse/SPARK-48518
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48522) Update Stream Library to 2.9.8

2024-06-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48522:
---
Labels: pull-request-available  (was: )

> Update Stream Library to 2.9.8
> --
>
> Key: SPARK-48522
> URL: https://issues.apache.org/jira/browse/SPARK-48522
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48522) Update Stream Library to 2.9.8

2024-06-04 Thread Kent Yao (Jira)

Kent Yao created SPARK-48522:


 Summary: Update Stream Library to 2.9.8
 Key: SPARK-48522
 URL: https://issues.apache.org/jira/browse/SPARK-48522
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Build
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48173) CheckAnalsis should see the entire query plan

2024-06-04 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-48173:

Fix Version/s: 3.5.2

> CheckAnalsis should see the entire query plan
> -
>
> Key: SPARK-48173
> URL: https://issues.apache.org/jira/browse/SPARK-48173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48512) Refactor Python tests

2024-06-04 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-48512.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46852
[https://github.com/apache/spark/pull/46852]

> Refactor Python tests
> -
>
> Key: SPARK-48512
> URL: https://issues.apache.org/jira/browse/SPARK-48512
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48521) Repartition, sort and partitionBy not working together

2024-06-04 Thread Alvaro Berdonces (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alvaro Berdonces updated SPARK-48521:
-
Description: 
Hi, we are having some problems writing sorted csv’s using Spark 3.5.1.

Example data:
[Parquet 1M records from 
flights|https://www.tablab.app/datasets/sample/parquet?datatable-source=demo-flights-1m]

 

Example code:
{code:scala}
val df = spark.read.parquet("Flights 1m.parquet").withColumn("partition_col", 
lit("2024")).localCheckpoint
df.repartition(1).sort("FL_DATE", 
"DISTANCE").write.mode("overwrite").partitionBy("partition_col").csv("repartition_order")
 {code}
 

Running previous example using Spark 3.3.4 writes a single file ordered by 
FL_DATE and DISTANCE fields inside the folder partition_col=2024.

On the other hand when using Spark 3.5.1 it returns same number of files as 
cores in the executors (4 in my case using 2 executors with 2 cores each) and 
rows are not sorted inside.
We can see that after repartition(1) and before the sort Spark adds another 
stage repartition(200) because of the default shuffle partitions value, and 
then AQE coalesce small partitions.

 

Spark 3.3.4 plan:
{code:java}
== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand (8)
+- AdaptiveSparkPlan (7)
   +- == Final Plan ==
      * Sort (4)
      +- ShuffleQueryStage (3), Statistics(sizeInBytes=76.3 MiB, 
rowCount=1.00E+6)
         +- Exchange (2)
            +- * Scan ExistingRDD (1)
   +- == Initial Plan ==
      Sort (6)
      +- Exchange (5)
         +- Scan ExistingRDD (1) {code}
 

Spark 3.5.1 plan:
{code:java}
== Physical Plan ==
AdaptiveSparkPlan (15)
+- == Final Plan ==
   Execute InsertIntoHadoopFsRelationCommand (9)
   +- WriteFiles (8)
      +- * Sort (7)
         +- AQEShuffleRead (6)
            +- ShuffleQueryStage (5), Statistics(sizeInBytes=76.3 MiB, 
rowCount=1.00E+6)
               +- Exchange (4)
                  +- ShuffleQueryStage (3), Statistics(sizeInBytes=76.3 MiB, 
rowCount=1.00E+6)
                     +- Exchange (2)
                        +- * Scan ExistingRDD (1)
+- == Initial Plan ==
   Execute InsertIntoHadoopFsRelationCommand (14)
   +- WriteFiles (13)
      +- Sort (12)
         +- Exchange (11)
            +- Exchange (10)
               +- Scan ExistingRDD (1) {code}
 

 

 

 

  was:
Hi, we are having some problems writing sorted csv’s using Spark 3.5.1.

Example data:
[Parquet 1M records from 
flights|https://www.tablab.app/datasets/sample/parquet?datatable-source=demo-flights-1m]

 

Example code:
{code:scala}
val df = spark.read.parquet("Flights 1m.parquet").withColumn("partition_col", 
lit("2024")).localCheckpoint
df.repartition(1).sort("FL_DATE", 
"DISTANCE").write.mode("overwrite").partitionBy("partition_col").csv("repartition_order")
 {code}
 

Running previous example using Spark 3.3.4 writes a single file ordered by 
FL_DATE and DISTANCE fields inside the folder partition_col=2024.

On the other hand when using Spark 3.5.1 it returns same number of files as 
cores in the executors (4 in my case using 2 executors with 2 cores each) and 
rows are not sorted inside.
We can see that after repartition(1) and before the sort Spark adds another 
stage repartition(200) because of the default shuffle partitions value, and 
then AQE coalesce small partitions.

 

Spark 3.3.4 plan:
{code:java}
== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand (8)
+- AdaptiveSparkPlan (7)
   +- == Final Plan ==
      * Sort (4)
      +- ShuffleQueryStage (3), Statistics(sizeInBytes=76.3 MiB, 
rowCount=1.00E+6)
         +- Exchange (2)
            +- * Scan ExistingRDD (1)
   +- == Initial Plan ==
      Sort (6)
      +- Exchange (5)
         +- Scan ExistingRDD (1) {code}
 

Spark 3.5.1 plan:

 
{code:java}
== Physical Plan ==
AdaptiveSparkPlan (15)
+- == Final Plan ==
   Execute InsertIntoHadoopFsRelationCommand (9)
   +- WriteFiles (8)
      +- * Sort (7)
         +- AQEShuffleRead (6)
            +- ShuffleQueryStage (5), Statistics(sizeInBytes=76.3 MiB, 
rowCount=1.00E+6)
               +- Exchange (4)
                  +- ShuffleQueryStage (3), Statistics(sizeInBytes=76.3 MiB, 
rowCount=1.00E+6)
                     +- Exchange (2)
                        +- * Scan ExistingRDD (1)
+- == Initial Plan ==
   Execute InsertIntoHadoopFsRelationCommand (14)
   +- WriteFiles (13)
      +- Sort (12)
         +- Exchange (11)
            +- Exchange (10)
               +- Scan ExistingRDD (1) {code}
 

 

 

 


> Repartition, sort and partitionBy not working together
> --
>
> Key: SPARK-48521
> URL: https://issues.apache.org/jira/browse/SPARK-48521
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Alvaro Berdonces
>Priority: Major
>
> Hi, we are

[jira] [Created] (SPARK-48521) Repartition, sort and partitionBy not working together

2024-06-04 Thread Alvaro Berdonces (Jira)

Alvaro Berdonces created SPARK-48521:


 Summary: Repartition, sort and partitionBy not working together
 Key: SPARK-48521
 URL: https://issues.apache.org/jira/browse/SPARK-48521
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, Spark Core, SQL
Affects Versions: 3.5.1, 3.5.0
Reporter: Alvaro Berdonces


Hi, we are having some problems writing sorted csv’s using Spark 3.5.1.

Example data:
[Parquet 1M records from 
flights|https://www.tablab.app/datasets/sample/parquet?datatable-source=demo-flights-1m]

 

Example code:
{code:scala}
val df = spark.read.parquet("Flights 1m.parquet").withColumn("partition_col", 
lit("2024")).localCheckpoint
df.repartition(1).sort("FL_DATE", 
"DISTANCE").write.mode("overwrite").partitionBy("partition_col").csv("repartition_order")
 {code}
 

Running previous example using Spark 3.3.4 writes a single file ordered by 
FL_DATE and DISTANCE fields inside the folder partition_col=2024.

On the other hand when using Spark 3.5.1 it returns same number of files as 
cores in the executors (4 in my case using 2 executors with 2 cores each) and 
rows are not sorted inside.
We can see that after repartition(1) and before the sort Spark adds another 
stage repartition(200) because of the default shuffle partitions value, and 
then AQE coalesce small partitions.

 

Spark 3.3.4 plan:
{code:java}
== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand (8)
+- AdaptiveSparkPlan (7)
   +- == Final Plan ==
      * Sort (4)
      +- ShuffleQueryStage (3), Statistics(sizeInBytes=76.3 MiB, 
rowCount=1.00E+6)
         +- Exchange (2)
            +- * Scan ExistingRDD (1)
   +- == Initial Plan ==
      Sort (6)
      +- Exchange (5)
         +- Scan ExistingRDD (1) {code}
 

Spark 3.5.1 plan:

 
{code:java}
== Physical Plan ==
AdaptiveSparkPlan (15)
+- == Final Plan ==
   Execute InsertIntoHadoopFsRelationCommand (9)
   +- WriteFiles (8)
      +- * Sort (7)
         +- AQEShuffleRead (6)
            +- ShuffleQueryStage (5), Statistics(sizeInBytes=76.3 MiB, 
rowCount=1.00E+6)
               +- Exchange (4)
                  +- ShuffleQueryStage (3), Statistics(sizeInBytes=76.3 MiB, 
rowCount=1.00E+6)
                     +- Exchange (2)
                        +- * Scan ExistingRDD (1)
+- == Initial Plan ==
   Execute InsertIntoHadoopFsRelationCommand (14)
   +- WriteFiles (13)
      +- Sort (12)
         +- Exchange (11)
            +- Exchange (10)
               +- Scan ExistingRDD (1) {code}
 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48520) spark-sql does not support using Spark Connect

2024-06-04 Thread David Sisson (Jira)

David Sisson created SPARK-48520:


 Summary: spark-sql does not support using Spark Connect
 Key: SPARK-48520
 URL: https://issues.apache.org/jira/browse/SPARK-48520
 Project: Spark
  Issue Type: Bug
  Components: Connect, SQL
Affects Versions: 3.5.1
Reporter: David Sisson


Similar to spark-shell (for Scala) specifying a Spark Connect option results in 
a "master URL must be set in your configuration" error.{{{}{}}}

 

Sample execution:

{{SPARK_REMOTE=sc://localhost spark-sql}}

 

Another attempt at setting the same value:

{{spark-sql --remote localhost:50051}}

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38862) Let consumers provide their own method for Authentication for The REST Submission Server

2024-06-04 Thread Jack (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack updated SPARK-38862:
-
Component/s: Documentation

> Let consumers provide their own method for Authentication for The REST 
> Submission Server
> 
>
> Key: SPARK-38862
> URL: https://issues.apache.org/jira/browse/SPARK-38862
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Spark Core, Spark Submit
>Affects Versions: 3.4.0, 4.0.0
>Reporter: Jack
>Priority: Major
>  Labels: authentication, pull-request-available, rest, spark, 
> spark-submit, submit
>
> [Spark documentation|https://spark.apache.org/docs/latest/security.html] 
> states that
> ??The REST Submission Server and the MesosClusterDispatcher do not support 
> authentication. You should ensure that all network access to the REST API & 
> MesosClusterDispatcher (port 6066 and 7077 respectively by default) are 
> restricted to hosts that are trusted to submit jobs.??
> Whilst it is true that we can use network policies to restrict access to our 
> exposed submission endpoint, it would be preferable to at least also allow 
> some primitive form of authentication at a global level, whether this is by 
> some token provided to the runtime environment or is a "system user" using 
> basic authentication of a username/password combination - I am not strictly 
> opinionated and I think either would suffice.
> Alternatively, one could implement a custom proxy to provide this 
> authentication check, but upon investigation this option is rejected by the 
> spark master as-is today.
> I would imagine that whatever solution is agreed for a first phase, a custom 
> authenticator may be something we want a user to be able to provide so that 
> if an admin needed some more advanced authentication check, such as RBAC et 
> al, it could be facilitated without the need for writing a complete custom 
> proxy layer; although it could be argued just some basic built in layer being 
> available; eg. RestSubmissionBasicAuthenticator could be preferable. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38862) Let consumers provide their own method for Authentication for The REST Submission Server

2024-06-04 Thread Jack (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack updated SPARK-38862:
-
Affects Version/s: 4.0.0

> Let consumers provide their own method for Authentication for The REST 
> Submission Server
> 
>
> Key: SPARK-38862
> URL: https://issues.apache.org/jira/browse/SPARK-38862
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Spark Submit
>Affects Versions: 3.4.0, 4.0.0
>Reporter: Jack
>Priority: Major
>  Labels: authentication, pull-request-available, rest, spark, 
> spark-submit, submit
>
> [Spark documentation|https://spark.apache.org/docs/latest/security.html] 
> states that
> ??The REST Submission Server and the MesosClusterDispatcher do not support 
> authentication. You should ensure that all network access to the REST API & 
> MesosClusterDispatcher (port 6066 and 7077 respectively by default) are 
> restricted to hosts that are trusted to submit jobs.??
> Whilst it is true that we can use network policies to restrict access to our 
> exposed submission endpoint, it would be preferable to at least also allow 
> some primitive form of authentication at a global level, whether this is by 
> some token provided to the runtime environment or is a "system user" using 
> basic authentication of a username/password combination - I am not strictly 
> opinionated and I think either would suffice.
> Alternatively, one could implement a custom proxy to provide this 
> authentication check, but upon investigation this option is rejected by the 
> spark master as-is today.
> I would imagine that whatever solution is agreed for a first phase, a custom 
> authenticator may be something we want a user to be able to provide so that 
> if an admin needed some more advanced authentication check, such as RBAC et 
> al, it could be facilitated without the need for writing a complete custom 
> proxy layer; although it could be argued just some basic built in layer being 
> available; eg. RestSubmissionBasicAuthenticator could be preferable. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38862) Let consumers provide their own method for Authentication for The REST Submission Server

2024-06-04 Thread Jack (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack updated SPARK-38862:
-
Description: 
[Spark documentation|https://spark.apache.org/docs/latest/security.html] states 
that

??The REST Submission Server and the MesosClusterDispatcher do not support 
authentication. You should ensure that all network access to the REST API & 
MesosClusterDispatcher (port 6066 and 7077 respectively by default) are 
restricted to hosts that are trusted to submit jobs.??

Whilst it is true that we can use network policies to restrict access to our 
exposed submission endpoint, it would be preferable to at least also allow some 
primitive form of authentication at a global level, whether this is by some 
token provided to the runtime environment or is a "system user" using basic 
authentication of a username/password combination - I am not strictly 
opinionated and I think either would suffice.

Alternatively, one could implement a custom proxy to provide this 
authentication check, but upon investigation this option is rejected by the 
spark master as-is today.

I would imagine that whatever solution is agreed for a first phase, a custom 
authenticator may be something we want a user to be able to provide so that if 
an admin needed some more advanced authentication check, such as RBAC et al, it 
could be facilitated without the need for writing a complete custom proxy 
layer; although it could be argued just some basic built in layer being 
available; eg. RestSubmissionBasicAuthenticator could be preferable. 

  was:
[Spark documentation|https://spark.apache.org/docs/latest/security.html] states 
that

??The REST Submission Server and the MesosClusterDispatcher do not support 
authentication. You should ensure that all network access to the REST API & 
MesosClusterDispatcher (port 6066 and 7077 respectively by default) are 
restricted to hosts that are trusted to submit jobs.??

Whilst it is true that we can use network policies to restrict access to our 
exposed submission endpoint, it would be preferable to at least also allow some 
primitive form of authentication at a global level, whether this is by some 
token provided to the runtime environment or is a "system user" using basic 
authentication of a username/password combination - I am not strictly 
opinionated and I think either would suffice.

I appreciate that one could implement a custom proxy to provide this 
authentication check, but it seems like a common use case that others may 
benefit from to be able to authenticate against the rest submission endpoint, 
and by implementing this capability as an optionally configurable aspect of 
Spark itself, we can utilise the existing server to provide this check.

I would imagine that whatever solution is agreed for a first phase, a custom 
authenticator may be something we want a user to be able to provide so that if 
an admin needed some more advanced authentication check, such as RBAC et al, it 
could be facilitated without the need for writing a complete custom proxy 
layer; but I do feel there should be some basic built in available; eg. 
RestSubmissionBasicAuthenticator.


> Let consumers provide their own method for Authentication for The REST 
> Submission Server
> 
>
> Key: SPARK-38862
> URL: https://issues.apache.org/jira/browse/SPARK-38862
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Spark Submit
>Affects Versions: 3.4.0
>Reporter: Jack
>Priority: Major
>  Labels: authentication, pull-request-available, rest, spark, 
> spark-submit, submit
>
> [Spark documentation|https://spark.apache.org/docs/latest/security.html] 
> states that
> ??The REST Submission Server and the MesosClusterDispatcher do not support 
> authentication. You should ensure that all network access to the REST API & 
> MesosClusterDispatcher (port 6066 and 7077 respectively by default) are 
> restricted to hosts that are trusted to submit jobs.??
> Whilst it is true that we can use network policies to restrict access to our 
> exposed submission endpoint, it would be preferable to at least also allow 
> some primitive form of authentication at a global level, whether this is by 
> some token provided to the runtime environment or is a "system user" using 
> basic authentication of a username/password combination - I am not strictly 
> opinionated and I think either would suffice.
> Alternatively, one could implement a custom proxy to provide this 
> authentication check, but upon investigation this option is rejected by the 
> spark master as-is today.
> I would imagine that whatever solution is agreed for a first phase, a custom 
> authenticator may be something we want a user to be able to provide so that 
> if an admin

[jira] [Commented] (SPARK-38862) Let consumers provide their own method for Authentication for The REST Submission Server

2024-06-04 Thread Jack (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851948#comment-17851948
 ] 

Jack commented on SPARK-38862:
--

After arguing with myself on this Jira, the outline of the approach taken in 
the linked PR is to provide a method of letting consumers specify that they 
will set up a secure gateway outside of spark itself. This approach feels best 
to start with since it's super simple and there are so many potential 
requirements at front door/user/app auth, likely so many that to do it well it 
would end up bloating out spark unnecessarily.

Practically speaking, this allows somebody to opt in to tell spark they will 
spin up something like Nginx collocated with the master node, keeping those 
master ports protected in a private network space, and proxy all requests to 
rest server via this gateway. If they are on the same node/ip, I've found spark 
avoids assigning any ports that are already claimed - although any interested 
party validating my assumptions would be great.

In essence, it means you can then enable the other spark.authenticate options 
and take control of this area yourself.

[~dongjoon], please let me know if you would like me to provide some examples 
of such a configuration/example architecture of the solution in the docs, right 
now I've followed the general feel of the security documentation which is let 
users interpret and ensure they understand things themselves opposed to being 
overly prescriptive. Interpretation of what a secure gateway actually is 
probably means different things to different people.

I've tried to keep the implementation open for extension without over 
anticipating what inbuilt auth might look like in the future. The new code is 
mostly all private to the master, so I feel no warrant for evolving/unstable 
annotations. I feel this solution is fairly comprehensive for this use case 
while maintaining backward compatibility.

I based this on master/v4.0.

> Let consumers provide their own method for Authentication for The REST 
> Submission Server
> 
>
> Key: SPARK-38862
> URL: https://issues.apache.org/jira/browse/SPARK-38862
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Spark Submit
>Affects Versions: 3.4.0
>Reporter: Jack
>Priority: Major
>  Labels: authentication, pull-request-available, rest, spark, 
> spark-submit, submit
>
> [Spark documentation|https://spark.apache.org/docs/latest/security.html] 
> states that
> ??The REST Submission Server and the MesosClusterDispatcher do not support 
> authentication. You should ensure that all network access to the REST API & 
> MesosClusterDispatcher (port 6066 and 7077 respectively by default) are 
> restricted to hosts that are trusted to submit jobs.??
> Whilst it is true that we can use network policies to restrict access to our 
> exposed submission endpoint, it would be preferable to at least also allow 
> some primitive form of authentication at a global level, whether this is by 
> some token provided to the runtime environment or is a "system user" using 
> basic authentication of a username/password combination - I am not strictly 
> opinionated and I think either would suffice.
> I appreciate that one could implement a custom proxy to provide this 
> authentication check, but it seems like a common use case that others may 
> benefit from to be able to authenticate against the rest submission endpoint, 
> and by implementing this capability as an optionally configurable aspect of 
> Spark itself, we can utilise the existing server to provide this check.
> I would imagine that whatever solution is agreed for a first phase, a custom 
> authenticator may be something we want a user to be able to provide so that 
> if an admin needed some more advanced authentication check, such as RBAC et 
> al, it could be facilitated without the need for writing a complete custom 
> proxy layer; but I do feel there should be some basic built in available; eg. 
> RestSubmissionBasicAuthenticator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38862) Let consumers provide their own method for Authentication for The REST Submission Server

2024-06-04 Thread Jack (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack updated SPARK-38862:
-
Summary: Let consumers provide their own method for Authentication for The 
REST Submission Server  (was: Basic Authentication or Token Based 
Authentication for The REST Submission Server)

> Let consumers provide their own method for Authentication for The REST 
> Submission Server
> 
>
> Key: SPARK-38862
> URL: https://issues.apache.org/jira/browse/SPARK-38862
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Spark Submit
>Affects Versions: 3.4.0
>Reporter: Jack
>Priority: Major
>  Labels: authentication, pull-request-available, rest, spark, 
> spark-submit, submit
>
> [Spark documentation|https://spark.apache.org/docs/latest/security.html] 
> states that
> ??The REST Submission Server and the MesosClusterDispatcher do not support 
> authentication. You should ensure that all network access to the REST API & 
> MesosClusterDispatcher (port 6066 and 7077 respectively by default) are 
> restricted to hosts that are trusted to submit jobs.??
> Whilst it is true that we can use network policies to restrict access to our 
> exposed submission endpoint, it would be preferable to at least also allow 
> some primitive form of authentication at a global level, whether this is by 
> some token provided to the runtime environment or is a "system user" using 
> basic authentication of a username/password combination - I am not strictly 
> opinionated and I think either would suffice.
> I appreciate that one could implement a custom proxy to provide this 
> authentication check, but it seems like a common use case that others may 
> benefit from to be able to authenticate against the rest submission endpoint, 
> and by implementing this capability as an optionally configurable aspect of 
> Spark itself, we can utilise the existing server to provide this check.
> I would imagine that whatever solution is agreed for a first phase, a custom 
> authenticator may be something we want a user to be able to provide so that 
> if an admin needed some more advanced authentication check, such as RBAC et 
> al, it could be facilitated without the need for writing a complete custom 
> proxy layer; but I do feel there should be some basic built in available; eg. 
> RestSubmissionBasicAuthenticator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48505) Simplify the implementation of Utils#isG1GC

2024-06-04 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48505.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46783
[https://github.com/apache/spark/pull/46783]

> Simplify the implementation of Utils#isG1GC
> ---
>
> Key: SPARK-48505
> URL: https://issues.apache.org/jira/browse/SPARK-48505
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48505) Simplify the implementation of Utils#isG1GC

2024-06-04 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-48505:


Assignee: Yang Jie

> Simplify the implementation of Utils#isG1GC
> ---
>
> Key: SPARK-48505
> URL: https://issues.apache.org/jira/browse/SPARK-48505
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48519) Upgrade jetty to 11.0.21

2024-06-04 Thread Yang Jie (Jira)

Yang Jie created SPARK-48519:


 Summary: Upgrade jetty to 11.0.21
 Key: SPARK-48519
 URL: https://issues.apache.org/jira/browse/SPARK-48519
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Yang Jie


* https://github.com/jetty/jetty.project/releases/tag/jetty-11.0.21



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 107 matches

Mail list logo