date:20240602

[jira] [Updated] (SPARK-48422) Using lambda may cause MemoryError

2024-06-02 Thread ZhouYang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhouYang updated SPARK-48422:
-
Description: 
In worker.py, there is a function called process(), the iterator loads all data 
at once
{code:java}
def process():
       iterator = deserializer.load_stream(infile)
       serializer.dump_stream(func(split_index, iterator), outfile){code}
It will cause MemoryError when working on large scale data, For the reason that 
I have indeed encountered this situation as below:
{code:java}
MemoryError at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:203) at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:244) at 
org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:162) at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
 at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:89) at 
org.apache.spark.scheduler.Task.run(Task.scala:109) at 
org.apache.spark.executor.Executor$TaskRunner$$anon$2.run(Executor.scala:355) 
at java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1721)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:353) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 2024-05-22 16:50:03,173 INFO 
org.apache.spark.scheduler.TaskSetManager: Starting task 0.1 in stage 5.0 (TID 
21, saturndatanode3, executor 2, partition 0, ANY, 5075 bytes) 2024-05-22 
16:50:03,174 INFO org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
stage 5.0 (TID 19) on saturndatanode3, executor 2: 
org.apache.spark.api.python.PythonException (Traceback (most recent call last): 
File "xx/spark/python/lib/pyspark.zip/pyspark/worker.py", line 200, in main 
process() File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
195, in process serializer.dump_stream(func(split_index, iterator), outfile) 
File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in 
 func = lambda _, it: map(mapper, it) File "", line 1, in 
 File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 73, 
in  return lambda *a: f(*a){code}
I did some tests by adding memory monitor code, I found that this code takes up 
a lot of memory during execution：
{code:java}
start_memory = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print(f"Memory usage at the beginning: {start_memory} KB")iterator = 
deserializer.load_stream(infile)
serializer.dump_stream(func(split_index, iterator), outfile)end_memory = 
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print(f"Memory usage at the end: {end_memory} KB")memory_difference = 
end_memory - start_memory
print(f"Memory usage changes:{memory_difference} KB"){code}
Can I process the data in the iterator in batches as below？
{code:java}
def process():
    iterator = deserializer.load_stream(infile)
    def batched_func(iterator, func, serializer, outputfile):
        batch = []
        count = 0
        for item in iterator:
            batch.append(item)
            count += 1
            // Process the data in the iterator in batches, with 1 entries 
each time.
            if count >= 1:
                serializer.dump_stream(func(split_index, batch), outfile)
                batch = []
                count = 0
            if batch:
                serializer.dump_stream(func(split_index, batch), outfile)
        batched_func(iterator, func, serializer, outfile){code}
I test with code as above， it works well with lower memory each time.

 

  was:
In worker.py, there is a function called process(), the iterator loads all data 
at once
{code:java}
def process():
       iterator = deserializer.load_stream(infile)
       serializer.dump_stream(func(split_index, iterator), outfile){code}
It will cause MemoryError when working on large scale data, For the reason that 
I have indeed encountered this situation as below:
{code:jav

[jira] [Updated] (SPARK-48422) Using lambda may cause MemoryError

2024-06-02 Thread ZhouYang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhouYang updated SPARK-48422:
-
Description: 
In worker.py, there is a function called process(), the iterator loads all data 
at once
{code:java}
def process():
       iterator = deserializer.load_stream(infile)
       serializer.dump_stream(func(split_index, iterator), outfile){code}
It will cause MemoryError when working on large scale data, For the reason that 
I have indeed encountered this situation as below:
{code:java}
MemoryError at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:203) at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:244) at 
org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:162) at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
 at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:89) at 
org.apache.spark.scheduler.Task.run(Task.scala:109) at 
org.apache.spark.executor.Executor$TaskRunner$$anon$2.run(Executor.scala:355) 
at java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1721)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:353) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 2024-05-22 16:50:03,173 INFO 
org.apache.spark.scheduler.TaskSetManager: Starting task 0.1 in stage 5.0 (TID 
21, saturndatanode3, executor 2, partition 0, ANY, 5075 bytes) 2024-05-22 
16:50:03,174 INFO org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
stage 5.0 (TID 19) on saturndatanode3, executor 2: 
org.apache.spark.api.python.PythonException (Traceback (most recent call last): 
File "xx/spark/python/lib/pyspark.zip/pyspark/worker.py", line 200, in main 
process() File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
195, in process serializer.dump_stream(func(split_index, iterator), outfile) 
File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in 
 func = lambda _, it: map(mapper, it) File "", line 1, in 
 File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 73, 
in  return lambda *a: f(*a){code}
I did some tests by adding memory monitor code, I found that this code takes up 
a lot of memory during execution：
{code:java}
start_memory = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print(f"Memory usage at the beginning: {start_memory} KB")iterator = 
deserializer.load_stream(infile)
serializer.dump_stream(func(split_index, iterator), outfile)end_memory = 
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print(f"Memory usage at the end: {end_memory} KB")memory_difference = 
end_memory - start_memory
print(f"Memory usage changes:{memory_difference} KB"){code}
Can I process the data in the iterator in batches as below？
{code:java}
def process():
    iterator = deserializer.load_stream(infile)
    def batched_func(iterator, func, serializer, outputfile):
        batch = []
        count = 0
        for item in iterator:
            batch.append(item)
            count += 1
            // Process the data in the iterator in batches, with 1 entries 
each time.
            if count >= 1:
                serializer.dump_stream(func(split_index, batch), outfile)
                batch = []
                count = 0
            if batch:
                serializer.dump_stream(func(split_index, batch), outfile)
        batched_func(iterator, func, serializer, outfile){code}
 

 

  was:
In worker.py, there is a function called process(), the iterator loads all data 
at once
{code:java}
def process():
       iterator = deserializer.load_stream(infile)
       serializer.dump_stream(func(split_index, iterator), outfile){code}
It will cause MemoryError when working on large scale data, For the reason that 
I have indeed encountered this situation as below:
{code:java}
MemoryError at 
org.apache.spark.api.python.PythonRunner$$anon$1.

[jira] [Updated] (SPARK-48488) Fix methods `log[info|warning|error]` in SparkSubmit

2024-06-02 Thread BingKun Pan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-48488:

Summary: Fix methods `log[info|warning|error]` in SparkSubmit  (was: 
Restore the original logic of methods `log[info|warning|error]` in 
`SparkSubmit`)

> Fix methods `log[info|warning|error]` in SparkSubmit
> 
>
> Key: SPARK-48488
> URL: https://issues.apache.org/jira/browse/SPARK-48488
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48447) Check state store provider class before invoking the constructor

2024-06-02 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-48447.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46791
[https://github.com/apache/spark/pull/46791]

> Check state store provider class before invoking the constructor
> 
>
> Key: SPARK-48447
> URL: https://issues.apache.org/jira/browse/SPARK-48447
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Yuchen Liu
>Assignee: Yuchen Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We should restrict that only classes 
> [extending|https://github.com/databricks/runtime/blob/1440e77ab54c40981066c22ec759bdafc0683e76/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L73]
>  {{StateStoreProvider}} can be constructed to prevent customer from 
> instantiating arbitrary class of objects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48447) Check state store provider class before invoking the constructor

2024-06-02 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-48447:


Assignee: Yuchen Liu

> Check state store provider class before invoking the constructor
> 
>
> Key: SPARK-48447
> URL: https://issues.apache.org/jira/browse/SPARK-48447
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Yuchen Liu
>Assignee: Yuchen Liu
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We should restrict that only classes 
> [extending|https://github.com/databricks/runtime/blob/1440e77ab54c40981066c22ec759bdafc0683e76/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L73]
>  {{StateStoreProvider}} can be constructed to prevent customer from 
> instantiating arbitrary class of objects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48488) Restore the original logic of methods `log[info|warning|error]` in `SparkSubmit`

2024-06-02 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-48488:
--

Assignee: BingKun Pan

> Restore the original logic of methods `log[info|warning|error]` in 
> `SparkSubmit`
> 
>
> Key: SPARK-48488
> URL: https://issues.apache.org/jira/browse/SPARK-48488
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48383) In KafkaOffsetReader, partition mismatch should not be an assertion

2024-06-02 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-48383.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46692
[https://github.com/apache/spark/pull/46692]

> In KafkaOffsetReader, partition mismatch should not be an assertion
> ---
>
> Key: SPARK-48383
> URL: https://issues.apache.org/jira/browse/SPARK-48383
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> In KafkaOffsetReader, we assert startOffsets have the same topic partition 
> list as assigned. However, if the user changes topic partition while the 
> query is running, they will see the assertion. Instead, they should see an 
> exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48383) In KafkaOffsetReader, partition mismatch should not be an assertion

2024-06-02 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-48383:


Assignee: Siying Dong

> In KafkaOffsetReader, partition mismatch should not be an assertion
> ---
>
> Key: SPARK-48383
> URL: https://issues.apache.org/jira/browse/SPARK-48383
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Major
>  Labels: pull-request-available
>
> In KafkaOffsetReader, we assert startOffsets have the same topic partition 
> list as assigned. However, if the user changes topic partition while the 
> query is running, they will see the assertion. Instead, they should see an 
> exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48488) Restore the original logic of methods `log[info|warning|error]` in `SparkSubmit`

2024-06-02 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-48488.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46822
[https://github.com/apache/spark/pull/46822]

> Restore the original logic of methods `log[info|warning|error]` in 
> `SparkSubmit`
> 
>
> Key: SPARK-48488
> URL: https://issues.apache.org/jira/browse/SPARK-48488
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48505) Simplify the implementation of Utils#isG1GC

2024-06-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48505:
---
Labels: pull-request-available  (was: )

> Simplify the implementation of Utils#isG1GC
> ---
>
> Key: SPARK-48505
> URL: https://issues.apache.org/jira/browse/SPARK-48505
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48505) Simplify the implementation of Utils#isG1GC

2024-06-02 Thread Yang Jie (Jira)

Yang Jie created SPARK-48505:


 Summary: Simplify the implementation of Utils#isG1GC
 Key: SPARK-48505
 URL: https://issues.apache.org/jira/browse/SPARK-48505
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48433) Upgrade `checkstyle` to 10.17.0

2024-06-02 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-48433.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46763
[https://github.com/apache/spark/pull/46763]

> Upgrade `checkstyle` to 10.17.0
> ---
>
> Key: SPARK-48433
> URL: https://issues.apache.org/jira/browse/SPARK-48433
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48487) Update License & Notice according to the dependency changes

2024-06-02 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48487.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46821
[https://github.com/apache/spark/pull/46821]

> Update License & Notice according to the dependency changes
> ---
>
> Key: SPARK-48487
> URL: https://issues.apache.org/jira/browse/SPARK-48487
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48487) Update License & Notice according to the dependency changes

2024-06-02 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-48487:


Assignee: Kent Yao

> Update License & Notice according to the dependency changes
> ---
>
> Key: SPARK-48487
> URL: https://issues.apache.org/jira/browse/SPARK-48487
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48504) Parent Window class for Spark Connect and Spark Classic

2024-06-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48504:
---
Labels: pull-request-available  (was: )

> Parent Window class for Spark Connect and Spark Classic
> ---
>
> Key: SPARK-48504
> URL: https://issues.apache.org/jira/browse/SPARK-48504
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48504) Parent Window class for Spark Connect and Spark Classic

2024-06-02 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-48504:
-

 Summary: Parent Window class for Spark Connect and Spark Classic
 Key: SPARK-48504
 URL: https://issues.apache.org/jira/browse/SPARK-48504
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48496) Use static regex Pattern instances in common/utils JavaUtils

2024-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48496.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/46829

> Use static regex Pattern instances in common/utils JavaUtils
> 
>
> Key: SPARK-48496
> URL: https://issues.apache.org/jira/browse/SPARK-48496
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Some methods in JavaUtils.java are recompiling regexes on every invocation; 
> we should instead store a single cached Pattern.
> This is a minor perf. issue that I spotted in the context of other profiling. 
> Not a huge bottleneck in the grand scheme of things, but simple and 
> straightforward to fix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48489) Throw an user-facing error when reading invalid schema from text DataSource

2024-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48489.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46823
[https://github.com/apache/spark/pull/46823]

> Throw an user-facing error when reading invalid schema from text DataSource
> ---
>
> Key: SPARK-48489
> URL: https://issues.apache.org/jira/browse/SPARK-48489
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Stefan Bukorovic
>Assignee: Stefan Bukorovic
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Text DataSource produces table schema with only 1 column, but it is possible 
> to try and create a table with schema having multiple columns.
> Currently, when user tries this, we have an assert in the code, which fails 
> and throws internal spark error. We should throw a better user-facing error.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48489) Throw an user-facing error when reading invalid schema from text DataSource

2024-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48489:


Assignee: Stefan Bukorovic

> Throw an user-facing error when reading invalid schema from text DataSource
> ---
>
> Key: SPARK-48489
> URL: https://issues.apache.org/jira/browse/SPARK-48489
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.3
>Reporter: Stefan Bukorovic
>Assignee: Stefan Bukorovic
>Priority: Minor
>  Labels: pull-request-available
>
> Text DataSource produces table schema with only 1 column, but it is possible 
> to try and create a table with schema having multiple columns.
> Currently, when user tries this, we have an assert in the code, which fails 
> and throws internal spark error. We should throw a better user-facing error.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48422) Using lambda may cause MemoryError

2024-06-02 Thread ZhouYang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhouYang updated SPARK-48422:
-
Description: 
In worker.py, there is a function called process(), the iterator loads all data 
at once
{code:java}
def process():
       iterator = deserializer.load_stream(infile)
       serializer.dump_stream(func(split_index, iterator), outfile){code}
It will cause MemoryError when working on large scale data, For the reason that 
I have indeed encountered this situation as below:
{code:java}
MemoryError at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:203) at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:244) at 
org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:162) at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
 at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:89) at 
org.apache.spark.scheduler.Task.run(Task.scala:109) at 
org.apache.spark.executor.Executor$TaskRunner$$anon$2.run(Executor.scala:355) 
at java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1721)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:353) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 2024-05-22 16:50:03,173 INFO 
org.apache.spark.scheduler.TaskSetManager: Starting task 0.1 in stage 5.0 (TID 
21, saturndatanode3, executor 2, partition 0, ANY, 5075 bytes) 2024-05-22 
16:50:03,174 INFO org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
stage 5.0 (TID 19) on saturndatanode3, executor 2: 
org.apache.spark.api.python.PythonException (Traceback (most recent call last): 
File "xx/spark/python/lib/pyspark.zip/pyspark/worker.py", line 200, in main 
process() File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
195, in process serializer.dump_stream(func(split_index, iterator), outfile) 
File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in 
 func = lambda _, it: map(mapper, it) File "", line 1, in 
 File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 73, 
in  return lambda *a: f(*a){code}
I did some tests by adding memory monitor code, I found that this code takes up 
a lot of memory during execution：
{code:java}
start_memory = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print(f"Memory usage at the beginning: {start_memory} KB")iterator = 
deserializer.load_stream(infile)
serializer.dump_stream(func(split_index, iterator), outfile)end_memory = 
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print(f"Memory usage at the end: {end_memory} KB")memory_difference = 
end_memory - start_memory
print(f"Memory usage changes:{memory_difference} KB"){code}





  was:
In worker.py, there is a function called wrap_udf(f, return_type), lambda is 
used in its body as below:
{code:java}
def wrap_udf(f, return_type):
    if return_type.needConversion():
        toInternal = return_type.toInternal
        return lambda *a: toInternal(f(*a))
    else:
        return lambda *a: f(*a){code}
Is there any possibility that it will cause MemoryError when function f is used 
to work on large scale data？For the reason that I have indeed encountered this 
situation as below:
{code:java}
MemoryError at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:203) at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:244) at 
org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:162) at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
 at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:79

[jira] [Assigned] (SPARK-48374) Support additional PyArrow Table column types

2024-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48374:


Assignee: Ian Cook

> Support additional PyArrow Table column types
> -
>
> Key: SPARK-48374
> URL: https://issues.apache.org/jira/browse/SPARK-48374
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>  Labels: pull-request-available
>
> SPARK-48220 adds support for passing a PyArrow Table to 
> {{{}createDataFrame(){}}}, but there are a few PyArrow column types that are 
> not yet supported:
>  * fixed-size binary
>  * fixed-size list
>  * large list
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48374) Support additional PyArrow Table column types

2024-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48374.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46688
[https://github.com/apache/spark/pull/46688]

> Support additional PyArrow Table column types
> -
>
> Key: SPARK-48374
> URL: https://issues.apache.org/jira/browse/SPARK-48374
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> SPARK-48220 adds support for passing a PyArrow Table to 
> {{{}createDataFrame(){}}}, but there are a few PyArrow column types that are 
> not yet supported:
>  * fixed-size binary
>  * fixed-size list
>  * large list
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48503) Scalar subquery with group-by and non-equality predicate incorrectly allowed, wrong results

2024-06-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48503:
---
Labels: pull-request-available  (was: )

> Scalar subquery with group-by and non-equality predicate incorrectly allowed, 
> wrong results
> ---
>
> Key: SPARK-48503
> URL: https://issues.apache.org/jira/browse/SPARK-48503
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jack Chen
>Priority: Major
>  Labels: pull-request-available
>
> This query is not legal and should give an error, but instead we incorrectly 
> allow it and it returns wrong results.
> {code:java}
> create table x(x1 int, x2 int);
> insert into x values (1, 1);
> create table y(y1 int, y2 int);
> insert into y values (2, 2), (3, 3);
> select *, (select count(*) from y where y1 > x1 group by y1) from x; {code}
> It returns two rows, even though there's only one row of x.
> The correct result is an error: more than one row returned by a subquery used 
> as an expression (as seen in postgres for example)
>  
> This is a longstanding bug. The bug is in CheckAnalysis in 
> {{{}checkAggregateInScalarSubquery{}}}. It allows grouping columns that are 
> present in correlation predicates, but doesn’t check whether those predicates 
> are equalities -  because when that code was written, non-equality 
> correlation wasn’t allowed. Therefore, it looks like this bug has existed 
> since non-equality correlation was added (~2 years ago).
>  
> Various other expressions that are not equi-joins between the inner and outer 
> fields hit this too, e.g. `where y1 + y2 = x1 group by y1`.
> Another bugged case is if the correlation condition is an equality but it's 
> under another operator like an OUTER JOIN or UNION.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48503) Scalar subquery with group-by and non-equality predicate incorrectly allowed, wrong results

2024-06-02 Thread Jack Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-48503:
--
Parent: SPARK-35553
Issue Type: Sub-task  (was: Bug)

> Scalar subquery with group-by and non-equality predicate incorrectly allowed, 
> wrong results
> ---
>
> Key: SPARK-48503
> URL: https://issues.apache.org/jira/browse/SPARK-48503
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jack Chen
>Priority: Major
>
> This query is not legal and should give an error, but instead we incorrectly 
> allow it and it returns wrong results.
> {code:java}
> create table x(x1 int, x2 int);
> insert into x values (1, 1);
> create table y(y1 int, y2 int);
> insert into y values (2, 2), (3, 3);
> select *, (select count(*) from y where y1 > x1 group by y1) from x; {code}
> It returns two rows, even though there's only one row of x.
> The correct result is an error: more than one row returned by a subquery used 
> as an expression (as seen in postgres for example)
>  
> This is a longstanding bug. The bug is in CheckAnalysis in 
> {{{}checkAggregateInScalarSubquery{}}}. It allows grouping columns that are 
> present in correlation predicates, but doesn’t check whether those predicates 
> are equalities -  because when that code was written, non-equality 
> correlation wasn’t allowed. Therefore, it looks like this bug has existed 
> since non-equality correlation was added (~2 years ago).
>  
> Various other expressions that are not equi-joins between the inner and outer 
> fields hit this too, e.g. `where y1 + y2 = x1 group by y1`.
> Another bugged case is if the correlation condition is an equality but it's 
> under another operator like an OUTER JOIN or UNION.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48501) Loosen `correlated scalar subqueries must be aggregated` error by doing runtime check for scalar subqueries output rowcount

2024-06-02 Thread Jack Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-48501:
--
Description: 
Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw 
an error {{{}Correlated scalar subqueries must be aggregated{}}}.

This check is often too restrictive, there are many cases where it should 
actually be runnable even though we don’t know it - e.g. unique keys or 
functional dependencies might ensure that there's only one row.

To handle these cases, it’s better to do the check at runtime instead of 
statically. This could be implemented as a special aggregate operator that 
throws exception on >=2 rows input, a “single join” operator that throws an 
exception when >= 2 rows match, or something similar.

There are also cases where we were incorrectly allowing queries before that 
returned wrong results, and should have been rejected as invalid (e.g. 
SPARK-48503, SPARK-18504). Doing the check at runtime would help avoid those 
bugs.

Current workarounds: Users can add an aggregate like {{any_value()}} or 
{{first()}} to the output of the subquery, or users can add {{limit 1}}

  was:
Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw 
an error {{{}Correlated scalar subqueries must be aggregated{}}}.

This check is often too restrictive, there are many cases where it should 
actually be runnable even though we don’t know it - e.g. unique keys or 
functional dependencies might ensure that there's only one row.

To handle these cases, it’s better to do the check at runtime instead of 
statically. This could be implemented as a special aggregate operator that 
throws exception on >=2 rows input, a “single join” operator that throws an 
exception when >= 2 rows match, or something similar.

There are also cases where we were incorrectly allowing queries before that 
returned wrong results, and should have been rejected as invalid. Doing the 
check at runtime would help avoid those bugs.

Current workarounds: Users can add an aggregate like {{any_value()}} or 
{{first()}} to the output of the subquery, or users can add {{limit 1}}


> Loosen `correlated scalar subqueries must be aggregated` error by doing 
> runtime check for scalar subqueries output rowcount
> ---
>
> Key: SPARK-48501
> URL: https://issues.apache.org/jira/browse/SPARK-48501
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jack Chen
>Priority: Major
>
> Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw 
> an error {{{}Correlated scalar subqueries must be aggregated{}}}.
> This check is often too restrictive, there are many cases where it should 
> actually be runnable even though we don’t know it - e.g. unique keys or 
> functional dependencies might ensure that there's only one row.
> To handle these cases, it’s better to do the check at runtime instead of 
> statically. This could be implemented as a special aggregate operator that 
> throws exception on >=2 rows input, a “single join” operator that throws an 
> exception when >= 2 rows match, or something similar.
> There are also cases where we were incorrectly allowing queries before that 
> returned wrong results, and should have been rejected as invalid (e.g. 
> SPARK-48503, SPARK-18504). Doing the check at runtime would help avoid those 
> bugs.
> Current workarounds: Users can add an aggregate like {{any_value()}} or 
> {{first()}} to the output of the subquery, or users can add {{limit 1}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48503) Scalar subquery with group-by and non-equality predicate incorrectly allowed, wrong results

2024-06-02 Thread Jack Chen (Jira)

Jack Chen created SPARK-48503:
-

 Summary: Scalar subquery with group-by and non-equality predicate 
incorrectly allowed, wrong results
 Key: SPARK-48503
 URL: https://issues.apache.org/jira/browse/SPARK-48503
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jack Chen


This query is not legal and should give an error, but instead we incorrectly 
allow it and it returns wrong results.
{code:java}
create table x(x1 int, x2 int);
insert into x values (1, 1);
create table y(y1 int, y2 int);
insert into y values (2, 2), (3, 3);

select *, (select count(*) from y where y1 > x1 group by y1) from x; {code}
It returns two rows, even though there's only one row of x.

The correct result is an error: more than one row returned by a subquery used 
as an expression (as seen in postgres for example)

 

This is a longstanding bug. The bug is in CheckAnalysis in 
{{{}checkAggregateInScalarSubquery{}}}. It allows grouping columns that are 
present in correlation predicates, but doesn’t check whether those predicates 
are equalities -  because when that code was written, non-equality correlation 
wasn’t allowed. Therefore, it looks like this bug has existed since 
non-equality correlation was added (~2 years ago).

 

Various other expressions that are not equi-joins between the inner and outer 
fields hit this too, e.g. `where y1 + y2 = x1 group by y1`.

Another bugged case is if the correlation condition is an equality but it's 
under another operator like an OUTER JOIN or UNION.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48502) shift functions do not support column as second argument

2024-06-02 Thread Frederik Paradis (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederik Paradis updated SPARK-48502:
-
Description: In PySpark, the shift* functions (shiftright/shiftleft, etc.) 
do not support the numBits argument as a column, only as a literal. The SQL 
function does support this.  (was: In PySpark, the shift* functions 
(shiftright/shiftleft, etc.) do not support the numBits argument as a column, 
only as literal. The SQL function does support this.)

> shift functions do not support column as second argument
> 
>
> Key: SPARK-48502
> URL: https://issues.apache.org/jira/browse/SPARK-48502
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.1
>Reporter: Frederik Paradis
>Priority: Major
>
> In PySpark, the shift* functions (shiftright/shiftleft, etc.) do not support 
> the numBits argument as a column, only as a literal. The SQL function does 
> support this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48502) shift functions do not support column as second argument

2024-06-02 Thread Frederik Paradis (Jira)

Frederik Paradis created SPARK-48502:


 Summary: shift functions do not support column as second argument
 Key: SPARK-48502
 URL: https://issues.apache.org/jira/browse/SPARK-48502
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.5.1
Reporter: Frederik Paradis


In PySpark, the shift* functions (shiftright/shiftleft, etc.) do not support 
the numBits argument as a column, only as literal. The SQL function does 
support this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48501) Loosen `correlated scalar subqueries must be aggregated` error by doing runtime check for scalar subqueries output rowcount

2024-06-02 Thread Jack Chen (Jira)

Jack Chen created SPARK-48501:
-

 Summary: Loosen `correlated scalar subqueries must be aggregated` 
error by doing runtime check for scalar subqueries output rowcount
 Key: SPARK-48501
 URL: https://issues.apache.org/jira/browse/SPARK-48501
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jack Chen


Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw 
an error {{{}Correlated scalar subqueries must be aggregated{}}}.

This check is often too restrictive, there are many cases where it should 
actually be runnable even though we don’t know it - e.g. unique keys or 
functional dependencies might ensure that there's only one row.

To handle these cases, it’s better to do the check at runtime instead of 
statically. This could be implemented as a special aggregate operator that 
throws exception on >=2 rows input, a “single join” operator that throws an 
exception when >= 2 rows match, or something similar.

There are also cases where we were incorrectly allowing queries before that 
returned wrong results, and should have been rejected as invalid. Doing the 
check at runtime would help avoid those bugs.

Current workarounds: Users can add an aggregate like {{any_value()}} or 
{{first()}} to the output of the subquery, or users can add {{limit 1}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36115) Handle the COUNT bug for correlated IN/EXISTS subquery

2024-06-02 Thread Jack Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen resolved SPARK-36115.
---
Resolution: Fixed

This was fixed by https://github.com/apache/spark/pull/43111

> Handle the COUNT bug for correlated IN/EXISTS subquery
> --
>
> Key: SPARK-36115
> URL: https://issues.apache.org/jira/browse/SPARK-36115
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> Correlated IN/EXISTS subqueries are also subject to the COUNT bug which is 
> not handled.
> {code:sql}
> create view t1(c1, c2) as values (0, 1), (1, 2)
> create view t2(c1, c2) as values (0, 2), (0, 3)
> -- Example 1: IN subquery
> select * from t1 where c1 in (select count(*) + 1 from t2 where t1.c1 = t2.c1)
> -- Correct answer: (1, 2)
> +---+---+
> |c1 |c2 |
> +---+---+
> +---+---+
> -- Example 2: EXISTS subquery
> select * from t1 where exists (select count(*) from t2 where t1.c1 = t2.c1)
> -- Correct answer: [(0, 1), (1, 2)]
> +---+---+
> |c1 |c2 |
> +---+---+
> |0  |1  |
> +---+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48500) On the client side, there is no information about the exception that caused the job to fail

2024-06-02 Thread Sergey Kotlov (Jira)

Sergey Kotlov created SPARK-48500:
-

 Summary: On the client side, there is no information about the 
exception that caused the job to fail
 Key: SPARK-48500
 URL: https://issues.apache.org/jira/browse/SPARK-48500
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.1
Reporter: Sergey Kotlov


When loading a table into BigQuery using the [BigQuery 
connector|https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example],
 the Spark Connect client does not receive information about the exception 
causing the problem.
Example:
{code:java}
spark.table("testds.test_table")
  .write
  .format("bigquery")
  .mode("overwrite")
  .option("project", "example-analytics")
  .option("table", "testds.test_table")
  .save() {code}
When running with Spark Connect, in the logs on the client side I see only:
{code:java}
Uncaught exception in main job thread
 org.apache.spark.SparkException: 
org.sparkproject.io.grpc.StatusRuntimeException: INTERNAL: Failed to write to 
BigQuery
    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:113)
    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)
    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:52)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at 
org.apache.spark.sql.connect.client.WrappedCloseableIterator.foreach(CloseableIterator.scala:30)
    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
    at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
    at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
    at 
org.apache.spark.sql.connect.client.WrappedCloseableIterator.to(CloseableIterator.scala:30)
    at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
    at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
    at 
org.apache.spark.sql.connect.client.WrappedCloseableIterator.toBuffer(CloseableIterator.scala:30)
    at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:552)
    at 
org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:257)
    at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:221)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:218)
    at com.example.testds.SparkTest$.main(SparkTest.scala:39)
    at com.example.testds.SparkTest.main(SparkTest.scala)
End of uncaught exception {code}
If the same code is run in a separate spark application, the cause of the error 
is there.
{code:java}
Uncaught exception in main job thread
 java.lang.RuntimeException: Failed to write to BigQuery
    at 
shadow.example.bigquery.com.google.cloud.spark.bigquery.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.scala:93)
    at 
shadow.example.bigquery.com.google.cloud.spark.bigquery.BigQueryInsertableRelation.insert(BigQueryInsertableRelation.scala:43)
    at 
shadow.example.bigquery.com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:113)
    at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
    at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
    at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
    at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
    at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107)
    at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
    at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
    at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
    at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
    at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107)
    at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transf

[jira] [Updated] (SPARK-48302) Preserve nulls in map columns in PyArrow Tables

2024-06-02 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-48302:
-
Parent: SPARK-44111
Issue Type: Sub-task  (was: Bug)

> Preserve nulls in map columns in PyArrow Tables
> ---
>
> Key: SPARK-48302
> URL: https://issues.apache.org/jira/browse/SPARK-48302
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ian Cook
>Priority: Major
>  Labels: pull-request-available
>
> Because of a limitation in PyArrow, when PyArrow Tables containing MapArray 
> columns with nested fields or timestamps are passed to 
> {{{}spark.createDataFrame(){}}}, null values in the MapArray columns are 
> replaced with empty lists.
> The PySpark function where this happens is 
> {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.
> Also see [https://github.com/apache/arrow/issues/41684].
> See the skipped tests and the TODO mentioning SPARK-48302.
> [Update] A fix for this has been implemented in PyArrow in 
> [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to 
> {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. 
> Since older versions of PyArrow (which PySpark will still support for a 
> while) won't have this argument, we will need to do a check like:
> {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("17.0.0")}}
> or
> {{from inspect import signature}}
> {{"mask" in signature(pa.MapArray.from_arrays).parameters}}
> and only pass {{mask}} if that's true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48302) Preserve nulls in map columns in PyArrow Tables

2024-06-02 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-48302:
-
Parent: (was: SPARK-44111)
Issue Type: Bug  (was: Sub-task)

> Preserve nulls in map columns in PyArrow Tables
> ---
>
> Key: SPARK-48302
> URL: https://issues.apache.org/jira/browse/SPARK-48302
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ian Cook
>Priority: Major
>  Labels: pull-request-available
>
> Because of a limitation in PyArrow, when PyArrow Tables containing MapArray 
> columns with nested fields or timestamps are passed to 
> {{{}spark.createDataFrame(){}}}, null values in the MapArray columns are 
> replaced with empty lists.
> The PySpark function where this happens is 
> {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.
> Also see [https://github.com/apache/arrow/issues/41684].
> See the skipped tests and the TODO mentioning SPARK-48302.
> [Update] A fix for this has been implemented in PyArrow in 
> [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to 
> {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. 
> Since older versions of PyArrow (which PySpark will still support for a 
> while) won't have this argument, we will need to do a check like:
> {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("17.0.0")}}
> or
> {{from inspect import signature}}
> {{"mask" in signature(pa.MapArray.from_arrays).parameters}}
> and only pass {{mask}} if that's true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48302) Preserve nulls in map columns in PyArrow Tables

2024-06-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48302:
---
Labels: pull-request-available  (was: )

> Preserve nulls in map columns in PyArrow Tables
> ---
>
> Key: SPARK-48302
> URL: https://issues.apache.org/jira/browse/SPARK-48302
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ian Cook
>Priority: Major
>  Labels: pull-request-available
>
> Because of a limitation in PyArrow, when PyArrow Tables containing MapArray 
> columns with nested fields or timestamps are passed to 
> {{{}spark.createDataFrame(){}}}, null values in the MapArray columns are 
> replaced with empty lists.
> The PySpark function where this happens is 
> {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.
> Also see [https://github.com/apache/arrow/issues/41684].
> See the skipped tests and the TODO mentioning SPARK-48302.
> [Update] A fix for this has been implemented in PyArrow in 
> [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to 
> {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. 
> Since older versions of PyArrow (which PySpark will still support for a 
> while) won't have this argument, we will need to do a check like:
> {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("17.0.0")}}
> or
> {{from inspect import signature}}
> {{"mask" in signature(pa.MapArray.from_arrays).parameters}}
> and only pass {{mask}} if that's true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48302) Preserve nulls in map columns in PyArrow Tables

2024-06-02 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-48302:
-
Summary: Preserve nulls in map columns in PyArrow Tables  (was: Null values 
in map columns of PyArrow tables are replaced with empty lists)

> Preserve nulls in map columns in PyArrow Tables
> ---
>
> Key: SPARK-48302
> URL: https://issues.apache.org/jira/browse/SPARK-48302
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ian Cook
>Priority: Major
>
> Because of a limitation in PyArrow, when PyArrow Tables containing MapArray 
> columns with nested fields or timestamps are passed to 
> {{{}spark.createDataFrame(){}}}, null values in the MapArray columns are 
> replaced with empty lists.
> The PySpark function where this happens is 
> {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.
> Also see [https://github.com/apache/arrow/issues/41684].
> See the skipped tests and the TODO mentioning SPARK-48302.
> [Update] A fix for this has been implemented in PyArrow in 
> [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to 
> {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. 
> Since older versions of PyArrow (which PySpark will still support for a 
> while) won't have this argument, we will need to do a check like:
> {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("17.0.0")}}
> or
> {{from inspect import signature}}
> {{"mask" in signature(pa.MapArray.from_arrays).parameters}}
> and only pass {{mask}} if that's true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists

2024-06-02 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-48302:
-
Description: 
Because of a limitation in PyArrow, when PyArrow Tables containing MapArray 
columns with nested fields or timestamps are passed to 
{{{}spark.createDataFrame(){}}}, null values in the MapArray columns are 
replaced with empty lists.

The PySpark function where this happens is 
{{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

[Update] A fix for this has been implemented in PyArrow in 
[https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to 
{{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. Since 
older versions of PyArrow (which PySpark will still support for a while) won't 
have this argument, we will need to do a check like:

{{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("17.0.0")}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.

  was:
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
{{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
with empty lists.

The PySpark function where this happens is 
{{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

[Update] A fix for this has been implemented in PyArrow in 
[https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to 
{{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. Since 
older versions of PyArrow (which PySpark will still support for a while) won't 
have this argument, we will need to do a check like:

{{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("17.0.0")}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.


> Null values in map columns of PyArrow tables are replaced with empty lists
> --
>
> Key: SPARK-48302
> URL: https://issues.apache.org/jira/browse/SPARK-48302
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ian Cook
>Priority: Major
>
> Because of a limitation in PyArrow, when PyArrow Tables containing MapArray 
> columns with nested fields or timestamps are passed to 
> {{{}spark.createDataFrame(){}}}, null values in the MapArray columns are 
> replaced with empty lists.
> The PySpark function where this happens is 
> {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.
> Also see [https://github.com/apache/arrow/issues/41684].
> See the skipped tests and the TODO mentioning SPARK-48302.
> [Update] A fix for this has been implemented in PyArrow in 
> [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to 
> {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. 
> Since older versions of PyArrow (which PySpark will still support for a 
> while) won't have this argument, we will need to do a check like:
> {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("17.0.0")}}
> or
> {{from inspect import signature}}
> {{"mask" in signature(pa.MapArray.from_arrays).parameters}}
> and only pass {{mask}} if that's true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists

2024-06-02 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-48302:
-
Description: 
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
{{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
with empty lists.

The PySpark function where this happens is 
{{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

[Update] A fix for this has been implemented in PyArrow in 
[https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to 
{{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. Since 
older versions of PyArrow (which PySpark will still support for a while) won't 
have this argument, we will need to do a check like:

{{LooseVersion(pa._{_}version{_}_) >= LooseVersion("17.0.0")}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.

  was:
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
{{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
with empty lists.

The PySpark function where this happens is 
{{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

[Update] A fix for this has been implemented in PyArrow in 
[https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to 
{{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. Since 
older versions of PyArrow (which PySpark will still support for a while) won't 
have this argument, we will need to do a check like:

{{LooseVersion(pa.__version__) >= LooseVersion("1X.0.0")}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.


> Null values in map columns of PyArrow tables are replaced with empty lists
> --
>
> Key: SPARK-48302
> URL: https://issues.apache.org/jira/browse/SPARK-48302
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ian Cook
>Priority: Major
>
> Because of a limitation in PyArrow, when PyArrow Tables are passed to 
> {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
> with empty lists.
> The PySpark function where this happens is 
> {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.
> Also see [https://github.com/apache/arrow/issues/41684].
> See the skipped tests and the TODO mentioning SPARK-48302.
> [Update] A fix for this has been implemented in PyArrow in 
> [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to 
> {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. 
> Since older versions of PyArrow (which PySpark will still support for a 
> while) won't have this argument, we will need to do a check like:
> {{LooseVersion(pa._{_}version{_}_) >= LooseVersion("17.0.0")}}
> or
> {{from inspect import signature}}
> {{"mask" in signature(pa.MapArray.from_arrays).parameters}}
> and only pass {{mask}} if that's true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists

2024-06-02 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-48302:
-
Description: 
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
{{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
with empty lists.

The PySpark function where this happens is 
{{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

[Update] A fix for this has been implemented in PyArrow in 
[https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to 
{{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. Since 
older versions of PyArrow (which PySpark will still support for a while) won't 
have this argument, we will need to do a check like:

{{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("17.0.0")}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.

  was:
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
{{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
with empty lists.

The PySpark function where this happens is 
{{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

[Update] A fix for this has been implemented in PyArrow in 
[https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to 
{{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. Since 
older versions of PyArrow (which PySpark will still support for a while) won't 
have this argument, we will need to do a check like:

{{LooseVersion(pa._{_}version{_}_) >= LooseVersion("17.0.0")}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.


> Null values in map columns of PyArrow tables are replaced with empty lists
> --
>
> Key: SPARK-48302
> URL: https://issues.apache.org/jira/browse/SPARK-48302
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ian Cook
>Priority: Major
>
> Because of a limitation in PyArrow, when PyArrow Tables are passed to 
> {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
> with empty lists.
> The PySpark function where this happens is 
> {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.
> Also see [https://github.com/apache/arrow/issues/41684].
> See the skipped tests and the TODO mentioning SPARK-48302.
> [Update] A fix for this has been implemented in PyArrow in 
> [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to 
> {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. 
> Since older versions of PyArrow (which PySpark will still support for a 
> while) won't have this argument, we will need to do a check like:
> {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("17.0.0")}}
> or
> {{from inspect import signature}}
> {{"mask" in signature(pa.MapArray.from_arrays).parameters}}
> and only pass {{mask}} if that's true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists

2024-06-02 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-48302:
-
Description: 
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
{{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
with empty lists.

The PySpark function where this happens is 
{{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

[Update] A fix for this has been implemented in PyArrow in 
[https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to 
{{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. Since 
older versions of PyArrow (which PySpark will still support for a while) won't 
have this argument, we will need to do a check like:

{{LooseVersion(pa.__version__) >= LooseVersion("1X.0.0")}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.

  was:
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
{{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
with empty lists.

The PySpark function where this happens is 
{{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

A possible fix for this will involve adding a {{mask}} argument to 
{{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which 
PySpark will still support for a while) won't have this argument, we will need 
to do a check like:

{{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("1X.0.0")}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.


> Null values in map columns of PyArrow tables are replaced with empty lists
> --
>
> Key: SPARK-48302
> URL: https://issues.apache.org/jira/browse/SPARK-48302
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ian Cook
>Priority: Major
>
> Because of a limitation in PyArrow, when PyArrow Tables are passed to 
> {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
> with empty lists.
> The PySpark function where this happens is 
> {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.
> Also see [https://github.com/apache/arrow/issues/41684].
> See the skipped tests and the TODO mentioning SPARK-48302.
> [Update] A fix for this has been implemented in PyArrow in 
> [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to 
> {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. 
> Since older versions of PyArrow (which PySpark will still support for a 
> while) won't have this argument, we will need to do a check like:
> {{LooseVersion(pa.__version__) >= LooseVersion("1X.0.0")}}
> or
> {{from inspect import signature}}
> {{"mask" in signature(pa.MapArray.from_arrays).parameters}}
> and only pass {{mask}} if that's true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47466) Add PySpark DataFrame method to return iterator of PyArrow RecordBatches

2024-06-02 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-47466:
-
Component/s: Connect
 Input/Output
 SQL

> Add PySpark DataFrame method to return iterator of PyArrow RecordBatches
> 
>
> Key: SPARK-47466
> URL: https://issues.apache.org/jira/browse/SPARK-47466
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Input/Output, PySpark, SQL
>Affects Versions: 3.5.1
>Reporter: Ian Cook
>Priority: Major
>
> As a follow-up to SPARK-47365:
> {{toArrow()}} is useful when the data is relatively small. For larger data, 
> the best way to return the contents of a PySpark DataFrame in Arrow format is 
> to return an iterator of [PyArrow 
> RecordBatches|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47466) Add PySpark DataFrame method to return iterator of PyArrow RecordBatches

2024-06-02 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-47466:
-
Affects Version/s: 4.0.0

> Add PySpark DataFrame method to return iterator of PyArrow RecordBatches
> 
>
> Key: SPARK-47466
> URL: https://issues.apache.org/jira/browse/SPARK-47466
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Input/Output, PySpark, SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Ian Cook
>Priority: Major
>
> As a follow-up to SPARK-47365:
> {{toArrow()}} is useful when the data is relatively small. For larger data, 
> the best way to return the contents of a PySpark DataFrame in Arrow format is 
> to return an iterator of [PyArrow 
> RecordBatches|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists

2024-06-02 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated SPARK-48302:
-
Language: Python

> Null values in map columns of PyArrow tables are replaced with empty lists
> --
>
> Key: SPARK-48302
> URL: https://issues.apache.org/jira/browse/SPARK-48302
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ian Cook
>Priority: Major
>
> Because of a limitation in PyArrow, when PyArrow Tables are passed to 
> {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
> with empty lists.
> The PySpark function where this happens is 
> {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.
> Also see [https://github.com/apache/arrow/issues/41684].
> See the skipped tests and the TODO mentioning SPARK-48302.
> A possible fix for this will involve adding a {{mask}} argument to 
> {{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which 
> PySpark will still support for a while) won't have this argument, we will 
> need to do a check like:
> {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("1X.0.0")}}
> or
> {{from inspect import signature}}
> {{"mask" in signature(pa.MapArray.from_arrays).parameters}}
> and only pass {{mask}} if that's true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48220) Allow passing PyArrow Table to createDataFrame()

2024-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48220:


Assignee: Ian Cook

> Allow passing PyArrow Table to createDataFrame()
> 
>
> Key: SPARK-48220
> URL: https://issues.apache.org/jira/browse/SPARK-48220
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Input/Output, PySpark, SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>  Labels: pull-request-available
>
> SPARK-47365 added support for returning a Spark DataFrame as a PyArrow Table.
> It would be nice if we could also go in the opposite direction, enabling 
> users to create a Spark DataFrame from a PyArrow Table by passing the PyArrow 
> Table to {{spark.createDataFrame()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48220) Allow passing PyArrow Table to createDataFrame()

2024-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48220.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46529
[https://github.com/apache/spark/pull/46529]

> Allow passing PyArrow Table to createDataFrame()
> 
>
> Key: SPARK-48220
> URL: https://issues.apache.org/jira/browse/SPARK-48220
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Input/Output, PySpark, SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> SPARK-47365 added support for returning a Spark DataFrame as a PyArrow Table.
> It would be nice if we could also go in the opposite direction, enabling 
> users to create a Spark DataFrame from a PyArrow Table by passing the PyArrow 
> Table to {{spark.createDataFrame()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

44 matches

Mail list logo