[jira] [Updated] (SPARK-48422) Using lambda may cause MemoryError
[ https://issues.apache.org/jira/browse/SPARK-48422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhouYang updated SPARK-48422: - Description: In worker.py, there is a function called process(), the iterator loads all data at once {code:java} def process(): iterator = deserializer.load_stream(infile) serializer.dump_stream(func(split_index, iterator), outfile){code} It will cause MemoryError when working on large scale data, For the reason that I have indeed encountered this situation as below: {code:java} MemoryError at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:203) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:244) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:162) at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144) at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:89) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner$$anon$2.run(Executor.scala:355) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1721) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:353) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2024-05-22 16:50:03,173 INFO org.apache.spark.scheduler.TaskSetManager: Starting task 0.1 in stage 5.0 (TID 21, saturndatanode3, executor 2, partition 0, ANY, 5075 bytes) 2024-05-22 16:50:03,174 INFO org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 5.0 (TID 19) on saturndatanode3, executor 2: org.apache.spark.api.python.PythonException (Traceback (most recent call last): File "xx/spark/python/lib/pyspark.zip/pyspark/worker.py", line 200, in main process() File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 195, in process serializer.dump_stream(func(split_index, iterator), outfile) File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in func = lambda _, it: map(mapper, it) File "", line 1, in File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 73, in return lambda *a: f(*a){code} I did some tests by adding memory monitor code, I found that this code takes up a lot of memory during execution: {code:java} start_memory = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss print(f"Memory usage at the beginning: {start_memory} KB")iterator = deserializer.load_stream(infile) serializer.dump_stream(func(split_index, iterator), outfile)end_memory = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss print(f"Memory usage at the end: {end_memory} KB")memory_difference = end_memory - start_memory print(f"Memory usage changes:{memory_difference} KB"){code} Can I process the data in the iterator in batches as below? {code:java} def process(): iterator = deserializer.load_stream(infile) def batched_func(iterator, func, serializer, outputfile): batch = [] count = 0 for item in iterator: batch.append(item) count += 1 // Process the data in the iterator in batches, with 1 entries each time. if count >= 1: serializer.dump_stream(func(split_index, batch), outfile) batch = [] count = 0 if batch: serializer.dump_stream(func(split_index, batch), outfile) batched_func(iterator, func, serializer, outfile){code} I test with code as above, it works well with lower memory each time. was: In worker.py, there is a function called process(), the iterator loads all data at once {code:java} def process(): iterator = deserializer.load_stream(infile) serializer.dump_stream(func(split_index, iterator), outfile){code} It will cause MemoryError when working on large scale data, For the reason that I have indeed encountered this situation as below: {code:jav
[jira] [Updated] (SPARK-48422) Using lambda may cause MemoryError
[ https://issues.apache.org/jira/browse/SPARK-48422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhouYang updated SPARK-48422: - Description: In worker.py, there is a function called process(), the iterator loads all data at once {code:java} def process(): iterator = deserializer.load_stream(infile) serializer.dump_stream(func(split_index, iterator), outfile){code} It will cause MemoryError when working on large scale data, For the reason that I have indeed encountered this situation as below: {code:java} MemoryError at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:203) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:244) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:162) at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144) at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:89) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner$$anon$2.run(Executor.scala:355) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1721) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:353) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2024-05-22 16:50:03,173 INFO org.apache.spark.scheduler.TaskSetManager: Starting task 0.1 in stage 5.0 (TID 21, saturndatanode3, executor 2, partition 0, ANY, 5075 bytes) 2024-05-22 16:50:03,174 INFO org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 5.0 (TID 19) on saturndatanode3, executor 2: org.apache.spark.api.python.PythonException (Traceback (most recent call last): File "xx/spark/python/lib/pyspark.zip/pyspark/worker.py", line 200, in main process() File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 195, in process serializer.dump_stream(func(split_index, iterator), outfile) File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in func = lambda _, it: map(mapper, it) File "", line 1, in File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 73, in return lambda *a: f(*a){code} I did some tests by adding memory monitor code, I found that this code takes up a lot of memory during execution: {code:java} start_memory = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss print(f"Memory usage at the beginning: {start_memory} KB")iterator = deserializer.load_stream(infile) serializer.dump_stream(func(split_index, iterator), outfile)end_memory = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss print(f"Memory usage at the end: {end_memory} KB")memory_difference = end_memory - start_memory print(f"Memory usage changes:{memory_difference} KB"){code} Can I process the data in the iterator in batches as below? {code:java} def process(): iterator = deserializer.load_stream(infile) def batched_func(iterator, func, serializer, outputfile): batch = [] count = 0 for item in iterator: batch.append(item) count += 1 // Process the data in the iterator in batches, with 1 entries each time. if count >= 1: serializer.dump_stream(func(split_index, batch), outfile) batch = [] count = 0 if batch: serializer.dump_stream(func(split_index, batch), outfile) batched_func(iterator, func, serializer, outfile){code} was: In worker.py, there is a function called process(), the iterator loads all data at once {code:java} def process(): iterator = deserializer.load_stream(infile) serializer.dump_stream(func(split_index, iterator), outfile){code} It will cause MemoryError when working on large scale data, For the reason that I have indeed encountered this situation as below: {code:java} MemoryError at org.apache.spark.api.python.PythonRunner$$anon$1.
[jira] [Updated] (SPARK-48488) Fix methods `log[info|warning|error]` in SparkSubmit
[ https://issues.apache.org/jira/browse/SPARK-48488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-48488: Summary: Fix methods `log[info|warning|error]` in SparkSubmit (was: Restore the original logic of methods `log[info|warning|error]` in `SparkSubmit`) > Fix methods `log[info|warning|error]` in SparkSubmit > > > Key: SPARK-48488 > URL: https://issues.apache.org/jira/browse/SPARK-48488 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Critical > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48447) Check state store provider class before invoking the constructor
[ https://issues.apache.org/jira/browse/SPARK-48447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-48447. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46791 [https://github.com/apache/spark/pull/46791] > Check state store provider class before invoking the constructor > > > Key: SPARK-48447 > URL: https://issues.apache.org/jira/browse/SPARK-48447 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Yuchen Liu >Assignee: Yuchen Liu >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > We should restrict that only classes > [extending|https://github.com/databricks/runtime/blob/1440e77ab54c40981066c22ec759bdafc0683e76/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L73] > {{StateStoreProvider}} can be constructed to prevent customer from > instantiating arbitrary class of objects. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48447) Check state store provider class before invoking the constructor
[ https://issues.apache.org/jira/browse/SPARK-48447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-48447: Assignee: Yuchen Liu > Check state store provider class before invoking the constructor > > > Key: SPARK-48447 > URL: https://issues.apache.org/jira/browse/SPARK-48447 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Yuchen Liu >Assignee: Yuchen Liu >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Remaining Estimate: 24h > > We should restrict that only classes > [extending|https://github.com/databricks/runtime/blob/1440e77ab54c40981066c22ec759bdafc0683e76/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L73] > {{StateStoreProvider}} can be constructed to prevent customer from > instantiating arbitrary class of objects. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48488) Restore the original logic of methods `log[info|warning|error]` in `SparkSubmit`
[ https://issues.apache.org/jira/browse/SPARK-48488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-48488: -- Assignee: BingKun Pan > Restore the original logic of methods `log[info|warning|error]` in > `SparkSubmit` > > > Key: SPARK-48488 > URL: https://issues.apache.org/jira/browse/SPARK-48488 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Critical > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48383) In KafkaOffsetReader, partition mismatch should not be an assertion
[ https://issues.apache.org/jira/browse/SPARK-48383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-48383. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46692 [https://github.com/apache/spark/pull/46692] > In KafkaOffsetReader, partition mismatch should not be an assertion > --- > > Key: SPARK-48383 > URL: https://issues.apache.org/jira/browse/SPARK-48383 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > In KafkaOffsetReader, we assert startOffsets have the same topic partition > list as assigned. However, if the user changes topic partition while the > query is running, they will see the assertion. Instead, they should see an > exception. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48383) In KafkaOffsetReader, partition mismatch should not be an assertion
[ https://issues.apache.org/jira/browse/SPARK-48383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-48383: Assignee: Siying Dong > In KafkaOffsetReader, partition mismatch should not be an assertion > --- > > Key: SPARK-48383 > URL: https://issues.apache.org/jira/browse/SPARK-48383 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Major > Labels: pull-request-available > > In KafkaOffsetReader, we assert startOffsets have the same topic partition > list as assigned. However, if the user changes topic partition while the > query is running, they will see the assertion. Instead, they should see an > exception. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48488) Restore the original logic of methods `log[info|warning|error]` in `SparkSubmit`
[ https://issues.apache.org/jira/browse/SPARK-48488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-48488. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46822 [https://github.com/apache/spark/pull/46822] > Restore the original logic of methods `log[info|warning|error]` in > `SparkSubmit` > > > Key: SPARK-48488 > URL: https://issues.apache.org/jira/browse/SPARK-48488 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Critical > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48505) Simplify the implementation of Utils#isG1GC
[ https://issues.apache.org/jira/browse/SPARK-48505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48505: --- Labels: pull-request-available (was: ) > Simplify the implementation of Utils#isG1GC > --- > > Key: SPARK-48505 > URL: https://issues.apache.org/jira/browse/SPARK-48505 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48505) Simplify the implementation of Utils#isG1GC
Yang Jie created SPARK-48505: Summary: Simplify the implementation of Utils#isG1GC Key: SPARK-48505 URL: https://issues.apache.org/jira/browse/SPARK-48505 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48433) Upgrade `checkstyle` to 10.17.0
[ https://issues.apache.org/jira/browse/SPARK-48433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-48433. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46763 [https://github.com/apache/spark/pull/46763] > Upgrade `checkstyle` to 10.17.0 > --- > > Key: SPARK-48433 > URL: https://issues.apache.org/jira/browse/SPARK-48433 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48487) Update License & Notice according to the dependency changes
[ https://issues.apache.org/jira/browse/SPARK-48487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-48487. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46821 [https://github.com/apache/spark/pull/46821] > Update License & Notice according to the dependency changes > --- > > Key: SPARK-48487 > URL: https://issues.apache.org/jira/browse/SPARK-48487 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48487) Update License & Notice according to the dependency changes
[ https://issues.apache.org/jira/browse/SPARK-48487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-48487: Assignee: Kent Yao > Update License & Notice according to the dependency changes > --- > > Key: SPARK-48487 > URL: https://issues.apache.org/jira/browse/SPARK-48487 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48504) Parent Window class for Spark Connect and Spark Classic
[ https://issues.apache.org/jira/browse/SPARK-48504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48504: --- Labels: pull-request-available (was: ) > Parent Window class for Spark Connect and Spark Classic > --- > > Key: SPARK-48504 > URL: https://issues.apache.org/jira/browse/SPARK-48504 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48504) Parent Window class for Spark Connect and Spark Classic
Ruifeng Zheng created SPARK-48504: - Summary: Parent Window class for Spark Connect and Spark Classic Key: SPARK-48504 URL: https://issues.apache.org/jira/browse/SPARK-48504 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48496) Use static regex Pattern instances in common/utils JavaUtils
[ https://issues.apache.org/jira/browse/SPARK-48496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48496. -- Fix Version/s: 4.0.0 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/46829 > Use static regex Pattern instances in common/utils JavaUtils > > > Key: SPARK-48496 > URL: https://issues.apache.org/jira/browse/SPARK-48496 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Some methods in JavaUtils.java are recompiling regexes on every invocation; > we should instead store a single cached Pattern. > This is a minor perf. issue that I spotted in the context of other profiling. > Not a huge bottleneck in the grand scheme of things, but simple and > straightforward to fix. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48489) Throw an user-facing error when reading invalid schema from text DataSource
[ https://issues.apache.org/jira/browse/SPARK-48489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48489. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46823 [https://github.com/apache/spark/pull/46823] > Throw an user-facing error when reading invalid schema from text DataSource > --- > > Key: SPARK-48489 > URL: https://issues.apache.org/jira/browse/SPARK-48489 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.3 >Reporter: Stefan Bukorovic >Assignee: Stefan Bukorovic >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > Text DataSource produces table schema with only 1 column, but it is possible > to try and create a table with schema having multiple columns. > Currently, when user tries this, we have an assert in the code, which fails > and throws internal spark error. We should throw a better user-facing error. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48489) Throw an user-facing error when reading invalid schema from text DataSource
[ https://issues.apache.org/jira/browse/SPARK-48489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-48489: Assignee: Stefan Bukorovic > Throw an user-facing error when reading invalid schema from text DataSource > --- > > Key: SPARK-48489 > URL: https://issues.apache.org/jira/browse/SPARK-48489 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.3 >Reporter: Stefan Bukorovic >Assignee: Stefan Bukorovic >Priority: Minor > Labels: pull-request-available > > Text DataSource produces table schema with only 1 column, but it is possible > to try and create a table with schema having multiple columns. > Currently, when user tries this, we have an assert in the code, which fails > and throws internal spark error. We should throw a better user-facing error. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48422) Using lambda may cause MemoryError
[ https://issues.apache.org/jira/browse/SPARK-48422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhouYang updated SPARK-48422: - Description: In worker.py, there is a function called process(), the iterator loads all data at once {code:java} def process(): iterator = deserializer.load_stream(infile) serializer.dump_stream(func(split_index, iterator), outfile){code} It will cause MemoryError when working on large scale data, For the reason that I have indeed encountered this situation as below: {code:java} MemoryError at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:203) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:244) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:162) at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144) at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:89) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner$$anon$2.run(Executor.scala:355) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1721) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:353) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2024-05-22 16:50:03,173 INFO org.apache.spark.scheduler.TaskSetManager: Starting task 0.1 in stage 5.0 (TID 21, saturndatanode3, executor 2, partition 0, ANY, 5075 bytes) 2024-05-22 16:50:03,174 INFO org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 5.0 (TID 19) on saturndatanode3, executor 2: org.apache.spark.api.python.PythonException (Traceback (most recent call last): File "xx/spark/python/lib/pyspark.zip/pyspark/worker.py", line 200, in main process() File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 195, in process serializer.dump_stream(func(split_index, iterator), outfile) File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in func = lambda _, it: map(mapper, it) File "", line 1, in File "x/spark/python/lib/pyspark.zip/pyspark/worker.py", line 73, in return lambda *a: f(*a){code} I did some tests by adding memory monitor code, I found that this code takes up a lot of memory during execution: {code:java} start_memory = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss print(f"Memory usage at the beginning: {start_memory} KB")iterator = deserializer.load_stream(infile) serializer.dump_stream(func(split_index, iterator), outfile)end_memory = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss print(f"Memory usage at the end: {end_memory} KB")memory_difference = end_memory - start_memory print(f"Memory usage changes:{memory_difference} KB"){code} was: In worker.py, there is a function called wrap_udf(f, return_type), lambda is used in its body as below: {code:java} def wrap_udf(f, return_type): if return_type.needConversion(): toInternal = return_type.toInternal return lambda *a: toInternal(f(*a)) else: return lambda *a: f(*a){code} Is there any possibility that it will cause MemoryError when function f is used to work on large scale data?For the reason that I have indeed encountered this situation as below: {code:java} MemoryError at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:203) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:244) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:162) at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144) at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:79
[jira] [Assigned] (SPARK-48374) Support additional PyArrow Table column types
[ https://issues.apache.org/jira/browse/SPARK-48374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-48374: Assignee: Ian Cook > Support additional PyArrow Table column types > - > > Key: SPARK-48374 > URL: https://issues.apache.org/jira/browse/SPARK-48374 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0, 3.5.1 >Reporter: Ian Cook >Assignee: Ian Cook >Priority: Major > Labels: pull-request-available > > SPARK-48220 adds support for passing a PyArrow Table to > {{{}createDataFrame(){}}}, but there are a few PyArrow column types that are > not yet supported: > * fixed-size binary > * fixed-size list > * large list > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48374) Support additional PyArrow Table column types
[ https://issues.apache.org/jira/browse/SPARK-48374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48374. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46688 [https://github.com/apache/spark/pull/46688] > Support additional PyArrow Table column types > - > > Key: SPARK-48374 > URL: https://issues.apache.org/jira/browse/SPARK-48374 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0, 3.5.1 >Reporter: Ian Cook >Assignee: Ian Cook >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > SPARK-48220 adds support for passing a PyArrow Table to > {{{}createDataFrame(){}}}, but there are a few PyArrow column types that are > not yet supported: > * fixed-size binary > * fixed-size list > * large list > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48503) Scalar subquery with group-by and non-equality predicate incorrectly allowed, wrong results
[ https://issues.apache.org/jira/browse/SPARK-48503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48503: --- Labels: pull-request-available (was: ) > Scalar subquery with group-by and non-equality predicate incorrectly allowed, > wrong results > --- > > Key: SPARK-48503 > URL: https://issues.apache.org/jira/browse/SPARK-48503 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jack Chen >Priority: Major > Labels: pull-request-available > > This query is not legal and should give an error, but instead we incorrectly > allow it and it returns wrong results. > {code:java} > create table x(x1 int, x2 int); > insert into x values (1, 1); > create table y(y1 int, y2 int); > insert into y values (2, 2), (3, 3); > select *, (select count(*) from y where y1 > x1 group by y1) from x; {code} > It returns two rows, even though there's only one row of x. > The correct result is an error: more than one row returned by a subquery used > as an expression (as seen in postgres for example) > > This is a longstanding bug. The bug is in CheckAnalysis in > {{{}checkAggregateInScalarSubquery{}}}. It allows grouping columns that are > present in correlation predicates, but doesn’t check whether those predicates > are equalities - because when that code was written, non-equality > correlation wasn’t allowed. Therefore, it looks like this bug has existed > since non-equality correlation was added (~2 years ago). > > Various other expressions that are not equi-joins between the inner and outer > fields hit this too, e.g. `where y1 + y2 = x1 group by y1`. > Another bugged case is if the correlation condition is an equality but it's > under another operator like an OUTER JOIN or UNION. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48503) Scalar subquery with group-by and non-equality predicate incorrectly allowed, wrong results
[ https://issues.apache.org/jira/browse/SPARK-48503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-48503: -- Parent: SPARK-35553 Issue Type: Sub-task (was: Bug) > Scalar subquery with group-by and non-equality predicate incorrectly allowed, > wrong results > --- > > Key: SPARK-48503 > URL: https://issues.apache.org/jira/browse/SPARK-48503 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jack Chen >Priority: Major > > This query is not legal and should give an error, but instead we incorrectly > allow it and it returns wrong results. > {code:java} > create table x(x1 int, x2 int); > insert into x values (1, 1); > create table y(y1 int, y2 int); > insert into y values (2, 2), (3, 3); > select *, (select count(*) from y where y1 > x1 group by y1) from x; {code} > It returns two rows, even though there's only one row of x. > The correct result is an error: more than one row returned by a subquery used > as an expression (as seen in postgres for example) > > This is a longstanding bug. The bug is in CheckAnalysis in > {{{}checkAggregateInScalarSubquery{}}}. It allows grouping columns that are > present in correlation predicates, but doesn’t check whether those predicates > are equalities - because when that code was written, non-equality > correlation wasn’t allowed. Therefore, it looks like this bug has existed > since non-equality correlation was added (~2 years ago). > > Various other expressions that are not equi-joins between the inner and outer > fields hit this too, e.g. `where y1 + y2 = x1 group by y1`. > Another bugged case is if the correlation condition is an equality but it's > under another operator like an OUTER JOIN or UNION. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48501) Loosen `correlated scalar subqueries must be aggregated` error by doing runtime check for scalar subqueries output rowcount
[ https://issues.apache.org/jira/browse/SPARK-48501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-48501: -- Description: Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw an error {{{}Correlated scalar subqueries must be aggregated{}}}. This check is often too restrictive, there are many cases where it should actually be runnable even though we don’t know it - e.g. unique keys or functional dependencies might ensure that there's only one row. To handle these cases, it’s better to do the check at runtime instead of statically. This could be implemented as a special aggregate operator that throws exception on >=2 rows input, a “single join” operator that throws an exception when >= 2 rows match, or something similar. There are also cases where we were incorrectly allowing queries before that returned wrong results, and should have been rejected as invalid (e.g. SPARK-48503, SPARK-18504). Doing the check at runtime would help avoid those bugs. Current workarounds: Users can add an aggregate like {{any_value()}} or {{first()}} to the output of the subquery, or users can add {{limit 1}} was: Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw an error {{{}Correlated scalar subqueries must be aggregated{}}}. This check is often too restrictive, there are many cases where it should actually be runnable even though we don’t know it - e.g. unique keys or functional dependencies might ensure that there's only one row. To handle these cases, it’s better to do the check at runtime instead of statically. This could be implemented as a special aggregate operator that throws exception on >=2 rows input, a “single join” operator that throws an exception when >= 2 rows match, or something similar. There are also cases where we were incorrectly allowing queries before that returned wrong results, and should have been rejected as invalid. Doing the check at runtime would help avoid those bugs. Current workarounds: Users can add an aggregate like {{any_value()}} or {{first()}} to the output of the subquery, or users can add {{limit 1}} > Loosen `correlated scalar subqueries must be aggregated` error by doing > runtime check for scalar subqueries output rowcount > --- > > Key: SPARK-48501 > URL: https://issues.apache.org/jira/browse/SPARK-48501 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jack Chen >Priority: Major > > Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw > an error {{{}Correlated scalar subqueries must be aggregated{}}}. > This check is often too restrictive, there are many cases where it should > actually be runnable even though we don’t know it - e.g. unique keys or > functional dependencies might ensure that there's only one row. > To handle these cases, it’s better to do the check at runtime instead of > statically. This could be implemented as a special aggregate operator that > throws exception on >=2 rows input, a “single join” operator that throws an > exception when >= 2 rows match, or something similar. > There are also cases where we were incorrectly allowing queries before that > returned wrong results, and should have been rejected as invalid (e.g. > SPARK-48503, SPARK-18504). Doing the check at runtime would help avoid those > bugs. > Current workarounds: Users can add an aggregate like {{any_value()}} or > {{first()}} to the output of the subquery, or users can add {{limit 1}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48503) Scalar subquery with group-by and non-equality predicate incorrectly allowed, wrong results
Jack Chen created SPARK-48503: - Summary: Scalar subquery with group-by and non-equality predicate incorrectly allowed, wrong results Key: SPARK-48503 URL: https://issues.apache.org/jira/browse/SPARK-48503 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Jack Chen This query is not legal and should give an error, but instead we incorrectly allow it and it returns wrong results. {code:java} create table x(x1 int, x2 int); insert into x values (1, 1); create table y(y1 int, y2 int); insert into y values (2, 2), (3, 3); select *, (select count(*) from y where y1 > x1 group by y1) from x; {code} It returns two rows, even though there's only one row of x. The correct result is an error: more than one row returned by a subquery used as an expression (as seen in postgres for example) This is a longstanding bug. The bug is in CheckAnalysis in {{{}checkAggregateInScalarSubquery{}}}. It allows grouping columns that are present in correlation predicates, but doesn’t check whether those predicates are equalities - because when that code was written, non-equality correlation wasn’t allowed. Therefore, it looks like this bug has existed since non-equality correlation was added (~2 years ago). Various other expressions that are not equi-joins between the inner and outer fields hit this too, e.g. `where y1 + y2 = x1 group by y1`. Another bugged case is if the correlation condition is an equality but it's under another operator like an OUTER JOIN or UNION. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48502) shift functions do not support column as second argument
[ https://issues.apache.org/jira/browse/SPARK-48502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frederik Paradis updated SPARK-48502: - Description: In PySpark, the shift* functions (shiftright/shiftleft, etc.) do not support the numBits argument as a column, only as a literal. The SQL function does support this. (was: In PySpark, the shift* functions (shiftright/shiftleft, etc.) do not support the numBits argument as a column, only as literal. The SQL function does support this.) > shift functions do not support column as second argument > > > Key: SPARK-48502 > URL: https://issues.apache.org/jira/browse/SPARK-48502 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.1 >Reporter: Frederik Paradis >Priority: Major > > In PySpark, the shift* functions (shiftright/shiftleft, etc.) do not support > the numBits argument as a column, only as a literal. The SQL function does > support this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48502) shift functions do not support column as second argument
Frederik Paradis created SPARK-48502: Summary: shift functions do not support column as second argument Key: SPARK-48502 URL: https://issues.apache.org/jira/browse/SPARK-48502 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.5.1 Reporter: Frederik Paradis In PySpark, the shift* functions (shiftright/shiftleft, etc.) do not support the numBits argument as a column, only as literal. The SQL function does support this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48501) Loosen `correlated scalar subqueries must be aggregated` error by doing runtime check for scalar subqueries output rowcount
Jack Chen created SPARK-48501: - Summary: Loosen `correlated scalar subqueries must be aggregated` error by doing runtime check for scalar subqueries output rowcount Key: SPARK-48501 URL: https://issues.apache.org/jira/browse/SPARK-48501 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Jack Chen Currently if a scalar subquery’s result isn’t aggregated or limit 1, we throw an error {{{}Correlated scalar subqueries must be aggregated{}}}. This check is often too restrictive, there are many cases where it should actually be runnable even though we don’t know it - e.g. unique keys or functional dependencies might ensure that there's only one row. To handle these cases, it’s better to do the check at runtime instead of statically. This could be implemented as a special aggregate operator that throws exception on >=2 rows input, a “single join” operator that throws an exception when >= 2 rows match, or something similar. There are also cases where we were incorrectly allowing queries before that returned wrong results, and should have been rejected as invalid. Doing the check at runtime would help avoid those bugs. Current workarounds: Users can add an aggregate like {{any_value()}} or {{first()}} to the output of the subquery, or users can add {{limit 1}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36115) Handle the COUNT bug for correlated IN/EXISTS subquery
[ https://issues.apache.org/jira/browse/SPARK-36115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen resolved SPARK-36115. --- Resolution: Fixed This was fixed by https://github.com/apache/spark/pull/43111 > Handle the COUNT bug for correlated IN/EXISTS subquery > -- > > Key: SPARK-36115 > URL: https://issues.apache.org/jira/browse/SPARK-36115 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > Correlated IN/EXISTS subqueries are also subject to the COUNT bug which is > not handled. > {code:sql} > create view t1(c1, c2) as values (0, 1), (1, 2) > create view t2(c1, c2) as values (0, 2), (0, 3) > -- Example 1: IN subquery > select * from t1 where c1 in (select count(*) + 1 from t2 where t1.c1 = t2.c1) > -- Correct answer: (1, 2) > +---+---+ > |c1 |c2 | > +---+---+ > +---+---+ > -- Example 2: EXISTS subquery > select * from t1 where exists (select count(*) from t2 where t1.c1 = t2.c1) > -- Correct answer: [(0, 1), (1, 2)] > +---+---+ > |c1 |c2 | > +---+---+ > |0 |1 | > +---+---+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48500) On the client side, there is no information about the exception that caused the job to fail
Sergey Kotlov created SPARK-48500: - Summary: On the client side, there is no information about the exception that caused the job to fail Key: SPARK-48500 URL: https://issues.apache.org/jira/browse/SPARK-48500 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.1 Reporter: Sergey Kotlov When loading a table into BigQuery using the [BigQuery connector|https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example], the Spark Connect client does not receive information about the exception causing the problem. Example: {code:java} spark.table("testds.test_table") .write .format("bigquery") .mode("overwrite") .option("project", "example-analytics") .option("table", "testds.test_table") .save() {code} When running with Spark Connect, in the logs on the client side I see only: {code:java} Uncaught exception in main job thread org.apache.spark.SparkException: org.sparkproject.io.grpc.StatusRuntimeException: INTERNAL: Failed to write to BigQuery at org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:113) at org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41) at org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:52) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at org.apache.spark.sql.connect.client.WrappedCloseableIterator.foreach(CloseableIterator.scala:30) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364) at org.apache.spark.sql.connect.client.WrappedCloseableIterator.to(CloseableIterator.scala:30) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358) at org.apache.spark.sql.connect.client.WrappedCloseableIterator.toBuffer(CloseableIterator.scala:30) at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:552) at org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:257) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:221) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:218) at com.example.testds.SparkTest$.main(SparkTest.scala:39) at com.example.testds.SparkTest.main(SparkTest.scala) End of uncaught exception {code} If the same code is run in a separate spark application, the cause of the error is there. {code:java} Uncaught exception in main job thread java.lang.RuntimeException: Failed to write to BigQuery at shadow.example.bigquery.com.google.cloud.spark.bigquery.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.scala:93) at shadow.example.bigquery.com.google.cloud.spark.bigquery.BigQueryInsertableRelation.insert(BigQueryInsertableRelation.scala:43) at shadow.example.bigquery.com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:113) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transf
[jira] [Updated] (SPARK-48302) Preserve nulls in map columns in PyArrow Tables
[ https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook updated SPARK-48302: - Parent: SPARK-44111 Issue Type: Sub-task (was: Bug) > Preserve nulls in map columns in PyArrow Tables > --- > > Key: SPARK-48302 > URL: https://issues.apache.org/jira/browse/SPARK-48302 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ian Cook >Priority: Major > Labels: pull-request-available > > Because of a limitation in PyArrow, when PyArrow Tables containing MapArray > columns with nested fields or timestamps are passed to > {{{}spark.createDataFrame(){}}}, null values in the MapArray columns are > replaced with empty lists. > The PySpark function where this happens is > {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. > Also see [https://github.com/apache/arrow/issues/41684]. > See the skipped tests and the TODO mentioning SPARK-48302. > [Update] A fix for this has been implemented in PyArrow in > [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to > {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. > Since older versions of PyArrow (which PySpark will still support for a > while) won't have this argument, we will need to do a check like: > {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("17.0.0")}} > or > {{from inspect import signature}} > {{"mask" in signature(pa.MapArray.from_arrays).parameters}} > and only pass {{mask}} if that's true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48302) Preserve nulls in map columns in PyArrow Tables
[ https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook updated SPARK-48302: - Parent: (was: SPARK-44111) Issue Type: Bug (was: Sub-task) > Preserve nulls in map columns in PyArrow Tables > --- > > Key: SPARK-48302 > URL: https://issues.apache.org/jira/browse/SPARK-48302 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ian Cook >Priority: Major > Labels: pull-request-available > > Because of a limitation in PyArrow, when PyArrow Tables containing MapArray > columns with nested fields or timestamps are passed to > {{{}spark.createDataFrame(){}}}, null values in the MapArray columns are > replaced with empty lists. > The PySpark function where this happens is > {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. > Also see [https://github.com/apache/arrow/issues/41684]. > See the skipped tests and the TODO mentioning SPARK-48302. > [Update] A fix for this has been implemented in PyArrow in > [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to > {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. > Since older versions of PyArrow (which PySpark will still support for a > while) won't have this argument, we will need to do a check like: > {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("17.0.0")}} > or > {{from inspect import signature}} > {{"mask" in signature(pa.MapArray.from_arrays).parameters}} > and only pass {{mask}} if that's true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48302) Preserve nulls in map columns in PyArrow Tables
[ https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48302: --- Labels: pull-request-available (was: ) > Preserve nulls in map columns in PyArrow Tables > --- > > Key: SPARK-48302 > URL: https://issues.apache.org/jira/browse/SPARK-48302 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ian Cook >Priority: Major > Labels: pull-request-available > > Because of a limitation in PyArrow, when PyArrow Tables containing MapArray > columns with nested fields or timestamps are passed to > {{{}spark.createDataFrame(){}}}, null values in the MapArray columns are > replaced with empty lists. > The PySpark function where this happens is > {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. > Also see [https://github.com/apache/arrow/issues/41684]. > See the skipped tests and the TODO mentioning SPARK-48302. > [Update] A fix for this has been implemented in PyArrow in > [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to > {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. > Since older versions of PyArrow (which PySpark will still support for a > while) won't have this argument, we will need to do a check like: > {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("17.0.0")}} > or > {{from inspect import signature}} > {{"mask" in signature(pa.MapArray.from_arrays).parameters}} > and only pass {{mask}} if that's true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48302) Preserve nulls in map columns in PyArrow Tables
[ https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook updated SPARK-48302: - Summary: Preserve nulls in map columns in PyArrow Tables (was: Null values in map columns of PyArrow tables are replaced with empty lists) > Preserve nulls in map columns in PyArrow Tables > --- > > Key: SPARK-48302 > URL: https://issues.apache.org/jira/browse/SPARK-48302 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ian Cook >Priority: Major > > Because of a limitation in PyArrow, when PyArrow Tables containing MapArray > columns with nested fields or timestamps are passed to > {{{}spark.createDataFrame(){}}}, null values in the MapArray columns are > replaced with empty lists. > The PySpark function where this happens is > {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. > Also see [https://github.com/apache/arrow/issues/41684]. > See the skipped tests and the TODO mentioning SPARK-48302. > [Update] A fix for this has been implemented in PyArrow in > [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to > {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. > Since older versions of PyArrow (which PySpark will still support for a > while) won't have this argument, we will need to do a check like: > {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("17.0.0")}} > or > {{from inspect import signature}} > {{"mask" in signature(pa.MapArray.from_arrays).parameters}} > and only pass {{mask}} if that's true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists
[ https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook updated SPARK-48302: - Description: Because of a limitation in PyArrow, when PyArrow Tables containing MapArray columns with nested fields or timestamps are passed to {{{}spark.createDataFrame(){}}}, null values in the MapArray columns are replaced with empty lists. The PySpark function where this happens is {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. Also see [https://github.com/apache/arrow/issues/41684]. See the skipped tests and the TODO mentioning SPARK-48302. [Update] A fix for this has been implemented in PyArrow in [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. Since older versions of PyArrow (which PySpark will still support for a while) won't have this argument, we will need to do a check like: {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("17.0.0")}} or {{from inspect import signature}} {{"mask" in signature(pa.MapArray.from_arrays).parameters}} and only pass {{mask}} if that's true. was: Because of a limitation in PyArrow, when PyArrow Tables are passed to {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced with empty lists. The PySpark function where this happens is {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. Also see [https://github.com/apache/arrow/issues/41684]. See the skipped tests and the TODO mentioning SPARK-48302. [Update] A fix for this has been implemented in PyArrow in [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. Since older versions of PyArrow (which PySpark will still support for a while) won't have this argument, we will need to do a check like: {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("17.0.0")}} or {{from inspect import signature}} {{"mask" in signature(pa.MapArray.from_arrays).parameters}} and only pass {{mask}} if that's true. > Null values in map columns of PyArrow tables are replaced with empty lists > -- > > Key: SPARK-48302 > URL: https://issues.apache.org/jira/browse/SPARK-48302 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ian Cook >Priority: Major > > Because of a limitation in PyArrow, when PyArrow Tables containing MapArray > columns with nested fields or timestamps are passed to > {{{}spark.createDataFrame(){}}}, null values in the MapArray columns are > replaced with empty lists. > The PySpark function where this happens is > {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. > Also see [https://github.com/apache/arrow/issues/41684]. > See the skipped tests and the TODO mentioning SPARK-48302. > [Update] A fix for this has been implemented in PyArrow in > [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to > {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. > Since older versions of PyArrow (which PySpark will still support for a > while) won't have this argument, we will need to do a check like: > {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("17.0.0")}} > or > {{from inspect import signature}} > {{"mask" in signature(pa.MapArray.from_arrays).parameters}} > and only pass {{mask}} if that's true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists
[ https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook updated SPARK-48302: - Description: Because of a limitation in PyArrow, when PyArrow Tables are passed to {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced with empty lists. The PySpark function where this happens is {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. Also see [https://github.com/apache/arrow/issues/41684]. See the skipped tests and the TODO mentioning SPARK-48302. [Update] A fix for this has been implemented in PyArrow in [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. Since older versions of PyArrow (which PySpark will still support for a while) won't have this argument, we will need to do a check like: {{LooseVersion(pa._{_}version{_}_) >= LooseVersion("17.0.0")}} or {{from inspect import signature}} {{"mask" in signature(pa.MapArray.from_arrays).parameters}} and only pass {{mask}} if that's true. was: Because of a limitation in PyArrow, when PyArrow Tables are passed to {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced with empty lists. The PySpark function where this happens is {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. Also see [https://github.com/apache/arrow/issues/41684]. See the skipped tests and the TODO mentioning SPARK-48302. [Update] A fix for this has been implemented in PyArrow in [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. Since older versions of PyArrow (which PySpark will still support for a while) won't have this argument, we will need to do a check like: {{LooseVersion(pa.__version__) >= LooseVersion("1X.0.0")}} or {{from inspect import signature}} {{"mask" in signature(pa.MapArray.from_arrays).parameters}} and only pass {{mask}} if that's true. > Null values in map columns of PyArrow tables are replaced with empty lists > -- > > Key: SPARK-48302 > URL: https://issues.apache.org/jira/browse/SPARK-48302 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ian Cook >Priority: Major > > Because of a limitation in PyArrow, when PyArrow Tables are passed to > {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced > with empty lists. > The PySpark function where this happens is > {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. > Also see [https://github.com/apache/arrow/issues/41684]. > See the skipped tests and the TODO mentioning SPARK-48302. > [Update] A fix for this has been implemented in PyArrow in > [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to > {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. > Since older versions of PyArrow (which PySpark will still support for a > while) won't have this argument, we will need to do a check like: > {{LooseVersion(pa._{_}version{_}_) >= LooseVersion("17.0.0")}} > or > {{from inspect import signature}} > {{"mask" in signature(pa.MapArray.from_arrays).parameters}} > and only pass {{mask}} if that's true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists
[ https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook updated SPARK-48302: - Description: Because of a limitation in PyArrow, when PyArrow Tables are passed to {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced with empty lists. The PySpark function where this happens is {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. Also see [https://github.com/apache/arrow/issues/41684]. See the skipped tests and the TODO mentioning SPARK-48302. [Update] A fix for this has been implemented in PyArrow in [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. Since older versions of PyArrow (which PySpark will still support for a while) won't have this argument, we will need to do a check like: {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("17.0.0")}} or {{from inspect import signature}} {{"mask" in signature(pa.MapArray.from_arrays).parameters}} and only pass {{mask}} if that's true. was: Because of a limitation in PyArrow, when PyArrow Tables are passed to {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced with empty lists. The PySpark function where this happens is {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. Also see [https://github.com/apache/arrow/issues/41684]. See the skipped tests and the TODO mentioning SPARK-48302. [Update] A fix for this has been implemented in PyArrow in [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. Since older versions of PyArrow (which PySpark will still support for a while) won't have this argument, we will need to do a check like: {{LooseVersion(pa._{_}version{_}_) >= LooseVersion("17.0.0")}} or {{from inspect import signature}} {{"mask" in signature(pa.MapArray.from_arrays).parameters}} and only pass {{mask}} if that's true. > Null values in map columns of PyArrow tables are replaced with empty lists > -- > > Key: SPARK-48302 > URL: https://issues.apache.org/jira/browse/SPARK-48302 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ian Cook >Priority: Major > > Because of a limitation in PyArrow, when PyArrow Tables are passed to > {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced > with empty lists. > The PySpark function where this happens is > {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. > Also see [https://github.com/apache/arrow/issues/41684]. > See the skipped tests and the TODO mentioning SPARK-48302. > [Update] A fix for this has been implemented in PyArrow in > [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to > {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. > Since older versions of PyArrow (which PySpark will still support for a > while) won't have this argument, we will need to do a check like: > {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("17.0.0")}} > or > {{from inspect import signature}} > {{"mask" in signature(pa.MapArray.from_arrays).parameters}} > and only pass {{mask}} if that's true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists
[ https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook updated SPARK-48302: - Description: Because of a limitation in PyArrow, when PyArrow Tables are passed to {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced with empty lists. The PySpark function where this happens is {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. Also see [https://github.com/apache/arrow/issues/41684]. See the skipped tests and the TODO mentioning SPARK-48302. [Update] A fix for this has been implemented in PyArrow in [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. Since older versions of PyArrow (which PySpark will still support for a while) won't have this argument, we will need to do a check like: {{LooseVersion(pa.__version__) >= LooseVersion("1X.0.0")}} or {{from inspect import signature}} {{"mask" in signature(pa.MapArray.from_arrays).parameters}} and only pass {{mask}} if that's true. was: Because of a limitation in PyArrow, when PyArrow Tables are passed to {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced with empty lists. The PySpark function where this happens is {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. Also see [https://github.com/apache/arrow/issues/41684]. See the skipped tests and the TODO mentioning SPARK-48302. A possible fix for this will involve adding a {{mask}} argument to {{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which PySpark will still support for a while) won't have this argument, we will need to do a check like: {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("1X.0.0")}} or {{from inspect import signature}} {{"mask" in signature(pa.MapArray.from_arrays).parameters}} and only pass {{mask}} if that's true. > Null values in map columns of PyArrow tables are replaced with empty lists > -- > > Key: SPARK-48302 > URL: https://issues.apache.org/jira/browse/SPARK-48302 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ian Cook >Priority: Major > > Because of a limitation in PyArrow, when PyArrow Tables are passed to > {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced > with empty lists. > The PySpark function where this happens is > {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. > Also see [https://github.com/apache/arrow/issues/41684]. > See the skipped tests and the TODO mentioning SPARK-48302. > [Update] A fix for this has been implemented in PyArrow in > [https://github.com/apache/arrow/pull/41757] by adding a {{mask}} argument to > {{{}pa.MapArray.from_arrays{}}}. This will be released in PyArrow 17.0.0. > Since older versions of PyArrow (which PySpark will still support for a > while) won't have this argument, we will need to do a check like: > {{LooseVersion(pa.__version__) >= LooseVersion("1X.0.0")}} > or > {{from inspect import signature}} > {{"mask" in signature(pa.MapArray.from_arrays).parameters}} > and only pass {{mask}} if that's true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47466) Add PySpark DataFrame method to return iterator of PyArrow RecordBatches
[ https://issues.apache.org/jira/browse/SPARK-47466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook updated SPARK-47466: - Component/s: Connect Input/Output SQL > Add PySpark DataFrame method to return iterator of PyArrow RecordBatches > > > Key: SPARK-47466 > URL: https://issues.apache.org/jira/browse/SPARK-47466 > Project: Spark > Issue Type: Improvement > Components: Connect, Input/Output, PySpark, SQL >Affects Versions: 3.5.1 >Reporter: Ian Cook >Priority: Major > > As a follow-up to SPARK-47365: > {{toArrow()}} is useful when the data is relatively small. For larger data, > the best way to return the contents of a PySpark DataFrame in Arrow format is > to return an iterator of [PyArrow > RecordBatches|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47466) Add PySpark DataFrame method to return iterator of PyArrow RecordBatches
[ https://issues.apache.org/jira/browse/SPARK-47466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook updated SPARK-47466: - Affects Version/s: 4.0.0 > Add PySpark DataFrame method to return iterator of PyArrow RecordBatches > > > Key: SPARK-47466 > URL: https://issues.apache.org/jira/browse/SPARK-47466 > Project: Spark > Issue Type: Improvement > Components: Connect, Input/Output, PySpark, SQL >Affects Versions: 4.0.0, 3.5.1 >Reporter: Ian Cook >Priority: Major > > As a follow-up to SPARK-47365: > {{toArrow()}} is useful when the data is relatively small. For larger data, > the best way to return the contents of a PySpark DataFrame in Arrow format is > to return an iterator of [PyArrow > RecordBatches|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists
[ https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Cook updated SPARK-48302: - Language: Python > Null values in map columns of PyArrow tables are replaced with empty lists > -- > > Key: SPARK-48302 > URL: https://issues.apache.org/jira/browse/SPARK-48302 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ian Cook >Priority: Major > > Because of a limitation in PyArrow, when PyArrow Tables are passed to > {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced > with empty lists. > The PySpark function where this happens is > {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. > Also see [https://github.com/apache/arrow/issues/41684]. > See the skipped tests and the TODO mentioning SPARK-48302. > A possible fix for this will involve adding a {{mask}} argument to > {{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which > PySpark will still support for a while) won't have this argument, we will > need to do a check like: > {{LooseVersion(pa.\_\_version\_\_) >= LooseVersion("1X.0.0")}} > or > {{from inspect import signature}} > {{"mask" in signature(pa.MapArray.from_arrays).parameters}} > and only pass {{mask}} if that's true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48220) Allow passing PyArrow Table to createDataFrame()
[ https://issues.apache.org/jira/browse/SPARK-48220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-48220: Assignee: Ian Cook > Allow passing PyArrow Table to createDataFrame() > > > Key: SPARK-48220 > URL: https://issues.apache.org/jira/browse/SPARK-48220 > Project: Spark > Issue Type: Sub-task > Components: Connect, Input/Output, PySpark, SQL >Affects Versions: 4.0.0, 3.5.1 >Reporter: Ian Cook >Assignee: Ian Cook >Priority: Major > Labels: pull-request-available > > SPARK-47365 added support for returning a Spark DataFrame as a PyArrow Table. > It would be nice if we could also go in the opposite direction, enabling > users to create a Spark DataFrame from a PyArrow Table by passing the PyArrow > Table to {{spark.createDataFrame()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48220) Allow passing PyArrow Table to createDataFrame()
[ https://issues.apache.org/jira/browse/SPARK-48220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48220. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46529 [https://github.com/apache/spark/pull/46529] > Allow passing PyArrow Table to createDataFrame() > > > Key: SPARK-48220 > URL: https://issues.apache.org/jira/browse/SPARK-48220 > Project: Spark > Issue Type: Sub-task > Components: Connect, Input/Output, PySpark, SQL >Affects Versions: 4.0.0, 3.5.1 >Reporter: Ian Cook >Assignee: Ian Cook >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > SPARK-47365 added support for returning a Spark DataFrame as a PyArrow Table. > It would be nice if we could also go in the opposite direction, enabling > users to create a Spark DataFrame from a PyArrow Table by passing the PyArrow > Table to {{spark.createDataFrame()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org