Nan Zhu created SPARK-49189:
-------------------------------

             Summary: pyspark auto batching serialization leads to job crash
                 Key: SPARK-49189
                 URL: https://issues.apache.org/jira/browse/SPARK-49189
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.5.1, 3.5.0, 3.4.1, 3.4.0, 3.3.0, 3.2.0
            Reporter: Nan Zhu


the auto batching mechanism of serialzation leads to job crash in pyspark 

 

the logic is increasing batch size when it can 
[https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L269-L285]

 

however, this logic is vulnerable to the situation that the total size of 
objects is larger than 2G and as a result , we are hit by the issue that 

```
File "/databricks/spark/python/pyspark/worker.py", line 1876, in main process() 
File "/databricks/spark/python/pyspark/worker.py", line 1868, in process 
serializer.dump_stream(out_iter, outfile) File 
"/databricks/spark/python/pyspark/serializers.py", line 308, in dump_stream 
self.serializer.dump_stream(self._batched(iterator), stream) File 
"/databricks/spark/python/pyspark/serializers.py", line 158, in dump_stream 
self._write_with_length(obj, stream) File 
"/databricks/spark/python/pyspark/serializers.py", line 172, in 
_write_with_length raise ValueError("can not serialize object larger than 2G") 
ValueError: can not serialize object larger than 2G
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to