Nan Zhu created SPARK-49189: ------------------------------- Summary: pyspark auto batching serialization leads to job crash Key: SPARK-49189 URL: https://issues.apache.org/jira/browse/SPARK-49189 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.5.1, 3.5.0, 3.4.1, 3.4.0, 3.3.0, 3.2.0 Reporter: Nan Zhu
the auto batching mechanism of serialzation leads to job crash in pyspark the logic is increasing batch size when it can [https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L269-L285] however, this logic is vulnerable to the situation that the total size of objects is larger than 2G and as a result , we are hit by the issue that ``` File "/databricks/spark/python/pyspark/worker.py", line 1876, in main process() File "/databricks/spark/python/pyspark/worker.py", line 1868, in process serializer.dump_stream(out_iter, outfile) File "/databricks/spark/python/pyspark/serializers.py", line 308, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/databricks/spark/python/pyspark/serializers.py", line 158, in dump_stream self._write_with_length(obj, stream) File "/databricks/spark/python/pyspark/serializers.py", line 172, in _write_with_length raise ValueError("can not serialize object larger than 2G") ValueError: can not serialize object larger than 2G ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org