Ziqi Liu created SPARK-40622:
--------------------------------

             Summary: Result of a single task in collect() must fit in 2GB
                 Key: SPARK-40622
                 URL: https://issues.apache.org/jira/browse/SPARK-40622
             Project: Spark
          Issue Type: Bug
          Components: Spark Core, SQL
    Affects Versions: 3.3.0
            Reporter: Ziqi Liu


when collecting results, data from single partition/task is serialized through 
byte array or ByteBuffer(which is backed by byte array as well), therefore it's 
subject to java array max size limit(in terms of byte array, it's 2GB).

 

Construct a single partition larger than 2GB and collect it can easily 
reproduce the issue
{code:java}
val df = spark.range(0, 3000, 1, 1).selectExpr("id", s"genData(id, 1000000) as 
data")

withSQLConf("spark.databricks.driver.localMaxResultSize" -> "4g") {
  withSQLConf("spark.sql.useChunkedBuffer" -> "true") {
    df.queryExecution.executedPlan.executeCollect()
  }
} {code}
 will get a OOM error from 
[https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/io/ByteArrayOutputStream.java#L125]

 

Consider using ChunkedByteBuffer to replace byte array in order to bypassing 
this limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to