Bruce Robbins created SPARK-40403: ------------------------------------- Summary: Negative size in error message when unsafe array is too big Key: SPARK-40403 URL: https://issues.apache.org/jira/browse/SPARK-40403 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Bruce Robbins
When initializing an overly large unsafe array via {{UnsafeArrayWriter#initialize}}, {{BufferHolder#grow}} may report an error message with a negative size, e.g.: {noformat} java.lang.IllegalArgumentException: Cannot grow BufferHolder by size -2115263656 because the size is negative {noformat} (Note: This is not related to SPARK-39608, as far as I can tell, despite having the same symptom). When calculating the initial size in bytes needed for the array, {{UnsafeArrayWriter#initialize}} uses an int expression, which can overflow. The initialize method then passes the negative size to {{BufferHolder#grow}}, which complains about the negative size. Example (the following will run just fine on a 16GB laptop, despite the large driver size setting): {noformat} bin/spark-sql --driver-memory 22g --master "local[1]" create or replace temp view data1 as select 0 as key, id as val from range(0, 268271216); create or replace temp view data2 as select key as lkey, collect_list(val) as bigarray from data1 group by key; -- the below cache forces Spark to create unsafe rows cache lazy table data2; select count(*) from data2; {noformat} After a few minutes, {{UnsafeArrayWriter#initialize}} will throw the following exception: {noformat} java.lang.IllegalArgumentException: Cannot grow BufferHolder by size -2115263656 because the size is negative at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:67) at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter.initialize(UnsafeArrayWriter.java:61) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.aggregate.Collect.serialize(collect.scala:73) at org.apache.spark.sql.catalyst.expressions.aggregate.Collect.serialize(collect.scala:37) {noformat} This query was going to fail anyway, but the message makes it looks like a bug in Spark rather than a user problem. {{UnsafeArrayWriter#initialize}} should calculate using a long expression and fail if the size exceeds {{Integer.MAX_VALUE}}, showing the actual initial size in the error message. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org