Ala Luszczak created SPARK-24133:
------------------------------------

             Summary: Reading Parquet files containing large strings can fail 
with java.lang.ArrayIndexOutOfBoundsException
                 Key: SPARK-24133
                 URL: https://issues.apache.org/jira/browse/SPARK-24133
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Ala Luszczak


ColumnVectors store string data in one big byte array. Since the array size is 
capped at just under Integer.MAX_VALUE, a single ColumnVector cannot store more 
than 2GB of string data.

However, since the Parquet files commonly contain large blobs stored as 
strings, and ColumnVectors by default carry 4096 values, it's entirely possible 
to go past that limit.

In such cases a negative capacity is requested from 
WritableColumnVector.reserve(). The call succeeds (requested capacity is 
smaller than already allocated), and consequently  
java.lang.ArrayIndexOutOfBoundsException is thrown when the reader actually 
attempts to put the data into the array.

This behavior is hard to troubleshoot for the users. Spark should instead check 
for negative requested capacity in WritableColumnVector.reserve() and throw 
more informative error, instructing the user to tweak ColumnarBatch size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to