Adam Binford created SPARK-42347:
------------------------------------

             Summary: Arrow string and binary vectors only support 1 GiB
                 Key: SPARK-42347
                 URL: https://issues.apache.org/jira/browse/SPARK-42347
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.4.0
            Reporter: Adam Binford


Since Arrow 10.0.0, BaseVariableWidthVector (the parent for string and binary 
vectors), only supports expanding up to 1 GiB through the safe interfaces, 
which Spark uses, instead of 2 GiB previously. This is due to 
[https://github.com/apache/arrow/pull/13815.] I added a comment in there but 
haven't got any responses yet, will make an issue in Arrow as well.

Basically whenever you try to add data beyond 1 GiB, the vector will try to 
double itself to the next power of two, which would be {{{}2147483648{}}}, 
which is greater than {{Integer.MAX_VALUE}} which is {{{}2147483647{}}}, thus 
throwing a {{{}OversizedAllocationException{}}}.

See [https://github.com/apache/spark/pull/39572#issuecomment-1383195213] and 
the comment above for how I recreated to show this was now the case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to