[ 
https://issues.apache.org/jira/browse/SPARK-42347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684212#comment-17684212
 ] 

Adam Binford commented on SPARK-42347:
--------------------------------------

[https://github.com/apache/spark/pull/39572] is a potential workaround to allow 
enabling the large variable width vectors when users hit this limit.

> Arrow string and binary vectors only support 1 GiB
> --------------------------------------------------
>
>                 Key: SPARK-42347
>                 URL: https://issues.apache.org/jira/browse/SPARK-42347
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.4.0
>            Reporter: Adam Binford
>            Priority: Major
>
> Since Arrow 10.0.0, BaseVariableWidthVector (the parent for string and binary 
> vectors), only supports expanding up to 1 GiB through the safe interfaces, 
> which Spark uses, instead of 2 GiB previously. This is due to 
> [https://github.com/apache/arrow/pull/13815.] I added a comment in there but 
> haven't got any responses yet, will make an issue in Arrow as well.
> Basically whenever you try to add data beyond 1 GiB, the vector will try to 
> double itself to the next power of two, which would be {{{}2147483648{}}}, 
> which is greater than {{Integer.MAX_VALUE}} which is {{{}2147483647{}}}, thus 
> throwing a {{{}OversizedAllocationException{}}}.
> See [https://github.com/apache/spark/pull/39572#issuecomment-1383195213] and 
> the comment above for how I recreated to show this was now the case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to