[ https://issues.apache.org/jira/browse/SPARK-42347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684212#comment-17684212 ]
Adam Binford commented on SPARK-42347: -------------------------------------- [https://github.com/apache/spark/pull/39572] is a potential workaround to allow enabling the large variable width vectors when users hit this limit. > Arrow string and binary vectors only support 1 GiB > -------------------------------------------------- > > Key: SPARK-42347 > URL: https://issues.apache.org/jira/browse/SPARK-42347 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.4.0 > Reporter: Adam Binford > Priority: Major > > Since Arrow 10.0.0, BaseVariableWidthVector (the parent for string and binary > vectors), only supports expanding up to 1 GiB through the safe interfaces, > which Spark uses, instead of 2 GiB previously. This is due to > [https://github.com/apache/arrow/pull/13815.] I added a comment in there but > haven't got any responses yet, will make an issue in Arrow as well. > Basically whenever you try to add data beyond 1 GiB, the vector will try to > double itself to the next power of two, which would be {{{}2147483648{}}}, > which is greater than {{Integer.MAX_VALUE}} which is {{{}2147483647{}}}, thus > throwing a {{{}OversizedAllocationException{}}}. > See [https://github.com/apache/spark/pull/39572#issuecomment-1383195213] and > the comment above for how I recreated to show this was now the case. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org