[ https://issues.apache.org/jira/browse/SPARK-39979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17676610#comment-17676610 ]
Adam Binford commented on SPARK-39979: -------------------------------------- We have encountered this as well and I'm am working on a fix. This is a limitation of using the BaseVariableWidthVector based classes for string and binary and not the BaseLargeVariableWidthVector based types. A single string or binary vector cannot hold more than 2 GiB of data, because it stores offsets to each value and the offsets are stored as 4 byte integers. The large types use 8 bytes for the offset instead, thus removing the limitation. Options are: * Add a config you can set to use the large types instead of the normal types * Just use the large types for everything, with the expense of an additional 4 bytes per record in the vector for storing offsets (not sure if there are additional overheads) Let me know if there's any thoughts or opinions on which route to go. > IndexOutOfBoundsException on groupby + apply pandas grouped map udf function > ---------------------------------------------------------------------------- > > Key: SPARK-39979 > URL: https://issues.apache.org/jira/browse/SPARK-39979 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.2.1 > Reporter: yaniv oren > Priority: Major > > I'm grouping on relatively small subset of groups with big size groups. > Working with pyarrow version 2.0.0, machines memory is {color:#444444}64 > GiB.{color} > I'm getting the following error: > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 387 > in stage 162.0 failed 4 times, most recent failure: Lost task 387.3 in stage > 162.0 (TID 29957) (ip-172-21-129-187.eu-west-1.compute.internal executor 71): > java.lang.IndexOutOfBoundsException: index: 2147483628, length: 36 (expected: > range(0, 2147483648)) > at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699) > at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:890) > at > org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1087) > at > org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:251) > at > org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:130) > at > org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:95) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:92) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1474) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:103) > at > org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:435) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2031) > at > org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:270) > {code} > Why do I hit this 2 GB limit? according SPARK-34588 this is supported, > perhaps related to SPARK-34020. > Please assist. > Note: > Is it related to the usage of BaseVariableWidthVector and not > BaseLargeVariableWidthVector? > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org