I was using arrow with spark+python and when I'm trying some pandas-UDAF functions I am getting this error:
org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand the buffer at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:457) at org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1188) at org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1026) at org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:256) at org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:122) at org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:87) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply$mcV$sp(ArrowPythonRunner.scala:84) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply(ArrowPythonRunner.scala:75) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply(ArrowPythonRunner.scala:75) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1380) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2.writeIteratorToStream(ArrowPythonRunner.scala:95) at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170) I was initially getting a RAM is insufficient error - and theoretically (with no compression) realized that the pandas DataFrame it would try to create would be ~8GB (21million records with each record having ~400 bytes). I have increased my executor memory to be 20GB per executor, but am now getting this error from Arrow. Looking for some pointers so I can understand this issue better. Here's what I am trying. I have 2 tables with string columns where the strings always have a fixed length: *Table 1*: id: integer char_column1: string (length = 30) char_column2: string (length = 40) char_column3: string (length = 10) ... In total, in table1, the char-columns have ~250 characters *Table 2*: id: integer char_column1: string (length = 50) char_column2: string (length = 3) char_column3: string (length = 4) ... In total, in table2, the char-columns have ~150 characters I am joining these tables by ID. In my current dataset, I have filtered my data so only id=1 exists. Table1 has ~400 records for id=1 and table2 has 50k records for id=1. Hence, total number of records (after joining) for table1_join2 = 400 * 50k = 20*10^6 records Each row has ~400bytes (150+250) => overall memory = 8*10^9 bytes => ~8GB Now, when I try an executor with 20GB RAM, it does not work. Is there some data duplicity happening internally ? What should be the estimated RAM I need to give for this to work ? Thanks for reading,