OversizedAllocationException for pandas_udf in pyspark

Abdeali Kothari Fri, 01 Mar 2019 00:09:54 -0800

I was using arrow with spark+python and when I'm trying some pandas-UDAF
functions I am getting this error:


org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand
the buffer
at
org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:457)
at
org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1188)
at
org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1026)
at
org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:256)
at
org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:122)
at
org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:87)
at
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply$mcV$sp(ArrowPythonRunner.scala:84)
at
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply(ArrowPythonRunner.scala:75)
at
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply(ArrowPythonRunner.scala:75)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1380)
at
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2.writeIteratorToStream(ArrowPythonRunner.scala:95)
at
org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991)
at
org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)

I was initially getting a RAM is insufficient error - and theoretically
(with no compression) realized that the pandas DataFrame it would try to
create would be ~8GB (21million records with each record having ~400
bytes). I have increased my executor memory to be 20GB per executor, but am
now getting this error from Arrow.
Looking for some pointers so I can understand this issue better.

Here's what I am trying. I have 2 tables with string columns where the
strings always have a fixed length:
*Table 1*:
    id: integer
   char_column1: string (length = 30)
   char_column2: string (length = 40)
   char_column3: string (length = 10)
   ...
In total, in table1, the char-columns have ~250 characters

*Table 2*:
    id: integer
   char_column1: string (length = 50)
   char_column2: string (length = 3)
   char_column3: string (length = 4)
   ...
In total, in table2, the char-columns have ~150 characters

I am joining these tables by ID. In my current dataset, I have filtered my
data so only id=1 exists.
Table1 has ~400 records for id=1 and table2 has 50k records for id=1.
Hence, total number of records (after joining) for table1_join2 = 400 * 50k
= 20*10^6 records
Each row has ~400bytes (150+250) => overall memory = 8*10^9 bytes => ~8GB

Now, when I try an executor with 20GB RAM, it does not work.
Is there some data duplicity happening internally ? What should be the
estimated RAM I need to give for this to work ?

Thanks for reading,

OversizedAllocationException for pandas_udf in pyspark

Reply via email to