subject:"OversizedAllocationException for pandas

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-14 Thread Micah Kornfield

Hi, >From the error it looks like this might potentially be some sort of integer overflow, but it is hard to say. Could you try to get a minimal reproduction of the error [1] , and open a JIRA Issue [2] with it? Thanks, Micah [1] https://stackoverflow.com/help/mcve [2] https://issues.apache.or

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-10 Thread Abdeali Kothari

Hi, any help on this would be much appreciated. I've not been able to figure out any reason for this to happen yet On Sat, Mar 2, 2019, 11:50 Abdeali Kothari wrote: > Hi Li Jin, thanks for the note. > > I get this error only for larger data - when I reduce the number of > records or the number o

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari

Hi Li Jin, thanks for the note. I get this error only for larger data - when I reduce the number of records or the number or columns in my data it all works fine - so if it is binary incompatibility it should be something related to large data. I am using Spark 2.3.1 on Amazon EMR for this testing

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Li Jin

The 2G limit that Uwe mentioned definitely exists, Spark serialize each group as a single RecordBatch currently. The "pyarrow.lib.ArrowIOError: read length must be positive or -1" is strange, I think Spark is on an older version of the Java side (0.10 for Spark 2.4 and 0.8 for Spark 2.3). I forgot

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari

Forgot to mention: The above testing is with 0.11.1 I tried 0.12.1 as you suggested - and am getting the OversizedAllocationException with the 80char column. And getting read length must be positive or -1 without that. So, both the issues are reproducible with pyarrow 0.12.1 On Sat, Mar 2, 2019 at

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari

That was spot on! I had 3 columns with 80characters => 80*21*10^6 = 1.56 bytes I removed these columns and replaced each with 10 doubleType columns (so it would still be 80 bytes of data) - and this error didn't come up anymore. I also removed all the other columns and just kept 1 column with 80cha

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Uwe L. Korn

There is currently the limitation that a column in a single RecordBatch can only hold 2G on the Java side. We work around this by splitting the DataFrame under the hood into multiple RecordBatches. I'm not familiar with the Spark<->Arrow code but I guess that in this case, the Spark code can onl

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari

Is there a limitation that a single column cannot be more than 1-2G ? One of my columns definitely would be around 1.5GB of memory. I cannot split my DF into more partitions as I have only 1 ID and I'm grouping by that ID. So, the UDAF would only run on a single pandasDF I do have a requirement to

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Uwe L. Korn

Hello Abdeali, a problem could here be that a single column of your dataframe is using more than 2GB of RAM (possibly also just 1G). Try splitting your DataFrame into more partitions before applying the UDAF. Cheers Uwe On Fri, Mar 1, 2019, at 9:09 AM, Abdeali Kothari wrote: > I was using arro

OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari

I was using arrow with spark+python and when I'm trying some pandas-UDAF functions I am getting this error: org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand the buffer at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:457)

Re: OversizedAllocationException for pandas_udf in pyspark

Re: OversizedAllocationException for pandas_udf in pyspark

Re: OversizedAllocationException for pandas_udf in pyspark

Re: OversizedAllocationException for pandas_udf in pyspark

Re: OversizedAllocationException for pandas_udf in pyspark

Re: OversizedAllocationException for pandas_udf in pyspark

Re: OversizedAllocationException for pandas_udf in pyspark

Re: OversizedAllocationException for pandas_udf in pyspark

Re: OversizedAllocationException for pandas_udf in pyspark

OversizedAllocationException for pandas_udf in pyspark

10 matches

Site Navigation

Mail list logo

Footer information