Hi,
>From the error it looks like this might potentially be some sort of
integer overflow, but it is hard to say. Could you try to get a minimal
reproduction of the error [1] , and open a JIRA Issue [2] with it?
Thanks,
Micah
[1] https://stackoverflow.com/help/mcve
[2] https://issues.apache.or
Hi, any help on this would be much appreciated.
I've not been able to figure out any reason for this to happen yet
On Sat, Mar 2, 2019, 11:50 Abdeali Kothari wrote:
> Hi Li Jin, thanks for the note.
>
> I get this error only for larger data - when I reduce the number of
> records or the number o
Hi Li Jin, thanks for the note.
I get this error only for larger data - when I reduce the number of records
or the number or columns in my data it all works fine - so if it is binary
incompatibility it should be something related to large data.
I am using Spark 2.3.1 on Amazon EMR for this testing
The 2G limit that Uwe mentioned definitely exists, Spark serialize each
group as a single RecordBatch currently.
The "pyarrow.lib.ArrowIOError: read length must be positive or -1" is
strange, I think Spark is on an older version of the Java side (0.10 for
Spark 2.4 and 0.8 for Spark 2.3). I forgot
Forgot to mention: The above testing is with 0.11.1
I tried 0.12.1 as you suggested - and am getting the
OversizedAllocationException with the 80char column. And getting read
length must be positive or -1 without that. So, both the issues are
reproducible with pyarrow 0.12.1
On Sat, Mar 2, 2019 at
That was spot on!
I had 3 columns with 80characters => 80*21*10^6 = 1.56 bytes
I removed these columns and replaced each with 10 doubleType columns (so it
would still be 80 bytes of data) - and this error didn't come up anymore.
I also removed all the other columns and just kept 1 column with
80cha
There is currently the limitation that a column in a single RecordBatch can
only hold 2G on the Java side. We work around this by splitting the DataFrame
under the hood into multiple RecordBatches. I'm not familiar with the
Spark<->Arrow code but I guess that in this case, the Spark code can onl
Is there a limitation that a single column cannot be more than 1-2G ?
One of my columns definitely would be around 1.5GB of memory.
I cannot split my DF into more partitions as I have only 1 ID and I'm
grouping by that ID.
So, the UDAF would only run on a single pandasDF
I do have a requirement to
Hello Abdeali,
a problem could here be that a single column of your dataframe is using more
than 2GB of RAM (possibly also just 1G). Try splitting your DataFrame into more
partitions before applying the UDAF.
Cheers
Uwe
On Fri, Mar 1, 2019, at 9:09 AM, Abdeali Kothari wrote:
> I was using arro
I was using arrow with spark+python and when I'm trying some pandas-UDAF
functions I am getting this error:
org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand
the buffer
at
org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:457)
10 matches
Mail list logo