OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
I was using arrow with spark+python and when I'm trying some pandas-UDAF
functions I am getting this error:

org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand
the buffer
at
org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:457)
at
org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1188)
at
org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1026)
at
org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:256)
at
org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:122)
at
org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:87)
at
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply$mcV$sp(ArrowPythonRunner.scala:84)
at
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply(ArrowPythonRunner.scala:75)
at
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply(ArrowPythonRunner.scala:75)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1380)
at
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2.writeIteratorToStream(ArrowPythonRunner.scala:95)
at
org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991)
at
org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)

I was initially getting a RAM is insufficient error - and theoretically
(with no compression) realized that the pandas DataFrame it would try to
create would be ~8GB (21million records with each record having ~400
bytes). I have increased my executor memory to be 20GB per executor, but am
now getting this error from Arrow.
Looking for some pointers so I can understand this issue better.

Here's what I am trying. I have 2 tables with string columns where the
strings always have a fixed length:
*Table 1*:
id: integer
   char_column1: string (length = 30)
   char_column2: string (length = 40)
   char_column3: string (length = 10)
   ...
In total, in table1, the char-columns have ~250 characters

*Table 2*:
id: integer
   char_column1: string (length = 50)
   char_column2: string (length = 3)
   char_column3: string (length = 4)
   ...
In total, in table2, the char-columns have ~150 characters

I am joining these tables by ID. In my current dataset, I have filtered my
data so only id=1 exists.
Table1 has ~400 records for id=1 and table2 has 50k records for id=1.
Hence, total number of records (after joining) for table1_join2 = 400 * 50k
= 20*10^6 records
Each row has ~400bytes (150+250) => overall memory = 8*10^9 bytes => ~8GB

Now, when I try an executor with 20GB RAM, it does not work.
Is there some data duplicity happening internally ? What should be the
estimated RAM I need to give for this to work ?

Thanks for reading,


Re: [ANNOUNCE] New Arrow committer: Paddy Horan

2019-03-01 Thread Krisztián Szűcs
Congrats Paddy!

On Fri, Mar 1, 2019 at 3:19 AM Chao Sun  wrote:

> Congratulations Paddy!
>
> On Thu, Feb 28, 2019 at 5:52 PM paddy horan 
> wrote:
>
> > Thanks All,
> >
> > I honored to be a part of such a great, talented community.
> >
> > P
> >
> > 
> > From: Renjie Liu 
> > Sent: Thursday, February 28, 2019 7:24 PM
> > To: dev@arrow.apache.org; emkornfi...@gmail.com
> > Subject: Re: [ANNOUNCE] New Arrow committer: Paddy Horan
> >
> > Congrats!
> >
> > Micah Kornfield  于 2019年3月1日周五 上午7:26写道:
> >
> > > Congrats!
> > >
> > > On Thu, Feb 28, 2019 at 3:14 PM Bryan Cutler 
> wrote:
> > >
> > > > Congratulations Paddy!
> > > >
> > > > On Thu, Feb 28, 2019 at 7:14 AM Wes McKinney 
> > > wrote:
> > > >
> > > > > Welcome Paddy and thank you!
> > > > >
> > > > >
> > > > > On Thu, Feb 28, 2019 at 4:29 AM Uwe L. Korn 
> > wrote:
> > > > > >
> > > > > > On behalf of the Arrow PMC, I'm happy to announce that Paddy has
> an
> > > > > > accepted an invitation to become a committer on Apache Arrow.
> > > > > >
> > > > > > Welcome, and thank you for your contributions!
> > > > >
> > > >
> > >
> >
>


Re: [ANNOUNCE] New Arrow committer: Chao Sun

2019-03-01 Thread Krisztián Szűcs
Congrats Chao!

On Fri, Mar 1, 2019 at 3:19 AM Chao Sun  wrote:

> Thanks everyone. Looking forward to contributing more!
>
> Chao
>
> On Thu, Feb 28, 2019 at 4:24 PM Renjie Liu 
> wrote:
>
> > Congrats!
> >
> > Micah Kornfield  于 2019年3月1日周五 上午7:26写道:
> >
> > > Congrats!
> > >
> > > On Thu, Feb 28, 2019 at 3:02 PM Bryan Cutler 
> wrote:
> > >
> > > > Congratulations Chao!
> > > >
> > > > On Thu, Feb 28, 2019 at 9:27 AM Neville Dipale <
> nevilled...@gmail.com>
> > > > wrote:
> > > >
> > > > > Congratulations Chao and Paddy! I'm loving the increase in velocity
> > on
> > > > the
> > > > > Rust side
> > > > >
> > > > > On Thu, 28 Feb 2019, 17:17 Wes McKinney, 
> > wrote:
> > > > >
> > > > > > thank you Chao, and welcome!
> > > > > >
> > > > > > On Thu, Feb 28, 2019 at 6:18 AM paddy horan <
> > paddyho...@hotmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > Congrats Chao!
> > > > > > >
> > > > > > > Get Outlook for iOS
> > > > > > > 
> > > > > > > From: Uwe L. Korn 
> > > > > > > Sent: Thursday, February 28, 2019 5:29 AM
> > > > > > > To: dev@arrow.apache.org
> > > > > > > Subject: [ANNOUNCE] New Arrow committer: Chao Sun
> > > > > > >
> > > > > > > On behalf of the Arrow PMC, I'm happy to announce that Chao has
> > an
> > > > > > > accepted an invitation to become a committer on Apache Arrow.
> > > > > > >
> > > > > > > Welcome, and thank you for your contributions!
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [ANNOUNCE] New Arrow committer: Paddy Horan

2019-03-01 Thread Andy Grove
Congratulations, Paddy! Great to have you here.

On Fri, Mar 1, 2019 at 8:45 AM Krisztián Szűcs 
wrote:

> Congrats Paddy!
>
> On Fri, Mar 1, 2019 at 3:19 AM Chao Sun  wrote:
>
> > Congratulations Paddy!
> >
> > On Thu, Feb 28, 2019 at 5:52 PM paddy horan 
> > wrote:
> >
> > > Thanks All,
> > >
> > > I honored to be a part of such a great, talented community.
> > >
> > > P
> > >
> > > 
> > > From: Renjie Liu 
> > > Sent: Thursday, February 28, 2019 7:24 PM
> > > To: dev@arrow.apache.org; emkornfi...@gmail.com
> > > Subject: Re: [ANNOUNCE] New Arrow committer: Paddy Horan
> > >
> > > Congrats!
> > >
> > > Micah Kornfield  于 2019年3月1日周五 上午7:26写道:
> > >
> > > > Congrats!
> > > >
> > > > On Thu, Feb 28, 2019 at 3:14 PM Bryan Cutler 
> > wrote:
> > > >
> > > > > Congratulations Paddy!
> > > > >
> > > > > On Thu, Feb 28, 2019 at 7:14 AM Wes McKinney 
> > > > wrote:
> > > > >
> > > > > > Welcome Paddy and thank you!
> > > > > >
> > > > > >
> > > > > > On Thu, Feb 28, 2019 at 4:29 AM Uwe L. Korn 
> > > wrote:
> > > > > > >
> > > > > > > On behalf of the Arrow PMC, I'm happy to announce that Paddy
> has
> > an
> > > > > > > accepted an invitation to become a committer on Apache Arrow.
> > > > > > >
> > > > > > > Welcome, and thank you for your contributions!
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [ANNOUNCE] New Arrow committer: Chao Sun

2019-03-01 Thread Andy Grove
Congratulations, Chao! Great to have you here.

On Fri, Mar 1, 2019 at 8:45 AM Krisztián Szűcs 
wrote:

> Congrats Chao!
>
> On Fri, Mar 1, 2019 at 3:19 AM Chao Sun  wrote:
>
> > Thanks everyone. Looking forward to contributing more!
> >
> > Chao
> >
> > On Thu, Feb 28, 2019 at 4:24 PM Renjie Liu 
> > wrote:
> >
> > > Congrats!
> > >
> > > Micah Kornfield  于 2019年3月1日周五 上午7:26写道:
> > >
> > > > Congrats!
> > > >
> > > > On Thu, Feb 28, 2019 at 3:02 PM Bryan Cutler 
> > wrote:
> > > >
> > > > > Congratulations Chao!
> > > > >
> > > > > On Thu, Feb 28, 2019 at 9:27 AM Neville Dipale <
> > nevilled...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Congratulations Chao and Paddy! I'm loving the increase in
> velocity
> > > on
> > > > > the
> > > > > > Rust side
> > > > > >
> > > > > > On Thu, 28 Feb 2019, 17:17 Wes McKinney, 
> > > wrote:
> > > > > >
> > > > > > > thank you Chao, and welcome!
> > > > > > >
> > > > > > > On Thu, Feb 28, 2019 at 6:18 AM paddy horan <
> > > paddyho...@hotmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Congrats Chao!
> > > > > > > >
> > > > > > > > Get Outlook for iOS
> > > > > > > > 
> > > > > > > > From: Uwe L. Korn 
> > > > > > > > Sent: Thursday, February 28, 2019 5:29 AM
> > > > > > > > To: dev@arrow.apache.org
> > > > > > > > Subject: [ANNOUNCE] New Arrow committer: Chao Sun
> > > > > > > >
> > > > > > > > On behalf of the Arrow PMC, I'm happy to announce that Chao
> has
> > > an
> > > > > > > > accepted an invitation to become a committer on Apache Arrow.
> > > > > > > >
> > > > > > > > Welcome, and thank you for your contributions!
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Nightly binary packages

2019-03-01 Thread Krisztián Szűcs
On Wed, Feb 27, 2019 at 9:30 PM Kouhei Sutou  wrote:

> Hi,
>
> > - How should We handle the signing procedure? Simply omit?
>
> For .deb and .rpm, we need to sign them to install them by
> apt/yum.
>
> We should use a GPG key only for nightly for this
> propose. We should not use GPG keys in
> https://dist.apache.org/repos/dist/release/arrow/KEYS for
> this propose.
>
> We can share the GPG key for nightly with PMC members safely
> by encrypted it with GPG keys in
> https://dist.apache.org/repos/dist/release/arrow/KEYS.
>
> We can use the GPG key for nightly on Travis CI by
> encrypting the GPG key:
> https://docs.travis-ci.com/user/encryption-keys/

Thanks Kou for the clarification!

>
>
> > - May We host the nightlies under the Apache bintray account?
> > - Do We want to use JFrog Artifactory over Bintray?
> >   If so should We setup it [2] or does Apache has one already?
>
> If we can use both, Bintray is better. Because we already
> use Bintray for release and RC packages.
>
> If we use Bintray, we can test our upload script.
>
Yep, that is my intention, upload the artifacts to bintray via the release
script. Additionally We could use Artifactory to serve bintray binaries
like pypi would do it (somehow We need to make the wheels available
for installing via pip).

>
> We need to remove old nightly packages periodically.
> I think that keeping the last 7 days is enough.
>
> We can do this by just deleting a version for old nightly
> packages on Bintray:
> https://bintray.com/docs/api/#url_delete_version

Sounds good.

>
>
> I think that we should create a version such as "2019-02-28"
> for each nightly.
>
 I wouldn't touch the versioning for now.

>
>
> Thanks,
> --
> kou
>
> In 
>   "Nightly binary packages" on Mon, 25 Feb 2019 22:51:29 +0100,
>   Krisztián Szűcs  wrote:
>
> > Hi,
> >
> > Currently We have nightly package builds, currently under my
> > github account, which is not really visible. It would be great to
> > make them available for developer purposes, and additionally
> > it'd test the binary scripts too.
> > The nightly packages are produced the same way like it is
> > documented in the release management guide, except that they
> > are not getting uploaded to bintray.
> >
> > I can setup a cron job to upload the nightly packages to bintray
> > under `-nightly` postfixed directories (similarly like `-rc` packages
> > are stored [1]), however I have a couple of questions:
> > - How should We handle the signing procedure? Simply omit?
> > - May We host the nightlies under the Apache bintray account?
> > - Do We want to use JFrog Artifactory over Bintray?
> >   If so should We setup it [2] or does Apache has one already?
> >
> > Regards, Krisztian
> >
> > [1] https://bintray.com/beta/#/apache/arrow?tab=packages
> > [2] https://jfrog.com/open-source/#artifactory
>


[jira] [Created] (ARROW-4727) [Rust] Implement ability to check if two schemas are the same

2019-03-01 Thread Andy Grove (JIRA)
Andy Grove created ARROW-4727:
-

 Summary: [Rust] Implement ability to check if two schemas are the 
same
 Key: ARROW-4727
 URL: https://issues.apache.org/jira/browse/ARROW-4727
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Affects Versions: 0.12.0
Reporter: Andy Grove
 Fix For: 0.13.0


When creating RecordBatch it would be desirable to ensure that all batches have 
the same schema and that the schema matches the schema defined for the 
RecordBatch. We currently have no way to compare two schemas to see if they are 
equivalent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Uwe L. Korn
Hello Abdeali,

a problem could here be that a single column of your dataframe is using more 
than 2GB of RAM (possibly also just 1G). Try splitting your DataFrame into more 
partitions before applying the UDAF.

Cheers
Uwe

On Fri, Mar 1, 2019, at 9:09 AM, Abdeali Kothari wrote:
> I was using arrow with spark+python and when I'm trying some pandas-UDAF
> functions I am getting this error:
> 
> org.apache.arrow.vector.util.OversizedAllocationException: Unable to 
> expand
> the buffer
> at
> org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:457)
> at
> org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1188)
> at
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1026)
> at
> org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:256)
> at
> org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:122)
> at
> org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:87)
> at
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply$mcV$sp(ArrowPythonRunner.scala:84)
> at
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply(ArrowPythonRunner.scala:75)
> at
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply(ArrowPythonRunner.scala:75)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1380)
> at
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2.writeIteratorToStream(ArrowPythonRunner.scala:95)
> at
> org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991)
> at
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
> 
> I was initially getting a RAM is insufficient error - and theoretically
> (with no compression) realized that the pandas DataFrame it would try to
> create would be ~8GB (21million records with each record having ~400
> bytes). I have increased my executor memory to be 20GB per executor, but am
> now getting this error from Arrow.
> Looking for some pointers so I can understand this issue better.
> 
> Here's what I am trying. I have 2 tables with string columns where the
> strings always have a fixed length:
> *Table 1*:
> id: integer
>char_column1: string (length = 30)
>char_column2: string (length = 40)
>char_column3: string (length = 10)
>...
> In total, in table1, the char-columns have ~250 characters
> 
> *Table 2*:
> id: integer
>char_column1: string (length = 50)
>char_column2: string (length = 3)
>char_column3: string (length = 4)
>...
> In total, in table2, the char-columns have ~150 characters
> 
> I am joining these tables by ID. In my current dataset, I have filtered my
> data so only id=1 exists.
> Table1 has ~400 records for id=1 and table2 has 50k records for id=1.
> Hence, total number of records (after joining) for table1_join2 = 400 * 50k
> = 20*10^6 records
> Each row has ~400bytes (150+250) => overall memory = 8*10^9 bytes => ~8GB
> 
> Now, when I try an executor with 20GB RAM, it does not work.
> Is there some data duplicity happening internally ? What should be the
> estimated RAM I need to give for this to work ?
> 
> Thanks for reading,
>


[jira] [Created] (ARROW-4728) [Javascript] Failing test Table#assign with a zero-length Null column round-trips through serialization

2019-03-01 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-4728:
-

 Summary: [Javascript] Failing test Table#assign with a zero-length 
Null column round-trips through serialization
 Key: ARROW-4728
 URL: https://issues.apache.org/jira/browse/ARROW-4728
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: 0.12.1
Reporter: Francois Saint-Jacques
 Fix For: 0.13.0


See https://travis-ci.org/apache/arrow/jobs/500414242#L1002
{code:javascript}
  ● Table#serialize() › Table#assign with an empty table round-trips through 
serialization
expect(received).toBe(expected) // Object.is equality
Expected: 86
Received: 41
  91 | const source = table1.assign(Table.empty());
  92 | expect(source.numCols).toBe(table1.numCols);
> 93 | expect(source.length).toBe(table1.length);
 |   ^
  94 | const result = Table.from(source.serialize());
  95 | expect(result).toEqualTable(source);
  96 | expect(result.schema.metadata.get('foo')).toEqual('bar');
  at Object.test (test/unit/table/serialize-tests.ts:93:35)
  ● Table#serialize() › Table#assign with a zero-length Null column round-trips 
through serialization
expect(received).toBe(expected) // Object.is equality
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Flaky Travis CI builds on master

2019-03-01 Thread Francois Saint-Jacques
Also just created https://issues.apache.org/jira/browse/ARROW-4728

On Thu, Feb 28, 2019 at 3:53 AM Ravindra Pindikura 
wrote:

>
>
> > On Feb 28, 2019, at 2:10 PM, Antoine Pitrou  wrote:
> >
> >
> > Le 28/02/2019 à 07:53, Ravindra Pindikura a écrit :
> >>
> >>
> >>> On Feb 27, 2019, at 1:48 AM, Antoine Pitrou 
> wrote:
> >>>
> >>> On Tue, 26 Feb 2019 13:39:08 -0600
> >>> Wes McKinney  wrote:
>  hi folks,
> 
>  We haven't had a green build on master for about 5 days now (the last
>  one was February 21). Has anyone else been paying attention to this?
>  It seems we should start cataloging which tests and build environments
>  are the most flaky and see if there's anything we can do to reduce the
>  flakiness. Since we are dependent on anaconda.org for build toolchain
>  packages, it's hard to control for the 500 timeouts that occur there,
>  but I'm seeing other kinds of routine flakiness.
> >>>
> >>> Isn't it https://issues.apache.org/jira/browse/ARROW-4684 ?
> >>
> >> ARROW-4684 seems to be failing consistently in travis CI.
> >>
> >> Can I merge a change if this is the only CI failure ?
> >
> > Yes, you can.
>
> Thanks !
>
> >
> > Regards
> >
> > Antoine.
>
>


[jira] [Created] (ARROW-4729) [C++] Improve buffer symbolic index

2019-03-01 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-4729:
-

 Summary: [C++] Improve buffer symbolic index
 Key: ARROW-4729
 URL: https://issues.apache.org/jira/browse/ARROW-4729
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.12.1
Reporter: Francois Saint-Jacques


The array data `buffers` vector is index differently depending on the Array 
type. This feature would expose static constexpr named variables for buffer 
index.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Flaky Travis CI builds on master

2019-03-01 Thread Micah Kornfield
Moving away from the tactical for a minute, I think being able to track
these over time would be useful.  I can think of a couple of high level
approaches and I was wondering what others think.

1.  Use tags appropriately in JIRA and try to generate a report from that.
2.  Create a new confluence page to try to log each time these occur (and
route cause).
3.  A separate spreadsheet someplace (e.g. Google Sheet).

Thoughts?

-Micah


On Fri, Mar 1, 2019 at 8:55 AM Francois Saint-Jacques <
fsaintjacq...@gmail.com> wrote:

> Also just created https://issues.apache.org/jira/browse/ARROW-4728
>
> On Thu, Feb 28, 2019 at 3:53 AM Ravindra Pindikura 
> wrote:
>
> >
> >
> > > On Feb 28, 2019, at 2:10 PM, Antoine Pitrou 
> wrote:
> > >
> > >
> > > Le 28/02/2019 à 07:53, Ravindra Pindikura a écrit :
> > >>
> > >>
> > >>> On Feb 27, 2019, at 1:48 AM, Antoine Pitrou 
> > wrote:
> > >>>
> > >>> On Tue, 26 Feb 2019 13:39:08 -0600
> > >>> Wes McKinney  wrote:
> >  hi folks,
> > 
> >  We haven't had a green build on master for about 5 days now (the
> last
> >  one was February 21). Has anyone else been paying attention to this?
> >  It seems we should start cataloging which tests and build
> environments
> >  are the most flaky and see if there's anything we can do to reduce
> the
> >  flakiness. Since we are dependent on anaconda.org for build
> toolchain
> >  packages, it's hard to control for the 500 timeouts that occur
> there,
> >  but I'm seeing other kinds of routine flakiness.
> > >>>
> > >>> Isn't it https://issues.apache.org/jira/browse/ARROW-4684 ?
> > >>
> > >> ARROW-4684 seems to be failing consistently in travis CI.
> > >>
> > >> Can I merge a change if this is the only CI failure ?
> > >
> > > Yes, you can.
> >
> > Thanks !
> >
> > >
> > > Regards
> > >
> > > Antoine.
> >
> >
>


Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
Is there a limitation that a single column cannot be more than 1-2G ?
One of my columns definitely would be around 1.5GB of memory.

I cannot split my DF into more partitions as I have only 1 ID and I'm
grouping by that ID.
So, the UDAF would only run on a single pandasDF
I do have a requirement to make a very large DF for this UDAF (8GB as i
mentioned above) - trying to figure out what I need to do here to make this
work.
Increasing RAM, etc. is no issue (i understand I'd need huge executors as I
have a huge data requirement). But trying to figure out how much to
actually get - cause 20GB of RAM for the executor is also erroring out
where I thought ~10GB would have been enough



On Fri, Mar 1, 2019 at 10:25 PM Uwe L. Korn  wrote:

> Hello Abdeali,
>
> a problem could here be that a single column of your dataframe is using
> more than 2GB of RAM (possibly also just 1G). Try splitting your DataFrame
> into more partitions before applying the UDAF.
>
> Cheers
> Uwe
>
> On Fri, Mar 1, 2019, at 9:09 AM, Abdeali Kothari wrote:
> > I was using arrow with spark+python and when I'm trying some pandas-UDAF
> > functions I am getting this error:
> >
> > org.apache.arrow.vector.util.OversizedAllocationException: Unable to
> > expand
> > the buffer
> > at
> >
> org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:457)
> > at
> >
> org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1188)
> > at
> >
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1026)
> > at
> >
> org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:256)
> > at
> >
> org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:122)
> > at
> >
> org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:87)
> > at
> >
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply$mcV$sp(ArrowPythonRunner.scala:84)
> > at
> >
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply(ArrowPythonRunner.scala:75)
> > at
> >
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply(ArrowPythonRunner.scala:75)
> > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1380)
> > at
> >
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2.writeIteratorToStream(ArrowPythonRunner.scala:95)
> > at
> >
> org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
> > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991)
> > at
> >
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
> >
> > I was initially getting a RAM is insufficient error - and theoretically
> > (with no compression) realized that the pandas DataFrame it would try to
> > create would be ~8GB (21million records with each record having ~400
> > bytes). I have increased my executor memory to be 20GB per executor, but
> am
> > now getting this error from Arrow.
> > Looking for some pointers so I can understand this issue better.
> >
> > Here's what I am trying. I have 2 tables with string columns where the
> > strings always have a fixed length:
> > *Table 1*:
> > id: integer
> >char_column1: string (length = 30)
> >char_column2: string (length = 40)
> >char_column3: string (length = 10)
> >...
> > In total, in table1, the char-columns have ~250 characters
> >
> > *Table 2*:
> > id: integer
> >char_column1: string (length = 50)
> >char_column2: string (length = 3)
> >char_column3: string (length = 4)
> >...
> > In total, in table2, the char-columns have ~150 characters
> >
> > I am joining these tables by ID. In my current dataset, I have filtered
> my
> > data so only id=1 exists.
> > Table1 has ~400 records for id=1 and table2 has 50k records for id=1.
> > Hence, total number of records (after joining) for table1_join2 = 400 *
> 50k
> > = 20*10^6 records
> > Each row has ~400bytes (150+250) => overall memory = 8*10^9 bytes => ~8GB
> >
> > Now, when I try an executor with 20GB RAM, it does not work.
> > Is there some data duplicity happening internally ? What should be the
> > estimated RAM I need to give for this to work ?
> >
> > Thanks for reading,
> >
>


Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Uwe L. Korn
There is currently the limitation that a column in a single RecordBatch can 
only hold 2G on the Java side. We work around this by splitting the DataFrame 
under the hood into multiple RecordBatches. I'm not familiar with the 
Spark<->Arrow code but I guess that in this case, the Spark code can only 
handle a single RecordBatch.

Probably it is best to construct a https://stackoverflow.com/help/mcve and 
create an issue with the Spark project. Most likely this is not a bug in Arrow 
but just requires a bit more complicated implementation around the Arrow libs.

Still, please have a look at the exact size of your columns. We support 2G per 
column, if it is only 1.5G, then there is probably a rounding error in the 
Arrow. Alternatively, you might also be in luck that the following patch 
https://github.com/apache/arrow/commit/bfe6865ba8087a46bd7665679e48af3a77987cef 
which is part of Apache Arrow 0.12 already fixes your problem.

Uwe

On Fri, Mar 1, 2019, at 6:48 PM, Abdeali Kothari wrote:
> Is there a limitation that a single column cannot be more than 1-2G ?
> One of my columns definitely would be around 1.5GB of memory.
> 
> I cannot split my DF into more partitions as I have only 1 ID and I'm
> grouping by that ID.
> So, the UDAF would only run on a single pandasDF
> I do have a requirement to make a very large DF for this UDAF (8GB as i
> mentioned above) - trying to figure out what I need to do here to make this
> work.
> Increasing RAM, etc. is no issue (i understand I'd need huge executors as I
> have a huge data requirement). But trying to figure out how much to
> actually get - cause 20GB of RAM for the executor is also erroring out
> where I thought ~10GB would have been enough
> 
> 
> 
> On Fri, Mar 1, 2019 at 10:25 PM Uwe L. Korn  wrote:
> 
> > Hello Abdeali,
> >
> > a problem could here be that a single column of your dataframe is using
> > more than 2GB of RAM (possibly also just 1G). Try splitting your DataFrame
> > into more partitions before applying the UDAF.
> >
> > Cheers
> > Uwe
> >
> > On Fri, Mar 1, 2019, at 9:09 AM, Abdeali Kothari wrote:
> > > I was using arrow with spark+python and when I'm trying some pandas-UDAF
> > > functions I am getting this error:
> > >
> > > org.apache.arrow.vector.util.OversizedAllocationException: Unable to
> > > expand
> > > the buffer
> > > at
> > >
> > org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:457)
> > > at
> > >
> > org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1188)
> > > at
> > >
> > org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1026)
> > > at
> > >
> > org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:256)
> > > at
> > >
> > org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:122)
> > > at
> > >
> > org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:87)
> > > at
> > >
> > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply$mcV$sp(ArrowPythonRunner.scala:84)
> > > at
> > >
> > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply(ArrowPythonRunner.scala:75)
> > > at
> > >
> > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2$$anonfun$writeIteratorToStream$1.apply(ArrowPythonRunner.scala:75)
> > > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1380)
> > > at
> > >
> > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$2.writeIteratorToStream(ArrowPythonRunner.scala:95)
> > > at
> > >
> > org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
> > > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991)
> > > at
> > >
> > org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
> > >
> > > I was initially getting a RAM is insufficient error - and theoretically
> > > (with no compression) realized that the pandas DataFrame it would try to
> > > create would be ~8GB (21million records with each record having ~400
> > > bytes). I have increased my executor memory to be 20GB per executor, but
> > am
> > > now getting this error from Arrow.
> > > Looking for some pointers so I can understand this issue better.
> > >
> > > Here's what I am trying. I have 2 tables with string columns where the
> > > strings always have a fixed length:
> > > *Table 1*:
> > > id: integer
> > >char_column1: string (length = 30)
> > >char_column2: string (length = 40)
> > >char_column3: string (length = 10)
> > >...
> > > In total, in table1, the char-columns have ~250 characters
> > >
> > > *Table 2*:
> > > id: integer
> > >char_column1: string (length = 50)
> > >char_column2: string (length = 3)
> > >char_column3: string (length = 4)
> > >...
> > > In total, in table2, the char-columns have ~150 charac

[jira] [Created] (ARROW-4730) [C++] Add docker-compose entry for testing Fedora build with system packages

2019-03-01 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4730:
--

 Summary: [C++] Add docker-compose entry for testing Fedora build 
with system packages
 Key: ARROW-4730
 URL: https://issues.apache.org/jira/browse/ARROW-4730
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Packaging
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 0.13.0


To better support people on Fedora and also show the missing things to get 
Arrow packaged into Fedora, add an entry to the docker-compose.yml that builds 
on Fedora



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4731) [C++] Add docker-compose entry for testing Ubuntu Xenial build with system packages

2019-03-01 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4731:
--

 Summary: [C++] Add docker-compose entry for testing Ubuntu Xenial 
build with system packages
 Key: ARROW-4731
 URL: https://issues.apache.org/jira/browse/ARROW-4731
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Packaging
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 0.13.0


To better support people on Ubuntu and also show the missing things to get 
Arrow packaged into Fedora, add an entry to the docker-compose.yml that builds 
on Ubuntu



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4732) [C++] Add docker-compose entry for testing Debian Testing build with system packages

2019-03-01 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4732:
--

 Summary: [C++] Add docker-compose entry for testing Debian Testing 
build with system packages
 Key: ARROW-4732
 URL: https://issues.apache.org/jira/browse/ARROW-4732
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Packaging
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 0.13.0


To better support people on Fedora and also show the missing things to get 
Arrow packaged into Debian, add an entry to the docker-compose.yml that builds 
on Debian Testing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4733) [C++] Add CI entry that builds without the conda-forge toolchain but with system packages

2019-03-01 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4733:
--

 Summary: [C++] Add CI entry that builds without the conda-forge 
toolchain but with system packages
 Key: ARROW-4733
 URL: https://issues.apache.org/jira/browse/ARROW-4733
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Continuous Integration
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 0.13.0


Instead of using the conda-forge toolchain to provide parts of the dependencies 
but then compile with a system compiler, utilise the system packages to build 
and test the C++ implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4734) [Go] Add option to write a header for CSV writer

2019-03-01 Thread Anson Qian (JIRA)
Anson Qian created ARROW-4734:
-

 Summary: [Go] Add option to write a header for CSV writer
 Key: ARROW-4734
 URL: https://issues.apache.org/jira/browse/ARROW-4734
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Anson Qian






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4735) [Go] Benchmark strconv.Format vs. fmt.Sprintf for CSV writer

2019-03-01 Thread Anson Qian (JIRA)
Anson Qian created ARROW-4735:
-

 Summary: [Go] Benchmark strconv.Format vs. fmt.Sprintf for CSV 
writer
 Key: ARROW-4735
 URL: https://issues.apache.org/jira/browse/ARROW-4735
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Anson Qian


Need test out strconv.Format\{Bool,Float,Int,Uint} instead of fmt.Sprintf and 
see if we can improve write performance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4736) [Go] Optimize memory usage for CSV writer

2019-03-01 Thread Anson Qian (JIRA)
Anson Qian created ARROW-4736:
-

 Summary: [Go] Optimize memory usage for CSV writer
 Key: ARROW-4736
 URL: https://issues.apache.org/jira/browse/ARROW-4736
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Anson Qian


perhaps not for this PR, but, depending on the number of rows and cols this 
record contains, this may be a very large allocation, and very big memory chunk.
it could be more interesting performance wise to write n rows instead of 
everything in one big chunk.

also, to reduce the memory pressure on the GC, we should probably (re)use this 
slice-of-slice of strings.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
That was spot on!
I had 3 columns with 80characters => 80*21*10^6 = 1.56 bytes
I removed these columns and replaced each with 10 doubleType columns (so it
would still be 80 bytes of data) - and this error didn't come up anymore.
I also removed all the other columns and just kept 1 column with
80characters - I got the error again.

I'll make a simpler example and report it to spark - as I guess these
columns would need some special handling.

Now, when I run - I get a different error:
19/03/01 20:16:49 WARN TaskSetManager: Lost task 108.0 in stage 8.0 (TID
12, ip-172-31-10-249.us-west-2.compute.internal, executor 1):
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File
"/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0010/container_1551469777576_0010_01_02/pyspark.zip/pyspark/worker.py",
line 230, in main
process()
  File
"/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0010/container_1551469777576_0010_01_02/pyspark.zip/pyspark/worker.py",
line 225, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File
"/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0010/container_1551469777576_0010_01_02/pyspark.zip/pyspark/serializers.py",
line 260, in dump_stream
for series in iterator:
  File
"/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0010/container_1551469777576_0010_01_02/pyspark.zip/pyspark/serializers.py",
line 279, in load_stream
for batch in reader:
  File "pyarrow/ipc.pxi", line 265, in __iter__
  File "pyarrow/ipc.pxi", line 281, in
pyarrow.lib._RecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: read length must be positive or -1

Again, any pointers on what this means and what it indicates would be
really useful for me.

Thanks for the replies!


On Fri, Mar 1, 2019 at 11:26 PM Uwe L. Korn  wrote:

> There is currently the limitation that a column in a single RecordBatch
> can only hold 2G on the Java side. We work around this by splitting the
> DataFrame under the hood into multiple RecordBatches. I'm not familiar with
> the Spark<->Arrow code but I guess that in this case, the Spark code can
> only handle a single RecordBatch.
>
> Probably it is best to construct a https://stackoverflow.com/help/mcve
> and create an issue with the Spark project. Most likely this is not a bug
> in Arrow but just requires a bit more complicated implementation around the
> Arrow libs.
>
> Still, please have a look at the exact size of your columns. We support 2G
> per column, if it is only 1.5G, then there is probably a rounding error in
> the Arrow. Alternatively, you might also be in luck that the following
> patch
> https://github.com/apache/arrow/commit/bfe6865ba8087a46bd7665679e48af3a77987cef
> which is part of Apache Arrow 0.12 already fixes your problem.
>
> Uwe
>
> On Fri, Mar 1, 2019, at 6:48 PM, Abdeali Kothari wrote:
> > Is there a limitation that a single column cannot be more than 1-2G ?
> > One of my columns definitely would be around 1.5GB of memory.
> >
> > I cannot split my DF into more partitions as I have only 1 ID and I'm
> > grouping by that ID.
> > So, the UDAF would only run on a single pandasDF
> > I do have a requirement to make a very large DF for this UDAF (8GB as i
> > mentioned above) - trying to figure out what I need to do here to make
> this
> > work.
> > Increasing RAM, etc. is no issue (i understand I'd need huge executors
> as I
> > have a huge data requirement). But trying to figure out how much to
> > actually get - cause 20GB of RAM for the executor is also erroring out
> > where I thought ~10GB would have been enough
> >
> >
> >
> > On Fri, Mar 1, 2019 at 10:25 PM Uwe L. Korn  wrote:
> >
> > > Hello Abdeali,
> > >
> > > a problem could here be that a single column of your dataframe is using
> > > more than 2GB of RAM (possibly also just 1G). Try splitting your
> DataFrame
> > > into more partitions before applying the UDAF.
> > >
> > > Cheers
> > > Uwe
> > >
> > > On Fri, Mar 1, 2019, at 9:09 AM, Abdeali Kothari wrote:
> > > > I was using arrow with spark+python and when I'm trying some
> pandas-UDAF
> > > > functions I am getting this error:
> > > >
> > > > org.apache.arrow.vector.util.OversizedAllocationException: Unable to
> > > > expand
> > > > the buffer
> > > > at
> > > >
> > >
> org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:457)
> > > > at
> > > >
> > >
> org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1188)
> > > > at
> > > >
> > >
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1026)
> > > > at
> > > >
> > >
> org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:256)
> > > > at
> > > >
> > >
> org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:122)
> > > > at
> > > >
> > >
> org.ap

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
Forgot to mention: The above testing is with 0.11.1
I tried 0.12.1 as you suggested - and am getting the
OversizedAllocationException with the 80char column. And getting read
length must be positive or -1 without that. So, both the issues are
reproducible with pyarrow 0.12.1

On Sat, Mar 2, 2019 at 1:57 AM Abdeali Kothari 
wrote:

> That was spot on!
> I had 3 columns with 80characters => 80*21*10^6 = 1.56 bytes
> I removed these columns and replaced each with 10 doubleType columns (so
> it would still be 80 bytes of data) - and this error didn't come up anymore.
> I also removed all the other columns and just kept 1 column with
> 80characters - I got the error again.
>
> I'll make a simpler example and report it to spark - as I guess these
> columns would need some special handling.
>
> Now, when I run - I get a different error:
> 19/03/01 20:16:49 WARN TaskSetManager: Lost task 108.0 in stage 8.0 (TID
> 12, ip-172-31-10-249.us-west-2.compute.internal, executor 1):
> org.apache.spark.api.python.PythonException: Traceback (most recent call
> last):
>   File
> "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0010/container_1551469777576_0010_01_02/pyspark.zip/pyspark/worker.py",
> line 230, in main
> process()
>   File
> "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0010/container_1551469777576_0010_01_02/pyspark.zip/pyspark/worker.py",
> line 225, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File
> "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0010/container_1551469777576_0010_01_02/pyspark.zip/pyspark/serializers.py",
> line 260, in dump_stream
> for series in iterator:
>   File
> "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0010/container_1551469777576_0010_01_02/pyspark.zip/pyspark/serializers.py",
> line 279, in load_stream
> for batch in reader:
>   File "pyarrow/ipc.pxi", line 265, in __iter__
>   File "pyarrow/ipc.pxi", line 281, in
> pyarrow.lib._RecordBatchReader.read_next_batch
>   File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: read length must be positive or -1
>
> Again, any pointers on what this means and what it indicates would be
> really useful for me.
>
> Thanks for the replies!
>
>
> On Fri, Mar 1, 2019 at 11:26 PM Uwe L. Korn  wrote:
>
>> There is currently the limitation that a column in a single RecordBatch
>> can only hold 2G on the Java side. We work around this by splitting the
>> DataFrame under the hood into multiple RecordBatches. I'm not familiar with
>> the Spark<->Arrow code but I guess that in this case, the Spark code can
>> only handle a single RecordBatch.
>>
>> Probably it is best to construct a https://stackoverflow.com/help/mcve
>> and create an issue with the Spark project. Most likely this is not a bug
>> in Arrow but just requires a bit more complicated implementation around the
>> Arrow libs.
>>
>> Still, please have a look at the exact size of your columns. We support
>> 2G per column, if it is only 1.5G, then there is probably a rounding error
>> in the Arrow. Alternatively, you might also be in luck that the following
>> patch
>> https://github.com/apache/arrow/commit/bfe6865ba8087a46bd7665679e48af3a77987cef
>> which is part of Apache Arrow 0.12 already fixes your problem.
>>
>> Uwe
>>
>> On Fri, Mar 1, 2019, at 6:48 PM, Abdeali Kothari wrote:
>> > Is there a limitation that a single column cannot be more than 1-2G ?
>> > One of my columns definitely would be around 1.5GB of memory.
>> >
>> > I cannot split my DF into more partitions as I have only 1 ID and I'm
>> > grouping by that ID.
>> > So, the UDAF would only run on a single pandasDF
>> > I do have a requirement to make a very large DF for this UDAF (8GB as i
>> > mentioned above) - trying to figure out what I need to do here to make
>> this
>> > work.
>> > Increasing RAM, etc. is no issue (i understand I'd need huge executors
>> as I
>> > have a huge data requirement). But trying to figure out how much to
>> > actually get - cause 20GB of RAM for the executor is also erroring out
>> > where I thought ~10GB would have been enough
>> >
>> >
>> >
>> > On Fri, Mar 1, 2019 at 10:25 PM Uwe L. Korn  wrote:
>> >
>> > > Hello Abdeali,
>> > >
>> > > a problem could here be that a single column of your dataframe is
>> using
>> > > more than 2GB of RAM (possibly also just 1G). Try splitting your
>> DataFrame
>> > > into more partitions before applying the UDAF.
>> > >
>> > > Cheers
>> > > Uwe
>> > >
>> > > On Fri, Mar 1, 2019, at 9:09 AM, Abdeali Kothari wrote:
>> > > > I was using arrow with spark+python and when I'm trying some
>> pandas-UDAF
>> > > > functions I am getting this error:
>> > > >
>> > > > org.apache.arrow.vector.util.OversizedAllocationException: Unable to
>> > > > expand
>> > > > the buffer
>> > > > at
>> > > >
>> > >
>> org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:457)
>> > > > 

[jira] [Created] (ARROW-4737) [C#] tests are not running in CI

2019-03-01 Thread Eric Erhardt (JIRA)
Eric Erhardt created ARROW-4737:
---

 Summary: [C#] tests are not running in CI
 Key: ARROW-4737
 URL: https://issues.apache.org/jira/browse/ARROW-4737
 Project: Apache Arrow
  Issue Type: Bug
  Components: C#
Reporter: Eric Erhardt
Assignee: Eric Erhardt


 The C# tests are not running in CI because the filtering logic needs to be 
updated.

For example see 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/22671460/job/nk1nn59k5njie720

{quote}Build started
git clone -q https://github.com/apache/arrow.git C:\projects\arrow
git fetch -q origin +refs/pull/3662/merge:
git checkout -qf FETCH_HEAD
Running Install scripts
python ci\detect-changes.py > generated_changes.bat
Affected files: [u'csharp/src/Apache.Arrow/Field.Builder.cs', 
u'csharp/src/Apache.Arrow/Schema.Builder.cs', 
u'csharp/test/Apache.Arrow.Tests/SchemaBuilderTests.cs', 
u'csharp/test/Apache.Arrow.Tests/TypeTests.cs']
Affected topics:
{'c_glib': False,
 'cpp': False,
 'dev': False,
 'docs': False,
 'go': False,
 'integration': False,
 'java': False,
 'js': False,
 'python': False,
 'r': False,
 'ruby': False,
 'rust': False,
 'site': False}
call generated_changes.bat
call ci\appveyor-filter-changes.bat
===
=== No C++ or Python changes, exiting job
===
Build was forcibly terminated
Build success{quote}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Flaky Travis CI builds on master

2019-03-01 Thread Francois Saint-Jacques
I agree with adding a tag/label for this and even marking the failure as
critical.


On Fri, Mar 1, 2019 at 12:18 PM Micah Kornfield 
wrote:

> Moving away from the tactical for a minute, I think being able to track
> these over time would be useful.  I can think of a couple of high level
> approaches and I was wondering what others think.
>
> 1.  Use tags appropriately in JIRA and try to generate a report from that.
> 2.  Create a new confluence page to try to log each time these occur (and
> route cause).
> 3.  A separate spreadsheet someplace (e.g. Google Sheet).
>
> Thoughts?
>
> -Micah
>
>
> On Fri, Mar 1, 2019 at 8:55 AM Francois Saint-Jacques <
> fsaintjacq...@gmail.com> wrote:
>
> > Also just created https://issues.apache.org/jira/browse/ARROW-4728
> >
> > On Thu, Feb 28, 2019 at 3:53 AM Ravindra Pindikura 
> > wrote:
> >
> > >
> > >
> > > > On Feb 28, 2019, at 2:10 PM, Antoine Pitrou 
> > wrote:
> > > >
> > > >
> > > > Le 28/02/2019 à 07:53, Ravindra Pindikura a écrit :
> > > >>
> > > >>
> > > >>> On Feb 27, 2019, at 1:48 AM, Antoine Pitrou 
> > > wrote:
> > > >>>
> > > >>> On Tue, 26 Feb 2019 13:39:08 -0600
> > > >>> Wes McKinney  wrote:
> > >  hi folks,
> > > 
> > >  We haven't had a green build on master for about 5 days now (the
> > last
> > >  one was February 21). Has anyone else been paying attention to
> this?
> > >  It seems we should start cataloging which tests and build
> > environments
> > >  are the most flaky and see if there's anything we can do to reduce
> > the
> > >  flakiness. Since we are dependent on anaconda.org for build
> > toolchain
> > >  packages, it's hard to control for the 500 timeouts that occur
> > there,
> > >  but I'm seeing other kinds of routine flakiness.
> > > >>>
> > > >>> Isn't it https://issues.apache.org/jira/browse/ARROW-4684 ?
> > > >>
> > > >> ARROW-4684 seems to be failing consistently in travis CI.
> > > >>
> > > >> Can I merge a change if this is the only CI failure ?
> > > >
> > > > Yes, you can.
> > >
> > > Thanks !
> > >
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > >
> > >
> >
>


Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Li Jin
The 2G limit that Uwe mentioned definitely exists, Spark serialize each
group as a single RecordBatch currently.

The "pyarrow.lib.ArrowIOError: read length must be positive or -1" is
strange, I think Spark is on an older version of the Java side (0.10 for
Spark 2.4 and 0.8 for Spark 2.3). I forgot whether there is binary
incompatibility between these versions and pyarrow 0.12.

On Fri, Mar 1, 2019 at 3:32 PM Abdeali Kothari 
wrote:

> Forgot to mention: The above testing is with 0.11.1
> I tried 0.12.1 as you suggested - and am getting the
> OversizedAllocationException with the 80char column. And getting read
> length must be positive or -1 without that. So, both the issues are
> reproducible with pyarrow 0.12.1
>
> On Sat, Mar 2, 2019 at 1:57 AM Abdeali Kothari 
> wrote:
>
> > That was spot on!
> > I had 3 columns with 80characters => 80*21*10^6 = 1.56 bytes
> > I removed these columns and replaced each with 10 doubleType columns (so
> > it would still be 80 bytes of data) - and this error didn't come up
> anymore.
> > I also removed all the other columns and just kept 1 column with
> > 80characters - I got the error again.
> >
> > I'll make a simpler example and report it to spark - as I guess these
> > columns would need some special handling.
> >
> > Now, when I run - I get a different error:
> > 19/03/01 20:16:49 WARN TaskSetManager: Lost task 108.0 in stage 8.0 (TID
> > 12, ip-172-31-10-249.us-west-2.compute.internal, executor 1):
> > org.apache.spark.api.python.PythonException: Traceback (most recent call
> > last):
> >   File
> >
> "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0010/container_1551469777576_0010_01_02/pyspark.zip/pyspark/worker.py",
> > line 230, in main
> > process()
> >   File
> >
> "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0010/container_1551469777576_0010_01_02/pyspark.zip/pyspark/worker.py",
> > line 225, in process
> > serializer.dump_stream(func(split_index, iterator), outfile)
> >   File
> >
> "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0010/container_1551469777576_0010_01_02/pyspark.zip/pyspark/serializers.py",
> > line 260, in dump_stream
> > for series in iterator:
> >   File
> >
> "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0010/container_1551469777576_0010_01_02/pyspark.zip/pyspark/serializers.py",
> > line 279, in load_stream
> > for batch in reader:
> >   File "pyarrow/ipc.pxi", line 265, in __iter__
> >   File "pyarrow/ipc.pxi", line 281, in
> > pyarrow.lib._RecordBatchReader.read_next_batch
> >   File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> > pyarrow.lib.ArrowIOError: read length must be positive or -1
> >
> > Again, any pointers on what this means and what it indicates would be
> > really useful for me.
> >
> > Thanks for the replies!
> >
> >
> > On Fri, Mar 1, 2019 at 11:26 PM Uwe L. Korn  wrote:
> >
> >> There is currently the limitation that a column in a single RecordBatch
> >> can only hold 2G on the Java side. We work around this by splitting the
> >> DataFrame under the hood into multiple RecordBatches. I'm not familiar
> with
> >> the Spark<->Arrow code but I guess that in this case, the Spark code can
> >> only handle a single RecordBatch.
> >>
> >> Probably it is best to construct a https://stackoverflow.com/help/mcve
> >> and create an issue with the Spark project. Most likely this is not a
> bug
> >> in Arrow but just requires a bit more complicated implementation around
> the
> >> Arrow libs.
> >>
> >> Still, please have a look at the exact size of your columns. We support
> >> 2G per column, if it is only 1.5G, then there is probably a rounding
> error
> >> in the Arrow. Alternatively, you might also be in luck that the
> following
> >> patch
> >>
> https://github.com/apache/arrow/commit/bfe6865ba8087a46bd7665679e48af3a77987cef
> >> which is part of Apache Arrow 0.12 already fixes your problem.
> >>
> >> Uwe
> >>
> >> On Fri, Mar 1, 2019, at 6:48 PM, Abdeali Kothari wrote:
> >> > Is there a limitation that a single column cannot be more than 1-2G ?
> >> > One of my columns definitely would be around 1.5GB of memory.
> >> >
> >> > I cannot split my DF into more partitions as I have only 1 ID and I'm
> >> > grouping by that ID.
> >> > So, the UDAF would only run on a single pandasDF
> >> > I do have a requirement to make a very large DF for this UDAF (8GB as
> i
> >> > mentioned above) - trying to figure out what I need to do here to make
> >> this
> >> > work.
> >> > Increasing RAM, etc. is no issue (i understand I'd need huge executors
> >> as I
> >> > have a huge data requirement). But trying to figure out how much to
> >> > actually get - cause 20GB of RAM for the executor is also erroring out
> >> > where I thought ~10GB would have been enough
> >> >
> >> >
> >> >
> >> > On Fri, Mar 1, 2019 at 10:25 PM Uwe L. Korn  wrote:
> >> >
> >> > > Hello Abdeali,
> >> > >
> >> > > a problem could here be that a single column of your

Re: Flaky Travis CI builds on master

2019-03-01 Thread Wes McKinney
We could create a page on the wiki that shows all open and resolved
issues relating to unexpected CI / build failures. Would someone like
to give this a go? There are probably many historical issues that can
be tagged with the label

On Fri, Mar 1, 2019 at 12:45 PM Francois Saint-Jacques
 wrote:
>
> I agree with adding a tag/label for this and even marking the failure as
> critical.
>
>
> On Fri, Mar 1, 2019 at 12:18 PM Micah Kornfield 
> wrote:
>
> > Moving away from the tactical for a minute, I think being able to track
> > these over time would be useful.  I can think of a couple of high level
> > approaches and I was wondering what others think.
> >
> > 1.  Use tags appropriately in JIRA and try to generate a report from that.
> > 2.  Create a new confluence page to try to log each time these occur (and
> > route cause).
> > 3.  A separate spreadsheet someplace (e.g. Google Sheet).
> >
> > Thoughts?
> >
> > -Micah
> >
> >
> > On Fri, Mar 1, 2019 at 8:55 AM Francois Saint-Jacques <
> > fsaintjacq...@gmail.com> wrote:
> >
> > > Also just created https://issues.apache.org/jira/browse/ARROW-4728
> > >
> > > On Thu, Feb 28, 2019 at 3:53 AM Ravindra Pindikura 
> > > wrote:
> > >
> > > >
> > > >
> > > > > On Feb 28, 2019, at 2:10 PM, Antoine Pitrou 
> > > wrote:
> > > > >
> > > > >
> > > > > Le 28/02/2019 à 07:53, Ravindra Pindikura a écrit :
> > > > >>
> > > > >>
> > > > >>> On Feb 27, 2019, at 1:48 AM, Antoine Pitrou 
> > > > wrote:
> > > > >>>
> > > > >>> On Tue, 26 Feb 2019 13:39:08 -0600
> > > > >>> Wes McKinney  wrote:
> > > >  hi folks,
> > > > 
> > > >  We haven't had a green build on master for about 5 days now (the
> > > last
> > > >  one was February 21). Has anyone else been paying attention to
> > this?
> > > >  It seems we should start cataloging which tests and build
> > > environments
> > > >  are the most flaky and see if there's anything we can do to reduce
> > > the
> > > >  flakiness. Since we are dependent on anaconda.org for build
> > > toolchain
> > > >  packages, it's hard to control for the 500 timeouts that occur
> > > there,
> > > >  but I'm seeing other kinds of routine flakiness.
> > > > >>>
> > > > >>> Isn't it https://issues.apache.org/jira/browse/ARROW-4684 ?
> > > > >>
> > > > >> ARROW-4684 seems to be failing consistently in travis CI.
> > > > >>
> > > > >> Can I merge a change if this is the only CI failure ?
> > > > >
> > > > > Yes, you can.
> > > >
> > > > Thanks !
> > > >
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > >
> > > >
> > >
> >


Re: Flaky Travis CI builds on master

2019-03-01 Thread Francois Saint-Jacques
I'll take this.

On Fri, Mar 1, 2019 at 3:55 PM Wes McKinney  wrote:

> We could create a page on the wiki that shows all open and resolved
> issues relating to unexpected CI / build failures. Would someone like
> to give this a go? There are probably many historical issues that can
> be tagged with the label
>
> On Fri, Mar 1, 2019 at 12:45 PM Francois Saint-Jacques
>  wrote:
> >
> > I agree with adding a tag/label for this and even marking the failure as
> > critical.
> >
> >
> > On Fri, Mar 1, 2019 at 12:18 PM Micah Kornfield 
> > wrote:
> >
> > > Moving away from the tactical for a minute, I think being able to track
> > > these over time would be useful.  I can think of a couple of high level
> > > approaches and I was wondering what others think.
> > >
> > > 1.  Use tags appropriately in JIRA and try to generate a report from
> that.
> > > 2.  Create a new confluence page to try to log each time these occur
> (and
> > > route cause).
> > > 3.  A separate spreadsheet someplace (e.g. Google Sheet).
> > >
> > > Thoughts?
> > >
> > > -Micah
> > >
> > >
> > > On Fri, Mar 1, 2019 at 8:55 AM Francois Saint-Jacques <
> > > fsaintjacq...@gmail.com> wrote:
> > >
> > > > Also just created https://issues.apache.org/jira/browse/ARROW-4728
> > > >
> > > > On Thu, Feb 28, 2019 at 3:53 AM Ravindra Pindikura <
> ravin...@dremio.com>
> > > > wrote:
> > > >
> > > > >
> > > > >
> > > > > > On Feb 28, 2019, at 2:10 PM, Antoine Pitrou 
> > > > wrote:
> > > > > >
> > > > > >
> > > > > > Le 28/02/2019 à 07:53, Ravindra Pindikura a écrit :
> > > > > >>
> > > > > >>
> > > > > >>> On Feb 27, 2019, at 1:48 AM, Antoine Pitrou <
> solip...@pitrou.net>
> > > > > wrote:
> > > > > >>>
> > > > > >>> On Tue, 26 Feb 2019 13:39:08 -0600
> > > > > >>> Wes McKinney  wrote:
> > > > >  hi folks,
> > > > > 
> > > > >  We haven't had a green build on master for about 5 days now
> (the
> > > > last
> > > > >  one was February 21). Has anyone else been paying attention to
> > > this?
> > > > >  It seems we should start cataloging which tests and build
> > > > environments
> > > > >  are the most flaky and see if there's anything we can do to
> reduce
> > > > the
> > > > >  flakiness. Since we are dependent on anaconda.org for build
> > > > toolchain
> > > > >  packages, it's hard to control for the 500 timeouts that occur
> > > > there,
> > > > >  but I'm seeing other kinds of routine flakiness.
> > > > > >>>
> > > > > >>> Isn't it https://issues.apache.org/jira/browse/ARROW-4684 ?
> > > > > >>
> > > > > >> ARROW-4684 seems to be failing consistently in travis CI.
> > > > > >>
> > > > > >> Can I merge a change if this is the only CI failure ?
> > > > > >
> > > > > > Yes, you can.
> > > > >
> > > > > Thanks !
> > > > >
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > >
> > > > >
> > > >
> > >
>


[jira] [Created] (ARROW-4738) [JS] NullVector should include a null data buffer

2019-03-01 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4738:
--

 Summary: [JS] NullVector should include a null data buffer
 Key: ARROW-4738
 URL: https://issues.apache.org/jira/browse/ARROW-4738
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: JS-0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.1


Arrow C++ and pyarrow expect NullVectors to include a null data buffer, so 
ArrowJS should write one into the buffer layout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Format] Redundant information in Time type?

2019-03-01 Thread Wes McKinney
As I recall there might have been the desire to permit 64-bit
representation of SECOND and MILLI time values, but I would opt for
YAGNI (until we actually do) and deprecate this bit width field in
Schema.fbs (we shouldn't outright remove it -- for backwards
compatibility -- unless it's actively causing a problem)

- Wes

On Wed, Feb 27, 2019 at 11:09 PM Micah Kornfield  wrote:
>
> In the  flatbuffer schema what is the purpose of bit width in "table Time"
> [1] based on the documentation it sounds like bit-width is fully determined
> by TimeUnit?  In other cases (e.g. Date) we don't have a similar field.
>
> Thanks,
> Micah
>
>
> [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L148


Re: Flaky Travis CI builds on master

2019-03-01 Thread Francois Saint-Jacques
Could someone give me write/edit access to confluence?

Thank you,
François

On Fri, Mar 1, 2019 at 3:55 PM Francois Saint-Jacques <
fsaintjacq...@gmail.com> wrote:

> I'll take this.
>
> On Fri, Mar 1, 2019 at 3:55 PM Wes McKinney  wrote:
>
>> We could create a page on the wiki that shows all open and resolved
>> issues relating to unexpected CI / build failures. Would someone like
>> to give this a go? There are probably many historical issues that can
>> be tagged with the label
>>
>> On Fri, Mar 1, 2019 at 12:45 PM Francois Saint-Jacques
>>  wrote:
>> >
>> > I agree with adding a tag/label for this and even marking the failure as
>> > critical.
>> >
>> >
>> > On Fri, Mar 1, 2019 at 12:18 PM Micah Kornfield 
>> > wrote:
>> >
>> > > Moving away from the tactical for a minute, I think being able to
>> track
>> > > these over time would be useful.  I can think of a couple of high
>> level
>> > > approaches and I was wondering what others think.
>> > >
>> > > 1.  Use tags appropriately in JIRA and try to generate a report from
>> that.
>> > > 2.  Create a new confluence page to try to log each time these occur
>> (and
>> > > route cause).
>> > > 3.  A separate spreadsheet someplace (e.g. Google Sheet).
>> > >
>> > > Thoughts?
>> > >
>> > > -Micah
>> > >
>> > >
>> > > On Fri, Mar 1, 2019 at 8:55 AM Francois Saint-Jacques <
>> > > fsaintjacq...@gmail.com> wrote:
>> > >
>> > > > Also just created https://issues.apache.org/jira/browse/ARROW-4728
>> > > >
>> > > > On Thu, Feb 28, 2019 at 3:53 AM Ravindra Pindikura <
>> ravin...@dremio.com>
>> > > > wrote:
>> > > >
>> > > > >
>> > > > >
>> > > > > > On Feb 28, 2019, at 2:10 PM, Antoine Pitrou > >
>> > > > wrote:
>> > > > > >
>> > > > > >
>> > > > > > Le 28/02/2019 à 07:53, Ravindra Pindikura a écrit :
>> > > > > >>
>> > > > > >>
>> > > > > >>> On Feb 27, 2019, at 1:48 AM, Antoine Pitrou <
>> solip...@pitrou.net>
>> > > > > wrote:
>> > > > > >>>
>> > > > > >>> On Tue, 26 Feb 2019 13:39:08 -0600
>> > > > > >>> Wes McKinney  wrote:
>> > > > >  hi folks,
>> > > > > 
>> > > > >  We haven't had a green build on master for about 5 days now
>> (the
>> > > > last
>> > > > >  one was February 21). Has anyone else been paying attention
>> to
>> > > this?
>> > > > >  It seems we should start cataloging which tests and build
>> > > > environments
>> > > > >  are the most flaky and see if there's anything we can do to
>> reduce
>> > > > the
>> > > > >  flakiness. Since we are dependent on anaconda.org for build
>> > > > toolchain
>> > > > >  packages, it's hard to control for the 500 timeouts that
>> occur
>> > > > there,
>> > > > >  but I'm seeing other kinds of routine flakiness.
>> > > > > >>>
>> > > > > >>> Isn't it https://issues.apache.org/jira/browse/ARROW-4684 ?
>> > > > > >>
>> > > > > >> ARROW-4684 seems to be failing consistently in travis CI.
>> > > > > >>
>> > > > > >> Can I merge a change if this is the only CI failure ?
>> > > > > >
>> > > > > > Yes, you can.
>> > > > >
>> > > > > Thanks !
>> > > > >
>> > > > > >
>> > > > > > Regards
>> > > > > >
>> > > > > > Antoine.
>> > > > >
>> > > > >
>> > > >
>> > >
>>
>


Re: Boost and manylinux CI builds

2019-03-01 Thread Ravindra Pindikura
Thanks Uwe.

For the record (in case someone needs to do it again), these are the steps :

1. Make the change in build_boost.sh

2. Setup an account on quay.io  and link to your GitHub account

3. In quay.io ,  Add a new repository using :

A. Link to GitHub repository push  
B. Trigger build on changes to a specific branch (eg. myquay) of the repo (eq. 
pravindra/arrow)
C. Set Dockerfile location to "/python/manylinux1/Dockerfile-x86_64_base”
D. Set Context location to "/python/manylinux1”

4. Push change (in step 1) to the branch specified in step 3B 

This should trigger a build in quay.io , the build takes about 
2 hrs to finish.

5. Add a tag “latest” to the build after step 4 finishes, save the URL of the 
build (eg. quay.io/pravindra 
/arrow_manylinux1_x86_64_base:latest
 )

6. In your arrow PR,

- include the change from 1.
- update travis_script_manylinux.sh to point to the location from step 5.

Thanks & regards,
Ravindra.

> On Feb 27, 2019, at 3:02 PM, Uwe L. Korn  wrote:
> 
> Hello Ravindra,
> 
> simplest thing would be when you open a pull request and I can then pick this 
> up and push it to my personal fork. Then a new image is built on quay.io. 
> Otherwise, you can also activate quay.io on your fork to get the docker image 
> to build.
> 
> Uwe
> 
> On Wed, Feb 27, 2019, at 8:41 AM, Krisztián Szűcs wrote:
>> Hi Ravindra!
>> 
>> You'll need to rebuild the docker image and change this line accordingly:
>> https://github.com/apache/arrow/blob/master/ci/travis_script_manylinux.sh#L57
>> 
>> On Wed, Feb 27, 2019 at 8:29 AM Ravindra Pindikura 
>> wrote:
>> 
>>> Hi,
>>> 
>>> I added an include for boost header file in gandiva. This compiles on
>>> ubuntu/Mac/windows, but fails with the manylinux CI entry.
>>> 
>>> I’m getting a compilation failure :
>>> 
>>> https://travis-ci.org/apache/arrow/jobs/498718755 <
>>> https://travis-ci.org/apache/arrow/jobs/498718755>
>>> /arrow/cpp/src/gandiva/decimal_xlarge.cc:29:44: fatal error:
>>> boost/multiprecision/cpp_int.hpp: No such file or directory
>>> #include "boost/multiprecision/cpp_int.hpp"
>>> ^
>>> compilation terminated.
>>> 
>>> 
>>> 
>>> @xhocy and @kszucs pointed out the manylinux1 image has a very minimal
>>> boost, and doesn’t include the multi precision files. So, the script that
>>> builds boost for manylinux1 needs to be updated for this.
>>> 
>>> 
>>> 
>>> https://github.com/apache/arrow/blob/master/python/manylinux1/scripts/build_boost.sh#L38
>>> <
>>> https://github.com/apache/arrow/blob/master/python/manylinux1/scripts/build_boost.sh#L38
 
>>> 
>>> After making change, the manylinux1 build still fails with the same error
>>> :(.
>>> 
>>> https://travis-ci.org/apache/arrow/jobs/498847622
>>> 
>>> Looks like the CI run downloads a prebuilt docker image. Do I need to
>>> update the docker image ? If yes, can you please point out the instructions
>>> for this ?
>>> 
>>> Thanks & regards,
>>> Ravindra.
>> 



Re: [C++] Help with windows build failure

2019-03-01 Thread Micah Kornfield
Just to finish off this thread.  Antoine's advice was spot on (need to pass
Debug and Static to b2).  There was still another build issue with doube
precision but I was able to bypass it my making the specific test that was
failing.



On Tue, Feb 26, 2019 at 3:49 AM Antoine Pitrou  wrote:

>
> Le 26/02/2019 à 05:42, Micah Kornfield a écrit :
> > The issue I'm blocked on is getting boost installed properly.  I've
> > included all of the steps I've run below, if anyone has some thoughts or
> > the magical script to build and install the appropriate boost libraries
> > appropriate for the Static_Crt_Build i would greatly appreciate it.
> >
> > With a Windows 10 MSVC 2017 VM:
> > Download and install cmake and Anaconda3 via visual installers.
> > Download Boost 1.67 and extract it.
> > Run "Developer Command Prompt for MSVC 2017" from the start menu.
> > 1.  CD to the boost directory
> > 2.  Run: .\bootstrap.bat
> > 3.  Run: .\b2.exe
> > 4.  run: .\b2.exe install
>
> Does this also build the libraries in debug mode?
> According to
> https://www.boost.org/doc/libs/1_69_0/more/getting_started/windows.html,
> you can "choose a specific build variant by adding release or debug to
> the command line".
>
> Regards
>
> Antoine.
>


Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
Hi Li Jin, thanks for the note.

I get this error only for larger data - when I reduce the number of records
or the number or columns in my data it all works fine - so if it is binary
incompatibility it should be something related to large data.
I am using Spark 2.3.1 on Amazon EMR for this testing.
https://github.com/apache/spark/blob/v2.3.1/pom.xml#L192 seems to indicate
arrow version is 0.8 for this.

I installed pyarrow-0.8.0 in the python environment on my cluster with pip
and I am still getting this error.
The stacktrace is very similar, just some lines moved in the pxi files:

Caused by: org.apache.spark.api.python.PythonException: Traceback (most
recent call last):
  File
"/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0018/container_1551469777576_0018_01_02/pyspark.zip/pyspark/worker.py",
line 230, in main
process()
  File
"/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0018/container_1551469777576_0018_01_02/pyspark.zip/pyspark/worker.py",
line 225, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File
"/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0018/container_1551469777576_0018_01_02/pyspark.zip/pyspark/serializers.py",
line 260, in dump_stream
for series in iterator:
  File
"/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0018/container_1551469777576_0018_01_02/pyspark.zip/pyspark/serializers.py",
line 279, in load_stream
for batch in reader:
  File "pyarrow/ipc.pxi", line 268, in __iter__
(/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:70278)
  File "pyarrow/ipc.pxi", line 284, in
pyarrow.lib._RecordBatchReader.read_next_batch
(/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:70534)
  File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status
(/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8345)
pyarrow.lib.ArrowIOError: read length must be positive or -1

Other notes:
 - My data is just integers, strings, and doubles. No complex types like
arrays/maps/etc.
 - I don't have any NULL/None values in my data
 - Increasing executor-memory for spark does not seem to help here

As always: Any thoughts or notes would be great so I can get some pointers
in which direction to debug



On Sat, Mar 2, 2019 at 2:24 AM Li Jin  wrote:

> The 2G limit that Uwe mentioned definitely exists, Spark serialize each
> group as a single RecordBatch currently.
>
> The "pyarrow.lib.ArrowIOError: read length must be positive or -1" is
> strange, I think Spark is on an older version of the Java side (0.10 for
> Spark 2.4 and 0.8 for Spark 2.3). I forgot whether there is binary
> incompatibility between these versions and pyarrow 0.12.
>
> On Fri, Mar 1, 2019 at 3:32 PM Abdeali Kothari 
> wrote:
>
> > Forgot to mention: The above testing is with 0.11.1
> > I tried 0.12.1 as you suggested - and am getting the
> > OversizedAllocationException with the 80char column. And getting read
> > length must be positive or -1 without that. So, both the issues are
> > reproducible with pyarrow 0.12.1
> >
> > On Sat, Mar 2, 2019 at 1:57 AM Abdeali Kothari  >
> > wrote:
> >
> > > That was spot on!
> > > I had 3 columns with 80characters => 80*21*10^6 = 1.56 bytes
> > > I removed these columns and replaced each with 10 doubleType columns
> (so
> > > it would still be 80 bytes of data) - and this error didn't come up
> > anymore.
> > > I also removed all the other columns and just kept 1 column with
> > > 80characters - I got the error again.
> > >
> > > I'll make a simpler example and report it to spark - as I guess these
> > > columns would need some special handling.
> > >
> > > Now, when I run - I get a different error:
> > > 19/03/01 20:16:49 WARN TaskSetManager: Lost task 108.0 in stage 8.0
> (TID
> > > 12, ip-172-31-10-249.us-west-2.compute.internal, executor 1):
> > > org.apache.spark.api.python.PythonException: Traceback (most recent
> call
> > > last):
> > >   File
> > >
> >
> "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0010/container_1551469777576_0010_01_02/pyspark.zip/pyspark/worker.py",
> > > line 230, in main
> > > process()
> > >   File
> > >
> >
> "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0010/container_1551469777576_0010_01_02/pyspark.zip/pyspark/worker.py",
> > > line 225, in process
> > > serializer.dump_stream(func(split_index, iterator), outfile)
> > >   File
> > >
> >
> "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0010/container_1551469777576_0010_01_02/pyspark.zip/pyspark/serializers.py",
> > > line 260, in dump_stream
> > > for series in iterator:
> > >   File
> > >
> >
> "/mnt/yarn/usercache/hadoop/appcache/application_1551469777576_0010/container_1551469777576_0010_01_02/pyspark.zip/pyspark/serializers.py",
> > > line 279, in load_stream
> > > for batch in reader:
> > >   File "pyarrow/ipc.pxi", line 265, in __iter__
> > >   File "pyarrow/ipc.pxi", line 281, in
> > > pyarrow.lib._RecordBatchRead