Re: [C++][Discuss] Approaches for SIMD optimizations

2020-06-11 Thread Micah Kornfield
Hi Frank,
Are the performance numbers you published for the baseline directly from
master?  I'd like to look at this over the next few days to see if I can
figure out what is going on.

To all:
I'd like to make sure we flush out things to consider in general, for a
path forward.

My take on this is we should still prefer writing code in this order:
1.  Plain-old C++
2.  SIMD Wrapper library (my preference would be towards something that is
going to be standardized eventually to limit 3P dependencies.  I think the
counter argument here is if any of the libraries mentioned above has much
better feature coverage on advanced instruction sets).  Please chime in if
there are other things to consider.  We should have some rubrics for when
to make use of the library (i.e. what performance gain do we get on a
workload).
3.  Native CPU intrinsics.  We should develop a rubric for when to accept
PRs for this.  This should include:
   1.  Performance gain.
   2.  General popularity of the architecture.

For dynamic dispatch:
I think we should probably continue down the path of building our own.  I
looked more at libsimdpp's implementation and it might be something we can
use for guidance, but as it stands, it doesn't seem to have hooks based on
CPU manufacturer, which for BMI2 intrinsics would be a requirement.  The
alternative would be to ban BMI2 intrinsics from the code (this might not
be a bad idea to limit complexity in general).

Thoughts?

Thanks,
Micah









On Wed, Jun 10, 2020 at 8:35 PM Du, Frank  wrote:

> Thanks Jed.
>
> I collect some data on my setup, gcc version 7.5.0, 18.04.4 LTS, SSE
> build(-msse4.2)
>
> [Unroll baseline]
> for (int64_t i = 0; i < length_rounded; i += kRoundFactor) {
>   for (int64_t k = 0; k < kRoundFactor; k++) {
> sum_rounded[k] += values[i + k];
>   }
> }
> SumKernelFloat/32768/02.91 us 2.90 us   239992
> bytes_per_second=10.5063G/s null_percent=0 size=32.768k
> SumKernelDouble/32768/0   1.89 us 1.89 us   374470
> bytes_per_second=16.1847G/s null_percent=0 size=32.768k
> SumKernelInt8/32768/0 11.6 us 11.6 us60329
> bytes_per_second=2.63274G/s null_percent=0 size=32.768k
> SumKernelInt16/32768/06.98 us 6.98 us   100293
> bytes_per_second=4.3737G/s null_percent=0 size=32.768k
> SumKernelInt32/32768/03.89 us 3.88 us   180423
> bytes_per_second=7.85862G/s null_percent=0 size=32.768k
> SumKernelInt64/32768/01.86 us 1.85 us   380477
> bytes_per_second=16.4536G/s null_percent=0 size=32.768k
>
> [#pragma omp simd reduction(+:sum)]
> #pragma omp simd reduction(+:sum)
> for (int64_t i = 0; i < n; i++)
> sum += values[i];
> SumKernelFloat/32768/02.97 us 2.96 us   235686
> bytes_per_second=10.294G/s null_percent=0 size=32.768k
> SumKernelDouble/32768/0   2.97 us 2.97 us   236456
> bytes_per_second=10.2875G/s null_percent=0 size=32.768k
> SumKernelInt8/32768/0 11.7 us 11.7 us60006
> bytes_per_second=2.61643G/s null_percent=0 size=32.768k
> SumKernelInt16/32768/05.47 us 5.47 us   127999
> bytes_per_second=5.58002G/s null_percent=0 size=32.768k
> SumKernelInt32/32768/02.42 us 2.41 us   290635
> bytes_per_second=12.6485G/s null_percent=0 size=32.768k
> SumKernelInt64/32768/01.82 us 1.82 us   386749
> bytes_per_second=16.7733G/s null_percent=0 size=32.768k
>
> [SSE intrinsic]
> SumKernelFloat/32768/02.24 us 2.24 us   310914
> bytes_per_second=13.6335G/s null_percent=0 size=32.768k
> SumKernelDouble/32768/0   1.43 us 1.43 us   486642
> bytes_per_second=21.3266G/s null_percent=0 size=32.768k
> SumKernelInt8/32768/0 6.93 us 6.92 us   100720
> bytes_per_second=4.41046G/s null_percent=0 size=32.768k
> SumKernelInt16/32768/03.14 us 3.14 us   222803
> bytes_per_second=9.72931G/s null_percent=0 size=32.768k
> SumKernelInt32/32768/02.11 us 2.11 us   331388
> bytes_per_second=14.4907G/s null_percent=0 size=32.768k
> SumKernelInt64/32768/01.32 us 1.32 us   532964
> bytes_per_second=23.0728G/s null_percent=0 size=32.768k
>
> I tried to tweak the kRoundFactor or using some unroll based omp simd, or
> build with clang-8, unluckily I never can get the results up to intrinsic.
> The ASM code generated all use SIMD instructions, only some small
> difference like instruction sequences or xmm register used. The things
> under compiler is really some secret for me.
>
> Thanks,
> Frank
>
> -Original Message-
> From: Jed Brown 
> Sent: Thursday, June 11, 2020 1:58 AM
> To: Du, Frank ; dev@arrow.apache.org
> Subject: RE: [C++][Discuss] Approaches for SIMD optimizations
>
> "Du, Frank"  writes:
>
> > The PR I committed provide a basic support for runtime dispatching. I
> > agree that complier should generate good vec

Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan

2020-06-11 Thread Fan Liya
Dear all,

I want to thank you all for all your kind help.
It is a great honor to work with you in this great community.
I Hope we can contribute more and make the community better.

Best,
Liya Fan

On Fri, Jun 12, 2020 at 12:02 PM Ji Liu  wrote:

> Thanks everyone for the warm welcome!
> It's a great honor for me to be a committer. Looking forward to
> contributing more to the community.
>
> Thanks,
> Ji Liu
>
>
> paddy horan  于2020年6月12日周五 上午8:52写道:
>
> > Congrats!
> >
> > 
> > From: Micah Kornfield 
> > Sent: Thursday, June 11, 2020 12:59:32 PM
> > To: dev 
> > Subject: Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan
> >
> > Congratulations!
> >
> > On Thu, Jun 11, 2020 at 9:32 AM David Li  wrote:
> >
> > > Congrats Ji  & Liya!
> > >
> > > David
> > >
> > > On 6/11/20, siddharth teotia  wrote:
> > > > Congratulations!
> > > >
> > > > On Thu, Jun 11, 2020 at 7:51 AM Neal Richardson
> > > > 
> > > > wrote:
> > > >
> > > >> Congratulations, both!
> > > >>
> > > >> Neal
> > > >>
> > > >> On Thu, Jun 11, 2020 at 7:38 AM Wes McKinney 
> > > wrote:
> > > >>
> > > >> > On behalf of the Arrow PMC I'm happy to announce that Ji Liu and
> > Liya
> > > >> > Fan have been invited to be Arrow committers and they have both
> > > >> > accepted.
> > > >> >
> > > >> > Welcome, and thank you for your contributions!
> > > >> >
> > > >>
> > > >
> > > >
> > > > --
> > > > *Best Regards,*
> > > > *SIDDHARTH TEOTIA*
> > > > *2008C6PS540G*
> > > > *BITS PILANI- GOA CAMPUS*
> > > >
> > > > *+91 87911 75932*
> > > >
> > >
> >
>


Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan

2020-06-11 Thread Ji Liu
Thanks everyone for the warm welcome!
It's a great honor for me to be a committer. Looking forward to
contributing more to the community.

Thanks,
Ji Liu


paddy horan  于2020年6月12日周五 上午8:52写道:

> Congrats!
>
> 
> From: Micah Kornfield 
> Sent: Thursday, June 11, 2020 12:59:32 PM
> To: dev 
> Subject: Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan
>
> Congratulations!
>
> On Thu, Jun 11, 2020 at 9:32 AM David Li  wrote:
>
> > Congrats Ji  & Liya!
> >
> > David
> >
> > On 6/11/20, siddharth teotia  wrote:
> > > Congratulations!
> > >
> > > On Thu, Jun 11, 2020 at 7:51 AM Neal Richardson
> > > 
> > > wrote:
> > >
> > >> Congratulations, both!
> > >>
> > >> Neal
> > >>
> > >> On Thu, Jun 11, 2020 at 7:38 AM Wes McKinney 
> > wrote:
> > >>
> > >> > On behalf of the Arrow PMC I'm happy to announce that Ji Liu and
> Liya
> > >> > Fan have been invited to be Arrow committers and they have both
> > >> > accepted.
> > >> >
> > >> > Welcome, and thank you for your contributions!
> > >> >
> > >>
> > >
> > >
> > > --
> > > *Best Regards,*
> > > *SIDDHARTH TEOTIA*
> > > *2008C6PS540G*
> > > *BITS PILANI- GOA CAMPUS*
> > >
> > > *+91 87911 75932*
> > >
> >
>


Re: Help with Java PR backlog

2020-06-11 Thread Fan Liya
I would like to help with the review.
I will spend some time on it late today.

Best,
Liya Fan


On Fri, Jun 12, 2020 at 9:56 AM Wes McKinney  wrote:

> hi folks,
>
> There's a number of Java PRs that seem like they are close to being in
> a merge-ready state, could we try to get the Java backlog mostly
> closed out before the next release (in a few weeks)?
>
> Thanks
> Wes
>


Help with Java PR backlog

2020-06-11 Thread Wes McKinney
hi folks,

There's a number of Java PRs that seem like they are close to being in
a merge-ready state, could we try to get the Java backlog mostly
closed out before the next release (in a few weeks)?

Thanks
Wes


Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan

2020-06-11 Thread paddy horan
Congrats!


From: Micah Kornfield 
Sent: Thursday, June 11, 2020 12:59:32 PM
To: dev 
Subject: Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan

Congratulations!

On Thu, Jun 11, 2020 at 9:32 AM David Li  wrote:

> Congrats Ji  & Liya!
>
> David
>
> On 6/11/20, siddharth teotia  wrote:
> > Congratulations!
> >
> > On Thu, Jun 11, 2020 at 7:51 AM Neal Richardson
> > 
> > wrote:
> >
> >> Congratulations, both!
> >>
> >> Neal
> >>
> >> On Thu, Jun 11, 2020 at 7:38 AM Wes McKinney 
> wrote:
> >>
> >> > On behalf of the Arrow PMC I'm happy to announce that Ji Liu and Liya
> >> > Fan have been invited to be Arrow committers and they have both
> >> > accepted.
> >> >
> >> > Welcome, and thank you for your contributions!
> >> >
> >>
> >
> >
> > --
> > *Best Regards,*
> > *SIDDHARTH TEOTIA*
> > *2008C6PS540G*
> > *BITS PILANI- GOA CAMPUS*
> >
> > *+91 87911 75932*
> >
>


[jira] [Created] (ARROW-9112) [R] Update autobrew script location

2020-06-11 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-9112:
--

 Summary: [R] Update autobrew script location
 Key: ARROW-9112
 URL: https://issues.apache.org/jira/browse/ARROW-9112
 Project: Apache Arrow
  Issue Type: Task
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 1.0.0


Jeroen is moving it to a different location.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9111) csv.read_csv progress bar

2020-06-11 Thread Jeff Hammerbacher (Jira)
Jeff Hammerbacher created ARROW-9111:


 Summary: csv.read_csv progress bar
 Key: ARROW-9111
 URL: https://issues.apache.org/jira/browse/ARROW-9111
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 0.17.1
Reporter: Jeff Hammerbacher


When reading a very large csv file, it would be nice to see some diagnostic 
output from pyarrow. 
[readr|[https://readr.tidyverse.org/reference/read_delim.html]] has a 
`progress` parameter, for example. [tqdm|[https://github.com/tqdm/tqdm]] is 
often used in the Python community to provide this functionality.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9110) [C++] Fix CPU cache size detection on macOS

2020-06-11 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9110:
--

 Summary: [C++] Fix CPU cache size detection on macOS
 Key: ARROW-9110
 URL: https://issues.apache.org/jira/browse/ARROW-9110
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 1.0.0


Running certain benchmarks on macOS never ends because CpuInfo detects the RAM 
size as  the size of L1 cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9109) [Python][Packaging] Enable S3 support in manylinux wheels

2020-06-11 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9109:
-

 Summary: [Python][Packaging] Enable S3 support in manylinux wheels
 Key: ARROW-9109
 URL: https://issues.apache.org/jira/browse/ARROW-9109
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Packaging, Python
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan

2020-06-11 Thread Micah Kornfield
Congratulations!

On Thu, Jun 11, 2020 at 9:32 AM David Li  wrote:

> Congrats Ji  & Liya!
>
> David
>
> On 6/11/20, siddharth teotia  wrote:
> > Congratulations!
> >
> > On Thu, Jun 11, 2020 at 7:51 AM Neal Richardson
> > 
> > wrote:
> >
> >> Congratulations, both!
> >>
> >> Neal
> >>
> >> On Thu, Jun 11, 2020 at 7:38 AM Wes McKinney 
> wrote:
> >>
> >> > On behalf of the Arrow PMC I'm happy to announce that Ji Liu and Liya
> >> > Fan have been invited to be Arrow committers and they have both
> >> > accepted.
> >> >
> >> > Welcome, and thank you for your contributions!
> >> >
> >>
> >
> >
> > --
> > *Best Regards,*
> > *SIDDHARTH TEOTIA*
> > *2008C6PS540G*
> > *BITS PILANI- GOA CAMPUS*
> >
> > *+91 87911 75932*
> >
>


Re: [DISCUSS] Move JIRA notifications to separate mailing list?

2020-06-11 Thread Wes McKinney
jira@ was just created so I think you can go ahead and request the
notification scheme changes. Thanks for doing this

On Wed, Jun 10, 2020, 11:14 PM Wes McKinney  wrote:

> I just requested jira@arrow.a.o be created
>
> On Wed, Jun 10, 2020, 11:06 PM Neal Richardson <
> neal.p.richard...@gmail.com> wrote:
>
>> I like it. That has some symmetry with how we handle github notifications:
>> github@ for everything, commits@ for just commits.
>>
>> I looked into creating new mailing lists, and it appears that it's not
>> available to all PMC members, only the PMC chair:
>> https://selfserve.apache.org/mail.html. But once that is done, I'm happy
>> to
>> handle working with INFRA on updating what goes where.
>>
>> Neal
>>
>>
>> On Wed, Jun 10, 2020 at 8:16 PM Wes McKinney  wrote:
>>
>> > Here's my proposal:
>> >
>> > * Create new mailing list j...@arrow.apache.org and move all JIRA
>> > activity to that list (what currently goes to issues@)
>> > * Send new issues notifications to issues@arrow.a.o. Stop sending
>> > these e-mails to dev@
>> > * Encourage dev@ subscribers to subscribe to issues@arrow.a.o
>> >
>> > Absent dissent I would suggest going ahead and asking INFRA to do
>> > this. Note that any PMC member can create the new jira@ mailing list
>> > (do this first, don't ask INFRA to do it)
>> >
>> > On Mon, Jun 8, 2020 at 2:33 PM Wes McKinney 
>> wrote:
>> > >
>> > > I'm openly not very sympathetic toward people who don't take time to
>> > > set up e-mail filters but I support having two e-mail lists:
>> > >
>> > > * One having new issues only. I think that active developers need to
>> > > see new issues to create awareness of what others are doing in the
>> > > project, so I think we should really encourage people to subscribe to
>> > > this list (and set up an e-mail filter if they don't want the e-mails
>> > > coming into their inbox). While I think having less "noise" on dev@
>> is
>> > > a good thing (even though it's only "noise" if you don't set up e-mail
>> > > filters) I'm concerned that this action will decrease developer
>> > > engagement in the project. There are of course other ways [1] to
>> > > subscribe to the JIRA activity feed if getting notifications in Slack
>> > > or Zulip is your thing.
>> > > * One having all JIRA traffic (i.e. what is currently at
>> > > https://lists.apache.org/list.html?iss...@arrow.apache.org)
>> > >
>> > > [1]: https://github.com/ursa-labs/jira-zulip-bridge
>> > >
>> > > On Mon, Jun 8, 2020 at 1:57 PM Antoine Pitrou 
>> > wrote:
>> > > >
>> > > >
>> > > > I would welcome a separate list, but only with notifications of new
>> > JIRA
>> > > > issues.  I am not interested in generic JIRA traffic.
>> > > >
>> > > > Regards
>> > > >
>> > > > Antoine.
>> > > >
>> > > >
>> > > > Le 08/06/2020 à 20:46, Neal Richardson a écrit :
>> > > > > And if you're like me, and this message got filtered out of your
>> > inbox
>> > > > > because it is from dev@ and contains "JIRA" in the subject, well,
>> > maybe
>> > > > > that demonstrates the problem ;)
>> > > > >
>> > > > > On Mon, Jun 8, 2020 at 11:43 AM Neal Richardson <
>> > neal.p.richard...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > >> Hi all,
>> > > > >> I've noticed that some other Apache projects have a separate
>> > mailing list
>> > > > >> for JIRA notifications (Spark, for example, has
>> > iss...@spark.apache.org).
>> > > > >> The result is that the dev@ mailing list is focused on actual
>> > discussions
>> > > > >> threads (like this!), votes, and other official business. Would
>> we
>> > be
>> > > > >> interested in doing the same?
>> > > > >>
>> > > > >> In my opinion, the status quo is not great. The dev@ archives (
>> > > > >> https://lists.apache.org/list.html?dev@arrow.apache.org) aren't
>> > that
>> > > > >> readable/browseable to me, and if I want to see what's going on
>> in
>> > JIRA, I
>> > > > >> go to JIRA. In fact, the first thing I/we recommend to people
>> > signing up
>> > > > >> for the mailing list is to set up email filters to exclude the
>> JIRA
>> > noise.
>> > > > >> Having a separate mailing list will make it easier for people to
>> > manage
>> > > > >> their own informations streams better.
>> > > > >>
>> > > > >> The counterargument is that moving JIRA traffic to a separate
>> > mailing
>> > > > >> list, requiring an additional subscribe action, might mean that
>> > developers
>> > > > >> miss out on things like new issues being created. I'm not
>> personally
>> > > > >> worried about this because I suspect that many of us already
>> aren't
>> > using
>> > > > >> the mailing list to stay on top of JIRA issues, and that those
>> who
>> > want the
>> > > > >> JIRA stream in their email can easily opt-in (subscribe). But I'm
>> > > > >> interested in the community's opinions on this.
>> > > > >>
>> > > > >> Thoughts?
>> > > > >>
>> > > > >> Neal
>> > > > >>
>> > > > >
>> >
>>
>


Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan

2020-06-11 Thread David Li
Congrats Ji  & Liya!

David

On 6/11/20, siddharth teotia  wrote:
> Congratulations!
>
> On Thu, Jun 11, 2020 at 7:51 AM Neal Richardson
> 
> wrote:
>
>> Congratulations, both!
>>
>> Neal
>>
>> On Thu, Jun 11, 2020 at 7:38 AM Wes McKinney  wrote:
>>
>> > On behalf of the Arrow PMC I'm happy to announce that Ji Liu and Liya
>> > Fan have been invited to be Arrow committers and they have both
>> > accepted.
>> >
>> > Welcome, and thank you for your contributions!
>> >
>>
>
>
> --
> *Best Regards,*
> *SIDDHARTH TEOTIA*
> *2008C6PS540G*
> *BITS PILANI- GOA CAMPUS*
>
> *+91 87911 75932*
>


[jira] [Created] (ARROW-9108) [C++][Dataset] Add Parquet Statistics conversion for timestamp columns

2020-06-11 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-9108:
-

 Summary: [C++][Dataset] Add Parquet Statistics conversion for 
timestamp columns
 Key: ARROW-9108
 URL: https://issues.apache.org/jira/browse/ARROW-9108
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9107) [C++][Dataset] Time-based types support

2020-06-11 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-9107:
-

 Summary: [C++][Dataset] Time-based types support
 Key: ARROW-9107
 URL: https://issues.apache.org/jira/browse/ARROW-9107
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Francois Saint-Jacques


We lack the support of date/timestamp partitions, and predicate pushdown rules. 
Timestamp columns are usually the most important predicate in OLAP style 
queries, we need to support this transparently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9106) [C++] Add C++ foundation to ease file transcoding

2020-06-11 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9106:
-

 Summary: [C++] Add C++ foundation to ease file transcoding
 Key: ARROW-9106
 URL: https://issues.apache.org/jira/browse/ARROW-9106
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


In some situations (e.g. reading a Windows-produced CSV file), the user might 
transcode data before ingesting it into Arrow. Rather than build transcoding in 
C++ (which would require a library of encodings), we could delegate it to 
bindings as needed, by providing a generic InputStream facility.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: python plasma client get_buffers behavior

2020-06-11 Thread saurabh pratap singh
sorry typo in above email .
So I have to call del python_object to make sure that underlying
PlasmaBuffer is released from plasma so that the release from plasma is not
delayed till the python_object is reinitialised until the next iteration of
poll loop.

On Thu, Jun 11, 2020 at 8:49 PM saurabh pratap singh <
saurabh.cs...@gmail.com> wrote:

> Figured it out I have to call del obj to make sure the underlying
> PlasmaBuffer is called.
>
> On Wed, Jun 10, 2020 at 9:17 PM saurabh pratap singh <
> saurabh.cs...@gmail.com> wrote:
>
>> Hi
>>
>> We are using python plasma client to do a get_buffers for arrow tables
>> created by java in plasma .
>>
>> The python plasma client basically polls on a queue and do a get_buffers
>> on the object ids returned from the queue.
>> What I have observed is tin context of plasma object table entry for
>> those object ids is that the get_buffers will first increment the ref count
>> by 1 and then there is an implicit  release call which decreases the ref
>> count again .
>>
>> But when there are no more entries in the queue I see that few object ids
>> still have a lingering reference count in plasma wrt to get_buffers and
>> there was no "implicit" release for that get call like previous one.
>>
>>  Is this expected ?
>> Is there any way I can handle this and make and explicit release for such
>> object ids as well .
>>
>> Thanks
>>
>


Re: python plasma client get_buffers behavior

2020-06-11 Thread saurabh pratap singh
Figured it out I have to call del obj to make sure the underlying
PlasmaBuffer is called.

On Wed, Jun 10, 2020 at 9:17 PM saurabh pratap singh <
saurabh.cs...@gmail.com> wrote:

> Hi
>
> We are using python plasma client to do a get_buffers for arrow tables
> created by java in plasma .
>
> The python plasma client basically polls on a queue and do a get_buffers
> on the object ids returned from the queue.
> What I have observed is tin context of plasma object table entry for those
> object ids is that the get_buffers will first increment the ref count by 1
> and then there is an implicit  release call which decreases the ref count
> again .
>
> But when there are no more entries in the queue I see that few object ids
> still have a lingering reference count in plasma wrt to get_buffers and
> there was no "implicit" release for that get call like previous one.
>
>  Is this expected ?
> Is there any way I can handle this and make and explicit release for such
> object ids as well .
>
> Thanks
>


Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan

2020-06-11 Thread siddharth teotia
Congratulations!

On Thu, Jun 11, 2020 at 7:51 AM Neal Richardson 
wrote:

> Congratulations, both!
>
> Neal
>
> On Thu, Jun 11, 2020 at 7:38 AM Wes McKinney  wrote:
>
> > On behalf of the Arrow PMC I'm happy to announce that Ji Liu and Liya
> > Fan have been invited to be Arrow committers and they have both
> > accepted.
> >
> > Welcome, and thank you for your contributions!
> >
>


-- 
*Best Regards,*
*SIDDHARTH TEOTIA*
*2008C6PS540G*
*BITS PILANI- GOA CAMPUS*

*+91 87911 75932*


Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan

2020-06-11 Thread Neal Richardson
Congratulations, both!

Neal

On Thu, Jun 11, 2020 at 7:38 AM Wes McKinney  wrote:

> On behalf of the Arrow PMC I'm happy to announce that Ji Liu and Liya
> Fan have been invited to be Arrow committers and they have both
> accepted.
>
> Welcome, and thank you for your contributions!
>


[ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan

2020-06-11 Thread Wes McKinney
On behalf of the Arrow PMC I'm happy to announce that Ji Liu and Liya
Fan have been invited to be Arrow committers and they have both
accepted.

Welcome, and thank you for your contributions!


[jira] [Created] (ARROW-9105) [C++] ParquetFileFragment::SplitByRowGroup doesn't handle filter on partition field

2020-06-11 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9105:


 Summary: [C++] ParquetFileFragment::SplitByRowGroup doesn't handle 
filter on partition field
 Key: ARROW-9105
 URL: https://issues.apache.org/jira/browse/ARROW-9105
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


When splitting a fragment into row group fragments, filtering on the partition 
field raises an error.

Python reproducer:

```
df = pd.DataFrame({"dummy": [1, 1, 1, 1], "part": ["A", "A", "B", "B"]})
df.to_parquet("test_partitioned_filter", partition_cols="part", 
engine="pyarrow")

import pyarrow.dataset as ds
dataset = ds.dataset("test_partitioned_filter", format="parquet", 
partitioning="hive")
fragment = list(dataset.get_fragments())[0]
```

```
In [31]: dataset.to_table(filter=ds.field("part") == "A").to_pandas()   

   
Out[31]: 
   dummy part
0  1A
1  1A

In [32]: fragment.split_by_row_group(ds.field("part") == "A")   

   
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 fragment.split_by_row_group(ds.field("part") == "A")

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset.ParquetFileFragment.split_by_row_group()

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset._insert_implicit_casts()

~/scipy/repos/arrow/python/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Field named 'part' not found or not unique in the schema.
```

This is probably a "strange" thing to do, since the fragment from a partitioned 
dataset is already coming only from a single partition (so will always only 
satisfy a single equality expression). But it's still nice that as a user you 
don't have to care about only passing part of the filter down to 
{{split_by_row_groups}}.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9104) [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory

2020-06-11 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9104:
--

 Summary: [C++] Parquet encryption tests should write files to a 
temporary directory instead of the testing submodule's directory
 Key: ARROW-9104
 URL: https://issues.apache.org/jira/browse/ARROW-9104
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Krisztian Szucs
 Fix For: 1.0.0


If the source directory is not writable the test raises permission denied error:

[ RUN  ] TestEncryptionConfiguration.UniformEncryption
1632
unknown file: Failure
1633
C++ exception with description "IOError: Failed to open local file 
'/arrow/cpp/submodules/parquet-testing/data/tmp_uniform_encryption.parquet.encrypted'.
 Detail: [errno 13] Permission denied" thrown in the test body.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2020-06-11-0

2020-06-11 Thread Crossbow


Arrow Build Report for Job nightly-2020-06-11-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0

Failed Tasks:
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-travis-homebrew-cpp
- homebrew-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-travis-homebrew-r-autobrew
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-test-conda-cpp-valgrind
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-test-conda-python-3.7-pandas-master
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-test-conda-python-3.7-spark-master
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-test-conda-python-3.7-turbodbc-master
- test-conda-python-3.8-dask-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-test-conda-python-3.8-dask-master
- test-conda-python-3.8-jpype:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-test-conda-python-3.8-jpype
- wheel-manylinux2010-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-wheel-manylinux2010-cp37m
- wheel-manylinux2014-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-wheel-manylinux2014-cp36m
- wheel-manylinux2014-cp38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-wheel-manylinux2014-cp38

Pending Tasks:
- wheel-manylinux2010-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-wheel-manylinux2010-cp35m
- wheel-manylinux2014-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-wheel-manylinux2014-cp35m

Succeeded Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-centos-6-amd64
- centos-7-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-travis-centos-7-aarch64
- centos-7-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-centos-7-amd64
- centos-8-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-travis-centos-8-aarch64
- centos-8-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-centos-8-amd64
- conda-clean:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-clean
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-linux-gcc-py38
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-osx-clang-py38
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-win-vs2015-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-debian-buster-amd64
- debian-buster-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-travis-debian-buster-arm64
- debian-stretch-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-debian-stretch-amd64
- debian-stretch-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-

[jira] [Created] (ARROW-9103) [Python] Clarify behaviour of CSV reader for non-UTF8 text data

2020-06-11 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9103:


 Summary: [Python] Clarify behaviour of CSV reader for non-UTF8 
text data
 Key: ARROW-9103
 URL: https://issues.apache.org/jira/browse/ARROW-9103
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


See 
https://stackoverflow.com/questions/62153229/how-does-pyarrow-read-csv-handle-different-file-encodings/62321673#62321673



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9102) [Packaging] Upload built manylinux docker images

2020-06-11 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9102:
--

 Summary: [Packaging] Upload built manylinux docker images
 Key: ARROW-9102
 URL: https://issues.apache.org/jira/browse/ARROW-9102
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 1.0.0


However the secrets were set on azure pipelines the upload step is failing: 
https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=13104&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181

So the manylinux builds take more than two hours. This is due to azure's secret 
handling, we need to explicitly export the azure secret variables as 
environment variables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9101) [Doc][C++][Python] Document encoding expected by CSV and JSON readers

2020-06-11 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9101:
-

 Summary: [Doc][C++][Python] Document encoding expected by CSV and 
JSON readers
 Key: ARROW-9101
 URL: https://issues.apache.org/jira/browse/ARROW-9101
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, Documentation, Python
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9100) Add ascii_lower kernel

2020-06-11 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-9100:
---

 Summary: Add ascii_lower kernel
 Key: ARROW-9100
 URL: https://issues.apache.org/jira/browse/ARROW-9100
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Maarten Breddels






--
This message was sent by Atlassian Jira
(v8.3.4#803005)