Re: [C++][Discuss] Approaches for SIMD optimizations
Hi Frank, Are the performance numbers you published for the baseline directly from master? I'd like to look at this over the next few days to see if I can figure out what is going on. To all: I'd like to make sure we flush out things to consider in general, for a path forward. My take on this is we should still prefer writing code in this order: 1. Plain-old C++ 2. SIMD Wrapper library (my preference would be towards something that is going to be standardized eventually to limit 3P dependencies. I think the counter argument here is if any of the libraries mentioned above has much better feature coverage on advanced instruction sets). Please chime in if there are other things to consider. We should have some rubrics for when to make use of the library (i.e. what performance gain do we get on a workload). 3. Native CPU intrinsics. We should develop a rubric for when to accept PRs for this. This should include: 1. Performance gain. 2. General popularity of the architecture. For dynamic dispatch: I think we should probably continue down the path of building our own. I looked more at libsimdpp's implementation and it might be something we can use for guidance, but as it stands, it doesn't seem to have hooks based on CPU manufacturer, which for BMI2 intrinsics would be a requirement. The alternative would be to ban BMI2 intrinsics from the code (this might not be a bad idea to limit complexity in general). Thoughts? Thanks, Micah On Wed, Jun 10, 2020 at 8:35 PM Du, Frank wrote: > Thanks Jed. > > I collect some data on my setup, gcc version 7.5.0, 18.04.4 LTS, SSE > build(-msse4.2) > > [Unroll baseline] > for (int64_t i = 0; i < length_rounded; i += kRoundFactor) { > for (int64_t k = 0; k < kRoundFactor; k++) { > sum_rounded[k] += values[i + k]; > } > } > SumKernelFloat/32768/02.91 us 2.90 us 239992 > bytes_per_second=10.5063G/s null_percent=0 size=32.768k > SumKernelDouble/32768/0 1.89 us 1.89 us 374470 > bytes_per_second=16.1847G/s null_percent=0 size=32.768k > SumKernelInt8/32768/0 11.6 us 11.6 us60329 > bytes_per_second=2.63274G/s null_percent=0 size=32.768k > SumKernelInt16/32768/06.98 us 6.98 us 100293 > bytes_per_second=4.3737G/s null_percent=0 size=32.768k > SumKernelInt32/32768/03.89 us 3.88 us 180423 > bytes_per_second=7.85862G/s null_percent=0 size=32.768k > SumKernelInt64/32768/01.86 us 1.85 us 380477 > bytes_per_second=16.4536G/s null_percent=0 size=32.768k > > [#pragma omp simd reduction(+:sum)] > #pragma omp simd reduction(+:sum) > for (int64_t i = 0; i < n; i++) > sum += values[i]; > SumKernelFloat/32768/02.97 us 2.96 us 235686 > bytes_per_second=10.294G/s null_percent=0 size=32.768k > SumKernelDouble/32768/0 2.97 us 2.97 us 236456 > bytes_per_second=10.2875G/s null_percent=0 size=32.768k > SumKernelInt8/32768/0 11.7 us 11.7 us60006 > bytes_per_second=2.61643G/s null_percent=0 size=32.768k > SumKernelInt16/32768/05.47 us 5.47 us 127999 > bytes_per_second=5.58002G/s null_percent=0 size=32.768k > SumKernelInt32/32768/02.42 us 2.41 us 290635 > bytes_per_second=12.6485G/s null_percent=0 size=32.768k > SumKernelInt64/32768/01.82 us 1.82 us 386749 > bytes_per_second=16.7733G/s null_percent=0 size=32.768k > > [SSE intrinsic] > SumKernelFloat/32768/02.24 us 2.24 us 310914 > bytes_per_second=13.6335G/s null_percent=0 size=32.768k > SumKernelDouble/32768/0 1.43 us 1.43 us 486642 > bytes_per_second=21.3266G/s null_percent=0 size=32.768k > SumKernelInt8/32768/0 6.93 us 6.92 us 100720 > bytes_per_second=4.41046G/s null_percent=0 size=32.768k > SumKernelInt16/32768/03.14 us 3.14 us 222803 > bytes_per_second=9.72931G/s null_percent=0 size=32.768k > SumKernelInt32/32768/02.11 us 2.11 us 331388 > bytes_per_second=14.4907G/s null_percent=0 size=32.768k > SumKernelInt64/32768/01.32 us 1.32 us 532964 > bytes_per_second=23.0728G/s null_percent=0 size=32.768k > > I tried to tweak the kRoundFactor or using some unroll based omp simd, or > build with clang-8, unluckily I never can get the results up to intrinsic. > The ASM code generated all use SIMD instructions, only some small > difference like instruction sequences or xmm register used. The things > under compiler is really some secret for me. > > Thanks, > Frank > > -Original Message- > From: Jed Brown > Sent: Thursday, June 11, 2020 1:58 AM > To: Du, Frank ; dev@arrow.apache.org > Subject: RE: [C++][Discuss] Approaches for SIMD optimizations > > "Du, Frank" writes: > > > The PR I committed provide a basic support for runtime dispatching. I > > agree that complier should generate good vec
Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan
Dear all, I want to thank you all for all your kind help. It is a great honor to work with you in this great community. I Hope we can contribute more and make the community better. Best, Liya Fan On Fri, Jun 12, 2020 at 12:02 PM Ji Liu wrote: > Thanks everyone for the warm welcome! > It's a great honor for me to be a committer. Looking forward to > contributing more to the community. > > Thanks, > Ji Liu > > > paddy horan 于2020年6月12日周五 上午8:52写道: > > > Congrats! > > > > > > From: Micah Kornfield > > Sent: Thursday, June 11, 2020 12:59:32 PM > > To: dev > > Subject: Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan > > > > Congratulations! > > > > On Thu, Jun 11, 2020 at 9:32 AM David Li wrote: > > > > > Congrats Ji & Liya! > > > > > > David > > > > > > On 6/11/20, siddharth teotia wrote: > > > > Congratulations! > > > > > > > > On Thu, Jun 11, 2020 at 7:51 AM Neal Richardson > > > > > > > > wrote: > > > > > > > >> Congratulations, both! > > > >> > > > >> Neal > > > >> > > > >> On Thu, Jun 11, 2020 at 7:38 AM Wes McKinney > > > wrote: > > > >> > > > >> > On behalf of the Arrow PMC I'm happy to announce that Ji Liu and > > Liya > > > >> > Fan have been invited to be Arrow committers and they have both > > > >> > accepted. > > > >> > > > > >> > Welcome, and thank you for your contributions! > > > >> > > > > >> > > > > > > > > > > > > -- > > > > *Best Regards,* > > > > *SIDDHARTH TEOTIA* > > > > *2008C6PS540G* > > > > *BITS PILANI- GOA CAMPUS* > > > > > > > > *+91 87911 75932* > > > > > > > > > >
Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan
Thanks everyone for the warm welcome! It's a great honor for me to be a committer. Looking forward to contributing more to the community. Thanks, Ji Liu paddy horan 于2020年6月12日周五 上午8:52写道: > Congrats! > > > From: Micah Kornfield > Sent: Thursday, June 11, 2020 12:59:32 PM > To: dev > Subject: Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan > > Congratulations! > > On Thu, Jun 11, 2020 at 9:32 AM David Li wrote: > > > Congrats Ji & Liya! > > > > David > > > > On 6/11/20, siddharth teotia wrote: > > > Congratulations! > > > > > > On Thu, Jun 11, 2020 at 7:51 AM Neal Richardson > > > > > > wrote: > > > > > >> Congratulations, both! > > >> > > >> Neal > > >> > > >> On Thu, Jun 11, 2020 at 7:38 AM Wes McKinney > > wrote: > > >> > > >> > On behalf of the Arrow PMC I'm happy to announce that Ji Liu and > Liya > > >> > Fan have been invited to be Arrow committers and they have both > > >> > accepted. > > >> > > > >> > Welcome, and thank you for your contributions! > > >> > > > >> > > > > > > > > > -- > > > *Best Regards,* > > > *SIDDHARTH TEOTIA* > > > *2008C6PS540G* > > > *BITS PILANI- GOA CAMPUS* > > > > > > *+91 87911 75932* > > > > > >
Re: Help with Java PR backlog
I would like to help with the review. I will spend some time on it late today. Best, Liya Fan On Fri, Jun 12, 2020 at 9:56 AM Wes McKinney wrote: > hi folks, > > There's a number of Java PRs that seem like they are close to being in > a merge-ready state, could we try to get the Java backlog mostly > closed out before the next release (in a few weeks)? > > Thanks > Wes >
Help with Java PR backlog
hi folks, There's a number of Java PRs that seem like they are close to being in a merge-ready state, could we try to get the Java backlog mostly closed out before the next release (in a few weeks)? Thanks Wes
Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan
Congrats! From: Micah Kornfield Sent: Thursday, June 11, 2020 12:59:32 PM To: dev Subject: Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan Congratulations! On Thu, Jun 11, 2020 at 9:32 AM David Li wrote: > Congrats Ji & Liya! > > David > > On 6/11/20, siddharth teotia wrote: > > Congratulations! > > > > On Thu, Jun 11, 2020 at 7:51 AM Neal Richardson > > > > wrote: > > > >> Congratulations, both! > >> > >> Neal > >> > >> On Thu, Jun 11, 2020 at 7:38 AM Wes McKinney > wrote: > >> > >> > On behalf of the Arrow PMC I'm happy to announce that Ji Liu and Liya > >> > Fan have been invited to be Arrow committers and they have both > >> > accepted. > >> > > >> > Welcome, and thank you for your contributions! > >> > > >> > > > > > > -- > > *Best Regards,* > > *SIDDHARTH TEOTIA* > > *2008C6PS540G* > > *BITS PILANI- GOA CAMPUS* > > > > *+91 87911 75932* > > >
[jira] [Created] (ARROW-9112) [R] Update autobrew script location
Neal Richardson created ARROW-9112: -- Summary: [R] Update autobrew script location Key: ARROW-9112 URL: https://issues.apache.org/jira/browse/ARROW-9112 Project: Apache Arrow Issue Type: Task Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 1.0.0 Jeroen is moving it to a different location. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9111) csv.read_csv progress bar
Jeff Hammerbacher created ARROW-9111: Summary: csv.read_csv progress bar Key: ARROW-9111 URL: https://issues.apache.org/jira/browse/ARROW-9111 Project: Apache Arrow Issue Type: Improvement Affects Versions: 0.17.1 Reporter: Jeff Hammerbacher When reading a very large csv file, it would be nice to see some diagnostic output from pyarrow. [readr|[https://readr.tidyverse.org/reference/read_delim.html]] has a `progress` parameter, for example. [tqdm|[https://github.com/tqdm/tqdm]] is often used in the Python community to provide this functionality. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9110) [C++] Fix CPU cache size detection on macOS
Krisztian Szucs created ARROW-9110: -- Summary: [C++] Fix CPU cache size detection on macOS Key: ARROW-9110 URL: https://issues.apache.org/jira/browse/ARROW-9110 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 1.0.0 Running certain benchmarks on macOS never ends because CpuInfo detects the RAM size as the size of L1 cache. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9109) [Python][Packaging] Enable S3 support in manylinux wheels
Antoine Pitrou created ARROW-9109: - Summary: [Python][Packaging] Enable S3 support in manylinux wheels Key: ARROW-9109 URL: https://issues.apache.org/jira/browse/ARROW-9109 Project: Apache Arrow Issue Type: Sub-task Components: Packaging, Python Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan
Congratulations! On Thu, Jun 11, 2020 at 9:32 AM David Li wrote: > Congrats Ji & Liya! > > David > > On 6/11/20, siddharth teotia wrote: > > Congratulations! > > > > On Thu, Jun 11, 2020 at 7:51 AM Neal Richardson > > > > wrote: > > > >> Congratulations, both! > >> > >> Neal > >> > >> On Thu, Jun 11, 2020 at 7:38 AM Wes McKinney > wrote: > >> > >> > On behalf of the Arrow PMC I'm happy to announce that Ji Liu and Liya > >> > Fan have been invited to be Arrow committers and they have both > >> > accepted. > >> > > >> > Welcome, and thank you for your contributions! > >> > > >> > > > > > > -- > > *Best Regards,* > > *SIDDHARTH TEOTIA* > > *2008C6PS540G* > > *BITS PILANI- GOA CAMPUS* > > > > *+91 87911 75932* > > >
Re: [DISCUSS] Move JIRA notifications to separate mailing list?
jira@ was just created so I think you can go ahead and request the notification scheme changes. Thanks for doing this On Wed, Jun 10, 2020, 11:14 PM Wes McKinney wrote: > I just requested jira@arrow.a.o be created > > On Wed, Jun 10, 2020, 11:06 PM Neal Richardson < > neal.p.richard...@gmail.com> wrote: > >> I like it. That has some symmetry with how we handle github notifications: >> github@ for everything, commits@ for just commits. >> >> I looked into creating new mailing lists, and it appears that it's not >> available to all PMC members, only the PMC chair: >> https://selfserve.apache.org/mail.html. But once that is done, I'm happy >> to >> handle working with INFRA on updating what goes where. >> >> Neal >> >> >> On Wed, Jun 10, 2020 at 8:16 PM Wes McKinney wrote: >> >> > Here's my proposal: >> > >> > * Create new mailing list j...@arrow.apache.org and move all JIRA >> > activity to that list (what currently goes to issues@) >> > * Send new issues notifications to issues@arrow.a.o. Stop sending >> > these e-mails to dev@ >> > * Encourage dev@ subscribers to subscribe to issues@arrow.a.o >> > >> > Absent dissent I would suggest going ahead and asking INFRA to do >> > this. Note that any PMC member can create the new jira@ mailing list >> > (do this first, don't ask INFRA to do it) >> > >> > On Mon, Jun 8, 2020 at 2:33 PM Wes McKinney >> wrote: >> > > >> > > I'm openly not very sympathetic toward people who don't take time to >> > > set up e-mail filters but I support having two e-mail lists: >> > > >> > > * One having new issues only. I think that active developers need to >> > > see new issues to create awareness of what others are doing in the >> > > project, so I think we should really encourage people to subscribe to >> > > this list (and set up an e-mail filter if they don't want the e-mails >> > > coming into their inbox). While I think having less "noise" on dev@ >> is >> > > a good thing (even though it's only "noise" if you don't set up e-mail >> > > filters) I'm concerned that this action will decrease developer >> > > engagement in the project. There are of course other ways [1] to >> > > subscribe to the JIRA activity feed if getting notifications in Slack >> > > or Zulip is your thing. >> > > * One having all JIRA traffic (i.e. what is currently at >> > > https://lists.apache.org/list.html?iss...@arrow.apache.org) >> > > >> > > [1]: https://github.com/ursa-labs/jira-zulip-bridge >> > > >> > > On Mon, Jun 8, 2020 at 1:57 PM Antoine Pitrou >> > wrote: >> > > > >> > > > >> > > > I would welcome a separate list, but only with notifications of new >> > JIRA >> > > > issues. I am not interested in generic JIRA traffic. >> > > > >> > > > Regards >> > > > >> > > > Antoine. >> > > > >> > > > >> > > > Le 08/06/2020 à 20:46, Neal Richardson a écrit : >> > > > > And if you're like me, and this message got filtered out of your >> > inbox >> > > > > because it is from dev@ and contains "JIRA" in the subject, well, >> > maybe >> > > > > that demonstrates the problem ;) >> > > > > >> > > > > On Mon, Jun 8, 2020 at 11:43 AM Neal Richardson < >> > neal.p.richard...@gmail.com> >> > > > > wrote: >> > > > > >> > > > >> Hi all, >> > > > >> I've noticed that some other Apache projects have a separate >> > mailing list >> > > > >> for JIRA notifications (Spark, for example, has >> > iss...@spark.apache.org). >> > > > >> The result is that the dev@ mailing list is focused on actual >> > discussions >> > > > >> threads (like this!), votes, and other official business. Would >> we >> > be >> > > > >> interested in doing the same? >> > > > >> >> > > > >> In my opinion, the status quo is not great. The dev@ archives ( >> > > > >> https://lists.apache.org/list.html?dev@arrow.apache.org) aren't >> > that >> > > > >> readable/browseable to me, and if I want to see what's going on >> in >> > JIRA, I >> > > > >> go to JIRA. In fact, the first thing I/we recommend to people >> > signing up >> > > > >> for the mailing list is to set up email filters to exclude the >> JIRA >> > noise. >> > > > >> Having a separate mailing list will make it easier for people to >> > manage >> > > > >> their own informations streams better. >> > > > >> >> > > > >> The counterargument is that moving JIRA traffic to a separate >> > mailing >> > > > >> list, requiring an additional subscribe action, might mean that >> > developers >> > > > >> miss out on things like new issues being created. I'm not >> personally >> > > > >> worried about this because I suspect that many of us already >> aren't >> > using >> > > > >> the mailing list to stay on top of JIRA issues, and that those >> who >> > want the >> > > > >> JIRA stream in their email can easily opt-in (subscribe). But I'm >> > > > >> interested in the community's opinions on this. >> > > > >> >> > > > >> Thoughts? >> > > > >> >> > > > >> Neal >> > > > >> >> > > > > >> > >> >
Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan
Congrats Ji & Liya! David On 6/11/20, siddharth teotia wrote: > Congratulations! > > On Thu, Jun 11, 2020 at 7:51 AM Neal Richardson > > wrote: > >> Congratulations, both! >> >> Neal >> >> On Thu, Jun 11, 2020 at 7:38 AM Wes McKinney wrote: >> >> > On behalf of the Arrow PMC I'm happy to announce that Ji Liu and Liya >> > Fan have been invited to be Arrow committers and they have both >> > accepted. >> > >> > Welcome, and thank you for your contributions! >> > >> > > > -- > *Best Regards,* > *SIDDHARTH TEOTIA* > *2008C6PS540G* > *BITS PILANI- GOA CAMPUS* > > *+91 87911 75932* >
[jira] [Created] (ARROW-9108) [C++][Dataset] Add Parquet Statistics conversion for timestamp columns
Francois Saint-Jacques created ARROW-9108: - Summary: [C++][Dataset] Add Parquet Statistics conversion for timestamp columns Key: ARROW-9108 URL: https://issues.apache.org/jira/browse/ARROW-9108 Project: Apache Arrow Issue Type: Sub-task Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9107) [C++][Dataset] Time-based types support
Francois Saint-Jacques created ARROW-9107: - Summary: [C++][Dataset] Time-based types support Key: ARROW-9107 URL: https://issues.apache.org/jira/browse/ARROW-9107 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques We lack the support of date/timestamp partitions, and predicate pushdown rules. Timestamp columns are usually the most important predicate in OLAP style queries, we need to support this transparently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9106) [C++] Add C++ foundation to ease file transcoding
Antoine Pitrou created ARROW-9106: - Summary: [C++] Add C++ foundation to ease file transcoding Key: ARROW-9106 URL: https://issues.apache.org/jira/browse/ARROW-9106 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou In some situations (e.g. reading a Windows-produced CSV file), the user might transcode data before ingesting it into Arrow. Rather than build transcoding in C++ (which would require a library of encodings), we could delegate it to bindings as needed, by providing a generic InputStream facility. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: python plasma client get_buffers behavior
sorry typo in above email . So I have to call del python_object to make sure that underlying PlasmaBuffer is released from plasma so that the release from plasma is not delayed till the python_object is reinitialised until the next iteration of poll loop. On Thu, Jun 11, 2020 at 8:49 PM saurabh pratap singh < saurabh.cs...@gmail.com> wrote: > Figured it out I have to call del obj to make sure the underlying > PlasmaBuffer is called. > > On Wed, Jun 10, 2020 at 9:17 PM saurabh pratap singh < > saurabh.cs...@gmail.com> wrote: > >> Hi >> >> We are using python plasma client to do a get_buffers for arrow tables >> created by java in plasma . >> >> The python plasma client basically polls on a queue and do a get_buffers >> on the object ids returned from the queue. >> What I have observed is tin context of plasma object table entry for >> those object ids is that the get_buffers will first increment the ref count >> by 1 and then there is an implicit release call which decreases the ref >> count again . >> >> But when there are no more entries in the queue I see that few object ids >> still have a lingering reference count in plasma wrt to get_buffers and >> there was no "implicit" release for that get call like previous one. >> >> Is this expected ? >> Is there any way I can handle this and make and explicit release for such >> object ids as well . >> >> Thanks >> >
Re: python plasma client get_buffers behavior
Figured it out I have to call del obj to make sure the underlying PlasmaBuffer is called. On Wed, Jun 10, 2020 at 9:17 PM saurabh pratap singh < saurabh.cs...@gmail.com> wrote: > Hi > > We are using python plasma client to do a get_buffers for arrow tables > created by java in plasma . > > The python plasma client basically polls on a queue and do a get_buffers > on the object ids returned from the queue. > What I have observed is tin context of plasma object table entry for those > object ids is that the get_buffers will first increment the ref count by 1 > and then there is an implicit release call which decreases the ref count > again . > > But when there are no more entries in the queue I see that few object ids > still have a lingering reference count in plasma wrt to get_buffers and > there was no "implicit" release for that get call like previous one. > > Is this expected ? > Is there any way I can handle this and make and explicit release for such > object ids as well . > > Thanks >
Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan
Congratulations! On Thu, Jun 11, 2020 at 7:51 AM Neal Richardson wrote: > Congratulations, both! > > Neal > > On Thu, Jun 11, 2020 at 7:38 AM Wes McKinney wrote: > > > On behalf of the Arrow PMC I'm happy to announce that Ji Liu and Liya > > Fan have been invited to be Arrow committers and they have both > > accepted. > > > > Welcome, and thank you for your contributions! > > > -- *Best Regards,* *SIDDHARTH TEOTIA* *2008C6PS540G* *BITS PILANI- GOA CAMPUS* *+91 87911 75932*
Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan
Congratulations, both! Neal On Thu, Jun 11, 2020 at 7:38 AM Wes McKinney wrote: > On behalf of the Arrow PMC I'm happy to announce that Ji Liu and Liya > Fan have been invited to be Arrow committers and they have both > accepted. > > Welcome, and thank you for your contributions! >
[ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan
On behalf of the Arrow PMC I'm happy to announce that Ji Liu and Liya Fan have been invited to be Arrow committers and they have both accepted. Welcome, and thank you for your contributions!
[jira] [Created] (ARROW-9105) [C++] ParquetFileFragment::SplitByRowGroup doesn't handle filter on partition field
Joris Van den Bossche created ARROW-9105: Summary: [C++] ParquetFileFragment::SplitByRowGroup doesn't handle filter on partition field Key: ARROW-9105 URL: https://issues.apache.org/jira/browse/ARROW-9105 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Fix For: 1.0.0 When splitting a fragment into row group fragments, filtering on the partition field raises an error. Python reproducer: ``` df = pd.DataFrame({"dummy": [1, 1, 1, 1], "part": ["A", "A", "B", "B"]}) df.to_parquet("test_partitioned_filter", partition_cols="part", engine="pyarrow") import pyarrow.dataset as ds dataset = ds.dataset("test_partitioned_filter", format="parquet", partitioning="hive") fragment = list(dataset.get_fragments())[0] ``` ``` In [31]: dataset.to_table(filter=ds.field("part") == "A").to_pandas() Out[31]: dummy part 0 1A 1 1A In [32]: fragment.split_by_row_group(ds.field("part") == "A") --- ArrowInvalid Traceback (most recent call last) in > 1 fragment.split_by_row_group(ds.field("part") == "A") ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.ParquetFileFragment.split_by_row_group() ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset._insert_implicit_casts() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Field named 'part' not found or not unique in the schema. ``` This is probably a "strange" thing to do, since the fragment from a partitioned dataset is already coming only from a single partition (so will always only satisfy a single equality expression). But it's still nice that as a user you don't have to care about only passing part of the filter down to {{split_by_row_groups}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9104) [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory
Krisztian Szucs created ARROW-9104: -- Summary: [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory Key: ARROW-9104 URL: https://issues.apache.org/jira/browse/ARROW-9104 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Krisztian Szucs Fix For: 1.0.0 If the source directory is not writable the test raises permission denied error: [ RUN ] TestEncryptionConfiguration.UniformEncryption 1632 unknown file: Failure 1633 C++ exception with description "IOError: Failed to open local file '/arrow/cpp/submodules/parquet-testing/data/tmp_uniform_encryption.parquet.encrypted'. Detail: [errno 13] Permission denied" thrown in the test body. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[NIGHTLY] Arrow Build Report for Job nightly-2020-06-11-0
Arrow Build Report for Job nightly-2020-06-11-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0 Failed Tasks: - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-travis-homebrew-cpp - homebrew-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-travis-homebrew-r-autobrew - test-conda-cpp-valgrind: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-test-conda-cpp-valgrind - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-test-conda-python-3.7-dask-latest - test-conda-python-3.7-pandas-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-test-conda-python-3.7-pandas-master - test-conda-python-3.7-spark-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-test-conda-python-3.7-spark-master - test-conda-python-3.7-turbodbc-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-test-conda-python-3.7-turbodbc-latest - test-conda-python-3.7-turbodbc-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-test-conda-python-3.7-turbodbc-master - test-conda-python-3.8-dask-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-test-conda-python-3.8-dask-master - test-conda-python-3.8-jpype: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-test-conda-python-3.8-jpype - wheel-manylinux2010-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-wheel-manylinux2010-cp37m - wheel-manylinux2014-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-wheel-manylinux2014-cp36m - wheel-manylinux2014-cp38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-wheel-manylinux2014-cp38 Pending Tasks: - wheel-manylinux2010-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-wheel-manylinux2010-cp35m - wheel-manylinux2014-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-wheel-manylinux2014-cp35m Succeeded Tasks: - centos-6-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-centos-6-amd64 - centos-7-aarch64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-travis-centos-7-aarch64 - centos-7-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-centos-7-amd64 - centos-8-aarch64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-travis-centos-8-aarch64 - centos-8-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-centos-8-amd64 - conda-clean: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-clean - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-linux-gcc-py37 - conda-linux-gcc-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-linux-gcc-py38 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-osx-clang-py38 - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-azure-conda-win-vs2015-py38 - debian-buster-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-debian-buster-amd64 - debian-buster-arm64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-travis-debian-buster-arm64 - debian-stretch-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-github-debian-stretch-amd64 - debian-stretch-arm64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-11-0-
[jira] [Created] (ARROW-9103) [Python] Clarify behaviour of CSV reader for non-UTF8 text data
Joris Van den Bossche created ARROW-9103: Summary: [Python] Clarify behaviour of CSV reader for non-UTF8 text data Key: ARROW-9103 URL: https://issues.apache.org/jira/browse/ARROW-9103 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche See https://stackoverflow.com/questions/62153229/how-does-pyarrow-read-csv-handle-different-file-encodings/62321673#62321673 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9102) [Packaging] Upload built manylinux docker images
Krisztian Szucs created ARROW-9102: -- Summary: [Packaging] Upload built manylinux docker images Key: ARROW-9102 URL: https://issues.apache.org/jira/browse/ARROW-9102 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 1.0.0 However the secrets were set on azure pipelines the upload step is failing: https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=13104&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181 So the manylinux builds take more than two hours. This is due to azure's secret handling, we need to explicitly export the azure secret variables as environment variables. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9101) [Doc][C++][Python] Document encoding expected by CSV and JSON readers
Antoine Pitrou created ARROW-9101: - Summary: [Doc][C++][Python] Document encoding expected by CSV and JSON readers Key: ARROW-9101 URL: https://issues.apache.org/jira/browse/ARROW-9101 Project: Apache Arrow Issue Type: Task Components: C++, Documentation, Python Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9100) Add ascii_lower kernel
Maarten Breddels created ARROW-9100: --- Summary: Add ascii_lower kernel Key: ARROW-9100 URL: https://issues.apache.org/jira/browse/ARROW-9100 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Maarten Breddels -- This message was sent by Atlassian Jira (v8.3.4#803005)