Re: Feather v2 random access

2020-06-24 Thread Yue Ni
Hi François, Thanks so much for the very detailed explanation, and that makes sense to me. I will check out the links for more information. @Wes, ARROW-8250 is very useful to me as well and I will keep an eye on it. Thanks. On Wed, Jun 24, 2020 at 11:08 PM Wes McKinney wrote: > See also this J

Re: [DISCUSS][C++] Performance work and compiler standardization for linux

2020-06-24 Thread Wes McKinney
On Wed, Jun 24, 2020 at 9:48 PM Micah Kornfield wrote: > > In that case I would propose the following: > 1. Standardize on clang for performance generating numbers for performance > related PRs > 2. Adjust our binary artifact builds to use clang where feasible (I think > should wait until after

Re: [DISCUSS] Addition of a feature enum

2020-06-24 Thread Micah Kornfield
I've updated the PR. More feedback welcome, I'd like to start a vote by end-of-week if possible. On Wed, Jun 24, 2020 at 12:48 PM Micah Kornfield wrote: > I agree flight might need to encode this data slightly differently for > negotiation purposes. I will update the enum to use power of 2 val

Re: [DISCUSS][C++] Performance work and compiler standardization for linux

2020-06-24 Thread Micah Kornfield
In that case I would propose the following: 1. Standardize on clang for performance generating numbers for performance related PRs 2. Adjust our binary artifact builds to use clang where feasible (I think should wait until after our next release). 3. Add to the contributors guide summarizing the

[DISCUSS] Incrementing Arrow MetadataVersion from V4 to V5 for 1.0.0 release

2020-06-24 Thread Wes McKinney
hi folks, This has come up in some other contexts, but I believe it would be a good idea to increment the version number in Schema.fbs starting with 1.0.0 to separate the pre-1.0 and post-1.0 worlds https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22 Given that we are contemplating

Re: [DISCUSS] Removing top-level validity bitmap from Union type

2020-06-24 Thread Wes McKinney
I drafted the specification changes that would be associated with the union changes https://github.com/apache/arrow/pull/7535 I'll start a separate discussion about incrementing the MetadataVersion since that must be discussed independently. Please take a look On Wed, Jun 24, 2020 at 3:50 PM We

[DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-24 Thread Wes McKinney
hi folks, (cross-posting to dev@arrow and dev@parquet since there are stakeholders in both places) It seems there are still problems at least with the C++ implementation of LZ4 compression in Parquet files https://issues.apache.org/jira/browse/PARQUET-1241 https://issues.apache.org/jira/browse/P

Re: Arrow sync call June 24 at 12:00 US/Eastern, 16:00 UTC

2020-06-24 Thread Andy Grove
The call clashed with the Spark AI Summit keynote as well, so that may have been a contributing factor. On Wed, Jun 24, 2020, 10:11 AM Neal Richardson wrote: > Attendees: > Prudhvi Porandla > Neal Richardson > > Discussion: > * Everyone must be so focused on getting things done for the 1.0 relea

Re: [DISCUSS] Removing top-level validity bitmap from Union type

2020-06-24 Thread Wes McKinney
I should also add that we could (with some effort) use the MetadataVersion V4/V5 indicator to offer backward compatibility for old serialized union data In any case, if there is consensus about this, we would need to have a vote and get busy with implementing and testing the changes. I could assis

Re: [DISCUSS] Addition of a feature enum

2020-06-24 Thread Micah Kornfield
I agree flight might need to encode this data slightly differently for negotiation purposes. I will update the enum to use power of 2 values so this isn't precluded, but I think for parsing in the schema, it is clearer to model this as a list of enums. Any other thoughts? Thanks, Micah On Tue

Re: [DISCUSS] Removing top-level validity bitmap from Union type

2020-06-24 Thread Wes McKinney
On Wed, Jun 24, 2020 at 1:07 PM Francois Saint-Jacques wrote: > > OTOH, > > how do we handle NullType -> UnionType cast conversion? Do we > require some convention like the first children ArrayData null bitmap > to be set and all tags set to 0? Sure, that sounds like a reasonable implementation s

Re: [DISCUSS] Removing top-level validity bitmap from Union type

2020-06-24 Thread Francois Saint-Jacques
OTOH, how do we handle NullType -> UnionType cast conversion? Do we require some convention like the first children ArrayData null bitmap to be set and all tags set to 0? François On Wed, Jun 24, 2020 at 1:09 PM Antoine Pitrou wrote: > > > Le 24/06/2020 à 18:34, Wes McKinney a écrit : > > On We

Re: [VOTE] Add Decimal::bitWidth field to Schema.fbs for forward compatibility

2020-06-24 Thread Francois Saint-Jacques
+1 (binding)

Re: [VOTE] Add Decimal::bitWidth field to Schema.fbs for forward compatibility

2020-06-24 Thread Antoine Pitrou
+1 (binding) Le 23/06/2020 à 20:35, Wes McKinney a écrit : > Hi, > > As discussed on the mailing list [1] I would like to add a "bit width" > field to our Decimal metadata to allow for supporting different > Decimal physical sizes other than 128-bit (where 32- and 64-bit > representations are re

Re: [DISCUSS] Removing top-level validity bitmap from Union type

2020-06-24 Thread Antoine Pitrou
Le 24/06/2020 à 18:34, Wes McKinney a écrit : > On Wed, Jun 24, 2020 at 11:08 AM Antoine Pitrou wrote: >> >> >> Le 24/06/2020 à 16:57, Wes McKinney a écrit : >>> hi folks, >>> >>> As discussed on the recent GitHub PR [1], as a means of reconciling >>> the long-standing cross-implementation incom

Re: [DISCUSS] Removing top-level validity bitmap from Union type

2020-06-24 Thread Wes McKinney
On Wed, Jun 24, 2020 at 11:08 AM Antoine Pitrou wrote: > > > Le 24/06/2020 à 16:57, Wes McKinney a écrit : > > hi folks, > > > > As discussed on the recent GitHub PR [1], as a means of reconciling > > the long-standing cross-implementation incompatibilities with Union > > types, it's been proposed

Re: Arrow sync call June 24 at 12:00 US/Eastern, 16:00 UTC

2020-06-24 Thread Neal Richardson
Attendees: Prudhvi Porandla Neal Richardson Discussion: * Everyone must be so focused on getting things done for the 1.0 release that they didn't have time to join the call :shrug: On Wed, Jun 24, 2020 at 9:01 AM Neal Richardson wrote: > Hi all, > Last minute reminder that our biweekly call is

Re: [DISCUSS] Removing top-level validity bitmap from Union type

2020-06-24 Thread Antoine Pitrou
Le 24/06/2020 à 16:57, Wes McKinney a écrit : > hi folks, > > As discussed on the recent GitHub PR [1], as a means of reconciling > the long-standing cross-implementation incompatibilities with Union > types, it's been proposed to remove the top-level validity bitmap from > the Union data layout

Arrow sync call June 24 at 12:00 US/Eastern, 16:00 UTC

2020-06-24 Thread Neal Richardson
Hi all, Last minute reminder that our biweekly call is starting now at https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes will be sent out to the mailing list afterward. Neal

Re: [VOTE] Add Decimal::bitWidth field to Schema.fbs for forward compatibility

2020-06-24 Thread Wes McKinney
+1 (binding) On Wed, Jun 24, 2020 at 2:03 AM Micah Kornfield wrote: > > +1 (binding) > > On Tue, Jun 23, 2020 at 11:35 AM Wes McKinney wrote: > > > Hi, > > > > As discussed on the mailing list [1] I would like to add a "bit width" > > field to our Decimal metadata to allow for supporting differe

Re: Renaming master branch, removing blacklist/whitelist

2020-06-24 Thread Jacques Nadeau
Hi Suvayu, thanks for sharing your experiences. Clearly we have work to do. Wrt to specific name changes, I agree with Wes. If something is negative to a non-trivial portion of the population, why not use something that avoids that issue where possible. On Fri, Jun 19, 2020, 7:44 PM Suvayu Ali

Re: [DISCUSS] Removing top-level validity bitmap from Union type

2020-06-24 Thread Jacques Nadeau
Per my comments on the pr, I also think this is preferred. I believe we will avoid the potential for validity inconsistency and simplify construction of union data in most cases. On Wed, Jun 24, 2020, 7:58 AM Wes McKinney wrote: > hi folks, > > As discussed on the recent GitHub PR [1], as a mean

Re: Feather v2 random access

2020-06-24 Thread Wes McKinney
See also this JIRA regarding adding random access read APIs for IPC files (and thus Feather) https://issues.apache.org/jira/browse/ARROW-8250 I hope to see this implemented someday. On Wed, Jun 24, 2020 at 10:03 AM Francois Saint-Jacques wrote: > > I forgot to mention that you can see how this

Re: Feather v2 random access

2020-06-24 Thread Francois Saint-Jacques
I forgot to mention that you can see how this is glued in `feather::reader::Read` [1]. This makes it obvious that nothing is cached and everything is loaded in memory. François [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/feather.cc#L715-L723 On Wed, Jun 24, 2020 at 10:53 A

[DISCUSS] Removing top-level validity bitmap from Union type

2020-06-24 Thread Wes McKinney
hi folks, As discussed on the recent GitHub PR [1], as a means of reconciling the long-standing cross-implementation incompatibilities with Union types, it's been proposed to remove the top-level validity bitmap from the Union data layout and let validity be determined exclusively by the child arr

Re: Feather v2 random access

2020-06-24 Thread Francois Saint-Jacques
Hello Yue, FeatherV2 is just a facade for the Arrow IPC file format. You can find the implementation here [1]. I will try to answer your question with inline comments. On a high level, the file format writes a schema and then multiple "chunks" called RecordBatch. Your lowest level of granularity

[NIGHTLY] Arrow Build Report for Job nightly-2020-06-24-0

2020-06-24 Thread Crossbow
Arrow Build Report for Job nightly-2020-06-24-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-24-0 Failed Tasks: - centos-7-aarch64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-24-0-travis-centos-7-aarch64 - centos-8-am

Re: [VOTE] Add Decimal::bitWidth field to Schema.fbs for forward compatibility

2020-06-24 Thread Micah Kornfield
+1 (binding) On Tue, Jun 23, 2020 at 11:35 AM Wes McKinney wrote: > Hi, > > As discussed on the mailing list [1] I would like to add a "bit width" > field to our Decimal metadata to allow for supporting different > Decimal physical sizes other than 128-bit (where 32- and 64-bit > representations