Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-04-17 Thread Wes McKinney
Sounds good. In general I would say that this is a good opportunity to make improvements around random data generation. For example, I don't think we have an API for generating a RecordBatch given a schema and some options (e.g. probability of nulls, distribution of list sizes), for example, but

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-04-16 Thread Micah Kornfield
Hi Wes, Thanks that seems like a good characterization. I opened up some JIRA subtasks on ARROW-1644 which go into a little more detail on tasks that can probably be worked on in parallel (I've only assigned ones to myself that I'm actively working on, happy to add discuss/collaborate on the

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-04-15 Thread Wes McKinney
hi Micah, Sounds good. It seems like there are a few projects where people might be able to work without stepping on each other's toes A. Array reassembly from raw repetition/definition levels (I would guess this would be your focus) B. Schema and data generation for round-trip correctness and

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-04-14 Thread Micah Kornfield
Hi Wes, Yes, I'm making progress and at this point I anticipate being able to finish it off by next release, possibly without support for round tripping fixed size lists. I've been spending some time thinking about different approaches and have started coding some of the building blocks, which I

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-04-14 Thread Wes McKinney
hi Micah, I'm glad that we have the write side of nested completed for 0.17.0. As far as completing the read side and then implementing sufficient testing to exercise corner cases in end-to-end reads/writes, do you anticipate being able to work on this in the next 4-6 weeks (obviously the state

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-03-29 Thread Micah Kornfield
Hi Igor, It looks like a good start. Thank you! I left feedback on the PR. I should have some time over the next few weeks or so to handle the combinations lists/structs that aren't currently supported in the core reader. As I get further into the code I'll chime in here if there are other

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-03-29 Thread Igor Calabria
Hi Micah, Finally got around to doing some work on the reader's side. Like you suggested, I started with https://issues.apache.org/jira/browse/ARROW-7960. I never programmed C++ professionally so I opened the PR https://github.com/apache/arrow/pull/6758 as soon as possible to collect feedback

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-03-14 Thread Micah Kornfield
> > I'd be OK with having the flag so long as the new code is the default. > Otherwise we'll never find out about the corner cases. Completely agree, my PR makes the new code the default. On Fri, Mar 13, 2020 at 6:44 AM Wes McKinney wrote: > On Thu, Mar 12, 2020 at 10:39 PM Micah Kornfield >

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-03-13 Thread Wes McKinney
On Thu, Mar 12, 2020 at 10:39 PM Micah Kornfield wrote: > > Maarten, I don't expect regressions for flat cases (I'm going to try to run > benchmarks comparison tonight). > > In terms of the flag, I'm more concerned about some corner case I didn't > think of in testing or a workload that for some

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-03-12 Thread Micah Kornfield
Maarten, I don't expect regressions for flat cases (I'm going to try to run benchmarks comparison tonight). In terms of the flag, I'm more concerned about some corner case I didn't think of in testing or a workload that for some reason is better with the prior code. If either of these arise I

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-03-12 Thread Wes McKinney
Maarten -- AFAIK Micah's work only affects nested / non-flat column paths, so flat data should not be impacted. Since we have a partial implementation of writes for nested data (lists-of-lists and structs-of-structs, but no mix of the two) that was the performance difference I was referencing. On

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-03-12 Thread Maarten Ballintijn
Hi Micah, How does the performance change for “flat” schemas? (particularly in the case of a large number of columns) Thanks, Maarten > On Mar 11, 2020, at 11:53 PM, Micah Kornfield wrote: > > Another status update. I've integrated the level generation code with the > parquet writing code

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-03-12 Thread Wes McKinney
hi Micah, Great to hear about the progress, I'll help with code review. FWIW, if the new code passes the existing unit tests I would be in favor of deleting the old code so that we're fully invested in making the new code suitably fast. Jump in with two feet, so to speak. Thanks Wes On Wed,

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-03-11 Thread Micah Kornfield
Another status update. I've integrated the level generation code with the parquet writing code [1]. After that PR is merged I'll add bindings in Python to control versions of the level generation algorithm and plan on moving on to the read side. Thanks, Micah [1]

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-03-03 Thread Micah Kornfield
Hi Igor, If you have the time https://issues.apache.org/jira/browse/ARROW-7960 might be a good task to pick up for this I think it should be a relatively small amount of code, so it is probably a good contribution to the project. Once that is wrapped up we can see were we both are. Cheers, Micah

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-03-03 Thread Igor Calabria
Hi Micah, I actually got involved with another personal project and had to postpone my contribution to arrow a bit. The good news is that I'm almost done with it, so I could help you with the read side very soon. Any ideas how we could coordinate this? Em qua., 26 de fev. de 2020 às 21:06, Wes

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-02-26 Thread Wes McKinney
hi Micah -- great news on the level generation PR. I'll try to carve out some time for reviewing over the coming week. On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield wrote: > > Hi Igor, > I was wondering if you have made any progress on this? > > I posted a new PR [1] which I believe handles

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-02-26 Thread Micah Kornfield
Hi Igor, I was wondering if you have made any progress on this? I posted a new PR [1] which I believe handles the difficult algorithmic part of writing. There will be some follow-ups but I think this PR might take a while to review, so I was thinking of starting to take a look at the read side

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-02-04 Thread Micah Kornfield
> > Glad to hear about the progress. As I mentioned on #2, what do you > think about setting up a feature branch for you to merge PRs into? > Then the branch can be iterated on and we can merge it back when it's > feature complete and does not have perf regressions for the flat > read/write path.

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-02-04 Thread Wes McKinney
hi Micah On Mon, Feb 3, 2020 at 12:01 AM Micah Kornfield wrote: > > Just to give an update. I've been a little bit delayed, but my progress is > as follows: > 1. Had 1 PR merged that will exercise basic end-to-end tests. > 2. Have another PR open that allows a configuration option in C++ to >

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-02-03 Thread Micah Kornfield
Hi Igor, Thanks for the offer. There is an old PR [1] that contains the generic logic for reading and writing, which last time I looked seemed like a reasonable start for the read portion (I think the write portion can potentially be made more memory efficient memory (at the expense of some

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-02-03 Thread Igor Calabria
Hi, I would love to help with this issue. I'm aware that this is a huge task for a first contribution to arrow, but I feel that I could help with the read path. Reading parquet seems like a extremely complex task since both hive[0] and spark[1] tried to implement a "vectorized" version and they

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-02-02 Thread Micah Kornfield
Just to give an update. I've been a little bit delayed, but my progress is as follows: 1. Had 1 PR merged that will exercise basic end-to-end tests. 2. Have another PR open that allows a configuration option in C++ to determine which algorithm version to use for reading/writing, the existing

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-01-09 Thread Micah Kornfield
Hi Wes, I'm still interested in doing the work. But don't to hold anybody up if they have bandwidth. In order to actually make progress on this, my plan will be to: 1. Help with the current Java review backlog through early next week or so (this has been taking the majority of my time allocated

Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-01-09 Thread Wes McKinney
hi folks, I think we have reached a point where the incomplete C++ Parquet nested data assembly/disassembly is harming the value of several others parts of the project, for example the Datasets API. As another example, it's possible to ingest nested data from JSON but not write it to Parquet in