Hi Igor, It looks like a good start. Thank you! I left feedback on the PR. I should have some time over the next few weeks or so to handle the combinations lists/structs that aren't currently supported in the core reader. As I get further into the code I'll chime in here if there are other areas to collaborate on.
Thanks, Micah On Sun, Mar 29, 2020 at 4:27 PM Igor Calabria <igor.calab...@gmail.com> wrote: > Hi Micah, > > Finally got around to doing some work on the reader's side. Like you > suggested, I started with https://issues.apache.org/jira/browse/ARROW-7960. > > I never programmed C++ professionally so I opened the PR > https://github.com/apache/arrow/pull/6758 as soon as possible to collect > feedback while I continue the work. Next step is actually fixing the > reader, since adding a explicit translation to maps broke it. It should be > simple enough to just reuse StructReader. > > Em dom., 15 de mar. de 2020 às 01:17, Micah Kornfield < > emkornfi...@gmail.com> escreveu: > >> > >> > I'd be OK with having the flag so long as the new code is the default. >> > Otherwise we'll never find out about the corner cases. >> >> Completely agree, my PR makes the new code the default. >> >> On Fri, Mar 13, 2020 at 6:44 AM Wes McKinney <wesmck...@gmail.com> wrote: >> >> > On Thu, Mar 12, 2020 at 10:39 PM Micah Kornfield <emkornfi...@gmail.com >> > >> > wrote: >> > > >> > > Maarten, I don't expect regressions for flat cases (I'm going to try >> to >> > run >> > > benchmarks comparison tonight). >> > > >> > > In terms of the flag, I'm more concerned about some corner case I >> didn't >> > > think of in testing or a workload that for some reason is better with >> the >> > > prior code. If either of these arise I would like to give users a way >> to >> > > mitigate issues so a patch release isn't an imperative. >> > >> > I'd be OK with having the flag so long as the new code is the default. >> > Otherwise we'll never find out about the corner cases. >> > >> > > I'm happy to cleanup the existing code as soon as the next release >> > happens >> > > though. >> > > >> > > Thoughts? >> > > >> > > -Micah >> > > >> > > On Thu, Mar 12, 2020 at 9:11 AM Wes McKinney <wesmck...@gmail.com> >> > wrote: >> > > >> > > > Maarten -- AFAIK Micah's work only affects nested / non-flat column >> > > > paths, so flat data should not be impacted. Since we have a partial >> > > > implementation of writes for nested data (lists-of-lists and >> > > > structs-of-structs, but no mix of the two) that was the performance >> > > > difference I was referencing. >> > > > >> > > > On Thu, Mar 12, 2020 at 10:43 AM Maarten Ballintijn < >> > maart...@xs4all.nl> >> > > > wrote: >> > > > > >> > > > > Hi Micah, >> > > > > >> > > > > How does the performance change for “flat” schemas? >> > > > > (particularly in the case of a large number of columns) >> > > > > >> > > > > Thanks, >> > > > > Maarten >> > > > > >> > > > > >> > > > > >> > > > > > On Mar 11, 2020, at 11:53 PM, Micah Kornfield < >> > emkornfi...@gmail.com> >> > > > wrote: >> > > > > > >> > > > > > Another status update. I've integrated the level generation >> code >> > with >> > > > the >> > > > > > parquet writing code [1]. >> > > > > > >> > > > > > After that PR is merged I'll add bindings in Python to control >> > > > versions of >> > > > > > the level generation algorithm and plan on moving on to the read >> > side. >> > > > > > >> > > > > > Thanks, >> > > > > > Micah >> > > > > > >> > > > > > [1] https://github.com/apache/arrow/pull/6586 >> > > > > > >> > > > > > On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield < >> > emkornfi...@gmail.com> >> > > > > > wrote: >> > > > > > >> > > > > >> Hi Igor, >> > > > > >> If you have the time >> > https://issues.apache.org/jira/browse/ARROW-7960 >> > > > might >> > > > > >> be a good task to pick up for this I think it should be a >> > relatively >> > > > small >> > > > > >> amount of code, so it is probably a good contribution to the >> > > > project. Once >> > > > > >> that is wrapped up we can see were we both are. >> > > > > >> >> > > > > >> Cheers, >> > > > > >> Micah >> > > > > >> >> > > > > >> On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria < >> > igor.calab...@gmail.com >> > > > > >> > > > > >> wrote: >> > > > > >> >> > > > > >>> Hi Micah, I actually got involved with another personal >> project >> > and >> > > > had >> > > > > >>> to postpone my contribution to arrow a bit. The good news is >> > that I'm >> > > > > >>> almost done with it, so I could help you with the read side >> very >> > > > soon. Any >> > > > > >>> ideas how we could coordinate this? >> > > > > >>> >> > > > > >>> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney < >> > > > wesmck...@gmail.com> >> > > > > >>> escreveu: >> > > > > >>> >> > > > > >>>> hi Micah -- great news on the level generation PR. I'll try >> to >> > carve >> > > > > >>>> out some time for reviewing over the coming week. >> > > > > >>>> >> > > > > >>>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield < >> > > > emkornfi...@gmail.com> >> > > > > >>>> wrote: >> > > > > >>>>> >> > > > > >>>>> Hi Igor, >> > > > > >>>>> I was wondering if you have made any progress on this? >> > > > > >>>>> >> > > > > >>>>> I posted a new PR [1] which I believe handles the difficult >> > > > algorithmic >> > > > > >>>>> part of writing. There will be some follow-ups but I think >> > this PR >> > > > > >>>> might >> > > > > >>>>> take a while to review, so I was thinking of starting to >> take a >> > > > look >> > > > > >>>> at the >> > > > > >>>>> read side if you haven't started yet, and circle back to the >> > final >> > > > > >>>>> integration for the write side once the PR is checked in. >> > > > > >>>>> >> > > > > >>>>> Thanks, >> > > > > >>>>> Micah >> > > > > >>>>> >> > > > > >>>>> [1] https://github.com/apache/arrow/pull/6490 >> > > > > >>>>> >> > > > > >>>>> On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria < >> > > > igor.calab...@gmail.com> >> > > > > >>>>> wrote: >> > > > > >>>>> >> > > > > >>>>>> Hi, I would love to help with this issue. I'm aware that >> this >> > is a >> > > > > >>>> huge >> > > > > >>>>>> task for a first contribution to arrow, but I feel that I >> > could >> > > > help >> > > > > >>>> with >> > > > > >>>>>> the read path. >> > > > > >>>>>> Reading parquet seems like a extremely complex task since >> both >> > > > > >>>> hive[0] and >> > > > > >>>>>> spark[1] tried to implement a "vectorized" version and they >> > all >> > > > > >>>> stopped >> > > > > >>>>>> short of supporting complex types. >> > > > > >>>>>> I wanted to at least give it a try and find out where the >> > > > challenge >> > > > > >>>> lies. >> > > > > >>>>>> >> > > > > >>>>>> Since you guys are much more familiar with the current code >> > base, >> > > > I >> > > > > >>>> could >> > > > > >>>>>> use some starting tips so I don't fall in common pitfalls >> and >> > > > > >>>> whatnot. >> > > > > >>>>>> >> > > > > >>>>>> [0] https://issues.apache.org/jira/browse/HIVE-18576 >> > > > > >>>>>> [1] >> > > > > >>>>>> >> > > > > >>>>>> >> > > > > >>>> >> > > > >> > >> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45 >> > > > > >>>>>> >> > > > > >>>>>> On 2020/02/03 06:01:25, Micah Kornfield <e...@gmail.com> >> > wrote: >> > > > > >>>>>>> Just to give an update. I've been a little bit delayed, >> but >> > my >> > > > > >>>> progress >> > > > > >>>>>> is> >> > > > > >>>>>>> as follows:> >> > > > > >>>>>>> 1. Had 1 PR merged that will exercise basic end-to-end >> > tests.> >> > > > > >>>>>>> 2. Have another PR open that allows a configuration >> option >> > in >> > > > C++ >> > > > > >>>> to> >> > > > > >>>>>>> determine which algorithm version to use for >> > reading/writing, the >> > > > > >>>>>> existing> >> > > > > >>>>>>> version and the new version supported complex-nested >> > arrays. I >> > > > > >>>> think a> >> > > > > >>>>>>> large amount of code will be reused/delegated to but I >> will >> > err >> > > > on >> > > > > >>>> the >> > > > > >>>>>> side> >> > > > > >>>>>>> of not touching the existing code/algorithms so that any >> > errors >> > > > in >> > > > > >>>> the> >> > > > > >>>>>>> implementation or performance regressions can hopefully >> be >> > > > > >>>> mitigated at> >> > > > > >>>>>>> runtime. I expect in later releases (once the code has >> > "baked") >> > > > > >>>> will> >> > > > > >>>>>>> become a no-op.> >> > > > > >>>>>>> 3. Started coding the write path.> >> > > > > >>>>>>> >> > > > > >>>>>>> Which leaves:> >> > > > > >>>>>>> 1. Finishing the write path (I estimate 2-3 weeks) to be >> > code >> > > > > >>>> complete> >> > > > > >>>>>>> 2. Implementing the read path.> >> > > > > >>>>>>> >> > > > > >>>>>>> Again, I'm happy to collaborate if people have bandwidth >> and >> > want >> > > > > >>>> to> >> > > > > >>>>>>> contribute.> >> > > > > >>>>>>> >> > > > > >>>>>>> Thanks,> >> > > > > >>>>>>> Micah> >> > > > > >>>>>>> >> > > > > >>>>>>> On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield < >> > em...@gmail.com >> > > > >> >> > > > > >>>>>>> wrote:> >> > > > > >>>>>>> >> > > > > >>>>>>>> Hi Wes,> >> > > > > >>>>>>>> I'm still interested in doing the work. But don't to >> hold >> > > > > >>>> anybody up >> > > > > >>>>>> if> >> > > > > >>>>>>>> they have bandwidth.> >> > > > > >>>>>>>>> >> > > > > >>>>>>>> In order to actually make progress on this, my plan will >> be >> > to:> >> > > > > >>>>>>>> 1. Help with the current Java review backlog through >> early >> > next >> > > > > >>>> week >> > > > > >>>>>> or> >> > > > > >>>>>>>> so (this has been taking the majority of my time >> allocated >> > for >> > > > > >>>> Arrow> >> > > > > >>>>>>>> contributions for the last 6 months or so).> >> > > > > >>>>>>>> 2. Shift all my attention to trying to get this done >> (this >> > > > > >>>> means no> >> > > > > >>>>>>>> reviews other then closing out existing ones that I've >> > started >> > > > > >>>> until it >> > > > > >>>>>> is> >> > > > > >>>>>>>> done). Hopefully, other Java committers can help shrink >> the >> > > > > >>>> backlog> >> > > > > >>>>>>>> further (Jacques thanks for you recent efforts here).> >> > > > > >>>>>>>>> >> > > > > >>>>>>>> Thanks,> >> > > > > >>>>>>>> Micah> >> > > > > >>>>>>>>> >> > > > > >>>>>>>> On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney < >> > we...@gmail.com> >> > > > > >>>> wrote:> >> > > > > >>>>>>>>> >> > > > > >>>>>>>>> hi folks,> >> > > > > >>>>>>>>>> >> > > > > >>>>>>>>> I think we have reached a point where the incomplete C++ >> > > > > >>>> Parquet> >> > > > > >>>>>>>>> nested data assembly/disassembly is harming the value of >> > > > > >>>> several> >> > > > > >>>>>>>>> others parts of the project, for example the Datasets >> API. >> > As >> > > > > >>>> another> >> > > > > >>>>>>>>> example, it's possible to ingest nested data from JSON >> but >> > not >> > > > > >>>> write> >> > > > > >>>>>>>>> it to Parquet in general.> >> > > > > >>>>>>>>>> >> > > > > >>>>>>>>> Implementing the nested data read and write path >> > completely is >> > > > > >>>> a> >> > > > > >>>>>>>>> difficult project requiring at least several weeks of >> > dedicated >> > > > > >>>> work,> >> > > > > >>>>>>>>> so it's not so surprising that it hasn't been >> accomplished >> > yet. >> > > > > >>>> I >> > > > > >>>>>> know> >> > > > > >>>>>>>>> that several people have expressed interest in working >> on >> > it, >> > > > > >>>> but I> >> > > > > >>>>>>>>> would like to see if anyone would be able to volunteer a >> > > > > >>>> commitment >> > > > > >>>>>> of> >> > > > > >>>>>>>>> time and guess on a rough timeline when this work could >> be >> > > > > >>>> done. It> >> > > > > >>>>>>>>> seems to me if this slips beyond 2020 it will >> significant >> > > > > >>>> diminish >> > > > > >>>>>> the> >> > > > > >>>>>>>>> value being created by other parts of the project.> >> > > > > >>>>>>>>>> >> > > > > >>>>>>>>> Since I'm pretty familiar with all the Parquet code I'm >> one >> > > > > >>>> candidate> >> > > > > >>>>>>>>> person to take on this project (and I can dedicate the >> > time, >> > > > > >>>> but it> >> > > > > >>>>>>>>> would come at the expense of other projects where I can >> > also >> > > > be> >> > > > > >>>>>>>>> useful). But Micah and others expressed interest in >> > working on >> > > > > >>>> it, so> >> > > > > >>>>>>>>> I wanted to have a discussion about it to see what >> others >> > > > > >>>> think.> >> > > > > >>>>>>>>>> >> > > > > >>>>>>>>> Thanks> >> > > > > >>>>>>>>> Wes> >> > > > > >>>>>>>>>> >> > > > > >>>>>>>>> >> > > > > >>>>>>> >> > > > > >>>>>> >> > > > > >>>> >> > > > > >>> >> > > > > >> > > > >> > >> >