Maarten -- AFAIK Micah's work only affects nested / non-flat column paths, so flat data should not be impacted. Since we have a partial implementation of writes for nested data (lists-of-lists and structs-of-structs, but no mix of the two) that was the performance difference I was referencing.
On Thu, Mar 12, 2020 at 10:43 AM Maarten Ballintijn <maart...@xs4all.nl> wrote: > > Hi Micah, > > How does the performance change for “flat” schemas? > (particularly in the case of a large number of columns) > > Thanks, > Maarten > > > > > On Mar 11, 2020, at 11:53 PM, Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > Another status update. I've integrated the level generation code with the > > parquet writing code [1]. > > > > After that PR is merged I'll add bindings in Python to control versions of > > the level generation algorithm and plan on moving on to the read side. > > > > Thanks, > > Micah > > > > [1] https://github.com/apache/arrow/pull/6586 > > > > On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > >> Hi Igor, > >> If you have the time https://issues.apache.org/jira/browse/ARROW-7960 might > >> be a good task to pick up for this I think it should be a relatively small > >> amount of code, so it is probably a good contribution to the project. Once > >> that is wrapped up we can see were we both are. > >> > >> Cheers, > >> Micah > >> > >> On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <igor.calab...@gmail.com> > >> wrote: > >> > >>> Hi Micah, I actually got involved with another personal project and had > >>> to postpone my contribution to arrow a bit. The good news is that I'm > >>> almost done with it, so I could help you with the read side very soon. Any > >>> ideas how we could coordinate this? > >>> > >>> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <wesmck...@gmail.com> > >>> escreveu: > >>> > >>>> hi Micah -- great news on the level generation PR. I'll try to carve > >>>> out some time for reviewing over the coming week. > >>>> > >>>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <emkornfi...@gmail.com> > >>>> wrote: > >>>>> > >>>>> Hi Igor, > >>>>> I was wondering if you have made any progress on this? > >>>>> > >>>>> I posted a new PR [1] which I believe handles the difficult algorithmic > >>>>> part of writing. There will be some follow-ups but I think this PR > >>>> might > >>>>> take a while to review, so I was thinking of starting to take a look > >>>> at the > >>>>> read side if you haven't started yet, and circle back to the final > >>>>> integration for the write side once the PR is checked in. > >>>>> > >>>>> Thanks, > >>>>> Micah > >>>>> > >>>>> [1] https://github.com/apache/arrow/pull/6490 > >>>>> > >>>>> On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <igor.calab...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>> Hi, I would love to help with this issue. I'm aware that this is a > >>>> huge > >>>>>> task for a first contribution to arrow, but I feel that I could help > >>>> with > >>>>>> the read path. > >>>>>> Reading parquet seems like a extremely complex task since both > >>>> hive[0] and > >>>>>> spark[1] tried to implement a "vectorized" version and they all > >>>> stopped > >>>>>> short of supporting complex types. > >>>>>> I wanted to at least give it a try and find out where the challenge > >>>> lies. > >>>>>> > >>>>>> Since you guys are much more familiar with the current code base, I > >>>> could > >>>>>> use some starting tips so I don't fall in common pitfalls and > >>>> whatnot. > >>>>>> > >>>>>> [0] https://issues.apache.org/jira/browse/HIVE-18576 > >>>>>> [1] > >>>>>> > >>>>>> > >>>> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45 > >>>>>> > >>>>>> On 2020/02/03 06:01:25, Micah Kornfield <e...@gmail.com> wrote: > >>>>>>> Just to give an update. I've been a little bit delayed, but my > >>>> progress > >>>>>> is> > >>>>>>> as follows:> > >>>>>>> 1. Had 1 PR merged that will exercise basic end-to-end tests.> > >>>>>>> 2. Have another PR open that allows a configuration option in C++ > >>>> to> > >>>>>>> determine which algorithm version to use for reading/writing, the > >>>>>> existing> > >>>>>>> version and the new version supported complex-nested arrays. I > >>>> think a> > >>>>>>> large amount of code will be reused/delegated to but I will err on > >>>> the > >>>>>> side> > >>>>>>> of not touching the existing code/algorithms so that any errors in > >>>> the> > >>>>>>> implementation or performance regressions can hopefully be > >>>> mitigated at> > >>>>>>> runtime. I expect in later releases (once the code has "baked") > >>>> will> > >>>>>>> become a no-op.> > >>>>>>> 3. Started coding the write path.> > >>>>>>> > >>>>>>> Which leaves:> > >>>>>>> 1. Finishing the write path (I estimate 2-3 weeks) to be code > >>>> complete> > >>>>>>> 2. Implementing the read path.> > >>>>>>> > >>>>>>> Again, I'm happy to collaborate if people have bandwidth and want > >>>> to> > >>>>>>> contribute.> > >>>>>>> > >>>>>>> Thanks,> > >>>>>>> Micah> > >>>>>>> > >>>>>>> On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <em...@gmail.com>> > >>>>>>> wrote:> > >>>>>>> > >>>>>>>> Hi Wes,> > >>>>>>>> I'm still interested in doing the work. But don't to hold > >>>> anybody up > >>>>>> if> > >>>>>>>> they have bandwidth.> > >>>>>>>>> > >>>>>>>> In order to actually make progress on this, my plan will be to:> > >>>>>>>> 1. Help with the current Java review backlog through early next > >>>> week > >>>>>> or> > >>>>>>>> so (this has been taking the majority of my time allocated for > >>>> Arrow> > >>>>>>>> contributions for the last 6 months or so).> > >>>>>>>> 2. Shift all my attention to trying to get this done (this > >>>> means no> > >>>>>>>> reviews other then closing out existing ones that I've started > >>>> until it > >>>>>> is> > >>>>>>>> done). Hopefully, other Java committers can help shrink the > >>>> backlog> > >>>>>>>> further (Jacques thanks for you recent efforts here).> > >>>>>>>>> > >>>>>>>> Thanks,> > >>>>>>>> Micah> > >>>>>>>>> > >>>>>>>> On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <we...@gmail.com> > >>>> wrote:> > >>>>>>>>> > >>>>>>>>> hi folks,> > >>>>>>>>>> > >>>>>>>>> I think we have reached a point where the incomplete C++ > >>>> Parquet> > >>>>>>>>> nested data assembly/disassembly is harming the value of > >>>> several> > >>>>>>>>> others parts of the project, for example the Datasets API. As > >>>> another> > >>>>>>>>> example, it's possible to ingest nested data from JSON but not > >>>> write> > >>>>>>>>> it to Parquet in general.> > >>>>>>>>>> > >>>>>>>>> Implementing the nested data read and write path completely is > >>>> a> > >>>>>>>>> difficult project requiring at least several weeks of dedicated > >>>> work,> > >>>>>>>>> so it's not so surprising that it hasn't been accomplished yet. > >>>> I > >>>>>> know> > >>>>>>>>> that several people have expressed interest in working on it, > >>>> but I> > >>>>>>>>> would like to see if anyone would be able to volunteer a > >>>> commitment > >>>>>> of> > >>>>>>>>> time and guess on a rough timeline when this work could be > >>>> done. It> > >>>>>>>>> seems to me if this slips beyond 2020 it will significant > >>>> diminish > >>>>>> the> > >>>>>>>>> value being created by other parts of the project.> > >>>>>>>>>> > >>>>>>>>> Since I'm pretty familiar with all the Parquet code I'm one > >>>> candidate> > >>>>>>>>> person to take on this project (and I can dedicate the time, > >>>> but it> > >>>>>>>>> would come at the expense of other projects where I can also be> > >>>>>>>>> useful). But Micah and others expressed interest in working on > >>>> it, so> > >>>>>>>>> I wanted to have a discussion about it to see what others > >>>> think.> > >>>>>>>>>> > >>>>>>>>> Thanks> > >>>>>>>>> Wes> > >>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>>> > >>>> > >>> >