Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Wes McKinney Thu, 12 Mar 2020 09:11:46 -0700

Maarten -- AFAIK Micah's work only affects nested / non-flat column
paths, so flat data should not be impacted. Since we have a partial
implementation of writes for nested data (lists-of-lists and
structs-of-structs, but no mix of the two) that was the performance
difference I was referencing.


On Thu, Mar 12, 2020 at 10:43 AM Maarten Ballintijn <[email protected]> wrote:
>
> Hi Micah,
>
> How does the performance change for “flat” schemas?
> (particularly in the case of a large number of columns)
>
> Thanks,
> Maarten
>
>
>
> > On Mar 11, 2020, at 11:53 PM, Micah Kornfield <[email protected]> wrote:
> >
> > Another status update.  I've integrated the level generation code with the
> > parquet writing code [1].
> >
> > After that PR is merged I'll add bindings in Python to control versions of
> > the level generation algorithm and plan on moving on to the read side.
> >
> > Thanks,
> > Micah
> >
> > [1] https://github.com/apache/arrow/pull/6586
> >
> > On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield <[email protected]>
> > wrote:
> >
> >> Hi Igor,
> >> If you have the time https://issues.apache.org/jira/browse/ARROW-7960 might
> >> be a good task to pick up for this I think it should be a relatively small
> >> amount of code, so it is probably a good contribution to the project.  Once
> >> that is wrapped up we can see were we both are.
> >>
> >> Cheers,
> >> Micah
> >>
> >> On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <[email protected]>
> >> wrote:
> >>
> >>> Hi Micah, I actually got involved with another personal project and had
> >>> to postpone my contribution to arrow a bit. The good news is that I'm
> >>> almost done with it, so I could help you with the read side very soon. Any
> >>> ideas how we could coordinate this?
> >>>
> >>> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <[email protected]>
> >>> escreveu:
> >>>
> >>>> hi Micah -- great news on the level generation PR. I'll try to carve
> >>>> out some time for reviewing over the coming week.
> >>>>
> >>>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <[email protected]>
> >>>> wrote:
> >>>>>
> >>>>> Hi Igor,
> >>>>> I was wondering if you have made any progress on this?
> >>>>>
> >>>>> I posted a new PR [1] which I believe handles the difficult algorithmic
> >>>>> part of writing.  There will be some follow-ups but I think this PR
> >>>> might
> >>>>> take a while to review, so I was thinking of starting to take a look
> >>>> at the
> >>>>> read side if you haven't started yet, and circle back to the final
> >>>>> integration for the write side once the PR is checked in.
> >>>>>
> >>>>> Thanks,
> >>>>> Micah
> >>>>>
> >>>>> [1] https://github.com/apache/arrow/pull/6490
> >>>>>
> >>>>> On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <[email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi, I would love to help with this issue. I'm aware that this is a
> >>>> huge
> >>>>>> task for a first contribution to arrow, but I feel that I could help
> >>>> with
> >>>>>> the read path.
> >>>>>> Reading parquet seems like a extremely complex task since both
> >>>> hive[0] and
> >>>>>> spark[1] tried to implement a "vectorized" version and they all
> >>>> stopped
> >>>>>> short of supporting complex types.
> >>>>>> I wanted to at least give it a try and find out where the challenge
> >>>> lies.
> >>>>>>
> >>>>>> Since you guys are much more familiar with the current code base, I
> >>>> could
> >>>>>> use some starting tips so I don't fall in common pitfalls and
> >>>> whatnot.
> >>>>>>
> >>>>>> [0] https://issues.apache.org/jira/browse/HIVE-18576
> >>>>>> [1]
> >>>>>>
> >>>>>>
> >>>> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45
> >>>>>>
> >>>>>> On 2020/02/03 06:01:25, Micah Kornfield <[email protected]> wrote:
> >>>>>>> Just to give an update.  I've been a little bit delayed, but my
> >>>> progress
> >>>>>> is>
> >>>>>>> as follows:>
> >>>>>>> 1.  Had 1 PR merged that will exercise basic end-to-end tests.>
> >>>>>>> 2.  Have another PR open that allows a configuration option in C++
> >>>> to>
> >>>>>>> determine which algorithm version to use for reading/writing, the
> >>>>>> existing>
> >>>>>>> version and the new version supported complex-nested arrays.  I
> >>>> think a>
> >>>>>>> large amount of code will be reused/delegated to but I will err on
> >>>> the
> >>>>>> side>
> >>>>>>> of not touching the existing code/algorithms so that any errors in
> >>>> the>
> >>>>>>> implementation  or performance regressions can hopefully be
> >>>> mitigated at>
> >>>>>>> runtime.  I expect in later releases (once the code has "baked")
> >>>> will>
> >>>>>>> become a no-op.>
> >>>>>>> 3.  Started coding the write path.>
> >>>>>>>
> >>>>>>> Which leaves:>
> >>>>>>> 1.  Finishing the write path (I estimate 2-3 weeks) to be code
> >>>> complete>
> >>>>>>> 2.  Implementing the read path.>
> >>>>>>>
> >>>>>>> Again, I'm happy to collaborate if people have bandwidth and want
> >>>> to>
> >>>>>>> contribute.>
> >>>>>>>
> >>>>>>> Thanks,>
> >>>>>>> Micah>
> >>>>>>>
> >>>>>>> On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <[email protected]>>
> >>>>>>> wrote:>
> >>>>>>>
> >>>>>>>> Hi Wes,>
> >>>>>>>> I'm still interested in doing the work.  But don't to hold
> >>>> anybody up
> >>>>>> if>
> >>>>>>>> they have bandwidth.>
> >>>>>>>>>
> >>>>>>>> In order to actually make progress on this, my plan will be to:>
> >>>>>>>> 1.  Help with the current Java review backlog through early next
> >>>> week
> >>>>>> or>
> >>>>>>>> so (this has been taking the majority of my time allocated for
> >>>> Arrow>
> >>>>>>>> contributions for the last 6 months or so).>
> >>>>>>>> 2.  Shift all my attention to trying to get this done (this
> >>>> means no>
> >>>>>>>> reviews other then closing out existing ones that I've started
> >>>> until it
> >>>>>> is>
> >>>>>>>> done).  Hopefully, other Java committers can help shrink the
> >>>> backlog>
> >>>>>>>> further (Jacques thanks for you recent efforts here).>
> >>>>>>>>>
> >>>>>>>> Thanks,>
> >>>>>>>> Micah>
> >>>>>>>>>
> >>>>>>>> On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <[email protected]>
> >>>> wrote:>
> >>>>>>>>>
> >>>>>>>>> hi folks,>
> >>>>>>>>>>
> >>>>>>>>> I think we have reached a point where the incomplete C++
> >>>> Parquet>
> >>>>>>>>> nested data assembly/disassembly is harming the value of
> >>>> several>
> >>>>>>>>> others parts of the project, for example the Datasets API. As
> >>>> another>
> >>>>>>>>> example, it's possible to ingest nested data from JSON but not
> >>>> write>
> >>>>>>>>> it to Parquet in general.>
> >>>>>>>>>>
> >>>>>>>>> Implementing the nested data read and write path completely is
> >>>> a>
> >>>>>>>>> difficult project requiring at least several weeks of dedicated
> >>>> work,>
> >>>>>>>>> so it's not so surprising that it hasn't been accomplished yet.
> >>>> I
> >>>>>> know>
> >>>>>>>>> that several people have expressed interest in working on it,
> >>>> but I>
> >>>>>>>>> would like to see if anyone would be able to volunteer a
> >>>> commitment
> >>>>>> of>
> >>>>>>>>> time and guess on a rough timeline when this work could be
> >>>> done. It>
> >>>>>>>>> seems to me if this slips beyond 2020 it will significant
> >>>> diminish
> >>>>>> the>
> >>>>>>>>> value being created by other parts of the project.>
> >>>>>>>>>>
> >>>>>>>>> Since I'm pretty familiar with all the Parquet code I'm one
> >>>> candidate>
> >>>>>>>>> person to take on this project (and I can dedicate the time,
> >>>> but it>
> >>>>>>>>> would come at the expense of other projects where I can also be>
> >>>>>>>>> useful). But Micah and others expressed interest in working on
> >>>> it, so>
> >>>>>>>>> I wanted to have a discussion about it to see what others
> >>>> think.>
> >>>>>>>>>>
> >>>>>>>>> Thanks>
> >>>>>>>>> Wes>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
>

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Reply via email to