Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Micah Kornfield Sun, 29 Mar 2020 21:49:09 -0700

Hi Igor,
It looks like a good start.  Thank you! I left feedback on the PR.  I
should have some time over the next few weeks or so to handle the
combinations lists/structs that aren't currently supported in the core
reader.  As I get further into the code I'll chime in here if there are
other areas to collaborate on.


Thanks,
Micah

On Sun, Mar 29, 2020 at 4:27 PM Igor Calabria <[email protected]>
wrote:

> Hi Micah,
>
> Finally got around to doing some work on the reader's side. Like you
> suggested, I started with https://issues.apache.org/jira/browse/ARROW-7960.
>
> I never programmed C++ professionally so I opened the PR
> https://github.com/apache/arrow/pull/6758 as soon as possible to collect
> feedback while I continue the work. Next step is actually fixing the
> reader, since adding a explicit translation to maps broke it. It should be
> simple enough to just reuse StructReader.
>
> Em dom., 15 de mar. de 2020 às 01:17, Micah Kornfield <
> [email protected]> escreveu:
>
>> >
>> > I'd be OK with having the flag so long as the new code is the default.
>> > Otherwise we'll never find out about the corner cases.
>>
>> Completely agree, my PR makes the new code the default.
>>
>> On Fri, Mar 13, 2020 at 6:44 AM Wes McKinney <[email protected]> wrote:
>>
>> > On Thu, Mar 12, 2020 at 10:39 PM Micah Kornfield <[email protected]
>> >
>> > wrote:
>> > >
>> > > Maarten, I don't expect regressions for flat cases (I'm going to try
>> to
>> > run
>> > > benchmarks comparison tonight).
>> > >
>> > > In terms of the flag, I'm more concerned about some corner case I
>> didn't
>> > > think of in testing or a workload that for some reason is better with
>> the
>> > > prior code. If either of these arise I would like to give users a way
>> to
>> > > mitigate issues so a patch release isn't an imperative.
>> >
>> > I'd be OK with having the flag so long as the new code is the default.
>> > Otherwise we'll never find out about the corner cases.
>> >
>> > > I'm happy to cleanup the existing code as soon as the next release
>> > happens
>> > > though.
>> > >
>> > > Thoughts?
>> > >
>> > > -Micah
>> > >
>> > > On Thu, Mar 12, 2020 at 9:11 AM Wes McKinney <[email protected]>
>> > wrote:
>> > >
>> > > > Maarten -- AFAIK Micah's work only affects nested / non-flat column
>> > > > paths, so flat data should not be impacted. Since we have a partial
>> > > > implementation of writes for nested data (lists-of-lists and
>> > > > structs-of-structs, but no mix of the two) that was the performance
>> > > > difference I was referencing.
>> > > >
>> > > > On Thu, Mar 12, 2020 at 10:43 AM Maarten Ballintijn <
>> > [email protected]>
>> > > > wrote:
>> > > > >
>> > > > > Hi Micah,
>> > > > >
>> > > > > How does the performance change for “flat” schemas?
>> > > > > (particularly in the case of a large number of columns)
>> > > > >
>> > > > > Thanks,
>> > > > > Maarten
>> > > > >
>> > > > >
>> > > > >
>> > > > > > On Mar 11, 2020, at 11:53 PM, Micah Kornfield <
>> > [email protected]>
>> > > > wrote:
>> > > > > >
>> > > > > > Another status update.  I've integrated the level generation
>> code
>> > with
>> > > > the
>> > > > > > parquet writing code [1].
>> > > > > >
>> > > > > > After that PR is merged I'll add bindings in Python to control
>> > > > versions of
>> > > > > > the level generation algorithm and plan on moving on to the read
>> > side.
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Micah
>> > > > > >
>> > > > > > [1] https://github.com/apache/arrow/pull/6586
>> > > > > >
>> > > > > > On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield <
>> > [email protected]>
>> > > > > > wrote:
>> > > > > >
>> > > > > >> Hi Igor,
>> > > > > >> If you have the time
>> > https://issues.apache.org/jira/browse/ARROW-7960
>> > > > might
>> > > > > >> be a good task to pick up for this I think it should be a
>> > relatively
>> > > > small
>> > > > > >> amount of code, so it is probably a good contribution to the
>> > > > project.  Once
>> > > > > >> that is wrapped up we can see were we both are.
>> > > > > >>
>> > > > > >> Cheers,
>> > > > > >> Micah
>> > > > > >>
>> > > > > >> On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <
>> > [email protected]
>> > > > >
>> > > > > >> wrote:
>> > > > > >>
>> > > > > >>> Hi Micah, I actually got involved with another personal
>> project
>> > and
>> > > > had
>> > > > > >>> to postpone my contribution to arrow a bit. The good news is
>> > that I'm
>> > > > > >>> almost done with it, so I could help you with the read side
>> very
>> > > > soon. Any
>> > > > > >>> ideas how we could coordinate this?
>> > > > > >>>
>> > > > > >>> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <
>> > > > [email protected]>
>> > > > > >>> escreveu:
>> > > > > >>>
>> > > > > >>>> hi Micah -- great news on the level generation PR. I'll try
>> to
>> > carve
>> > > > > >>>> out some time for reviewing over the coming week.
>> > > > > >>>>
>> > > > > >>>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <
>> > > > [email protected]>
>> > > > > >>>> wrote:
>> > > > > >>>>>
>> > > > > >>>>> Hi Igor,
>> > > > > >>>>> I was wondering if you have made any progress on this?
>> > > > > >>>>>
>> > > > > >>>>> I posted a new PR [1] which I believe handles the difficult
>> > > > algorithmic
>> > > > > >>>>> part of writing.  There will be some follow-ups but I think
>> > this PR
>> > > > > >>>> might
>> > > > > >>>>> take a while to review, so I was thinking of starting to
>> take a
>> > > > look
>> > > > > >>>> at the
>> > > > > >>>>> read side if you haven't started yet, and circle back to the
>> > final
>> > > > > >>>>> integration for the write side once the PR is checked in.
>> > > > > >>>>>
>> > > > > >>>>> Thanks,
>> > > > > >>>>> Micah
>> > > > > >>>>>
>> > > > > >>>>> [1] https://github.com/apache/arrow/pull/6490
>> > > > > >>>>>
>> > > > > >>>>> On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <
>> > > > [email protected]>
>> > > > > >>>>> wrote:
>> > > > > >>>>>
>> > > > > >>>>>> Hi, I would love to help with this issue. I'm aware that
>> this
>> > is a
>> > > > > >>>> huge
>> > > > > >>>>>> task for a first contribution to arrow, but I feel that I
>> > could
>> > > > help
>> > > > > >>>> with
>> > > > > >>>>>> the read path.
>> > > > > >>>>>> Reading parquet seems like a extremely complex task since
>> both
>> > > > > >>>> hive[0] and
>> > > > > >>>>>> spark[1] tried to implement a "vectorized" version and they
>> > all
>> > > > > >>>> stopped
>> > > > > >>>>>> short of supporting complex types.
>> > > > > >>>>>> I wanted to at least give it a try and find out where the
>> > > > challenge
>> > > > > >>>> lies.
>> > > > > >>>>>>
>> > > > > >>>>>> Since you guys are much more familiar with the current code
>> > base,
>> > > > I
>> > > > > >>>> could
>> > > > > >>>>>> use some starting tips so I don't fall in common pitfalls
>> and
>> > > > > >>>> whatnot.
>> > > > > >>>>>>
>> > > > > >>>>>> [0] https://issues.apache.org/jira/browse/HIVE-18576
>> > > > > >>>>>> [1]
>> > > > > >>>>>>
>> > > > > >>>>>>
>> > > > > >>>>
>> > > >
>> >
>> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45
>> > > > > >>>>>>
>> > > > > >>>>>> On 2020/02/03 06:01:25, Micah Kornfield <[email protected]>
>> > wrote:
>> > > > > >>>>>>> Just to give an update.  I've been a little bit delayed,
>> but
>> > my
>> > > > > >>>> progress
>> > > > > >>>>>> is>
>> > > > > >>>>>>> as follows:>
>> > > > > >>>>>>> 1.  Had 1 PR merged that will exercise basic end-to-end
>> > tests.>
>> > > > > >>>>>>> 2.  Have another PR open that allows a configuration
>> option
>> > in
>> > > > C++
>> > > > > >>>> to>
>> > > > > >>>>>>> determine which algorithm version to use for
>> > reading/writing, the
>> > > > > >>>>>> existing>
>> > > > > >>>>>>> version and the new version supported complex-nested
>> > arrays.  I
>> > > > > >>>> think a>
>> > > > > >>>>>>> large amount of code will be reused/delegated to but I
>> will
>> > err
>> > > > on
>> > > > > >>>> the
>> > > > > >>>>>> side>
>> > > > > >>>>>>> of not touching the existing code/algorithms so that any
>> > errors
>> > > > in
>> > > > > >>>> the>
>> > > > > >>>>>>> implementation  or performance regressions can hopefully
>> be
>> > > > > >>>> mitigated at>
>> > > > > >>>>>>> runtime.  I expect in later releases (once the code has
>> > "baked")
>> > > > > >>>> will>
>> > > > > >>>>>>> become a no-op.>
>> > > > > >>>>>>> 3.  Started coding the write path.>
>> > > > > >>>>>>>
>> > > > > >>>>>>> Which leaves:>
>> > > > > >>>>>>> 1.  Finishing the write path (I estimate 2-3 weeks) to be
>> > code
>> > > > > >>>> complete>
>> > > > > >>>>>>> 2.  Implementing the read path.>
>> > > > > >>>>>>>
>> > > > > >>>>>>> Again, I'm happy to collaborate if people have bandwidth
>> and
>> > want
>> > > > > >>>> to>
>> > > > > >>>>>>> contribute.>
>> > > > > >>>>>>>
>> > > > > >>>>>>> Thanks,>
>> > > > > >>>>>>> Micah>
>> > > > > >>>>>>>
>> > > > > >>>>>>> On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <
>> > [email protected]
>> > > > >>
>> > > > > >>>>>>> wrote:>
>> > > > > >>>>>>>
>> > > > > >>>>>>>> Hi Wes,>
>> > > > > >>>>>>>> I'm still interested in doing the work.  But don't to
>> hold
>> > > > > >>>> anybody up
>> > > > > >>>>>> if>
>> > > > > >>>>>>>> they have bandwidth.>
>> > > > > >>>>>>>>>
>> > > > > >>>>>>>> In order to actually make progress on this, my plan will
>> be
>> > to:>
>> > > > > >>>>>>>> 1.  Help with the current Java review backlog through
>> early
>> > next
>> > > > > >>>> week
>> > > > > >>>>>> or>
>> > > > > >>>>>>>> so (this has been taking the majority of my time
>> allocated
>> > for
>> > > > > >>>> Arrow>
>> > > > > >>>>>>>> contributions for the last 6 months or so).>
>> > > > > >>>>>>>> 2.  Shift all my attention to trying to get this done
>> (this
>> > > > > >>>> means no>
>> > > > > >>>>>>>> reviews other then closing out existing ones that I've
>> > started
>> > > > > >>>> until it
>> > > > > >>>>>> is>
>> > > > > >>>>>>>> done).  Hopefully, other Java committers can help shrink
>> the
>> > > > > >>>> backlog>
>> > > > > >>>>>>>> further (Jacques thanks for you recent efforts here).>
>> > > > > >>>>>>>>>
>> > > > > >>>>>>>> Thanks,>
>> > > > > >>>>>>>> Micah>
>> > > > > >>>>>>>>>
>> > > > > >>>>>>>> On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <
>> > [email protected]>
>> > > > > >>>> wrote:>
>> > > > > >>>>>>>>>
>> > > > > >>>>>>>>> hi folks,>
>> > > > > >>>>>>>>>>
>> > > > > >>>>>>>>> I think we have reached a point where the incomplete C++
>> > > > > >>>> Parquet>
>> > > > > >>>>>>>>> nested data assembly/disassembly is harming the value of
>> > > > > >>>> several>
>> > > > > >>>>>>>>> others parts of the project, for example the Datasets
>> API.
>> > As
>> > > > > >>>> another>
>> > > > > >>>>>>>>> example, it's possible to ingest nested data from JSON
>> but
>> > not
>> > > > > >>>> write>
>> > > > > >>>>>>>>> it to Parquet in general.>
>> > > > > >>>>>>>>>>
>> > > > > >>>>>>>>> Implementing the nested data read and write path
>> > completely is
>> > > > > >>>> a>
>> > > > > >>>>>>>>> difficult project requiring at least several weeks of
>> > dedicated
>> > > > > >>>> work,>
>> > > > > >>>>>>>>> so it's not so surprising that it hasn't been
>> accomplished
>> > yet.
>> > > > > >>>> I
>> > > > > >>>>>> know>
>> > > > > >>>>>>>>> that several people have expressed interest in working
>> on
>> > it,
>> > > > > >>>> but I>
>> > > > > >>>>>>>>> would like to see if anyone would be able to volunteer a
>> > > > > >>>> commitment
>> > > > > >>>>>> of>
>> > > > > >>>>>>>>> time and guess on a rough timeline when this work could
>> be
>> > > > > >>>> done. It>
>> > > > > >>>>>>>>> seems to me if this slips beyond 2020 it will
>> significant
>> > > > > >>>> diminish
>> > > > > >>>>>> the>
>> > > > > >>>>>>>>> value being created by other parts of the project.>
>> > > > > >>>>>>>>>>
>> > > > > >>>>>>>>> Since I'm pretty familiar with all the Parquet code I'm
>> one
>> > > > > >>>> candidate>
>> > > > > >>>>>>>>> person to take on this project (and I can dedicate the
>> > time,
>> > > > > >>>> but it>
>> > > > > >>>>>>>>> would come at the expense of other projects where I can
>> > also
>> > > > be>
>> > > > > >>>>>>>>> useful). But Micah and others expressed interest in
>> > working on
>> > > > > >>>> it, so>
>> > > > > >>>>>>>>> I wanted to have a discussion about it to see what
>> others
>> > > > > >>>> think.>
>> > > > > >>>>>>>>>>
>> > > > > >>>>>>>>> Thanks>
>> > > > > >>>>>>>>> Wes>
>> > > > > >>>>>>>>>>
>> > > > > >>>>>>>>>
>> > > > > >>>>>>>
>> > > > > >>>>>>
>> > > > > >>>>
>> > > > > >>>
>> > > > >
>> > > >
>> >
>>
>

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Reply via email to