Hi Igor,
Thanks for the offer.

There is an old PR [1] that contains the generic logic for reading and
writing, which last time I looked seemed like a reasonable start for the
read portion (I think the write portion can potentially be made more memory
efficient memory (at the expense of some recomputation).  The difficulty is
the genericness of the algorithm, which I don't think currently takes
advantages of optimizations made in the parquet reading code [2].  The task
is roughly to adapt the two.  Given previous attempts this have caused
performance regressions for common cases, I thought it best to try to make
the least invasive changes to the existing code and enable the new features
via a flag, please follow the discussion there[3].  Finally, please follow
the discussion on some trickier points on the jira [4].

If this seems too involved, I think there is also supporting work that
would be helpful to check in additional "golden" parquet files containing
nested data and add disabled unit tests for those.

If you do have time to try the Read path, please try to create smaller
standalone JIRAs (as subtasks) and PRs for them.

If you need more details specific questions would be helpful.

Thanks,
Micah

[1] https://github.com/apache/arrow/pull/4066
[2]
https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc
[3] https://github.com/apache/arrow/pull/6337
[4]
https://issues.apache.org/jira/browse/ARROW-1644?jql=project%20in%20(PARQUET%2C%20ARROW)%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22arrow%20nested%22

On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <igor.calab...@gmail.com>
wrote:

> Hi, I would love to help with this issue. I'm aware that this is a huge
> task for a first contribution to arrow, but I feel that I could help with
> the read path.
> Reading parquet seems like a extremely complex task since both hive[0] and
> spark[1] tried to implement a "vectorized" version and they all stopped
> short of supporting complex types.
> I wanted to at least give it a try and find out where the challenge lies.
>
> Since you guys are much more familiar with the current code base, I could
> use some starting tips so I don't fall in common pitfalls and whatnot.
>
> [0] https://issues.apache.org/jira/browse/HIVE-18576
> [1]
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45
>
> On 2020/02/03 06:01:25, Micah Kornfield <e...@gmail.com> wrote:
> > Just to give an update.  I've been a little bit delayed, but my progress
> is>
> > as follows:>
> > 1.  Had 1 PR merged that will exercise basic end-to-end tests.>
> > 2.  Have another PR open that allows a configuration option in C++ to>
> > determine which algorithm version to use for reading/writing, the
> existing>
> > version and the new version supported complex-nested arrays.  I think a>
> > large amount of code will be reused/delegated to but I will err on the
> side>
> > of not touching the existing code/algorithms so that any errors in the>
> > implementation  or performance regressions can hopefully be mitigated at>
> > runtime.  I expect in later releases (once the code has "baked") will>
> > become a no-op.>
> > 3.  Started coding the write path.>
> >
> > Which leaves:>
> > 1.  Finishing the write path (I estimate 2-3 weeks) to be code complete>
> > 2.  Implementing the read path.>
> >
> > Again, I'm happy to collaborate if people have bandwidth and want to>
> > contribute.>
> >
> > Thanks,>
> > Micah>
> >
> > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <em...@gmail.com>>
> > wrote:>
> >
> > > Hi Wes,>
> > > I'm still interested in doing the work.  But don't to hold anybody up
> if>
> > > they have bandwidth.>
> > >>
> > > In order to actually make progress on this, my plan will be to:>
> > > 1.  Help with the current Java review backlog through early next week
> or>
> > > so (this has been taking the majority of my time allocated for Arrow>
> > > contributions for the last 6 months or so).>
> > > 2.  Shift all my attention to trying to get this done (this means no>
> > > reviews other then closing out existing ones that I've started until it
> is>
> > > done).  Hopefully, other Java committers can help shrink the backlog>
> > > further (Jacques thanks for you recent efforts here).>
> > >>
> > > Thanks,>
> > > Micah>
> > >>
> > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <we...@gmail.com> wrote:>
> > >>
> > >> hi folks,>
> > >>>
> > >> I think we have reached a point where the incomplete C++ Parquet>
> > >> nested data assembly/disassembly is harming the value of several>
> > >> others parts of the project, for example the Datasets API. As another>
> > >> example, it's possible to ingest nested data from JSON but not write>
> > >> it to Parquet in general.>
> > >>>
> > >> Implementing the nested data read and write path completely is a>
> > >> difficult project requiring at least several weeks of dedicated work,>
> > >> so it's not so surprising that it hasn't been accomplished yet. I
> know>
> > >> that several people have expressed interest in working on it, but I>
> > >> would like to see if anyone would be able to volunteer a commitment
> of>
> > >> time and guess on a rough timeline when this work could be done. It>
> > >> seems to me if this slips beyond 2020 it will significant diminish
> the>
> > >> value being created by other parts of the project.>
> > >>>
> > >> Since I'm pretty familiar with all the Parquet code I'm one candidate>
> > >> person to take on this project (and I can dedicate the time, but it>
> > >> would come at the expense of other projects where I can also be>
> > >> useful). But Micah and others expressed interest in working on it, so>
> > >> I wanted to have a discussion about it to see what others think.>
> > >>>
> > >> Thanks>
> > >> Wes>
> > >>>
> > >>
> >
>

Reply via email to