Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Maarten Ballintijn Thu, 12 Mar 2020 08:44:28 -0700

Hi Micah,

How does the performance change for “flat” schemas?
(particularly in the case of a large number of columns)


Thanks,
Maarten



> On Mar 11, 2020, at 11:53 PM, Micah Kornfield <emkornfi...@gmail.com> wrote:
> 
> Another status update.  I've integrated the level generation code with the
> parquet writing code [1].
> 
> After that PR is merged I'll add bindings in Python to control versions of
> the level generation algorithm and plan on moving on to the read side.
> 
> Thanks,
> Micah
> 
> [1] https://github.com/apache/arrow/pull/6586
> 
> On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> 
>> Hi Igor,
>> If you have the time https://issues.apache.org/jira/browse/ARROW-7960 might
>> be a good task to pick up for this I think it should be a relatively small
>> amount of code, so it is probably a good contribution to the project.  Once
>> that is wrapped up we can see were we both are.
>> 
>> Cheers,
>> Micah
>> 
>> On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <igor.calab...@gmail.com>
>> wrote:
>> 
>>> Hi Micah, I actually got involved with another personal project and had
>>> to postpone my contribution to arrow a bit. The good news is that I'm
>>> almost done with it, so I could help you with the read side very soon. Any
>>> ideas how we could coordinate this?
>>> 
>>> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <wesmck...@gmail.com>
>>> escreveu:
>>> 
>>>> hi Micah -- great news on the level generation PR. I'll try to carve
>>>> out some time for reviewing over the coming week.
>>>> 
>>>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <emkornfi...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Hi Igor,
>>>>> I was wondering if you have made any progress on this?
>>>>> 
>>>>> I posted a new PR [1] which I believe handles the difficult algorithmic
>>>>> part of writing.  There will be some follow-ups but I think this PR
>>>> might
>>>>> take a while to review, so I was thinking of starting to take a look
>>>> at the
>>>>> read side if you haven't started yet, and circle back to the final
>>>>> integration for the write side once the PR is checked in.
>>>>> 
>>>>> Thanks,
>>>>> Micah
>>>>> 
>>>>> [1] https://github.com/apache/arrow/pull/6490
>>>>> 
>>>>> On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <igor.calab...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi, I would love to help with this issue. I'm aware that this is a
>>>> huge
>>>>>> task for a first contribution to arrow, but I feel that I could help
>>>> with
>>>>>> the read path.
>>>>>> Reading parquet seems like a extremely complex task since both
>>>> hive[0] and
>>>>>> spark[1] tried to implement a "vectorized" version and they all
>>>> stopped
>>>>>> short of supporting complex types.
>>>>>> I wanted to at least give it a try and find out where the challenge
>>>> lies.
>>>>>> 
>>>>>> Since you guys are much more familiar with the current code base, I
>>>> could
>>>>>> use some starting tips so I don't fall in common pitfalls and
>>>> whatnot.
>>>>>> 
>>>>>> [0] https://issues.apache.org/jira/browse/HIVE-18576
>>>>>> [1]
>>>>>> 
>>>>>> 
>>>> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45
>>>>>> 
>>>>>> On 2020/02/03 06:01:25, Micah Kornfield <e...@gmail.com> wrote:
>>>>>>> Just to give an update.  I've been a little bit delayed, but my
>>>> progress
>>>>>> is>
>>>>>>> as follows:>
>>>>>>> 1.  Had 1 PR merged that will exercise basic end-to-end tests.>
>>>>>>> 2.  Have another PR open that allows a configuration option in C++
>>>> to>
>>>>>>> determine which algorithm version to use for reading/writing, the
>>>>>> existing>
>>>>>>> version and the new version supported complex-nested arrays.  I
>>>> think a>
>>>>>>> large amount of code will be reused/delegated to but I will err on
>>>> the
>>>>>> side>
>>>>>>> of not touching the existing code/algorithms so that any errors in
>>>> the>
>>>>>>> implementation  or performance regressions can hopefully be
>>>> mitigated at>
>>>>>>> runtime.  I expect in later releases (once the code has "baked")
>>>> will>
>>>>>>> become a no-op.>
>>>>>>> 3.  Started coding the write path.>
>>>>>>> 
>>>>>>> Which leaves:>
>>>>>>> 1.  Finishing the write path (I estimate 2-3 weeks) to be code
>>>> complete>
>>>>>>> 2.  Implementing the read path.>
>>>>>>> 
>>>>>>> Again, I'm happy to collaborate if people have bandwidth and want
>>>> to>
>>>>>>> contribute.>
>>>>>>> 
>>>>>>> Thanks,>
>>>>>>> Micah>
>>>>>>> 
>>>>>>> On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <em...@gmail.com>>
>>>>>>> wrote:>
>>>>>>> 
>>>>>>>> Hi Wes,>
>>>>>>>> I'm still interested in doing the work.  But don't to hold
>>>> anybody up
>>>>>> if>
>>>>>>>> they have bandwidth.>
>>>>>>>>> 
>>>>>>>> In order to actually make progress on this, my plan will be to:>
>>>>>>>> 1.  Help with the current Java review backlog through early next
>>>> week
>>>>>> or>
>>>>>>>> so (this has been taking the majority of my time allocated for
>>>> Arrow>
>>>>>>>> contributions for the last 6 months or so).>
>>>>>>>> 2.  Shift all my attention to trying to get this done (this
>>>> means no>
>>>>>>>> reviews other then closing out existing ones that I've started
>>>> until it
>>>>>> is>
>>>>>>>> done).  Hopefully, other Java committers can help shrink the
>>>> backlog>
>>>>>>>> further (Jacques thanks for you recent efforts here).>
>>>>>>>>> 
>>>>>>>> Thanks,>
>>>>>>>> Micah>
>>>>>>>>> 
>>>>>>>> On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <we...@gmail.com>
>>>> wrote:>
>>>>>>>>> 
>>>>>>>>> hi folks,>
>>>>>>>>>> 
>>>>>>>>> I think we have reached a point where the incomplete C++
>>>> Parquet>
>>>>>>>>> nested data assembly/disassembly is harming the value of
>>>> several>
>>>>>>>>> others parts of the project, for example the Datasets API. As
>>>> another>
>>>>>>>>> example, it's possible to ingest nested data from JSON but not
>>>> write>
>>>>>>>>> it to Parquet in general.>
>>>>>>>>>> 
>>>>>>>>> Implementing the nested data read and write path completely is
>>>> a>
>>>>>>>>> difficult project requiring at least several weeks of dedicated
>>>> work,>
>>>>>>>>> so it's not so surprising that it hasn't been accomplished yet.
>>>> I
>>>>>> know>
>>>>>>>>> that several people have expressed interest in working on it,
>>>> but I>
>>>>>>>>> would like to see if anyone would be able to volunteer a
>>>> commitment
>>>>>> of>
>>>>>>>>> time and guess on a rough timeline when this work could be
>>>> done. It>
>>>>>>>>> seems to me if this slips beyond 2020 it will significant
>>>> diminish
>>>>>> the>
>>>>>>>>> value being created by other parts of the project.>
>>>>>>>>>> 
>>>>>>>>> Since I'm pretty familiar with all the Parquet code I'm one
>>>> candidate>
>>>>>>>>> person to take on this project (and I can dedicate the time,
>>>> but it>
>>>>>>>>> would come at the expense of other projects where I can also be>
>>>>>>>>> useful). But Micah and others expressed interest in working on
>>>> it, so>
>>>>>>>>> I wanted to have a discussion about it to see what others
>>>> think.>
>>>>>>>>>> 
>>>>>>>>> Thanks>
>>>>>>>>> Wes>
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Reply via email to