hi Micah,

I'm glad that we have the write side of nested completed for 0.17.0.

As far as completing the read side and then implementing sufficient
testing to exercise corner cases in end-to-end reads/writes, do you
anticipate being able to work on this in the next 4-6 weeks (obviously
the state of the world has affected everyone's availability /
bandwidth)? I ask because someone from my team (or me also) may be
able to get involved and help this move along. It'd be great to have
this 100% completed and checked off our list for the next release
(i.e. 0.18.0 or 1.0.0 depending on whether the Java/C++ integration
tests get completed also)

thanks
Wes

On Wed, Feb 5, 2020 at 12:12 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
>>
>> Glad to hear about the progress. As I mentioned on #2, what do you
>> think about setting up a feature branch for you to merge PRs into?
>> Then the branch can be iterated on and we can merge it back when it's
>> feature complete and does not have perf regressions for the flat
>> read/write path.
>>
> I'd like to avoid a separate branch if possible.  I'm willing to close the 
> open PR till I'm sure it is needed but I'm hoping keeping PRs as small 
> focused as possible with performance testing a long the way will be a better 
> reviewer and developer experience here.
>
>> The earliest I'd have time to work on this myself would likely be
>> sometime in March. Others are welcome to jump in as well (and it'd be
>> great to increase the overall level of knowledge of the Parquet
>> codebase)
>
> Hopefully, Igor can help out otherwise I'll take up the read path after I 
> finish the write path.
>
> -Micah
>
> On Tue, Feb 4, 2020 at 3:31 PM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>> hi Micah
>>
>> On Mon, Feb 3, 2020 at 12:01 AM Micah Kornfield <emkornfi...@gmail.com> 
>> wrote:
>> >
>> > Just to give an update.  I've been a little bit delayed, but my progress is
>> > as follows:
>> > 1.  Had 1 PR merged that will exercise basic end-to-end tests.
>> > 2.  Have another PR open that allows a configuration option in C++ to
>> > determine which algorithm version to use for reading/writing, the existing
>> > version and the new version supported complex-nested arrays.  I think a
>> > large amount of code will be reused/delegated to but I will err on the side
>> > of not touching the existing code/algorithms so that any errors in the
>> > implementation  or performance regressions can hopefully be mitigated at
>> > runtime.  I expect in later releases (once the code has "baked") will
>> > become a no-op.
>>
>> Glad to hear about the progress. As I mentioned on #2, what do you
>> think about setting up a feature branch for you to merge PRs into?
>> Then the branch can be iterated on and we can merge it back when it's
>> feature complete and does not have perf regressions for the flat
>> read/write path.
>>
>> > 3.  Started coding the write path.
>> >
>> > Which leaves:
>> > 1.  Finishing the write path (I estimate 2-3 weeks) to be code complete
>> > 2.  Implementing the read path.
>>
>> The earliest I'd have time to work on this myself would likely be
>> sometime in March. Others are welcome to jump in as well (and it'd be
>> great to increase the overall level of knowledge of the Parquet
>> codebase)
>>
>> > Again, I'm happy to collaborate if people have bandwidth and want to
>> > contribute.
>> >
>> > Thanks,
>> > Micah
>> >
>> > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <emkornfi...@gmail.com>
>> > wrote:
>> >
>> > > Hi Wes,
>> > > I'm still interested in doing the work.  But don't to hold anybody up if
>> > > they have bandwidth.
>> > >
>> > > In order to actually make progress on this, my plan will be to:
>> > > 1.  Help with the current Java review backlog through early next week or
>> > > so (this has been taking the majority of my time allocated for Arrow
>> > > contributions for the last 6 months or so).
>> > > 2.  Shift all my attention to trying to get this done (this means no
>> > > reviews other then closing out existing ones that I've started until it 
>> > > is
>> > > done).  Hopefully, other Java committers can help shrink the backlog
>> > > further (Jacques thanks for you recent efforts here).
>> > >
>> > > Thanks,
>> > > Micah
>> > >
>> > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <wesmck...@gmail.com> wrote:
>> > >
>> > >> hi folks,
>> > >>
>> > >> I think we have reached a point where the incomplete C++ Parquet
>> > >> nested data assembly/disassembly is harming the value of several
>> > >> others parts of the project, for example the Datasets API. As another
>> > >> example, it's possible to ingest nested data from JSON but not write
>> > >> it to Parquet in general.
>> > >>
>> > >> Implementing the nested data read and write path completely is a
>> > >> difficult project requiring at least several weeks of dedicated work,
>> > >> so it's not so surprising that it hasn't been accomplished yet. I know
>> > >> that several people have expressed interest in working on it, but I
>> > >> would like to see if anyone would be able to volunteer a commitment of
>> > >> time and guess on a rough timeline when this work could be done. It
>> > >> seems to me if this slips beyond 2020 it will significant diminish the
>> > >> value being created by other parts of the project.
>> > >>
>> > >> Since I'm pretty familiar with all the Parquet code I'm one candidate
>> > >> person to take on this project (and I can dedicate the time, but it
>> > >> would come at the expense of other projects where I can also be
>> > >> useful). But Micah and others expressed interest in working on it, so
>> > >> I wanted to have a discussion about it to see what others think.
>> > >>
>> > >> Thanks
>> > >> Wes
>> > >>
>> > >

Reply via email to