Hi Igor, If you have the time https://issues.apache.org/jira/browse/ARROW-7960 might be a good task to pick up for this I think it should be a relatively small amount of code, so it is probably a good contribution to the project. Once that is wrapped up we can see were we both are.
Cheers, Micah On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <igor.calab...@gmail.com> wrote: > Hi Micah, I actually got involved with another personal project and had to > postpone my contribution to arrow a bit. The good news is that I'm almost > done with it, so I could help you with the read side very soon. Any ideas > how we could coordinate this? > > Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <wesmck...@gmail.com> > escreveu: > >> hi Micah -- great news on the level generation PR. I'll try to carve >> out some time for reviewing over the coming week. >> >> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> > >> > Hi Igor, >> > I was wondering if you have made any progress on this? >> > >> > I posted a new PR [1] which I believe handles the difficult algorithmic >> > part of writing. There will be some follow-ups but I think this PR >> might >> > take a while to review, so I was thinking of starting to take a look at >> the >> > read side if you haven't started yet, and circle back to the final >> > integration for the write side once the PR is checked in. >> > >> > Thanks, >> > Micah >> > >> > [1] https://github.com/apache/arrow/pull/6490 >> > >> > On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <igor.calab...@gmail.com> >> > wrote: >> > >> > > Hi, I would love to help with this issue. I'm aware that this is a >> huge >> > > task for a first contribution to arrow, but I feel that I could help >> with >> > > the read path. >> > > Reading parquet seems like a extremely complex task since both >> hive[0] and >> > > spark[1] tried to implement a "vectorized" version and they all >> stopped >> > > short of supporting complex types. >> > > I wanted to at least give it a try and find out where the challenge >> lies. >> > > >> > > Since you guys are much more familiar with the current code base, I >> could >> > > use some starting tips so I don't fall in common pitfalls and whatnot. >> > > >> > > [0] https://issues.apache.org/jira/browse/HIVE-18576 >> > > [1] >> > > >> > > >> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45 >> > > >> > > On 2020/02/03 06:01:25, Micah Kornfield <e...@gmail.com> wrote: >> > > > Just to give an update. I've been a little bit delayed, but my >> progress >> > > is> >> > > > as follows:> >> > > > 1. Had 1 PR merged that will exercise basic end-to-end tests.> >> > > > 2. Have another PR open that allows a configuration option in C++ >> to> >> > > > determine which algorithm version to use for reading/writing, the >> > > existing> >> > > > version and the new version supported complex-nested arrays. I >> think a> >> > > > large amount of code will be reused/delegated to but I will err on >> the >> > > side> >> > > > of not touching the existing code/algorithms so that any errors in >> the> >> > > > implementation or performance regressions can hopefully be >> mitigated at> >> > > > runtime. I expect in later releases (once the code has "baked") >> will> >> > > > become a no-op.> >> > > > 3. Started coding the write path.> >> > > > >> > > > Which leaves:> >> > > > 1. Finishing the write path (I estimate 2-3 weeks) to be code >> complete> >> > > > 2. Implementing the read path.> >> > > > >> > > > Again, I'm happy to collaborate if people have bandwidth and want >> to> >> > > > contribute.> >> > > > >> > > > Thanks,> >> > > > Micah> >> > > > >> > > > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <em...@gmail.com>> >> > > > wrote:> >> > > > >> > > > > Hi Wes,> >> > > > > I'm still interested in doing the work. But don't to hold >> anybody up >> > > if> >> > > > > they have bandwidth.> >> > > > >> >> > > > > In order to actually make progress on this, my plan will be to:> >> > > > > 1. Help with the current Java review backlog through early next >> week >> > > or> >> > > > > so (this has been taking the majority of my time allocated for >> Arrow> >> > > > > contributions for the last 6 months or so).> >> > > > > 2. Shift all my attention to trying to get this done (this means >> no> >> > > > > reviews other then closing out existing ones that I've started >> until it >> > > is> >> > > > > done). Hopefully, other Java committers can help shrink the >> backlog> >> > > > > further (Jacques thanks for you recent efforts here).> >> > > > >> >> > > > > Thanks,> >> > > > > Micah> >> > > > >> >> > > > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <we...@gmail.com> >> wrote:> >> > > > >> >> > > > >> hi folks,> >> > > > >>> >> > > > >> I think we have reached a point where the incomplete C++ Parquet> >> > > > >> nested data assembly/disassembly is harming the value of several> >> > > > >> others parts of the project, for example the Datasets API. As >> another> >> > > > >> example, it's possible to ingest nested data from JSON but not >> write> >> > > > >> it to Parquet in general.> >> > > > >>> >> > > > >> Implementing the nested data read and write path completely is a> >> > > > >> difficult project requiring at least several weeks of dedicated >> work,> >> > > > >> so it's not so surprising that it hasn't been accomplished yet. I >> > > know> >> > > > >> that several people have expressed interest in working on it, >> but I> >> > > > >> would like to see if anyone would be able to volunteer a >> commitment >> > > of> >> > > > >> time and guess on a rough timeline when this work could be done. >> It> >> > > > >> seems to me if this slips beyond 2020 it will significant >> diminish >> > > the> >> > > > >> value being created by other parts of the project.> >> > > > >>> >> > > > >> Since I'm pretty familiar with all the Parquet code I'm one >> candidate> >> > > > >> person to take on this project (and I can dedicate the time, but >> it> >> > > > >> would come at the expense of other projects where I can also be> >> > > > >> useful). But Micah and others expressed interest in working on >> it, so> >> > > > >> I wanted to have a discussion about it to see what others think.> >> > > > >>> >> > > > >> Thanks> >> > > > >> Wes> >> > > > >>> >> > > > >> >> > > > >> > > >> >