Hi Anoop,
I think a contribution would be welcome.  There was a recent discussion
thread on what would be expected from new "readers" for Arrow data in Java
[1].  I think its worth reading through but my recollections of the
highlights are:
1.  A short design sketch in the JIRA that will track the work.
2.  Off-heap data-structures as much as possible
3.  An interface that allows predicate push down, column projection and
specifying the batch sizes of reads.  I think there is probably some
interplay here between RowGroup size and size of batches.  It might worth
thinking about this up front and mentioning in the design.
4.  Performant (since we care going from columnar->columar it should be
faster then Parquet-MR and on-par or better then Spark's implementation
which I believe also goes from columnar to columnar).

Answers to specific questions below.

Thanks,
Micah

To help me get started, are there any pointers on how the C++ or Rust
> implementations currently read Parquet into Arrow?

I'm not sure about the Rust code, but the C++ code is located at [2], it is
has been going under some recent refactoring (and I think Wes might have 1
or 2 changes till to make).  It doesn't yet support nested data types fully
(e.g. structs).

Are they reading Parquet row-by-row and building Arrow batches or are there
> better ways of implementing this?

I believe the implementations should be reading a row-group at a time
column by column.  Spark potentially has an implementation that already
does this.


[1]
https://lists.apache.org/thread.html/b096528600e66c17af9498c151352f12944ead2fd218a0257fdd4f70@%3Cdev.arrow.apache.org%3E
[2] https://github.com/apache/arrow/tree/master/cpp/src/parquet/arrow

On Sun, Aug 4, 2019 at 2:52 PM Anoop Johnson <anoop.k.john...@gmail.com>
wrote:

> Thanks for the response Micah. I could implement this and contribute to
> Arrow Java. To help me get started, are there any pointers on how the C++
> or Rust implementations currently read Parquet into Arrow? Are they reading
> Parquet row-by-row and building Arrow batches or are there better ways of
> implementing this?
>
> On Tue, Jul 30, 2019 at 1:56 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Hi Anoop,
>> There isn't currently anything in the Arrow Java library that does this.
>> It is something that I think we want to add at some point.   Dremio [1]
>> has
>> some Parquet related code, but I haven't looked at it to understand how
>> easy it is to use as a standalone library and whether is supports
>> predicate
>> push-down/column selection.
>>
>> Thanks,
>> Micah
>>
>> [1]
>>
>> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet
>>
>> On Sun, Jul 28, 2019 at 2:08 PM Anoop Johnson <anoop.k.john...@gmail.com>
>> wrote:
>>
>> > Arrow Newbie here.  What is the recommended way to convert Parquet data
>> > into Arrow, preferably doing predicate/column pushdown?
>> >
>> > One can implement this as custom code using the Parquet API, and
>> re-encode
>> > it in Arrow using the Arrow APIs, but is this supported by Arrow out of
>> the
>> > box?
>> >
>> > Thanks,
>> > Anoop
>> >
>>
>

Reply via email to