Re: Parquet to Arrow in Java

Renjie Liu Fri, 09 Aug 2019 08:49:43 -0700

Hi:

I'm working on the rust part and expecting to finish this recently. I'm
also interested in the java version because we are trying to embed arrow in
spark to implement vectorized processing. Maybe we can work together.


Micah Kornfield <emkornfi...@gmail.com> 于 2019年8月5日周一 下午1:50写道：

> Hi Anoop,
> I think a contribution would be welcome.  There was a recent discussion
> thread on what would be expected from new "readers" for Arrow data in Java
> [1].  I think its worth reading through but my recollections of the
> highlights are:
> 1.  A short design sketch in the JIRA that will track the work.
> 2.  Off-heap data-structures as much as possible
> 3.  An interface that allows predicate push down, column projection and
> specifying the batch sizes of reads.  I think there is probably some
> interplay here between RowGroup size and size of batches.  It might worth
> thinking about this up front and mentioning in the design.
> 4.  Performant (since we care going from columnar->columar it should be
> faster then Parquet-MR and on-par or better then Spark's implementation
> which I believe also goes from columnar to columnar).
>
> Answers to specific questions below.
>
> Thanks,
> Micah
>
> To help me get started, are there any pointers on how the C++ or Rust
> > implementations currently read Parquet into Arrow?
>
> I'm not sure about the Rust code, but the C++ code is located at [2], it is
> has been going under some recent refactoring (and I think Wes might have 1
> or 2 changes till to make).  It doesn't yet support nested data types fully
> (e.g. structs).
>
> Are they reading Parquet row-by-row and building Arrow batches or are there
> > better ways of implementing this?
>
> I believe the implementations should be reading a row-group at a time
> column by column.  Spark potentially has an implementation that already
> does this.
>
>
> [1]
>
> https://lists.apache.org/thread.html/b096528600e66c17af9498c151352f12944ead2fd218a0257fdd4f70@%3Cdev.arrow.apache.org%3E
> [2] https://github.com/apache/arrow/tree/master/cpp/src/parquet/arrow
>
> On Sun, Aug 4, 2019 at 2:52 PM Anoop Johnson <anoop.k.john...@gmail.com>
> wrote:
>
> > Thanks for the response Micah. I could implement this and contribute to
> > Arrow Java. To help me get started, are there any pointers on how the C++
> > or Rust implementations currently read Parquet into Arrow? Are they
> reading
> > Parquet row-by-row and building Arrow batches or are there better ways of
> > implementing this?
> >
> > On Tue, Jul 30, 2019 at 1:56 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> >> Hi Anoop,
> >> There isn't currently anything in the Arrow Java library that does this.
> >> It is something that I think we want to add at some point.   Dremio [1]
> >> has
> >> some Parquet related code, but I haven't looked at it to understand how
> >> easy it is to use as a standalone library and whether is supports
> >> predicate
> >> push-down/column selection.
> >>
> >> Thanks,
> >> Micah
> >>
> >> [1]
> >>
> >>
> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet
> >>
> >> On Sun, Jul 28, 2019 at 2:08 PM Anoop Johnson <
> anoop.k.john...@gmail.com>
> >> wrote:
> >>
> >> > Arrow Newbie here.  What is the recommended way to convert Parquet
> data
> >> > into Arrow, preferably doing predicate/column pushdown?
> >> >
> >> > One can implement this as custom code using the Parquet API, and
> >> re-encode
> >> > it in Arrow using the Arrow APIs, but is this supported by Arrow out
> of
> >> the
> >> > box?
> >> >
> >> > Thanks,
> >> > Anoop
> >> >
> >>
> >
>

Re: Parquet to Arrow in Java

Reply via email to