Re: Parquet to Arrow in Java

Chao Sun Wed, 04 Sep 2019 10:26:01 -0700

Bumping this.

We may have an upcoming use case for this as well. Want to know if anyone
is actively working on this? I also heard that Dremio has internally
implemented a performant Parquet to Arrow reader. Is there any plan to open
source it? that could save us a lot of work.


Thanks,
Chao

On Fri, Aug 9, 2019 at 8:49 AM Renjie Liu <liurenjie2...@gmail.com> wrote:

> Hi:
>
> I'm working on the rust part and expecting to finish this recently. I'm
> also interested in the java version because we are trying to embed arrow in
> spark to implement vectorized processing. Maybe we can work together.
>
> Micah Kornfield <emkornfi...@gmail.com> 于 2019年8月5日周一 下午1:50写道：
>
> > Hi Anoop,
> > I think a contribution would be welcome.  There was a recent discussion
> > thread on what would be expected from new "readers" for Arrow data in
> Java
> > [1].  I think its worth reading through but my recollections of the
> > highlights are:
> > 1.  A short design sketch in the JIRA that will track the work.
> > 2.  Off-heap data-structures as much as possible
> > 3.  An interface that allows predicate push down, column projection and
> > specifying the batch sizes of reads.  I think there is probably some
> > interplay here between RowGroup size and size of batches.  It might worth
> > thinking about this up front and mentioning in the design.
> > 4.  Performant (since we care going from columnar->columar it should be
> > faster then Parquet-MR and on-par or better then Spark's implementation
> > which I believe also goes from columnar to columnar).
> >
> > Answers to specific questions below.
> >
> > Thanks,
> > Micah
> >
> > To help me get started, are there any pointers on how the C++ or Rust
> > > implementations currently read Parquet into Arrow?
> >
> > I'm not sure about the Rust code, but the C++ code is located at [2], it
> is
> > has been going under some recent refactoring (and I think Wes might have
> 1
> > or 2 changes till to make).  It doesn't yet support nested data types
> fully
> > (e.g. structs).
> >
> > Are they reading Parquet row-by-row and building Arrow batches or are
> there
> > > better ways of implementing this?
> >
> > I believe the implementations should be reading a row-group at a time
> > column by column.  Spark potentially has an implementation that already
> > does this.
> >
> >
> > [1]
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_b096528600e66c17af9498c151352f12944ead2fd218a0257fdd4f70-40-253Cdev.arrow.apache.org-253E&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=fKQYzdomRi1K0lheZFD-gZ59TaHAGaBDzJFApTekkt0&e=
> > [2]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_arrow_tree_master_cpp_src_parquet_arrow&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=7AhyWgqaneMxlWeFA3EKYaygl0RpkC-nVSungqZaVqg&e=
> >
> > On Sun, Aug 4, 2019 at 2:52 PM Anoop Johnson <anoop.k.john...@gmail.com>
> > wrote:
> >
> > > Thanks for the response Micah. I could implement this and contribute to
> > > Arrow Java. To help me get started, are there any pointers on how the
> C++
> > > or Rust implementations currently read Parquet into Arrow? Are they
> > reading
> > > Parquet row-by-row and building Arrow batches or are there better ways
> of
> > > implementing this?
> > >
> > > On Tue, Jul 30, 2019 at 1:56 PM Micah Kornfield <emkornfi...@gmail.com
> >
> > > wrote:
> > >
> > >> Hi Anoop,
> > >> There isn't currently anything in the Arrow Java library that does
> this.
> > >> It is something that I think we want to add at some point.   Dremio
> [1]
> > >> has
> > >> some Parquet related code, but I haven't looked at it to understand
> how
> > >> easy it is to use as a standalone library and whether is supports
> > >> predicate
> > >> push-down/column selection.
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >> [1]
> > >>
> > >>
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dremio_dremio-2Doss_tree_master_sabot_kernel_src_main_java_com_dremio_exec_store_parquet&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=DHkMQ-raZ__SnOH71hieIiLMlPYdEWY_7pqeMowp6wU&e=
> > >>
> > >> On Sun, Jul 28, 2019 at 2:08 PM Anoop Johnson <
> > anoop.k.john...@gmail.com>
> > >> wrote:
> > >>
> > >> > Arrow Newbie here.  What is the recommended way to convert Parquet
> > data
> > >> > into Arrow, preferably doing predicate/column pushdown?
> > >> >
> > >> > One can implement this as custom code using the Parquet API, and
> > >> re-encode
> > >> > it in Arrow using the Arrow APIs, but is this supported by Arrow out
> > of
> > >> the
> > >> > box?
> > >> >
> > >> > Thanks,
> > >> > Anoop
> > >> >
> > >>
> > >
> >
>

Re: Parquet to Arrow in Java

Reply via email to