Bumping this. We may have an upcoming use case for this as well. Want to know if anyone is actively working on this? I also heard that Dremio has internally implemented a performant Parquet to Arrow reader. Is there any plan to open source it? that could save us a lot of work.
Thanks, Chao On Fri, Aug 9, 2019 at 8:49 AM Renjie Liu <liurenjie2...@gmail.com> wrote: > Hi: > > I'm working on the rust part and expecting to finish this recently. I'm > also interested in the java version because we are trying to embed arrow in > spark to implement vectorized processing. Maybe we can work together. > > Micah Kornfield <emkornfi...@gmail.com> 于 2019年8月5日周一 下午1:50写道: > > > Hi Anoop, > > I think a contribution would be welcome. There was a recent discussion > > thread on what would be expected from new "readers" for Arrow data in > Java > > [1]. I think its worth reading through but my recollections of the > > highlights are: > > 1. A short design sketch in the JIRA that will track the work. > > 2. Off-heap data-structures as much as possible > > 3. An interface that allows predicate push down, column projection and > > specifying the batch sizes of reads. I think there is probably some > > interplay here between RowGroup size and size of batches. It might worth > > thinking about this up front and mentioning in the design. > > 4. Performant (since we care going from columnar->columar it should be > > faster then Parquet-MR and on-par or better then Spark's implementation > > which I believe also goes from columnar to columnar). > > > > Answers to specific questions below. > > > > Thanks, > > Micah > > > > To help me get started, are there any pointers on how the C++ or Rust > > > implementations currently read Parquet into Arrow? > > > > I'm not sure about the Rust code, but the C++ code is located at [2], it > is > > has been going under some recent refactoring (and I think Wes might have > 1 > > or 2 changes till to make). It doesn't yet support nested data types > fully > > (e.g. structs). > > > > Are they reading Parquet row-by-row and building Arrow batches or are > there > > > better ways of implementing this? > > > > I believe the implementations should be reading a row-group at a time > > column by column. Spark potentially has an implementation that already > > does this. > > > > > > [1] > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_b096528600e66c17af9498c151352f12944ead2fd218a0257fdd4f70-40-253Cdev.arrow.apache.org-253E&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=fKQYzdomRi1K0lheZFD-gZ59TaHAGaBDzJFApTekkt0&e= > > [2] > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_arrow_tree_master_cpp_src_parquet_arrow&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=7AhyWgqaneMxlWeFA3EKYaygl0RpkC-nVSungqZaVqg&e= > > > > On Sun, Aug 4, 2019 at 2:52 PM Anoop Johnson <anoop.k.john...@gmail.com> > > wrote: > > > > > Thanks for the response Micah. I could implement this and contribute to > > > Arrow Java. To help me get started, are there any pointers on how the > C++ > > > or Rust implementations currently read Parquet into Arrow? Are they > > reading > > > Parquet row-by-row and building Arrow batches or are there better ways > of > > > implementing this? > > > > > > On Tue, Jul 30, 2019 at 1:56 PM Micah Kornfield <emkornfi...@gmail.com > > > > > wrote: > > > > > >> Hi Anoop, > > >> There isn't currently anything in the Arrow Java library that does > this. > > >> It is something that I think we want to add at some point. Dremio > [1] > > >> has > > >> some Parquet related code, but I haven't looked at it to understand > how > > >> easy it is to use as a standalone library and whether is supports > > >> predicate > > >> push-down/column selection. > > >> > > >> Thanks, > > >> Micah > > >> > > >> [1] > > >> > > >> > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dremio_dremio-2Doss_tree_master_sabot_kernel_src_main_java_com_dremio_exec_store_parquet&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=DHkMQ-raZ__SnOH71hieIiLMlPYdEWY_7pqeMowp6wU&e= > > >> > > >> On Sun, Jul 28, 2019 at 2:08 PM Anoop Johnson < > > anoop.k.john...@gmail.com> > > >> wrote: > > >> > > >> > Arrow Newbie here. What is the recommended way to convert Parquet > > data > > >> > into Arrow, preferably doing predicate/column pushdown? > > >> > > > >> > One can implement this as custom code using the Parquet API, and > > >> re-encode > > >> > it in Arrow using the Arrow APIs, but is this supported by Arrow out > > of > > >> the > > >> > box? > > >> > > > >> > Thanks, > > >> > Anoop > > >> > > > >> > > > > > >