Re: [DISCUSS] Apache Iceberg / Apache Hudi support in Arrow

Will Jones Mon, 03 Oct 2022 08:04:04 -0700

Hi Rusty,

Note we discussed Iceberg a while ago [1]. I don't think we've discussed
Hudi in any depth.

As I see it, we are waiting on three things:

1. Someone willing to move forward the Iceberg / Hudi integration.
2. The Iceberg and Hudi projects need native libraries that we can use. The
base implementations are all Java, which isn't practical to integrate with
our C++ implementation (and the Python/R/Ruby bindings). But I think these
formats are complex enough that it's best to develop the core
implementation within the respective community, rather than within the
Arrow repo. There was a discussion to start one a C++/Rust implementation
for Iceberg [2], but I haven't seen any progress so far. I haven't been
watching Hudi.
3. We need a model for extending Arrow C++ datasets in separate packages,
or else we contribute to the package size problem you mentioned in your
other thread [3].

As a personal project, I've been working on integrating the Delta Lake Rust
implementation [4] with PyArrow. The community in that repo is pretty
invested in Arrow and has others working on integration with the Rust query
engines (such as Polars and DataFusion). Early next year I hope to extend
those to C++ and R, hopefully paving a path for solving issue (3) for the
other table formats.

Best,

Will Jones

[1] https://issues.apache.org/jira/browse/ARROW-15135
[2] https://lists.apache.org/thread/lf8gw4yk9c6l580o6k7mobg2y91rpjvp
[3] https://lists.apache.org/thread/mdr05pjzlq01dwwcwz21sz6ol3dkkylz
[4] https://github.com/delta-io/delta-rs

On Mon, Oct 3, 2022 at 5:25 AM Rusty Conover <ru...@conover.me.invalid>
wrote:

> Hi Arrow Team,
>
> Arrow is fantastic for manipulating the Parquet file format.
>
> There is an increasing desire to have the ability to update, delete and
> insert the rows stored in Parquet files, but without rewriting the Parquet
> files in their entirety.  It is not uncommon to have gigabytes/petabytes of
> data stored in Parquet files, so having to rewrite all of it for an update
> is non-trivial.
>
> The following projects promote that they can bring update/delete/insert
> support to Parquet:
>
> * Apache Hudi - https://hudi.apache.org/
> * Apache Iceberg - https://iceberg.apache.org/
>
> These projects combine a Parquet file with one or more "update" files
> stored using ORC or AVRO. Clients that want to read the rows combine the
> data stored in the Parquet file with the "update" files to determine which
> rows exist.  Occasionally, the formats "compact" the updates that have
> occurred and rewrite a new "optimized" Parquet file.
>
> Both projects require Apache Spark to be able to write data.
>
> I'd like to be able to use these formats in any language that Arrow
> supports, and I'd like to avoid the complexity of operating a Spark
> cluster.
>
> Since Arrow has support for tabular datasets and supports Parquet, is there
> anything on the roadmap for Arrow to support these formats?
>
> These formats will most likely become increasingly popular in various
> industries.
>
> Rusty
>

Re: [DISCUSS] Apache Iceberg / Apache Hudi support in Arrow

Reply via email to