Re: [DISCUSS] Implementation strategies for supporting Iceberg tables in Hive

RD Thu, 25 Jul 2019 15:37:25 -0700

Hi Adrien,
   We at LinkedIn went through a similar thought process, but given our
Hive deployment is not that large, we are in the process of considering
deprecating Hive and asking our users to move to Spark [since Spark
supports Hive ql].


I'm guessing we'd have to invest in Spark's catalog AFAICT, but we haven't
investigated this yet.

-Best.






On Wed, Jul 24, 2019 at 1:53 PM Adrien Guillo
<adrien.gui...@airbnb.com.invalid> wrote:

> Hi Iceberg folks,
>
> In the last few months, we (the data infrastructure team at Airbnb) have
> been closely following the project. We are currently evaluating potential
> strategies to migrate our data warehouse to Iceberg. However, we have a
> very large Hive deployment, which means we can’t really do so without
> support for Iceberg tables in Hive.
>
> We have been thinking about implementation strategies. Here are some
> thoughts that we would like to share them with you:
>
> – Implementing a new `RawStore`
>
> This is something that has been mentioned several times on the mailing
> list and seems to indicate that adding support for Iceberg tables in Hive
> could be achieved without client-side modifications. Does that mean that
> the Metastore is the only process manipulating Iceberg metadata (snapshots,
> manifests)? Does that mean that for instance the `listPartition*` calls to
> the Metastore return the DataFiles associated with each partition? Per our
> understanding, it seems that supporting Iceberg tables in Hive with this
> strategy will most likely require to update the RawStore interface AND will
> require at least some client-side changes. In addition, with this strategy
> the Metastore bears new responsibilities, which contradicts one of the
> Iceberg design goals: offloading more work to jobs and removing the
> metastore as a bottleneck. In the Iceberg world, not much is needed from
> the Metastore: it just keeps track of the metadata location and provides a
> mechanism for atomically updating this location (basically, what is done in
> the `HiveTableOperations` class). We would like to design a solution that
> relies  as little as possible on the Metastore so that in future we have
> the option to replace our fleet of Metastores with a simpler system.
>
>
> – Implementing a new `HiveStorageHandler`
>
> We are working on implementing custom `InputFormat` and `OutputFormat`
> classes for Iceberg (more on that in the next paragraph) and they would fit
> in nicely with the `HiveStorageHandler` and `HiveStoragePredicateHandler`
> interfaces. However, the `HiveMetaHook` interface does not seem rich enough
> to accommodate all the workflows, for instance no hooks run on `ALTER ...`
>  or `INSERT...` commands.
>
>
>
> – Proof of concept
>
> We set out to write a proof of concept that would allow us to learn and
> experiment. We based our work on the 2.3 branch. Here’s the state of the
> project and the paths we explored:
>
> DDL commands
> We support some commands such as `CREABLE TABLE ...`, `DESC ...`, `SHOW
> PARTITIONS`. They are all implemented in the client and mostly rely on the
> `HiveCatalog` class to do the work.
>
> Read path
> We are in the process of implementing a custom `FileInputFormat` that
> receives an Iceberg table identifier and a serialized expression
> `ExprNodeDesc` as input. This is similar in a lot of ways to what you can
> find in the `PigParquetReader` class in the `iceberg-pig` package or in
> `HiveHBaseTableInputFormat` class in Hive.
>
>
> Write path
> We have made less progress in that front but we see a path forward by
> implementing a custom `OutputFormat` that would keep track of the files
> that are being written and gather statistics. Then, each task can dump this
> information on HDFS. From there, the final Hive `MoveTask` can merge those
> “pre-manifest” files to create a new snapshot and commit the new version of
> a table.
>
>
> We hope that our observations will start a healthy conversation about
> supporting Iceberg tables in Hive :)
>
>
> Cheers,
> Adrien
>

Re: [DISCUSS] Implementation strategies for supporting Iceberg tables in Hive

Reply via email to