Hi Adrien, We at LinkedIn went through a similar thought process, but given our Hive deployment is not that large, we are in the process of considering deprecating Hive and asking our users to move to Spark [since Spark supports Hive ql].
I'm guessing we'd have to invest in Spark's catalog AFAICT, but we haven't investigated this yet. -Best. On Wed, Jul 24, 2019 at 1:53 PM Adrien Guillo <adrien.gui...@airbnb.com.invalid> wrote: > Hi Iceberg folks, > > In the last few months, we (the data infrastructure team at Airbnb) have > been closely following the project. We are currently evaluating potential > strategies to migrate our data warehouse to Iceberg. However, we have a > very large Hive deployment, which means we can’t really do so without > support for Iceberg tables in Hive. > > We have been thinking about implementation strategies. Here are some > thoughts that we would like to share them with you: > > – Implementing a new `RawStore` > > This is something that has been mentioned several times on the mailing > list and seems to indicate that adding support for Iceberg tables in Hive > could be achieved without client-side modifications. Does that mean that > the Metastore is the only process manipulating Iceberg metadata (snapshots, > manifests)? Does that mean that for instance the `listPartition*` calls to > the Metastore return the DataFiles associated with each partition? Per our > understanding, it seems that supporting Iceberg tables in Hive with this > strategy will most likely require to update the RawStore interface AND will > require at least some client-side changes. In addition, with this strategy > the Metastore bears new responsibilities, which contradicts one of the > Iceberg design goals: offloading more work to jobs and removing the > metastore as a bottleneck. In the Iceberg world, not much is needed from > the Metastore: it just keeps track of the metadata location and provides a > mechanism for atomically updating this location (basically, what is done in > the `HiveTableOperations` class). We would like to design a solution that > relies as little as possible on the Metastore so that in future we have > the option to replace our fleet of Metastores with a simpler system. > > > – Implementing a new `HiveStorageHandler` > > We are working on implementing custom `InputFormat` and `OutputFormat` > classes for Iceberg (more on that in the next paragraph) and they would fit > in nicely with the `HiveStorageHandler` and `HiveStoragePredicateHandler` > interfaces. However, the `HiveMetaHook` interface does not seem rich enough > to accommodate all the workflows, for instance no hooks run on `ALTER ...` > or `INSERT...` commands. > > > > – Proof of concept > > We set out to write a proof of concept that would allow us to learn and > experiment. We based our work on the 2.3 branch. Here’s the state of the > project and the paths we explored: > > DDL commands > We support some commands such as `CREABLE TABLE ...`, `DESC ...`, `SHOW > PARTITIONS`. They are all implemented in the client and mostly rely on the > `HiveCatalog` class to do the work. > > Read path > We are in the process of implementing a custom `FileInputFormat` that > receives an Iceberg table identifier and a serialized expression > `ExprNodeDesc` as input. This is similar in a lot of ways to what you can > find in the `PigParquetReader` class in the `iceberg-pig` package or in > `HiveHBaseTableInputFormat` class in Hive. > > > Write path > We have made less progress in that front but we see a path forward by > implementing a custom `OutputFormat` that would keep track of the files > that are being written and gather statistics. Then, each task can dump this > information on HDFS. From there, the final Hive `MoveTask` can merge those > “pre-manifest” files to create a new snapshot and commit the new version of > a table. > > > We hope that our observations will start a healthy conversation about > supporting Iceberg tables in Hive :) > > > Cheers, > Adrien >