Re: [DISCUSS] Implementation strategies for supporting Iceberg tables in Hive

Ryan Blue Thu, 08 Aug 2019 11:01:13 -0700

> With an iceberg raw store, I suspect that you might not need a storage
handler and could go straight to a input/output format. You would probably
need an input and output format for each of the storage formats:
Iceberg{Orc,Parquet,Avro}{Input,Output}Format.


I don't think that would work because Iceberg tables can use different
storage formats within the same table or partition. I think an InputFormat
would need to support all file formats.

On Wed, Aug 7, 2019 at 10:49 PM Owen O'Malley <owen.omal...@gmail.com>
wrote:

>
>
> > On Jul 24, 2019, at 22:52, Adrien Guillo <adrien.gui...@airbnb.com.invalid>
> wrote:
> >
> > Hi Iceberg folks,
> >
> > In the last few months, we (the data infrastructure team at Airbnb) have
> been closely following the project. We are currently evaluating potential
> strategies to migrate our data warehouse to Iceberg. However, we have a
> very large Hive deployment, which means we can’t really do so without
> support for Iceberg tables in Hive.
> >
> > We have been thinking about implementation strategies. Here are some
> thoughts that we would like to share them with you:
> >
> > – Implementing a new `RawStore`
>
> My thought would be to use the embedded metastore with an iceberg
> rawstore. That enables the client to do the work rather than pushing it to
> an external metastore.
>
> I expect that some new users would be able to just use iceberg as their
> only metastore, but that others would want to have some databases in hive
> layout and others in iceberg. We could use a delegating raw store for that.
>
> > This is something that has been mentioned several times on the mailing
> list and seems to indicate that adding support for Iceberg tables in Hive
> could be achieved without client-side modifications. Does that mean that
> the Metastore is the only process manipulating Iceberg metadata (snapshots,
> manifests)? Does that mean that for instance the `listPartition*` calls to
> the Metastore return the DataFiles associated with each partition? Per our
> understanding, it seems that supporting Iceberg tables in Hive with this
> strategy will most likely require to update the RawStore interface AND will
> require at least some client-side changes. In addition, with this strategy
> the Metastore bears new responsibilities, which contradicts one of the
> Iceberg design goals: offloading more work to jobs and removing the
> metastore as a bottleneck. In the Iceberg world, not much is needed from
> the Metastore: it just keeps track of the metadata location and provides a
> mechanism for atomically updating this location (basically, what is done in
> the `HiveTableOperations` class). We would like to design a solution that
> relies  as little as possible on the Metastore so that in future we have
> the option to replace our fleet of Metastores with a simpler system.
> >
> >
> > – Implementing a new `HiveStorageHandler`
>
> With an iceberg raw store, I suspect that you might not need a storage
> handler and could go straight to a input/output format. You would probably
> need an input and output format for each of the storage formats:
> Iceberg{Orc,Parquet,Avro}{Input,Output}Format.
>
> .. Owen
> >
> > We are working on implementing custom `InputFormat` and `OutputFormat`
> classes for Iceberg (more on that in the next paragraph) and they would fit
> in nicely with the `HiveStorageHandler` and `HiveStoragePredicateHandler`
> interfaces. However, the `HiveMetaHook` interface does not seem rich enough
> to accommodate all the workflows, for instance no hooks run on `ALTER ...`
> or `INSERT...` commands.
> >
> >
> >
> > – Proof of concept
> >
> > We set out to write a proof of concept that would allow us to learn and
> experiment. We based our work on the 2.3 branch. Here’s the state of the
> project and the paths we explored:
> >
> > DDL commands
> > We support some commands such as `CREABLE TABLE ...`, `DESC ...`, `SHOW
> PARTITIONS`. They are all implemented in the client and mostly rely on the
> `HiveCatalog` class to do the work.
> >
> > Read path
> > We are in the process of implementing a custom `FileInputFormat` that
> receives an Iceberg table identifier and a serialized expression
> `ExprNodeDesc` as input. This is similar in a lot of ways to what you can
> find in the `PigParquetReader` class in the `iceberg-pig` package or in
> `HiveHBaseTableInputFormat` class in Hive.
> >
> >
> > Write path
> > We have made less progress in that front but we see a path forward by
> implementing a custom `OutputFormat` that would keep track of the files
> that are being written and gather statistics. Then, each task can dump this
> information on HDFS. From there, the final Hive `MoveTask` can merge those
> “pre-manifest” files to create a new snapshot and commit the new version of
> a table.
> >
> >
> > We hope that our observations will start a healthy conversation about
> supporting Iceberg tables in Hive :)
> >
> >
> > Cheers,
> > Adrien
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Implementation strategies for supporting Iceberg tables in Hive

Reply via email to