I think that this direction sounds reasonable. It makes sense to start
building the integration in Hive because it will be easier to iterate
there. Iceberg is quite different in some areas and I think that would
probably mean that Hive needs to change to provide a really great
experience. That was the case with Spark, too.

We will need to continue to provide the Hive module in Iceberg for quite
some time, but as Hive releases newer versions we can eventually remove the
modules. We did the same thing for Parquet support.

Thanks for leading the way on this, Peter!

On Wed, Mar 3, 2021 at 2:12 AM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> Hi Iceberg and Hive Teams,
>
> As some of you already know we are working on making Iceberg available as
> a first class storage layer for Hive.
>
> Folks on the Iceberg side made a good job on utilizing the existing Hive
> SerDe API for the released Hive 2.3.8 and 3.1.2 versions. Thanks to their
> efforts we have read support for queries above Iceberg backed Hive tables
> with predicate pushdown and column pruning. In the last few months we added
> basic write and DDL support, so now one can create Iceberg backed Hive
> table and insert data into it with Hive queries. The code of these
> features are in the iceberg repo and available through the released
> iceberg-mr-runtime.jar for everyone to try out.
>
> There are some important features where the current Hive query execution
> model and SerDe API is not enough to achieve the things we need. Just to
> name a few:
>
>    - CREATE TABLE AS ... - Here we need to create an Iceberg table first,
>    then write the data. Hive currently writes the data to a temporary dir and
>    uses MoveTask to move it to the final place
>    - INSERT OVERWRITE ... - We need information about the jobs/tasks at
>    HS2 side to commit the changes to an Iceberg table. These are not available
>    ATM in DefaultHiveMetaHook.commitInsert method.
>    - We would like to extend the Hive query language with Iceberg
>    specific bits, like timetravel / Iceberg specific partitioning etc
>
>
> We fully expect to find even more roadblocks as we progress with our
> roadmap. We might be able to work around the limitations by some hacky
> solutions but those do not pave the road for long term stable integration.
> The good solution for this problem should be to extend the SerDe API and
> enhance the query execution logic based on the new SerDe API. This will be
> an iterative process where the API will be constantly evolving until we
> reach the "final" stable stage.
>
> To make the process above streamlined, we propose to create an
> iceberg-handler module in Hive and use the existing
> iceberg-mr/iceberg-hive3 Iceberg modules as a baseline for it. We can
> extend and use the new SerDe API in this new iceberg-handler module and
> iterate faster. When there is a Hive release we can decide our next steps
> based on the actual landscape, and in the meantime we can port the changes
> between the 2 repo which does not require the new APIs.
>
> I would like to hear both teams opinion of the proposed solution.
>
> Thanks,
> Peter
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to