Re: Hive Iceberg integration

David Wed, 03 Mar 2021 09:49:36 -0800

Hello Team,

I'm not sure how far out you want to scope this, but I think we have enough
sub-projects as it is within the Hive core project.  To build the entire
project takes a considerable amount of time.


Would it be possible to roll this out like Jackson or DataNucleaus?

https://github.com/apache/hive
https://github.com/apache/hive-iceberg

Thanks.

On Wed, Mar 3, 2021 at 12:30 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> I think that this direction sounds reasonable. It makes sense to start
> building the integration in Hive because it will be easier to iterate
> there. Iceberg is quite different in some areas and I think that would
> probably mean that Hive needs to change to provide a really great
> experience. That was the case with Spark, too.
>
> We will need to continue to provide the Hive module in Iceberg for quite
> some time, but as Hive releases newer versions we can eventually remove the
> modules. We did the same thing for Parquet support.
>
> Thanks for leading the way on this, Peter!
>
> On Wed, Mar 3, 2021 at 2:12 AM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
> > Hi Iceberg and Hive Teams,
> >
> > As some of you already know we are working on making Iceberg available as
> > a first class storage layer for Hive.
> >
> > Folks on the Iceberg side made a good job on utilizing the existing Hive
> > SerDe API for the released Hive 2.3.8 and 3.1.2 versions. Thanks to their
> > efforts we have read support for queries above Iceberg backed Hive tables
> > with predicate pushdown and column pruning. In the last few months we
> added
> > basic write and DDL support, so now one can create Iceberg backed Hive
> > table and insert data into it with Hive queries. The code of these
> > features are in the iceberg repo and available through the released
> > iceberg-mr-runtime.jar for everyone to try out.
> >
> > There are some important features where the current Hive query execution
> > model and SerDe API is not enough to achieve the things we need. Just to
> > name a few:
> >
> >    - CREATE TABLE AS ... - Here we need to create an Iceberg table first,
> >    then write the data. Hive currently writes the data to a temporary
> dir and
> >    uses MoveTask to move it to the final place
> >    - INSERT OVERWRITE ... - We need information about the jobs/tasks at
> >    HS2 side to commit the changes to an Iceberg table. These are not
> available
> >    ATM in DefaultHiveMetaHook.commitInsert method.
> >    - We would like to extend the Hive query language with Iceberg
> >    specific bits, like timetravel / Iceberg specific partitioning etc
> >
> >
> > We fully expect to find even more roadblocks as we progress with our
> > roadmap. We might be able to work around the limitations by some hacky
> > solutions but those do not pave the road for long term stable
> integration.
> > The good solution for this problem should be to extend the SerDe API and
> > enhance the query execution logic based on the new SerDe API. This will
> be
> > an iterative process where the API will be constantly evolving until we
> > reach the "final" stable stage.
> >
> > To make the process above streamlined, we propose to create an
> > iceberg-handler module in Hive and use the existing
> > iceberg-mr/iceberg-hive3 Iceberg modules as a baseline for it. We can
> > extend and use the new SerDe API in this new iceberg-handler module and
> > iterate faster. When there is a Hive release we can decide our next steps
> > based on the actual landscape, and in the meantime we can port the
> changes
> > between the 2 repo which does not require the new APIs.
> >
> > I would like to hear both teams opinion of the proposed solution.
> >
> > Thanks,
> > Peter
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Hive Iceberg integration

Reply via email to