Re: Hive Iceberg integration

Ryan Blue Wed, 03 Mar 2021 09:58:51 -0800

David, we already have Hive support in Iceberg, so there is no need to
create a separate project. I think the problem is that we can't make
changes to Hive that are needed for that support. We're reaching the limits
of what can be done in an external project, so we can either add/update
interfaces in Hive and then wait for a Hive release, or we can move the
support into Hive. Moving support into Hive allows us to make more rapid
progress on the integration because we don't need to wait for Hive releases.


On Wed, Mar 3, 2021 at 9:49 AM David <dam6...@gmail.com> wrote:

> Hello Team,
>
> I'm not sure how far out you want to scope this, but I think we have
> enough sub-projects as it is within the Hive core project.  To build the
> entire project takes a considerable amount of time.
>
> Would it be possible to roll this out like Jackson or DataNucleaus?
>
> https://github.com/apache/hive
> https://github.com/apache/hive-iceberg
>
> Thanks.
>
> On Wed, Mar 3, 2021 at 12:30 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> I think that this direction sounds reasonable. It makes sense to start
>> building the integration in Hive because it will be easier to iterate
>> there. Iceberg is quite different in some areas and I think that would
>> probably mean that Hive needs to change to provide a really great
>> experience. That was the case with Spark, too.
>>
>> We will need to continue to provide the Hive module in Iceberg for quite
>> some time, but as Hive releases newer versions we can eventually remove
>> the
>> modules. We did the same thing for Parquet support.
>>
>> Thanks for leading the way on this, Peter!
>>
>> On Wed, Mar 3, 2021 at 2:12 AM Peter Vary <pv...@cloudera.com.invalid>
>> wrote:
>>
>> > Hi Iceberg and Hive Teams,
>> >
>> > As some of you already know we are working on making Iceberg available
>> as
>> > a first class storage layer for Hive.
>> >
>> > Folks on the Iceberg side made a good job on utilizing the existing Hive
>> > SerDe API for the released Hive 2.3.8 and 3.1.2 versions. Thanks to
>> their
>> > efforts we have read support for queries above Iceberg backed Hive
>> tables
>> > with predicate pushdown and column pruning. In the last few months we
>> added
>> > basic write and DDL support, so now one can create Iceberg backed Hive
>> > table and insert data into it with Hive queries. The code of these
>> > features are in the iceberg repo and available through the released
>> > iceberg-mr-runtime.jar for everyone to try out.
>> >
>> > There are some important features where the current Hive query execution
>> > model and SerDe API is not enough to achieve the things we need. Just to
>> > name a few:
>> >
>> >    - CREATE TABLE AS ... - Here we need to create an Iceberg table
>> first,
>> >    then write the data. Hive currently writes the data to a temporary
>> dir and
>> >    uses MoveTask to move it to the final place
>> >    - INSERT OVERWRITE ... - We need information about the jobs/tasks at
>> >    HS2 side to commit the changes to an Iceberg table. These are not
>> available
>> >    ATM in DefaultHiveMetaHook.commitInsert method.
>> >    - We would like to extend the Hive query language with Iceberg
>> >    specific bits, like timetravel / Iceberg specific partitioning etc
>> >
>> >
>> > We fully expect to find even more roadblocks as we progress with our
>> > roadmap. We might be able to work around the limitations by some hacky
>> > solutions but those do not pave the road for long term stable
>> integration.
>> > The good solution for this problem should be to extend the SerDe API and
>> > enhance the query execution logic based on the new SerDe API. This will
>> be
>> > an iterative process where the API will be constantly evolving until we
>> > reach the "final" stable stage.
>> >
>> > To make the process above streamlined, we propose to create an
>> > iceberg-handler module in Hive and use the existing
>> > iceberg-mr/iceberg-hive3 Iceberg modules as a baseline for it. We can
>> > extend and use the new SerDe API in this new iceberg-handler module and
>> > iterate faster. When there is a Hive release we can decide our next
>> steps
>> > based on the actual landscape, and in the meantime we can port the
>> changes
>> > between the 2 repo which does not require the new APIs.
>> >
>> > I would like to hear both teams opinion of the proposed solution.
>> >
>> > Thanks,
>> > Peter
>> >
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Hive Iceberg integration

Reply via email to