Hello Team, I'm not sure how far out you want to scope this, but I think we have enough sub-projects as it is within the Hive core project. To build the entire project takes a considerable amount of time.
Would it be possible to roll this out like Jackson or DataNucleaus? https://github.com/apache/hive https://github.com/apache/hive-iceberg Thanks. On Wed, Mar 3, 2021 at 12:30 PM Ryan Blue <rb...@netflix.com.invalid> wrote: > I think that this direction sounds reasonable. It makes sense to start > building the integration in Hive because it will be easier to iterate > there. Iceberg is quite different in some areas and I think that would > probably mean that Hive needs to change to provide a really great > experience. That was the case with Spark, too. > > We will need to continue to provide the Hive module in Iceberg for quite > some time, but as Hive releases newer versions we can eventually remove the > modules. We did the same thing for Parquet support. > > Thanks for leading the way on this, Peter! > > On Wed, Mar 3, 2021 at 2:12 AM Peter Vary <pv...@cloudera.com.invalid> > wrote: > > > Hi Iceberg and Hive Teams, > > > > As some of you already know we are working on making Iceberg available as > > a first class storage layer for Hive. > > > > Folks on the Iceberg side made a good job on utilizing the existing Hive > > SerDe API for the released Hive 2.3.8 and 3.1.2 versions. Thanks to their > > efforts we have read support for queries above Iceberg backed Hive tables > > with predicate pushdown and column pruning. In the last few months we > added > > basic write and DDL support, so now one can create Iceberg backed Hive > > table and insert data into it with Hive queries. The code of these > > features are in the iceberg repo and available through the released > > iceberg-mr-runtime.jar for everyone to try out. > > > > There are some important features where the current Hive query execution > > model and SerDe API is not enough to achieve the things we need. Just to > > name a few: > > > > - CREATE TABLE AS ... - Here we need to create an Iceberg table first, > > then write the data. Hive currently writes the data to a temporary > dir and > > uses MoveTask to move it to the final place > > - INSERT OVERWRITE ... - We need information about the jobs/tasks at > > HS2 side to commit the changes to an Iceberg table. These are not > available > > ATM in DefaultHiveMetaHook.commitInsert method. > > - We would like to extend the Hive query language with Iceberg > > specific bits, like timetravel / Iceberg specific partitioning etc > > > > > > We fully expect to find even more roadblocks as we progress with our > > roadmap. We might be able to work around the limitations by some hacky > > solutions but those do not pave the road for long term stable > integration. > > The good solution for this problem should be to extend the SerDe API and > > enhance the query execution logic based on the new SerDe API. This will > be > > an iterative process where the API will be constantly evolving until we > > reach the "final" stable stage. > > > > To make the process above streamlined, we propose to create an > > iceberg-handler module in Hive and use the existing > > iceberg-mr/iceberg-hive3 Iceberg modules as a baseline for it. We can > > extend and use the new SerDe API in this new iceberg-handler module and > > iterate faster. When there is a Hive release we can decide our next steps > > based on the actual landscape, and in the meantime we can port the > changes > > between the 2 repo which does not require the new APIs. > > > > I would like to hear both teams opinion of the proposed solution. > > > > Thanks, > > Peter > > > > > -- > Ryan Blue > Software Engineer > Netflix >