Hive Iceberg integration

2021-03-03 Thread Peter Vary
Hi Iceberg and Hive Teams, As some of you already know we are working on making Iceberg available as a first class storage layer for Hive. Folks on the Iceberg side made a good job on utilizing the existing Hive SerDe API for the released Hive 2.3.8 and 3.1.2 versions. Thanks to their efforts w

RE: Reading data from Iceberg table into Apache Arrow in Java

2021-03-03 Thread Mayur Srivastava
Thanks for finding out Peter. Should we proceed with this pr and later add support for vectorized reads in a separate pr? There are also some other limitations in the current pr (listed in the pr) which could be addressed in subsequent prs. Thanks, Mayur From: Peter Vary Sent: Tuesday, March

RE: Reading data from Iceberg table into Apache Arrow in Java

2021-03-03 Thread Mayur Srivastava
>> Should we proceed with this pr and later add support for vectorized reads in >> a separate pr? I meant support deletes in the vectorized reader. Thanks, Mayur From: Mayur Srivastava Sent: Wednesday, March 3, 2021 6:41 AM To: dev@iceberg.apache.org Cc: Ryan Blue Subject: RE: Reading data fro

Re: Basic iceberg metrics viz tool

2021-03-03 Thread Tianyi Wang
I did something similar to visualize the snapshots and files. But instead of using the static website, I was using the Java API to get the metadata from HDFS and send it back to the frontend. Something like this: https://observablehq.com/@capkurmagati/iceberg-metadata-visualization My actual implem

Re: Hive query with join of Iceberg table and Hive table

2021-03-03 Thread Edgar Rodriguez
On Wed, Mar 3, 2021 at 1:48 AM Peter Vary wrote: > Quick question @Edgar: Am I right that the table is created by Spark? I > think if it is created from Hive and we inserted the data from Hive, then > we should have the basic stats already collected and we should not need the > estimation (we mig

Re: Hive Iceberg integration

2021-03-03 Thread Ryan Blue
I think that this direction sounds reasonable. It makes sense to start building the integration in Hive because it will be easier to iterate there. Iceberg is quite different in some areas and I think that would probably mean that Hive needs to change to provide a really great experience. That was

Re: Reading data from Iceberg table into Apache Arrow in Java

2021-03-03 Thread Ryan Blue
Yes, I think we should move forward with reads that don't need to merge deletes and have a check that there are no deletes to merge. That will work in many cases and we can add read support for v2 later. On Wed, Mar 3, 2021 at 3:42 AM Mayur Srivastava < mayur.srivast...@twosigma.com> wrote: > >>

Re: Hive Iceberg integration

2021-03-03 Thread David
Hello Team, I'm not sure how far out you want to scope this, but I think we have enough sub-projects as it is within the Hive core project. To build the entire project takes a considerable amount of time. Would it be possible to roll this out like Jackson or DataNucleaus? https://github.com/apa

Re: Hive query with join of Iceberg table and Hive table

2021-03-03 Thread Ryan Blue
I agree with the concern about caching splits, but doesn't the API cause us to collect all of the splits into memory anyway? I thought there was no way to return splits as an `Iterator` that lazily loads them. If that's the case, then we primarily need to worry about cleanup and how long they are k

Re: Hive Iceberg integration

2021-03-03 Thread Ryan Blue
David, we already have Hive support in Iceberg, so there is no need to create a separate project. I think the problem is that we can't make changes to Hive that are needed for that support. We're reaching the limits of what can be done in an external project, so we can either add/update interfaces

Re: Secondary Indexes - Pluggable File Filter interface for Apache Iceberg

2021-03-03 Thread Ryan Blue
Thanks for putting this together, Guy! I just did a pass over the doc and it looks like a really reasonable proposal for being able to inject custom file filter implementations. One of the main things we need to think about is how to store and track the index data. There's a comment in the doc abo

Re: Secondary Indexes - Pluggable File Filter interface for Apache Iceberg

2021-03-03 Thread OpenInx
It will be 1:00 AM (China Standard Time) on 18 March, and it works for our Asia people. I'd love to attend this discussion, Thanks. On Thu, Mar 4, 2021 at 9:50 AM Ryan Blue wrote: > Thanks for putting this together, Guy! I just did a pass over the doc and > it looks like a really reasonable

Re: Secondary Indexes - Pluggable File Filter interface for Apache Iceberg

2021-03-03 Thread Ryan Blue
Great, thank you for planning to join! I definitely want to get your input on this as well. On Wed, Mar 3, 2021 at 6:06 PM OpenInx wrote: > It will be 1:00 AM (China Standard Time) on 18 March, and it works for > our Asia people. I'd love to attend this discussion, Thanks. > > On Thu, Mar 4,

Re: Secondary Indexes - Pluggable File Filter interface for Apache Iceberg

2021-03-03 Thread Miao Wang
It works for me. With a quick thought, there may be a few concerns about consolidated fashion storage. 1). Maintaining the consolidated storage may be a bit more complex; 2). It may make collecting index while writing data file (i.e., online index building) more complex (e.g., we need to consid