Thanks for working on this, Anton and Miguel! Would anyone be interested in scheduling a hangout to talk about next steps and tentative design choices?
The doc is a great start and does a good job laying out the trade-offs between different approaches. I appreciate the idea to get a discussion started and not to pick one particular approach, but I think that it does make a few choices clear: *1. Iceberg should support lazy (read-side) merging using diff files* The eager approach doesn’t require much beyond Iceberg’s existing support. Adding diff files is the next step for engines that need to implement lazy merging for merge/upsert/delete. I support adding these structures to the spec (as a new format version). *2. Iceberg diff files should use synthetic keys* A lot of the discussion on the doc is about whether natural keys are practical or what assumptions we can make or trade about them. In my opinion, Iceberg tables will absolutely need natural keys for reasonable use cases. And those natural keys will need to be unique. And Iceberg will need to rely on engines to enforce that uniqueness. But, there is a difference between table behavior and implementation. We can use synthetic keys to implement the requirements of natural keys. Each row should be identified by its file and position in a file. When deleting by a natural key, we just need to find out what the synthetic key is and encode that in the delete diff. With the physical representation using synthetic keys, we should also define how to communicate a natural key constraint for a table. That way, writers can fail if a write may violate the key constraints of a table. *3. Synthetic keys should be based on filename and position* I think identifying the file in a synthetic key makes a lot of sense. This would allow for delta file reuse as individual files are rewritten by a “major” compaction and provides nice flexibility that fits with the format. We will need to think through all the impacts, like how file relocation works (e.g., move between regions) and the requirements for rewrites (must apply the delta when rewriting). *Open questions* There are also quite a few remaining questions for a design: - Should Iceberg use insert diff files? (My initial answer is no) - Should Iceberg require diff compaction? Iceberg could require one delete diff per partition, for example. (My answer: no) - Should data files store synthetic key position? If so, why? - Should there be a dense format for deletes, or just a sparse format? - What is the scope of a delete diff? At a minimum, partition. But does it make sense to build ways to restrict scope further? On Fri, May 10, 2019 at 11:27 AM Anton Okolnychyi <aokolnyc...@apple.com.invalid> wrote: > We did take a look at Hudi. The overall design seems to be pretty > complicated and, unfortunately, I didn’t have time to explore every detail. > > Here is my understanding (correct me if I am wrong): > > - Hudi has RECORD_KEY, which is expected to be unique. > - Hudi has PRECOMBINED_KEY, which is used to pick only one row in the > incoming batch if there are multiple rows with the same key. As I > understand, this isn't used on reads. It is used on writes to deduplicate > rows with identical keys within one incoming batch. For example, if we are > inserting 10 records and two rows have the same key, PRECOMBINED_KEY will > be used to pick up only one row. > - Once Hudi ensures the uniqueness of RECORD_KEY within the incoming > batch, it loads the Bloom filter index from all existing Parquet files in > the involved partitions (meaning, partitions spread from the input batch) > and tags each record as either an update or insert by mapping the incoming > keys to existing files for updates. At this point, it seems to rely on join. > > Is my understanding correct? If so, do we want to consider joins on write? > We mentioned this technique as one way to ensure the uniqueness of natural > keys but we were concerned about the performance. Also, does Hudi support > record-level updates? > > Thanks, > Anton > > On 10 May 2019, at 18:22, Erik Wright <erik.wri...@shopify.com.INVALID> > wrote: > > Thanks for putting this forward. > > Another term for the "lazy" approach would be "merge on read". > > My team has built something internally that uses merge-on-read internally > but uses an "Eager" materialization for publication to Presto. Roughly, we > maintain a table metadata file that looks a bit like Iceberg's and tracks > the "live" version of each partition as it is updated over time. We are > looking into a solution that will allow us to push the merge-on-read all > the way to Presto (and other consumers), and adding Merge-On-Read to > Iceberg is one of the approaches we are considering. > > It's worth noting that Hudi does have support for upserts/deletes as well, > so that's another model to consider. > > On Fri, May 10, 2019 at 8:30 AM Miguel Miranda < > miguelnmira...@apple.com.invalid> wrote: > >> Hi, >> >> As Anton said, we purposely avoided making a "decision" on which approach >> should be implemented in order to allow for a meaningful discussion with >> the community. >> >> The document starts with an eager approach as it is straightforward and >> easy to understand: steps resemble the usual file level >> operations/manipulations frequently used by engineers when implementing >> Update/Delete/Upsert behaviour themselves, hopefully creating a conceptual >> bridge to the more involved designs. Right now, Iceberg has almost >> everything to implement the "eager" approach as we simply need to adjust >> the retry mechanism. For example, I have implemented a prototype of the >> eager solution with Spark and Iceberg. >> >> We looked into many existing solutions for inspiration, but when there >> isn't a paper or code in the public domain it becomes hard to assess the >> underlying design, although some of it can be inferred from the API or >> documentation. >> >> Best, >> Miguel >> >> On 10 May 2019, at 11:57, Anton Okolnychyi <aokolnyc...@apple.com> wrote: >> >> Thanks for the feedback, Jacques! >> >> You are correct, we kept the question of the best approach as open :) The >> idea was to have a discussion in the community. Hopefully, we can reach a >> consensus. >> >> While the proposed “lazy” approaches certainly offer significant >> benefits, they require more changes in Iceberg as well as in readers/query >> engines (depending on how we want to merge base and diff files). For us, it >> is important to understand whether the Iceberg community would even >> consider such changes. >> >> Hive ACID 3 is one the projects we looked at. In fact, we spoke to Owen, >> the original creator of updates/deletes/upserts in Hive. I believe the >> “lazy” approaches are close to what Hive 3 does but with their own >> distinctions that Iceberg allows us to have. It would be great to have >> Owen’s feedback. >> >> We don’t know the internals of Delta as updates/deletes/upserts are not >> open source. My personal guess, yes, it might be similar to the “eager” >> approach in our doc. >> >> Jacques, could you share some insights how you implement the merge of >> diffs? Is it done by readers? >> >> Thanks, >> Anton >> >> On 10 May 2019, at 06:24, Jacques Nadeau <jacq...@dremio.com> wrote: >> >> This is a nice doc and it covers many different options. Upon first skim, >> I don't see a strong argument for particular approach. D >> >> In our own development, we've been leaning heavily towards what you >> describe in the document as "lazy with SRI". I believe this is consistent >> with what the Hive community did on top of Orc. It's interesting because my >> (maybe incorrect) understanding of the Databricks Delta approach is they >> chose what you title "eager" in their approach to upserts. They may also >> have a lazy approach for other types of mutations but I don't think they do. >> >> Thanks again for putting this together! >> Jacques >> -- >> Jacques Nadeau >> CTO and Co-Founder, Dremio >> >> >> On Wed, May 8, 2019 at 3:42 AM Anton Okolnychyi < >> aokolnyc...@apple.com.invalid> wrote: >> >>> Hi folks, >>> >>> Miguel (cc) and I have spent some time thinking about how to perform >>> updates/deletes/upserts on top of Iceberg tables. This functionality is >>> essential for many modern use cases. We've summarized our ideas in a doc >>> [1], which, hopefully, will trigger a discussion in the community. The >>> document presents different conceptual approaches alongside their >>> trade-offs. We will be glad to consider any other ideas as well. >>> >>> Thanks, >>> Anton >>> >>> [1] - >>> https://docs.google.com/document/d/1Pk34C3diOfVCRc-sfxfhXZfzvxwum1Odo-6Jj9mwK38/ >>> >>> >>> >> >> > -- Ryan Blue Software Engineer Netflix