Hi, As Anton said, we purposely avoided making a "decision" on which approach should be implemented in order to allow for a meaningful discussion with the community.
The document starts with an eager approach as it is straightforward and easy to understand: steps resemble the usual file level operations/manipulations frequently used by engineers when implementing Update/Delete/Upsert behaviour themselves, hopefully creating a conceptual bridge to the more involved designs. Right now, Iceberg has almost everything to implement the "eager" approach as we simply need to adjust the retry mechanism. For example, I have implemented a prototype of the eager solution with Spark and Iceberg. We looked into many existing solutions for inspiration, but when there isn't a paper or code in the public domain it becomes hard to assess the underlying design, although some of it can be inferred from the API or documentation. Best, Miguel > On 10 May 2019, at 11:57, Anton Okolnychyi <aokolnyc...@apple.com> wrote: > > Thanks for the feedback, Jacques! > > You are correct, we kept the question of the best approach as open :) The > idea was to have a discussion in the community. Hopefully, we can reach a > consensus. > > While the proposed “lazy” approaches certainly offer significant benefits, > they require more changes in Iceberg as well as in readers/query engines > (depending on how we want to merge base and diff files). For us, it is > important to understand whether the Iceberg community would even consider > such changes. > > Hive ACID 3 is one the projects we looked at. In fact, we spoke to Owen, the > original creator of updates/deletes/upserts in Hive. I believe the “lazy” > approaches are close to what Hive 3 does but with their own distinctions that > Iceberg allows us to have. It would be great to have Owen’s feedback. > > We don’t know the internals of Delta as updates/deletes/upserts are not open > source. My personal guess, yes, it might be similar to the “eager” approach > in our doc. > > Jacques, could you share some insights how you implement the merge of diffs? > Is it done by readers? > > Thanks, > Anton > >> On 10 May 2019, at 06:24, Jacques Nadeau <jacq...@dremio.com >> <mailto:jacq...@dremio.com>> wrote: >> >> This is a nice doc and it covers many different options. Upon first skim, I >> don't see a strong argument for particular approach. D >> >> In our own development, we've been leaning heavily towards what you describe >> in the document as "lazy with SRI". I believe this is consistent with what >> the Hive community did on top of Orc. It's interesting because my (maybe >> incorrect) understanding of the Databricks Delta approach is they chose what >> you title "eager" in their approach to upserts. They may also have a lazy >> approach for other types of mutations but I don't think they do. >> >> Thanks again for putting this together! >> Jacques >> -- >> Jacques Nadeau >> CTO and Co-Founder, Dremio >> >> >> On Wed, May 8, 2019 at 3:42 AM Anton Okolnychyi >> <aokolnyc...@apple.com.invalid <mailto:aokolnyc...@apple.com.invalid>> wrote: >> Hi folks, >> >> Miguel (cc) and I have spent some time thinking about how to perform >> updates/deletes/upserts on top of Iceberg tables. This functionality is >> essential for many modern use cases. We've summarized our ideas in a doc >> [1], which, hopefully, will trigger a discussion in the community. The >> document presents different conceptual approaches alongside their >> trade-offs. We will be glad to consider any other ideas as well. >> >> Thanks, >> Anton >> >> [1] - >> https://docs.google.com/document/d/1Pk34C3diOfVCRc-sfxfhXZfzvxwum1Odo-6Jj9mwK38/ >> >> <https://docs.google.com/document/d/1Pk34C3diOfVCRc-sfxfhXZfzvxwum1Odo-6Jj9mwK38/> >> >> >