Hi,

As Anton said, we purposely avoided making a "decision" on which approach 
should be implemented in order to allow for a meaningful discussion with the 
community.

The document starts with an eager approach as it is straightforward and easy to 
understand: steps resemble the usual file level operations/manipulations 
frequently used by engineers when implementing Update/Delete/Upsert behaviour 
themselves, hopefully creating a conceptual bridge to the more involved 
designs. Right now, Iceberg has almost everything to implement the "eager" 
approach as we simply need to adjust the retry mechanism. For example, I have 
implemented a prototype of the eager solution with Spark and Iceberg.

We looked into many existing solutions for inspiration, but when there isn't a 
paper or code in the public domain it becomes hard to assess the underlying 
design, although some of it can be inferred from the API or documentation.

Best,
Miguel

> On 10 May 2019, at 11:57, Anton Okolnychyi <aokolnyc...@apple.com> wrote:
> 
> Thanks for the feedback, Jacques!
> 
> You are correct, we kept the question of the best approach as open :) The 
> idea was to have a discussion in the community. Hopefully, we can reach a 
> consensus.
> 
> While the proposed “lazy” approaches certainly offer significant benefits, 
> they require more changes in Iceberg as well as in readers/query engines 
> (depending on how we want to merge base and diff files). For us, it is 
> important to understand whether the Iceberg community would even consider 
> such changes. 
> 
> Hive ACID 3 is one the projects we looked at. In fact, we spoke to Owen, the 
> original creator of updates/deletes/upserts in Hive. I believe the “lazy” 
> approaches are close to what Hive 3 does but with their own distinctions that 
> Iceberg allows us to have. It would be great to have Owen’s feedback.
> 
> We don’t know the internals of Delta as updates/deletes/upserts are not open 
> source. My personal guess, yes, it might be similar to the “eager” approach 
> in our doc.
> 
> Jacques, could you share some insights how you implement the merge of diffs? 
> Is it done by readers?
> 
> Thanks,
> Anton
> 
>> On 10 May 2019, at 06:24, Jacques Nadeau <jacq...@dremio.com 
>> <mailto:jacq...@dremio.com>> wrote:
>> 
>> This is a nice doc and it covers many different options. Upon first skim, I 
>> don't see a strong argument for particular approach. D
>> 
>> In our own development, we've been leaning heavily towards what you describe 
>> in the document as "lazy with SRI". I believe this is consistent with what 
>> the Hive community did on top of Orc. It's interesting because my (maybe 
>> incorrect) understanding of the Databricks Delta approach is they chose what 
>> you title "eager" in their approach to upserts. They may also have a lazy 
>> approach for other types of mutations but I don't think they do.
>> 
>> Thanks again for putting this together!
>> Jacques
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>> 
>> 
>> On Wed, May 8, 2019 at 3:42 AM Anton Okolnychyi 
>> <aokolnyc...@apple.com.invalid <mailto:aokolnyc...@apple.com.invalid>> wrote:
>> Hi folks,
>> 
>> Miguel (cc) and I have spent some time thinking about how to perform 
>> updates/deletes/upserts on top of Iceberg tables. This functionality is 
>> essential for many modern use cases. We've summarized our ideas in a doc 
>> [1], which, hopefully, will trigger a discussion in the community. The 
>> document presents different conceptual approaches alongside their 
>> trade-offs. We will be glad to consider any other ideas as well.
>> 
>> Thanks,
>> Anton
>> 
>> [1] - 
>> https://docs.google.com/document/d/1Pk34C3diOfVCRc-sfxfhXZfzvxwum1Odo-6Jj9mwK38/
>>  
>> <https://docs.google.com/document/d/1Pk34C3diOfVCRc-sfxfhXZfzvxwum1Odo-6Jj9mwK38/>
>> 
>> 
> 

Reply via email to