fbocse commented on issue #1496:
URL: https://github.com/apache/iceberg/issues/1496#issuecomment-708650031


   @pvary thank you for following up on this - I think that the solution you 
are considering reproduces the behaviour we are seeing for the server-side 
implementation of `create(overwrite=true)` used by Iceberg to override the 
`version-hint.txt` file. I don't think that in our particular case this is 
necessarily going to be a benefit at the "expense" of a couple of more hdfs 
requests.
   
   However your thoughts got me thinking too, especially with the new fallback 
behaviour in place. I wouldn't necessarily advertise using more APIs than 
necessary to implement optimistic locking because we're also exposing more 
edge-cases, I was thinking about this one in particular, accounting for your 
proposal:
   
   -1. writer reads version-hint.txt (with content 17)
   0. writer create(overwrite=false) v18.metadata.json
   1. writer create(overwrite=false) file version-hint-18.txt (with content 18)
   2. writer deletes version-hint.txt (with content 17)
   3. writer moves version-hint-18.txt to version-hint.txt  (with content 18)
   
   If between 2 and 3 a NEW writer attempts to commit it will fail to load 
`version-hint.txt` and it will fallback to directory listing instead - then it 
finds v18.metadata.json as the greatest version and continues to write 
v19.metadata.json and to commit its new version to the version-hint.txt file.
   If it so happens that the NEW writer also manages to commit its own 2 and 3 
steps (unlikely but not impossible, think GC pause?) before the initial writer 
does we might end up with version 19 being overridden by version 18. This is a 
classic dead-lock.
   This basically "locks" the table for any subsequent writers - cause new 
writers will find version-hint.txt and load value 18, increment that value to 
resolve the version they should write a new metadata file for but fail to do so 
since there already is a v19.metadata.json file.
   
   I think that this behaviour is present today should the client exit 
unexpectedly right before replacing version-hint.txt. 
   But to that edge-case this implementation adds an edge case of a different 
category, it relates to the distributed nature of writers attempting to replace 
the same resource.
   
   Also looking at your suggestion I think that if any writer exits 
unexpectedly before step 3 I assume version-hint.txt file is never going to be 
materialized and all subsequent writers will fallback to directory listing for 
version resolution, right? I assume that resolving the version-hint.txt file 
doesn't involve also to write the version to the file as well.
   
   My suggestion is we'd also extract the current implementation of 
`HadoopTableOperations:writeVersionHint` and 
`HadoopTableOperations:readVersionHint` to something we can override by using 
the APIs that provide the right atomic and consistency guarantees - say dir 
listing provides read-after-write guarantees and we also get atomic 
create(overwrite=false) guarantees we can implement version resolution on top 
of those - each new version is a new version file, every writer attempts to 
delete older versions than say last 10 so we keep dir listing in constant time


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to