[ https://issues.apache.org/jira/browse/HDFS-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669149#comment-13669149 ]
Steve Loughran commented on HDFS-4849: -------------------------------------- bq. concurrency logic is always an issue in distributed storage. Which is why I tend to defer to Lamport in such situations, here [Lamport78], and the notion of a transient {{happens-before}} relation. This relation is antireflexive {{happens-before(A,A) is false for all A}}, and antisymmetric {{happens-before(A,B) <=> !happens-before(B,A) for all A, B}}. Currently HDFS provides happens-before guarantees, which is a guarantee which applications rely on. bq. when B exists, which should replace B with contents of A. Suppose then that at the same time client2 deletes file B. Since there is no guarantee which operation is executed first you can either end up with A renamed to B, if the delete goes first, or with no files if the rename prevails followed by the deletion of B. This is similar to your case. When client1 retries from its perspective delete was not completed, so it deletes again. And it is not different from the case when client1 is slow and executes delete after rename. Cos, your proposed scenario has an initial state of both paths existing and two clients executing un-coordinated requests against the filesystem. {code} 0: exists(A), exists(B) 1. client1: mv(A,B) 2. client2: rm(B) {code} There are two orderings here, which each lead to two different outcomes {code} order1: happens-before(cient1, client2) client1: mv(A,B) - failure client2: rm(B) -success result: exists(A), !exists(B) {code} the rename fails, the data is where it was. {code} order2: happens-before( client2, client1) client2: rm(B) -success client1: mv(A,B) - success result: !exists(A), exists(B) {code} The rename has succeeded. In both cases, the NN has imposes a strict ordering the outcomes -and both clients know the outcome. They may not know what happened after, but they both can infer what happened before. Now, let's add the ability for operations that fail over the network to be retried, and, as we cannot determine if the failure was before or after the operation was executed, re-execution. This is what you appear to have proposed {code} order1: happens-before( client1, client2); initial rm(B) lost after execution and retried. client1: mv(A,B) - failure client2: rm(B) -success/fail => result lost client2: retry: rm(B) -success result: exists(A), !exists(B) {code} The outcome is as before: the happens-before ordering is the same. I think this is what you are considering when we talk about idempotent deletes. {code} order2a: happens-before( client2, client1); initial rm(B) operation fails before execution retry happens before any client1 calls are executed. client2: rm(B) -fail- client2: retry rm(B) - success client1: mv(A,B) - success result: !exists(A), exists(B) {code} Again, the retry has not affected the outcome, because the first delete operation never took place. because the {{happens-before}} ordering is the same, client 2 actions executed before client 1's. I believe these are the scenarios which you are considering when you say that making {{delete()}} idempotent is harmless. The problem we have is the third ordering {code} order2b: client2 rm(B) operation is first and succeeds, but result lost retry after client1 operations client2: rm(B) -succes, => result lost client1: mv(A,B) - success => true client2: retry rm(B) success result: !exists(A), !exists(B) {code} Can you see what has happened here? The ordering {{happens-before(2,1)}} is no longer antisymmetric; we have {{happens-before(2,1) && happens-before(1,2)}}. The outcome is that the rename succeeded, {{!exists(A)}} now holds, and then the delete succeeded: {{!exists(B)}}. A new final state of the operation sequence has been reached, resulting in "data lost" . *This is the issue.* bq. I see only one issue with delete that prevents it from being idempotent - it's the return value, which must be true only if the deleted object existed and was actually deleted. This cannot be guaranteed through retries. The semantics of delete should be that "object does not exist after delete completes". This seems idempotent to me. The return value should be treated as success or failure. Same as in mkdir. The issue is not the final state of the object, but the fact that the filesystem state after the first invocation is visible to other clients -and that they perform work which depends on it. The real change of the semantics that you are proposing is {{an rm() call may be repeated 1+ times until the client finally receives a response -irrespective of what other operations take place on the same path}}. It's not what the client sees, it is the external visibility of the retried actions. bq. My point is that if you need to coordinate clients you should do it with some external tools, like ZK. You know how MapReduce co-ordinates commits between speculating tasks, don't you? It relies on {{rename()}} being atomic and failing if the destination path exists. It does this because of the guarantees that the FS makes of any pair of atomic operations being serialized into a strict order -either {{(1,2)}} or {{(2,1)}}. The proposal breaks this fundamental guarantee -a guarantee which allows client operations to use the filesystem itself to co-ordinate actions that consist of single side-effecting operations. Making delete idempotent to aid on MR job cleanup is unimportant if it can stop the output from being committed reliably. I am reasonably confident that the blobstores break these rules as rename is not-atomic; [HADOOP-9577] is proof of this, [HADOOP-9565] the beginnings of a workaround, one which will require new committers for MR jobs. Because today *the MR engine relies on atomic FS options to co-ordinate loosely coupled nodes during is execution*. This is why - and I'm stating this a warning before you invest significant time in implementing retries - myself, and presumably others, will veto any change to the semantics of {{delete()}} that breaks any part of the {{happens-before()}} relation, i.e. allows for the remote path to be deleted more than once in a way that would be visible to other clients. Further reading [Lamport78: Time, Clocks and the Ordering of Events in a Distributed System|http://research.microsoft.com/en-us/um/people/lamport/pubs/time-clocks.pdf] > Idempotent create, append and delete operations. > ------------------------------------------------ > > Key: HDFS-4849 > URL: https://issues.apache.org/jira/browse/HDFS-4849 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Affects Versions: 2.0.4-alpha > Reporter: Konstantin Shvachko > Assignee: Konstantin Shvachko > > create, append and delete operations can be made idempotent. This will reduce > chances for a job or other app failures when NN fails over. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira