[ 
https://issues.apache.org/jira/browse/HDFS-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669149#comment-13669149
 ] 

Steve Loughran commented on HDFS-4849:
--------------------------------------

bq. concurrency logic is always an issue in distributed storage.

Which is why I tend to defer to Lamport in such situations, here [Lamport78], 
and the notion of a transient {{happens-before}} relation. This relation is 
antireflexive {{happens-before(A,A) is false for all A}}, and antisymmetric 
{{happens-before(A,B) <=> !happens-before(B,A) for all A, B}}. Currently HDFS 
provides happens-before guarantees, which is a guarantee which applications 
rely on.

bq.  when B exists, which should replace B with contents of A. Suppose then 
that at the same time client2 deletes file B. Since there is no guarantee which 
operation is executed first you can either end up with A renamed to B, if the 
delete goes first, or with no files if the rename prevails followed by the 
deletion of B. This is similar to your case. When client1 retries from its 
perspective delete was not completed, so it deletes again. And it is not 
different from the case  when client1 is slow and executes delete after rename.

Cos, your proposed scenario has an initial state of both paths existing and two 
clients executing un-coordinated requests against the filesystem.
{code}
0: exists(A),  exists(B) 
1. client1: mv(A,B)
2. client2: rm(B)
{code}

There are two orderings here, which each lead to two different outcomes
{code}
order1: happens-before(cient1, client2)

client1: mv(A,B) - failure
client2: rm(B) -success

result: exists(A), !exists(B) 
{code}

the rename fails, the data is where it was.
{code}
order2: happens-before( client2, client1)

client2: rm(B) -success 
client1: mv(A,B) - success 
result: !exists(A), exists(B) 
{code}

The rename has succeeded. 

In both cases, the NN has imposes a strict ordering the outcomes -and both 
clients know the outcome. They may not know what happened after, but they both 
can infer what happened before.

Now, let's add the ability for operations that fail over the network to be 
retried, and, as we cannot determine if the failure was before or after the 
operation was executed, re-execution. This is what you appear to have proposed

{code}
order1: happens-before( client1, client2);
        initial rm(B) lost after execution and retried.

client1: mv(A,B) - failure 
client2: rm(B) -success/fail => result lost
client2: retry: rm(B) -success

result: exists(A), !exists(B) 
{code}

The outcome is as before: the happens-before ordering is the same. I think this 
is what you are considering when we talk about idempotent deletes.
{code}
order2a: happens-before( client2, client1);
         initial rm(B) operation fails before execution
         retry happens before any client1 calls are executed.

client2: rm(B) -fail-
client2: retry rm(B) - success
client1: mv(A,B) - success
result: !exists(A), exists(B) 
{code}

Again, the retry has not affected the outcome, because the first delete 
operation never took place.
because the {{happens-before}} ordering is the same, client 2 actions executed 
before client 1's.

I believe these are the scenarios which you are considering when you say that 
making {{delete()}} idempotent is harmless. The problem we have is the third 
ordering

{code}
order2b: client2 rm(B) operation is first and succeeds, but result lost
         retry after client1 operations

client2: rm(B) -succes, => result lost
client1: mv(A,B) - success => true
client2: retry rm(B) success 
result: !exists(A), !exists(B) 
{code}

Can you see what has happened here? The ordering {{happens-before(2,1)}} is no 
longer antisymmetric; we have {{happens-before(2,1) && happens-before(1,2)}}. 
The outcome is that the rename succeeded, {{!exists(A)}} now holds, and then 
the delete succeeded: {{!exists(B)}}. A new final state of the operation 
sequence has been reached, resulting in "data lost" . *This is the issue.*

bq. I see only one issue with delete that prevents it from being idempotent - 
it's the return value, which must be true only if the deleted object existed 
and was actually deleted. This cannot be guaranteed through retries. The 
semantics of delete should be that "object does not exist after delete 
completes". This seems idempotent to me. The return value should be treated as 
success or failure. Same as in mkdir.

The issue is not the final state of the object, but the fact that the 
filesystem state after the first invocation is visible to other clients -and 
that they perform work which depends on it. The real change of the semantics 
that you are proposing is {{an rm() call may be repeated 1+ times until the 
client finally receives a response -irrespective of what other operations take 
place on the same path}}. It's not what the client sees, it is the external 
visibility of the retried actions.

bq. My point is that if you need to coordinate clients you should do it with 
some external tools, like ZK.

You know how MapReduce co-ordinates commits between speculating tasks, don't 
you? It relies on {{rename()}} being atomic and failing if the destination path 
exists. It does this because of the guarantees that the FS makes of any pair of 
atomic operations being serialized into a strict order -either {{(1,2)}} or 
{{(2,1)}}. 

The proposal breaks this fundamental guarantee -a guarantee which allows client 
operations to use the filesystem itself to co-ordinate actions that consist of 
single side-effecting operations. Making delete idempotent to aid on MR job 
cleanup is unimportant if it can stop the output from being committed reliably.

I am reasonably confident that the blobstores break these rules as rename is 
not-atomic; [HADOOP-9577] is proof of this, [HADOOP-9565] the beginnings of a 
workaround, one which will require new committers for MR jobs. Because today 
*the MR engine relies on atomic FS options to co-ordinate loosely coupled nodes 
during is execution*.

This is why - and I'm stating this a warning before you invest significant time 
in implementing retries - myself, and presumably others, will veto any change 
to the semantics of {{delete()}} that breaks any part of the 
{{happens-before()}} relation, i.e. allows for the remote path to be deleted 
more than once in a way that would be visible to other clients. 

Further reading [Lamport78: Time, Clocks and the Ordering of Events in a 
Distributed 
System|http://research.microsoft.com/en-us/um/people/lamport/pubs/time-clocks.pdf]
                
> Idempotent create, append and delete operations.
> ------------------------------------------------
>
>                 Key: HDFS-4849
>                 URL: https://issues.apache.org/jira/browse/HDFS-4849
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.0.4-alpha
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>
> create, append and delete operations can be made idempotent. This will reduce 
> chances for a job or other app failures when NN fails over.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to