[ 
https://issues.apache.org/jira/browse/HDFS-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13796233#comment-13796233
 ] 

Chris Nauroth commented on HDFS-5014:
-------------------------------------

Hi, [~vinayrpet].  I'm coming back to this one after a while.  The latest patch 
still has some concurrency problems:
* {{updateActorStatesFromHeartbeat}}: It's possible for state to change after 
releasing the read lock, but before the if statement executes.  The method 
would then execute logic assuming the old values of {{bpServiceToActive}} and 
{{lastActiveClaimTxId}}.
* {{processCommandFromActor}}: Even though the read lock is not held during 
{{processCommandFromStandby}}, it's still possible to have the same problem 
that you saw in your cluster, but on the active instead of the standby.  If the 
active requests re-registration of datanodes, and then immediately goes into a 
bad state or a network partition prevents communication, then datanodes will be 
stuck inside the re-register polling loop while holding the read lock.  This 
will prevent the other one from taking over as active, which requires holding 
the write lock.

I'm starting to think that we can't fix this bug by just tuning locks in 
{{BPOfferService}}.  Instead, I'm starting to think that we need to work out a 
way for the re-register polling loops to yield the lock in case of repeated 
failure, to give the other {{BPServicActor}} a chance.  If a {{BPServiceActor}} 
yields like this, then it must also have a way to trigger the other 
{{BPServiceActor}} to repeat its heartbeat *before executing any additional 
commands*.  It's vital to re-check current state of the other one before 
proceeding to handle its commands.


> BPOfferService#processCommandFromActor() synchronization on namenode RPC call 
> delays IBR to Active NN, if Stanby NN is unstable
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-5014
>                 URL: https://issues.apache.org/jira/browse/HDFS-5014
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, ha
>    Affects Versions: 3.0.0, 2.0.4-alpha
>            Reporter: Vinay
>            Assignee: Vinay
>         Attachments: HDFS-5014.patch, HDFS-5014.patch, HDFS-5014.patch, 
> HDFS-5014.patch, HDFS-5014.patch
>
>
> In one of our cluster, following has happened which failed HDFS write.
> 1. Standby NN was unstable and continously restarting due to some errors. But 
> Active NN was stable.
> 2. MR Job was writing files.
> 3. At some point SNN went down again while datanode processing the REGISTER 
> command for SNN. 
> 4. Datanodes started retrying to connect to SNN to register at the following 
> code  in BPServiceActor#retrieveNamespaceInfo() which will be called under 
> synchronization.
> {code}      try {
>         nsInfo = bpNamenode.versionRequest();
>         LOG.debug(this + " received versionRequest response: " + nsInfo);
>         break;{code}
> Unfortunately in all datanodes at same point this happened.
> 5. For next 7-8 min standby was down, and no blocks were reported to active 
> NN at this point and writes have failed.
> So culprit is {{BPOfferService#processCommandFromActor()}} is completely 
> synchronized which is not required.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to