[ 
https://issues.apache.org/jira/browse/HDFS-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825586#comment-13825586
 ] 

Chris Nauroth commented on HDFS-5014:
-------------------------------------

[~umamaheswararao], thank you for joining the review.

bq. I am not sure what is your idea here.

My last idea was to keep holding the lock during the register attempt, but then 
release the lock after there is a timeout.  IOW, don't hold the lock during the 
{{Thread#sleep}} time of {{BPServiceActor#register}}.

bq. how about allowing registaration commands allowing without lock and all 
other command should go under lock.

Great idea!  That's a much simpler version of what I was trying to achieve.

bq. I think it should work fine with BPOfferService#registrationSucceeded(..) 
synchronized.

Yes, I agree that {{BPOfferService#registrationSucceeded}} now needs to be 
synchronized.  [~vinayrpet], thanks for covering this in the most recent patch.

The new patch looks good.  Just one small thing: the log messages used to say 
whether it was processing a {{DNA_REGISTER}} request from the active or the 
standby.  With the patch, we lose that information, because the log message is 
the same regardless of which NN sent the command.  Can we restore active vs. 
standby in the log message?  That's potentially useful information for 
troubleshooting.

> BPOfferService#processCommandFromActor() synchronization on namenode RPC call 
> delays IBR to Active NN, if Stanby NN is unstable
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-5014
>                 URL: https://issues.apache.org/jira/browse/HDFS-5014
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, ha
>    Affects Versions: 3.0.0, 2.0.4-alpha
>            Reporter: Vinay
>            Assignee: Vinay
>         Attachments: HDFS-5014-v2.patch, HDFS-5014.patch, HDFS-5014.patch, 
> HDFS-5014.patch, HDFS-5014.patch, HDFS-5014.patch, HDFS-5014.patch, 
> HDFS-5014.patch
>
>
> In one of our cluster, following has happened which failed HDFS write.
> 1. Standby NN was unstable and continously restarting due to some errors. But 
> Active NN was stable.
> 2. MR Job was writing files.
> 3. At some point SNN went down again while datanode processing the REGISTER 
> command for SNN. 
> 4. Datanodes started retrying to connect to SNN to register at the following 
> code  in BPServiceActor#retrieveNamespaceInfo() which will be called under 
> synchronization.
> {code}      try {
>         nsInfo = bpNamenode.versionRequest();
>         LOG.debug(this + " received versionRequest response: " + nsInfo);
>         break;{code}
> Unfortunately in all datanodes at same point this happened.
> 5. For next 7-8 min standby was down, and no blocks were reported to active 
> NN at this point and writes have failed.
> So culprit is {{BPOfferService#processCommandFromActor()}} is completely 
> synchronized which is not required.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to