[ 
https://issues.apache.org/jira/browse/HDFS-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359685#comment-15359685
 ] 

James Clampffer commented on HDFS-10441:
----------------------------------------

Thanks for the very detailed feedback Bob.  You caught a lot of issues that 
slipped my mind + a few I didn't really think about.

As far as bigger issues go (other points are also important to me, just want to 
give comments on these before the next patch):
bq. Do we pass information in the data pointer for the NN failover event? If 
so, document it in events.h
I think I accidentally left this out.  Needs to be fixed, or I'll take it out 
of the header and add another jira for that.

bq. rpc_connection.cc: HandleRpcResponse should push req back to the head of 
the queue; alternately, don't dequeue it if we got a standby exception.
Need to take a better look at how to best do this.  Will fix.

bq. If HandleRpcResponse gets a kStandbyException, will CommsError be called 
twice (once in HandleRpcResponse and again in OnRecvComplete)?
Yes, and I haven't figured out why yet, I should be able to track it down 
quickly now that I'm more familiar with the RPC retry code.  It doesn't seem to 
break things but it sure makes reading the logs more confusing than they need 
to be.

bq. rpc_engine.cc: let's use both namenodes if servers.size() >= 2 rather than 
just bailing out.
I'm torn on this one.  As far as I know you'd need to do a bit of hacking to 
get more than two nodes working in HA.  It seems like it's best to cover the 
common case here.  How about allowing more than 2 nodes but issuing warnings 
that say that your cluster probably isn't going to act as you'd expect based on 
the config file and that we discarded the extra nodes?

bq. rpc_engine.h: IsCurrentActive/IsCurrentStandby are dangerous as designed: 
they're asking for race conditions as we acquire the lock, check, release the 
lock, then take action. Just before we take action, someone else could change 
the value
You're totally right here.  I didn't want to implement a mechanism to allow 
other things to lock the state of that structure for an arbitrary amount of 
time (seemed like it would be asking for a deadlock).  I think the best fix 
here is to make them private to the NANamenodeTracker.

bq. rpc_engine.cc: Remove RpcEngine::Start instead of deprecating it
Yep, I left that in to see if someone could think of a use for a hook that got 
called once the IoService was known to be running.  If you can't think of any 
I'll be happy to get rid of some dead code.

bq. Don't forget to file bugs to handle more than two namenodes.
Sure.  I was thinking of holding off until that became widely supported but it 
won't hurt to plan ahead.

As a general warning, I don't plan to have much internet access for the next 
week.  Once I do I'll address the rest of the comments and get a new patch up 
that includes the rest of your feedback.




> libhdfs++: HA namenode support
> ------------------------------
>
>                 Key: HDFS-10441
>                 URL: https://issues.apache.org/jira/browse/HDFS-10441
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: James Clampffer
>            Assignee: James Clampffer
>         Attachments: HDFS-10441.HDFS-8707.000.patch, 
> HDFS-10441.HDFS-8707.002.patch, HDFS-10441.HDFS-8707.003.patch, 
> HDFS-10441.HDFS-8707.004.patch, HDFS-10441.HDFS-8707.005.patch, 
> HDFS-10441.HDFS-8707.006.patch, HDFS-10441.HDFS-8707.007.patch, 
> HDFS-10441.HDFS-8707.008.patch, HDFS-10441.HDFS-8707.009.patch, 
> HDFS-8707.HDFS-10441.001.patch
>
>
> If a cluster is HA enabled then do proper failover.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to