[ 
https://issues.apache.org/jira/browse/HDFS-14588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868973#comment-16868973
 ] 

Íñigo Goiri commented on HDFS-14588:
------------------------------------

I'm guessing the solution is to throw the standby exception and that's it?
I would expect this to happen already.
Can you put a unit test showing this behavior?

In the last couple months we had an issue with active/standby with WebHDFS; it 
might be worth mentioning.
The client connects to the NN asking to write a file say (reading should be 
pretty straightforward).
The NN replies the address of a DN with a parameter called "namenoderpcaddress" 
(this is the tricky one).
When the DN receives the write request it creates a regular RPC client 
(DFSClient to be specific) which connects with the NN again and does the write.
The issue we had in the past is the namenoderpcaddress being the address of the 
active NN.
When the NN failed over to some other NN, the DN couldn't find the NN to 
complete, etc.
Bottomline, for active/standby the namenoderpcaddress can be a source of issues.
Not sure is the same, but worth bringing it up.

> Client retries Standby NN continuously even if Active NN is available 
> (WebHDFS)
> -------------------------------------------------------------------------------
>
>                 Key: HDFS-14588
>                 URL: https://issues.apache.org/jira/browse/HDFS-14588
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: CR Hota
>            Priority: Major
>
> This is a behavior we have observed in our HA setup of HDFS.
>  # Active NN is up and serving traffic.
>  # Stand By NN is restarted for maintenance.
>  # After step 2 all new clients (webhdfs only) which connect to Stand By keep 
> seeing Retriable Exception as Stand By NN is not yet started (Rpc server is 
> yet to come up as FS image is loading) but http server is started and ready 
> to accept traffic. This keeps happening till rpcserver is up and SNN knows 
> that it's truely standby. Based on start up time this behavior can continue 
> based on start-up times which is high (many minutes) for big clusters.
> This above behavior is causing low availability of HDFS when HDFS is actually 
> still available.
> Ideally webhdfs should throw standby exception (if HA is enabled) and let 
> clients connect to active following that. If active is also not available 
> clients will bounce and automatically connect to the right active.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to