[ 
https://issues.apache.org/jira/browse/KNOX-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Sharp updated KNOX-1093:
--------------------------------
    Attachment: KNOX-1093.patch

> KNOX Not Handling safemode state of one of the NameNode In HA state 
> --------------------------------------------------------------------
>
>                 Key: KNOX-1093
>                 URL: https://issues.apache.org/jira/browse/KNOX-1093
>             Project: Apache Knox
>          Issue Type: Bug
>          Components: Server
>    Affects Versions: 0.10.0
>            Reporter: Rajesh Chandramohan
>            Assignee: Matthew Sharp
>            Priority: Major
>             Fix For: 1.2.0
>
>         Attachments: KNOX-1093.patch
>
>
>  per your code WebHdfsHaDispatch.java , When Safemode exception happened it 
> calls the retryRequest() method. which also calls executeRequest() method as 
> like failover request but the namenode info is not changing for the thread 
> for all of its iteration until maxRetryAttempts=300 
> and retrySleep=1000 ( 1 sec ) 
> After Max 5 minutes , client retries should pick the right namenode atleast 
> in next attempt.
>  But in this case if we need to copy a set of files in stipulated time there 
> is X% of connections falls into these namenode and fails. Can we handle that 
> better
> {code:java}
> try {
>          inboundResponse = executeOutboundRequest(outboundRequest);
>          writeOutboundResponse(outboundRequest, inboundRequest, 
> outboundResponse, inboundResponse);
>       } catch (StandbyException e) {
>          LOG.errorReceivedFromStandbyNode(e);
>          failoverRequest(outboundRequest, inboundRequest, outboundResponse, 
> inboundResponse, e);
>       } catch (SafeModeException e) {
>          LOG.errorReceivedFromSafeModeNode(e);
>          retryRequest(outboundRequest, inboundRequest, outboundResponse, 
> inboundResponse, e);
>       } catch (IOException e) {
>          LOG.errorConnectingToServer(outboundRequest.getURI().toString(), e);
>          failoverRequest(outboundRequest, inboundRequest, outboundResponse, 
> inboundResponse, e);
>       }
>    }
> {code}
> Need to change the logic in SafeModeexception state in  KNOX HADispatch code 
> to flag the namenode which is stuck in safemode  and maintain don't try queue 
> and redirect all further connection only to healthy active namenode . This 
> way X5 of failures we can handle. What do we think



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to