[ https://issues.apache.org/jira/browse/KNOX-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthew Sharp updated KNOX-1093: -------------------------------- Attachment: (was: KNOX-1093.patch) > KNOX Not Handling safemode state of one of the NameNode In HA state > -------------------------------------------------------------------- > > Key: KNOX-1093 > URL: https://issues.apache.org/jira/browse/KNOX-1093 > Project: Apache Knox > Issue Type: Bug > Components: Server > Affects Versions: 0.10.0 > Reporter: Rajesh Chandramohan > Assignee: Matthew Sharp > Priority: Major > Fix For: 1.2.0 > > > per your code WebHdfsHaDispatch.java , When Safemode exception happened it > calls the retryRequest() method. which also calls executeRequest() method as > like failover request but the namenode info is not changing for the thread > for all of its iteration until maxRetryAttempts=300 > and retrySleep=1000 ( 1 sec ) > After Max 5 minutes , client retries should pick the right namenode atleast > in next attempt. > But in this case if we need to copy a set of files in stipulated time there > is X% of connections falls into these namenode and fails. Can we handle that > better > {code:java} > try { > inboundResponse = executeOutboundRequest(outboundRequest); > writeOutboundResponse(outboundRequest, inboundRequest, > outboundResponse, inboundResponse); > } catch (StandbyException e) { > LOG.errorReceivedFromStandbyNode(e); > failoverRequest(outboundRequest, inboundRequest, outboundResponse, > inboundResponse, e); > } catch (SafeModeException e) { > LOG.errorReceivedFromSafeModeNode(e); > retryRequest(outboundRequest, inboundRequest, outboundResponse, > inboundResponse, e); > } catch (IOException e) { > LOG.errorConnectingToServer(outboundRequest.getURI().toString(), e); > failoverRequest(outboundRequest, inboundRequest, outboundResponse, > inboundResponse, e); > } > } > {code} > Need to change the logic in SafeModeexception state in KNOX HADispatch code > to flag the namenode which is stuck in safemode and maintain don't try queue > and redirect all further connection only to healthy active namenode . This > way X5 of failures we can handle. What do we think -- This message was sent by Atlassian JIRA (v7.6.3#76005)