Rajesh Chandramohan created KNOX-1093:
-----------------------------------------

             Summary: KNOX Not Handling safemode state of one of the NameNode 
In HA state 
                 Key: KNOX-1093
                 URL: https://issues.apache.org/jira/browse/KNOX-1093
             Project: Apache Knox
          Issue Type: Bug
          Components: Server
    Affects Versions: 0.10.0
            Reporter: Rajesh Chandramohan



 per your code WebHdfsHaDispatch.java , When Safemode exception happened it 
calls the retryRequest() method. which also calls executeRequest() method as 
like failover request but the namenode info is not changing for the thread for 
all of its iteration until maxRetryAttempts=300 
and retrySleep=1000 ( 1 sec ) 
After Max 5 minutes , client retries should pick the right namenode atleast in 
next attempt.
 But in this case if we need to copy a set of files in stipulated time there is 
X% os connections falls into these namenode and fails. Can we candle that better

{code:java}
try {
         inboundResponse = executeOutboundRequest(outboundRequest);
         writeOutboundResponse(outboundRequest, inboundRequest, 
outboundResponse, inboundResponse);
      } catch (StandbyException e) {
         LOG.errorReceivedFromStandbyNode(e);
         failoverRequest(outboundRequest, inboundRequest, outboundResponse, 
inboundResponse, e);
      } catch (SafeModeException e) {
         LOG.errorReceivedFromSafeModeNode(e);
         retryRequest(outboundRequest, inboundRequest, outboundResponse, 
inboundResponse, e);
      } catch (IOException e) {
         LOG.errorConnectingToServer(outboundRequest.getURI().toString(), e);
         failoverRequest(outboundRequest, inboundRequest, outboundResponse, 
inboundResponse, e);
      }
   }
{code}


Need to change the logic in SafeModeexception state in  KNOX HADispatch code to 
flag the namenode which is stuck in safemode  and maintain don't try queue and 
redirect all further connection only to healthy active namenode . This way X5 
of failures we can handle. What do we think



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to