ramkrishna.s.vasudevan created AMBARI-25620:
-----------------------------------------------

             Summary: Ambari Server STOP command on a node might fail because 
of a decommissioned NM's behaviour
                 Key: AMBARI-25620
                 URL: https://issues.apache.org/jira/browse/AMBARI-25620
             Project: Ambari
          Issue Type: Bug
            Reporter: ramkrishna.s.vasudevan


As part of our use case, before we STOP any set of components in a node, if we 
see a NodeManager and Datanode in that node, we first Decommission them and 
then issue STOP components to all the components in the node.
* DataNodes once decommissioned are not STOPPED but they are alive as a 
process. Unless we stop them. Whereas RM stops the NM on decommission. (this is 
a known behaviour). But what happens is that the RecoveryManager in ambari 
keeps restarting the service thinking its DESIRED state (in the agent side) is 
STARTED. So the restart keeps happening. So the state changes between STARTED 
<-> INSTALLED on the agent side and once this happens we communicate the 
Component status to the server side.
* On receiving this update the server sets the STATE as STARTED/INSTALLED as 
the case may be.
* Now coming back to the actual STOP command request that we gave, as per 
design in ambari server once all the component updates are sent, it processes 
them in batch and tries to do the in-memory transition of STATES on the server 
side (not the cache but the FSM (state machine transition). Here the event is 
INSTALL/STOP event for NM that the server is expecting but instead of getting 
an INSTALLED state it gets STARTED state. The reason as highlighted above. So 
the entire STOP command gets aborted by the server thinking there is some 
problem in what it sees.
* 
{code:java}
/Multimap is analog of Map<Object, List<Object>> but allows to avoid nested loop
        ListMultimap<String, ServiceComponentHostEvent> eventMap = 
formEventMap(stage, commandsToStart);
        Map<ExecutionCommand, String> commandsToAbort = new HashMap<>();
        if (!eventMap.isEmpty()) {
          LOG.debug("==> processing {} serviceComponentHostEvents...", 
eventMap.size());
          Cluster cluster = clusters.getCluster(stage.getClusterName());
          if (cluster != null) {
            Map<ServiceComponentHostEvent, String> failedEvents = 
cluster.processServiceComponentHostEvents(eventMap);

            if (failedEvents.size() > 0) {
              LOG.error("==> {} events failed.", failedEvents.size());
            }

            for (Iterator<ExecutionCommand> iterator = 
commandsToUpdate.iterator(); iterator.hasNext(); ) {
              ExecutionCommand cmd = iterator.next();
              for (ServiceComponentHostEvent event : failedEvents.keySet()) {
                if (StringUtils.equals(event.getHostName(), cmd.getHostname()) 
&&
                  StringUtils.equals(event.getServiceComponentName(), 
cmd.getRole())) {
                  iterator.remove();
                  commandsToAbort.put(cmd, failedEvents.get(event));
                  break;
                }
              }
            }
{code}
* Check the processServiceComponentHostEvents() for the way the transition 
happens and what is the Invalid Transition that happens over there. The log msg 
would be like this

{code:java}
org.apache.ambari.server.state.fsm.InvalidStateTransitionException: Invalid 
event: HOST_SVCCOMP_INSTALL at STARTED
{code}
Since this entire set of STOP component is considered as a FAILURe, we issue 
ABORT command and hence all the STOP command issued to the agent are aborted.
This makes the DN to stay in the STARTED state itself and hence the remaining 
DELETE HOST command keeps failing. 
The idea is to ensure that for NM if decommissioned and the current state is 
STARTED for a HOST_SVCCOMP_INSTALL  then mark it as not a failure condition so 
that the commands are not aborted. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to