[
https://issues.apache.org/jira/browse/AMBARI-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siddharth Wagle updated AMBARI-2041:
------------------------------------
Attachment: AMBARI-2041.patch
[~sumitmohanty] Fix, without unit test. Incremental patch to follow.
> If a host that has a service client installed and the host is down, service
> start will fail
> -------------------------------------------------------------------------------------------
>
> Key: AMBARI-2041
> URL: https://issues.apache.org/jira/browse/AMBARI-2041
> Project: Ambari
> Issue Type: Bug
> Components: controller
> Affects Versions: 1.3.0
> Reporter: Siddharth Wagle
> Assignee: Siddharth Wagle
> Fix For: 1.3.0
>
> Attachments: AMBARI-2041.patch
>
>
> In condor, service start may include client install on some hosts. If the
> host where a client is being installed is down (heartbeat lost) then service
> start fails. This is because the success factor for clients (tested with
> MAPREDUCE_CLIENT) is 1 and single failure fails the stage. During service
> start there are three stages, one each for installs, starts, and check. When
> install stage fails, the later stages are aborted.
> Few observations:
> Client goes to INSTALL_FAILED state. So second attempt ignores installing
> on the client thereby succeeds in starting the service. (this is a bug as we
> should try installing a component that is in INSTALL_FAILED state. However,
> at this point we are saved by this bug)
> Service check can be scheduled on a host that is in UNHEALTHY/UNKNOWN
> state and can fail
> Now service cannot be stopped because:
> Stop command sees INSTALL_FAILED state and schedules an INSTALL task
> for the client which fails.
> The STOP commands for other components are at a later stage and are
> aborted.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira