----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/44926/ -----------------------------------------------------------
Review request for Ambari, Jonathan Hurley and Nate Cole. Bugs: AMBARI-15446 https://issues.apache.org/jira/browse/AMBARI-15446 Repository: ambari Description ------- When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins. This is useful when a host goes down as Ambari is running a task on it. ambari.properties will have 1 new parameter. E.g,. stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present) If Ambari Server is restarted, it should be able to recover. Today, Action Scheduler increases the attempt_count whenever a task is retried, but it requires resetting the start_time to -1. Because of this, we cannot rely on the start_time property to know when to timeout after several retries. For the implementation, will add another thread to Ambari that will monitor failed tasks only during active RU/EU and change the status back to PENDING so that Action Scheduler can reschedule it. Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are blocking, so no other stages are allowed to proceed. In order to know when a task was first started, will add a new property to host_role_command table called original_start_time. For the agents, we need to ensure that they always write out a response. On the first heartbeat, it should send the status of its last command so we know it failed and Ambari can retry. Diffs ----- ambari-server/src/main/java/org/apache/ambari/server/agent/HeartBeatHandler.java 3a80803 ambari-server/src/main/java/org/apache/ambari/server/agent/RetryActionMonitor.java PRE-CREATION ambari-server/src/main/java/org/apache/ambari/server/checks/PreviousUpgradeCompleted.java 3a4467f ambari-server/src/main/java/org/apache/ambari/server/orm/dao/ClusterVersionDAO.java 1bcca60 ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java f5b1cb4 ambari-server/src/main/java/org/apache/ambari/server/orm/entities/ClusterVersionEntity.java f1867b4 ambari-server/src/main/java/org/apache/ambari/server/orm/entities/HostRoleCommandEntity.java 19f0602 ambari-server/src/main/java/org/apache/ambari/server/state/Cluster.java ed3c772 ambari-server/src/main/java/org/apache/ambari/server/state/cluster/ClusterImpl.java 1c7ff61 ambari-server/src/main/java/org/apache/ambari/server/topology/LogicalRequest.java 82edbcf Diff: https://reviews.apache.org/r/44926/diff/ Testing ------- Verified on a live cluster. TODO: Still need to make more changes to the implementation, add the config, switch to gauva service, add a column, and add unit tests. Thanks, Alejandro Fernandez