Hi, I wish your experience with Rolling Upgrades would have been better. I'll do my best to explain the solution to each one of those items. As a developer, I like to hear this feedback so we can make the product better.
* Cluster is locked down while in the middle of upgrade: Operations like changing configs, adding hosts, adding services, etc. are disallowed by default. This is meant to prevent the user from drastically changing the stack configs and ending up in a worse state. Cluster operators can still change configs by navigating to http://server:8080/#/experimental and enabling "opsDuringRollingUpgrade". I completely agree that we need to be more flexible in this area since configs are likely to break, and the savvy users should still be allowed to change them. * Configs are only changed in major stack versions: In HDP 2.2.*->2.2.*, we don't expect any config changes, so the Upgrade Pack doesn't orchestrate any, whereas a 2.2.*->2.3.* has many config changes. At times, this will break, and we typically find out about it during testing and reports from users with custom configs. Tools like SmartSense can also help to point out incorrect configs. In the future, we may relax this so that even minor versions are allowed to change configs. * Unable to finalize since hosts are not on the new version: We've talked about a way to "force finalize" the versions. Today, Ambari is very strict about requiring all hosts to be updated. As a workaround, we have a python script called "RU Magician" that will allow you to fix things, and force any version to CURRENT; checkout https://github.com/apache/ambari/tree/branch-2.1.2/contrib/ru_magician You ran the correct SQL statements, so kudos to you for that. * Components that don't advertise a version: Some components like ZKFC, AMS, MySQL, Kerberos Client, don't need to advertise a version. In the case of ZKFC, it is because it uses the same binary as that of NameNode. So perhaps an earlier version of Ambari caused it to stay stuck on 2.2.6 in the DB. If you feel more comfortable, you can change ZKFC's version to 'UNKNOWN'. My suggestion is to create Jiras on Apache for the following: * Allow force finalizing a version during Stack Upgrade * Allow changing configs during the middle of a Stack Upgrade, will need to prompt user with a disclaimer/warning Thanks, Alejandro On 11/22/15, 11:32 PM, "Andrew Robertson" <[email protected]<mailto:[email protected]>> wrote: I performed a rolling upgrade of HDP from 2.2.8 to 2.2.9 today using Ambari 2.1.2.1 & ran into several issues. My YARN resource manager failed to start due to a "Service ResourceManager failed in state INITED; cause: java.lang.IllegalArgumentException: Illegal capacity of -1.0 for node-label=default in queue=root, valid capacity should in range of [0, 100].". (It was working fine with 2.2.8; this may be something new in 2.2.9). As Ambari usage feedback - this was impossible to fix in Ambari while the upgrade was going on, and it added a ton of (down)time to the upgrade. This error caused a number of service checks to time out after a long wait (many checks took 5-15 min to fail). I didn't see any way to fix the error (the only options I had during the upgrade were "Downgrade" - which I didn't want to do (It was a test cluster after all, I wanted to get through it so I could fix it); and "Ignore" which allowed it to continue, but caused each step to take 300+ seconds. Ambari seemed to lock the configs so I couldn't make changes to fix the issue while the upgrade was going on. Likewise, I couldn't manually restart the service myself or abort the service checks. Even at the "Verify operation" and the "finalize" checkpoints, where I could "pause" the upgrade - the configs were still locked and I had no ability to start/stop services. At the end, Ambari started giving other errors about being unable to finalize the upgrade. I ended up rebooting the cluster & ambari - this got it back to a state where I could edit the configs again to fix the YARN RM config. The fix to the RM not starting ended up being the same as AMBARI-11358, which appears to only have been fixed in the HDP2.3 upgrade. Separately, Ambari had the 2.2.9 version waiting to be finalized but I couldn't find any way to do this in the UI after the restart. So I went into the database and ran the following: UPDATE host_version SET state = 'INSTALLED' WHERE state = 'CURRENT'; UPDATE host_version SET state = 'CURRENT' WHERE repo_version_id = <id for 2.2.9.0 version> and state = 'UPGRADED'; UPDATE cluster_version SET state = 'INSTALLED' WHERE state = 'CURRENT'; UPDATE cluster_version SET state = 'CURRENT' WHERE repo_version_id = <id for 2.2.9.0 version> and state = 'UPGRADED'; UPDATE hostcomponentstate set upgrade_state = 'NONE'; This seems to have fixed that. Possibly unrelated - I did find there are 2 services that show up with an even older old version when checking the ambari database: ambari=> SELECT h.host_name, hcs.service_name, hcs.component_name, hcs.version FROM hostcomponentstate hcs JOIN hosts h ON hcs.host_id = h.host_id where hcs.version NOT IN ('2.2.9.0-3393', 'UNKNOWN'); host_name | service_name | component_name | version ----------------------------------+--------------+----------------+-------------- node2 | HDFS | ZKFC | 2.2.6.0-2800 node1 | HDFS | ZKFC | 2.2.6.0-2800 (But I had upgraded from 2.2.8; 2.2.6 was the version before that). Any suggestions on how to fix this? I think Ambari may just be confused, but I'm not sure how to verify this and/or fix Ambari (other than overwrite this field in the database?). I've verified the yum versions are right for the package and the right processes are actually running on the machine. Thank you!
