I've created AMBARI-14031 & AMBARI-14032 for these issues. And thanks for the pointer for the #experimental / opsDuringRollingUpgrade workaround.
One more question - is there a process for uninstalling old / unused versions of HDP? For example, now that I've upgraded from 2.2.8 -> 2.2.9, is there a way to remove 2.2.6? On Mon, Nov 23, 2015 at 11:35 AM, Alejandro Fernandez <[email protected]> wrote: > Hi, > > I wish your experience with Rolling Upgrades would have been better. > I'll do my best to explain the solution to each one of those items. As a > developer, I like to hear this feedback so we can make the product better. > > * Cluster is locked down while in the middle of upgrade: > Operations like changing configs, adding hosts, adding services, etc. are > disallowed by default. > This is meant to prevent the user from drastically changing the stack > configs and ending up in a worse state. > Cluster operators can still change configs by navigating to > http://server:8080/#/experimental and enabling "opsDuringRollingUpgrade". > I completely agree that we need to be more flexible in this area since > configs are likely to break, and the savvy users should still be allowed to > change them. > > * Configs are only changed in major stack versions: > In HDP 2.2.*->2.2.*, we don't expect any config changes, so the Upgrade Pack > doesn't orchestrate any, whereas a 2.2.*->2.3.* has many config changes. > At times, this will break, and we typically find out about it during testing > and reports from users with custom configs. > Tools like SmartSense can also help to point out incorrect configs. In the > future, we may relax this so that even minor versions are allowed to change > configs. > > * Unable to finalize since hosts are not on the new version: > We've talked about a way to "force finalize" the versions. Today, Ambari is > very strict about requiring all hosts to be updated. > As a workaround, we have a python script called "RU Magician" that will > allow you to fix things, and force any version to CURRENT; checkout > https://github.com/apache/ambari/tree/branch-2.1.2/contrib/ru_magician > You ran the correct SQL statements, so kudos to you for that. > > * Components that don't advertise a version: > Some components like ZKFC, AMS, MySQL, Kerberos Client, don’t need to > advertise a version. > In the case of ZKFC, it is because it uses the same binary as that of > NameNode. So perhaps an earlier version of Ambari caused it to stay stuck on > 2.2.6 in the DB. > If you feel more comfortable, you can change ZKFC's version to 'UNKNOWN'. > > My suggestion is to create Jiras on Apache for the following: > > Allow force finalizing a version during Stack Upgrade > Allow changing configs during the middle of a Stack Upgrade, will need to > prompt user with a disclaimer/warning > > Thanks, > Alejandro > > On 11/22/15, 11:32 PM, "Andrew Robertson" <[email protected]> > wrote: > > I performed a rolling upgrade of HDP from 2.2.8 to 2.2.9 today using > Ambari 2.1.2.1 & ran into several issues. > > My YARN resource manager failed to start due to a "Service > ResourceManager failed in state INITED; cause: > java.lang.IllegalArgumentException: Illegal capacity of -1.0 for > node-label=default in queue=root, valid capacity should in range of > [0, 100].". (It was working fine with 2.2.8; this may be something new > in 2.2.9). > > As Ambari usage feedback - this was impossible to fix in Ambari while > the upgrade was going on, and it added a ton of (down)time to the > upgrade. This error caused a number of service checks to time out > after a long wait (many checks took 5-15 min to fail). I didn't see > any way to fix the error (the only options I had during the upgrade > were "Downgrade" - which I didn't want to do (It was a test cluster > after all, I wanted to get through it so I could fix it); and "Ignore" > which allowed it to continue, but caused each step to take 300+ > seconds. Ambari seemed to lock the configs so I couldn't make changes > to fix the issue while the upgrade was going on. Likewise, I couldn't > manually restart the service myself or abort the service checks. Even > at the "Verify operation" and the "finalize" checkpoints, where I > could "pause" the upgrade - the configs were still locked and I had no > ability to start/stop services. > > At the end, Ambari started giving other errors about being unable to > finalize the upgrade. I ended up rebooting the cluster & ambari - this > got it back to a state where I could edit the configs again to fix the > YARN RM config. The fix to the RM not starting ended up being the > same as AMBARI-11358, which appears to only have been fixed in the > HDP2.3 upgrade. > > Separately, Ambari had the 2.2.9 version waiting to be finalized but I > couldn't find any way to do this in the UI after the restart. So I > went into the database and ran the following: > UPDATE host_version SET state = 'INSTALLED' WHERE state = 'CURRENT'; > UPDATE host_version SET state = 'CURRENT' WHERE repo_version_id = <id > for 2.2.9.0 version> and state = 'UPGRADED'; > UPDATE cluster_version SET state = 'INSTALLED' WHERE state = 'CURRENT'; > UPDATE cluster_version SET state = 'CURRENT' WHERE repo_version_id = > <id for 2.2.9.0 version> and state = 'UPGRADED'; > UPDATE hostcomponentstate set upgrade_state = 'NONE'; > This seems to have fixed that. > > Possibly unrelated - I did find there are 2 services that show up with > an even older old version when checking the ambari database: > > ambari=> SELECT h.host_name, hcs.service_name, hcs.component_name, > hcs.version FROM hostcomponentstate hcs JOIN hosts h ON hcs.host_id = > h.host_id where hcs.version NOT IN ('2.2.9.0-3393', 'UNKNOWN'); > host_name | service_name | component_name | > version > ----------------------------------+--------------+----------------+-------------- > node2 | HDFS | ZKFC | 2.2.6.0-2800 > node1 | HDFS | ZKFC | 2.2.6.0-2800 > > (But I had upgraded from 2.2.8; 2.2.6 was the version before that). > > Any suggestions on how to fix this? I think Ambari may just be > confused, but I'm not sure how to verify this and/or fix Ambari (other > than overwrite this field in the database?). I've verified the yum > versions are right for the package and the right processes are > actually running on the machine. > > Thank you! > >
