[ceph-users] VM Data corruption shortly after Luminous Upgrade

James Forde Mon, 06 Nov 2017 08:24:43 -0800

Weird but Very bad problem with my test cluster 2-3 weeks after upgrading to 
Luminous.
All 7 running VM's are corrupted and unbootable. 6 Windows and 1 CentOS7. 
Windows error is "unmountable boot volume". CentOS7 will only boot to emergency 
mode.
3 VM's that were off during event work as expected. 2 Windows and 1 Ubuntu.


History:
7 node cluster: 5 OSD, 3 MON, (1 is MON-OSD). Plus 2 KVM nodes.

System originally running Jewel on old Tower servers. Migrated to all rackmount 
servers. Then upgraded to Kraken. Kraken added the MGR servers.

On the 13th or 14th of October Upgraded to Luminous. Upgrade went smoothly. 
Ceph versions showed all nodes running 12.2.1, Health_OK. Even checked out the 
Ceph Dashboard.

Then around the 20th I created a master for cloning, spun off a clone, mucked 
around with it, flattened it so it was stand alone, and shut it and the master 
off.

Problem:
On November 1st I started the clone and got the following error.

"failed to start domain internal error: qemu unexpectedly closed the monitor 
vice virtio-balloon"



To resolve: (restart MON's one at a time)

I restarted 1 MON. tried to restart clone. Same error.

Restarted 2nd MON. All 7 running VMs shut off!

Restarted 3rd MON. Clone now runs. Try to start any of the 7 VM's that were 
running. "Unmountable Boot Volume"



Pulled the logs on all nodes and am going through them.
So far have found this.

"terminate called after throwing an instance of 'ceph::buffer::end_of_buffer'
  what():  buffer::end_of_buffer
terminate called recursively
2017-11-01 19:41:48.814+0000: shutting down, reason=crashed"

Possible monmap corruption?
Any insight would be greatly appreciated.


Hints?
After the Luminous upgrade, ceph osd tree had nothing in the class column. 
After restarting the MON's, the MON-OSD node had "hdd" on each osd.
After restarting the entire cluster all OSD servers had "hdd" in the class 
column. Not sure why this would not have happened right after upgrade.

Also after restart the mgr servers failed to start. " key for mgr.HOST exists 
but cap mds does not match<https://www.seekhole.io/?p=12>"
Solved per https://www.seekhole.io/?p=12
$ ceph auth caps mgr.HOST mon 'allow profile mgr' mds 'allow *' osd 'allow *'
Again, not sure why this would not have manifested itself at the upgrade when 
all servers were restarted.

-Jim

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] VM Data corruption shortly after Luminous Upgrade

Reply via email to