First, I’d like to thank Ilya and Ahmad for their feedbacks. Here is the procedure I followed to get this issue fixed:
1. For safety, I updated following global settings: expunge.delay => 86400 (original value: 60 sec) expunge.interval => 86400 (original value: 60 sec) storage.cleanup.enabled => false (original value: true) 2. Shutdown cloudstack-management service 3. Do a full database backup using mysqldump for database cloud. 4. As we are using NetApp for primary storage, I took a manual snapshot of the volume. 5. Start cloudstack-management service for new global settings to take effect. 5. Shut down VM instances from web UI 6. Do database fixups: 6a. Find out volume names for the instance you are fixing. In this example, the instance has two volumes with name ROOT-98 and DATA-98’ 6b. Get current info for these volumes from volumes table. Here is the current entries (before fixing): mysql> select id, name, instance_id, uuid, path, pool_id, state, removed from volumes where name='ROOT-98' or name='DATA-98'; +-----+---------+-------------+--------------------------------------+--------------------------------------+---------+---------+---------------------+ | id | name | instance_id | uuid | path | pool_id | state | removed | +-----+---------+-------------+--------------------------------------+--------------------------------------+---------+---------+---------------------+ | 126 | ROOT-98 | 98 | ebc10ccc-9f58-4b2a-8748-f52caacb587c | 25400c2c-0f39-475f-9f9c-50fdd05afab3 | 1 | Ready | NULL | | 127 | DATA-98 | 98 | f8794a2c-6cd0-4e26-a3c7-fdb7ec465ba3 | b54b2f04-dfec-4623-90fa-41c726067e7f | 1 | Ready | NULL | | 322 | ROOT-98 | NULL | 0f183764-2349-42c9-9fdd-944b892173ab | NULL | 8 | Destroy | 2016-05-03 19:06:39 | | 323 | ROOT-98 | NULL | f2753635-4616-48c8-94bc-97d2a09b72a3 | NULL | 8 | Destroy | 2016-05-04 11:01:19 | +-----+---------+-------------+--------------------------------------+--------------------------------------+---------+---------+---------------------+ 6c. Find out UUIDs for volumes ROOT-98 and DATA-98 on the new hypervisor pool, using xe CLI tool on one of hypervisors in the pool: # xe vdi-list name-label=ROOT-98 read-only=false | grep "^uuid" uuid ( RO) : 27be8a27-e26a-457b-9140-6181a1bc6bd2 # xe vdi-list name-label=DATA-98 read-only=false | grep "^uuid" uuid ( RO) : 1c5d388a-fc36-4e0c-94dd-64e450eef7ab 6d. Now run SQL update statements to modify volumes with id=323, the root disk, and id=127, the data disk, so that they look like following: mysql> select id, name, instance_id, uuid, path, pool_id, state, removed from volumes where name='ROOT-98' or name='DATA-98’; +-----+---------+-------------+--------------------------------------+--------------------------------------+---------+---------+---------------------+ | id | name | instance_id | uuid | path | pool_id | state | removed | +-----+---------+-------------+--------------------------------------+--------------------------------------+---------+---------+---------------------+ | 126 | ROOT-98 | NULL | ebc10ccc-9f58-4b2a-8748-f52caacb587c | 25400c2c-0f39-475f-9f9c-50fdd05afab3 | 1 | Ready | 2016-05-05 18:53:36 | | 127 | DATA-98 | 98 | f8794a2c-6cd0-4e26-a3c7-fdb7ec465ba3 | 1c5d388a-fc36-4e0c-94dd-64e450eef7ab | 8 | Ready | NULL | | 322 | ROOT-98 | NULL | 0f183764-2349-42c9-9fdd-944b892173ab | NULL | 8 | Destroy | 2016-05-03 19:06:39 | | 323 | ROOT-98 | 98 | f2753635-4616-48c8-94bc-97d2a09b72a3 | 27be8a27-e26a-457b-9140-6181a1bc6bd2 | 8 | Ready | NULL | +-----+---------+-------------+--------------------------------------+--------------------------------------+---------+---------+---------------------+ 4 rows in set (0.00 sec) Note: Entry with id-126 has columns instance_id, removed updated; Entry with id=127 has columns path, pool_id updated; Entry with id=323 has columns instance_id, path, state, removed updated; 6e. Start the VM instance. Just to be really sure, I also stop and started the instance once more through the web UI to make sure that the instance can be rebooted normally. Repeat steps 5, and 6a-e for each VM instances that need the fix. That’s all I have to do to recover all four Vm instances. Now, I still don’t know what is the root cause of this problem and how can I avoid it in the future. Have a good day, all. Yiping On 5/4/16, 11:25 PM, "Yiping Zhang" <yzh...@marketo.com> wrote: >Thanks, it’s a good idea to back up those “removed” disks first before >attempting DB surgery! > > > > >On 5/4/16, 9:57 PM, "ilya" <ilya.mailing.li...@gmail.com> wrote: > >>never mind - on the "removed" disks - it deletes well. >> >>On 5/4/16 9:55 PM, ilya wrote: >>> I'm pretty certain cloudstack does not have purging on data disks as i >>> had to write my own :) >>> >>> On 5/4/16 9:51 PM, Ahmad Emneina wrote: >>>> I'm not sure if the expunge interval/delay plays a part... but you might >>>> want to set: storage.cleanup.enabled to false. That might prevent your >>>> disks from being purged. You might also look to export those volumes, or >>>> copy them to a safe location, out of band. >>>> >>>> On Wed, May 4, 2016 at 8:49 PM, Yiping Zhang <yzh...@marketo.com> wrote: >>>> >>>>> Before I try the direct DB modifications, I would first: >>>>> >>>>> * shutdown the VM instances >>>>> * stop cloudstack-management service >>>>> * do a DB backup with mysqldump >>>>> >>>>> What I worry the most is that the volumes on new cluster’s primary storage >>>>> device are marked as “removed”, so if I shutdown the instances, the >>>>> cloudstack may kick off a storage cleanup job to remove them from new >>>>> cluster’s primary storage before I can get the fixes in. >>>>> >>>>> Is there a way to temporarily disable storage cleanups ? >>>>> >>>>> Yiping >>>>> >>>>> >>>>> >>>>> >>>>> On 5/4/16, 3:22 PM, "Yiping Zhang" <yzh...@marketo.com> wrote: >>>>> >>>>>> Hi, all: >>>>>> >>>>>> I am in a situation that I need some help: >>>>>> >>>>>> I did a live migration with storage migration required for a production >>>>> VM instance from one cluster to another. The first migration attempt >>>>> failed after some time, but the second attempt succeeded. During all this >>>>> time the VM instance is accessible (and it is still up and running). >>>>> However, when I use my api script to query volumes, it still reports that >>>>> the volume is on the old cluster’s primary storage. If I shut down this >>>>> VM, I am afraid that it won’t start again as it would try to use >>>>> non-existing volumes. >>>>>> >>>>>> Checking database, sure enough, the DB still has old info about these >>>>> volumes: >>>>>> >>>>>> >>>>>> mysql> select id,name from storage_pool where id=1 or id=8; >>>>>> >>>>>> +----+------------------+ >>>>>> >>>>>> | id | name | >>>>>> >>>>>> +----+------------------+ >>>>>> >>>>>> | 1 | abprod-primary1 | >>>>>> >>>>>> | 8 | abprod-p1c2-pri1 | >>>>>> >>>>>> +----+------------------+ >>>>>> >>>>>> 2 rows in set (0.01 sec) >>>>>> >>>>>> >>>>>> Here the old cluster’s primary storage has id=1, and the new cluster’s >>>>> primary storage has id=8. >>>>>> >>>>>> >>>>>> Here are the entries with wrong info in volumes table: >>>>>> >>>>>> >>>>>> mysql> select id,name, uuid, path,pool_id, removed from volumes where >>>>> name='ROOT-97' or name='DATA-97'; >>>>>> >>>>> >>>>>> +-----+---------+--------------------------------------+--------------------------------------+---------+---------------------+ >>>>>> >>>>>> | id | name | uuid | path >>>>> | pool_id | removed | >>>>>> >>>>> >>>>>> +-----+---------+--------------------------------------+--------------------------------------+---------+---------------------+ >>>>>> >>>>>> | 124 | ROOT-97 | 224bf673-fda8-4ccc-9c30-fd1068aee005 | >>>>> 5d1ab4ef-2629-4384-a56a-e2dc1055d032 | 1 | NULL | >>>>>> >>>>>> | 125 | DATA-97 | d385d635-9230-4130-8d1f-702dbcf0f22c | >>>>> 6b75496d-5907-46c3-8836-5618f11dac8e | 1 | NULL | >>>>>> >>>>>> | 316 | ROOT-97 | 691b5c12-7ec4-408d-b66f-1ff041f149c1 | NULL >>>>> | 8 | 2016-05-03 06:10:40 | >>>>>> >>>>>> | 317 | ROOT-97 | 8ba29fcf-a81a-4ca0-9540-0287230f10c7 | NULL >>>>> | 8 | 2016-05-03 06:10:45 | >>>>>> >>>>> >>>>>> +-----+---------+--------------------------------------+--------------------------------------+---------+---------------------+ >>>>>> >>>>>> 4 rows in set (0.01 sec) >>>>>> >>>>>> On the xenserver of old cluster, the volumes do not exist: >>>>>> >>>>>> >>>>>> [root@abmpc-hv01 ~]# xe vdi-list name-label='ROOT-97' >>>>>> >>>>>> [root@abmpc-hv01 ~]# xe vdi-list name-label='DATA-97' >>>>>> >>>>>> [root@abmpc-hv01 ~]# >>>>>> >>>>>> But the volumes are on the new cluster’s primary storage: >>>>>> >>>>>> >>>>>> [root@abmpc-hv04 ~]# xe vdi-list name-label=ROOT-97 >>>>>> >>>>>> uuid ( RO) : a253b217-8cdc-4d4a-a111-e5b6ad48a1d5 >>>>>> >>>>>> name-label ( RW): ROOT-97 >>>>>> >>>>>> name-description ( RW): >>>>>> >>>>>> sr-uuid ( RO): 6d4bea51-f253-3b43-2f2f-6d7ba3261ed3 >>>>>> >>>>>> virtual-size ( RO): 34359738368 >>>>>> >>>>>> sharable ( RO): false >>>>>> >>>>>> read-only ( RO): true >>>>>> >>>>>> >>>>>> uuid ( RO) : c46b7a61-9e82-4ea1-88ca-692cd4a9204b >>>>>> >>>>>> name-label ( RW): ROOT-97 >>>>>> >>>>>> name-description ( RW): >>>>>> >>>>>> sr-uuid ( RO): 6d4bea51-f253-3b43-2f2f-6d7ba3261ed3 >>>>>> >>>>>> virtual-size ( RO): 34359738368 >>>>>> >>>>>> sharable ( RO): false >>>>>> >>>>>> read-only ( RO): false >>>>>> >>>>>> >>>>>> [root@abmpc-hv04 ~]# xe vdi-list name-label=DATA-97 >>>>>> >>>>>> uuid ( RO) : bc868e3d-b3c0-4c6a-a6fc-910bc4dd1722 >>>>>> >>>>>> name-label ( RW): DATA-97 >>>>>> >>>>>> name-description ( RW): >>>>>> >>>>>> sr-uuid ( RO): 6d4bea51-f253-3b43-2f2f-6d7ba3261ed3 >>>>>> >>>>>> virtual-size ( RO): 107374182400 >>>>>> >>>>>> sharable ( RO): false >>>>>> >>>>>> read-only ( RO): false >>>>>> >>>>>> >>>>>> uuid ( RO) : a8c187cc-2ba0-4928-8acf-2afc012c036c >>>>>> >>>>>> name-label ( RW): DATA-97 >>>>>> >>>>>> name-description ( RW): >>>>>> >>>>>> sr-uuid ( RO): 6d4bea51-f253-3b43-2f2f-6d7ba3261ed3 >>>>>> >>>>>> virtual-size ( RO): 107374182400 >>>>>> >>>>>> sharable ( RO): false >>>>>> >>>>>> read-only ( RO): true >>>>>> >>>>>> >>>>>> Following is how I plan to fix the corrupted DB entries. Note: using uuid >>>>> of VDI volume with read/write access as the path values: >>>>>> >>>>>> >>>>>> 1. for ROOT-97 volume: >>>>>> >>>>>> Update volumes set removed=NOW() where id=124; >>>>>> Update volumes set removed=NULL where id=317; >>>>>> Update volumes set path=c46b7a61-9e82-4ea1-88ca-692cd4a9204b where >>>>>> id=317; >>>>>> >>>>>> >>>>>> 2) for DATA-97 volume: >>>>>> >>>>>> Update volumes set pool_id=8 where id=125; >>>>>> >>>>>> Update volumes set path=bc868e3d-b3c0-4c6a-a6fc-910bc4dd1722 where >>>>>> id=125; >>>>>> >>>>>> >>>>>> Would this work? >>>>>> >>>>>> >>>>>> Thanks for all the helps anyone can provide. I have a total of 4 VM >>>>> instances with 8 volumes in this situation need to be fixed. >>>>>> >>>>>> >>>>>> Yiping >>>>> >>>>