[SOLVED]: corrupt DB after VM live migration with storage migration

Yiping Zhang Thu, 05 May 2016 15:00:59 -0700

First,  I’d like to thank Ilya and Ahmad for their feedbacks.

Here is the procedure I followed to get this issue fixed:


1. For safety,  I updated following global settings:
    expunge.delay => 86400 (original value: 60 sec)
    expunge.interval => 86400 (original value: 60 sec)
    storage.cleanup.enabled => false  (original value: true)
2. Shutdown cloudstack-management service
3. Do a full database backup using mysqldump for database cloud.
4. As we are using NetApp for primary storage, I took a manual snapshot of the 
volume.
5. Start cloudstack-management service for new global settings to take effect. 
5. Shut down VM instances from web UI
6. Do database fixups:

6a. Find out volume names for the instance you are fixing. In this example, the 
instance has two volumes with name ROOT-98 and DATA-98’
6b. Get current info for these volumes from volumes table. Here is the current 
entries (before fixing):

mysql> select id, name, instance_id, uuid, path, pool_id, state, removed from 
volumes where name='ROOT-98' or name='DATA-98';
+-----+---------+-------------+--------------------------------------+--------------------------------------+---------+---------+---------------------+
| id  | name    | instance_id | uuid                                 | path     
                            | pool_id | state   | removed             |
+-----+---------+-------------+--------------------------------------+--------------------------------------+---------+---------+---------------------+
| 126 | ROOT-98 |          98 | ebc10ccc-9f58-4b2a-8748-f52caacb587c | 
25400c2c-0f39-475f-9f9c-50fdd05afab3 |       1 | Ready   | NULL                |
| 127 | DATA-98 |          98 | f8794a2c-6cd0-4e26-a3c7-fdb7ec465ba3 | 
b54b2f04-dfec-4623-90fa-41c726067e7f |       1 | Ready   | NULL                |
| 322 | ROOT-98 |        NULL | 0f183764-2349-42c9-9fdd-944b892173ab | NULL     
                            |       8 | Destroy | 2016-05-03 19:06:39 |
| 323 | ROOT-98 |        NULL | f2753635-4616-48c8-94bc-97d2a09b72a3 | NULL     
                            |       8 | Destroy | 2016-05-04 11:01:19 |
+-----+---------+-------------+--------------------------------------+--------------------------------------+---------+---------+---------------------+

6c. Find out UUIDs for volumes ROOT-98 and DATA-98 on the new hypervisor pool, 
using xe CLI tool on one of hypervisors in the pool:


# xe vdi-list name-label=ROOT-98 read-only=false | grep "^uuid"
uuid ( RO)                : 27be8a27-e26a-457b-9140-6181a1bc6bd2
# xe vdi-list name-label=DATA-98 read-only=false | grep "^uuid"
uuid ( RO)                : 1c5d388a-fc36-4e0c-94dd-64e450eef7ab

6d. Now run SQL update statements to modify volumes with id=323, the root disk, 
and id=127, the data disk, so that they look like following:


mysql> select id, name, instance_id, uuid, path, pool_id, state, removed from 
volumes where name='ROOT-98' or name='DATA-98’; 
+-----+---------+-------------+--------------------------------------+--------------------------------------+---------+---------+---------------------+
| id  | name    | instance_id | uuid                                 | path     
                            | pool_id | state   | removed             |
+-----+---------+-------------+--------------------------------------+--------------------------------------+---------+---------+---------------------+
| 126 | ROOT-98 |        NULL | ebc10ccc-9f58-4b2a-8748-f52caacb587c | 
25400c2c-0f39-475f-9f9c-50fdd05afab3 |       1 | Ready   | 2016-05-05 18:53:36 |
| 127 | DATA-98 |          98 | f8794a2c-6cd0-4e26-a3c7-fdb7ec465ba3 | 
1c5d388a-fc36-4e0c-94dd-64e450eef7ab |       8 | Ready   | NULL                |
| 322 | ROOT-98 |        NULL | 0f183764-2349-42c9-9fdd-944b892173ab | NULL     
                            |       8 | Destroy | 2016-05-03 19:06:39 |
| 323 | ROOT-98 |          98 | f2753635-4616-48c8-94bc-97d2a09b72a3 | 
27be8a27-e26a-457b-9140-6181a1bc6bd2 |       8 | Ready   | NULL                |
+-----+---------+-------------+--------------------------------------+--------------------------------------+---------+---------+---------------------+
4 rows in set (0.00 sec)

Note:  Entry with id-126 has columns instance_id, removed updated;
       Entry with id=127 has columns path, pool_id updated;
       Entry with id=323 has columns instance_id, path, state, removed updated;

6e. Start the VM instance.  Just to be really sure,  I also stop and started 
the instance once more through the web UI to make sure that the instance can be 
rebooted normally.

Repeat steps 5, and 6a-e for each VM instances that need the fix.

That’s all I have to do to recover all four Vm instances.


Now,  I still don’t know what is the root cause of this problem and how can I 
avoid it in the future.


Have a good day, all.

Yiping







On 5/4/16, 11:25 PM, "Yiping Zhang" <yzh...@marketo.com> wrote:

>Thanks, it’s a good idea to back up those “removed” disks first before 
>attempting DB surgery!
>
>
>
>
>On 5/4/16, 9:57 PM, "ilya" <ilya.mailing.li...@gmail.com> wrote:
>
>>never mind - on the "removed" disks - it deletes well.
>>
>>On 5/4/16 9:55 PM, ilya wrote:
>>> I'm pretty certain cloudstack does not have purging on data disks as i
>>> had to write my own :)
>>> 
>>> On 5/4/16 9:51 PM, Ahmad Emneina wrote:
>>>> I'm not sure if the expunge interval/delay plays a part... but you might
>>>> want to set: storage.cleanup.enabled to false. That might prevent your
>>>> disks from being purged. You might also look to export those volumes, or
>>>> copy them to a safe location, out of band.
>>>>
>>>> On Wed, May 4, 2016 at 8:49 PM, Yiping Zhang <yzh...@marketo.com> wrote:
>>>>
>>>>> Before I try the direct DB modifications, I would first:
>>>>>
>>>>> * shutdown the VM instances
>>>>> * stop cloudstack-management service
>>>>> * do a DB backup with mysqldump
>>>>>
>>>>> What I worry the most is that the volumes on new cluster’s primary storage
>>>>> device are marked as “removed”, so if I shutdown the instances, the
>>>>> cloudstack may kick off a storage cleanup job to remove them from new
>>>>> cluster’s primary storage  before I can get the fixes in.
>>>>>
>>>>> Is there a way to temporarily disable storage cleanups ?
>>>>>
>>>>> Yiping
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 5/4/16, 3:22 PM, "Yiping Zhang" <yzh...@marketo.com> wrote:
>>>>>
>>>>>> Hi, all:
>>>>>>
>>>>>> I am in a situation that I need some help:
>>>>>>
>>>>>> I did a live migration with storage migration required for a production
>>>>> VM instance from one cluster to another.  The first migration attempt
>>>>> failed after some time, but the second attempt succeeded. During all this
>>>>> time the VM instance is accessible (and it is still up and running).
>>>>> However, when I use my api script to query volumes, it still reports that
>>>>> the volume is on the old cluster’s primary storage.  If I shut down this
>>>>> VM,  I am afraid that it won’t start again as it would try to use
>>>>> non-existing volumes.
>>>>>>
>>>>>> Checking database, sure enough, the DB still has old info about these
>>>>> volumes:
>>>>>>
>>>>>>
>>>>>> mysql> select id,name from storage_pool where id=1 or id=8;
>>>>>>
>>>>>> +----+------------------+
>>>>>>
>>>>>> | id | name             |
>>>>>>
>>>>>> +----+------------------+
>>>>>>
>>>>>> |  1 | abprod-primary1  |
>>>>>>
>>>>>> |  8 | abprod-p1c2-pri1 |
>>>>>>
>>>>>> +----+------------------+
>>>>>>
>>>>>> 2 rows in set (0.01 sec)
>>>>>>
>>>>>>
>>>>>> Here the old cluster’s primary storage has id=1, and the new cluster’s
>>>>> primary storage has id=8.
>>>>>>
>>>>>>
>>>>>> Here are the entries with wrong info in volumes table:
>>>>>>
>>>>>>
>>>>>> mysql> select id,name, uuid, path,pool_id, removed from volumes where
>>>>> name='ROOT-97' or name='DATA-97';
>>>>>>
>>>>>
>>>>>> +-----+---------+--------------------------------------+--------------------------------------+---------+---------------------+
>>>>>>
>>>>>> | id  | name    | uuid                                 | path
>>>>>                      | pool_id | removed             |
>>>>>>
>>>>>
>>>>>> +-----+---------+--------------------------------------+--------------------------------------+---------+---------------------+
>>>>>>
>>>>>> | 124 | ROOT-97 | 224bf673-fda8-4ccc-9c30-fd1068aee005 |
>>>>> 5d1ab4ef-2629-4384-a56a-e2dc1055d032 |       1 | NULL                |
>>>>>>
>>>>>> | 125 | DATA-97 | d385d635-9230-4130-8d1f-702dbcf0f22c |
>>>>> 6b75496d-5907-46c3-8836-5618f11dac8e |       1 | NULL                |
>>>>>>
>>>>>> | 316 | ROOT-97 | 691b5c12-7ec4-408d-b66f-1ff041f149c1 | NULL
>>>>>                      |       8 | 2016-05-03 06:10:40 |
>>>>>>
>>>>>> | 317 | ROOT-97 | 8ba29fcf-a81a-4ca0-9540-0287230f10c7 | NULL
>>>>>                      |       8 | 2016-05-03 06:10:45 |
>>>>>>
>>>>>
>>>>>> +-----+---------+--------------------------------------+--------------------------------------+---------+---------------------+
>>>>>>
>>>>>> 4 rows in set (0.01 sec)
>>>>>>
>>>>>> On the xenserver of old cluster, the volumes do not exist:
>>>>>>
>>>>>>
>>>>>> [root@abmpc-hv01 ~]# xe vdi-list name-label='ROOT-97'
>>>>>>
>>>>>> [root@abmpc-hv01 ~]# xe vdi-list name-label='DATA-97'
>>>>>>
>>>>>> [root@abmpc-hv01 ~]#
>>>>>>
>>>>>> But the volumes are on the new cluster’s primary storage:
>>>>>>
>>>>>>
>>>>>> [root@abmpc-hv04 ~]# xe vdi-list name-label=ROOT-97
>>>>>>
>>>>>> uuid ( RO)                : a253b217-8cdc-4d4a-a111-e5b6ad48a1d5
>>>>>>
>>>>>>          name-label ( RW): ROOT-97
>>>>>>
>>>>>>    name-description ( RW):
>>>>>>
>>>>>>             sr-uuid ( RO): 6d4bea51-f253-3b43-2f2f-6d7ba3261ed3
>>>>>>
>>>>>>        virtual-size ( RO): 34359738368
>>>>>>
>>>>>>            sharable ( RO): false
>>>>>>
>>>>>>           read-only ( RO): true
>>>>>>
>>>>>>
>>>>>> uuid ( RO)                : c46b7a61-9e82-4ea1-88ca-692cd4a9204b
>>>>>>
>>>>>>          name-label ( RW): ROOT-97
>>>>>>
>>>>>>    name-description ( RW):
>>>>>>
>>>>>>             sr-uuid ( RO): 6d4bea51-f253-3b43-2f2f-6d7ba3261ed3
>>>>>>
>>>>>>        virtual-size ( RO): 34359738368
>>>>>>
>>>>>>            sharable ( RO): false
>>>>>>
>>>>>>           read-only ( RO): false
>>>>>>
>>>>>>
>>>>>> [root@abmpc-hv04 ~]# xe vdi-list name-label=DATA-97
>>>>>>
>>>>>> uuid ( RO)                : bc868e3d-b3c0-4c6a-a6fc-910bc4dd1722
>>>>>>
>>>>>>          name-label ( RW): DATA-97
>>>>>>
>>>>>>    name-description ( RW):
>>>>>>
>>>>>>             sr-uuid ( RO): 6d4bea51-f253-3b43-2f2f-6d7ba3261ed3
>>>>>>
>>>>>>        virtual-size ( RO): 107374182400
>>>>>>
>>>>>>            sharable ( RO): false
>>>>>>
>>>>>>           read-only ( RO): false
>>>>>>
>>>>>>
>>>>>> uuid ( RO)                : a8c187cc-2ba0-4928-8acf-2afc012c036c
>>>>>>
>>>>>>          name-label ( RW): DATA-97
>>>>>>
>>>>>>    name-description ( RW):
>>>>>>
>>>>>>             sr-uuid ( RO): 6d4bea51-f253-3b43-2f2f-6d7ba3261ed3
>>>>>>
>>>>>>        virtual-size ( RO): 107374182400
>>>>>>
>>>>>>            sharable ( RO): false
>>>>>>
>>>>>>           read-only ( RO): true
>>>>>>
>>>>>>
>>>>>> Following is how I plan to fix the corrupted DB entries. Note: using uuid
>>>>> of VDI volume with read/write access as the path values:
>>>>>>
>>>>>>
>>>>>> 1. for ROOT-97 volume:
>>>>>>
>>>>>> Update volumes set removed=NOW() where id=124;
>>>>>> Update volumes set removed=NULL where id=317;
>>>>>> Update volumes set path=c46b7a61-9e82-4ea1-88ca-692cd4a9204b where 
>>>>>> id=317;
>>>>>>
>>>>>>
>>>>>> 2) for DATA-97 volume:
>>>>>>
>>>>>> Update volumes set pool_id=8 where id=125;
>>>>>>
>>>>>> Update volumes set path=bc868e3d-b3c0-4c6a-a6fc-910bc4dd1722 where 
>>>>>> id=125;
>>>>>>
>>>>>>
>>>>>> Would this work?
>>>>>>
>>>>>>
>>>>>> Thanks for all the helps anyone can provide.  I have a total of 4 VM
>>>>> instances with 8 volumes in this situation need to be fixed.
>>>>>>
>>>>>>
>>>>>> Yiping
>>>>>
>>>>

[SOLVED]: corrupt DB after VM live migration with storage migration

Reply via email to