Hello again,

We understood that the issue arises from a hardware crash with the help of Dan 
van der Ster and Mykola from Clyso. After upgrading Ceph, we encountered an 
unexpected crash resulted with a reboot. 

After comparing the first blocks of running and failed OSDs, we found that HW 
crash caused a corruption on the first 23 bytes of block devices. 

First few bytes of the block device of a failed OSD contains:

00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000010: 0000 0000 0000 0a63 3965 6533 6566 362d  .......c9ee3ef6-
00000020: 3733 6437 2d34 3032 392d 3963 6436 2d30  73d7-4029-9cd6-0
00000030: 3836 6363 3935 6432 6632 370a 0201 a901  86cc95d2f27.....
00000040: 0000 c9ee 3ef6 73d7 4029 9cd6 086c c95d  ....>.s.@)...l.]
00000050: 2f27 0000 4002 4707 0000 8ceb 6665 c827  /'..@.G.....fe.'
00000060: 0409 0400 0000 6d61 696e 0d00 0000 0a00  ......main……

and a running OSD contains:

00000000: 626c 7565 7374 6f72 6520 626c 6f63 6b20  bluestore block 
00000010: 6465 7669 6365 0a38 6637 3732 3532 312d  device.8f772521-
00000020: 6535 3663 2d34 6135 622d 6239 3763 2d31  e56c-4a5b-b97c-1
00000030: 6233 3630 6439 6266 6135 340a 0201 a901  b360d9bfa54.....
00000040: 0000 8f77 2521 e56c 4a5b b97c 1b36 0d9b  ...w%!.lJ[.|.6..
00000050: fa54 0000 4002 4707 0000 c8eb 6665 cd4c  .t...@.g.....fe.L
00000060: 6233 0400 0000 6d61 696e 0d00 0000 0a00  b3....main……

It turned out that the first 23 bytes of data is corrupted during HW crash. So 
we copied the first 23 bytes of this data from a running OSD with the following 
command:

dd if=/dev/ceph-block-21/block-21 of=/root/header.21.dat bs=23 count=1

Then we copied the exact 23 bytes to the every failed OSD block device after 
backup and the problem is resolved. 

for i in {12..20} ; do dd if=/dev/ceph-block-$i/block-$i of=/root/backup.$i.1M 
bs=1M count=1 ;  dd if=/root/header.21.dat of=/dev/ceph-block-$i/block-$i bs=23 
count=1 ; done 

At the end of the day, it turned out that the lsiutil tool is not compatible 
with our kernel and caused the crash. The following link contains the detailed 
information. 

https://support.huawei.com/enterprise/en/knowledge/KB1000001578https://support.huawei.com/enterprise/en/knowledge/KB1000001578

I want to thank to Dan and Mykola from Clyso and appreciate their help.

BR,
Huseyin Cotuk
hco...@gmail.com
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to