Hello everyone,
I am currently trying to recover a ceph cluster from the disaster, now I have 
enough osd (171 up and in/195) and have 2 incomplete pgs at the end.

However the question now is not the incomplete pgs, is about one mon services 
fail to start due to a strange, wrong monmap is used.  After inject monmap 
exported from cluster, it's up and enter synchronizing and unable to be back 
after several hours.  I originally guess it's common for the fact the whole 
cluster is still busy in recovering and backfilling, however it's over 24hour 
now and no hint when sync can be done or if it's still in healthy status.

The log tells it is still doing synchronizing and I can see the file under 
store.db keep being updated.


a small piece of log for the reference:
2015-03-12 03:20:15.025048 7f3cb6c48700 10 
mon.NVMBD1CIF290D00@0(synchronizing).data_health(0) service_tick
2015-03-12 03:20:15.025075 7f3cb6c48700  0 
mon.NVMBD1CIF290D00@0(synchronizing).data_health(0) update_stats avail 71% 
total 103080888 used 24281956 avail 73539668
2015-03-12 03:20:30.460672 7f3cb4b43700 10 -- 10.137.36.30:6789/0 >> 
10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 
c=0x34b1760).aborted = 0
2015-03-12 03:20:30.460923 7f3cb4b43700 10 -- 10.137.36.30:6789/0 >> 
10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 
c=0x34b1760).reader got message 1466470577 0x45b3c80 mon_sync(chunk cookie 
37950063980 lc 12343379 bl 791970 bytes last_key logm,full_5120265) v2
2015-03-12 03:20:30.460963 7f3cbc783700 10 -- 10.137.36.30:6789/0 >> 
10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 
c=0x34b1760).writer: state = open policy.server=0
2015-03-12 03:20:30.460988 7f3cbc783700 10 -- 10.137.36.30:6789/0 >> 
10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 
c=0x34b1760).write_ack 1466470577
2015-03-12 03:20:30.461011 7f3cbc783700 10 -- 10.137.36.30:6789/0 >> 
10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 
c=0x34b1760).writer: state = open policy.server=0
2015-03-12 03:20:30.461030 7f3cb6447700  1 -- 10.137.36.30:6789/0 <== mon.1 
10.137.36.31:6789/0 1466470577 ==== mon_sync(chunk cookie 37950063980 lc 
12343379 bl 791970 bytes last_key logm,full_5120265) v2 ==== 792163+0+0 
(2147002791 0 0) 0x45b3c80 con 0x34b1760
2015-03-12 03:20:30.461048 7f3cb6447700 10 mon.NVMBD1CIF290D00@0(synchronizing) 
e1 handle_sync mon_sync(chunk cookie 37950063980 lc 12343379 bl 791970 bytes 
last_key logm,full_5120265) v2
2015-03-12 03:20:30.461052 7f3cb6447700 10 mon.NVMBD1CIF290D00@0(synchronizing) 
e1 handle_sync_chunk mon_sync(chunk cookie 37950063980 lc 12343379 bl 791970 
bytes last_key logm,full_5120265) v2
2015-03-12 03:20:30.463832 7f3cb6447700 10 mon.NVMBD1CIF290D00@0(synchronizing) 
e1 sync_reset_timeout


I am also wondering some osd are fail to join cluster due to this.  Some osd 
processes are up without error, but after load pgs, it cannot keep moving to 
boot and status is still down and out.

Please advise, thanks



Luke Kao

MYCOM OSI
<http://www.mycom-osi.com>

________________________________

This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to