Re: [ceph-users] Part 2: ssd osd fails often with FAILED assert(soid scrubber.start || soid = scrubber.end)
Hi, All,Loic I have exactly the same error. I understand the problem is in 0.80.9? Thank you. Sat Jan 17 2015 at 2:21:09 AM, Loic Dachary l...@dachary.org: On 14/01/2015 18:33, Udo Lembke wrote: Hi Loic, thanks for the answer. I hope it's not like in http://tracker.ceph.com/issues/8747 where the issue happens with an patched version if understand right. http://tracker.ceph.com/issues/8747 is a duplicate of http://tracker.ceph.com/issues/8011 indeed :-) So I must only wait few month ;-) for an backport... Udo Am 14.01.2015 09:40, schrieb Loic Dachary: Hi, This is http://tracker.ceph.com/issues/8011 which is being backported. Cheers -- Loïc Dachary, Artisan Logiciel Libre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Part 2: ssd osd fails often with FAILED assert(soid scrubber.start || soid = scrubber.end)
Hi again, sorry for not threaded, but my last email don't came back on the mailing list (often miss some posts!). Just after sending the last mail, the first time another SSD fails - in this case an cheap one, but with the same error: root@ceph-04:/var/log/ceph# more ceph-osd.62.log 2015-01-13 16:40:55.712967 7fb29cfd3700 0 log [INF] : 17.2 scrub ok 2015-01-13 17:54:35.548361 7fb29dfd5700 0 log [INF] : 17.3 scrub ok 2015-01-13 17:54:38.007014 7fb29dfd5700 0 log [INF] : 17.5 scrub ok 2015-01-13 17:54:41.215558 7fb29d7d4700 0 log [INF] : 17.f scrub ok 2015-01-13 17:54:42.277585 7fb29dfd5700 0 log [INF] : 17.a scrub ok 2015-01-13 17:54:48.961582 7fb29d7d4700 0 log [INF] : 17.6 scrub ok 2015-01-13 20:15:08.749597 7fb292337700 0 -- 192.168.3.14:6824/9185 192.168.3.15:6824/11735 pipe(0x107d9680 sd=307 :6824 s=2 pgs=2 cs=1 l=0 c=0x124a09a0).fault, initiating reconnect 2015-01-13 20:15:08.750803 7fb296dbe700 0 -- 192.168.3.14:0/9185 192.168.3.15:6825/11735 pipe(0xd011180 sd=42 :0 s=1 pgs=0 cs=0 l=1 c=0x 8d19760).fault 2015-01-13 20:15:08.750804 7fb292b3f700 0 -- 192.168.3.14:0/9185 172.20.2.15:6837/11735 pipe(0x1210f900 sd=66 :0 s=1 pgs=0 cs=0 l=1 c=0x beae840).fault 2015-01-13 20:15:08.751056 7fb291d31700 0 -- 192.168.3.14:6824/9185 192.168.3.15:6824/11735 pipe(0x107d9680 sd=29 :6824 s=1 pgs=2 cs=2 l =0 c=0x124a09a0).fault 2015-01-13 20:15:27.035342 7fb2b3edd700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:07.035339) 2015-01-13 20:15:28.036773 7fb2b3edd700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:08.036769) 2015-01-13 20:15:28.945179 7fb29b7d0700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:08.945178) 2015-01-13 20:15:29.037016 7fb2b3edd700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:09.037014) 2015-01-13 20:15:30.037204 7fb2b3edd700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:10.037202) 2015-01-13 20:15:30.645491 7fb29b7d0700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:10.645483) 2015-01-13 20:15:31.037326 7fb2b3edd700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:11.037323) 2015-01-13 20:15:32.037442 7fb2b3edd700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:12.037439) 2015-01-13 20:15:33.037641 7fb2b3edd700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:13.037637) 2015-01-13 20:15:34.037843 7fb2b3edd700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:14.037839) 2015-01-13 21:39:35.241153 7fb29dfd5700 0 log [INF] : 17.d scrub ok 2015-01-13 21:39:39.293113 7fb29a7ce700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bo ol)' thread 7fb29a7ce700 time 2015-01-13 21:39:39.279799 osd/ReplicatedPG.cc: 5306: FAILED assert(soid scrubber.start || soid = scrubber.end) ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) 1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)+0x1320) [0x9296b0] 2: (ReplicatedPG::try_flush_mark_clean(boost::shared_ptrReplicatedPG::FlushOp)+0x5f6) [0x92b076] 3: (ReplicatedPG::finish_flush(hobject_t, unsigned long, int)+0x296) [0x92b876] 4: (C_Flush::finish(int)+0x86) [0x986226] 5: (Context::complete(int)+0x9) [0x78f449] 6: (Finisher::finisher_thread_entry()+0x1c8) [0xad5a18] 7: (()+0x6b50) [0x7fb2b94ceb50] 8: (clone()+0x6d) [0x7fb2b80dc7bd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -127 2015-01-10 19:39:41.861724 7fb2b9faa780 5 asok(0x28e4230) register_command perfcounters_dump hook 0x28d4010 -126 2015-01-10 19:39:41.861749 7fb2b9faa780 5 asok(0x28e4230) register_command 1 hook 0x28d4010 -125 2015-01-10 19:39:41.861753 7fb2b9faa780 5 asok(0x28e4230) register_command perf dump hook 0x28d4010 -124 2015-01-10 19:39:41.861756 7fb2b9faa780 5 asok(0x28e4230) register_command perfcounters_schema hook 0x28d4010 -123 2015-01-10 19:39:41.861759 7fb2b9faa780 5 asok(0x28e4230) register_command 2 hook 0x28d4010 -122 2015-01-10 19:39:41.861762 7fb2b9faa780
Re: [ceph-users] Part 2: ssd osd fails often with FAILED assert(soid scrubber.start || soid = scrubber.end)
Hi, This is http://tracker.ceph.com/issues/8011 which is being backported. Cheers On 13/01/2015 22:00, Udo Lembke wrote: Hi again, sorry for not threaded, but my last email don't came back on the mailing list (often miss some posts!). Just after sending the last mail, the first time another SSD fails - in this case an cheap one, but with the same error: root@ceph-04:/var/log/ceph# more ceph-osd.62.log 2015-01-13 16:40:55.712967 7fb29cfd3700 0 log [INF] : 17.2 scrub ok 2015-01-13 17:54:35.548361 7fb29dfd5700 0 log [INF] : 17.3 scrub ok 2015-01-13 17:54:38.007014 7fb29dfd5700 0 log [INF] : 17.5 scrub ok 2015-01-13 17:54:41.215558 7fb29d7d4700 0 log [INF] : 17.f scrub ok 2015-01-13 17:54:42.277585 7fb29dfd5700 0 log [INF] : 17.a scrub ok 2015-01-13 17:54:48.961582 7fb29d7d4700 0 log [INF] : 17.6 scrub ok 2015-01-13 20:15:08.749597 7fb292337700 0 -- 192.168.3.14:6824/9185 192.168.3.15:6824/11735 pipe(0x107d9680 sd=307 :6824 s=2 pgs=2 cs=1 l=0 c=0x124a09a0).fault, initiating reconnect 2015-01-13 20:15:08.750803 7fb296dbe700 0 -- 192.168.3.14:0/9185 192.168.3.15:6825/11735 pipe(0xd011180 sd=42 :0 s=1 pgs=0 cs=0 l=1 c=0x 8d19760).fault 2015-01-13 20:15:08.750804 7fb292b3f700 0 -- 192.168.3.14:0/9185 172.20.2.15:6837/11735 pipe(0x1210f900 sd=66 :0 s=1 pgs=0 cs=0 l=1 c=0x beae840).fault 2015-01-13 20:15:08.751056 7fb291d31700 0 -- 192.168.3.14:6824/9185 192.168.3.15:6824/11735 pipe(0x107d9680 sd=29 :6824 s=1 pgs=2 cs=2 l =0 c=0x124a09a0).fault 2015-01-13 20:15:27.035342 7fb2b3edd700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:07.035339) 2015-01-13 20:15:28.036773 7fb2b3edd700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:08.036769) 2015-01-13 20:15:28.945179 7fb29b7d0700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:08.945178) 2015-01-13 20:15:29.037016 7fb2b3edd700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:09.037014) 2015-01-13 20:15:30.037204 7fb2b3edd700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:10.037202) 2015-01-13 20:15:30.645491 7fb29b7d0700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:10.645483) 2015-01-13 20:15:31.037326 7fb2b3edd700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:11.037323) 2015-01-13 20:15:32.037442 7fb2b3edd700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:12.037439) 2015-01-13 20:15:33.037641 7fb2b3edd700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:13.037637) 2015-01-13 20:15:34.037843 7fb2b3edd700 -1 osd.62 116422 heartbeat_check: no reply from osd.61 since back 2015-01-13 20:15:06.843259 front 2 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:14.037839) 2015-01-13 21:39:35.241153 7fb29dfd5700 0 log [INF] : 17.d scrub ok 2015-01-13 21:39:39.293113 7fb29a7ce700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bo ol)' thread 7fb29a7ce700 time 2015-01-13 21:39:39.279799 osd/ReplicatedPG.cc: 5306: FAILED assert(soid scrubber.start || soid = scrubber.end) ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) 1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)+0x1320) [0x9296b0] 2: (ReplicatedPG::try_flush_mark_clean(boost::shared_ptrReplicatedPG::FlushOp)+0x5f6) [0x92b076] 3: (ReplicatedPG::finish_flush(hobject_t, unsigned long, int)+0x296) [0x92b876] 4: (C_Flush::finish(int)+0x86) [0x986226] 5: (Context::complete(int)+0x9) [0x78f449] 6: (Finisher::finisher_thread_entry()+0x1c8) [0xad5a18] 7: (()+0x6b50) [0x7fb2b94ceb50] 8: (clone()+0x6d) [0x7fb2b80dc7bd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -127 2015-01-10 19:39:41.861724 7fb2b9faa780 5 asok(0x28e4230) register_command perfcounters_dump hook 0x28d4010 -126 2015-01-10 19:39:41.861749 7fb2b9faa780 5 asok(0x28e4230) register_command 1 hook 0x28d4010 -125 2015-01-10 19:39:41.861753 7fb2b9faa780 5 asok(0x28e4230) register_command perf dump hook 0x28d4010 -124 2015-01-10 19:39:41.861756