OSD suicide after being down/in for one day as it needs to search large amount of objects

Guang Yang Mon, 18 Aug 2014 23:31:28 -0700

Hi ceph-devel,
David (cc’ed) reported a bug (http://tracker.ceph.com/issues/9128) which we 
came across in our test cluster during our failure testing, basically the way 
to reproduce it was to leave one OSD daemon down and in for a day, at the same 
time, keep giving write traffic. When the OSD daemon was started again, it hit 
suicide timeout and kill itself.


After some analysis (details in the bug), David found that the op thread was 
busy searching for missing objects and once the volume to search increase, the 
thread is expected to work that long time, please refer to the bug for detailed 
logs.

One simple fix is to let the op thread reset the suicide timeout periodically 
when it is doing long-time work, other fix might be to cut the work into 
smaller pieces?

Any suggestion is welcome.

Thanks,
Guang--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

OSD suicide after being down/in for one day as it needs to search large amount of objects

Reply via email to