If it is only this one osd I'd be inclined to be taking a hard look at
the underlying hardware and how it behaves/performs compared to the hw
backing identical osds. The less likely possibility is that you have
some sort of "hot spot" causing resource contention for that osd. To
investigate that fu
I updated firmware and kernel, running torture tests. So far no assert,
but I still noticed this on the same osd as yesterday
Oct 01 19:35:13 storage2n2-la ceph-osd-34[11188]: 2019-10-01 19:35:13.721
7f8d03150700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread
0x7f8cd05d7700' had timed out aft
It was hardware indeed. Dell server reported a disk being reset with power
on. Checking the usual suspects i.e. controller firmware, controller event
log (if I can get one), drive firmware.
I will report more when I get a better idea
Thank you!
On Tue, Oct 1, 2019 at 2:33 AM Brad Hubbard wrote
Removed ceph-de...@vger.kernel.org and added d...@ceph.io
On Tue, Oct 1, 2019 at 4:26 PM Alex Litvak wrote:
>
> Hellow everyone,
>
> Can you shed the line on the cause of the crash? Could actually client
> request trigger it?
>
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 2019-09-30 22:5
Hellow everyone,
Can you shed the line on the cause of the crash? Could actually client request
trigger it?
Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 2019-09-30 22:52:58.867
7f093d71e700 -1 bdev(0x55b72c156000 /var/lib/ceph/osd/ceph-17/block) aio_submit
retries 16
Sep 30 22:52:58 sto