Re: [ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)

2017-12-14 Thread Matthew Vernon
On 29/11/17 17:24, Matthew Vernon wrote: > We have a 3,060 OSD ceph cluster (running Jewel > 10.2.7-0ubuntu0.16.04.1), and one OSD on one host keeps misbehaving - by > which I mean it keeps spinning ~100% CPU (cf ~5% for other OSDs on that > host), and having ops blocking on it for some time. It w

Re: [ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)

2017-11-29 Thread Brad Hubbard
# ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan | grep ceph-osd To find the actual thread that is using 100% CPU. # for x in `seq 1 5`; do gdb -batch -p [PID] -ex "thr appl all bt"; echo; done > /tmp/osd.stack.dump Then look at the stacks for the thread that was using all the CPU and see what i

Re: [ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)

2017-11-29 Thread Denes Dolhay
Hello, You might consider checking the iowait (during the problem), and the dmesg (after it recovered). Maybe an issue with the given sata/sas/nvme port? Regards, Denes On 11/29/2017 06:24 PM, Matthew Vernon wrote: Hi, We have a 3,060 OSD ceph cluster (running Jewel 10.2.7-0ubuntu0.16.0

Re: [ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)

2017-11-29 Thread Jean-Charles Lopez
Hi Mathhew, anything special happening on the NIC side that could cause a problem? Packet drops? Incorrect jumbo frame settings causing fragmentation? Have you checked the cstate settings on the box? Have you disabled energy saving settings differently from the other boxes? Any unexpected wait

[ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)

2017-11-29 Thread Matthew Vernon
Hi, We have a 3,060 OSD ceph cluster (running Jewel 10.2.7-0ubuntu0.16.04.1), and one OSD on one host keeps misbehaving - by which I mean it keeps spinning ~100% CPU (cf ~5% for other OSDs on that host), and having ops blocking on it for some time. It will then behave for a bit, and then go back t