Re: [ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)

2017-12-14 Thread Matthew Vernon
On 29/11/17 17:24, Matthew Vernon wrote: > We have a 3,060 OSD ceph cluster (running Jewel > 10.2.7-0ubuntu0.16.04.1), and one OSD on one host keeps misbehaving - by > which I mean it keeps spinning ~100% CPU (cf ~5% for other OSDs on that > host), and having ops blocking on it for some time. It

Re: [ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)

2017-11-29 Thread Brad Hubbard
# ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan | grep ceph-osd To find the actual thread that is using 100% CPU. # for x in `seq 1 5`; do gdb -batch -p [PID] -ex "thr appl all bt"; echo; done > /tmp/osd.stack.dump Then look at the stacks for the thread that was using all the CPU and see what

Re: [ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)

2017-11-29 Thread Denes Dolhay
Hello, You might consider checking the iowait (during the problem), and the dmesg (after it recovered). Maybe an issue with the given sata/sas/nvme port? Regards, Denes On 11/29/2017 06:24 PM, Matthew Vernon wrote: Hi, We have a 3,060 OSD ceph cluster (running Jewel

Re: [ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)

2017-11-29 Thread Jean-Charles Lopez
Hi Mathhew, anything special happening on the NIC side that could cause a problem? Packet drops? Incorrect jumbo frame settings causing fragmentation? Have you checked the cstate settings on the box? Have you disabled energy saving settings differently from the other boxes? Any unexpected

[ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)

2017-11-29 Thread Matthew Vernon
Hi, We have a 3,060 OSD ceph cluster (running Jewel 10.2.7-0ubuntu0.16.04.1), and one OSD on one host keeps misbehaving - by which I mean it keeps spinning ~100% CPU (cf ~5% for other OSDs on that host), and having ops blocking on it for some time. It will then behave for a bit, and then go back