On 29/11/17 17:24, Matthew Vernon wrote:
> We have a 3,060 OSD ceph cluster (running Jewel
> 10.2.7-0ubuntu0.16.04.1), and one OSD on one host keeps misbehaving - by
> which I mean it keeps spinning ~100% CPU (cf ~5% for other OSDs on that
> host), and having ops blocking on it for some time. It w
# ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan | grep ceph-osd
To find the actual thread that is using 100% CPU.
# for x in `seq 1 5`; do gdb -batch -p [PID] -ex "thr appl all bt";
echo; done > /tmp/osd.stack.dump
Then look at the stacks for the thread that was using all the CPU and
see what i
Hello,
You might consider checking the iowait (during the problem), and the
dmesg (after it recovered). Maybe an issue with the given sata/sas/nvme
port?
Regards,
Denes
On 11/29/2017 06:24 PM, Matthew Vernon wrote:
Hi,
We have a 3,060 OSD ceph cluster (running Jewel
10.2.7-0ubuntu0.16.0
Hi Mathhew,
anything special happening on the NIC side that could cause a problem? Packet
drops? Incorrect jumbo frame settings causing fragmentation?
Have you checked the cstate settings on the box?
Have you disabled energy saving settings differently from the other boxes?
Any unexpected wait
Hi,
We have a 3,060 OSD ceph cluster (running Jewel
10.2.7-0ubuntu0.16.04.1), and one OSD on one host keeps misbehaving - by
which I mean it keeps spinning ~100% CPU (cf ~5% for other OSDs on that
host), and having ops blocking on it for some time. It will then behave
for a bit, and then go back t