[ceph-users] OSD down cause all OSD slow ops

2023-03-30 Thread petersun
We experienced a Ceph failure causing the system to become unresponsive with no 
IOPS or throughput due to a problematic OSD process on one node. This resulted 
in slow operations and no IOPS for all other OSDs in the cluster. The incident 
timeline is as follows:

Alert triggered for OSD problem.
6 out of 12 OSDs on the node were down.
Soft restart attempted, but smartmontools process stuck while shutting down 
server.
Hard restart attempted and service resumed as usual.

Our Ceph cluster has 19 nodes, 218 OSDs, and is using version 15.2.17  octopus 
(stable).

Questions:
1. What is Ceph's detection mechanism? Why couldn't Ceph detect the faulty node 
and automatically abandon its resources?
2. Did we miss any patches or bug fixes?
3. Suggestions for improvements to quickly detect and avoid similar issues in 
the future?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph Failure and OSD Node Stuck Incident

2023-03-30 Thread petersun
We encountered a Ceph failure where the system became unresponsive with no IOPS 
or throughput after encountering a failed node. Upon investigation, it appears 
that the OSD process on one of the Ceph storage nodes is stuck, but ping is 
still responsive. However, during the failure, Ceph was unable to recognize the 
problematic node, which resulted in all other OSDs in the cluster experiencing 
slow operations and no IOPS in the cluster at all.

Here's the timeline of the incident:

- At 10:40, an alert is triggered, indicating a problem with the OSD.
- After the alert, Ceph becomes unresponsive with no IOPS or throughput.
- At 11:26, an engineer discovers that there is a gradual OSD failure, with 6 
out of 12 OSDs on the node being down.
- At 11:46, the Ceph engineer is unable to SSH into the faulty node and 
attempts a soft restart, but the "smartmontools" process is stuck while 
shutting down the server. Ping works during this time.
- After waiting for about one or two minutes, a hard restart is attempted for 
the server.
- At 11:57, after the Ceph node starts normally, service resumes as usual, 
indicating that the issue has been resolved.

Here is some basic information about our services:

- `Mon: 5 daemons, quorum host001, host002, host003, host004, host005 (age 4w)`
- `Mgr: host005 (active, since 4w), standbys: host001, host002, host003, 
host004`
- `Osd: 218 osds: 218 up (since 22h), 218 in (since 22h)`

We have a cluster with 19 nodes, including 15 SSD nodes and 4 HDD nodes. In 
total, there are 218 OSDs. The SSD nodes have 11 OSDs with Samsung EVO 870 SSD 
and each drive DB/WAL by 1.6T NVME drive. We are using Ceph version 15.2.17 
(8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable).

Here is the health check detail:
[root@node21 ~]#  ceph health detail
HEALTH_WARN 1 osds down; Reduced data availability: 12 pgs inactive, 12 pgs 
peering; Degraded data redundancy: 272273/43967625 objects degraded (0.619%), 
88 pgs degraded, 5 pgs undersized; 18192 slow ops, oldest one blocked for 3730 
sec, daemons 
[osd.0,osd.1,osd.101,osd.103,osd.107,osd.108,osd.109,osd.11,osd.111,osd.112]... 
have slow ops.
[WRN] OSD_DOWN: 1 osds down
osd.174 (root=default,host=hkhost031) is down
[WRN] PG_AVAILABILITY: Reduced data availability: 12 pgs inactive, 12 pgs 
peering
pg 2.dc is stuck peering for 49m, current state peering, last acting 
[87,95,172]
pg 2.e2 is stuck peering for 15m, current state peering, last acting 
[51,177,97]

..
  pg 2.f7e is active+undersized+degraded, acting [10,214]
pg 2.f84 is active+undersized+degraded, acting [91,52]
[WRN] SLOW_OPS: 18192 slow ops, oldest one blocked for 3730 sec, daemons 
[osd.0,osd.1,osd.101,osd.103,osd.107,osd.108,osd.109,osd.11,osd.111,osd.112]... 
have slow ops.

I have the following questions:

1. Why couldn't Ceph detect the faulty node and automatically abandon its 
resources? Can anyone provide more troubleshooting guidance for this case?
2. What is Ceph's detection mechanism and where can I find related information? 
All of our production cloud machines were affected and suspended. If RBD is 
unstable, we cannot continue to use Ceph technology for our RBD source.
3. Did we miss any patches or bug fixes?
4. Is there anyone who can suggest improvements and how we can quickly detect 
and avoid similar issues in the future?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph cluster iops low

2023-01-24 Thread petersun
Hi Mark,
Thanks for your response, it is help!
Our Ceph cluster use Samsung SSD 870 EVO all backed with NVME drive. 12 SSD 
drives to 2 NVMe drives per storage node. Each 4TB SSD backed 283G NVMe lvm 
partition as DB. 
Now cluster throughput only 300M write, and around 5K IOPS.  I could see NVMe 
drive utilization over 95% show on ‘iostat’ command. Will NVMe drive be a 
bottle neck quickly if we have large of IO in cluster?
I have read the top article about OSD bundle with CPU cores. However I can only 
find script called pincpu on the github to automate process to allocate CPU 
core with OSDs. It seems not work for me. Do you have any tool or official 
instruction that can guide me to test it?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph cluster iops low

2023-01-23 Thread petersun
I have my ceph IOPS very low with over 48 SSD backed on NVMs for DB/WAL on four 
physical servers. The whole cluster has only about 20K IO total. Looks the IOs 
are suppressed over bottleneck somewhere. Dstat shows a lots csw and interrupts 
over 150K, while I am using FIO bench 4K 128QD test. 
I check SSD throughput only about 40M at 250 ios each. Network are total 20G 
and not full of traffic. CPU are around 50% idle on 2*E5 2950v2 each node.
Is it normal to get that high and how to reduce it? where else could be the 
bottleneck?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io