Hello all,
wrt: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXNULEBHVUIXQQGYUDJAO2SF/

Yesterday we hit a problem with osd_pglog memory, similar to the thread above.

We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. 
We run 8+3 EC for the data pool (metadata is on replicated nvme pool).

The cluster has been running fine, and (as relevant to the post) the memory 
usage has been stable at 100 GB / node. We've had the default pg_log of 3000. 
The user traffic doesn't seem to have been exceptional lately.

Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory 
usage on OSD nodes started to grow. On each node it grew steadily about 30 
GB/day, until the servers started OOM killing OSD processes. 

After a lot of debugging we found that the pg_logs were huge. Each OSD process 
pg_log had grown to ~22GB, which we naturally didn't have memory for, and then 
the cluster was in an unstable situation. This is significantly more than the 
1,5 GB in the post above. We do have ~20k pgs, which may directly affect the 
size.

We've reduced the pg_log to 500, and started offline trimming it where we can, 
and also just waited. The pg_log size dropped to ~1,2 GB on at least some 
nodes, but we're  still recovering, and have a lot of ODSs down and out still.

We're unsure if version 14.2.13 triggered this, or if the osd restarts 
triggered this (or something unrelated we don't see).

This mail is mostly to figure out if there are good guesses why the pg_log size 
per OSD process exploded? Any technical (and moral) support is appreciated. 
Also, currently we're not sure if 14.2.13 triggered this, so this is also to 
put a data point out there for other debuggers.

Cheers,
Kalle Happonen
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to