Hi Peter,

If you can reproduce and have debug symbols installed, I'd be interested to see the output of this tool:


https://github.com/markhpc/uwpmp/


It might need slightly different compile instructions if you have a newer version of go.  I can send you an executable offline if needed.  Since RGW potentially can have a fairly insane number of threads with the default settings, it will gather samples pretty slowly.  Just start out collecting something like 100 samples:


sudo ./unwindpmp -n 100 -p `pidof radosgw` > foo.txt


Hopefully that should help diagnose where all of the threads are spending time in the code.  uwpmp has a much faster libdw backend (-b libdw), but the callgraphs aren't always accurate so I would stick with the default unwind backend for now.


Mark


On 6/12/23 12:15, grin wrote:
Hello,

ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

There is a single (test) radosgw serving plenty of test traffic. When under heavy req/s 
("heavy" in a low sense, about 1k rq/s) it pretty reliably hangs: low traffic 
threads seem to work (like handling occasional PUTs) but GETs are completely 
nonresponsive, all attention seems to be spent on futexes.

The effect is extremely similar to
https://ceph-users.ceph.narkive.com/I4uFVzH9/radosgw-civetweb-hangs-once-around-850-established-connections
 (subject: Radosgw (civetweb) hangs once around)
except this is quincy so it's beast instead of civetweb. The effect is the same 
as described there, except the cluster is way smaller (about 20-40 OSDs).

I observed that when I start radosgw -f with debug 20/20 it almost never hangs, 
so my guess is some ugly race condition. However I am a bit clueless how to 
actually debug it since debugging makes it go away. Debug 1 (default) with -d 
seems to hang after a while but it's not that simple to induce, I'm still 
testing under 4/4.

Also I do not see much to configure about beast.

As to answer the question in the original (2016) thread:
- Debian stable
- no visible limits issue
- no obvious memory leak observed
- no other visible resource shortage
- strace says everyone's waiting on futexes, about 600-800 threads, apart from 
the one serving occasional PUTs
- tcp port doesn't respond.

IRC didn't react. ;-)

Thanks,
Peter
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

--
Best Regards,
Mark Nelson
Head of R&D (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nel...@clyso.com

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to