Denis Krienbühl <de...@href.ch> writes:

> Hi everyone
>
> We have faced some RGW outages recently, with the RGW returning HTTP 503. 
> First for a few, then for most, then all requests - in the course of 1-2 
> hours. This seems to have started since we have updated from 15.2.4 to 15.2.5.
>
> The line that accompanies these outages in the log is the following:
>
>       s3:list_bucket Scheduling request failed with -2218
There isn't much in terms of code changes in the scheduler from
v15.2.4->5. Does the perf dump (`ceph daemon perf dump <client.rgw-name>
`) on RGW socket show any throttle counts?

>
> It first pops up a few times here and there, until it eventually applies to 
> all requests. It seems to indicate that the throttler has reached the limit 
> of open connections.
>
> As we run a pair of HAProxy instances in front of RGW, which limit the number 
> of connections to the two RGW instances to 400, this limit should never be 
> reached. We do use RGW metadata sync between the instances, which could 
> account for some extra connections, but if I look at open TCP connections 
> between the instances I can count no more than 20 at any given time.
>
> I also noticed that some connections in the RGW log seem to never complete. 
> That is, I can find a ‘starting new request’ line, but no associated ‘req 
> done’ or ‘beast’ line.
>
> I don’t think there are any hung connections around, as they are killed by 
> HAProxy after a short timeout.
>
> Looking at the code, it seems as if the throttler in use (SimpleThrottler), 
> eventually reaches the maximum count of 1024 connections 
> (outstanding_requests), and never recovers. I believe that the 
> request_complete function is not called in all cases, but I am not familiar 
> with the Ceph codebase, so I am not sure.
>
> See 
> https://github.com/ceph/ceph/blob/cc17681b478594aa39dd80437256a54e388432f0/src/rgw/rgw_dmclock_async_scheduler.h#L166-L214
>  
> <https://github.com/ceph/ceph/blob/cc17681b478594aa39dd80437256a54e388432f0/src/rgw/rgw_dmclock_async_scheduler.h#L166-L214>
>
> Does anyone see the same phenomenon? Could this be a bug in the request 
> handling of RGW, or am I wrong in my assumptions?
>
> For now we’re just restarting our RGWs regularly, which seems to keep the 
> problem at bay.
>
> Thanks for any hints.
>
> Denis
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
Abhishek 
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to