On Fri, Nov 27, 2015 at 9:52 PM, Daniel Maraio <dmar...@choopa.com> wrote:

> Hello,
>
>   Can you provide some further details. What are the size of your objects,
> how many objects do you have in your buckets. Are you using bucket index
> sharding, are you sharding your objects over multiple buckets? Is the
> cluster doing any scrubbing during these periods? It sounds like you may be
> having trouble with your rgw bucket index. In our cluster, much smaller
> than yours mind you, it was necessary to put the rgw bucket index onto it's
> own set of osds to isolate it from the rest of the cluster IO. We are still
> using single object bucket indexes but have a plan to move to sharded
> bucket index eventually.
>

In order (and I apologize if I'm conflating S3 buckets with Ceph buckets
here):

 - Since this is an S3 cluster, object sizes range from a few bytes to tens
of GB.  On average, most objects are around a MB or two.
 - We currently have 41.4M objects in the cluster.  Some buckets have a few
objects, some have several million.
 - Yes, we are using bucket index sharding
 - Objects are sharded (7,2 erasure coding), and the crush map is set up
such that each PG is going to contain an OSD from each physical server
(apologies if I'm misunderstanding bucket here)
 - Scrubbing runs on the default schedule, so there has been no more or
less scrubbing going on during this incident than before, when things were
working well.  Scrubbing operations kick off periodically throughout the
day and complete in a few minutes' time.

That reminds me -- we also disabled scrubbing for several hours, and we
noticed no decrease in the rate of slow requests.



>
>   You should determine what OSDs your bucket indexes are located on and
> see if a pattern emerges with the OSDs have have slow requests during this
> periods. You can use the command ' ceph pg ls-by-pool .rgw.buckets.index '
> to show what pgs/osds the bucket index resides on.
>
> - Daniel
>

When I run this (and with a little bash-fu), I see 509 of my 648 OSDs
marked as an acting primary for the 1024 pools here.  It'll take some
digging to see what relationship exists between these OSDs and the ones
marked as slow.

I appreciate the quick response.

Brian
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to