Re: [ceph-users] Adventures with large RGW buckets

Eric Ivancich Thu, 01 Aug 2019 12:06:59 -0700

Hi Paul,

I’ll interleave responses below.


> On Jul 31, 2019, at 2:02 PM, Paul Emmerich <paul.emmer...@croit.io> wrote:
> 
> we are seeing a trend towards rather large RGW S3 buckets lately.
> we've worked on
> several clusters with 100 - 500 million objects in a single bucket, and we've
> been asked about the possibilities of buckets with several billion objects 
> more
> than once.
> 
> From our experience: buckets with tens of million objects work just fine with
> no big problems usually. Buckets with hundreds of million objects require some
> attention. Buckets with billions of objects? "How about indexless buckets?" -
> "No, we need to list them".
> 
> 
> A few stories and some questions:
> 
> 
> 1. The recommended number of objects per shard is 100k. Why? How was this
> default configuration derived?
> 
> It doesn't really match my experiences. We know a few clusters running with
> larger shards because resharding isn't possible for various reasons at the
> moment. They sometimes work better than buckets with lots of shards.
> 
> So we've been considering to at least double that 100k target shard size
> for large buckets, that would make the following point far less annoying.

I believe the 100,000 objects per shard was done with a little bit of 
experience and some back-of-the-envelope calculations. Please keep us updated 
as to what you find for 200,000 objects per shard.

> 2. Many shards + ordered object listing = lots of IO
> 
> Unfortunately telling people to not use ordered listings when they don't 
> really
> need them doesn't really work as their software usually just doesn't support
> that :(

We are exploring sharding schemes that maintain ordering, and that would really 
help here.

> A listing request for X objects will retrieve up to X objects from each shard
> for ordering them. That will lead to quite a lot of traffic between the OSDs
> and the radosgw instances, even for relatively innocent simple queries as X
> defaults to 1000 usually.

What you say is correct. And it gets worse, because we have to go through all 
the returned lists and select the, say, 1000 earliest to return. And then we 
throw the rest away.

> Simple example: just getting the first page of a bucket listing with 4096
> shards fetches around 1 GB of data from the OSD to return ~300kb or so to the
> S3 client.

Correct.

> I've got two clusters here that are only used for some relatively 
> low-bandwidth
> backup use case here. However, there are a few buckets with hundreds of 
> millions
> of objects that are sometimes being listed by the backup system.
> 
> The result is that this cluster has an average read IO of 1-2 GB/s, all going
> to the index pool. Not a big deal since that's coming from SSDs and goes over
> 80 Gbit/s LACP bonds. But it does pose the question about scalability
> as the user-
> visible load created by the S3 clients is quite low.
> 
> 
> 
> 3. Deleting large buckets
> 
> Someone accidentaly put 450 million small objects into a bucket and only 
> noticed
> when the cluster ran full. The bucket isn't needed, so just delete it and case
> closed?
> 
> Deleting is unfortunately far slower than adding objects, also
> radosgw-admin leaks
> memory during deletion: https://tracker.ceph.com/issues/40700
> 
> Increasing --max-concurrent-ios helps with deletion speed (option does effect
> deletion concurrency, documentation says it's only for other specific 
> commands).
> 
> Since the deletion is going faster than new data is being added to that 
> cluster
> the "solution" was to run the deletion command in a memory-limited cgroup and
> restart it automatically after it gets killed due to leaking.

That tracker is being investigated.

> How could the bucket deletion of the future look like? Would it be possible
> to put all objects in buckets into RADOS namespaces and implement some kind
> of efficient namespace deletion on the OSD level similar to how pool deletions
> are handled at a lower level?

I’ll raise that with other RGW developers. I’m unfamiliar with how RADOS 
namespaces are handled.

> 4. Common prefixes could filtered in the rgw class on the OSD instead
> of in radosgw
> 
> Consider a bucket with 100 folders with 1000 objects in each and only one 
> shard
> 
> /p1/1, /p1/2, ..., /p1/1000, /p2/1, /p2/2, ..., /p2/1000, ... /p100/1000
> 
> 
> Now a user wants to list / with aggregating common prefixes on the
> delimiter / and
> wants up to 1000 results.
> So there'll be 100 results returned to the client: the common prefixes
> p1 to p100.
> 
> How much data will be transfered between the OSDs and radosgw for this 
> request?
> How many omap entries does the OSD scan?
> 
> radosgw will ask the (single) index object to list the first 1000 objects. 
> It'll
> return 1000 objects in a quite unhelpful way: /p1/1, /p1/2, ...., /p1/1000
> 
> radosgw will discard 999 of these and detect one common prefix and continue 
> the
> iteration at /p1/\xFF to skip the remaining entries in /p1/ if there are any.
> The OSD will then return everything in /p2/ in that next request and so on.
> 
> So it'll internally list every single object in that bucket. That will
> be a problem
> for large buckets and having lots of shards doesn't help either.
> 
> 
> This shouldn't be too hard to fix: add an option "aggregate prefixes" to the 
> RGW
> class method and duplicate the fast-forward logic from radosgw in
> cls_rgw. It doesn't
> even need to change the response type or anything, it just needs to
> limit entries in
> common prefixes to one result.
> Is this a good idea or am I missing something?

On the face it looks good. I’ll raise this with other RGW developers. I do know 
that there was a related bug that was recently addressed with this pr:

        https://github.com/ceph/ceph/pull/28192 
<https://github.com/ceph/ceph/pull/28192>

But your suggestion seems to go farther.

> IO would be reduced by a factor of 100 for that particular
> pathological case. I've
> unfortunately seen a real-world setup that I think hits a case like that.


Thank you for sharing your experiences and your ideas.

Eric

--
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Adventures with large RGW buckets

Reply via email to