Hi Paul,
I’ve turned the following idea of yours into a tracker:
https://tracker.ceph.com/issues/41051
<https://tracker.ceph.com/issues/41051>
> 4. Common prefixes could filtered in the rgw class on the OSD instead
> of in radosgw
>
> Consider a bucket with 100 folders with 1000 objects in each and only one
> shard
>
> /p1/1, /p1/2, ..., /p1/1000, /p2/1, /p2/2, ..., /p2/1000, ... /p100/1000
>
>
> Now a user wants to list / with aggregating common prefixes on the
> delimiter / and
> wants up to 1000 results.
> So there'll be 100 results returned to the client: the common prefixes
> p1 to p100.
>
> How much data will be transfered between the OSDs and radosgw for this
> request?
> How many omap entries does the OSD scan?
>
> radosgw will ask the (single) index object to list the first 1000 objects.
> It'll
> return 1000 objects in a quite unhelpful way: /p1/1, /p1/2, ...., /p1/1000
>
> radosgw will discard 999 of these and detect one common prefix and continue
> the
> iteration at /p1/\xFF to skip the remaining entries in /p1/ if there are any.
> The OSD will then return everything in /p2/ in that next request and so on.
>
> So it'll internally list every single object in that bucket. That will
> be a problem
> for large buckets and having lots of shards doesn't help either.
>
>
> This shouldn't be too hard to fix: add an option "aggregate prefixes" to the
> RGW
> class method and duplicate the fast-forward logic from radosgw in
> cls_rgw. It doesn't
> even need to change the response type or anything, it just needs to
> limit entries in
> common prefixes to one result.
> Is this a good idea or am I missing something?
>
> IO would be reduced by a factor of 100 for that particular
> pathological case. I've
> unfortunately seen a real-world setup that I think hits a case like that.
Eric
--
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com