I've originally reported the linked issue. I've seen this problem with negative stats on several of S3 setups but I could never figure out how to reproduce it.
But I haven't seen the resharder act on these stats; that seems like a particularly bad case :( Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Fri, Nov 22, 2019 at 5:51 PM David Monschein <monsch...@gmail.com> wrote: > > Hi all. Running an Object Storage cluster with Ceph Nautilus 14.2.4. > > We are running into what appears to be a serious bug that is affecting our > fairly new object storage cluster. While investigating some performance > issues -- seeing abnormally high IOPS, extremely slow bucket stat listings > (over 3 minutes) -- we noticed some dynamic bucket resharding jobs running. > Strangely enough they were resharding buckets that had very few objects. Even > more worrying was the number of new shards Ceph was planning: 65521 > > [root@os1 ~]# radosgw-admin reshard list > [ > { > "time": "2019-11-22 00:12:40.192886Z", > "tenant": "", > "bucket_name": "redacteed", > "bucket_id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20", > "new_instance_id": > "redacted:c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7552496.28", > "old_num_shards": 1, > "new_num_shards": 65521 > } > ] > > Upon further inspection we noticed a seemingly impossible number of objects > (18446744073709551603) in rgw.none for the same bucket: > [root@os1 ~]# radosgw-admin bucket stats --bucket=redacted > { > "bucket": "redacted", > "tenant": "", > "zonegroup": "dbb69c5b-b33f-4af2-950c-173d695a4d2c", > "placement_rule": "default-placement", > "explicit_placement": { > "data_pool": "", > "data_extra_pool": "", > "index_pool": "" > }, > "id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20", > "marker": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20", > "index_type": "Normal", > "owner": "d52cb8cc-1f92-47f5-86bf-fb28bc6b592c", > "ver": "0#12623", > "master_ver": "0#0", > "mtime": "2019-11-22 00:18:41.753188Z", > "max_marker": "0#", > "usage": { > "rgw.none": { > "size": 0, > "size_actual": 0, > "size_utilized": 0, > "size_kb": 0, > "size_kb_actual": 0, > "size_kb_utilized": 0, > "num_objects": 18446744073709551603 > }, > "rgw.main": { > "size": 63410030, > "size_actual": 63516672, > "size_utilized": 63410030, > "size_kb": 61924, > "size_kb_actual": 62028, > "size_kb_utilized": 61924, > "num_objects": 27 > }, > "rgw.multimeta": { > "size": 0, > "size_actual": 0, > "size_utilized": 0, > "size_kb": 0, > "size_kb_actual": 0, > "size_kb_utilized": 0, > "num_objects": 0 > } > }, > "bucket_quota": { > "enabled": false, > "check_on_raw": false, > "max_size": -1, > "max_size_kb": 0, > "max_objects": -1 > } > } > > It would seem that the unreal number of objects in rgw.none is driving the > resharding process, making ceph reshard the bucket 65521 times. I am assuming > 65521 is the limit. > > I have seen only a couple of references to this issue, none of which had a > resolution or much of a conversation around them: > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030791.html > https://tracker.ceph.com/issues/37942 > > For now we are cancelling these resharding jobs since they seem to be causing > performance issues with the cluster, but this is an untenable solution. Does > anyone know what is causing this? Or how to prevent it/fix it? > > Thanks, > Dave Monschein > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com