Hi all. Running an Object Storage cluster with Ceph Nautilus 14.2.4.

We are running into what appears to be a serious bug that is affecting our
fairly new object storage cluster. While investigating some performance
issues -- seeing abnormally high IOPS, extremely slow bucket stat listings
(over 3 minutes) -- we noticed some dynamic bucket resharding jobs running.
Strangely enough they were resharding buckets that had very few objects.
Even more worrying was the number of new shards Ceph was planning: 65521

[root@os1 ~]# radosgw-admin reshard list
[
    {
        "time": "2019-11-22 00:12:40.192886Z",
        "tenant": "",
        "bucket_name": "redacteed",
        "bucket_id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
        "new_instance_id":
"redacted:c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7552496.28",
        "old_num_shards": 1,
        "new_num_shards": 65521
    }
]

Upon further inspection we noticed a seemingly impossible number of objects
(18446744073709551603) in rgw.none for the same bucket:
[root@os1 ~]# radosgw-admin bucket stats --bucket=redacted
{
    "bucket": "redacted",
    "tenant": "",
    "zonegroup": "dbb69c5b-b33f-4af2-950c-173d695a4d2c",
    "placement_rule": "default-placement",
    "explicit_placement": {
        "data_pool": "",
        "data_extra_pool": "",
        "index_pool": ""
    },
    "id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
    "marker": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
    "index_type": "Normal",
    "owner": "d52cb8cc-1f92-47f5-86bf-fb28bc6b592c",
    "ver": "0#12623",
    "master_ver": "0#0",
    "mtime": "2019-11-22 00:18:41.753188Z",
    "max_marker": "0#",
    "usage": {
        "rgw.none": {
            "size": 0,
            "size_actual": 0,
            "size_utilized": 0,
            "size_kb": 0,
            "size_kb_actual": 0,
            "size_kb_utilized": 0,
            "num_objects": 18446744073709551603
        },
        "rgw.main": {
            "size": 63410030,
            "size_actual": 63516672,
            "size_utilized": 63410030,
            "size_kb": 61924,
            "size_kb_actual": 62028,
            "size_kb_utilized": 61924,
            "num_objects": 27
        },
        "rgw.multimeta": {
            "size": 0,
            "size_actual": 0,
            "size_utilized": 0,
            "size_kb": 0,
            "size_kb_actual": 0,
            "size_kb_utilized": 0,
            "num_objects": 0
        }
    },
    "bucket_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    }
}

It would seem that the unreal number of objects in rgw.none is driving the
resharding process, making ceph reshard the bucket 65521 times. I am
assuming 65521 is the limit.

I have seen only a couple of references to this issue, none of which had a
resolution or much of a conversation around them:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030791.html
https://tracker.ceph.com/issues/37942

For now we are cancelling these resharding jobs since they seem to be
causing performance issues with the cluster, but this is an untenable
solution. Does anyone know what is causing this? Or how to prevent it/fix
it?

Thanks,
Dave Monschein
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to