Re: [ceph-users] Dynamic bucket index resharding bug? - rgw.none showing unreal number of objects

2019-11-22 Thread Paul Emmerich
On Fri, Nov 22, 2019 at 9:09 PM J. Eric Ivancich  wrote:

> 2^64 (2 to the 64th power) is 18446744073709551616, which is 13 greater
> than your value of 18446744073709551603. So this likely represents the
> value of -13, but displayed in an unsigned format.

I've seen this with values between -2 and -10, see
https://tracker.ceph.com/issues/37942


Paul

>
> Obviously is should not calculate a value of -13. I'm guessing it's a
> bug when bucket index entries that are categorized as rgw.none are
> found, we're not adding to the stats, but when they're removed they are
> being subtracted from the stats.
>
> Interestingly resharding recalculates these, so you'll likely have a
> much smaller value when you're done.
>
> It seems the operations that result in rgw.none bucket index entries are
> cancelled operations and removals.
>
> We're currently looking at how best to deal with rgw.none stats here:
>
> https://github.com/ceph/ceph/pull/29062
>
> Eric
>
> --
> J. Eric Ivancich
> he/him/his
> Red Hat Storage
> Ann Arbor, Michigan, USA
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dynamic bucket index resharding bug? - rgw.none showing unreal number of objects

2019-11-22 Thread J. Eric Ivancich
On 11/22/19 11:50 AM, David Monschein wrote:
> Hi all. Running an Object Storage cluster with Ceph Nautilus 14.2.4.
> 
> We are running into what appears to be a serious bug that is affecting
> our fairly new object storage cluster. While investigating some
> performance issues -- seeing abnormally high IOPS, extremely slow bucket
> stat listings (over 3 minutes) -- we noticed some dynamic bucket
> resharding jobs running. Strangely enough they were resharding buckets
> that had very few objects. Even more worrying was the number of new
> shards Ceph was planning: 65521
> 
> [root@os1 ~]# radosgw-admin reshard list
> [
>     {
>         "time": "2019-11-22 00:12:40.192886Z",
>         "tenant": "",
>         "bucket_name": "redacteed",
>         "bucket_id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
>         "new_instance_id":
> "redacted:c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7552496.28",
>         "old_num_shards": 1,
>         "new_num_shards": 65521
>     }
> ]
> 
> Upon further inspection we noticed a seemingly impossible number of
> objects (18446744073709551603) in rgw.none for the same bucket:
> [root@os1 ~]# radosgw-admin bucket stats --bucket=redacted
> {
>     "bucket": "redacted",
>     "tenant": "",
>     "zonegroup": "dbb69c5b-b33f-4af2-950c-173d695a4d2c",
>     "placement_rule": "default-placement",
>     "explicit_placement": {
>         "data_pool": "",
>         "data_extra_pool": "",
>         "index_pool": ""
>     },
>     "id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
>     "marker": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
>     "index_type": "Normal",
>     "owner": "d52cb8cc-1f92-47f5-86bf-fb28bc6b592c",
>     "ver": "0#12623",
>     "master_ver": "0#0",
>     "mtime": "2019-11-22 00:18:41.753188Z",
>     "max_marker": "0#",
>     "usage": {
>         "rgw.none": {
>             "size": 0,
>             "size_actual": 0,
>             "size_utilized": 0,
>             "size_kb": 0,
>             "size_kb_actual": 0,
>             "size_kb_utilized": 0,
>             "num_objects": 18446744073709551603
>         },
>         "rgw.main": {
>             "size": 63410030,
>             "size_actual": 63516672,
>             "size_utilized": 63410030,
>             "size_kb": 61924,
>             "size_kb_actual": 62028,
>             "size_kb_utilized": 61924,
>             "num_objects": 27
>         },
>         "rgw.multimeta": {
>             "size": 0,
>             "size_actual": 0,
>             "size_utilized": 0,
>             "size_kb": 0,
>             "size_kb_actual": 0,
>             "size_kb_utilized": 0,
>             "num_objects": 0
>         }
>     },
>     "bucket_quota": {
>         "enabled": false,
>         "check_on_raw": false,
>         "max_size": -1,
>         "max_size_kb": 0,
>         "max_objects": -1
>     }
> }
> 
> It would seem that the unreal number of objects in rgw.none is driving
> the resharding process, making ceph reshard the bucket 65521 times. I am
> assuming 65521 is the limit.
> 
> I have seen only a couple of references to this issue, none of which had
> a resolution or much of a conversation around them:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030791.html
> https://tracker.ceph.com/issues/37942
> 
> For now we are cancelling these resharding jobs since they seem to be
> causing performance issues with the cluster, but this is an untenable
> solution. Does anyone know what is causing this? Or how to prevent
> it/fix it?


2^64 (2 to the 64th power) is 18446744073709551616, which is 13 greater
than your value of 18446744073709551603. So this likely represents the
value of -13, but displayed in an unsigned format.

Obviously is should not calculate a value of -13. I'm guessing it's a
bug when bucket index entries that are categorized as rgw.none are
found, we're not adding to the stats, but when they're removed they are
being subtracted from the stats.

Interestingly resharding recalculates these, so you'll likely have a
much smaller value when you're done.

It seems the operations that result in rgw.none bucket index entries are
cancelled operations and removals.

We're currently looking at how best to deal with rgw.none stats here:

https://github.com/ceph/ceph/pull/29062

Eric

-- 
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dynamic bucket index resharding bug? - rgw.none showing unreal number of objects

2019-11-22 Thread Paul Emmerich
I've originally reported the linked issue. I've seen this problem with
negative stats on several of S3 setups but I could never figure out
how to reproduce it.

But I haven't seen the resharder act on these stats; that seems like a
particularly bad case :(


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Nov 22, 2019 at 5:51 PM David Monschein  wrote:
>
> Hi all. Running an Object Storage cluster with Ceph Nautilus 14.2.4.
>
> We are running into what appears to be a serious bug that is affecting our 
> fairly new object storage cluster. While investigating some performance 
> issues -- seeing abnormally high IOPS, extremely slow bucket stat listings 
> (over 3 minutes) -- we noticed some dynamic bucket resharding jobs running. 
> Strangely enough they were resharding buckets that had very few objects. Even 
> more worrying was the number of new shards Ceph was planning: 65521
>
> [root@os1 ~]# radosgw-admin reshard list
> [
> {
> "time": "2019-11-22 00:12:40.192886Z",
> "tenant": "",
> "bucket_name": "redacteed",
> "bucket_id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
> "new_instance_id": 
> "redacted:c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7552496.28",
> "old_num_shards": 1,
> "new_num_shards": 65521
> }
> ]
>
> Upon further inspection we noticed a seemingly impossible number of objects 
> (18446744073709551603) in rgw.none for the same bucket:
> [root@os1 ~]# radosgw-admin bucket stats --bucket=redacted
> {
> "bucket": "redacted",
> "tenant": "",
> "zonegroup": "dbb69c5b-b33f-4af2-950c-173d695a4d2c",
> "placement_rule": "default-placement",
> "explicit_placement": {
> "data_pool": "",
> "data_extra_pool": "",
> "index_pool": ""
> },
> "id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
> "marker": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
> "index_type": "Normal",
> "owner": "d52cb8cc-1f92-47f5-86bf-fb28bc6b592c",
> "ver": "0#12623",
> "master_ver": "0#0",
> "mtime": "2019-11-22 00:18:41.753188Z",
> "max_marker": "0#",
> "usage": {
> "rgw.none": {
> "size": 0,
> "size_actual": 0,
> "size_utilized": 0,
> "size_kb": 0,
> "size_kb_actual": 0,
> "size_kb_utilized": 0,
> "num_objects": 18446744073709551603
> },
> "rgw.main": {
> "size": 63410030,
> "size_actual": 63516672,
> "size_utilized": 63410030,
> "size_kb": 61924,
> "size_kb_actual": 62028,
> "size_kb_utilized": 61924,
> "num_objects": 27
> },
> "rgw.multimeta": {
> "size": 0,
> "size_actual": 0,
> "size_utilized": 0,
> "size_kb": 0,
> "size_kb_actual": 0,
> "size_kb_utilized": 0,
> "num_objects": 0
> }
> },
> "bucket_quota": {
> "enabled": false,
> "check_on_raw": false,
> "max_size": -1,
> "max_size_kb": 0,
> "max_objects": -1
> }
> }
>
> It would seem that the unreal number of objects in rgw.none is driving the 
> resharding process, making ceph reshard the bucket 65521 times. I am assuming 
> 65521 is the limit.
>
> I have seen only a couple of references to this issue, none of which had a 
> resolution or much of a conversation around them:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030791.html
> https://tracker.ceph.com/issues/37942
>
> For now we are cancelling these resharding jobs since they seem to be causing 
> performance issues with the cluster, but this is an untenable solution. Does 
> anyone know what is causing this? Or how to prevent it/fix it?
>
> Thanks,
> Dave Monschein
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Dynamic bucket index resharding bug? - rgw.none showing unreal number of objects

2019-11-22 Thread David Monschein
Hi all. Running an Object Storage cluster with Ceph Nautilus 14.2.4.

We are running into what appears to be a serious bug that is affecting our
fairly new object storage cluster. While investigating some performance
issues -- seeing abnormally high IOPS, extremely slow bucket stat listings
(over 3 minutes) -- we noticed some dynamic bucket resharding jobs running.
Strangely enough they were resharding buckets that had very few objects.
Even more worrying was the number of new shards Ceph was planning: 65521

[root@os1 ~]# radosgw-admin reshard list
[
{
"time": "2019-11-22 00:12:40.192886Z",
"tenant": "",
"bucket_name": "redacteed",
"bucket_id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
"new_instance_id":
"redacted:c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7552496.28",
"old_num_shards": 1,
"new_num_shards": 65521
}
]

Upon further inspection we noticed a seemingly impossible number of objects
(18446744073709551603) in rgw.none for the same bucket:
[root@os1 ~]# radosgw-admin bucket stats --bucket=redacted
{
"bucket": "redacted",
"tenant": "",
"zonegroup": "dbb69c5b-b33f-4af2-950c-173d695a4d2c",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
"marker": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
"index_type": "Normal",
"owner": "d52cb8cc-1f92-47f5-86bf-fb28bc6b592c",
"ver": "0#12623",
"master_ver": "0#0",
"mtime": "2019-11-22 00:18:41.753188Z",
"max_marker": "0#",
"usage": {
"rgw.none": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 18446744073709551603
},
"rgw.main": {
"size": 63410030,
"size_actual": 63516672,
"size_utilized": 63410030,
"size_kb": 61924,
"size_kb_actual": 62028,
"size_kb_utilized": 61924,
"num_objects": 27
},
"rgw.multimeta": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 0
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}

It would seem that the unreal number of objects in rgw.none is driving the
resharding process, making ceph reshard the bucket 65521 times. I am
assuming 65521 is the limit.

I have seen only a couple of references to this issue, none of which had a
resolution or much of a conversation around them:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030791.html
https://tracker.ceph.com/issues/37942

For now we are cancelling these resharding jobs since they seem to be
causing performance issues with the cluster, but this is an untenable
solution. Does anyone know what is causing this? Or how to prevent it/fix
it?

Thanks,
Dave Monschein
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com