[ceph-users] Unable to delete bucket - endless multipart uploads?

2021-02-23 Thread David Monschein
Hi All,

We've been dealing with what seems to be a pretty annoying bug for a while
now. We are unable to delete a customer's bucket that seems to have an
extremely large number of aborted multipart uploads. I've had $(radosgw-admin
bucket rm --bucket=pusulax --purge-objects) running in a screen session for
almost 3 weeks now and it's still not finished; it's most likely stuck in a
loop or something. The screen session with debug-rgw=10 spams billions of
these messages:

2021-02-23 15:38:58.667 7f9b55704840 10
RGWRados::cls_bucket_list_unordered: got
_multipart_04/d3/04d33e18-3f13-433c-b924-56602d702d60-31.msg.2~0DTalUjTHsnIiKraN1klwIFO88Vc2E3.meta[]
2021-02-23 15:38:58.667 7f9b55704840 10
RGWRados::cls_bucket_list_unordered: got
_multipart_04/d7/04d7ad26-c8ec-4a39-9938-329acd6d9da7-102.msg.2~K_gAeTpfEongNvaOMNa0IFwSGPpQ1iA.meta[]
2021-02-23 15:38:58.667 7f9b55704840 10
RGWRados::cls_bucket_list_unordered: got
_multipart_04/da/04da4147-c949-4c3a-aca6-e63298f5ff62-102.msg.2~-hXBSFcjQKbMkiyEqSgLaXMm75qFzEp.meta[]
2021-02-23 15:38:58.667 7f9b55704840 10
RGWRados::cls_bucket_list_unordered: got
_multipart_04/db/04dbb0e6-dfb0-42fb-9d0f-49cceb18457f-102.msg.2~B5EhGgBU5U_U7EA5r8IhVpO3Aj2OvKg.meta[]
2021-02-23 15:38:58.667 7f9b55704840 10
RGWRados::cls_bucket_list_unordered: got
_multipart_04/df/04df39be-06ab-4c72-bc63-3fac1d2700a9-11.msg.2~_8h5fWlkNrIMqcrZgNbAoJfc8BN1Xx-.meta[]

This is probably the 2nd or 3rd time I've been unable to delete this
bucket. I also tried running $(radosgw-admin bucket check --fix
--check-objects --bucket=pusulax) before kicking off the delete job, but
that didn't work either. Here is the bucket in question, the num_objects
counter never decreases after trying to delete the bucket:

[root@os5 ~]# radosgw-admin bucket stats --bucket=pusulax
{
"bucket": "pusulax",
"num_shards": 144,
"tenant": "",
"zonegroup": "dbb69c5b-b33f-4af2-950c-173d695a4d2c",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.3209338.4",
"marker": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.3292800.7",
"index_type": "Normal",
"owner": "REDACTED",
"ver":
"0#115613,1#115196,2#115884,3#115497,4#114649,5#114150,6#116127,7#114269,8#115220,9#115092,10#114003,11#114538,12#115235,13#113463,14#114928,15#115135,16#115535,17#114867,18#116010,19#115766,20#115274,21#114818,22#114805,23#114853,24#114099,25#114359,26#114966,27#115790,28#114572,29#114826,30#114767,31#115614,32#113995,33#115305,34#114227,35#114342,36#114144,37#114704,38#114088,39#114738,40#114133,41#114520,42#114420,43#114168,44#113820,45#115093,46#114788,47#115522,48#114713,49#115315,50#115055,51#114513,52#114086,53#114401,54#114079,55#113649,56#114089,57#114157,58#114064,59#115224,60#114753,61#114686,62#115169,63#114321,64#114949,65#115075,66#115003,67#114993,68#115320,69#114392,70#114893,71#114219,72#114190,73#114868,74#113432,75#114882,76#115300,77#114755,78#114598,79#114221,80#114895,81#114031,82#114566,83#113849,84#115155,85#113790,86#113334,87#113800,88#114856,89#114841,90#115073,91#113849,92#114554,93#114820,94#114256,95#113840,96#114838,97#113784,98#114876,99#115524,100#115
 
686,101#112969,102#112156,103#112635,104#112732,105#112933,106#112412,107#113090,108#112239,109#112697,110#113444,111#111730,112#112446,113#114479,114#113318,115#113032,116#112048,117#112404,118#114545,119#112563,120#112341,121#112518,122#111719,123#112273,124#112014,125#112979,126#112209,127#112830,128#113186,129#112944,130#111991,131#112865,132#112688,133#113819,134#112586,135#113275,136#112172,137#113019,138#112872,139#113130,140#112716,141#112091,142#111859,143#112773",
"master_ver":
"0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0,11#0,12#0,13#0,14#0,15#0,16#0,17#0,18#0,19#0,20#0,21#0,22#0,23#0,24#0,25#0,26#0,27#0,28#0,29#0,30#0,31#0,32#0,33#0,34#0,35#0,36#0,37#0,38#0,39#0,40#0,41#0,42#0,43#0,44#0,45#0,46#0,47#0,48#0,49#0,50#0,51#0,52#0,53#0,54#0,55#0,56#0,57#0,58#0,59#0,60#0,61#0,62#0,63#0,64#0,65#0,66#0,67#0,68#0,69#0,70#0,71#0,72#0,73#0,74#0,75#0,76#0,77#0,78#0,79#0,80#0,81#0,82#0,83#0,84#0,85#0,86#0,87#0,88#0,89#0,90#0,91#0,92#0,93#0,94#0,95#0,96#0,97#0,98#0,99#0,100#0,101#0,102#0,103#0,104#0,105#0,106#0,107#0,108#0,109#0,110#0,111#0,112#0,113#0,114#0,115#0,116#0,117#0,118#0,119#0,120#0,121#0,122#0,123#0,124#0,125#0,126#0,127#0,128#0,129#0,130#0,131#0,132#0,133#0,134#0,135#0,136#0,137#0,138#0,139#0,140#0,141#0,142#0,143#0",
"mtime": "2020-06-17 20:27:16.685833Z",
"max_marker":

[ceph-users] Re: [RGW] Space usage vastly overestimated since Octopus upgrade

2020-07-15 Thread David Monschein
Hi Liam, All,

We have also run into this bug:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/PCYY2MKRPCPIXZLZV5NNBWVHDXKWXVAG/

Like you, we are also running Octopus 15.2.3

Downgrading the RGWs at this point is not ideal, but if a fix isn't found
soon we might have to.

Has a bug report been filed for this yet?

- Dave
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] User stats - Object count wrong in Octopus?

2020-07-14 Thread David Monschein
Hi All,

Sorry for the double email, I accidentally sent the previous e-mail with an
accidental KB shortcut before it was finished :)

I'm investigating what appears to be a bug in RGW stats. This is a brand
new cluster running 15.2.3

One of our customers reached out, saying they were hitting their quota (S3
error: 403 (QuotaExceeded)). The user-wide max_objects quota we set is 50
million objects, so this would be impossible since the entire cluster isn't
even close to 50 million objects yet:

[root@os1 ~]# ceph status | grep objects
objects: 7.58M objects, 6.8 TiB

The customer in question has three buckets, and if I query the bucket
stats, the total number of objects for all 3 buckets comes to about ~372k:

[root@os1 ~]# radosgw-admin bucket stats --bucket=df-fs1 | grep num_objects
"num_objects": 324880
[root@os1 ~]# radosgw-admin bucket stats --bucket=df-oldrepo | grep
num_objects
"num_objects": 47476
[root@os1 ~]# radosgw-admin bucket stats --bucket=df-test | grep num_objects
"num_objects": 1

But things get interesting when I query the user stats:
[root@os1 ~]# radosgw-admin user stats --uid=user-in-question | grep
num_objects
"num_objects": 52543794

How is Ceph arriving at 52+ million objects?

Here is the full output from the bucket stats and user stats if it's any
help. Thanks for any assistance.

[root@os1 ~]# radosgw-admin bucket stats --bucket=df-fs1
{
"bucket": "df-fs1",
"num_shards": 491,
"tenant": "",
"zonegroup": "a1b72aa0-eb06-4a96-8af4-db39d4dc3e09",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "d8c6ebd1-2bab-414d-9d6b-73bf9bc8fc5a.9736264.34",
"marker": "d8c6ebd1-2bab-414d-9d6b-73bf9bc8fc5a.9736264.21",
"index_type": "Normal",
"owner": "01c235fd-fc9b-4c54-a2a5-b38054608b9e",
"ver":
"0#57,1#58,2#59,3#56,4#63,5#62,6#52,7#61,8#48,9#58,10#56,11#58,12#64,13#45,14#57,15#66,16#66,17#55,18#52,19#60,20#74,21#59,22#57,23#47,24#60,25#52,26#62,27#58,28#60,29#47,30#57,31#45,32#48,33#56,34#64,35#68,36#47,37#49,38#62,39#48,40#64,41#44,42#60,43#60,44#60,45#59,46#65,47#47,48#57,49#57,50#63,51#68,52#61,53#57,54#46,55#59,56#57,57#41,58#59,59#58,60#57,61#53,62#52,63#45,64#55,65#58,66#54,67#43,68#60,69#51,70#62,71#55,72#52,73#55,74#60,75#77,76#59,77#54,78#57,79#65,80#60,81#64,82#62,83#48,84#57,85#51,86#53,87#61,88#62,89#63,90#59,91#53,92#55,93#59,94#50,95#51,96#59,97#58,98#55,99#44,100#47,101#48,102#61,103#47,104#54,105#57,106#50,107#72,108#66,109#53,110#48,111#49,112#51,113#69,114#61,115#59,116#56,117#52,118#61,119#64,120#60,121#56,122#62,123#46,124#50,125#57,126#59,127#58,128#62,129#62,130#54,131#47,132#79,133#52,134#53,135#54,136#52,137#40,138#45,139#54,140#43,141#55,142#56,143#70,144#52,145#58,146#55,147#46,148#54,149#64,150#52,151#56,152#55,153#70,154#45,155#66,156#65,157#58,1
 
58#61,159#58,160#57,161#62,162#49,163#76,164#70,165#59,166#68,167#64,168#51,169#68,170#62,171#54,172#51,173#64,174#47,175#59,176#53,177#51,178#44,179#51,180#71,181#60,182#57,183#65,184#68,185#58,186#63,187#51,188#65,189#66,190#47,191#63,192#55,193#56,194#59,195#59,196#53,197#38,198#53,199#47,200#64,201#59,202#56,203#51,204#70,205#50,206#55,207#63,208#42,209#44,210#56,211#50,212#56,213#55,214#56,215#55,216#51,217#57,218#58,219#60,220#47,221#49,222#51,223#42,224#52,225#54,226#56,227#58,228#57,229#53,230#55,231#53,232#47,233#59,234#59,235#58,236#50,237#51,238#54,239#51,240#60,241#53,242#65,243#63,244#60,245#57,246#41,247#64,248#54,249#48,250#49,251#60,252#53,253#53,254#52,255#57,256#62,257#74,258#62,259#57,260#42,261#63,262#52,263#55,264#63,265#47,266#51,267#61,268#49,269#62,270#61,271#63,272#45,273#62,274#53,275#53,276#58,277#58,278#67,279#58,280#59,281#60,282#52,283#43,284#66,285#63,286#50,287#65,288#62,289#55,290#60,291#53,292#50,293#61,294#67,295#69,296#50,297#59,298#50,299#61,300#
 
55,301#61,302#61,303#59,304#51,305#57,306#50,307#60,308#66,309#64,310#66,311#59,312#56,313#51,314#41,315#53,316#54,317#63,318#59,319#54,320#47,321#62,322#57,323#53,324#55,325#54,326#58,327#59,328#42,329#55,330#57,331#53,332#56,333#63,334#55,335#56,336#56,337#52,338#55,339#48,340#50,341#57,342#58,343#47,344#65,345#44,346#70,347#63,348#48,349#59,350#48,351#58,352#57,353#49,354#55,355#55,356#63,357#63,358#59,359#52,360#55,361#60,362#60,363#63,364#62,365#63,366#58,367#68,368#58,369#61,370#64,371#49,372#60,373#62,374#55,375#55,376#52,377#53,378#65,379#58,380#69,381#54,382#53,383#65,384#68,385#61,386#36,387#60,388#55,389#53,390#58,391#53,392#65,393#49,394#52,395#56,396#62,397#50,398#72,399#56,400#52,401#54,402#54,403#63,404#58,405#62,406#56,407#49,408#54,409#61,410#65,411#48,412#60,413#57,414#49,415#61,416#54,417#62,418#53,419#53,420#52,421#51,422#63,423#59,424#48,425#55,426#66,427#56,428#44,429#54,430#61,431#52,432#43,433#44,434#56,435#51,436#60,437#49,438#57,439#54,440#59,441#56,442#54,
 

[ceph-users] User stats - Object count wrong in Octopus?

2020-07-14 Thread David Monschein
Hi All,

I'm investigating what appears to be a bug in RGW stats. This is a brand
new cluster running 15.2.3

One of our customers reached out, saying they were hitting their quota (S3
error: 403 (QuotaExceeded)). The user-wide max_objects quota we set is 50
million objects, so this would be impossible since the entire cluster isn't
even close to 50 million objects yet:

[root@os1 ~]# ceph status | grep objects
objects: 7.58M objects, 6.8 TiB

The customer in question has three buckets, and if I query the bucket
stats, the total number of objects for all 3 buckets comes to about ~372k:

[root@os1 ~]# radosgw-admin bucket stats --bucket=df-fs1 | grep num_objects
"num_objects": 324880
[root@os1 ~]# radosgw-admin bucket stats --bucket=df-oldrepo | grep
num_objects
"num_objects": 47476
[root@os1 ~]# radosgw-admin bucket stats --bucket=df-test | grep num_objects
"num_objects": 1

But things get interesting when I query the user stats:
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bogus Entries in RGW Usage Log / Large omap object in rgw.log pool

2019-10-29 Thread David Monschein
Florian,

Thank you for your detailed reply. I was right in thinking that the 223k+
usage log entries were causing my large omap object warning. You've also
confirmed my suspicions that osd_deep_scrub_large_omap_object_key_threshold
was changed between Ceph versions. I ended up trimming all of the usage
logs before 2019-10-01. First I exported the log -- it was 140MB!

radosgw-admin usage trim --start-date=2018-10-01 --end-date=2019-10-01

Interestingly enough, trimming all logs with --start-date & --end-date only
took maybe 10 seconds, but when I try to trim the usage for only a single
user/bucket, it takes over 30 minutes. Either way, after trimming the log
down considerably, I manually issued a deep scrub on pgid 5.70, after which
the Ceph health returned to HEALTH_OK

I hope this can serve as a guide for anyone else who runs into this problem
:)

Thanks again,
Dave

On Tue, Oct 29, 2019 at 3:22 AM Florian Haas  wrote:

> Hi David,
>
> On 28/10/2019 20:44, David Monschein wrote:
> > Hi All,
> >
> > Running an object storage cluster, originally deployed with Nautilus
> > 14.2.1 and now running 14.2.4.
> >
> > Last week I was alerted to a new warning from my object storage cluster:
> >
> > [root@ceph1 ~]# ceph health detail
> > HEALTH_WARN 1 large omap objects
> > LARGE_OMAP_OBJECTS 1 large omap objects
> > 1 large objects found in pool 'default.rgw.log'
> > Search the cluster log for 'Large omap object found' for more
> details.
> >
> > I looked into this and found the object and pool in question
> > (default.rgw.log):
> >
> > [root@ceph1 /var/log/ceph]# grep -R -i 'Large omap object found' .
> > ./ceph.log:2019-10-24 12:21:26.984802 osd.194 (osd.194) 715 : cluster
> > [WRN] Large omap object found. Object: 5:0fbdcb32:usage::usage.17:head
> > Key count: 702330 Size (bytes): 92881228
> >
> > [root@ceph1 ~]# ceph --format=json pg ls-by-pool default.rgw.log | jq
> '.[]' | egrep '(pgid|num_large_omap_objects)' | grep -v
> '"num_large_omap_objects": 0,' | grep -B1 num_large_omap_objects
> > "pgid": "5.70",
> >   "num_large_omap_objects": 1,
> > While I was investigating, I noticed an enormous amount of entries in
> > the RGW usage log:
> >
> > [root@ceph ~]# radosgw-admin usage show | grep -c bucket
> > 223326
> > [...]
>
> I recently ran into a similar issue:
>
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AQNGVY7VJ3K6ZGRSTX3E5XIY7DBNPDHW/
>
> You have 702,330 keys on that omap object, so you would have been bitten
> by the default for osd_deep_scrub_large_omap_object_key_threshold having
> been revised down from 2,000,000 to 200,000 in 14.2.3:
>
>
> https://github.com/ceph/ceph/commit/d8180c57ac9083f414a23fd393497b2784377735
> https://tracker.ceph.com/issues/40583
>
> That's why you didn't see this warning before your recent upgrade.
>
> > There are entries for over 223k buckets! This was pretty scary to see,
> > considering we only have maybe 500 legitimate buckets in this fairly new
> > cluster. Almost all of the entries in the usage log are bogus entries
> > from anonymous users. It looks like someone/something was scanning,
> > looking for vulnerabilities, etc. Here are a few example entries, notice
> > none of the operations were successful:
>
> Caveat: whether or not you really *want* to trim the usage log is up to
> you to decide. If you are suspecting you are dealing with a security
> breach, you should definitely export and preserve the usage log before
> you trim it, or else delay trimming until you have properly investigated
> your problem.
>
> *If* you decide you no longer need those usage log entries, you can use
> "radosgw-admin usage trim" with appropriate --start-date, --end-date,
> and/or --uid options, to clean them up:
>
> https://docs.ceph.com/docs/nautilus/radosgw/admin/#trim-usage
>
> Please let me know if that information is helpful. Thank you!
>
> Cheers,
> Florian
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Bogus Entries in RGW Usage Log / Large omap object in rgw.log pool

2019-10-28 Thread David Monschein
Hi All,

Running an object storage cluster, originally deployed with Nautilus 14.2.1
and now running 14.2.4.

Last week I was alerted to a new warning from my object storage cluster:

[root@ceph1 ~]# ceph health detail
HEALTH_WARN 1 large omap objects
LARGE_OMAP_OBJECTS 1 large omap objects
1 large objects found in pool 'default.rgw.log'
Search the cluster log for 'Large omap object found' for more details.

I looked into this and found the object and pool in question
(default.rgw.log):

[root@ceph1 /var/log/ceph]# grep -R -i 'Large omap object found' .
./ceph.log:2019-10-24 12:21:26.984802 osd.194 (osd.194) 715 : cluster [WRN]
Large omap object found. Object: 5:0fbdcb32:usage::usage.17:head Key count:
702330 Size (bytes): 92881228

[root@ceph1 ~]# ceph --format=json pg ls-by-pool default.rgw.log | jq
'.[]' | egrep '(pgid|num_large_omap_objects)' | grep -v
'"num_large_omap_objects": 0,' | grep -B1 num_large_omap_objects
"pgid": "5.70",
  "num_large_omap_objects": 1,

While I was investigating, I noticed an enormous amount of entries in the
RGW usage log:

[root@ceph ~]# radosgw-admin usage show | grep -c bucket
223326

There are entries for over 223k buckets! This was pretty scary to see,
considering we only have maybe 500 legitimate buckets in this fairly new
cluster. Almost all of the entries in the usage log are bogus entries from
anonymous users. It looks like someone/something was scanning, looking for
vulnerabilities, etc. Here are a few example entries, notice none of the
operations were successful:
<-SNIP->
 {
"bucket": "pk1914.php",
"time": "2019-07-26 21:00:00.00Z",
"epoch": 1564174800,
"owner": "anonymous",
"categories": [
{
"category": "post_obj",
"bytes_sent": 586,
"bytes_received": 0,
"ops": 2,
"successful_ops": 0
}
]
},
{
"bucket": "plus",
"time": "2019-07-26 21:00:00.00Z",
"epoch": 1564174800,
"owner": "anonymous",
"categories": [
{
"category": "post_obj",
"bytes_sent": 6314,
"bytes_received": 0,
"ops": 22,
"successful_ops": 0
}
]
},
{
"bucket": "pma.php",
"time": "2019-07-26 21:00:00.00Z",
"epoch": 1564174800,
"owner": "anonymous",
"categories": [
{
"category": "post_obj",
"bytes_sent": 580,
"bytes_received": 0,
"ops": 2,
"successful_ops": 0
}
]
<-SNIP->

I suspect that the large omap warning from Ceph is related to the 223k+
entries in the RGW usage log. I have two questions:
1) How can I remove these bogus bucket entries from the RGW usage log? When
I issue $(radosgw-admin usage trim), it only resets the stats to 0, but
does not actually remove the bogus bucket entries.
2) Is it possible to prevent Ceph from logging usage statistics for buckets
that do not exist?

Some more output if it's useful:##
ceph --format=json pg ls-by-pool default.rgw.log | jq '.[]'

<-SNIP->
{
  "pgid": "5.70",
  "version": "4712'5226850",
  "reported_seq": "5636679",
  "reported_epoch": "4712",
  "state": "active+clean",
  "last_fresh": "2019-10-24 14:40:59.287019",
  "last_change": "2019-10-24 12:21:26.984997",
  "last_active": "2019-10-24 14:40:59.287019",
  "last_peered": "2019-10-24 14:40:59.287019",
  "last_clean": "2019-10-24 14:40:59.287019",
  "last_became_active": "2019-10-18 16:02:06.007865",
  "last_became_peered": "2019-10-18 16:02:06.007865",
  "last_unstale": "2019-10-24 14:40:59.287019",
  "last_undegraded": "2019-10-24 14:40:59.287019",
  "last_fullsized": "2019-10-24 14:40:59.287019",
  "mapping_epoch": 4672,
  "log_start": "4712'5223781",
  "ondisk_log_start": "4712'5223781",
  "created": 1105,
  "last_epoch_clean": 4673,
  "parent": "0.0",
  "parent_split_bits": 0,
  "last_scrub": "4712'5223077",
  "last_scrub_stamp": "2019-10-24 12:21:26.984947",
  "last_deep_scrub": "4712'5223077",
  "last_deep_scrub_stamp": "2019-10-24 12:21:26.984947",
  "last_clean_scrub_stamp": "2019-10-24 12:21:26.984947",
  "log_size": 3069,
  "ondisk_log_size": 3069,