Re: [ceph-users] Dynamic bucket index resharding bug? - rgw.none showing unreal number of objects

2019-11-22 Thread J. Eric Ivancich
On 11/22/19 11:50 AM, David Monschein wrote:
> Hi all. Running an Object Storage cluster with Ceph Nautilus 14.2.4.
> 
> We are running into what appears to be a serious bug that is affecting
> our fairly new object storage cluster. While investigating some
> performance issues -- seeing abnormally high IOPS, extremely slow bucket
> stat listings (over 3 minutes) -- we noticed some dynamic bucket
> resharding jobs running. Strangely enough they were resharding buckets
> that had very few objects. Even more worrying was the number of new
> shards Ceph was planning: 65521
> 
> [root@os1 ~]# radosgw-admin reshard list
> [
>     {
>         "time": "2019-11-22 00:12:40.192886Z",
>         "tenant": "",
>         "bucket_name": "redacteed",
>         "bucket_id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
>         "new_instance_id":
> "redacted:c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7552496.28",
>         "old_num_shards": 1,
>         "new_num_shards": 65521
>     }
> ]
> 
> Upon further inspection we noticed a seemingly impossible number of
> objects (18446744073709551603) in rgw.none for the same bucket:
> [root@os1 ~]# radosgw-admin bucket stats --bucket=redacted
> {
>     "bucket": "redacted",
>     "tenant": "",
>     "zonegroup": "dbb69c5b-b33f-4af2-950c-173d695a4d2c",
>     "placement_rule": "default-placement",
>     "explicit_placement": {
>         "data_pool": "",
>         "data_extra_pool": "",
>         "index_pool": ""
>     },
>     "id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
>     "marker": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
>     "index_type": "Normal",
>     "owner": "d52cb8cc-1f92-47f5-86bf-fb28bc6b592c",
>     "ver": "0#12623",
>     "master_ver": "0#0",
>     "mtime": "2019-11-22 00:18:41.753188Z",
>     "max_marker": "0#",
>     "usage": {
>         "rgw.none": {
>             "size": 0,
>             "size_actual": 0,
>             "size_utilized": 0,
>             "size_kb": 0,
>             "size_kb_actual": 0,
>             "size_kb_utilized": 0,
>             "num_objects": 18446744073709551603
>         },
>         "rgw.main": {
>             "size": 63410030,
>             "size_actual": 63516672,
>             "size_utilized": 63410030,
>             "size_kb": 61924,
>             "size_kb_actual": 62028,
>             "size_kb_utilized": 61924,
>             "num_objects": 27
>         },
>         "rgw.multimeta": {
>             "size": 0,
>             "size_actual": 0,
>             "size_utilized": 0,
>             "size_kb": 0,
>             "size_kb_actual": 0,
>             "size_kb_utilized": 0,
>             "num_objects": 0
>         }
>     },
>     "bucket_quota": {
>         "enabled": false,
>         "check_on_raw": false,
>         "max_size": -1,
>         "max_size_kb": 0,
>         "max_objects": -1
>     }
> }
> 
> It would seem that the unreal number of objects in rgw.none is driving
> the resharding process, making ceph reshard the bucket 65521 times. I am
> assuming 65521 is the limit.
> 
> I have seen only a couple of references to this issue, none of which had
> a resolution or much of a conversation around them:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030791.html
> https://tracker.ceph.com/issues/37942
> 
> For now we are cancelling these resharding jobs since they seem to be
> causing performance issues with the cluster, but this is an untenable
> solution. Does anyone know what is causing this? Or how to prevent
> it/fix it?


2^64 (2 to the 64th power) is 18446744073709551616, which is 13 greater
than your value of 18446744073709551603. So this likely represents the
value of -13, but displayed in an unsigned format.

Obviously is should not calculate a value of -13. I'm guessing it's a
bug when bucket index entries that are categorized as rgw.none are
found, we're not adding to the stats, but when they're removed they are
being subtracted from the stats.

Interestingly resharding recalculates these, so you'll likely have a
much smaller value when you're done.

It seems the operations that result in rgw.none bucket index entries are
cancelled operations and removals.

We're currently looking at how best to deal with rgw.none stats here:

https://github.com/ceph/ceph/pull/29062

Eric

-- 
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-23 Thread Eric Ivancich
Good morning, Vladimir,

Please create a tracker for this 
(https://tracker.ceph.com/projects/rgw/issues/new 
<https://tracker.ceph.com/projects/rgw/issues/new>) and include the link to it 
in an email reply. And if you can include any more potentially relevant 
details, please do so. I’ll add my initial analysis to it.

But the threads do seem to be stuck, at least for a while, in 
get_obj_data::flush despite a lack of traffic. And sometimes it self-resolves, 
so it’s not a true “infinite loop”.

Thank you,

Eric

> On Aug 22, 2019, at 9:12 PM, Eric Ivancich  wrote:
> 
> Thank you for providing the profiling data, Vladimir. There are 5078 threads 
> and most of them are waiting. Here is a list of the deepest call of each 
> thread with duplicates removed.
> 
> + 100.00% epoll_wait
>   + 100.00% 
> get_obj_data::flush(rgw::OwningList&&)
> + 100.00% poll
> + 100.00% poll
>   + 100.00% poll
> + 100.00% pthread_cond_timedwait@@GLIBC_2.3.2
>   + 100.00% pthread_cond_timedwait@@GLIBC_2.3.2
> + 100.00% pthread_cond_wait@@GLIBC_2.3.2
>   + 100.00% pthread_cond_wait@@GLIBC_2.3.2
>   + 100.00% read
> + 100.00% 
> _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_
> 
> The only interesting ones are the second and last:
> 
> * get_obj_data::flush(rgw::OwningList&&)
> * 
> _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_
> 
> They are essentially part of the same call stack that results from processing 
> a GetObj request, and five threads are in this call stack (the only 
> difference is wether or not they include the call into boost intrusive list). 
> Here’s the full call stack of those threads:
> 
> + 100.00% clone
>   + 100.00% start_thread
> + 100.00% worker_thread
>   + 100.00% process_new_connection
> + 100.00% handle_request
>   + 100.00% RGWCivetWebFrontend::process(mg_connection*)
> + 100.00% process_request(RGWRados*, RGWREST*, RGWRequest*, 
> std::string const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, 
> OpsLogSocket*, opt
> ional_yield, rgw::dmclock::Scheduler*, int*)
>   + 100.00% rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, 
> RGWRequest*, req_state*, bool)
> + 100.00% RGWGetObj::execute()
>   + 100.00% RGWRados::Object::Read::iterate(long, long, 
> RGWGetDataCB*)
> + 100.00% RGWRados::iterate_obj(RGWObjectCtx&, 
> RGWBucketInfo const&, rgw_obj const&, long, long, unsigned long, int 
> (*)(rgw_raw_obj const&, l
> ong, long, long, bool, RGWObjState*, void*), void*)
>   + 100.00% _get_obj_iterate_cb(rgw_raw_obj const&, long, 
> long, long, bool, RGWObjState*, void*)
> + 100.00% RGWRados::get_obj_iterate_cb(rgw_raw_obj 
> const&, long, long, long, bool, RGWObjState*, void*)
>   + 100.00% 
> get_obj_data::flush(rgw::OwningList&&)
> + 100.00% 
> _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_
> 
> So this isn’t background processing but request processing. I’m not clear why 
> these requests are consuming so much CPU for so long.
> 
> From your initial message:
>> I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, 
>> radosgw process on those machines starts consuming 100% of 5 CPU cores for 
>> days at a time, even though the machine is not being used for data transfers 
>> (nothing in radosgw logs, couple of KB/s of network).
>> 
>> This situation can affect any number of our rados gateways, lasts from few 
>> hours to few days and stops if radosgw process is restarted or on its own.
> 
> 
> I’m going to check with others who’re more familiar with this code path.
> 
>> Begin forwarded message:
>> 
>> From: Vladimir Brik > <mailto:vladimir.b...@icecube.wisc.edu>>
>> Subject: Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is 
>> being transferred
>> Date: August 21, 2019 at 4:47:01 PM EDT
>> To: "J. Eric Ivancich" mailto:ivanc...@redhat.com>>, 
>> Mark Nelson mailto:mn

[ceph-users] Fwd: radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-22 Thread Eric Ivancich
Thank you for providing the profiling data, Vladimir. There are 5078 threads 
and most of them are waiting. Here is a list of the deepest call of each thread 
with duplicates removed.

+ 100.00% epoll_wait
  + 100.00% 
get_obj_data::flush(rgw::OwningList&&)
+ 100.00% poll
+ 100.00% poll
  + 100.00% poll
+ 100.00% pthread_cond_timedwait@@GLIBC_2.3.2
  + 100.00% pthread_cond_timedwait@@GLIBC_2.3.2
+ 100.00% pthread_cond_wait@@GLIBC_2.3.2
  + 100.00% pthread_cond_wait@@GLIBC_2.3.2
  + 100.00% read
+ 100.00% 
_ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_

The only interesting ones are the second and last:

* get_obj_data::flush(rgw::OwningList&&)
* 
_ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_

They are essentially part of the same call stack that results from processing a 
GetObj request, and five threads are in this call stack (the only difference is 
wether or not they include the call into boost intrusive list). Here’s the full 
call stack of those threads:

+ 100.00% clone
  + 100.00% start_thread
+ 100.00% worker_thread
  + 100.00% process_new_connection
+ 100.00% handle_request
  + 100.00% RGWCivetWebFrontend::process(mg_connection*)
+ 100.00% process_request(RGWRados*, RGWREST*, RGWRequest*, 
std::string const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, 
OpsLogSocket*, opt
ional_yield, rgw::dmclock::Scheduler*, int*)
  + 100.00% rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, 
RGWRequest*, req_state*, bool)
+ 100.00% RGWGetObj::execute()
  + 100.00% RGWRados::Object::Read::iterate(long, long, 
RGWGetDataCB*)
+ 100.00% RGWRados::iterate_obj(RGWObjectCtx&, 
RGWBucketInfo const&, rgw_obj const&, long, long, unsigned long, int 
(*)(rgw_raw_obj const&, l
ong, long, long, bool, RGWObjState*, void*), void*)
  + 100.00% _get_obj_iterate_cb(rgw_raw_obj const&, long, 
long, long, bool, RGWObjState*, void*)
+ 100.00% RGWRados::get_obj_iterate_cb(rgw_raw_obj 
const&, long, long, long, bool, RGWObjState*, void*)
  + 100.00% 
get_obj_data::flush(rgw::OwningList&&)
+ 100.00% 
_ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_

So this isn’t background processing but request processing. I’m not clear why 
these requests are consuming so much CPU for so long.

From your initial message:
> I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, 
> radosgw process on those machines starts consuming 100% of 5 CPU cores for 
> days at a time, even though the machine is not being used for data transfers 
> (nothing in radosgw logs, couple of KB/s of network).
> 
> This situation can affect any number of our rados gateways, lasts from few 
> hours to few days and stops if radosgw process is restarted or on its own.


I’m going to check with others who’re more familiar with this code path.

> Begin forwarded message:
> 
> From: Vladimir Brik 
> Subject: Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is 
> being transferred
> Date: August 21, 2019 at 4:47:01 PM EDT
> To: "J. Eric Ivancich" , Mark Nelson 
> , ceph-users@lists.ceph.com
> 
> > Are you running multisite?
> No
> 
> > Do you have dynamic bucket resharding turned on?
> Yes. "radosgw-admin reshard list" prints "[]"
> 
> > Are you using lifecycle?
> I am not sure. How can I check? "radosgw-admin lc list" says "[]"
> 
> > And just to be clear -- sometimes all 3 of your rados gateways are
> > simultaneously in this state?
> Multiple, but I have not seen all 3 being in this state simultaneously. 
> Currently one gateway has 1 thread using 100% of CPU, and another has 5 
> threads each using 100% CPU.
> 
> Here are the fruits of my attempts to capture the call graph using perf and 
> gdbpmp:
> https://icecube.wisc.edu/~vbrik/perf.data
> https://icecube.wisc.edu/~vbrik/gdbpmp.data
> 
> These are the commands that I ran and their outputs (note I couldn't get perf 
> not to generate the warning):
> rgw-3 gdbpmp # ./gdbpmp.py -n 100 -p 73688 -o gdbpmp.data
> Attaching to process 736

Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-21 Thread J. Eric Ivancich
On 8/21/19 10:22 AM, Mark Nelson wrote:
> Hi Vladimir,
> 
> 
> On 8/21/19 8:54 AM, Vladimir Brik wrote:
>> Hello
>>

[much elided]

> You might want to try grabbing a a callgraph from perf instead of just
> running perf top or using my wallclock profiler to see if you can drill
> down and find out where in that method it's spending the most time.

I agree with Mark -- a call graph would be very helpful in tracking down
what's happening.

There are background tasks that run. Are you running multisite? Do you
have dynamic bucket resharding turned on? Are you using lifecycle? And
garbage collection is another background task.

And just to be clear -- sometimes all 3 of your rados gateways are
simultaneously in this state?

But the call graph would be incredibly helpful.

Thank you,

Eric

-- 
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adventures with large RGW buckets [EXT]

2019-08-02 Thread J. Eric Ivancich
A few interleaved responses below

On 8/1/19 10:20 AM, Matthew Vernon wrote:
> Hi,
> 
> On 31/07/2019 19:02, Paul Emmerich wrote:
> 
> Some interesting points here, thanks for raising them :)


> We've had some problems with large buckets (from around the 70Mobject
> mark).
> 
> One you don't mention is that multipart uploads break during resharding
> - so if our users are filling up a bucket with many writers uploading
> multipart objects, some of these will fail (rather than blocking) when
> the bucket is resharded.

Is there a tracker for that already? If not, would you mind adding one?

> We've also seen bucket deletion via radosgw-admin failing because of
> oddities in the bucket itself (e.g. missing shadow objects, omap objects
> that still exist when the related object is gone); sorting that was a
> bit fiddly (with some help from Canonical, who I think are working on
> patches).
> 

There was a recently merged PR that addressed bucket deletion with
missing shadow objects:

https://tracker.ceph.com/issues/40590

Thank you for reporting your experience w/ rgw,

Eric

-- 
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adventures with large RGW buckets

2019-08-01 Thread Eric Ivancich
Hi Paul,

I’ve turned the following idea of yours into a tracker:

https://tracker.ceph.com/issues/41051 
<https://tracker.ceph.com/issues/41051>

> 4. Common prefixes could filtered in the rgw class on the OSD instead
> of in radosgw
> 
> Consider a bucket with 100 folders with 1000 objects in each and only one 
> shard
> 
> /p1/1, /p1/2, ..., /p1/1000, /p2/1, /p2/2, ..., /p2/1000, ... /p100/1000
> 
> 
> Now a user wants to list / with aggregating common prefixes on the
> delimiter / and
> wants up to 1000 results.
> So there'll be 100 results returned to the client: the common prefixes
> p1 to p100.
> 
> How much data will be transfered between the OSDs and radosgw for this 
> request?
> How many omap entries does the OSD scan?
> 
> radosgw will ask the (single) index object to list the first 1000 objects. 
> It'll
> return 1000 objects in a quite unhelpful way: /p1/1, /p1/2, , /p1/1000
> 
> radosgw will discard 999 of these and detect one common prefix and continue 
> the
> iteration at /p1/\xFF to skip the remaining entries in /p1/ if there are any.
> The OSD will then return everything in /p2/ in that next request and so on.
> 
> So it'll internally list every single object in that bucket. That will
> be a problem
> for large buckets and having lots of shards doesn't help either.
> 
> 
> This shouldn't be too hard to fix: add an option "aggregate prefixes" to the 
> RGW
> class method and duplicate the fast-forward logic from radosgw in
> cls_rgw. It doesn't
> even need to change the response type or anything, it just needs to
> limit entries in
> common prefixes to one result.
> Is this a good idea or am I missing something?
> 
> IO would be reduced by a factor of 100 for that particular
> pathological case. I've
> unfortunately seen a real-world setup that I think hits a case like that.

Eric

--
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adventures with large RGW buckets

2019-08-01 Thread Eric Ivancich
the (single) index object to list the first 1000 objects. 
> It'll
> return 1000 objects in a quite unhelpful way: /p1/1, /p1/2, , /p1/1000
> 
> radosgw will discard 999 of these and detect one common prefix and continue 
> the
> iteration at /p1/\xFF to skip the remaining entries in /p1/ if there are any.
> The OSD will then return everything in /p2/ in that next request and so on.
> 
> So it'll internally list every single object in that bucket. That will
> be a problem
> for large buckets and having lots of shards doesn't help either.
> 
> 
> This shouldn't be too hard to fix: add an option "aggregate prefixes" to the 
> RGW
> class method and duplicate the fast-forward logic from radosgw in
> cls_rgw. It doesn't
> even need to change the response type or anything, it just needs to
> limit entries in
> common prefixes to one result.
> Is this a good idea or am I missing something?

On the face it looks good. I’ll raise this with other RGW developers. I do know 
that there was a related bug that was recently addressed with this pr:

https://github.com/ceph/ceph/pull/28192 
<https://github.com/ceph/ceph/pull/28192>

But your suggestion seems to go farther.

> IO would be reduced by a factor of 100 for that particular
> pathological case. I've
> unfortunately seen a real-world setup that I think hits a case like that.


Thank you for sharing your experiences and your ideas.

Eric

--
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW - Multisite setup -> question about Bucket - Sharding, limitations and synchronization

2019-07-31 Thread Eric Ivancich
t’s about right.

> And If I understand it correct, how would look the exact strategy in a 
> multisite - setup to resync e.g. a single bucket at which one zone got 
> corrupted and must be get back into a synchronous state?

Be aware that there are full syncs and incremental syncs. Full syncs just copy 
every object. Incremental syncs use logs to sync selectively. Perhaps Casey 
will weigh in and discuss the state transitions.

> Hope thats the correct place to ask such questions.
> 
> Best Regards,
> Daly


--
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot delete bucket

2019-07-01 Thread J. Eric Ivancich
> On Jun 27, 2019, at 4:53 PM, David Turner  wrote:
> 
> I'm still going at 452M incomplete uploads. There are guides online for 
> manually deleting buckets kinda at the RADOS level that tend to leave data 
> stranded. That doesn't work for what I'm trying to do so I'll keep going with 
> this and wait for that PR to come through and hopefully help with bucket 
> deletion.
> 
> On Thu, Jun 27, 2019 at 2:58 PM Sergei Genchev  <mailto:sgenc...@gmail.com>> wrote:
> @David Turner
> Did your bucket delete ever finish? I am up to 35M incomplete uploads,
> and I doubt that I actually had that many upload attempts. I could be
> wrong though.
> Is there a way to force bucket deletion, even at the cost of not
> cleaning up space?


Just a quick update….

The PR merged and backports are underway for luminous, mimic, and nautilus:

http://tracker.ceph.com/issues/40526 
<http://tracker.ceph.com/issues/40526>

Eric

--
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot delete bucket

2019-06-25 Thread J. Eric Ivancich
On 6/24/19 1:49 PM, David Turner wrote:
> It's aborting incomplete multipart uploads that were left around. First
> it will clean up the cruft like that and then it should start actually
> deleting the objects visible in stats. That's my understanding of it
> anyway. I'm int he middle of cleaning up some buckets right now doing
> this same thing. I'm up to `WARNING : aborted 108393000 incomplete
> multipart uploads`. This bucket had a client uploading to it constantly
> with a very bad network connection.

There's a PR to better deal with this situation:

https://github.com/ceph/ceph/pull/28724

Eric

-- 
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP object in RGW GC pool

2019-06-11 Thread J. Eric Ivancich
Hi Wido,

Interleaving below

On 6/11/19 3:10 AM, Wido den Hollander wrote:
> 
> I thought it was resolved, but it isn't.
> 
> I counted all the OMAP values for the GC objects and I got back:
> 
> gc.0: 0
> gc.11: 0
> gc.14: 0
> gc.15: 0
> gc.16: 0
> gc.18: 0
> gc.19: 0
> gc.1: 0
> gc.20: 0
> gc.21: 0
> gc.22: 0
> gc.23: 0
> gc.24: 0
> gc.25: 0
> gc.27: 0
> gc.29: 0
> gc.2: 0
> gc.30: 0
> gc.3: 0
> gc.4: 0
> gc.5: 0
> gc.6: 0
> gc.7: 0
> gc.8: 0
> gc.9: 0
> gc.13: 110996
> gc.10: 04
> gc.26: 42
> gc.28: 111292
> gc.17: 111314
> gc.12: 111534
> gc.31: 111956

Casey Bodley mentioned to me that he's seen similar behavior to what
you're describing when RGWs are upgraded but not all OSDs are upgraded
as well. Is it possible that the OSDs hosting gc.13, gc.10, and so forth
are running a different version of ceph?

Eric

-- 
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP object in RGW GC pool

2019-06-04 Thread J. Eric Ivancich
On 6/4/19 7:37 AM, Wido den Hollander wrote:
> I've set up a temporary machine next to the 13.2.5 cluster with the
> 13.2.6 packages from Shaman.
> 
> On that machine I'm running:
> 
> $ radosgw-admin gc process
> 
> That seems to work as intended! So the PR seems to have fixed it.
> 
> Should be fixed permanently when 13.2.6 is officially released.
> 
> Wido

Thank you, Wido, for sharing the results of your experiment. I'm happy
to learn that it was successful. And v13.2.6 was just released about 2
hours ago.

Eric
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP object in RGW GC pool

2019-05-29 Thread J. Eric Ivancich
Hi Wido,

When you run `radosgw-admin gc list`, I assume you are *not* using the
"--include-all" flag, right? If you're not using that flag, then
everything listed should be expired and be ready for clean-up. If after
running `radosgw-admin gc process` the same entries appear in
`radosgw-admin gc list` then gc apparently stalled.

There were a few bugs within gc processing that could prevent it from
making forward progress. They were resolved with a PR (master:
https://github.com/ceph/ceph/pull/26601 ; mimic backport:
https://github.com/ceph/ceph/pull/27796). Unfortunately that code was
backported after the 13.2.5 release, but it is in place for the 13.2.6
release of mimic.

Eric


On 5/29/19 3:19 AM, Wido den Hollander wrote:
> Hi,
> 
> I've got a Ceph cluster with this status:
> 
> health: HEALTH_WARN
> 3 large omap objects
> 
> After looking into it I see that the issue comes from objects in the
> '.rgw.gc' pool.
> 
> Investigating it I found that the gc.* objects have a lot of OMAP keys:
> 
> for OBJ in $(rados -p .rgw.gc ls); do
>   echo $OBJ
>   rados -p .rgw.gc listomapkeys $OBJ|wc -l
> done
> 
> I then found out that on average these objects have about 100k of OMAP
> keys each, but two stand out and have about 3M OMAP keys.
> 
> I can list the GC with 'radosgw-admin gc list' and this yields a JSON
> which is a couple of MB in size.
> 
> I ran:
> 
> $ radosgw-admin gc process
> 
> That runs for hours and then finishes, but the large list of OMAP keys
> stays.
> 
> Running Mimic 13.3.5 on this cluster.
> 
> Has anybody seen this before?
> 
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Bucket strange issues rgw.none + id and marker diferent.

2019-05-15 Thread J. Eric Ivancich
Hi Manuel,

My response is interleaved below.

On 5/8/19 3:17 PM, EDH - Manuel Rios Fernandez wrote:
> Eric,
> 
> Yes we do :
> 
> time s3cmd ls s3://[BUCKET]/ --no-ssl and we get near 2min 30 secs for list 
> the bucket.

We're adding an --allow-unordered option to `radosgw-admin bucket list`.
That would likely speed up your listing. If you want to follow the
trackers, they are:

https://tracker.ceph.com/issues/39637 [feature added to master]
https://tracker.ceph.com/issues/39730 [nautilus backport]
https://tracker.ceph.com/issues/39731 [mimic backport]
https://tracker.ceph.com/issues/39732 [luminous backport]

> If we instantly hit again the query it normally timeouts.

That's interesting. I don't have an explanation for that behavior. I
would suggest creating a tracker for the issue, ideally with the minimal
steps to reproduce the issue. My concern is that your bucket has so many
objects, and if that's related to the issue, it would not be easy to
reproduce.

> Could you explain a little more "
> 
> With respect to your earlier message in which you included the output of 
> `ceph df`, I believe the reason that default.rgw.buckets.index shows as
> 0 bytes used is that the index uses the metadata branch of the object to 
> store its data.
> "

Each object in ceph has three components. The data itself plus two types
of metadata (omap and xattr). The `ceph df` command doesn't count the
metadata.

The bucket indexes that track the objects in each bucket use only the
metadata. So you won't see that reported in `ceph df`.

> I read in IRC today that in Nautilus release now is well calculated and no 
> show more 0B. Is it correct?

I don't know. I wasn't aware of any changes in nautilus that report
metadata in `ceph df`.

> Thanks for your response.

You're welcome,

Eric
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Bucket strange issues rgw.none + id and marker diferent.

2019-05-09 Thread J. Eric Ivancich
Hi Manuel,

I’ve interleaved responses below.

> On May 8, 2019, at 3:17 PM, EDH - Manuel Rios Fernandez 
>  wrote:
> 
> Eric,
> 
> Yes we do :
> 
> time s3cmd ls s3://[BUCKET]/ --no-ssl and we get near 2min 30 secs for list 
> the bucket.
> 
> If we instantly hit again the query it normally timeouts.
> 
> 
> Could you explain a little more "
> 
> With respect to your earlier message in which you included the output of 
> `ceph df`, I believe the reason that default.rgw.buckets.index shows as
> 0 bytes used is that the index uses the metadata branch of the object to 
> store its data.
> “

Each object stored in ceph is composed of 3 distinct parts — the data, the 
xattr metadata (older), and the omap metadata (newer). For the system objects 
that manage RGW on top of ceph we often use the omap metadata. We use this for 
bucket indexes and for various types of logs, for example.

`ceph df` reports only the data’s size and not the two types of metadata sizes. 
So that would explain why you see 0B for the bucket index objects.

> I read in IRC today that in Nautilus release now is well calculated and no 
> show more 0B. Is it correct?

I am having difficulty understanding that sentence. Would you be so kind as to 
rewrite it? I don’t want to create confusion by guessing.

Eric

> Thanks for your response.
> 
> 
> -Mensaje original-
> De: J. Eric Ivancich  
> Enviado el: miércoles, 8 de mayo de 2019 21:00
> Para: EDH - Manuel Rios Fernandez ; 'Casey Bodley' 
> ; ceph-users@lists.ceph.com
> Asunto: Re: [ceph-users] Ceph Bucket strange issues rgw.none + id and marker 
> diferent.
> 
> Hi Manuel,
> 
> My response is interleaved.
> 
> On 5/7/19 7:32 PM, EDH - Manuel Rios Fernandez wrote:
>> Hi Eric,
>> 
>> This looks like something the software developer must do, not something than 
>> Storage provider must allow no?
> 
> True -- so you're using `radosgw-admin bucket list --bucket=XYZ` to list the 
> bucket? Currently we do not allow for a "--allow-unordered" flag, but there's 
> no reason we could not. I'm working on the PR now, although it might take 
> some time before it gets to v13.
> 
>> Strange behavior is that sometimes bucket is list fast in less than 30 secs 
>> and other time it timeout after 600 secs, the bucket contains 875 folders 
>> with a total object number of 6Millions.
>> 
>> I don’t know how a simple list of 875 folder can timeout after 600 
>> secs
> 
> Burkhard Linke's comment is on target. The "folders" are a trick using 
> delimiters. A bucket is really entirely flat without a hierarchy.
> 
>> We bought several NVMe Optane for do 4 partitions in each PCIe card and get 
>> up 1.000.000 IOPS for Index. Quite expensive because we calc that our index 
>> is just 4GB (100-200M objects),waiting those cards. Any more idea?
> 
> With respect to your earlier message in which you included the output of 
> `ceph df`, I believe the reason that default.rgw.buckets.index shows as
> 0 bytes used is that the index uses the metadata branch of the object to 
> store its data.
> 
>> Regards
> 
> Eric
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Bucket strange issues rgw.none + id and marker diferent.

2019-05-08 Thread J. Eric Ivancich
Hi Manuel,

My response is interleaved.

On 5/7/19 7:32 PM, EDH - Manuel Rios Fernandez wrote:
> Hi Eric,
> 
> This looks like something the software developer must do, not something than 
> Storage provider must allow no?

True -- so you're using `radosgw-admin bucket list --bucket=XYZ` to list
the bucket? Currently we do not allow for a "--allow-unordered" flag,
but there's no reason we could not. I'm working on the PR now, although
it might take some time before it gets to v13.

> Strange behavior is that sometimes bucket is list fast in less than 30 secs 
> and other time it timeout after 600 secs, the bucket contains 875 folders 
> with a total object number of 6Millions.
> 
> I don’t know how a simple list of 875 folder can timeout after 600 secs

Burkhard Linke's comment is on target. The "folders" are a trick using
delimiters. A bucket is really entirely flat without a hierarchy.

> We bought several NVMe Optane for do 4 partitions in each PCIe card and get 
> up 1.000.000 IOPS for Index. Quite expensive because we calc that our index 
> is just 4GB (100-200M objects),waiting those cards. Any more idea?

With respect to your earlier message in which you included the output of
`ceph df`, I believe the reason that default.rgw.buckets.index shows as
0 bytes used is that the index uses the metadata branch of the object to
store its data.

> Regards

Eric
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Bucket strange issues rgw.none + id and marker diferent.

2019-05-07 Thread J. Eric Ivancich
On 5/7/19 11:24 AM, EDH - Manuel Rios Fernandez wrote:
> Hi Casey
> 
> ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic
> (stable)
> 
> Reshard is something than don’t allow us customer to list index?
> 
> Regards
Listing of buckets with a large number of buckets is notoriously slow,
because the entries are not stored in lexical order but the default
behavior is to list the objects in lexical order.

If your use case allows for an unordered listing it would likely perform
better. You can see some documentation here under the S3 API / GET BUCKET:

http://docs.ceph.com/docs/mimic/radosgw/s3/bucketops/

Are you using S3?

Eric
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to config mclock_client queue?

2019-03-26 Thread J. Eric Ivancich
So I do not think mclock_client queue works the way you’re hoping it does. For 
categorization purposes it joins the operation class and the client identifier 
with the intent that that will execute operations among clients more evenly 
(i.e., it won’t favor one client over another).

However, it was not designed for per-client distinct configurations, which is 
what it seems that you’re after.

I started an effort to update librados (and the path all the way back to the 
OSDs) to allow per-client QoS configuration. However I got pulled off of that 
for other priorities. I believe Mark Kogan is working on that as he has time. 
That might be closer to what you’re after. See: 
https://github.com/ceph/ceph/pull/20235 .

Eric

> On Mar 26, 2019, at 8:14 AM, Wang Chuanwen  wrote:
> 
> I am now trying to run tests to see how mclock_client queue works on mimic. 
> But when I tried to config tag (r,w,l) of each client, I found there are no 
> options to distinguish different clients.
> All I got are following options for mclock_opclass, which are used to 
> distinguish different types of operations.
> 
> [root@ceph-node1 ~]# ceph daemon osd.0 config show | grep mclock
> "osd_op_queue": "mclock_opclass",
> "osd_op_queue_mclock_client_op_lim": "100.00",
> "osd_op_queue_mclock_client_op_res": "100.00",
> "osd_op_queue_mclock_client_op_wgt": "500.00",
> "osd_op_queue_mclock_osd_subop_lim": "0.00",
> "osd_op_queue_mclock_osd_subop_res": "1000.00",
> "osd_op_queue_mclock_osd_subop_wgt": "500.00",
> "osd_op_queue_mclock_recov_lim": "0.001000",
> "osd_op_queue_mclock_recov_res": "0.00",
> "osd_op_queue_mclock_recov_wgt": "1.00",
> "osd_op_queue_mclock_scrub_lim": "100.00",
> "osd_op_queue_mclock_scrub_res": "100.00",
> "osd_op_queue_mclock_scrub_wgt": "500.00",
> "osd_op_queue_mclock_snap_lim": "0.001000",
> "osd_op_queue_mclock_snap_res": "0.00",
> "osd_op_queue_mclock_snap_wgt": "1.00"
> 
> I am wondering if ceph mimic provide any configuration interfaces for 
> mclock_client queue?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Omap issues - metadata creating too many

2019-01-03 Thread J. Eric Ivancich
If you can wait a few weeks until the next release of luminous there
will be tooling to do this safely. Abhishek Lekshmanan of SUSE
contributed the PR. It adds some sub-commands to radosgw-admin:

radosgw-admin reshard stale-instances list
radosgw-admin reshard stale-instances rm

If you do it manually you should proceed with extreme caution as you
could do some damage that you might not be able to recover from.

Eric

On 1/3/19 11:31 AM, Bryan Stillwell wrote:
> Josef,
> 
>  
> 
> I've noticed that when dynamic resharding is on it'll reshard some of
> our bucket indices daily (sometimes more).  This causes a lot of wasted
> space in the .rgw.buckets.index pool which might be what you are seeing.
> 
>  
> 
> You can get a listing of all the bucket instances in your cluster with
> this command:
> 
>  
> 
> radosgw-admin metadata list bucket.instance | jq -r '.[]' | sort
> 
>  
> 
> Give that a try and see if you see the same problem.  It seems that once
> you remove the old bucket instances the omap dbs don't reduce in size
> until you compact them.
> 
>  
> 
> Bryan
> 
>  
> 
> *From: *Josef Zelenka 
> *Date: *Thursday, January 3, 2019 at 3:49 AM
> *To: *"J. Eric Ivancich" 
> *Cc: *"ceph-users@lists.ceph.com" , Bryan
> Stillwell 
> *Subject: *Re: [ceph-users] Omap issues - metadata creating too many
> 
>  
> 
> Hi, i had the default - so it was on(according to ceph kb). turned it
> 
> off, but the issue persists. i noticed Bryan Stillwell(cc-ing him) had
> 
> the same issue (reported about it yesterday) - tried his tips about
> 
> compacting, but it doesn't do anything, however i have to add to his
> 
> last point, this happens even with bluestore. Is there anything we can
> 
> do to clean up the omap manually?
> 
>  
> 
> Josef
> 
>  
> 
> On 18/12/2018 23:19, J. Eric Ivancich wrote:
> 
> On 12/17/18 9:18 AM, Josef Zelenka wrote:
> 
> Hi everyone, i'm running a Luminous 12.2.5 cluster with 6 hosts on
> 
> ubuntu 16.04 - 12 HDDs for data each, plus 2 SSD metadata OSDs(three
> 
> nodes have an additional SSD i added to have more space to
> rebalance the
> 
> metadata). CUrrently, the cluster is used mainly as a radosgw
> storage,
> 
> with 28tb data in total, replication 2x for both the metadata
> and data
> 
> pools(a cephfs isntance is running alongside there, but i don't
> think
> 
> it's the perpetrator - this happenned likely before we had it). All
> 
> pools aside from the data pool of the cephfs and data pool of the
> 
> radosgw are located on the SSD's. Now, the interesting thing -
> at random
> 
> times, the metadata OSD's fill up their entire capacity with
> OMAP data
> 
> and go to r/o mode and we have no other option currently than
> deleting
> 
> them and re-creating. The fillup comes at a random time, it
> doesn't seem
> 
> to be triggered by anything and it isn't caused by some data
> influx. It
> 
> seems like some kind of a bug to me to be honest, but i'm not
> certain -
> 
> anyone else seen this behavior with their radosgw? Thanks a lot
> 
> Hi Josef,
> 
>  
> 
> Do you have rgw_dynamic_resharding turned on? Try turning it off and see
> 
> if the behavior continues.
> 
>  
> 
> One theory is that dynamic resharding is triggered and possibly not
> 
> completing. This could add a lot of data to omap for the incomplete
> 
> bucket index shards. After a delay it tries resharding again, possibly
> 
> failing again, and adding more data to the omap. This continues.
> 
>  
> 
> If this is the ultimate issue we have some commits on the upstream
> 
> luminous branch that are designed to address this set of issues.
> 
>  
> 
> But we should first see if this is the cause.
> 
>  
> 
> Eric
> 
>  
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Removing orphaned radosgw bucket indexes from pool

2018-12-18 Thread J. Eric Ivancich
On 11/29/18 6:58 PM, Bryan Stillwell wrote:
> Wido,
> 
> I've been looking into this large omap objects problem on a couple of our 
> clusters today and came across your script during my research.
> 
> The script has been running for a few hours now and I'm already over 100,000 
> 'orphaned' objects!
> 
> It appears that ever since upgrading to Luminous (12.2.5 initially, followed 
> by 12.2.8) this cluster has been resharding the large bucket indexes at least 
> once a day and not cleaning up the previous bucket indexes:
> 
> for instance in $(radosgw-admin metadata list bucket.instance | jq -r '.[]' | 
> grep go-test-dashboard); do
>   mtime=$(radosgw-admin metadata get bucket.instance:${instance} | grep mtime)
>   num_shards=$(radosgw-admin metadata get bucket.instance:${instance} | grep 
> num_shards)
>   echo "${instance}: ${mtime} ${num_shards}"
> done | column -t | sort -k3
> go-test-dashboard:default.188839135.327804:  "mtime":  "2018-06-01  
> 22:35:28.693095Z",  "num_shards":  0,
> go-test-dashboard:default.617828918.2898:"mtime":  "2018-06-02  
> 22:35:40.438738Z",  "num_shards":  46,
> go-test-dashboard:default.617828918.4:   "mtime":  "2018-06-02  
> 22:38:21.537259Z",  "num_shards":  46,
> go-test-dashboard:default.617663016.10499:   "mtime":  "2018-06-03  
> 23:00:04.185285Z",  "num_shards":  46,
> [...snip...]
> go-test-dashboard:default.891941432.342061:  "mtime":  "2018-11-28  
> 01:41:46.777968Z",  "num_shards":  7,
> go-test-dashboard:default.928133068.2899:"mtime":  "2018-11-28  
> 20:01:49.390237Z",  "num_shards":  46,
> go-test-dashboard:default.928133068.5115:"mtime":  "2018-11-29  
> 01:54:17.788355Z",  "num_shards":  7,
> go-test-dashboard:default.928133068.8054:"mtime":  "2018-11-29  
> 20:21:53.733824Z",  "num_shards":  7,
> go-test-dashboard:default.891941432.359004:  "mtime":  "2018-11-29  
> 20:22:09.201965Z",  "num_shards":  46,
> 
> The num_shards is typically around 46, but looking at all 288 instances of 
> that bucket index, it has varied between 3 and 62 shards.
> 
> Have you figured anything more out about this since you posted this 
> originally two weeks ago?
> 
> Thanks,
> Bryan
> 
> From: ceph-users  on behalf of Wido den 
> Hollander 
> Date: Thursday, November 15, 2018 at 5:43 AM
> To: Ceph Users 
> Subject: [ceph-users] Removing orphaned radosgw bucket indexes from pool
> 
> Hi,
> 
> Recently we've seen multiple messages on the mailinglists about people
> seeing HEALTH_WARN due to large OMAP objects on their cluster. This is
> due to the fact that starting with 12.2.6 OSDs warn about this.
> 
> I've got multiple people asking me the same questions and I've done some
> digging around.
> 
> Somebody on the ML wrote this script:
> 
> for bucket in `radosgw-admin metadata list bucket | jq -r '.[]' | sort`; do
>   actual_id=`radosgw-admin bucket stats --bucket=${bucket} | jq -r '.id'`
>   for instance in `radosgw-admin metadata list bucket.instance | jq -r
> '.[]' | grep ${bucket}: | cut -d ':' -f 2`
>   do
> if [ "$actual_id" != "$instance" ]
> then
>   radosgw-admin bi purge --bucket=${bucket} --bucket-id=${instance}
>   radosgw-admin metadata rm bucket.instance:${bucket}:${instance}
> fi
>   done
> done
> 
> That partially works, but 'orphaned' objects in the index pool do not work.
> 
> So I wrote my own script [0]:
> 
> #!/bin/bash
> INDEX_POOL=$1
> 
> if [ -z "$INDEX_POOL" ]; then
> echo "Usage: $0 "
> exit 1
> fi
> 
> INDEXES=$(mktemp)
> METADATA=$(mktemp)
> 
> trap "rm -f ${INDEXES} ${METADATA}" EXIT
> 
> radosgw-admin metadata list bucket.instance|jq -r '.[]' > ${METADATA}
> rados -p ${INDEX_POOL} ls > $INDEXES
> 
> for OBJECT in $(cat ${INDEXES}); do
> MARKER=$(echo ${OBJECT}|cut -d '.' -f 3,4,5)
> grep ${MARKER} ${METADATA} > /dev/null
> if [ "$?" -ne 0 ]; then
> echo $OBJECT
> fi
> done
> 
> It does not remove anything, but for example, it returns these objects:
> 
> .dir.eb32b1ca-807a-4867-aea5-ff43ef7647c6.10406917.5752
> .dir.eb32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6162
> .dir.eb32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6186
> 
> The output of:
> 
> $ radosgw-admin metadata list|jq -r '.[]'
> 
> Does not contain:
> - eb32b1ca-807a-4867-aea5-ff43ef7647c6.10406917.5752
> - eb32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6162
> - eb32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6186
> 
> So for me these objects do not seem to be tied to any bucket and seem to
> be leftovers which were not cleaned up.
> 
> For example, I see these objects tied to a bucket:
> 
> - b32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6160
> - eb32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6188
> - eb32b1ca-807a-4867-aea5-ff43ef7647c6.10289105.6167
> 
> But notice the difference: 6160, 6188, 6167, but not 6162 nor 6186
> 
> Before I remove these objects I want to verify with other users if they
> see the same and if my thinking is correct.
> 
> Wido
> 
> [0]: 

Re: [ceph-users] Omap issues - metadata creating too many

2018-12-18 Thread J. Eric Ivancich
On 12/17/18 9:18 AM, Josef Zelenka wrote:
> Hi everyone, i'm running a Luminous 12.2.5 cluster with 6 hosts on
> ubuntu 16.04 - 12 HDDs for data each, plus 2 SSD metadata OSDs(three
> nodes have an additional SSD i added to have more space to rebalance the
> metadata). CUrrently, the cluster is used mainly as a radosgw storage,
> with 28tb data in total, replication 2x for both the metadata and data
> pools(a cephfs isntance is running alongside there, but i don't think
> it's the perpetrator - this happenned likely before we had it). All
> pools aside from the data pool of the cephfs and data pool of the
> radosgw are located on the SSD's. Now, the interesting thing - at random
> times, the metadata OSD's fill up their entire capacity with OMAP data
> and go to r/o mode and we have no other option currently than deleting
> them and re-creating. The fillup comes at a random time, it doesn't seem
> to be triggered by anything and it isn't caused by some data influx. It
> seems like some kind of a bug to me to be honest, but i'm not certain -
> anyone else seen this behavior with their radosgw? Thanks a lot

Hi Josef,

Do you have rgw_dynamic_resharding turned on? Try turning it off and see
if the behavior continues.

One theory is that dynamic resharding is triggered and possibly not
completing. This could add a lot of data to omap for the incomplete
bucket index shards. After a delay it tries resharding again, possibly
failing again, and adding more data to the omap. This continues.

If this is the ultimate issue we have some commits on the upstream
luminous branch that are designed to address this set of issues.

But we should first see if this is the cause.

Eric
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inexplicably slow bucket listing at top level

2018-11-05 Thread J. Eric Ivancich
I did make an inquiry and someone here does have some experience w/ the
mc command -- minio client. We're curious how "ls -r" is implemented
under mc. Does it need to get a full listing and then do some path
parsing to produce nice output? If so, it may be playing a role in the
delay as well.

Eric

On 9/26/18 5:27 PM, Graham Allan wrote:
> I have one user bucket, where inexplicably (to me), the bucket takes an
> eternity to list, though only on the top level. There are two
> subfolders, each of which lists individually at a completely normal
> speed...
> 
> eg (using minio client):
> 
>> [~] % time ./mc ls fried/friedlab/
>> [2018-09-26 16:15:48 CDT] 0B impute/
>> [2018-09-26 16:15:48 CDT] 0B wgs/
>>
>> real    1m59.390s
>>
>> [~] % time ./mc ls -r fried/friedlab/
>> ...
>> real 3m18.013s
>>
>> [~] % time ./mc ls -r fried/friedlab/impute
>> ...
>> real 0m13.512s
>>
>> [~] % time ./mc ls -r fried/friedlab/wgs
>> ...
>> real 0m6.437s
> 
> The bucket has about 55k objects total, with 32 index shards on a
> replicated ssd pool. It shouldn't be taking this long but I can't
> imagine what could be causing this. I haven't found any others behaving
> this way. I'd think it has to be some problem with the bucket index, but
> what...?
> 
> I did naively try some "radosgw-admin bucket check [--fix]" commands
> with no change.
> 
> Graham

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inexplicably slow bucket listing at top level

2018-11-05 Thread J. Eric Ivancich
The numbers you're reporting strike me as surprising as well. Which version are 
you running?

In case you're not aware, listing of buckets is not a very efficient operation 
given that the listing is required to return with objects in lexical order. 
They are distributed across the shards via a hash, which is not in lexical 
order. So every shard has to have a chunk read and brought to the rgw and the 
top elements are sorted and returned. For example, in order to return the first 
1000 object names, it asks each of the 32 shards for their first 1000 object 
names, and then does a selection process to get the first 1000 among the 32000. 
It returns that, and the process is then repeated.

I'm unfamiliar with your mc script/command, so I don't know if that might be 
contributing to the issue.

We have added the ability to list buckets in unsorted order and made that 
accessible via s3 and swift and that's been backported all the way to upstream 
luminous.

Eric

On 9/26/18 5:27 PM, Graham Allan wrote:
> I have one user bucket, where inexplicably (to me), the bucket takes an
> eternity to list, though only on the top level. There are two
> subfolders, each of which lists individually at a completely normal
> speed...
>
> eg (using minio client):
>
> > [~] % time ./mc ls fried/friedlab/
> > [2018-09-26 16:15:48 CDT] 0B impute/
> > [2018-09-26 16:15:48 CDT] 0B wgs/
> >
> > real    1m59.390s
> >
> > [~] % time ./mc ls -r fried/friedlab/
> > ...
> > real 3m18.013s
> >
> > [~] % time ./mc ls -r fried/friedlab/impute
> > ...
> > real 0m13.512s
> >
> > [~] % time ./mc ls -r fried/friedlab/wgs
> > ...
> > real 0m6.437s
>
> The bucket has about 55k objects total, with 32 index shards on a
> replicated ssd pool. It shouldn't be taking this long but I can't
> imagine what could be causing this. I haven't found any others behaving
> this way. I'd think it has to be some problem with the bucket index, but
> what...?
>
> I did naively try some "radosgw-admin bucket check [--fix]" commands
> with no change.
>
> Graham


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com