[ceph-users] OSD Crash During Deep-Scrub

2021-03-29 Thread Dave Hall
Hello,

A while back, I was having an issue with an OSD repeatedly crashing.  I
ultimately reweighted it to zero and then marked out 'Out'.  Since I found
that the logs for thoses crashes match https://tracker.ceph.com/issues/46490
.

Since the OSD is in a 'Safe-to-Destroy' state, I'm wondering the best
course of action - should I just mark it back in?  Or should I destroy and
rebuild it.  If clearing it in the way I have, in combination with updating
to 14.2.16, will prevent it from misbehaving, why go through the trouble of
destroying and rebuilding?

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus - PG Autoscaler Gobal vs Pool Setting

2021-03-29 Thread Dave Hall
All,

In looking at the options for setting the default pg autoscale option, I
notice that there is a global option setting and a per-pool option
setting.  It seems that the options at the pool level are off, warn, and
on.  The same, I assume for the global setting.

Is there a way to get rid of the per-pool setting and set the pool to honor
the global setting?  I think I'm looking for 'off, warn, on, or global'.
 It seems that once the per-pool option is set for all of one's pools, the
global value is irrelevant.  This also implies that in a circumstance where
one would want to temporarily suspend autoscaling it would be required to
modify the setting for each pool and then to modify it back afterward.

Thoughts?

Thanks

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu


On Mon, Mar 29, 2021 at 1:44 PM Anthony D'Atri 
wrote:

> Yes the PG autoscalar has a way of reducing PG count way too far.  There’s
> a claim that it’s better in Pacific, but I tend to recommend disabling it
> and calculating / setting pg_num manually.
>
> > On Mar 29, 2021, at 9:06 AM, Dave Hall  wrote:
> >
> > Eugen,
> >
> > I didn't really think my cluster was eating itself, but I also didn't
> want
> > to be in denial.
> >
> > Regarding the autoscaler, I really thought that it only went up - I
> didn't
> > expect that it would decrease the number of PGs.  Plus, I thought I had
> it
> > turned off.  I see now that it's off globally but enabled for this
> > particular pool.  Also, I see that the target PG count is lower than the
> > current.
> >
> > I guess you learn something new every day.
> >
> > -Dave
> >
> > --
> > Dave Hall
> > Binghamton University
> > kdh...@binghamton.edu
> > 607-760-2328 (Cell)
> > 607-777-4641 (Office)
> >
> >
> > On Mon, Mar 29, 2021 at 7:52 AM Eugen Block  wrote:
> >
> >> Hi,
> >>
> >> that sounds like the pg_autoscaler is doing its work. Check with:
> >>
> >> ceph osd pool autoscale-status
> >>
> >> I don't think ceph is eating itself or that you're losing data. ;-)
> >>
> >>
> >> Zitat von Dave Hall :
> >>
> >>> Hello,
> >>>
> >>> About 3 weeks ago I added a node and increased the number of OSDs in my
> >>> cluster from 24 to 32, and then marked one old OSD down because it was
> >>> frequently crashing.  .
> >>>
> >>> After adding the new OSDs the PG count jumped fairly dramatically, but
> >> ever
> >>> since, amidst a continuous low level of rebalancing, the number of PGs
> >> has
> >>> gradually decreased to less by 25% from it's max value.  Although I
> don't
> >>> have specific notes, my perception is that the current number of PGs is
> >>> actually lower than it was before I added OSDs.
> >>>
> >>> So what's going on here?  It is possible to imagine that my cluster is
> >>> slowly eating itself, and that I'm about to lose 200TB of data. It's
> also
> >>> possible to imagine that this is all due to the gradual optimization of
> >> the
> >>> pools.
> >>>
> >>> Note that the primary pool is an EC 8 + 2 containing about 124TB.
> >>>
> >>> Thanks.
> >>>
> >>> -Dave
> >>>
> >>> --
> >>> Dave Hall
> >>> Binghamton University
> >>> kdh...@binghamton.edu
> >>> ___
> >>> ceph-users mailing list -- ceph-users@ceph.io
> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Resolving LARGE_OMAP_OBJECTS

2021-03-29 Thread David Orman
Response inline:

On Fri, Mar 5, 2021 at 11:00 AM Benoît Knecht  wrote:
>
> On Friday, March 5th, 2021 at 15:20, Drew Weaver  
> wrote:
> > Sorry to sound clueless but no matter what I search for on El Goog I can't 
> > figure out how to answer the question as to whether dynamic sharding is 
> > enabled in our environment.
> >
> > It's not configured as true in the config files, but it is the default.
> >
> > Is there a radosgw-admin command to determine whether or not it's enabled 
> > in the running environment?
>
> If `rgw_dynamic_resharding` is not explicitly set to `false` in your 
> environment, I think we can assume dynamic resharding is enabled. And if any 
> of your buckets have more than one shard and you didn't reshard them 
> manually, you'll know for sure dynamic resharding is working; you can check 
> the number of shards on a bucket with `radosgw-admin bucket stats 
> --bucket=`, there's a `num_shards` field. You can also check with 
> `radosgw-admin bucket limit check` if any of your buckets are about to be 
> resharded.
>
> Assuming dynamic resharding is enabled and none of your buckets are about to 
> be resharded, I would then find out which object has too many OMAP keys by 
> grepping the logs. The name of the object will contain the bucket ID (also 
> found in the output of `radosgw-admin bucket stats`), so you'll know which 
> bucket is causing the issue. And you can check how many OMAP keys are in each 
> shard of that bucket index using
>
> ```
> for obj in $(rados -p default.rgw.buckets.index ls | grep 
> eaf0ece5-9f4a-4aa8-9d67-8c6698f7919b.88726492.4); do
>   printf "%-60s %7d\n" $obj $(rados -p default.rgw.buckets.index listomapkeys 
> $obj | wc -l)
> done
> ```
>
> (where `eaf0ece5-9f4a-4aa8-9d67-8c6698f7919b.88726492.4` is your bucket ID). 
> If the number of keys are very uneven amongst the shards, there's probably an 
> issue that needs to be addressed. If you they are relatively even but 
> slightly above the warning threshold, it's probably a versioned bucket, and 
> it should be safe to simply increase the threshold.

As this is somewhat relevant, jumping in here... we're seeing the same
"large omap objects" warning and this is only happening with versioned
buckets/objects. Looking through logs, we can find a few instances:

cluster 2021-03-29T14:22:12.822291+ osd.55 (osd.55) 1074 : cluster
[WRN] Large omap object found. Object:
18:7004a547:::.dir.d99b34b6-5e94-4b64-a189-e23a3fabd712.326812.1.10:head
PG: 18.e2a5200e (18.e) Key count: 264199 Size (bytes): 107603375

We check the bucket (we do have dynamic sharding enabled):

"num_shards": 23,
"num_objects": 1524017

Doing the math, something seems off with that key count (23 shards
with 1.52 million objects shouldn't be 260k+ a shard). We check:

root@ceph01:~# rados -p res22-vbo1a.rgw.buckets.index listomapkeys
.dir.d99b34b6-5e94-4b64-a189-e23a3fabd712.326812.1.10 | wc -l
264239

Sure enough, it is more than 200,000, just as the alert indicates.
However, why did it not reshard further? Here's the kicker - we _only_
see this with versioned buckets/objects. I don't see anything in the
documentation that indicates this is a known issue with sharding, but
perhaps there is something going on with versioned buckets/objects. Is
there any clarity here/suggestions on how to deal with this? It sounds
like you expect this behavior with versioned buckets, so we must be
missing something.

root@ceph01:~# ceph config get osd rgw_dynamic_resharding
true
root@ceph01:~# ceph config get osd rgw_max_objs_per_shard
10
root@ceph01:~# ceph config get osd rgw_max_dynamic_shards
1999
root@ceph01:~#

Config should be sharding further based on the key counts in each of
the shards. I checked all 23 shards and they are all ~260,000 keys.

Thanks,
David

> Cheers,
>
> --
> Ben
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus - PG count decreasing after adding OSDs

2021-03-29 Thread Dave Hall
Eugen,

I didn't really think my cluster was eating itself, but I also didn't want
to be in denial.

Regarding the autoscaler, I really thought that it only went up - I didn't
expect that it would decrease the number of PGs.  Plus, I thought I had it
turned off.  I see now that it's off globally but enabled for this
particular pool.  Also, I see that the target PG count is lower than the
current.

I guess you learn something new every day.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)


On Mon, Mar 29, 2021 at 7:52 AM Eugen Block  wrote:

> Hi,
>
> that sounds like the pg_autoscaler is doing its work. Check with:
>
> ceph osd pool autoscale-status
>
> I don't think ceph is eating itself or that you're losing data. ;-)
>
>
> Zitat von Dave Hall :
>
> > Hello,
> >
> > About 3 weeks ago I added a node and increased the number of OSDs in my
> > cluster from 24 to 32, and then marked one old OSD down because it was
> > frequently crashing.  .
> >
> > After adding the new OSDs the PG count jumped fairly dramatically, but
> ever
> > since, amidst a continuous low level of rebalancing, the number of PGs
> has
> > gradually decreased to less by 25% from it's max value.  Although I don't
> > have specific notes, my perception is that the current number of PGs is
> > actually lower than it was before I added OSDs.
> >
> > So what's going on here?  It is possible to imagine that my cluster is
> > slowly eating itself, and that I'm about to lose 200TB of data. It's also
> > possible to imagine that this is all due to the gradual optimization of
> the
> > pools.
> >
> > Note that the primary pool is an EC 8 + 2 containing about 124TB.
> >
> > Thanks.
> >
> > -Dave
> >
> > --
> > Dave Hall
> > Binghamton University
> > kdh...@binghamton.edu
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster suspends when Add Mon or stop and start after a while.

2021-03-29 Thread Frank Schilder
Please use the correct list: ceph-users@ceph.io

Probably same problem I had. Try reducing mon_sync_max_payload_size=4096 and 
start a new MON. Should just take a few seconds to boot up.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder
Sent: 29 March 2021 16:58:19
To: by morphin; ceph-us...@ceph.com
Subject: Re: [ceph-users] Cluster suspends when Add Mon or stop and start after 
a while.

Probably same problem I had. Try reducing mon_sync_max_payload_size=4096 and 
start a new MON. Should just take a few seconds to boot up.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: by morphin 
Sent: 28 March 2021 11:25:02
To: ceph-us...@ceph.com
Subject: [ceph-users] Cluster suspends when Add Mon or stop and start after a 
while.

Hello!

I have a cluster with Datacenter crushmap (A+B).(9+9 = 18 servers)
The cluster started with v12.2.0 Luminous 4 years ago.
All these years I upgraded the Cluster Luminous > Mimic > v14.2.16 Nautilus.
Now I have a weird issue. When I add a mon or shutdown a while and
start it up again, all the cluster suspends, ceph -s do not respond
and other two monitors starting election while booting mon is syncing.
(logs below)



2021-03-28 00:18:23.482 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:23.782 7fe2eee07700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model
2021-03-28 00:18:24.292 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:26.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
get_health_metrics reporting 3919 slow ops, oldest is log(1 entries
from seq 2031 at 2021-03-28 00:08:41.094522)
2021-03-28 00:18:29.782 7fe2f160c700  1
mon.SRV-SB-1@1(electing).elector(7899) init, last seen epoch 7899,
mid-election, bumping
2021-03-28 00:18:29.812 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model
2021-03-28 00:18:31.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
get_health_metrics reporting 3951 slow ops, oldest is log(1 entries
from seq 2031 at 2021-03-28 00:08:41.094522)
2021-03-28 00:18:31.872 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:32.072 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:32.482 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:33.282 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:34.812 7fe2f160c700  1
mon.SRV-SB-1@1(electing).elector(7901) init, last seen epoch 7901,
mid-election, bumping
2021-03-28 00:18:34.842 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model
2021-03-28 00:18:34.872 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:35.072 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:35.492 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:36.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
get_health_metrics reporting 3989 slow ops, oldest is log(1 entries
from seq 2031 at 2021-03-28 00:08:41.094522)
2021-03-28 00:18:36.292 7fe2f2e0f700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:39.842 7fe2f160c700  1
mon.SRV-SB-1@1(electing).elector(7903) init, last seen epoch 7903,
mid-election, bumping
2021-03-28 00:18:39.872 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model
2021-03-28 00:18:40.872 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:41.082 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:41.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
get_health_metrics reporting 4027 slow ops, oldest is log(1 entries
from seq 2031 at 2021-03-28 00:08:41.094522)
2021-03-28 00:18:41.492 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:41.812 7fe2eee07700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model
2021-03-28 00:18:42.312 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:43.882 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:44.082 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:44.492 

[ceph-users] Re: Nautilus - PG Autoscaler Gobal vs Pool Setting

2021-03-29 Thread Eugen Block

Or you could just disable the mgr module. Something like

ceph mgr module disable pg_autoscaler



Zitat von Dave Hall :


All,

In looking at the options for setting the default pg autoscale option, I
notice that there is a global option setting and a per-pool option
setting.  It seems that the options at the pool level are off, warn, and
on.  The same, I assume for the global setting.

Is there a way to get rid of the per-pool setting and set the pool to honor
the global setting?  I think I'm looking for 'off, warn, on, or global'.
 It seems that once the per-pool option is set for all of one's pools, the
global value is irrelevant.  This also implies that in a circumstance where
one would want to temporarily suspend autoscaling it would be required to
modify the setting for each pool and then to modify it back afterward.

Thoughts?

Thanks

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu


On Mon, Mar 29, 2021 at 1:44 PM Anthony D'Atri 
wrote:


Yes the PG autoscalar has a way of reducing PG count way too far.  There’s
a claim that it’s better in Pacific, but I tend to recommend disabling it
and calculating / setting pg_num manually.

> On Mar 29, 2021, at 9:06 AM, Dave Hall  wrote:
>
> Eugen,
>
> I didn't really think my cluster was eating itself, but I also didn't
want
> to be in denial.
>
> Regarding the autoscaler, I really thought that it only went up - I
didn't
> expect that it would decrease the number of PGs.  Plus, I thought I had
it
> turned off.  I see now that it's off globally but enabled for this
> particular pool.  Also, I see that the target PG count is lower than the
> current.
>
> I guess you learn something new every day.
>
> -Dave
>
> --
> Dave Hall
> Binghamton University
> kdh...@binghamton.edu
> 607-760-2328 (Cell)
> 607-777-4641 (Office)
>
>
> On Mon, Mar 29, 2021 at 7:52 AM Eugen Block  wrote:
>
>> Hi,
>>
>> that sounds like the pg_autoscaler is doing its work. Check with:
>>
>> ceph osd pool autoscale-status
>>
>> I don't think ceph is eating itself or that you're losing data. ;-)
>>
>>
>> Zitat von Dave Hall :
>>
>>> Hello,
>>>
>>> About 3 weeks ago I added a node and increased the number of OSDs in my
>>> cluster from 24 to 32, and then marked one old OSD down because it was
>>> frequently crashing.  .
>>>
>>> After adding the new OSDs the PG count jumped fairly dramatically, but
>> ever
>>> since, amidst a continuous low level of rebalancing, the number of PGs
>> has
>>> gradually decreased to less by 25% from it's max value.  Although I
don't
>>> have specific notes, my perception is that the current number of PGs is
>>> actually lower than it was before I added OSDs.
>>>
>>> So what's going on here?  It is possible to imagine that my cluster is
>>> slowly eating itself, and that I'm about to lose 200TB of data. It's
also
>>> possible to imagine that this is all due to the gradual optimization of
>> the
>>> pools.
>>>
>>> Note that the primary pool is an EC 8 + 2 containing about 124TB.
>>>
>>> Thanks.
>>>
>>> -Dave
>>>
>>> --
>>> Dave Hall
>>> Binghamton University
>>> kdh...@binghamton.edu
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Nautilus - PG count decreasing after adding OSDs

2021-03-29 Thread Dave Hall
Hello,

About 3 weeks ago I added a node and increased the number of OSDs in my
cluster from 24 to 32, and then marked one old OSD down because it was
frequently crashing.  .

After adding the new OSDs the PG count jumped fairly dramatically, but ever
since, amidst a continuous low level of rebalancing, the number of PGs has
gradually decreased to less by 25% from it's max value.  Although I don't
have specific notes, my perception is that the current number of PGs is
actually lower than it was before I added OSDs.

So what's going on here?  It is possible to imagine that my cluster is
slowly eating itself, and that I'm about to lose 200TB of data. It's also
possible to imagine that this is all due to the gradual optimization of the
pools.

Note that the primary pool is an EC 8 + 2 containing about 124TB.

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Suspicious newsletter] Re: [Suspicious newsletter] bucket index and WAL/DB

2021-03-29 Thread Marcelo
It is true, on the slide it was understood to use this configuration. Thanks
for the answer.

Em sex., 26 de mar. de 2021 às 10:26, Szabo, Istvan (Agoda) <
istvan.sz...@agoda.com> escreveu:

> Makes sense what you are talking about, I had the same confusing like you,
> finally went with redhat setup:
>
>
> https://hubb.blob.core.windows.net/c2511cea-81c5-4386-8731-cc444ff806df-public/resources/Optimize%20Ceph%20object%20storage%20for%20production%20in%20multisite%20clouds.pdf
>
> Slide 27.
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
> -Original Message-
> From: Marcelo 
> Sent: Friday, March 26, 2021 7:35 PM
> Cc: ceph-users@ceph.io
> Subject: [Suspicious newsletter] [ceph-users] Re: [Suspicious newsletter]
> bucket index and WAL/DB
>
> This is exactly the problem, so we thought about not risking losing the
> entire host by using both NVMes.
>
> From what I understand the bucket index data is stored in the omap, which
> is stored in the block.db, making it unnecessary to create a separate OSD
> for the bucket index. But I didn't find anything in the documentation about
> it.
> It is also unclear whether, if it is necessary to create a separate index
> pool, it would be recommended to place the OSD that serves that pool with
> wal / DB.
>
> Em qui., 25 de mar. de 2021 às 22:42, Szabo, Istvan (Agoda) <
> istvan.sz...@agoda.com> escreveu:
>
> > In couple of documentation that I've read I finally made the decision
> > to separate index from wal+db.
> > However don't you think that the density is a bit high with 12HDD for
> > 1 nvme? So if you loose nvme you actually loose your complete host and
> > a lot of data movements will happen.
> >
> > Istvan Szabo
> > Senior Infrastructure Engineer
> > ---
> > Agoda Services Co., Ltd.
> > e: istvan.sz...@agoda.com
> > ---
> >
> > -Original Message-
> > From: Marcelo 
> > Sent: Thursday, March 25, 2021 11:15 PM
> > To: ceph-users@ceph.io
> > Subject: [Suspicious newsletter] [ceph-users] bucket index and WAL/DB
> >
> > Hello everybody.
> >
> > I searched in several places and I couldn't find any information about
> > what the best bucket index and WAL / DB organization would be.
> >
> > I have several hosts consisting of 12 HDDs and 2 NVMes, and currently
> > one of the NVMes serves as WAL / DB for the 10 OSDs and the other NVMe
> > is partitioned in two, serving as 2 OSDs to serve the S3 index pool.
> >
> > I saw in ceph-ansible a playbook (infrastructure-playbooks /
> > lv-create.yml) that creates a division where we have an OSD living
> > with a journal on the same NVMe. The problem is that in lv-vars.yaml
> > used by lv-create.yml it is said that this only applies to the
> > filestore. Is this correct or can I use this same structure with
> bluestore?
> >
> > Thank you all,
> > Marcelo.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
> >
> > 
> > This message is confidential and is for the sole use of the intended
> > recipient(s). It may also be privileged or otherwise protected by
> > copyright or other legal rules. If you have received it by mistake
> > please let us know by reply email and delete it from your system. It
> > is prohibited to copy this message or disclose its content to anyone.
> > Any confidentiality or privilege is not waived or lost by any mistaken
> > delivery or unauthorized disclosure of the message. All messages sent
> > to and from Agoda may be monitored to ensure compliance with company
> > policies, to protect the company's interests and to remove potential
> > malware. Electronic messages may be intercepted, amended, lost or
> deleted, or contain viruses.
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
>
> 
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by copyright
> or other legal rules. If you have received it by mistake please let us know
> by reply email and delete it from your system. It is prohibited to copy
> this message or disclose its content to anyone. Any confidentiality or
> privilege is not waived or lost by any mistaken delivery or unauthorized
> disclosure of the message. All messages sent to and from Agoda may be
> monitored to ensure compliance with company policies, to protect the
> company's interests and to remove potential malware. Electronic messages
> may be intercepted, amended, lost or deleted, or contain viruses.
>

[ceph-users] Re: OSDs RocksDB corrupted when upgrading nautilus->octopus: unknown WriteBatch tag

2021-03-29 Thread Dan van der Ster
Hi,

Saw that, looks scary!

I have no experience with that particular crash, but I was thinking
that if you have already backfilled the degraded PGs, and can afford
to try another OSD, you could try:

"bluestore_fsck_quick_fix_threads": "1",  # because
https://github.com/facebook/rocksdb/issues/5068 showed a similar crash
and the dev said it occurs because WriteBatch is not thread safe.

"bluestore_fsck_quick_fix_on_mount": "false", # should disable the
fsck during upgrade. See https://github.com/ceph/ceph/pull/40198

-- Dan

On Mon, Mar 29, 2021 at 2:23 PM Jonas Jelten  wrote:
>
> Hi!
>
> After upgrading MONs and MGRs successfully, the first OSD host I upgraded on 
> Ubuntu Bionic from 14.2.16 to 15.2.10
> shredded all OSDs on it by corrupting RocksDB, and they now refuse to boot.
> RocksDB complains "Corruption: unknown WriteBatch tag".
>
> The initial crash/corruption occured when the automatic fsck was ran, and 
> when it committed the changes for a lot of "zombie spanning blobs".
>
> Tracker issue with logs: https://tracker.ceph.com/issues/50017
>
>
> Anyone else encountered this error? I've "suspended" the upgrade for now :)
>
> -- Jonas
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] OSDs RocksDB corrupted when upgrading nautilus->octopus: unknown WriteBatch tag

2021-03-29 Thread Jonas Jelten
Hi!

After upgrading MONs and MGRs successfully, the first OSD host I upgraded on 
Ubuntu Bionic from 14.2.16 to 15.2.10
shredded all OSDs on it by corrupting RocksDB, and they now refuse to boot.
RocksDB complains "Corruption: unknown WriteBatch tag".

The initial crash/corruption occured when the automatic fsck was ran, and when 
it committed the changes for a lot of "zombie spanning blobs".

Tracker issue with logs: https://tracker.ceph.com/issues/50017


Anyone else encountered this error? I've "suspended" the upgrade for now :)

-- Jonas
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus - PG count decreasing after adding OSDs

2021-03-29 Thread Eugen Block

Hi,

that sounds like the pg_autoscaler is doing its work. Check with:

ceph osd pool autoscale-status

I don't think ceph is eating itself or that you're losing data. ;-)


Zitat von Dave Hall :


Hello,

About 3 weeks ago I added a node and increased the number of OSDs in my
cluster from 24 to 32, and then marked one old OSD down because it was
frequently crashing.  .

After adding the new OSDs the PG count jumped fairly dramatically, but ever
since, amidst a continuous low level of rebalancing, the number of PGs has
gradually decreased to less by 25% from it's max value.  Although I don't
have specific notes, my perception is that the current number of PGs is
actually lower than it was before I added OSDs.

So what's going on here?  It is possible to imagine that my cluster is
slowly eating itself, and that I'm about to lose 200TB of data. It's also
possible to imagine that this is all due to the gradual optimization of the
pools.

Note that the primary pool is an EC 8 + 2 containing about 124TB.

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus: Reduce the number of managers

2021-03-29 Thread Stefan Kooman

On 3/28/21 3:52 PM, Dave Hall wrote:

Hello,

We are in the process of bringing new hardware online that will allow us 
to get all of the MGRs, MONs, MDSs, etc.  off of our OSD nodes and onto 
dedicated management nodes.   I've created MGRs and MONs on the new 
nodes, and I found procedures for disabling the MONs from the OSD nodes.


Now I'm looking for the correct procedure to remove the MGRs from the 
OSD nodes.  I haven't found any reference to this in the documentation. 
Is it as simple as stopping and disabling the systemd service/target? Or 
are there Ceph commands?  Do I need to clean up /var/lib/ceph/mgr?


Yes, just disable it. And afterward remove the key for that manager: 
ceph auth rm mgr.$daemon and clean up the mgr directory like you mentioned.




Same questions about MDS in the near term, but I haven't searched the 
docs yet.


Same thing.

Those daemons, if they need to store state, do so in the cluster. You 
can create / remove these daemons as approriate.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: memory consumption by osd

2021-03-29 Thread Stefan Kooman

On 3/28/21 4:58 AM, Tony Liu wrote:

I don't see any problems yet. All OSDs are working fine.
Just that 1.8GB free memory concerns me.
I know 256GB memory for 10 OSDs (16TB HDD) is a lot, I am planning to
reduce it or increate osd_memory_target (if that's what you meant) to
boost performance. But before doing that, I'd like to understand what's
taking so much buff/cache and if there is any option to control it.


You can enable /disable bluefs_buffered_io (true / false). It was the 
default for a long time. It has been disabled for some time due to a RGW 
issue. But I believe it has been re-enabled again. I would leave it on.


I would find a lot of unused memory an issue. Linux wil generally use as 
much memory to buffer / cache as possible.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Suspicious newsletter] Re: How to clear Health Warning status?

2021-03-29 Thread Szabo, Istvan (Agoda)
Restart the osd.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: jinguk.k...@ungleich.ch 
Sent: Monday, March 29, 2021 10:41 AM
To: Anthony D'Atri 
Cc: ceph-users@ceph.io
Subject: [Suspicious newsletter] [ceph-users] Re: How to clear Health Warning 
status?

Hello there,

Thank you for your response.
There is no error at syslog, dmesg, or SMART.

# ceph health detail
HEALTH_WARN Too many repaired reads on 2 OSDs OSD_TOO_MANY_REPAIRS Too many 
repaired reads on 2 OSDs
osd.29 had 38 reads repaired
osd.16 had 17 reads repaired

How can i clear this waning ?
My ceph is version 14.2.9(clear_shards_repaired is not supported.)



/dev/sdh1 on /var/lib/ceph/osd/ceph-16 type xfs 
(rw,relatime,attr2,inode64,noquota)

# cat dmesg | grep sdh
[   12.990728] sd 5:2:3:0: [sdh] 19531825152 512-byte logical blocks: (10.0 
TB/9.09 TiB)
[   12.990728] sd 5:2:3:0: [sdh] Write Protect is off
[   12.990728] sd 5:2:3:0: [sdh] Mode Sense: 1f 00 00 08
[   12.990728] sd 5:2:3:0: [sdh] Write cache: enabled, read cache: enabled, 
doesn't support DPO or FUA
[   13.016616]  sdh: sdh1 sdh2
[   13.017780] sd 5:2:3:0: [sdh] Attached SCSI disk

# ceph tell osd.29 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 6.464404,
"bytes_per_sec": 166100668.21318716,
"iops": 39.60148530320815
}
# ceph tell osd.16 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 9.61689458,
"bytes_per_sec": 111651617.26584397,
"iops": 26.619819942914003
}

Thank you


> On 26 Mar 2021, at 16:04, Anthony D'Atri  wrote:
>
> Did you look at syslog, dmesg, or SMART?  Mostly likely the drives are 
> failing.
>
>
>> On Mar 25, 2021, at 9:55 PM, jinguk.k...@ungleich.ch wrote:
>>
>> Hello there,
>>
>> Thank you for advanced.
>> My ceph is ceph version 14.2.9
>> I have a repair issue too.
>>
>> ceph health detail
>> HEALTH_WARN Too many repaired reads on 2 OSDs OSD_TOO_MANY_REPAIRS
>> Too many repaired reads on 2 OSDs
>>   osd.29 had 38 reads repaired
>>   osd.16 had 17 reads repaired
>>
>> ~# ceph tell osd.16 bench
>> {
>>   "bytes_written": 1073741824,
>>   "blocksize": 4194304,
>>   "elapsed_sec": 7.148673815996,
>>   "bytes_per_sec": 150201541.10217974,
>>   "iops": 35.81083800844663
>> }
>> ~# ceph tell osd.29 bench
>> {
>>   "bytes_written": 1073741824,
>>   "blocksize": 4194304,
>>   "elapsed_sec": 6.924432750002,
>>   "bytes_per_sec": 155065672.9246161,
>>   "iops": 36.970537406114602
>> }
>>
>> But it looks like those osds are ok. how can i clear this warning ?
>>
>> Best regards
>> JG
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
>> email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Do I need to update ceph.conf and restart each OSD after adding more MONs?

2021-03-29 Thread Josh Baergen
As was mentioned in this thread, all of the mon clients (OSDs included)
learn about other mons through monmaps, which are distributed when mon
membership and election changes. Thus, your OSDs should already know about
the new mons.

mon_host indicates the list of mons that mon clients should try to contact
at boot. Thus, it's important to have correct in the config but doesn't
need to be updated after the process starts.

At least that's how I understand it; the config docs aren't terribly clear
on this behaviour.

Josh


On Sat., Mar. 27, 2021, 2:07 p.m. Tony Liu,  wrote:

> Just realized that all config files (/var/lib/ceph/ id>//config)
> on all nodes are already updated properly. It must be handled as part of
> adding
> MONs. But "ceph config show" shows only single host.
>
> mon_host   [v2:
> 10.250.50.80:3300/0,v1:10.250.50.80:6789/0]  file
>
> That means I still need to restart all services to apply the update, right?
> Is this supposed to be part of adding MONs as well, or additional manual
> step?
>
>
> Thanks!
> Tony
> 
> From: Tony Liu 
> Sent: March 27, 2021 12:53 PM
> To: Stefan Kooman; ceph-users@ceph.io
> Subject: [ceph-users] Re: Do I need to update ceph.conf and restart each
> OSD after adding more MONs?
>
> # ceph config set osd.0 mon_host [v2:
> 10.250.50.80:3300/0,v1:10.250.50.80:6789/0,v2:10.250.50.81:3300/0,v1:10.250.50.81:6789/0,v2:10.250.50.82:3300/0,v1:10.250.50.82:6789/0
> ]
> Error EINVAL: mon_host is special and cannot be stored by the mon
>
> It seems that the only option is to update ceph.conf and restart service.
>
>
> Tony
> 
> From: Tony Liu 
> Sent: March 27, 2021 12:20 PM
> To: Stefan Kooman; ceph-users@ceph.io
> Subject: [ceph-users] Re: Do I need to update ceph.conf and restart each
> OSD after adding more MONs?
>
> I expanded MON from 1 to 3 by updating orch service "ceph orch apply".
> "mon_host" in all services (MON, MGR, OSDs) is not updated. It's still
> single
> host from source "file".
> What's the guidance here to update "mon_host" for all services? I am
> talking
> about Ceph services, not client side.
> Should I update ceph.conf for all services and restart all of them?
> Or I can update it on-the-fly by "ceph config set"?
> In the latter case, where the updated configuration is stored? Is it going
> to
> be overridden by ceph.conf when restart service?
>
>
> Thanks!
> Tony
>
> 
> From: Stefan Kooman 
> Sent: March 26, 2021 12:22 PM
> To: Tony Liu; ceph-users@ceph.io
> Subject: Re: [ceph-users] Do I need to update ceph.conf and restart each
> OSD after adding more MONs?
>
> On 3/26/21 6:06 PM, Tony Liu wrote:
> > Hi,
> >
> > Do I need to update ceph.conf and restart each OSD after adding more
> MONs?
>
> This should not be necessary, as the OSDs should learn about these
> changes through monmaps. Updating the ceph.conf after the mons have been
> updated is advised.
>
> > This is with 15.2.8 deployed by cephadm.
> >
> > When adding MON, "mon_host" should be updated accordingly.
> > Given [1], is that update "the monitor cluster’s centralized
> configuration
> > database" or "runtime overrides set by an administrator"?
>
> No need to put that in the centralized config database. I *think* they
> mean ceph.conf file on the clients and hosts. At least, that's what you
> would normally do (if not using DNS).
>
> Gr. Stefan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: memory consumption by osd

2021-03-29 Thread Josh Baergen
Linux will automatically make use of all available memory for the buffer
cache, freeing buffers when it needs more memory for other things. This is
why MemAvailable is more useful than MemFree; the former indicates how much
memory could be used between Free, buffer cache, and anything else that
could be freed up. If you'd like to learn more about the buffer cache and
Linux's management of it, there are plenty of resources a search away.

My guess is that you're using a Ceph release that has bluefs_buffered_io
set to true by default, which will cause the OSDs to use the buffer cache
for some of their IO. What you're seeing is normal behaviour in this case.

Josh

On Sat., Mar. 27, 2021, 8:59 p.m. Tony Liu,  wrote:

> I don't see any problems yet. All OSDs are working fine.
> Just that 1.8GB free memory concerns me.
> I know 256GB memory for 10 OSDs (16TB HDD) is a lot, I am planning to
> reduce it or increate osd_memory_target (if that's what you meant) to
> boost performance. But before doing that, I'd like to understand what's
> taking so much buff/cache and if there is any option to control it.
>
>
> Thanks!
> Tony
> 
> From: Anthony D'Atri 
> Sent: March 27, 2021 07:27 PM
> To: ceph-users
> Subject: [ceph-users] Re: memory consumption by osd
>
>
> Depending on your kernel version, MemFree can be misleading.  Attend to
> the value of MemAvailable instead.
>
> Your OSDs all look to be well below the target, I wouldn’t think you have
> any problems.  In fact 256GB for just 10 OSDs is an embarassment of
> riches.  What type of drives are you using, and what’s the cluster used
> for?  If anything I might advise *raising* the target.
>
> You might check tcmalloc usage
>
>
> https://ceph-devel.vger.kernel.narkive.com/tYp0KkIT/ceph-daemon-memory-utilization-heap-release-drops-use-by-50
>
> but I doubt this is an issue for you.
>
> > What's taking that much buffer?
> > # free -h
> >  totalusedfree  shared  buff/cache
>  available
> > Mem:  251Gi31Gi   1.8Gi   1.6Gi   217Gi
>  215Gi
> >
> > # cat /proc/meminfo
> > MemTotal:   263454780 kB
> > MemFree: 2212484 kB
> > MemAvailable:   226842848 kB
> > Buffers:219061308 kB
> > Cached:  2066532 kB
> > SwapCached:  928 kB
> > Active: 142272648 kB
> > Inactive:   109641772 kB
> > ..
> >
> >
> > Thanks!
> > Tony
> > 
> > From: Tony Liu 
> > Sent: March 27, 2021 01:25 PM
> > To: ceph-users
> > Subject: [ceph-users] memory consumption by osd
> >
> > Hi,
> >
> > Here is a snippet from top on a node with 10 OSDs.
> > ===
> > MiB Mem : 257280.1 total,   2070.1 free,  31881.7 used, 223328.3
> buff/cache
> > MiB Swap: 128000.0 total, 126754.7 free,   1245.3 used. 221608.0 avail
> Mem
> >
> >PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+
> COMMAND
> >  30492 167   20   0 4483384   2.9g  16696 S   6.0   1.2 707:05.25
> ceph-osd
> >  35396 167   20   0 952   2.8g  16468 S   5.0   1.1 815:58.52
> ceph-osd
> >  33488 167   20   0 4161872   2.8g  16580 S   4.7   1.1 496:07.94
> ceph-osd
> >  36371 167   20   0 4387792   3.0g  16748 S   4.3   1.2 762:37.64
> ceph-osd
> >  39185 167   20   0 5108244   3.1g  16576 S   4.0   1.2 998:06.73
> ceph-osd
> >  38729 167   20   0 4748292   2.8g  16580 S   3.3   1.1 895:03.67
> ceph-osd
> >  34439 167   20   0 4492312   2.8g  16796 S   2.0   1.1 921:55.50
> ceph-osd
> >  31473 167   20   0 4314500   2.9g  16684 S   1.3   1.2 680:48.09
> ceph-osd
> >  32495 167   20   0 4294196   2.8g  16552 S   1.0   1.1 545:14.53
> ceph-osd
> >  37230 167   20   0 4586020   2.7g  16620 S   1.0   1.1 844:12.23
> ceph-osd
> > ===
> > Does it look OK with 2GB free?
> > I can't tell how that 220GB is used for buffer/cache.
> > Is that used by OSDs? Is it controlled by configuration or auto scaling
> based
> > on physical memory? Any clarifications would be helpful.
> >
> >
> > Thanks!
> > Tony
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io