[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Zakhar Kirpichenko
Thanks for the suggestions, I will try this.

/Z

On Fri, 7 Oct 2022 at 18:13, Konstantin Shalygin  wrote:

> Zakhar, try to look to top of slow ops in daemon socket for this osd, you
> may find 'snapc' operations, for example. By rbd head you can find rbd
> image, and then try to look how much snapshots in chain for this image.
> More than 10 snaps for one image can increase client ops latency to tens
> milliseconds even for NVMe drives, that usually operates at usec or 1-2msec
>
>
> k
> Sent from my iPhone
>
> > On 7 Oct 2022, at 14:35, Zakhar Kirpichenko  wrote:
> >
> > The drive doesn't show increased utilization on average, but it does
> > sporadically get more I/O than other drives, usually in short bursts. I
> am
> > now trying to find a way to trace this to a specific PG, pool and object
> > (s) – not sure if that is possible.
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] every rgw stuck on "RGWReshardLock::lock found lock"

2022-10-07 Thread Haas, Josh
I've observed this occur on v14.2.22 and v15.2.12. Wasn't able to find anything 
obviously relevant in changelogs, bug tickets, or existing mailing list threads.


In both cases, every RGW in the cluster starts spamming logs with lines that 
look like the following:


2022-09-04 14:20:45.231 7fc7b28c7700  0 INFO: RGWReshardLock::lock found lock 
on $BUCKET:c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.582171072.3067 to be held by 
another RGW process; skipping for now
2022-09-04 14:20:45.281 7fc7ca0f6700  0 block_while_resharding ERROR: bucket is 
still resharding, please retry
2022-09-04 14:20:45.283 7fc7ca0f6700  0 NOTICE: resharding operation on bucket 
index detected, blocking


The buckets in question were growing very quickly (hundreds of uploads per 
second, in the ballpark of 10 million objects when the bug hit), so it makes 
sense they got picked up for resharding. What doesn't make sense is every rgw 
stopping all other processing (not responding over HTTP) and logging nothing 
but these locking messages. Something seems to be going pretty badly wrong if 
we're not just backing off the lock and trying again later. Only one rgw should 
be trying to reshard a bucket at a time, right?


The other weird part is that it cycled between a complete outage for ~7.5 
minutes, followed by responding to a low volume of requests for a couple 
minutes. Here you can see the outage in terms of HTTP status codes logged by 
our frontends for the second occurrence (aggregated across all rgws in the 
cluster):

https://jhaas.us-east-1.linodeobjects.com/public/rgw-lock/http-codes.jpg


We can see the same trend (though basically reversed) if I graph the frequency 
of all log lines containing "starting new request" which should happen any time 
the rgw begins servicing a new request:

https://jhaas.us-east-1.linodeobjects.com/public/rgw-lock/starting-new-request.jpg


I don't have an explanation for that; 7.5 minutes is ~450 seconds which doesn't 
sound like a default timeout or something to me. After all that was over, the 
buckets appear to have resharded successfully, and my current assumption is the 
issue resolved by itself once the resharding operation completed.


I'll be trying to reliably reproduce this or observe it more closely in the 
wild, hopefully on v17, but was hoping someone might have some insight in the 
meantime.


Thanks,

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Iinfinite backfill loop + number of pgp groups stuck at wrong value

2022-10-07 Thread Josh Baergen
As of Nautilus+, when you set pg_num, it actually internally sets
pg(p)_num_target, and then slowly increases (or decreases, if you're
merging) pg_num and then pgp_num until it reaches the target. The
amount of backfill scheduled into the system is controlled by
target_max_misplaced_ratio.

Josh

On Fri, Oct 7, 2022 at 3:50 AM Nicola Mori  wrote:
>
> The situation got solved by itself, since probably there was no error. I
> manually increased the number of PGs and PGPs to 128 some days ago, and
> the PGP count was being updated step by step. Actually after a bump from
> 5% to 7% in the count of misplaced object I noticed that the number of
> PGPs was updated to 126, and after a last bump it is now at 128 with a
> ~4% of misplaced objects currently decreasing.
> Sorry for the noise,
>
> Nicola
>
> On 07/10/22 09:15, Nicola Mori wrote:
> > Dear Ceph users,
> >
> > my cluster is stuck since several days with some PG backfilling. The
> > number of misplaced objects slowly decreases down to 5%, and at that
> > point jumps up again to about 7%, and so on. I found several possible
> > reasons for this behavior. One is related to the balancer, which anyway
> > I think is not operating:
> >
> > # ceph balancer status
> > {
> >  "active": false,
> >  "last_optimize_duration": "0:00:00.000938",
> >  "last_optimize_started": "Thu Oct  6 16:19:59 2022",
> >  "mode": "upmap",
> >  "optimize_result": "Too many objects (0.071539 > 0.05) are
> > misplaced; try again later",
> >  "plans": []
> > }
> >
> > (the lase optimize result is from yesterday when I disabled it, and
> > since then the backfill loop has happened several times).
> > Another possible reason seems to be an imbalance of PG and PGB  numbers.
> > Effectively I found such an imbalance on one of my pools:
> >
> > # ceph osd pool get wizard_data pg_num
> > pg_num: 128
> > # ceph osd pool get wizard_data pgp_num
> > pgp_num: 123
> >
> > but I cannot fix it:
> > # ceph osd pool set wizard_data pgp_num 128
> > set pool 3 pgp_num to 128
> > # ceph osd pool get wizard_data pgp_num
> > pgp_num: 123
> >
> > The autoscaler is off for that pool:
> >
> > POOL   SIZE  TARGET SIZERATE  RAW CAPACITY
> > RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM
> > AUTOSCALE  BULK
> > wizard_data   8951G   1.333730697632152.8T
> > 0.0763  1.0 128  off
> > False
> >
> > so I don't understand why the PGP number is stuck at 123.
> > Thanks in advance for any help,
> >
> > Nicola
>
> --
> Nicola Mori, Ph.D.
> INFN sezione di Firenze
> Via Bruno Rossi 1, 50019 Sesto F.no (Italy)
> +390554572660
> m...@fi.infn.it
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Konstantin Shalygin
Zakhar, try to look to top of slow ops in daemon socket for this osd, you may 
find 'snapc' operations, for example. By rbd head you can find rbd image, and 
then try to look how much snapshots in chain for this image. More than 10 snaps 
for one image can increase client ops latency to tens milliseconds even for 
NVMe drives, that usually operates at usec or 1-2msec


k
Sent from my iPhone

> On 7 Oct 2022, at 14:35, Zakhar Kirpichenko  wrote:
> 
> The drive doesn't show increased utilization on average, but it does
> sporadically get more I/O than other drives, usually in short bursts. I am
> now trying to find a way to trace this to a specific PG, pool and object
> (s) – not sure if that is possible.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Inherited CEPH nightmare

2022-10-07 Thread Josef Johansson
Hi,

You want to also check disk_io_weighted via some kind of metric system.
That will detect which SSDs that are hogging the systems, if there are any
specific ones. Also check their error levels and endurance.

On Fri, 7 Oct 2022 at 17:05, Stefan Kooman  wrote:

> On 10/7/22 16:56, Tino Todino wrote:
> > Hi folks,
> >
> > The company I recently joined has a Proxmox cluster of 4 hosts with a
> CEPH implementation that was set-up using the Proxmox GUI.  It is running
> terribly, and as a CEPH newbie I'm trying to figure out if the
> configuration is at fault.  I'd really appreciate some help and guidance on
> this please.
>
> Can you send output of these commands:
> ceph -s
> ceph osd df
>
> Gr. Stefan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Inherited CEPH nightmare

2022-10-07 Thread Robert Sander

Hi Tino,

Am 07.10.22 um 16:56 schrieb Tino Todino:

I know some of these are consumer class, but I'm working on replacing these.


This would be your biggest issue. SSD performance can vary drastically.
Ceph needs "multi-use" enterprise SSDs, not read-optimized consumer ones.


All 4 hosts are set as Monitors


Remove one of the MONs. There has to be an odd number of MONs in the 
cluster.



I also think the DB/WAL should be on dedicated disks or partitions, but have no 
idea what procedure to follow to do this.


You only use dedicated devices for DB and WAL if these are faster than 
the data devices. You don't have that. Keep DB and WAL on the data 
devices, makes operations easier.


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Inherited CEPH nightmare

2022-10-07 Thread Stefan Kooman

On 10/7/22 16:56, Tino Todino wrote:

Hi folks,

The company I recently joined has a Proxmox cluster of 4 hosts with a CEPH 
implementation that was set-up using the Proxmox GUI.  It is running terribly, 
and as a CEPH newbie I'm trying to figure out if the configuration is at fault. 
 I'd really appreciate some help and guidance on this please.


Can you send output of these commands:
ceph -s
ceph osd df

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Inherited CEPH nightmare

2022-10-07 Thread Tino Todino
Hi folks,

The company I recently joined has a Proxmox cluster of 4 hosts with a CEPH 
implementation that was set-up using the Proxmox GUI.  It is running terribly, 
and as a CEPH newbie I'm trying to figure out if the configuration is at fault. 
 I'd really appreciate some help and guidance on this please.

The symptoms:


  *   Really slow read/write performance
  *   Really Really slow rebalancing/backfill
  *   High Apply/Commit latency on a couple of the SSDs when under load
  *   Knock on performance hit on key VM's (particularly AD/DNS services) that 
affect user experience

The setup is as follows:

4 x hosts, 3 hosts are Dell R820s which have 4 socket Xeon's with 96 cores and 
1.5 TB RAM.  The other (host 4) has a Ryzen 7 5800 processor with 64GB RAM.  
All servers are running on a simple 10Gbe network with dedicated NICs on a 
separate subnet.

The SSD's in use are a combination of new Seagate IronWolf 125 1TB SSDs and 
older Crucial MX500 1TB, and WDC Blue 1TB drives.  I know some of these are 
consumer class, but I'm working on replacing these.

I believe the OSDs were added to ProxMox's CEPH implementation with the default 
settings, i.e DB and WAL on the same OSD.  All 4 hosts are set as Monitors, and 
the 3 beefy ones set as Managers and metadata servers.

Ceph version is 16.2.7

Here is the config:

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 192.168.8.4/24
fsid = 4a4b4fff-d140-4e11-a35b-cbac0e18a3ce
mon_allow_pool_delete = true
mon_host = 192.168.8.4 192.168.8.6 192.168.8.5 192.168.8.3
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_memory_target = 2147483648
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 192.168.8.4/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.cl1-h1-lv]
host = cl1-h1-lv
mds_standby_for_name = pve

[mds.cl1-h2-lv]
host = cl1-h2-lv
mds_standby_for_name = pve

[mds.cl1-h3-lv]
host = cl1-h3-lv
mds_standby_for_name = pve

[mon.cl1-h1-lv]
public_addr = 192.168.8.3

[mon.cl1-h2-lv]
public_addr = 192.168.8.4

[mon.cl1-h3-lv]
public_addr = 192.168.8.5

[mon.cl1-h4-lv]
public_addr = 192.168.8.6



And the Crush map:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd
device 12 osd.12 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host cl1-h2-lv {
id -3  # do not change unnecessarily
id -4 class ssd   # do not change unnecessarily
# weight 2.729
alg straw2
hash 0  # rjenkins1
item osd.0 weight 0.910
item osd.5 weight 0.910
item osd.10 weight 0.910
}
host cl1-h3-lv {
id -5  # do not change unnecessarily
id -6 class ssd   # do not change unnecessarily
# weight 2.729
alg straw2
hash 0  # rjenkins1
item osd.1 weight 0.910
item osd.6 weight 0.910
item osd.11 weight 0.910
}
host cl1-h4-lv {
id -7  # do not change unnecessarily
id -8 class ssd   # do not change unnecessarily
# weight 1.819
alg straw2
hash 0  # rjenkins1
item osd.7 weight 0.910
item osd.2 weight 0.910
}
host cl1-h1-lv {
id -9  # do not change unnecessarily
id -10 class ssd # do not change unnecessarily
# weight 3.639
alg straw2
hash 0  # rjenkins1
item osd.4 weight 0.910
item osd.8 weight 0.910
item osd.9 weight 0.910
item osd.12 weight 0.910
}
root default {
id -1  # do not change unnecessarily
id -2 class ssd   # do not change unnecessarily
# weight 10.916
alg straw2
hash 0  # rjenkins1
item cl1-h2-lv weight 2.729
item cl1-h3-lv weight 

[ceph-users] Slow monitor responses for rbd ls etc.

2022-10-07 Thread Sven Barczyk
Hello,

 

we are encountering a strange behavior on our Ceph. (All Ubuntu 20 / All
mons Quincy 17.2.4 / Oldest OSD Quincy 17.2.0 )
Administrative commands like rbd ls or create are so slow, that libvirtd is
running into timeouts and creating new VMs on our Cloudstack, on behalf of
creating new volumes on our pool, takes up to 10 mins.

Already running VMs are not affected or showing slow responses on their
Filesystem.
It is really only limited if services needs to interact with rbd commands. 


Has anyone encountered some behavoir like this?


Regards
Sven

 

--

BRINGE Informationstechnik GmbH

Zur Seeplatte 12

D-76228 Karlsruhe

Germany

 

Fon: +49 721 94246-0

Fax: +49 721 94246-66

Web:   http://www.bringe.de/

 

Geschäftsführer: Dipl.-Ing. (FH) Martin Bringe

Ust.Id: DE812936645, HRB 108943 Mannheim

 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: iscsi deprecation

2022-10-07 Thread Maged Mokhtar



You can try PetaSAN

www.petasan.org

We are open source solution on top of Ceph. we provide scalable 
active/active iSCSI which supports VMWare VAAI and Microsoft clustered 
shared volumes for hyper-v clustering.


Cheers /maged

On 30/09/2022 19:36, Filipe Mendes wrote:

Hello!


I'm considering switching my current storage solution to CEPH. Today we use
iscsi as a communication protocol and we use several different hypervisors:
VMware, hyper-v, xcp-ng, etc.


I was reading that the current version of CEPH has discontinued iscsi
support in favor of RBD or Nvmeof. I imagine there are thousands of
projects in production using different hypervisors connecting to ceph via
iscsi, so I was curious that I did not find much discussion on the topic in
forums or mailings, since so many projects depend on both: ceph + iscsi,
and that RBD only communicates well with Proxmox or openstack. Also nvmeof
is not fully supported on ceph and many other popular hypervisors.


So the trend is that other hypervisors will start to support RBD over time,
or that they will start to support nvmeof at the same time that ceph
implements it stably?


Am I missing or maybe mixing something?

Filipe
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Eugen Block

Hi,

I’d look for deep-scrubs on that OSD, those are logged, maybe those  
timestamps match your observations.


Zitat von Zakhar Kirpichenko :


Thanks for this!

The drive doesn't show increased utilization on average, but it does
sporadically get more I/O than other drives, usually in short bursts. I am
now trying to find a way to trace this to a specific PG, pool and object
(s) – not sure if that is possible.

/Z

On Fri, 7 Oct 2022, 12:17 Dan van der Ster,  wrote:


Hi Zakhar,

I can back up what Konstantin has reported -- we occasionally have
HDDs performing very slowly even though all smart tests come back
clean. Besides ceph osd perf showing a high latency, you could see
high ioutil% with iostat.

We normally replace those HDDs -- usually by draining and zeroing
them, then putting them back in prod (e.g. in a different cluster or
some other service). I don't have statistics on how often those sick
drives come back to full performance or not -- that could indicate it
was a poor physical connection, vibrations, ... , for example. But I
do recall some drives came back repeatedly as "sick" but not dead w/
clean SMART tests.

If you have time you can dig deeper with increased bluestore debug
levels. In our environment this happens often enough that we simply
drain, replace, move on.

Cheers, dan




On Fri, Oct 7, 2022 at 9:41 AM Zakhar Kirpichenko 
wrote:
>
> Unfortunately, that isn't the case: the drive is perfectly healthy and,
> according to all measurements I did on the host itself, it isn't any
> different from any other drive on that host size-, health- or
> performance-wise.
>
> The only difference I noticed is that this drive sporadically does more
I/O
> than other drives for a split second, probably due to specific PGs placed
> on its OSD, but the average I/O pattern is very similar to other drives
and
> OSDs, so it's somewhat unclear why the specific OSD is consistently
showing
> much higher latency. It would be good to figure out what exactly is
causing
> these I/O spikes, but I'm not yet sure how to do that.
>
> /Z
>
> On Fri, 7 Oct 2022 at 09:24, Konstantin Shalygin  wrote:
>
> > Hi,
> >
> > When you see one of 100 drives perf is unusually different, this may
mean
> > 'this drive is not like the others' and should be replaced
> >
> >
> > k
> >
> > Sent from my iPhone
> >
> > > On 7 Oct 2022, at 07:33, Zakhar Kirpichenko 
wrote:
> > >
> > > Anyone, please?
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Szabo, Istvan (Agoda)
Finally how is your pg distribution? How many pg/disk?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: Frank Schilder 
Sent: Friday, October 7, 2022 6:50 PM
To: Igor Fedotov ; ceph-users@ceph.io
Subject: [ceph-users] Re: OSD crashes during upgrade mimic->octopus

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi all,

trying to respond to 4 past emails :)

We started using manual conversion and, if  the conversion fails, it fails in 
the last step. So far, we have a fail on 1 out of 8 OSDs. The OSD can be 
repaired with running a compaction + another repair, which will complete the 
last step. Looks like we are just on the edge and can get away with 
double-compaction.

For the interested future reader, we have subdivided 400G high-performance SSDs 
into 4x100G OSDs for our FS meta data pool. The increased concurrency improves 
performance a lot. But yes, we are on the edge. OMAP+META is almost 50%.

In our case, just merging 2x100 into 1x200 will probably not improve things as 
we will end up with an even more insane number of objects per PG than what we 
have already today. I will plan for having more OSDs for the meta-data pool 
available and also plan for having the infamous 60G temp space available with a 
bit more margin than what we have now.

Thanks to everyone who helped!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 13:21:29
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: OSD crashes during upgrade mimic->octopus

Hi Frank,

there no tools to defragment OSD atm. The only way to defragment OSD is to 
redeploy it...


Thanks,

Igor


On 10/7/2022 3:04 AM, Frank Schilder wrote:
> Hi Igor,
>
> sorry for the extra e-mail. I forgot to ask: I'm interested in a tool to 
> de-fragment the OSD. It doesn't look like the fsck command does that. Is 
> there any such tool?
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: 07 October 2022 01:53:20
> To: Igor Fedotov; ceph-users@ceph.io
> Subject: [ceph-users] Re: OSD crashes during upgrade mimic->octopus
>
> Hi Igor,
>
> I added a sample of OSDs on identical disks. The usage is quite well 
> balanced, so the numbers I included are representative. I don't believe that 
> we had one such extreme outlier. Maybe it ran full during conversion. Most of 
> the data is OMAP after all.
>
> I can't dump the free-dumps into paste bin, they are too large. Not sure if 
> you can access ceph-post-files. I will send you a tgz in a separate e-mail 
> directly to you.
>
>> And once again - do other non-starting OSDs show the same ENOSPC error?
>> Evidently I'm unable to make any generalization about the root cause
>> due to lack of the info...
> As I said before, I need more time to check this and give you the answer you 
> actually want. The stupid answer is they don't, because the other 3 are taken 
> down the moment 16 crashes and don't reach the same point. I need to take 
> them out of the grouped management and start them by hand, which I can do 
> tomorrow. I'm too tired now to play on our production system.
>
> The free-dumps are on their separate way. I included one for OSD 17 as well 
> (on the same disk).
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Igor Fedotov 
> Sent: 07 October 2022 01:19:44
> To: Frank Schilder; ceph-users@ceph.io
> Cc: Stefan Kooman
> Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus
>
> The log I inspected was for osd.16  so please share that OSD
> utilization... And honestly I trust allocator's stats more so it's
> rather CLI stats are incorrect if any. Anyway free dump should provide
> additional proofs..
>
> And once again - do other non-starting OSDs show the same ENOSPC error?
> Evidently I'm unable to make any generalization about the root cause
> due to lack of the info...
>
>
> W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly
> there are some chances it will work.
>
>
> Thanks,
>
> Igor
>
>
> On 10/7/2022 1:59 AM, Frank Schilder wrote:
>> Hi Igor,
>>
>> I suspect there is something wrong with the data reported. These OSDs are 
>> only 50-60% used. For example:
>>
>> IDCLASS WEIGHT   REWEIGHT  SIZE RAW USE   DATA  OMAP 
>> META  AVAIL%USE   VAR   PGS  STATUS TYPE NAME
>> 29   ssd  0.09099   1.0   93 GiB49 GiB17 GiB   16 
>> GiB15 GiB   44 GiB  52.42  1.91  104 up  
>> osd.29
>>

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov

Hi Frank,

there no tools to defragment OSD atm. The only way to defragment OSD is 
to redeploy it...



Thanks,

Igor


On 10/7/2022 3:04 AM, Frank Schilder wrote:

Hi Igor,

sorry for the extra e-mail. I forgot to ask: I'm interested in a tool to 
de-fragment the OSD. It doesn't look like the fsck command does that. Is there 
any such tool?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 07 October 2022 01:53:20
To: Igor Fedotov; ceph-users@ceph.io
Subject: [ceph-users] Re: OSD crashes during upgrade mimic->octopus

Hi Igor,

I added a sample of OSDs on identical disks. The usage is quite well balanced, 
so the numbers I included are representative. I don't believe that we had one 
such extreme outlier. Maybe it ran full during conversion. Most of the data is 
OMAP after all.

I can't dump the free-dumps into paste bin, they are too large. Not sure if you 
can access ceph-post-files. I will send you a tgz in a separate e-mail directly 
to you.


And once again - do other non-starting OSDs show the same ENOSPC error?
Evidently I'm unable to make any generalization about the root cause due
to lack of the info...

As I said before, I need more time to check this and give you the answer you 
actually want. The stupid answer is they don't, because the other 3 are taken 
down the moment 16 crashes and don't reach the same point. I need to take them 
out of the grouped management and start them by hand, which I can do tomorrow. 
I'm too tired now to play on our production system.

The free-dumps are on their separate way. I included one for OSD 17 as well (on 
the same disk).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 01:19:44
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

The log I inspected was for osd.16  so please share that OSD
utilization... And honestly I trust allocator's stats more so it's
rather CLI stats are incorrect if any. Anyway free dump should provide
additional proofs..

And once again - do other non-starting OSDs show the same ENOSPC error?
Evidently I'm unable to make any generalization about the root cause due
to lack of the info...


W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly
there are some chances it will work.


Thanks,

Igor


On 10/7/2022 1:59 AM, Frank Schilder wrote:

Hi Igor,

I suspect there is something wrong with the data reported. These OSDs are only 
50-60% used. For example:

IDCLASS WEIGHT   REWEIGHT  SIZE RAW USE   DATA  OMAP 
META  AVAIL%USE   VAR   PGS  STATUS TYPE NAME
29   ssd  0.09099   1.0   93 GiB49 GiB17 GiB   16 GiB   
 15 GiB   44 GiB  52.42  1.91  104 up  osd.29
44   ssd  0.09099   1.0   93 GiB50 GiB23 GiB   10 GiB   
 16 GiB   43 GiB  53.88  1.96  121 up  osd.44
58   ssd  0.09099   1.0   93 GiB49 GiB16 GiB   15 GiB   
 18 GiB   44 GiB  52.81  1.92  123 up  osd.58
   984   ssd  0.09099   1.0   93 GiB57 GiB26 GiB   13 GiB   
 17 GiB   37 GiB  60.81  2.21  133 up  osd.984

Yes, these drives are small, but it should be possible to find 1M more. It 
sounds like some stats data/counters are incorrect/corrupted. Is it possible to 
run an fsck on a bluestore device to have it checked for that? Any idea how an 
incorrect utilisation might come about?

I will look into starting these OSDs individually. This will be a bit of work 
as our deployment method is to start/stop all OSDs sharing the same disk 
simultaneously (OSDs are grouped by disk). If one fails all others also go 
down. Its for simplifying disk management and this debugging is a new use case 
we never needed before.

Thanks for your help at this late hour!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 00:37:34
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Hi Frank,

the abort message "bluefs enospc" indicates lack of free space for
additional bluefs space allocations which prevents osd from startup.

   From the following log line one can see that bluefs needs ~1M more
space while the total available one is approx 622M. the problem is that
bluefs needs continuous(!) 64K chunks though. Which apparently aren't
available due to high disk fragmentation.

   -4> 2022-10-06T23:22:49.267+0200 7f669d129700 -1
bluestore(/var/lib/ceph/osd/ceph-16) allocate_bluefs_freespace failed to
allocate on 0x11 min_size 0x11 > allocated total 0x3

[ceph-users] Re: Iinfinite backfill loop + number of pgp groups stuck at wrong value

2022-10-07 Thread Nicola Mori
The situation got solved by itself, since probably there was no error. I 
manually increased the number of PGs and PGPs to 128 some days ago, and 
the PGP count was being updated step by step. Actually after a bump from 
5% to 7% in the count of misplaced object I noticed that the number of 
PGPs was updated to 126, and after a last bump it is now at 128 with a 
~4% of misplaced objects currently decreasing.

Sorry for the noise,

Nicola

On 07/10/22 09:15, Nicola Mori wrote:

Dear Ceph users,

my cluster is stuck since several days with some PG backfilling. The 
number of misplaced objects slowly decreases down to 5%, and at that 
point jumps up again to about 7%, and so on. I found several possible 
reasons for this behavior. One is related to the balancer, which anyway 
I think is not operating:


# ceph balancer status
{
     "active": false,
     "last_optimize_duration": "0:00:00.000938",
     "last_optimize_started": "Thu Oct  6 16:19:59 2022",
     "mode": "upmap",
     "optimize_result": "Too many objects (0.071539 > 0.05) are 
misplaced; try again later",

     "plans": []
}

(the lase optimize result is from yesterday when I disabled it, and 
since then the backfill loop has happened several times).
Another possible reason seems to be an imbalance of PG and PGB  numbers. 
Effectively I found such an imbalance on one of my pools:


# ceph osd pool get wizard_data pg_num
pg_num: 128
# ceph osd pool get wizard_data pgp_num
pgp_num: 123

but I cannot fix it:
# ceph osd pool set wizard_data pgp_num 128
set pool 3 pgp_num to 128
# ceph osd pool get wizard_data pgp_num
pgp_num: 123

The autoscaler is off for that pool:

POOL   SIZE  TARGET SIZE    RATE  RAW CAPACITY 
RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM 
AUTOSCALE  BULK
wizard_data   8951G   1.333730697632    152.8T 
0.0763  1.0 128  off
False


so I don't understand why the PGP number is stuck at 123.
Thanks in advance for any help,

Nicola


--
Nicola Mori, Ph.D.
INFN sezione di Firenze
Via Bruno Rossi 1, 50019 Sesto F.no (Italy)
+390554572660
m...@fi.infn.it
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov

Just FYI:

standalone ceph-bluestore-tool's quick-fix behaves pretty similar to the 
action performed on start-up with bluestore_fsck_quick_fix_on_mount = true




On 10/7/2022 10:18 AM, Frank Schilder wrote:

Hi Stefan,

super thanks!

I found a quick-fix command in the help output:

# ceph-bluestore-tool -h
[...]
Positional options:
   --command arg  fsck, repair, quick-fix, bluefs-export,
  bluefs-bdev-sizes, bluefs-bdev-expand,
  bluefs-bdev-new-db, bluefs-bdev-new-wal,
  bluefs-bdev-migrate, show-label, set-label-key,
  rm-label-key, prime-osd-dir, bluefs-log-dump,
  free-dump, free-score, bluefs-stats

but its not documented in https://docs.ceph.com/en/octopus/man/8/ceph-bluestore-tool/. I 
guess I will stick with the tested command "repair". Nothing I found mentions 
what exactly is executed on start-up with bluestore_fsck_quick_fix_on_mount = true.

Thanks for your quick answer!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Kooman 
Sent: 07 October 2022 09:07:37
To: Frank Schilder; Igor Fedotov; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

On 10/7/22 09:03, Frank Schilder wrote:

Hi Igor and Stefan,

thanks a lot for your help! Our cluster is almost finished with recovery and I 
would like to switch to off-line conversion of the SSD OSDs. In one of Stefan's 
I coud find the command for manual compaction:

ceph-kvstore-tool bluestore-kv "/var/lib/ceph/osd/ceph-${OSD_ID}" compact

Unfortunately, I can't find the command for performing the omap conversion. It 
is not mentioned here 
https://docs.ceph.com/en/quincy/releases/octopus/#upgrading-from-mimic-or-nautilus
 even though it does mention the option to skip conversion in step 5. How to 
continue with an off-line conversion is not mentioned. I know it has been 
posted before, but I seem unable to find it on this list. If someone could send 
me the command, I would be most grateful.

for osd in `ls /var/lib/ceph/osd/`; do ceph-bluestore-tool repair --path
   /var/lib/ceph/osd/$osd;done

That's what I use.

Gr. Stefan


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov
For format updates one can use quick-fix command instead of repair, it 
might work a bit faster..


On 10/7/2022 10:07 AM, Stefan Kooman wrote:

On 10/7/22 09:03, Frank Schilder wrote:

Hi Igor and Stefan,

thanks a lot for your help! Our cluster is almost finished with 
recovery and I would like to switch to off-line conversion of the SSD 
OSDs. In one of Stefan's I coud find the command for manual compaction:


ceph-kvstore-tool bluestore-kv "/var/lib/ceph/osd/ceph-${OSD_ID}" 
compact


Unfortunately, I can't find the command for performing the omap 
conversion. It is not mentioned here 
https://docs.ceph.com/en/quincy/releases/octopus/#upgrading-from-mimic-or-nautilus 
even though it does mention the option to skip conversion in step 5. 
How to continue with an off-line conversion is not mentioned. I know 
it has been posted before, but I seem unable to find it on this list. 
If someone could send me the command, I would be most grateful.


for osd in `ls /var/lib/ceph/osd/`; do ceph-bluestore-tool repair 
--path  /var/lib/ceph/osd/$osd;done


That's what I use.

Gr. Stefan


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov

Hi Frank,

one more thing I realized during the night :)

Whe performing conversion DB gets a significant bunch of new data 
(approx. on par with the original OMAP volume) without old one being 
immediately removed. Hence one should expect DB size grows dramatically 
at this point. Which should go away after compaction (either enforced or 
regular background one).


But the point is that during that peak usage one might (temporarily) be 
out of free space And I believe that's the root cause for your 
outage. So please be careful when doing further conversions, I think 
your OSDs are exposed to this issue due to limited space available ...



Thanks,

Igor

On 10/7/2022 2:53 AM, Frank Schilder wrote:

Hi Igor,

I added a sample of OSDs on identical disks. The usage is quite well balanced, 
so the numbers I included are representative. I don't believe that we had one 
such extreme outlier. Maybe it ran full during conversion. Most of the data is 
OMAP after all.

I can't dump the free-dumps into paste bin, they are too large. Not sure if you 
can access ceph-post-files. I will send you a tgz in a separate e-mail directly 
to you.


And once again - do other non-starting OSDs show the same ENOSPC error?
Evidently I'm unable to make any generalization about the root cause due
to lack of the info...

As I said before, I need more time to check this and give you the answer you 
actually want. The stupid answer is they don't, because the other 3 are taken 
down the moment 16 crashes and don't reach the same point. I need to take them 
out of the grouped management and start them by hand, which I can do tomorrow. 
I'm too tired now to play on our production system.

The free-dumps are on their separate way. I included one for OSD 17 as well (on 
the same disk).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 01:19:44
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

The log I inspected was for osd.16  so please share that OSD
utilization... And honestly I trust allocator's stats more so it's
rather CLI stats are incorrect if any. Anyway free dump should provide
additional proofs..

And once again - do other non-starting OSDs show the same ENOSPC error?
Evidently I'm unable to make any generalization about the root cause due
to lack of the info...


W.r.t fsck - you can try to run it - since fsck opens DB in read-pnly
there are some chances it will work.


Thanks,

Igor


On 10/7/2022 1:59 AM, Frank Schilder wrote:

Hi Igor,

I suspect there is something wrong with the data reported. These OSDs are only 
50-60% used. For example:

IDCLASS WEIGHT   REWEIGHT  SIZE RAW USE   DATA  OMAP 
META  AVAIL%USE   VAR   PGS  STATUS TYPE NAME
29   ssd  0.09099   1.0   93 GiB49 GiB17 GiB   16 GiB   
 15 GiB   44 GiB  52.42  1.91  104 up  osd.29
44   ssd  0.09099   1.0   93 GiB50 GiB23 GiB   10 GiB   
 16 GiB   43 GiB  53.88  1.96  121 up  osd.44
58   ssd  0.09099   1.0   93 GiB49 GiB16 GiB   15 GiB   
 18 GiB   44 GiB  52.81  1.92  123 up  osd.58
   984   ssd  0.09099   1.0   93 GiB57 GiB26 GiB   13 GiB   
 17 GiB   37 GiB  60.81  2.21  133 up  osd.984

Yes, these drives are small, but it should be possible to find 1M more. It 
sounds like some stats data/counters are incorrect/corrupted. Is it possible to 
run an fsck on a bluestore device to have it checked for that? Any idea how an 
incorrect utilisation might come about?

I will look into starting these OSDs individually. This will be a bit of work 
as our deployment method is to start/stop all OSDs sharing the same disk 
simultaneously (OSDs are grouped by disk). If one fails all others also go 
down. Its for simplifying disk management and this debugging is a new use case 
we never needed before.

Thanks for your help at this late hour!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Igor Fedotov 
Sent: 07 October 2022 00:37:34
To: Frank Schilder; ceph-users@ceph.io
Cc: Stefan Kooman
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

Hi Frank,

the abort message "bluefs enospc" indicates lack of free space for
additional bluefs space allocations which prevents osd from startup.

   From the following log line one can see that bluefs needs ~1M more
space while the total available one is approx 622M. the problem is that
bluefs needs continuous(!) 64K chunks though. Which apparently aren't
available due to high disk fragmentation.

   -4> 2022-10-06T23:22:49.267+0200 7f669d129700 -1
bluestore(/var/lib/ceph/osd/ceph-16) 

[ceph-users] Re: octopus 15.2.17 RGW daemons begin to crash regularly

2022-10-07 Thread Boris Behrens
Hi Casey,
thanks a lot. I added the full stack trace from our ceph-client log.

Cheers
 Boris

Am Do., 6. Okt. 2022 um 19:21 Uhr schrieb Casey Bodley :

> hey Boris,
>
> that looks a lot like https://tracker.ceph.com/issues/40018 where an
> exception was thrown when trying to read a socket's remote_endpoint().
> i didn't think that local_endpoint() could fail the same way, but i've
> opened https://tracker.ceph.com/issues/57784 to track this and the fix
> should look the same
>
> On Thu, Oct 6, 2022 at 12:12 PM Boris Behrens  wrote:
> >
> > Any ideas on this?
> >
> > Am So., 2. Okt. 2022 um 00:44 Uhr schrieb Boris Behrens :
> >
> > > Hi,
> > > we are experiencing that the rgw daemons crash and I don't understand
> why,
> > > Maybe someone here can lead me to a point where I can dig further.
> > >
> > > {
> > > "backtrace": [
> > > "(()+0x43090) [0x7f143ca06090]",
> > > "(gsignal()+0xcb) [0x7f143ca0600b]",
> > > "(abort()+0x12b) [0x7f143c9e5859]",
> > > "(()+0x9e911) [0x7f1433441911]",
> > > "(()+0xaa38c) [0x7f143344d38c]",
> > > "(()+0xaa3f7) [0x7f143344d3f7]",
> > > "(()+0xaa6a9) [0x7f143344d6a9]",
> > > "(boost::asio::detail::do_throw_error(boost::system::error_code
> > > const&, char const*)+0x96) [0x7f143ce73c76]",
> > > "(boost::asio::basic_socket > > boost::asio::io_context::executor_type>::local_endpoint() const+0x134)
> > > [0x7f143cf3d914]",
> > > "(()+0x36e355) [0x7f143cf23355]",
> > > "(()+0x36fa59) [0x7f143cf24a59]",
> > > "(()+0x36fbbc) [0x7f143cf24bbc]",
> > > "(make_fcontext()+0x2f) [0x7f143d69958f]"
> > > ],
> > > "ceph_version": "15.2.17",
> > > "crash_id":
> > > "2022-10-01T09:55:55.134763Z_dfb496e9-a789-4471-a087-2a6405aa07df",
> > > "entity_name": "",
> > > "os_id": "ubuntu",
> > > "os_name": "Ubuntu",
> > > "os_version": "20.04.4 LTS (Focal Fossa)",
> > > "os_version_id": "20.04",
> > > "process_name": "radosgw",
> > > "stack_sig":
> > > "29b20e8702f17ff69135a92fc83b17dbee9b12ba5756ad5992c808c783c134ca",
> > > "timestamp": "2022-10-01T09:55:55.134763Z",
> > > "utsname_hostname": "",
> > > "utsname_machine": "x86_64",
> > > "utsname_release": "5.4.0-100-generic",
> > > "utsname_sysname": "Linux",
> > > "utsname_version": "#113-Ubuntu SMP Thu Feb 3 18:43:29 UTC 2022"
> > >
> > > --
> > > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend
> im
> > > groüen Saal.
> > >
> >
> >
> > --
> > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> > groüen Saal.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] rgw multisite octopus - bucket can not be resharded after cancelling prior reshard process

2022-10-07 Thread Boris Behrens
Hi,
I just wanted to reshard a bucket but mistyped the amount of shards. In a
reflex I hit ctrl-c and waited. It looked like the resharding did not
finish so I canceled it, and now the bucket is in this state.
How can I fix it. It does not show up in the stale-instace list. It's also
a multisite environment (we only sync metadata).

$ radosgw-admin reshard status --bucket bucket
[
{
"reshard_status": "not-resharding",
"new_bucket_instance_id": "",
"num_shards": -1
}
]

$ radosgw-admin bucket stats --bucket bucket
{
"bucket": "bucket",
*"num_shards": 0,*
...
*"id": "ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2296333939.14",*
"marker": "ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2296333939.14",
...
}

$ radosgw-admin metadata get
bucket.instance:bucket:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2368407345.1
{
"key":
"bucket.instance:bucket:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2368407345.1",
"ver": {
"tag": "QndcbsKPFDjs6rYKKDHde9bM",
"ver": 2
},
"mtime": "2022-10-07T07:16:49.231685Z",
"data": {
"bucket_info": {
"bucket": {
"name": "bucket",
"marker":
"ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2296333939.14",
*"bucket_id":
"ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2368407345.1",*
...
},
...
*"num_shards": 211,*
...
},
}


Cheers
 Boris
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Dan van der Ster
Hi Zakhar,

I can back up what Konstantin has reported -- we occasionally have
HDDs performing very slowly even though all smart tests come back
clean. Besides ceph osd perf showing a high latency, you could see
high ioutil% with iostat.

We normally replace those HDDs -- usually by draining and zeroing
them, then putting them back in prod (e.g. in a different cluster or
some other service). I don't have statistics on how often those sick
drives come back to full performance or not -- that could indicate it
was a poor physical connection, vibrations, ... , for example. But I
do recall some drives came back repeatedly as "sick" but not dead w/
clean SMART tests.

If you have time you can dig deeper with increased bluestore debug
levels. In our environment this happens often enough that we simply
drain, replace, move on.

Cheers, dan




On Fri, Oct 7, 2022 at 9:41 AM Zakhar Kirpichenko  wrote:
>
> Unfortunately, that isn't the case: the drive is perfectly healthy and,
> according to all measurements I did on the host itself, it isn't any
> different from any other drive on that host size-, health- or
> performance-wise.
>
> The only difference I noticed is that this drive sporadically does more I/O
> than other drives for a split second, probably due to specific PGs placed
> on its OSD, but the average I/O pattern is very similar to other drives and
> OSDs, so it's somewhat unclear why the specific OSD is consistently showing
> much higher latency. It would be good to figure out what exactly is causing
> these I/O spikes, but I'm not yet sure how to do that.
>
> /Z
>
> On Fri, 7 Oct 2022 at 09:24, Konstantin Shalygin  wrote:
>
> > Hi,
> >
> > When you see one of 100 drives perf is unusually different, this may mean
> > 'this drive is not like the others' and should be replaced
> >
> >
> > k
> >
> > Sent from my iPhone
> >
> > > On 7 Oct 2022, at 07:33, Zakhar Kirpichenko  wrote:
> > >
> > > Anyone, please?
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade

2022-10-07 Thread Dan van der Ster
Hi Jan,

It looks like you got into this situation by not setting
require-osd-release to pacific while you were running 16.2.7.
The code has that expectation, and unluckily for you if you had
upgraded to 16.2.8 you would have had a HEALTH_WARN that pointed out
the mismatch between require_osd_release and the running version:
https://tracker.ceph.com/issues/53551
https://github.com/ceph/ceph/pull/44259

Cheers, Dan

On Fri, Oct 7, 2022 at 10:05 AM Jan Marek  wrote:
>
> Hello,
>
> I've now cluster healthy.
>
> I've studied OSDMonitor.cc file and I've found, that there is
> some problematic logic.
>
> Assumptions:
>
> 1) require_osd_release can be only raise.
>
> 2) ceph-mon in version 17.2.3 can set require_osd_release to
> minimal value 'octopus'.
>
> I have two variants:
>
> 1) If I can set require_osd_release to octopus, I have to have
> set require_osd_release actually to 'nautilus' (I will raise
> require_osd_release from nautilus to octopus). Then I have to
> have on line 11618 in OSDMonitor.cc this line:
>
> ceph_assert(osdmap.require_osd_release >= ceph_release_t::nautilus);
>
> 2) If I would have to preserve on line 11618 in file
> OSDMonitor.cc line:
>
> ceph_assert(osdmap.require_osd_release >= ceph_release_t::octopus);
>
> it is nonsense to can set require_osd_release parameter to
> 'octopus', because this line ensures, that I alredy set
> require_osd_release parameter to octopus.
>
> I suggest to use variant 1) and I've sendig attached patch.
>
> There is another question, if MON daemon have to check
> require_osd_release, when it is joining to the cluster, when it
> cannot raise it's value.
>
> It is potentially dangerous situation, see my old e-mail below...
>
> Sincerely
> Jan Marek
>
> Dne Po, říj 03, 2022 at 11:26:51 CEST napsal Jan Marek:
> > Hello,
> >
> > I've problem with our ceph cluster - I've stucked in upgrade
> > process between versions 16.2.7 and 17.2.3.
> >
> > My problem is, that I have upgraded MON, MGR, MDS processes, and
> > when I started upgrade OSDs, ceph tell me, that I cannot add OSD
> > with that version to cluster, because I have problem with
> > require_osd_release.
> >
> > In my osdmap I have:
> >
> > # ceph osd dump | grep require_osd_release
> > require_osd_release nautilus
> >
> > When I tried set this to octopus or pacific, my MON daemon crashed with
> > assertion:
> >
> > ceph_assert(osdmap.require_osd_release >= ceph_release_t::octopus);
> >
> > in OSDMonitor.cc on line 11618.
> >
> > Please, is there a way to repair it?
> >
> > Can I (temporary) change ceph_assert to this line:
> >
> > ceph_assert(osdmap.require_osd_release >= ceph_release_t::nautilus);
> >
> > and set require_osd_release to, say, pacific?
> >
> > I've tried to downgrade ceph-mon process back to version 16.2,
> > but it cannot join to cluster...
> >
> > Sincerely
> > Jan Marek
> > --
> > Ing. Jan Marek
> > University of South Bohemia
> > Academic Computer Centre
> > Phone: +420389032080
> > http://www.gnu.org/philosophy/no-word-attachments.cs.html
>
>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> --
> Ing. Jan Marek
> University of South Bohemia
> Academic Computer Centre
> Phone: +420389032080
> http://www.gnu.org/philosophy/no-word-attachments.cs.html
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade

2022-10-07 Thread Jan Marek
Hello,

I've now cluster healthy.

I've studied OSDMonitor.cc file and I've found, that there is
some problematic logic.

Assumptions:

1) require_osd_release can be only raise.

2) ceph-mon in version 17.2.3 can set require_osd_release to
minimal value 'octopus'.

I have two variants:

1) If I can set require_osd_release to octopus, I have to have
set require_osd_release actually to 'nautilus' (I will raise
require_osd_release from nautilus to octopus). Then I have to
have on line 11618 in OSDMonitor.cc this line:

ceph_assert(osdmap.require_osd_release >= ceph_release_t::nautilus);

2) If I would have to preserve on line 11618 in file
OSDMonitor.cc line:

ceph_assert(osdmap.require_osd_release >= ceph_release_t::octopus);

it is nonsense to can set require_osd_release parameter to
'octopus', because this line ensures, that I alredy set
require_osd_release parameter to octopus.

I suggest to use variant 1) and I've sendig attached patch.

There is another question, if MON daemon have to check
require_osd_release, when it is joining to the cluster, when it
cannot raise it's value.

It is potentially dangerous situation, see my old e-mail below...

Sincerely
Jan Marek

Dne Po, říj 03, 2022 at 11:26:51 CEST napsal Jan Marek:
> Hello,
> 
> I've problem with our ceph cluster - I've stucked in upgrade
> process between versions 16.2.7 and 17.2.3.
> 
> My problem is, that I have upgraded MON, MGR, MDS processes, and
> when I started upgrade OSDs, ceph tell me, that I cannot add OSD
> with that version to cluster, because I have problem with
> require_osd_release.
> 
> In my osdmap I have:
> 
> # ceph osd dump | grep require_osd_release
> require_osd_release nautilus
> 
> When I tried set this to octopus or pacific, my MON daemon crashed with
> assertion:
> 
> ceph_assert(osdmap.require_osd_release >= ceph_release_t::octopus);
> 
> in OSDMonitor.cc on line 11618.
> 
> Please, is there a way to repair it?
> 
> Can I (temporary) change ceph_assert to this line:
> 
> ceph_assert(osdmap.require_osd_release >= ceph_release_t::nautilus);
> 
> and set require_osd_release to, say, pacific?
> 
> I've tried to downgrade ceph-mon process back to version 16.2,
> but it cannot join to cluster...
> 
> Sincerely
> Jan Marek
> -- 
> Ing. Jan Marek
> University of South Bohemia
> Academic Computer Centre
> Phone: +420389032080
> http://www.gnu.org/philosophy/no-word-attachments.cs.html


> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


-- 
Ing. Jan Marek
University of South Bohemia
Academic Computer Centre
Phone: +420389032080
http://www.gnu.org/philosophy/no-word-attachments.cs.html
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Zakhar Kirpichenko
Unfortunately, that isn't the case: the drive is perfectly healthy and,
according to all measurements I did on the host itself, it isn't any
different from any other drive on that host size-, health- or
performance-wise.

The only difference I noticed is that this drive sporadically does more I/O
than other drives for a split second, probably due to specific PGs placed
on its OSD, but the average I/O pattern is very similar to other drives and
OSDs, so it's somewhat unclear why the specific OSD is consistently showing
much higher latency. It would be good to figure out what exactly is causing
these I/O spikes, but I'm not yet sure how to do that.

/Z

On Fri, 7 Oct 2022 at 09:24, Konstantin Shalygin  wrote:

> Hi,
>
> When you see one of 100 drives perf is unusually different, this may mean
> 'this drive is not like the others' and should be replaced
>
>
> k
>
> Sent from my iPhone
>
> > On 7 Oct 2022, at 07:33, Zakhar Kirpichenko  wrote:
> >
> > Anyone, please?
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Frank Schilder
Hi Stefan,

super thanks!

I found a quick-fix command in the help output:

# ceph-bluestore-tool -h
[...]
Positional options:
  --command arg  fsck, repair, quick-fix, bluefs-export,
 bluefs-bdev-sizes, bluefs-bdev-expand,
 bluefs-bdev-new-db, bluefs-bdev-new-wal,
 bluefs-bdev-migrate, show-label, set-label-key,
 rm-label-key, prime-osd-dir, bluefs-log-dump,
 free-dump, free-score, bluefs-stats

but its not documented in 
https://docs.ceph.com/en/octopus/man/8/ceph-bluestore-tool/. I guess I will 
stick with the tested command "repair". Nothing I found mentions what exactly 
is executed on start-up with bluestore_fsck_quick_fix_on_mount = true.

Thanks for your quick answer!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Kooman 
Sent: 07 October 2022 09:07:37
To: Frank Schilder; Igor Fedotov; ceph-users@ceph.io
Subject: Re: [ceph-users] OSD crashes during upgrade mimic->octopus

On 10/7/22 09:03, Frank Schilder wrote:
> Hi Igor and Stefan,
>
> thanks a lot for your help! Our cluster is almost finished with recovery and 
> I would like to switch to off-line conversion of the SSD OSDs. In one of 
> Stefan's I coud find the command for manual compaction:
>
> ceph-kvstore-tool bluestore-kv "/var/lib/ceph/osd/ceph-${OSD_ID}" compact
>
> Unfortunately, I can't find the command for performing the omap conversion. 
> It is not mentioned here 
> https://docs.ceph.com/en/quincy/releases/octopus/#upgrading-from-mimic-or-nautilus
>  even though it does mention the option to skip conversion in step 5. How to 
> continue with an off-line conversion is not mentioned. I know it has been 
> posted before, but I seem unable to find it on this list. If someone could 
> send me the command, I would be most grateful.

for osd in `ls /var/lib/ceph/osd/`; do ceph-bluestore-tool repair --path
  /var/lib/ceph/osd/$osd;done

That's what I use.

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Iinfinite backfill loop + number of pgp groups stuck at wrong value

2022-10-07 Thread Nicola Mori

Dear Ceph users,

my cluster is stuck since several days with some PG backfilling. The 
number of misplaced objects slowly decreases down to 5%, and at that 
point jumps up again to about 7%, and so on. I found several possible 
reasons for this behavior. One is related to the balancer, which anyway 
I think is not operating:


# ceph balancer status
{
"active": false,
"last_optimize_duration": "0:00:00.000938",
"last_optimize_started": "Thu Oct  6 16:19:59 2022",
"mode": "upmap",
"optimize_result": "Too many objects (0.071539 > 0.05) are 
misplaced; try again later",

"plans": []
}

(the lase optimize result is from yesterday when I disabled it, and 
since then the backfill loop has happened several times).
Another possible reason seems to be an imbalance of PG and PGB  numbers. 
Effectively I found such an imbalance on one of my pools:


# ceph osd pool get wizard_data pg_num
pg_num: 128
# ceph osd pool get wizard_data pgp_num
pgp_num: 123

but I cannot fix it:
# ceph osd pool set wizard_data pgp_num 128
set pool 3 pgp_num to 128
# ceph osd pool get wizard_data pgp_num
pgp_num: 123

The autoscaler is off for that pool:

POOL   SIZE  TARGET SIZERATE  RAW CAPACITY 
RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM 
AUTOSCALE  BULK
wizard_data   8951G   1.333730697632152.8T 
0.0763  1.0 128  off 
   False


so I don't understand why the PGP number is stuck at 123.
Thanks in advance for any help,

Nicola
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Frank Schilder
Hi Igor and Stefan,

thanks a lot for your help! Our cluster is almost finished with recovery and I 
would like to switch to off-line conversion of the SSD OSDs. In one of Stefan's 
I coud find the command for manual compaction:

ceph-kvstore-tool bluestore-kv "/var/lib/ceph/osd/ceph-${OSD_ID}" compact

Unfortunately, I can't find the command for performing the omap conversion. It 
is not mentioned here 
https://docs.ceph.com/en/quincy/releases/octopus/#upgrading-from-mimic-or-nautilus
 even though it does mention the option to skip conversion in step 5. How to 
continue with an off-line conversion is not mentioned. I know it has been 
posted before, but I seem unable to find it on this list. If someone could send 
me the command, I would be most grateful.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Konstantin Shalygin
Hi,

When you see one of 100 drives perf is unusually different, this may mean 'this 
drive is not like the others' and should be replaced


k

Sent from my iPhone

> On 7 Oct 2022, at 07:33, Zakhar Kirpichenko  wrote:
> 
> Anyone, please?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io