[ceph-users] Re: Ceph upgrade advice - Luminous to Pacific with OS upgrade

2022-12-06 Thread Massimo Sgaravatto
If it can help, I have recently updated my ceph cluster (composed by 3
mon-mgr nodes and n osd nodes) from Nautilus CentOS7 to Pacific Centos8
stream.

First I reinstalled the mon-mgr nodes with Centos8 stream (removing them
from the cluster and then re-adding them with the new operating system).
This was needed because the mgr on octopus runs only on rhel8 and its forks

Then I migrated the cluster to Octopus (so mon-mgr running C8stream and osd
nodes running centos7)

Then I reinstalled each OSD node with Centos8 Stream, without draining the
node  [*]

Then I migrated the cluster from Octopus to Pacific

[*]
ceph osd set noout
Reinstallation of the node with the C8stream
Installation of ceph
ceph-volume lvm activate --all


Cheers, Massimo


On Tue, Dec 6, 2022 at 3:58 PM David C  wrote:

> Hi All
>
> I'm planning to upgrade a Luminous 12.2.10 cluster to Pacific 16.2.10,
> cluster is primarily used for CephFS, mix of Filestore and Bluestore
> OSDs, mons/osds collocated, running on CentOS 7 nodes
>
> My proposed upgrade path is: Upgrade to Nautilus 14.2.22 -> Upgrade to
> EL8 on the nodes (probably Rocky) -> Upgrade to Pacific
>
> I assume the cleanest way to update the node OS would be to drain the
> node and remove from the cluster, install Rocky 8, add back to cluster
> as effectively a new node
>
> I have a relatively short maintenance window and was hoping to speed
> up OS upgrade with the following approach on each node:
>
> - back up ceph config/systemd files etc.
> - set noout etc.
> - deploy Rocky 8, being careful not to touch OSD block devices
> - install Nautilus binaries (ensuring I use same version as pre OS upgrade)
> - copy ceph config back over
>
> In theory I could then start up the daemons and they wouldn't care
> that we're now running on a different OS
>
> Does anyone see any issues with that approach? I plan to test on a dev
> cluster anyway but would be grateful for any thoughts
>
> Thanks,
> David
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph upgrade advice - Luminous to Pacific with OS upgrade

2022-12-06 Thread Fox, Kevin M
We went on a couple clusters from ceph-deploy+centos7+nautilus to 
cephadm+rocky8+pacific using ELevate as one of the steps. Went through octopus 
as well. ELevate wasn't perfect for us either, but was able to get the job 
done. Had to test it carefully on the test clusters multiple times to get the 
procedure just right. Had some bumps even then, but was able to get things 
finished up.

Thanks,
Kevin


From: Wolfpaw - Dale Corse 
Sent: Tuesday, December 6, 2022 8:18 AM
To: 'David C'
Cc: 'ceph-users'
Subject: [ceph-users] Re: Ceph upgrade advice - Luminous to Pacific with OS 
upgrade

Check twice before you click! This email originated from outside PNNL.


Hi David,

  > Good to hear you had success with the ELevate tool, I'd looked at that but 
seemed a bit risky. The tool supports Rocky so I may give it a look.

Elevate wasn't perfect - we had to manually upgrade some packages from outside 
repos (ceph, opennebula and salt if memory serves). That said, it was certainly 
manageable.

> This one is surprising since in theory Pacific still supports Filestore, 
> there is at least one thread on the list where someone upgraded to Pacific 
> and is still running some Filestore OSDs -
> on the other hand, there's also a recent thread where someone ran into 
> problems and was  forced to upgrade to Bluestore - did you experience issues 
> yourself or was this advice you
> picked up? I do ultimately want to get all my OSDs on Bluestore but was 
> hoping to do that after the Ceph version upgrade.

Sorry - I am mistaken about Rocks/LevelDB and Filestore upgrades being required 
for Pacific. Apologies!
I do remember doing all of ours when we upgraded from Luminous -> Nautilus, but 
I can't remember why to be honest. Might have been advice at the time, or 
something I read when looking into the upgrade :)

Cheers,
D.

-Original Message-
From: David C [mailto:dcsysengin...@gmail.com]
Sent: Tuesday, December 6, 2022 8:56 AM
To: Wolfpaw - Dale Corse 
Cc: ceph-users 
Subject: [SPAM] [ceph-users] Re: [SPAM] Ceph upgrade advice - Luminous to 
Pacific with OS upgrade

Hi Wolfpaw, thanks for the response

- Id upgrade to Nautilus on Centos 7 before moving to EL8. We then used
> AlmaLinux Elevate to love from 7 to 8 without a reinstall. Rocky has a
> similar path I think.
>

Good to hear you had success with the ELevate tool, I'd looked at that but 
seemed a bit risky. The tool supports Rocky so I may give it a look.

>
> - you will need to love those filestore OSD’s to Bluestore before
> hitting Pacific, might even be part of the Nautilus upgrade. This
> takes some time if I remember correctly.
>

This one is surprising since in theory Pacific still supports Filestore, there 
is at least one thread on the list where someone upgraded to Pacific and is 
still running some Filestore OSDs - on the other hand, there's also a recent 
thread where someone ran into problems and was forced to upgrade to Bluestore - 
did you experience issues yourself or was this advice you picked up? I do 
ultimately want to get all my OSDs on Bluestore but was hoping to do that after 
the Ceph version upgrade.


> - You may need to upgrade monitors to RocksDB too.


Thanks, I wasn't aware of this  - I suppose I'll do that when I'm on Nautilus


On Tue, Dec 6, 2022 at 3:22 PM Wolfpaw - Dale Corse 
wrote:

> We did this (over a longer timespan).. it worked ok.
>
> A couple things I’d add:
>
> - Id upgrade to Nautilus on Centos 7 before moving to EL8. We then
> used AlmaLinux Elevate to love from 7 to 8 without a reinstall. Rocky
> has a similar path I think.
>
> - you will need to love those filestore OSD’s to Bluestore before
> hitting Pacific, might even be part of the Nautilus upgrade. This
> takes some time if I remember correctly.
>
> - You may need to upgrade monitors to RocksDB too.
>
> Sent from my iPhone
>
> > On Dec 6, 2022, at 7:59 AM, David C  wrote:
> >
> > Hi All
> >
> > I'm planning to upgrade a Luminous 12.2.10 cluster to Pacific
> > 16.2.10, cluster is primarily used for CephFS, mix of Filestore and
> > Bluestore OSDs, mons/osds collocated, running on CentOS 7 nodes
> >
> > My proposed upgrade path is: Upgrade to Nautilus 14.2.22 -> Upgrade
> > to
> > EL8 on the nodes (probably Rocky) -> Upgrade to Pacific
> >
> > I assume the cleanest way to update the node OS would be to drain
> > the node and remove from the cluster, install Rocky 8, add back to
> > cluster as effectively a new node
> >
> > I have a relatively short maintenance window and was hoping to speed
> > up OS upgrade with the following approach on each node:
> >
> > - back up ceph config/systemd files etc.
> > - set noout etc.
> > - deploy Rocky 8, being careful not to touch OSD block devices
> > - install Nautilus binaries (ensuring I use same version as pre OS
> upgrade)
> > - copy ceph config back over
> >
> > In theory I could then start up the daemons and they wouldn't care
> > that we're now running on a different OS

[ceph-users] Re: Ceph upgrade advice - Luminous to Pacific with OS upgrade

2022-12-06 Thread Wolfpaw - Dale Corse
Hi David,

  > Good to hear you had success with the ELevate tool, I'd looked at that but 
seemed a bit risky. The tool supports Rocky so I may give it a look.
 
Elevate wasn't perfect - we had to manually upgrade some packages from outside 
repos (ceph, opennebula and salt if memory serves). That said, it was certainly 
manageable.

> This one is surprising since in theory Pacific still supports Filestore, 
> there is at least one thread on the list where someone upgraded to Pacific 
> and is still running some Filestore OSDs - 
> on the other hand, there's also a recent thread where someone ran into 
> problems and was  forced to upgrade to Bluestore - did you experience issues 
> yourself or was this advice you
> picked up? I do ultimately want to get all my OSDs on Bluestore but was 
> hoping to do that after the Ceph version upgrade.

Sorry - I am mistaken about Rocks/LevelDB and Filestore upgrades being required 
for Pacific. Apologies!
I do remember doing all of ours when we upgraded from Luminous -> Nautilus, but 
I can't remember why to be honest. Might have been advice at the time, or 
something I read when looking into the upgrade :)

Cheers,
D.

-Original Message-
From: David C [mailto:dcsysengin...@gmail.com] 
Sent: Tuesday, December 6, 2022 8:56 AM
To: Wolfpaw - Dale Corse 
Cc: ceph-users 
Subject: [SPAM] [ceph-users] Re: [SPAM] Ceph upgrade advice - Luminous to 
Pacific with OS upgrade

Hi Wolfpaw, thanks for the response

- Id upgrade to Nautilus on Centos 7 before moving to EL8. We then used
> AlmaLinux Elevate to love from 7 to 8 without a reinstall. Rocky has a 
> similar path I think.
>

Good to hear you had success with the ELevate tool, I'd looked at that but 
seemed a bit risky. The tool supports Rocky so I may give it a look.

>
> - you will need to love those filestore OSD’s to Bluestore before 
> hitting Pacific, might even be part of the Nautilus upgrade. This 
> takes some time if I remember correctly.
>

This one is surprising since in theory Pacific still supports Filestore, there 
is at least one thread on the list where someone upgraded to Pacific and is 
still running some Filestore OSDs - on the other hand, there's also a recent 
thread where someone ran into problems and was forced to upgrade to Bluestore - 
did you experience issues yourself or was this advice you picked up? I do 
ultimately want to get all my OSDs on Bluestore but was hoping to do that after 
the Ceph version upgrade.


> - You may need to upgrade monitors to RocksDB too.


Thanks, I wasn't aware of this  - I suppose I'll do that when I'm on Nautilus


On Tue, Dec 6, 2022 at 3:22 PM Wolfpaw - Dale Corse 
wrote:

> We did this (over a longer timespan).. it worked ok.
>
> A couple things I’d add:
>
> - Id upgrade to Nautilus on Centos 7 before moving to EL8. We then 
> used AlmaLinux Elevate to love from 7 to 8 without a reinstall. Rocky 
> has a similar path I think.
>
> - you will need to love those filestore OSD’s to Bluestore before 
> hitting Pacific, might even be part of the Nautilus upgrade. This 
> takes some time if I remember correctly.
>
> - You may need to upgrade monitors to RocksDB too.
>
> Sent from my iPhone
>
> > On Dec 6, 2022, at 7:59 AM, David C  wrote:
> >
> > Hi All
> >
> > I'm planning to upgrade a Luminous 12.2.10 cluster to Pacific 
> > 16.2.10, cluster is primarily used for CephFS, mix of Filestore and 
> > Bluestore OSDs, mons/osds collocated, running on CentOS 7 nodes
> >
> > My proposed upgrade path is: Upgrade to Nautilus 14.2.22 -> Upgrade 
> > to
> > EL8 on the nodes (probably Rocky) -> Upgrade to Pacific
> >
> > I assume the cleanest way to update the node OS would be to drain 
> > the node and remove from the cluster, install Rocky 8, add back to 
> > cluster as effectively a new node
> >
> > I have a relatively short maintenance window and was hoping to speed 
> > up OS upgrade with the following approach on each node:
> >
> > - back up ceph config/systemd files etc.
> > - set noout etc.
> > - deploy Rocky 8, being careful not to touch OSD block devices
> > - install Nautilus binaries (ensuring I use same version as pre OS
> upgrade)
> > - copy ceph config back over
> >
> > In theory I could then start up the daemons and they wouldn't care 
> > that we're now running on a different OS
> >
> > Does anyone see any issues with that approach? I plan to test on a 
> > dev cluster anyway but would be grateful for any thoughts
> >
> > Thanks,
> > David
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> > email to ceph-users-le...@ceph.io
> >
>
>
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [SPAM] Ceph upgrade advice - Luminous to Pacific with OS upgrade

2022-12-06 Thread David C
>
> I don't think this is necessary. It _is_ necessary to convert all
> leveldb to rocksdb before upgrading to Pacific, on both mons and any
> filestore OSDs.


Thanks, Josh, I guess that explains why some people had issues with
Filestore OSDs post Pacific upgrade

On Tue, Dec 6, 2022 at 4:07 PM Josh Baergen 
wrote:

> > - you will need to love those filestore OSD’s to Bluestore before
> hitting Pacific, might even be part of the Nautilus upgrade. This takes
> some time if I remember correctly.
>
> I don't think this is necessary. It _is_ necessary to convert all
> leveldb to rocksdb before upgrading to Pacific, on both mons and any
> filestore OSDs.
>
> Quincy will warn you about filestore OSDs, and Reef will no longer
> support filestore.
>
> Josh
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [SPAM] Ceph upgrade advice - Luminous to Pacific with OS upgrade

2022-12-06 Thread Josh Baergen
> - you will need to love those filestore OSD’s to Bluestore before hitting 
> Pacific, might even be part of the Nautilus upgrade. This takes some time if 
> I remember correctly.

I don't think this is necessary. It _is_ necessary to convert all
leveldb to rocksdb before upgrading to Pacific, on both mons and any
filestore OSDs.

Quincy will warn you about filestore OSDs, and Reef will no longer
support filestore.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [SPAM] Ceph upgrade advice - Luminous to Pacific with OS upgrade

2022-12-06 Thread David C
Hi Wolfpaw, thanks for the response

- Id upgrade to Nautilus on Centos 7 before moving to EL8. We then used
> AlmaLinux Elevate to love from 7 to 8 without a reinstall. Rocky has a
> similar path I think.
>

Good to hear you had success with the ELevate tool, I'd looked at that but
seemed a bit risky. The tool supports Rocky so I may give it a look.

>
> - you will need to love those filestore OSD’s to Bluestore before hitting
> Pacific, might even be part of the Nautilus upgrade. This takes some time
> if I remember correctly.
>

This one is surprising since in theory Pacific still supports Filestore,
there is at least one thread on the list where someone upgraded to Pacific
and is still running some Filestore OSDs - on the other hand, there's also
a recent thread where someone ran into problems and was forced to upgrade
to Bluestore - did you experience issues yourself or was this advice you
picked up? I do ultimately want to get all my OSDs on Bluestore but was
hoping to do that after the Ceph version upgrade.


> - You may need to upgrade monitors to RocksDB too.


Thanks, I wasn't aware of this  - I suppose I'll do that when I'm on
Nautilus


On Tue, Dec 6, 2022 at 3:22 PM Wolfpaw - Dale Corse 
wrote:

> We did this (over a longer timespan).. it worked ok.
>
> A couple things I’d add:
>
> - Id upgrade to Nautilus on Centos 7 before moving to EL8. We then used
> AlmaLinux Elevate to love from 7 to 8 without a reinstall. Rocky has a
> similar path I think.
>
> - you will need to love those filestore OSD’s to Bluestore before hitting
> Pacific, might even be part of the Nautilus upgrade. This takes some time
> if I remember correctly.
>
> - You may need to upgrade monitors to RocksDB too.
>
> Sent from my iPhone
>
> > On Dec 6, 2022, at 7:59 AM, David C  wrote:
> >
> > Hi All
> >
> > I'm planning to upgrade a Luminous 12.2.10 cluster to Pacific 16.2.10,
> > cluster is primarily used for CephFS, mix of Filestore and Bluestore
> > OSDs, mons/osds collocated, running on CentOS 7 nodes
> >
> > My proposed upgrade path is: Upgrade to Nautilus 14.2.22 -> Upgrade to
> > EL8 on the nodes (probably Rocky) -> Upgrade to Pacific
> >
> > I assume the cleanest way to update the node OS would be to drain the
> > node and remove from the cluster, install Rocky 8, add back to cluster
> > as effectively a new node
> >
> > I have a relatively short maintenance window and was hoping to speed
> > up OS upgrade with the following approach on each node:
> >
> > - back up ceph config/systemd files etc.
> > - set noout etc.
> > - deploy Rocky 8, being careful not to touch OSD block devices
> > - install Nautilus binaries (ensuring I use same version as pre OS
> upgrade)
> > - copy ceph config back over
> >
> > In theory I could then start up the daemons and they wouldn't care
> > that we're now running on a different OS
> >
> > Does anyone see any issues with that approach? I plan to test on a dev
> > cluster anyway but would be grateful for any thoughts
> >
> > Thanks,
> > David
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph upgrade advice - Luminous to Pacific with OS upgrade

2022-12-06 Thread Stefan Kooman

On 12/6/22 15:58, David C wrote:

Hi All

I'm planning to upgrade a Luminous 12.2.10 cluster to Pacific 16.2.10,
cluster is primarily used for CephFS, mix of Filestore and Bluestore
OSDs, mons/osds collocated, running on CentOS 7 nodes

My proposed upgrade path is: Upgrade to Nautilus 14.2.22 -> Upgrade to
EL8 on the nodes (probably Rocky) -> Upgrade to Pacific

I assume the cleanest way to update the node OS would be to drain the
node and remove from the cluster, install Rocky 8, add back to cluster
as effectively a new node

I have a relatively short maintenance window and was hoping to speed
up OS upgrade with the following approach on each node:

- back up ceph config/systemd files etc.
- set noout etc.
- deploy Rocky 8, being careful not to touch OSD block devices
- install Nautilus binaries (ensuring I use same version as pre OS upgrade)
- copy ceph config back over

In theory I could then start up the daemons and they wouldn't care
that we're now running on a different OS

Does anyone see any issues with that approach? I plan to test on a dev
cluster anyway but would be grateful for any thoughts


That would work. Just run:

  systemctl enable ceph-osd.target
  ceph-volume lvm activate --all

on them and you should be good to go. I have done re-install from 16.04 
to 20.04 this way and that just worked (TM).


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [SPAM] Ceph upgrade advice - Luminous to Pacific with OS upgrade

2022-12-06 Thread Wolfpaw - Dale Corse
We did this (over a longer timespan).. it worked ok.

A couple things I’d add:

- Id upgrade to Nautilus on Centos 7 before moving to EL8. We then used 
AlmaLinux Elevate to love from 7 to 8 without a reinstall. Rocky has a similar 
path I think.

- you will need to love those filestore OSD’s to Bluestore before hitting 
Pacific, might even be part of the Nautilus upgrade. This takes some time if I 
remember correctly. 

- You may need to upgrade monitors to RocksDB too.

Sent from my iPhone

> On Dec 6, 2022, at 7:59 AM, David C  wrote:
> 
> Hi All
> 
> I'm planning to upgrade a Luminous 12.2.10 cluster to Pacific 16.2.10,
> cluster is primarily used for CephFS, mix of Filestore and Bluestore
> OSDs, mons/osds collocated, running on CentOS 7 nodes
> 
> My proposed upgrade path is: Upgrade to Nautilus 14.2.22 -> Upgrade to
> EL8 on the nodes (probably Rocky) -> Upgrade to Pacific
> 
> I assume the cleanest way to update the node OS would be to drain the
> node and remove from the cluster, install Rocky 8, add back to cluster
> as effectively a new node
> 
> I have a relatively short maintenance window and was hoping to speed
> up OS upgrade with the following approach on each node:
> 
> - back up ceph config/systemd files etc.
> - set noout etc.
> - deploy Rocky 8, being careful not to touch OSD block devices
> - install Nautilus binaries (ensuring I use same version as pre OS upgrade)
> - copy ceph config back over
> 
> In theory I could then start up the daemons and they wouldn't care
> that we're now running on a different OS
> 
> Does anyone see any issues with that approach? I plan to test on a dev
> cluster anyway but would be grateful for any thoughts
> 
> Thanks,
> David
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph upgrade advice - Luminous to Pacific with OS upgrade

2022-12-06 Thread David C
Hi All

I'm planning to upgrade a Luminous 12.2.10 cluster to Pacific 16.2.10,
cluster is primarily used for CephFS, mix of Filestore and Bluestore
OSDs, mons/osds collocated, running on CentOS 7 nodes

My proposed upgrade path is: Upgrade to Nautilus 14.2.22 -> Upgrade to
EL8 on the nodes (probably Rocky) -> Upgrade to Pacific

I assume the cleanest way to update the node OS would be to drain the
node and remove from the cluster, install Rocky 8, add back to cluster
as effectively a new node

I have a relatively short maintenance window and was hoping to speed
up OS upgrade with the following approach on each node:

- back up ceph config/systemd files etc.
- set noout etc.
- deploy Rocky 8, being careful not to touch OSD block devices
- install Nautilus binaries (ensuring I use same version as pre OS upgrade)
- copy ceph config back over

In theory I could then start up the daemons and they wouldn't care
that we're now running on a different OS

Does anyone see any issues with that approach? I plan to test on a dev
cluster anyway but would be grateful for any thoughts

Thanks,
David
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: octopus rbd cluster just stopped out of nowhere (>20k slow ops)

2022-12-06 Thread Boris Behrens
Hi Janne,
that is a really good idea. Thank you.

I just saw, that our only ubuntu20.04 got very high %util (all 8TB disks)
Devicer/s rkB/s   rrqm/s  %rrqm r_await rareq-sz w/s
  wkB/s   wrqm/s  %wrqm w_await wareq-sz d/s dkB/s   drqm/s  %drqm
d_await dareq-sz  aqu-sz  %util
sdc 19.00112.00 0.00   0.000.32 5.89 1535.00
 68768.00  1260.00  45.081.3344.800.00  0.00 0.00
0.000.00 0.001.44  76.00
sdd 62.00   5892.0043.00  40.952.8295.03 1196.00
 78708.00  1361.00  53.232.3565.810.00  0.00 0.00
0.000.00 0.002.31  72.00
sde 33.00184.00 0.00   0.000.33 5.58 1413.00
102592.00  1709.00  54.741.7072.610.00  0.00 0.00
0.000.00 0.001.68  84.40
sdf 62.00   8200.0063.00  50.409.32   132.26 1066.00
 74372.00  1173.00  52.391.6869.770.00  0.00 0.00
0.000.00 0.001.80  70.00
sdg  5.00 40.00 0.00   0.000.40 8.00 1936.00
128188.00  2172.00  52.872.1866.210.00  0.00 0.00
0.000.00 0.003.21  92.80
sdh133.00   8636.0044.00  24.864.1464.93 1505.00
 87820.00  1646.00  52.240.9558.350.00  0.00 0.00
0.000.00 0.001.09  78.80

I've cross checked the other 8TB disks in our cluster, which are around
30-50% with roughly the same IOPs.
Maybe I am missing some optimization, that is done on the centos7 nodes,
but not on the ubuntu20.04 node. (If you know something from the top of
your head, I am happy to hear it).
Maybe it is just another measuring on ubuntu.

But this was the first node where I restarted the OSDs and this is where I
waited the longest time, to see if anything is going better. The problem
nearly disappeared in a couple of seconds, after the last OSD was
restarted. So I would not blame that node in particular, but I will
investigate in this direction.


Am Di., 6. Dez. 2022 um 10:08 Uhr schrieb Janne Johansson <
icepic...@gmail.com>:

> Perhaps run "iostat -xtcy  5" on the OSD hosts to
> see if any of the drives have weirdly high utilization despite low
> iops/requests?
>
>
> Den tis 6 dec. 2022 kl 10:02 skrev Boris Behrens :
> >
> > Hi Sven,
> > I am searching really hard for defect hardware, but I am currently out of
> > ideas:
> > - checked prometheus stats, but in all that data I don't know what to
> look
> > for (osd apply latency if very low at the mentioned point and went up to
> > 40ms after all OSDs were restarted)
> > - smartctl shows nothing
> > - dmesg show nothing
> > - network data shows nothing
> > - osd and clusterlogs show nothing
> >
> > If anybody got a good tip what I can check, that would be awesome. A
> string
> > in the logs (I made a copy from that days logs), or a tool to fire
> against
> > the hardware. I am 100% out of ideas what it could be.
> > In a time frame of 20s 2/3 of our OSDs went from "all fine" to "I am
> > waiting for the replicas to do their work" (log message 'waiting for sub
> > ops'). But there was no alert that any OSD had connection problems to
> other
> > OSDs. Additional the cluster_network is the same interface, switch,
> > everything as public_network. Only difference is the VLAN id (I plan to
> > remove the cluster_network because it does not provide anything for us).
> >
> > I am also planning to update all hosts from centos7 to ubuntu 20.04
> (newer
> > kernel, standardized OS config and so on).
> >
> > Am Mo., 5. Dez. 2022 um 14:24 Uhr schrieb Sven Kieske <
> s.kie...@mittwald.de
> > >:
> >
> > > On Sa, 2022-12-03 at 01:54 +0100, Boris Behrens wrote:
> > > > hi,
> > > > maybe someone here can help me to debug an issue we faced today.
> > > >
> > > > Today one of our clusters came to a grinding halt with 2/3 of our
> OSDs
> > > > reporting slow ops.
> > > > Only option to get it back to work fast, was to restart all OSDs
> daemons.
> > > >
> > > > The cluster is an octopus cluster with 150 enterprise SSD OSDs. Last
> work
> > > > on the cluster: synced in a node 4 days ago.
> > > >
> > > > The only health issue, that was reported, was the SLOW_OPS. No slow
> pings
> > > > on the networks. No restarting OSDs. Nothing.
> > > >
> > > > I was able to ping it to a 20s timeframe and I read ALL the logs in
> a 20
> > > > minute timeframe around this issue.
> > > >
> > > > I haven't found any clues.
> > > >
> > > > Maybe someone encountered this in the past?
> > >
> > > do you happen to run your rocksdb on a dedicated caching device (nvme
> ssd)?
> > >
> > > I observed slow ops in octopus after a faulty nvme ssd was inserted in
> one
> > > ceph server.
> > > as was said in other mails, try to isolate your root cause.
> > >
> > > maybe the node added 4 days ago was the culprit here?
> > >
> > > we were able to pinpoint the nvme by monitoring the slow osds
> > > and the commonality in this case was the same nvme cache 

[ceph-users] Orchestrator hanging on 'stuck' nodes

2022-12-06 Thread Ewan Mac Mahon
Dear all,

We're having an odd problem with a recently installed Quincy/cephadm cluster on 
CentOS 8 Stream with Podman, where the orchestrator appears to get wedged and 
just won't implement any changes. 

The overall cluster was installed and working for a few weeks, then we added an 
NFS export which worked for a bit, then we had some problems with that and 
tried to restart/redeploy it and found that the orchestrator wouldn't deploy 
new NFS server containers. We then made an attempt to restart the MGR 
process(es) by stopping one and having the Orchestrator redeploy it, but it 
didn't. The overall effect looks like orchestrator won't try to start 
containers - it knows what it's supposed to be doing (and you can tell it to do 
new things, e.g. deploy a new NFS cluster, and that's reflected correctly in 
both CLI and web control panel), but it just doesn't actually deploy things.

This looks a bit like this Reddit post: 
https://www.reddit.com/r/ceph/comments/v3kdix/cephadm_not_deploying_new_mgr_daemons_to_match/
And this mailing list post: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YREK7HUBNIMTKR5GU5L5E5CNFI7FDKLF/

We've found that in our case this appears to be due to one node in the cluster 
being in a strange state; it's not always the same node, and it doesn't have to 
be either the node running the MGR, or the node(s) being targetted to start the 
new containers on, *any* node in the system being in this state will wedge the 
orchestrator. A 'stuck' node can't start a new local container with './cephadm 
shell', and it sometimes but not always appears in the Cluster->Hosts section 
of the web interface with a blank machine model name and 'NaN' for its capacity 
(I'm guessing that these values are cached and then time out after a while?). 
Already running  containers on the node (e.g. OSDs) appear to carry on working.

As well as failing to start containers, while in this state the orchestrator 
will also fail to copy /etc/ceph/* to a new node with the '_admin' tag. 
Rebooting the 'stuck' node instantly unwedges the orchestrator as soon as the 
'stuck' node goes down - it doesn't have to come up working, it just has to 
stop - as soon as the 'stuck' node is down the orchestrator catches up on 
outstanding requests, starts new containers and brings everything into line 
with the requested state. 

My current best guess is that the first thing the orchestrator tries to do when 
it needs to change something is to enumerate the nodes to see where it can 
start things, it hangs trying to query the 'stuck' node, and can't recover on 
its own.

I've included more details and command outputs below, but:

- Does that sound feasible?
- Does this sound familiar to anyone?
- Does anyone know how to fix it?
- Or how to narrow down the root cause to turn this into a proper bug report?

Ewan



While in the wedged state the orchestrator thinks it's fine:

# ceph orch status --detail
Backend: cephadm
Available: Yes
Paused: No
Host Parallelism: 10

The cluster overall health is fine:

# ceph -s
  cluster:
id: 58140ed2-4ed4-11ed-b4db-5c6f69756a60
health: HEALTH_OK

  services:
mon: 5 daemons, quorum ceph-r3n4,ceph-r1n4,ceph-r2n4,ceph-r1n5,ceph-r2n5 
(age 2w)
mgr: ceph-r1n4.mgqrwx(active, since 2d)
mds: 1/1 daemons up, 3 standby
osd: 294 osds: 294 up (since 2w), 294 in (since 5w)

  data:
volumes: 1/1 healthy
pools:   5 pools, 9281 pgs
objects: 252.26M objects, 649 TiB
usage:   2.1 PiB used, 2.2 PiB / 4.3 PiB avail
pgs: 9269 active+clean
 12   active+clean+scrubbing+deep

This was tested by deploying new NFS serice in existing cluster and then a 
whole new NFS cluster (nfs.cephnfstwo) and removing the original NFS cluster 
(nfs.cephnfsone) - the change made from CLI reflected in Web dashboard, but not 
actioned; e.g. only one MGR running of target two, nothing running for NFS 
cluster 'cephnfstwo', original 'nfs.cephnfsone' cluster shown as 'deleting' but 
not acutally gone:

# ceph orch ls
NAMEPORTSRUNNING  REFRESHED   AGE  PLACEMENT
alertmanager?:9093,9094  1/1  2d ago  5w   count:1
crash  21/21  4w ago  5w   *
grafana ?:3000   1/1  2d ago  5w   count:1
mds.mds1 4/4  4w ago  5w   count:4
mgr  1/2  2w ago  2w   count:2
mon  5/5  2w ago  5w   count:5
nfs.cephnfsone   0/12w   
ceph-r1n5;ceph-r2n5;count:1
nfs.cephnfstwo  ?:2049   0/1  -   2d   count:1
node-exporter   ?:9100 21/21  4w ago  5w   *
osd  294  4w ago  -
prometheus  ?:9095   1/1  2d ago  5w   count:1

The 'stuck' node is continuing to run services including OSDs, and in cases 
where it's also had the active MGR then the web interface remains accessible, 
but SSHing in and trying to start a cephadm shell 

[ceph-users] Re: cephfs snap-mirror stalled

2022-12-06 Thread Venky Shankar
On Tue, Dec 6, 2022 at 6:34 PM Holger Naundorf  wrote:
>
>
>
> On 06.12.22 09:54, Venky Shankar wrote:
> > Hi Holger,
> >
> > On Tue, Dec 6, 2022 at 1:42 PM Holger Naundorf  
> > wrote:
> >>
> >> Hello,
> >> we have set up a snap-mirror for a directory on one of our clusters -
> >> running ceph version
> >>
> >> ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
> >> (stable)
> >>
> >> to get mirrorred our other cluster - running ceph version
> >>
> >> ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific
> >> (stable)
> >>
> >> The initial setup went ok, when the first snapshot was created data
> >> started to flow at a decent (for our HW) rate of 100-200MB/s. As the
> >> directory contains  ~200TB this was expected to take some time - but now
> >> the process has stalled completely after ~100TB were mirrored and ~7d
> >> running.
> >>
> >> Up to now I do not have any hints why it has stopped - I do not see any
> >> error messages from the cephfs-mirror daemon. Can the small version
> >> mismatch be a problem?
> >>
> >> Any hints where to look to find out what has got stuck are welcome.
> >
> > I'd look at the mirror daemon logs for any errors to start with. You
> > might want to crank up the log level for debugging (debug
> > cephfs_mirror=20).
> >
>
> Even on max debug I do not see anything which looks like an error - but
> as this is the first time I try to dig into any cephfs-mirror logs I
> might not notice (as long as it is not red and flashing).
>
> The Log basically this type of sequence, repeating forever:
>
> (...)
> cephfs::mirror::MirrorWatcher handle_notify
> cephfs::mirror::Mirror update_fs_mirrors
> cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror
> update (0x556fe3a7f130) after 2 seconds
> cephfs::mirror::Watcher handle_notify: notify_id=751516198184655,
> handle=93939050205568, notifier_id=25504530
> cephfs::mirror::MirrorWatcher handle_notify
> cephfs::mirror::PeerReplayer(19361031-928d-4366-99bd-50df70d3adf1) run:
> trying to pick from 1 directories
> cephfs::mirror::PeerReplayer(19361031-928d-4366-99bd-50df70d3adf1)
> pick_directory
> cephfs::mirror::Watcher handle_notify: notify_id=751516198184656,
> handle=93939050205568, notifier_id=25504530
> cephfs::mirror::MirrorWatcher handle_notify
> cephfs::mirror::Mirror update_fs_mirrors
> cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror
> update (0x556fe3a7fc70) after 2 seconds
> cephfs::mirror::Watcher handle_notify: notify_id=751516198184657,
> handle=93939050205568, notifier_id=25504530
> cephfs::mirror::MirrorWatcher handle_notify
> (...)

Basically, the interesting bit is not captured since it probably
happened sometime back. Could you please set the following:

debug cephfs_mirror = 20
debug client = 20

and restart the mirror daemon? The daemon would start synchronizing
again. When synchronizing stalls, please share the daemon logs. If the
log is huge, you could upload them via ceph-post-file.

>
>
>
> >>
> >> Regards,
> >> Holger
> >>
> >> --
> >> Dr. Holger Naundorf
> >> Christian-Albrechts-Universität zu Kiel
> >> Rechenzentrum / HPC / Server und Storage
> >> Tel: +49 431 880-1990
> >> Fax:  +49 431 880-1523
> >> naund...@rz.uni-kiel.de
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> >
>
> --
> Dr. Holger Naundorf
> Christian-Albrechts-Universität zu Kiel
> Rechenzentrum / HPC / Server und Storage
> Tel: +49 431 880-1990
> Fax:  +49 431 880-1523
> naund...@rz.uni-kiel.de



-- 
Cheers,
Venky

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs snap-mirror stalled

2022-12-06 Thread Holger Naundorf



On 06.12.22 09:54, Venky Shankar wrote:

Hi Holger,

On Tue, Dec 6, 2022 at 1:42 PM Holger Naundorf  wrote:


Hello,
we have set up a snap-mirror for a directory on one of our clusters -
running ceph version

ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
(stable)

to get mirrorred our other cluster - running ceph version

ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific
(stable)

The initial setup went ok, when the first snapshot was created data
started to flow at a decent (for our HW) rate of 100-200MB/s. As the
directory contains  ~200TB this was expected to take some time - but now
the process has stalled completely after ~100TB were mirrored and ~7d
running.

Up to now I do not have any hints why it has stopped - I do not see any
error messages from the cephfs-mirror daemon. Can the small version
mismatch be a problem?

Any hints where to look to find out what has got stuck are welcome.


I'd look at the mirror daemon logs for any errors to start with. You
might want to crank up the log level for debugging (debug
cephfs_mirror=20).



Even on max debug I do not see anything which looks like an error - but 
as this is the first time I try to dig into any cephfs-mirror logs I 
might not notice (as long as it is not red and flashing).


The Log basically this type of sequence, repeating forever:

(...)
cephfs::mirror::MirrorWatcher handle_notify
cephfs::mirror::Mirror update_fs_mirrors
cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror 
update (0x556fe3a7f130) after 2 seconds
cephfs::mirror::Watcher handle_notify: notify_id=751516198184655, 
handle=93939050205568, notifier_id=25504530

cephfs::mirror::MirrorWatcher handle_notify
cephfs::mirror::PeerReplayer(19361031-928d-4366-99bd-50df70d3adf1) run: 
trying to pick from 1 directories
cephfs::mirror::PeerReplayer(19361031-928d-4366-99bd-50df70d3adf1) 
pick_directory
cephfs::mirror::Watcher handle_notify: notify_id=751516198184656, 
handle=93939050205568, notifier_id=25504530

cephfs::mirror::MirrorWatcher handle_notify
cephfs::mirror::Mirror update_fs_mirrors
cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror 
update (0x556fe3a7fc70) after 2 seconds
cephfs::mirror::Watcher handle_notify: notify_id=751516198184657, 
handle=93939050205568, notifier_id=25504530

cephfs::mirror::MirrorWatcher handle_notify
(...)





Regards,
Holger

--
Dr. Holger Naundorf
Christian-Albrechts-Universität zu Kiel
Rechenzentrum / HPC / Server und Storage
Tel: +49 431 880-1990
Fax:  +49 431 880-1523
naund...@rz.uni-kiel.de
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io






--
Dr. Holger Naundorf
Christian-Albrechts-Universität zu Kiel
Rechenzentrum / HPC / Server und Storage
Tel: +49 431 880-1990
Fax:  +49 431 880-1523
naund...@rz.uni-kiel.de
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] pacific: ceph-mon services stopped after OSDs are out/down

2022-12-06 Thread Mevludin Blazevic

Hi all,

I'm running Pacific with cephadm.

After installation, ceph automatically provisoned 5 ceph monitor nodes across 
the cluster. After a few OSDs crashed due to a hardware related issue with the 
SAS interface, 3 monitor services are stopped and won't restart again. Is it 
related to the OSD crash problem?

Thanks,
Mevludin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: What to expect on rejoining a host to cluster?

2022-12-06 Thread Frank Schilder
Hi Matt,

that I'm using re-weights does not mean I would recommend it. There seems 
something seriously broken with reweights, see this message 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/E5BYQ27LRWFNT4M34OYKI2KM27Q3DUY6/
 and the thread with it. I have to wait for a client update before considering 
to change that to upmaps.

I'm pretty sure the script https://github.com/TheJJ/ceph-balancer linked by 
Stefan can be configured/tweaked to look only at the fullest and maybe the 
emptiest OSDs to compute a moderately sized list of re-mappings that will 
eliminate the outliers only. What I would not recommend is to go all balanced 
and 95% OSD utilisation. You will see serious performance loss after some OSDs 
reached 80% and if you loose an OSD or host you will have to combat the fallout 
of deleted upmaps.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Fwd: [MGR] Only 60 trash removal tasks are processed per minute

2022-12-06 Thread sea you
Hi all,

Our cluster contains 12 nodes, 120 OSDs  (all NVME), and - currently -
4096 PGs in total. We're currently testing a scenario of having 20
thousand - 10G - volumes and then taking snapshots of each one of
them. These 20k snapshots are created in just a bit under 2 hours.

When we delete one snapshot of each volume - so again 20k -, it
usually takes more than 2 hours to move them to trash and create tasks
to delete.

Now the tasks to remove them from the trash are pretty slow. According
to my calculations, it's around 1 removal in 1 second. Doing the math,
it's around 5 and a half hours to empty the trash at this pace...

Looking at the 
https://github.com/ceph/ceph/blob/main/src/pybind/mgr/rbd_support/task.py
module, it's clear that this is a sequential operation, but is there
anything we could do to improve the speed here?

Neither the MGR nor any other components are CPU/memory bound, ceph is
basically just chilling :)

Any thoughts?

Doma
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: octopus rbd cluster just stopped out of nowhere (>20k slow ops)

2022-12-06 Thread Janne Johansson
Perhaps run "iostat -xtcy  5" on the OSD hosts to
see if any of the drives have weirdly high utilization despite low
iops/requests?


Den tis 6 dec. 2022 kl 10:02 skrev Boris Behrens :
>
> Hi Sven,
> I am searching really hard for defect hardware, but I am currently out of
> ideas:
> - checked prometheus stats, but in all that data I don't know what to look
> for (osd apply latency if very low at the mentioned point and went up to
> 40ms after all OSDs were restarted)
> - smartctl shows nothing
> - dmesg show nothing
> - network data shows nothing
> - osd and clusterlogs show nothing
>
> If anybody got a good tip what I can check, that would be awesome. A string
> in the logs (I made a copy from that days logs), or a tool to fire against
> the hardware. I am 100% out of ideas what it could be.
> In a time frame of 20s 2/3 of our OSDs went from "all fine" to "I am
> waiting for the replicas to do their work" (log message 'waiting for sub
> ops'). But there was no alert that any OSD had connection problems to other
> OSDs. Additional the cluster_network is the same interface, switch,
> everything as public_network. Only difference is the VLAN id (I plan to
> remove the cluster_network because it does not provide anything for us).
>
> I am also planning to update all hosts from centos7 to ubuntu 20.04 (newer
> kernel, standardized OS config and so on).
>
> Am Mo., 5. Dez. 2022 um 14:24 Uhr schrieb Sven Kieske  >:
>
> > On Sa, 2022-12-03 at 01:54 +0100, Boris Behrens wrote:
> > > hi,
> > > maybe someone here can help me to debug an issue we faced today.
> > >
> > > Today one of our clusters came to a grinding halt with 2/3 of our OSDs
> > > reporting slow ops.
> > > Only option to get it back to work fast, was to restart all OSDs daemons.
> > >
> > > The cluster is an octopus cluster with 150 enterprise SSD OSDs. Last work
> > > on the cluster: synced in a node 4 days ago.
> > >
> > > The only health issue, that was reported, was the SLOW_OPS. No slow pings
> > > on the networks. No restarting OSDs. Nothing.
> > >
> > > I was able to ping it to a 20s timeframe and I read ALL the logs in a 20
> > > minute timeframe around this issue.
> > >
> > > I haven't found any clues.
> > >
> > > Maybe someone encountered this in the past?
> >
> > do you happen to run your rocksdb on a dedicated caching device (nvme ssd)?
> >
> > I observed slow ops in octopus after a faulty nvme ssd was inserted in one
> > ceph server.
> > as was said in other mails, try to isolate your root cause.
> >
> > maybe the node added 4 days ago was the culprit here?
> >
> > we were able to pinpoint the nvme by monitoring the slow osds
> > and the commonality in this case was the same nvme cache device.
> >
> > you should always benchmark new hardware/perform burn-in tests imho, which
> > is not always possible due to environment constraints.
> >
> > --
> > Mit freundlichen Grüßen / Regards
> >
> > Sven Kieske
> > Systementwickler / systems engineer
> >
> >
> > Mittwald CM Service GmbH & Co. KG
> > Königsberger Straße 4-6
> > 32339 Espelkamp
> >
> > Tel.: 05772 / 293-900
> > Fax: 05772 / 293-333
> >
> > https://www.mittwald.de
> >
> > Geschäftsführer: Robert Meyer, Florian Jürgens
> >
> > St.Nr.: 331/5721/1033, USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen
> > Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen
> >
> > Informationen zur Datenverarbeitung im Rahmen unserer Geschäftstätigkeit
> > gemäß Art. 13-14 DSGVO sind unter www.mittwald.de/ds abrufbar.
> >
> >
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groüen Saal.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: octopus rbd cluster just stopped out of nowhere (>20k slow ops)

2022-12-06 Thread Boris Behrens
Hi Sven,
I am searching really hard for defect hardware, but I am currently out of
ideas:
- checked prometheus stats, but in all that data I don't know what to look
for (osd apply latency if very low at the mentioned point and went up to
40ms after all OSDs were restarted)
- smartctl shows nothing
- dmesg show nothing
- network data shows nothing
- osd and clusterlogs show nothing

If anybody got a good tip what I can check, that would be awesome. A string
in the logs (I made a copy from that days logs), or a tool to fire against
the hardware. I am 100% out of ideas what it could be.
In a time frame of 20s 2/3 of our OSDs went from "all fine" to "I am
waiting for the replicas to do their work" (log message 'waiting for sub
ops'). But there was no alert that any OSD had connection problems to other
OSDs. Additional the cluster_network is the same interface, switch,
everything as public_network. Only difference is the VLAN id (I plan to
remove the cluster_network because it does not provide anything for us).

I am also planning to update all hosts from centos7 to ubuntu 20.04 (newer
kernel, standardized OS config and so on).

Am Mo., 5. Dez. 2022 um 14:24 Uhr schrieb Sven Kieske :

> On Sa, 2022-12-03 at 01:54 +0100, Boris Behrens wrote:
> > hi,
> > maybe someone here can help me to debug an issue we faced today.
> >
> > Today one of our clusters came to a grinding halt with 2/3 of our OSDs
> > reporting slow ops.
> > Only option to get it back to work fast, was to restart all OSDs daemons.
> >
> > The cluster is an octopus cluster with 150 enterprise SSD OSDs. Last work
> > on the cluster: synced in a node 4 days ago.
> >
> > The only health issue, that was reported, was the SLOW_OPS. No slow pings
> > on the networks. No restarting OSDs. Nothing.
> >
> > I was able to ping it to a 20s timeframe and I read ALL the logs in a 20
> > minute timeframe around this issue.
> >
> > I haven't found any clues.
> >
> > Maybe someone encountered this in the past?
>
> do you happen to run your rocksdb on a dedicated caching device (nvme ssd)?
>
> I observed slow ops in octopus after a faulty nvme ssd was inserted in one
> ceph server.
> as was said in other mails, try to isolate your root cause.
>
> maybe the node added 4 days ago was the culprit here?
>
> we were able to pinpoint the nvme by monitoring the slow osds
> and the commonality in this case was the same nvme cache device.
>
> you should always benchmark new hardware/perform burn-in tests imho, which
> is not always possible due to environment constraints.
>
> --
> Mit freundlichen Grüßen / Regards
>
> Sven Kieske
> Systementwickler / systems engineer
>
>
> Mittwald CM Service GmbH & Co. KG
> Königsberger Straße 4-6
> 32339 Espelkamp
>
> Tel.: 05772 / 293-900
> Fax: 05772 / 293-333
>
> https://www.mittwald.de
>
> Geschäftsführer: Robert Meyer, Florian Jürgens
>
> St.Nr.: 331/5721/1033, USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen
> Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen
>
> Informationen zur Datenverarbeitung im Rahmen unserer Geschäftstätigkeit
> gemäß Art. 13-14 DSGVO sind unter www.mittwald.de/ds abrufbar.
>
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs snap-mirror stalled

2022-12-06 Thread Venky Shankar
Hi Holger,

On Tue, Dec 6, 2022 at 1:42 PM Holger Naundorf  wrote:
>
> Hello,
> we have set up a snap-mirror for a directory on one of our clusters -
> running ceph version
>
> ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
> (stable)
>
> to get mirrorred our other cluster - running ceph version
>
> ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific
> (stable)
>
> The initial setup went ok, when the first snapshot was created data
> started to flow at a decent (for our HW) rate of 100-200MB/s. As the
> directory contains  ~200TB this was expected to take some time - but now
> the process has stalled completely after ~100TB were mirrored and ~7d
> running.
>
> Up to now I do not have any hints why it has stopped - I do not see any
> error messages from the cephfs-mirror daemon. Can the small version
> mismatch be a problem?
>
> Any hints where to look to find out what has got stuck are welcome.

I'd look at the mirror daemon logs for any errors to start with. You
might want to crank up the log level for debugging (debug
cephfs_mirror=20).

>
> Regards,
> Holger
>
> --
> Dr. Holger Naundorf
> Christian-Albrechts-Universität zu Kiel
> Rechenzentrum / HPC / Server und Storage
> Tel: +49 431 880-1990
> Fax:  +49 431 880-1523
> naund...@rz.uni-kiel.de
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Cheers,
Venky

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephfs snap-mirror stalled

2022-12-06 Thread Holger Naundorf

Hello,
we have set up a snap-mirror for a directory on one of our clusters - 
running ceph version


ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific 
(stable)


to get mirrorred our other cluster - running ceph version

ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific 
(stable)


The initial setup went ok, when the first snapshot was created data 
started to flow at a decent (for our HW) rate of 100-200MB/s. As the 
directory contains  ~200TB this was expected to take some time - but now 
the process has stalled completely after ~100TB were mirrored and ~7d 
running.


Up to now I do not have any hints why it has stopped - I do not see any 
error messages from the cephfs-mirror daemon. Can the small version 
mismatch be a problem?


Any hints where to look to find out what has got stuck are welcome.

Regards,
Holger

--
Dr. Holger Naundorf
Christian-Albrechts-Universität zu Kiel
Rechenzentrum / HPC / Server und Storage
Tel: +49 431 880-1990
Fax:  +49 431 880-1523
naund...@rz.uni-kiel.de
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io