[ceph-users] Re: CEPH 16.2.x: disappointing I/O performance

2021-10-05 Thread Zakhar Kirpichenko
Got it. I don't have any specific throttling set up for RBD-backed storage.
I also previously tested several different backends and found that virtio
consistently produced better performance than virtio-scsi in different
scenarios, thus my VMs run virtio.

/Z

On Wed, Oct 6, 2021 at 7:10 AM Anthony D'Atri 
wrote:

> To be clear, I’m suspecting explicit throttling as described here:
>
>
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-blockio-techniques
>
> not impact from virtualization as such, though depending on the versions
> of software involved, the device emulation chosen can make a big
> difference, eg. virtio-scsi vs virtio-blk vs IDE.
>
> If one has Prometheus / Grafana set up to track throughput and iops per
> volume / attachment / VM, or enables the client-side admin socket, that
> sort of throttling can be very visually apparent.
>
>
> > On Oct 5, 2021, at 8:35 PM, Zakhar Kirpichenko  wrote:
> >
> > Hi!
> >
> > The clients are KVM VMs, there's QEMU/libvirt impact for sure. I will
> test
> > with a baremetal client and see whether it performs much better.
> >
> > /Z
> >
> >
> > On Wed, 6 Oct 2021, 01:29 Anthony D'Atri, 
> wrote:
> >
> >> The lead PG handling ops isn’t a factor, with RBD your volumes touch
> >> dozens / hundreds of PGs.   But QD=1 and small block sizes are going to
> >> limit your throughput.
> >>
> >> What are your clients?  Are they bare metal?  Are they VMs?  If they’re
> >> VMs, do you have QEMU/libvirt throttling in play?  I see that a lot.
> >>
> >>> On Oct 5, 2021, at 2:06 PM, Zakhar Kirpichenko 
> wrote:
> >>>
> >>> I'm not sure, fio might be showing some bogus values in the summary,
> I'll
> >>> check the readings again tomorrow.
> >>>
> >>> Another thing I noticed is that writes seem bandwidth-limited and don't
> >>> scale well with block size and/or number of threads. I.e. one clients
> >>> writes at about the same speed regardless of the benchmark settings. A
> >>> person on reddit, where I asked this question as well, suggested that
> in
> >> a
> >>> replicated pool writes and reads are handled by the primary PG, which
> >> would
> >>> explain this write bandwidth limit.
> >>>
> >>> /Z
> >>>
> >>> On Tue, 5 Oct 2021, 22:31 Christian Wuerdig, <
> >> christian.wuer...@gmail.com>
> >>> wrote:
> >>>
>  Maybe some info is missing but 7k write IOPs at 4k block size seem
> >> fairly
>  decent (as you also state) - the bandwidth automatically follows from
> >> that
>  so not sure what you're expecting?
>  I am a bit puzzled though - by my math 7k IOPS at 4k should only be
>  27MiB/sec - not sure how the 120MiB/sec was achieved
>  The read benchmark seems in line with 13k IOPS at 4k making around
>  52MiB/sec bandwidth which again is expected.
> 
> 
>  On Wed, 6 Oct 2021 at 04:08, Zakhar Kirpichenko 
> >> wrote:
> 
> > Hi,
> >
> > I built a CEPH 16.2.x cluster with relatively fast and modern
> hardware,
> > and
> > its performance is kind of disappointing. I would very much
> appreciate
> >> an
> > advice and/or pointers :-)
> >
> > The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with:
> >
> > 2 x Intel(R) Xeon(R) Gold 5220R CPUs
> > 384 GB RAM
> > 2 x boot drives
> > 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL)
> > 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier)
> > 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier)
> > 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
> >
> > All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel,
> > apparmor is disabled, energy-saving features are disabled. The
> network
> > between the CEPH nodes is 40G, CEPH access network is 40G, the
> average
> > latencies are < 0.15 ms. I've personally tested the network for
> > throughput,
> > latency and loss, and can tell that it's operating as expected and
> >> doesn't
> > exhibit any issues at idle or under load.
> >
> > The CEPH cluster is set up with 2 storage classes, NVME and HDD,
> with 2
> > smaller NVME drives in each node used as DB/WAL and each HDD
> allocated
> >> .
> > ceph osd tree output:
> >
> > ID   CLASS  WEIGHT TYPE NAMESTATUS  REWEIGHT
> >> PRI-AFF
> > -1 288.37488  root default
> > -13 288.37488  datacenter ste
> > -14 288.37488  rack rack01
> > -7  96.12495  host ceph01
> > 0hdd9.38680  osd.0up   1.0
> >> 1.0
> > 1hdd9.38680  osd.1up   1.0
> >> 1.0
> > 2hdd9.38680  osd.2up   1.0
> >> 1.0
> > 3hdd9.38680  osd.3up   1.0
> >> 1.0
> > 4hdd9.38680   

[ceph-users] Re: CEPH 16.2.x: disappointing I/O performance

2021-10-05 Thread Anthony D'Atri
To be clear, I’m suspecting explicit throttling as described here:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-blockio-techniques

not impact from virtualization as such, though depending on the versions of 
software involved, the device emulation chosen can make a big difference, eg. 
virtio-scsi vs virtio-blk vs IDE.

If one has Prometheus / Grafana set up to track throughput and iops per volume 
/ attachment / VM, or enables the client-side admin socket, that sort of 
throttling can be very visually apparent.


> On Oct 5, 2021, at 8:35 PM, Zakhar Kirpichenko  wrote:
> 
> Hi!
> 
> The clients are KVM VMs, there's QEMU/libvirt impact for sure. I will test
> with a baremetal client and see whether it performs much better.
> 
> /Z
> 
> 
> On Wed, 6 Oct 2021, 01:29 Anthony D'Atri,  wrote:
> 
>> The lead PG handling ops isn’t a factor, with RBD your volumes touch
>> dozens / hundreds of PGs.   But QD=1 and small block sizes are going to
>> limit your throughput.
>> 
>> What are your clients?  Are they bare metal?  Are they VMs?  If they’re
>> VMs, do you have QEMU/libvirt throttling in play?  I see that a lot.
>> 
>>> On Oct 5, 2021, at 2:06 PM, Zakhar Kirpichenko  wrote:
>>> 
>>> I'm not sure, fio might be showing some bogus values in the summary, I'll
>>> check the readings again tomorrow.
>>> 
>>> Another thing I noticed is that writes seem bandwidth-limited and don't
>>> scale well with block size and/or number of threads. I.e. one clients
>>> writes at about the same speed regardless of the benchmark settings. A
>>> person on reddit, where I asked this question as well, suggested that in
>> a
>>> replicated pool writes and reads are handled by the primary PG, which
>> would
>>> explain this write bandwidth limit.
>>> 
>>> /Z
>>> 
>>> On Tue, 5 Oct 2021, 22:31 Christian Wuerdig, <
>> christian.wuer...@gmail.com>
>>> wrote:
>>> 
 Maybe some info is missing but 7k write IOPs at 4k block size seem
>> fairly
 decent (as you also state) - the bandwidth automatically follows from
>> that
 so not sure what you're expecting?
 I am a bit puzzled though - by my math 7k IOPS at 4k should only be
 27MiB/sec - not sure how the 120MiB/sec was achieved
 The read benchmark seems in line with 13k IOPS at 4k making around
 52MiB/sec bandwidth which again is expected.
 
 
 On Wed, 6 Oct 2021 at 04:08, Zakhar Kirpichenko 
>> wrote:
 
> Hi,
> 
> I built a CEPH 16.2.x cluster with relatively fast and modern hardware,
> and
> its performance is kind of disappointing. I would very much appreciate
>> an
> advice and/or pointers :-)
> 
> The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with:
> 
> 2 x Intel(R) Xeon(R) Gold 5220R CPUs
> 384 GB RAM
> 2 x boot drives
> 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL)
> 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier)
> 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier)
> 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
> 
> All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel,
> apparmor is disabled, energy-saving features are disabled. The network
> between the CEPH nodes is 40G, CEPH access network is 40G, the average
> latencies are < 0.15 ms. I've personally tested the network for
> throughput,
> latency and loss, and can tell that it's operating as expected and
>> doesn't
> exhibit any issues at idle or under load.
> 
> The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2
> smaller NVME drives in each node used as DB/WAL and each HDD allocated
>> .
> ceph osd tree output:
> 
> ID   CLASS  WEIGHT TYPE NAMESTATUS  REWEIGHT
>> PRI-AFF
> -1 288.37488  root default
> -13 288.37488  datacenter ste
> -14 288.37488  rack rack01
> -7  96.12495  host ceph01
> 0hdd9.38680  osd.0up   1.0
>> 1.0
> 1hdd9.38680  osd.1up   1.0
>> 1.0
> 2hdd9.38680  osd.2up   1.0
>> 1.0
> 3hdd9.38680  osd.3up   1.0
>> 1.0
> 4hdd9.38680  osd.4up   1.0
>> 1.0
> 5hdd9.38680  osd.5up   1.0
>> 1.0
> 6hdd9.38680  osd.6up   1.0
>> 1.0
> 7hdd9.38680  osd.7up   1.0
>> 1.0
> 8hdd9.38680  osd.8up   1.0
>> 1.0
> 9   nvme5.82190  osd.9up   1.0
>> 1.0
> 10   nvme5.82190  osd.10   up   1.0
>> 1.0
> 

[ceph-users] Re: CEPH 16.2.x: disappointing I/O performance

2021-10-05 Thread Zakhar Kirpichenko
Hi!

The clients are KVM VMs, there's QEMU/libvirt impact for sure. I will test
with a baremetal client and see whether it performs much better.

/Z


On Wed, 6 Oct 2021, 01:29 Anthony D'Atri,  wrote:

> The lead PG handling ops isn’t a factor, with RBD your volumes touch
> dozens / hundreds of PGs.   But QD=1 and small block sizes are going to
> limit your throughput.
>
> What are your clients?  Are they bare metal?  Are they VMs?  If they’re
> VMs, do you have QEMU/libvirt throttling in play?  I see that a lot.
>
> > On Oct 5, 2021, at 2:06 PM, Zakhar Kirpichenko  wrote:
> >
> > I'm not sure, fio might be showing some bogus values in the summary, I'll
> > check the readings again tomorrow.
> >
> > Another thing I noticed is that writes seem bandwidth-limited and don't
> > scale well with block size and/or number of threads. I.e. one clients
> > writes at about the same speed regardless of the benchmark settings. A
> > person on reddit, where I asked this question as well, suggested that in
> a
> > replicated pool writes and reads are handled by the primary PG, which
> would
> > explain this write bandwidth limit.
> >
> > /Z
> >
> > On Tue, 5 Oct 2021, 22:31 Christian Wuerdig, <
> christian.wuer...@gmail.com>
> > wrote:
> >
> >> Maybe some info is missing but 7k write IOPs at 4k block size seem
> fairly
> >> decent (as you also state) - the bandwidth automatically follows from
> that
> >> so not sure what you're expecting?
> >> I am a bit puzzled though - by my math 7k IOPS at 4k should only be
> >> 27MiB/sec - not sure how the 120MiB/sec was achieved
> >> The read benchmark seems in line with 13k IOPS at 4k making around
> >> 52MiB/sec bandwidth which again is expected.
> >>
> >>
> >> On Wed, 6 Oct 2021 at 04:08, Zakhar Kirpichenko 
> wrote:
> >>
> >>> Hi,
> >>>
> >>> I built a CEPH 16.2.x cluster with relatively fast and modern hardware,
> >>> and
> >>> its performance is kind of disappointing. I would very much appreciate
> an
> >>> advice and/or pointers :-)
> >>>
> >>> The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with:
> >>>
> >>> 2 x Intel(R) Xeon(R) Gold 5220R CPUs
> >>> 384 GB RAM
> >>> 2 x boot drives
> >>> 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL)
> >>> 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier)
> >>> 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier)
> >>> 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
> >>>
> >>> All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel,
> >>> apparmor is disabled, energy-saving features are disabled. The network
> >>> between the CEPH nodes is 40G, CEPH access network is 40G, the average
> >>> latencies are < 0.15 ms. I've personally tested the network for
> >>> throughput,
> >>> latency and loss, and can tell that it's operating as expected and
> doesn't
> >>> exhibit any issues at idle or under load.
> >>>
> >>> The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2
> >>> smaller NVME drives in each node used as DB/WAL and each HDD allocated
> .
> >>> ceph osd tree output:
> >>>
> >>> ID   CLASS  WEIGHT TYPE NAMESTATUS  REWEIGHT
> PRI-AFF
> >>> -1 288.37488  root default
> >>> -13 288.37488  datacenter ste
> >>> -14 288.37488  rack rack01
> >>> -7  96.12495  host ceph01
> >>>  0hdd9.38680  osd.0up   1.0
> 1.0
> >>>  1hdd9.38680  osd.1up   1.0
> 1.0
> >>>  2hdd9.38680  osd.2up   1.0
> 1.0
> >>>  3hdd9.38680  osd.3up   1.0
> 1.0
> >>>  4hdd9.38680  osd.4up   1.0
> 1.0
> >>>  5hdd9.38680  osd.5up   1.0
> 1.0
> >>>  6hdd9.38680  osd.6up   1.0
> 1.0
> >>>  7hdd9.38680  osd.7up   1.0
> 1.0
> >>>  8hdd9.38680  osd.8up   1.0
> 1.0
> >>>  9   nvme5.82190  osd.9up   1.0
> 1.0
> >>> 10   nvme5.82190  osd.10   up   1.0
> 1.0
> >>> -10  96.12495  host ceph02
> >>> 11hdd9.38680  osd.11   up   1.0
> 1.0
> >>> 12hdd9.38680  osd.12   up   1.0
> 1.0
> >>> 13hdd9.38680  osd.13   up   1.0
> 1.0
> >>> 14hdd9.38680  osd.14   up   1.0
> 1.0
> >>> 15hdd9.38680  osd.15   up   1.0
> 1.0
> >>> 16hdd9.38680  osd.16   up   1.0
> 1.0
> >>> 17hdd9.38680  osd.17   up   1.0
> 1.0
> >>> 18hdd9.38680  osd.18   up   1.0
> 1.0
> >>> 19hdd9.38680  osd.19   up   1.0
> 

[ceph-users] Re: ceph-iscsi issue after upgrading from nautilus to octopus

2021-10-05 Thread icy chan
Hi,

This issue also happens on another platform that runs with Pacific
(v16.2.6) on Ubuntu 20.04.
# ceph health detail
HEALTH_WARN 20 stray daemon(s) not managed by cephadm
[WRN] CEPHADM_STRAY_DAEMON: 20 stray daemon(s) not managed by cephadm
stray daemon tcmu-runner.sds-ctt-gw1:cached-hdd/iscsi_pool_01 on host
sds-ctt-gw1 not managed by cephadm
stray daemon tcmu-runner.sds-ctt-gw1:cached-hdd/iscsi_pool_02 on host
sds-ctt-gw1 not managed by cephadm
...
...

This warning can be removed by configuring
mgr/cephadm/warn_on_stray_daemons as false. But I dont really think I
should.
A related bug report found in: https://tracker.ceph.com/issues/5

Regs,
Icy


On Fri, 16 Apr 2021 at 10:44, icy chan  wrote:

> Hi,
>
> I had several clusters running as nautilus and pending upgrading to
> octopus.
>
> I am now testing the upgrade steps for ceph cluster from nautilus
> to octopus using cephadm adopt in lab referred to below link:
> - https://docs.ceph.com/en/octopus/cephadm/adoption/
>
> Lab environment:
> 3 all-in-one nodes.
> OS: CentOS 7.9.2009 with podman 1.6.4.
>
> After the adoption, ceph health keep warns about tcme-runner not
> managed by cephadm.
> # ceph health detail
> HEALTH_WARN 12 stray daemon(s) not managed by cephadm; 1 pool(s) have no
> replicas configured
> [WRN] CEPHADM_STRAY_DAEMON: 12 stray daemon(s) not managed by cephadm
> stray daemon tcmu-runner.ceph-aio1:iSCSI/iscsi_image_01 on host
> ceph-aio1 not managed by cephadm
> stray daemon tcmu-runner.ceph-aio1:iSCSI/iscsi_image_02 on host
> ceph-aio1 not managed by cephadm
> stray daemon tcmu-runner.ceph-aio1:iSCSI/iscsi_image_03 on host
> ceph-aio1 not managed by cephadm
> stray daemon tcmu-runner.ceph-aio1:iSCSI/iscsi_image_test on host
> ceph-aio1 not managed by cephadm
> stray daemon tcmu-runner.ceph-aio2:iSCSI/iscsi_image_01 on host
> ceph-aio2 not managed by cephadm
> stray daemon tcmu-runner.ceph-aio2:iSCSI/iscsi_image_02 on host
> ceph-aio2 not managed by cephadm
> stray daemon tcmu-runner.ceph-aio2:iSCSI/iscsi_image_03 on host
> ceph-aio2 not managed by cephadm
> stray daemon tcmu-runner.ceph-aio2:iSCSI/iscsi_image_test on host
> ceph-aio2 not managed by cephadm
> stray daemon tcmu-runner.ceph-aio3:iSCSI/iscsi_image_01 on host
> ceph-aio3 not managed by cephadm
> stray daemon tcmu-runner.ceph-aio3:iSCSI/iscsi_image_02 on host
> ceph-aio3 not managed by cephadm
> stray daemon tcmu-runner.ceph-aio3:iSCSI/iscsi_image_03 on host
> ceph-aio3 not managed by cephadm
> stray daemon tcmu-runner.ceph-aio3:iSCSI/iscsi_image_test on host
> ceph-aio3 not managed by cephadm
>
> And tcmu-runner is still running with the old version.
> # ceph versions
> {
> "mon": {
> "ceph version 15.2.10 (27917a557cca91e4da407489bbaa64ad4352cc02)
> octopus (stable)": 3
> },
> "mgr": {
> "ceph version 15.2.10 (27917a557cca91e4da407489bbaa64ad4352cc02)
> octopus (stable)": 1
> },
> "osd": {
> "ceph version 15.2.10 (27917a557cca91e4da407489bbaa64ad4352cc02)
> octopus (stable)": 9
> },
> "mds": {},
> "tcmu-runner": {
> "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9)
> nautilus (stable)": 12
> },
> "overall": {
> "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9)
> nautilus (stable)": 12,
> "ceph version 15.2.10 (27917a557cca91e4da407489bbaa64ad4352cc02)
> octopus (stable)": 13
> }
> }
>
> I didn't find any ceph-iscsi related upgrade steps from the above
> reference link.
> Can anyone here point me to the right direction of ceph-iscsi version
> upgrade?
>
> Thanks.
>
> Regs,
> Icy
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: radosgw breaking because of too many open files

2021-10-05 Thread shubjero
Found the issue.

Upgrading to Octopus did replace /etc/init.d/radosgw which contained
some changes to the distribution detection and setting ulimits.

New radosgw init script:

-snip-
echo "Starting $name..."
if [ $DEBIAN -eq 1 ]; then
start-stop-daemon --start -u $user -x $RADOSGW -p
/var/run/ceph/client-$name.pid -- -n $name
else
ulimit -n 32768
core_limit=`ceph-conf -n $name 'core file limit'`
if [ -z $core_limit ]; then
DAEMON_COREFILE_LIMIT=$core_limit
fi
daemon --user="$user" "$RADOSGW -n $name"
fi
-snip-

Old radosgw init script (or at least one that we may have customized
over the years:
-snip-
echo "Starting $name..."
if [ $DEBIAN -eq 1 ]; then
ulimit -n 32768
start-stop-daemon --start -u $user -x $RADOSGW -p
/var/run/ceph/client-$name.pid -- -n $name
else
ulimit -n 32768
core_limit=`ceph-conf -n $name 'core file limit'`
if [ -z $core_limit ]; then
DAEMON_COREFILE_LIMIT=$core_limit
fi
daemon --user="$user" "$RADOSGW -n $name"
fi
-snip-

Editing this file and putting back the first 'ulimit -n 32768'
followed by a 'systemctl daemon-reload' and bouncing the radosgw
process and we seem to be humming along nicely now.


On Tue, Oct 5, 2021 at 4:55 PM shubjero  wrote:
>
> Just upgraded from Ceph Nautilus to Ceph Octopus on Ubuntu 18.04 using
> standard ubuntu packages from the Ceph repo.
>
> Upgrade has gone OK but we are having issues with our radosgw service,
> eventually failing after some load, here's what we see in the logs:
>
> 2021-10-05T15:55:16.328-0400 7fa47700 -1 NetHandler create_socket
> couldn't create socket (24) Too many open files
> 2021-10-05T15:55:17.896-0400 7fa484b18700 -1 NetHandler create_socket
> couldn't create socket (24) Too many open files
> 2021-10-05T15:55:17.964-0400 7fa484b18700 -1 NetHandler create_socket
> couldn't create socket (24) Too many open files
> 2021-10-05T15:55:18.148-0400 7fa484b18700 -1 NetHandler create_socket
> couldn't create socket (24) Too many open files
>
> In Ceph Nautilus we used to set in ceph.conf the following which I
> think helped is avoid the situation:
>
> [global]
>   max open files = 131072
>
> This config option seems to be no longer recognized by ceph.
>
>
> Any help would be appreciated.
>
> Jared Baker
> Ontario Institute for Cancer Research
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS replay questions

2021-10-05 Thread Brian Kim
Dear ceph-users,

We have a ceph cluster with 3 MDS's and recently had to replay our cache
which is taking an extremely long time to complete. Is there some way to
speed up this process as well as apply some checkpoint so it doesn't have
to start all the way from the beginning?

-- 
Best Wishes,
Brian Kim
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CEPH 16.2.x: disappointing I/O performance

2021-10-05 Thread Zakhar Kirpichenko
Hi!

I can post the crush map tomorrow morning, but it definitely isn't
targeting the NVME drives.

I'm having a performance issue specifically with the HDD-backed pool, where
each OSD is an NVME-backed WAL/DB + HDD-backed storage.

/Z

On Tue, 5 Oct 2021, 22:43 Tor Martin Ølberg,  wrote:

> Hi Zakhar,
>
> Out of curiosity, what does your crushmap look like? Probably a long shot
> but are you sure your crush map is targeting the NVME's for the rados bench
> you are performing?
>
> Tor Martin Ølberg
>
> On Tue, Oct 5, 2021 at 9:31 PM Christian Wuerdig <
> christian.wuer...@gmail.com> wrote:
>
>> Maybe some info is missing but 7k write IOPs at 4k block size seem fairly
>> decent (as you also state) - the bandwidth automatically follows from that
>> so not sure what you're expecting?
>> I am a bit puzzled though - by my math 7k IOPS at 4k should only be
>> 27MiB/sec - not sure how the 120MiB/sec was achieved
>> The read benchmark seems in line with 13k IOPS at 4k making around
>> 52MiB/sec bandwidth which again is expected.
>>
>>
>> On Wed, 6 Oct 2021 at 04:08, Zakhar Kirpichenko  wrote:
>>
>> > Hi,
>> >
>> > I built a CEPH 16.2.x cluster with relatively fast and modern hardware,
>> and
>> > its performance is kind of disappointing. I would very much appreciate
>> an
>> > advice and/or pointers :-)
>> >
>> > The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with:
>> >
>> > 2 x Intel(R) Xeon(R) Gold 5220R CPUs
>> > 384 GB RAM
>> > 2 x boot drives
>> > 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL)
>> > 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier)
>> > 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier)
>> > 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
>> >
>> > All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel,
>> > apparmor is disabled, energy-saving features are disabled. The network
>> > between the CEPH nodes is 40G, CEPH access network is 40G, the average
>> > latencies are < 0.15 ms. I've personally tested the network for
>> throughput,
>> > latency and loss, and can tell that it's operating as expected and
>> doesn't
>> > exhibit any issues at idle or under load.
>> >
>> > The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2
>> > smaller NVME drives in each node used as DB/WAL and each HDD allocated .
>> > ceph osd tree output:
>> >
>> > ID   CLASS  WEIGHT TYPE NAMESTATUS  REWEIGHT
>> PRI-AFF
>> >  -1 288.37488  root default
>> > -13 288.37488  datacenter ste
>> > -14 288.37488  rack rack01
>> >  -7  96.12495  host ceph01
>> >   0hdd9.38680  osd.0up   1.0
>> 1.0
>> >   1hdd9.38680  osd.1up   1.0
>> 1.0
>> >   2hdd9.38680  osd.2up   1.0
>> 1.0
>> >   3hdd9.38680  osd.3up   1.0
>> 1.0
>> >   4hdd9.38680  osd.4up   1.0
>> 1.0
>> >   5hdd9.38680  osd.5up   1.0
>> 1.0
>> >   6hdd9.38680  osd.6up   1.0
>> 1.0
>> >   7hdd9.38680  osd.7up   1.0
>> 1.0
>> >   8hdd9.38680  osd.8up   1.0
>> 1.0
>> >   9   nvme5.82190  osd.9up   1.0
>> 1.0
>> >  10   nvme5.82190  osd.10   up   1.0
>> 1.0
>> > -10  96.12495  host ceph02
>> >  11hdd9.38680  osd.11   up   1.0
>> 1.0
>> >  12hdd9.38680  osd.12   up   1.0
>> 1.0
>> >  13hdd9.38680  osd.13   up   1.0
>> 1.0
>> >  14hdd9.38680  osd.14   up   1.0
>> 1.0
>> >  15hdd9.38680  osd.15   up   1.0
>> 1.0
>> >  16hdd9.38680  osd.16   up   1.0
>> 1.0
>> >  17hdd9.38680  osd.17   up   1.0
>> 1.0
>> >  18hdd9.38680  osd.18   up   1.0
>> 1.0
>> >  19hdd9.38680  osd.19   up   1.0
>> 1.0
>> >  20   nvme5.82190  osd.20   up   1.0
>> 1.0
>> >  21   nvme5.82190  osd.21   up   1.0
>> 1.0
>> >  -3  96.12495  host ceph03
>> >  22hdd9.38680  osd.22   up   1.0
>> 1.0
>> >  23hdd9.38680  osd.23   up   1.0
>> 1.0
>> >  24hdd9.38680  osd.24   up   1.0
>> 1.0
>> >  25hdd9.38680  osd.25   up   1.0
>> 1.0
>> >  26hdd9.38680  osd.26   up   1.0
>> 1.0
>> >  27hdd9.38680  osd.27   up   

[ceph-users] Re: CEPH 16.2.x: disappointing I/O performance

2021-10-05 Thread Zakhar Kirpichenko
I'm not sure, fio might be showing some bogus values in the summary, I'll
check the readings again tomorrow.

Another thing I noticed is that writes seem bandwidth-limited and don't
scale well with block size and/or number of threads. I.e. one clients
writes at about the same speed regardless of the benchmark settings. A
person on reddit, where I asked this question as well, suggested that in a
replicated pool writes and reads are handled by the primary PG, which would
explain this write bandwidth limit.

/Z

On Tue, 5 Oct 2021, 22:31 Christian Wuerdig, 
wrote:

> Maybe some info is missing but 7k write IOPs at 4k block size seem fairly
> decent (as you also state) - the bandwidth automatically follows from that
> so not sure what you're expecting?
> I am a bit puzzled though - by my math 7k IOPS at 4k should only be
> 27MiB/sec - not sure how the 120MiB/sec was achieved
> The read benchmark seems in line with 13k IOPS at 4k making around
> 52MiB/sec bandwidth which again is expected.
>
>
> On Wed, 6 Oct 2021 at 04:08, Zakhar Kirpichenko  wrote:
>
>> Hi,
>>
>> I built a CEPH 16.2.x cluster with relatively fast and modern hardware,
>> and
>> its performance is kind of disappointing. I would very much appreciate an
>> advice and/or pointers :-)
>>
>> The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with:
>>
>> 2 x Intel(R) Xeon(R) Gold 5220R CPUs
>> 384 GB RAM
>> 2 x boot drives
>> 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL)
>> 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier)
>> 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier)
>> 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
>>
>> All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel,
>> apparmor is disabled, energy-saving features are disabled. The network
>> between the CEPH nodes is 40G, CEPH access network is 40G, the average
>> latencies are < 0.15 ms. I've personally tested the network for
>> throughput,
>> latency and loss, and can tell that it's operating as expected and doesn't
>> exhibit any issues at idle or under load.
>>
>> The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2
>> smaller NVME drives in each node used as DB/WAL and each HDD allocated .
>> ceph osd tree output:
>>
>> ID   CLASS  WEIGHT TYPE NAMESTATUS  REWEIGHT  PRI-AFF
>>  -1 288.37488  root default
>> -13 288.37488  datacenter ste
>> -14 288.37488  rack rack01
>>  -7  96.12495  host ceph01
>>   0hdd9.38680  osd.0up   1.0  1.0
>>   1hdd9.38680  osd.1up   1.0  1.0
>>   2hdd9.38680  osd.2up   1.0  1.0
>>   3hdd9.38680  osd.3up   1.0  1.0
>>   4hdd9.38680  osd.4up   1.0  1.0
>>   5hdd9.38680  osd.5up   1.0  1.0
>>   6hdd9.38680  osd.6up   1.0  1.0
>>   7hdd9.38680  osd.7up   1.0  1.0
>>   8hdd9.38680  osd.8up   1.0  1.0
>>   9   nvme5.82190  osd.9up   1.0  1.0
>>  10   nvme5.82190  osd.10   up   1.0  1.0
>> -10  96.12495  host ceph02
>>  11hdd9.38680  osd.11   up   1.0  1.0
>>  12hdd9.38680  osd.12   up   1.0  1.0
>>  13hdd9.38680  osd.13   up   1.0  1.0
>>  14hdd9.38680  osd.14   up   1.0  1.0
>>  15hdd9.38680  osd.15   up   1.0  1.0
>>  16hdd9.38680  osd.16   up   1.0  1.0
>>  17hdd9.38680  osd.17   up   1.0  1.0
>>  18hdd9.38680  osd.18   up   1.0  1.0
>>  19hdd9.38680  osd.19   up   1.0  1.0
>>  20   nvme5.82190  osd.20   up   1.0  1.0
>>  21   nvme5.82190  osd.21   up   1.0  1.0
>>  -3  96.12495  host ceph03
>>  22hdd9.38680  osd.22   up   1.0  1.0
>>  23hdd9.38680  osd.23   up   1.0  1.0
>>  24hdd9.38680  osd.24   up   1.0  1.0
>>  25hdd9.38680  osd.25   up   1.0  1.0
>>  26hdd9.38680  osd.26   up   1.0  1.0
>>  27hdd9.38680  osd.27   up   1.0  1.0
>>  28hdd9.38680  osd.28   up   1.0  1.0
>>  29hdd9.38680  osd.29   up   1.0  1.0
>>  30hdd9.38680  osd.30   up 

[ceph-users] Re: Orchestrator is internally ignoring applying a spec against SSDs, apparently determining they're rotational.

2021-10-05 Thread Chris
Hi!  So I nuked the cluster, zapped all the disks, and redeployed.

Then I applied this osd spec (this time via the dashboard since I was full
of hope):

service_type: osd
service_id: osd_spec_default
placement:
  host_pattern: '*'
data_devices:
  rotational: 1
db_devices:
  rotational: 0

--dry-run showed exactly what I hoped to see.

Upon application, hosts 1-4 worked just fine.  Host 5... not so much. I see
logical volumes being created, but no OSDs are coming online.  Moreover,
it's taken cephadm on host 5 days to get just a few LVs built.

I nuked all the LV's on that host, then zapped with sgdisk, then dd'd the
drives with /dev/urandom, then rebooted... the problem persists!
cephadm started making vg/lv but no new OSDs.

This wall of text might have a hint... but it's not true!  There's no
partition on these!  They've been wiped with /dev/urandom!

Here's a dump of a relevant part of /var/log/ceph/cephadm.log.  Since
formatting is stripped, I've spaced out the interesting part.  It's a shame
this process is still so unreliable.

2021-10-05 20:43:41,499 INFO Non-zero exit code 1 from /usr/bin/docker run
--rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint
/usr/sbin/ceph-volume --privileged --group-add=disk --init -e
CONTAINER_IMAGE=
quay.io/ceph/ceph@sha256:5755c3a5c197ef186b8186212e023565f15b799f1ed411207f2c3fcd4a80ab45
-e NODE_NAME=ceph05 -e CEPH_USE_RANDOM_NONCE=1 -e
CEPH_VOLUME_OSDSPEC_AFFINITY=dashboard-admin-1633379370439 -v
/var/run/ceph/23e192fe-221d-11ec-a2cb-a16209e26d65:/var/run/ceph:z -v
/var/log/ceph/23e192fe-221d-11ec-a2cb-a16209e26d65:/var/log/ceph:z -v
/var/lib/ceph/23e192fe-221d-11ec-a2cb-a16209e26d65/crash:/var/lib/ceph/crash:z
-v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
/run/lock/lvm:/run/lock/lvm -v /tmp/ceph-tmpu5c6jw0u:/etc/ceph/ceph.conf:z
-v /tmp/ceph-tmpk1wgba4u:/var/lib/ceph/bootstrap-osd/ceph.keyring:z
quay.io/ceph/ceph@sha256:5755c3a5c197ef186b8186212e023565f15b799f1ed411207f2c3fcd4a80ab45
lvm batch --no-auto /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf
/dev/sdg /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo
/dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx
--wal-devices /dev/sdp /dev/sdq --yes --no-systemd
2021-10-05 20:43:41,499 INFO /usr/bin/docker: stderr --> passed data
devices: 21 physical, 0 LVM
2021-10-05 20:43:41,499 INFO /usr/bin/docker: stderr --> relative data
size: 1.0
2021-10-05 20:43:41,499 INFO /usr/bin/docker: stderr --> passed block_wal
devices: 2 physical, 0 LVM
2021-10-05 20:43:41,500 INFO /usr/bin/docker: stderr Running command:
/usr/bin/ceph-authtool --gen-print-key
2021-10-05 20:43:41,500 INFO /usr/bin/docker: stderr Running command:
/usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring
/var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new
a97fda7a-586f-4ced-86e0-b0a18e081ec7
2021-10-05 20:43:41,500 INFO /usr/bin/docker: stderr Running command:
/usr/sbin/vgcreate --force --yes ceph-19158c90-90e6-4a37-98e2-7e0e45cd5e27
/dev/sdn
2021-10-05 20:43:41,500 INFO /usr/bin/docker: stderr  stdout: Physical
volume "/dev/sdn" successfully created.
2021-10-05 20:43:41,500 INFO /usr/bin/docker: stderr  stdout: Volume group
"ceph-19158c90-90e6-4a37-98e2-7e0e45cd5e27" successfully created
2021-10-05 20:43:41,500 INFO /usr/bin/docker: stderr Running command:
/usr/sbin/lvcreate --yes -l 238467 -n
osd-block-a97fda7a-586f-4ced-86e0-b0a18e081ec7
ceph-19158c90-90e6-4a37-98e2-7e0e45cd5e27
2021-10-05 20:43:41,500 INFO /usr/bin/docker: stderr  stdout: Logical
volume "osd-block-a97fda7a-586f-4ced-86e0-b0a18e081ec7" created.
2021-10-05 20:43:41,501 INFO /usr/bin/docker: stderr Running command:
/usr/sbin/vgcreate --force --yes ceph-84b7458f-4888-41a7-a6d6-031d85bfc9e4
/dev/sdp

2021-10-05 20:43:41,501 INFO /usr/bin/docker: stderr  *stderr: Cannot use
/dev/sdp: device is partitioned*

2021-10-05 20:43:41,501 INFO /usr/bin/docker: stderr   Command requires all
devices to be found.
2021-10-05 20:43:41,501 INFO /usr/bin/docker: stderr --> Was unable to
complete a new OSD, will rollback changes
2021-10-05 20:43:41,501 INFO /usr/bin/docker: stderr Running command:
/usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring
/var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.84
--yes-i-really-mean-it
2021-10-05 20:43:41,501 INFO /usr/bin/docker: stderr  stderr: purged osd.84
2021-10-05 20:43:41,501 INFO /usr/bin/docker: stderr -->  RuntimeError:
command returned non-zero exit status: 5
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: radosgw breaking because of too many open files

2021-10-05 Thread Marc
> In Ceph Nautilus we used to set in ceph.conf the following which I
> think helped is avoid the situation:
> 
> [global]
>   max open files = 131072
> 
> This config option seems to be no longer recognized by ceph.
> 

ceph config set ??? (I would not know, I am still Nautilus)

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] radosgw breaking because of too many open files

2021-10-05 Thread shubjero
Just upgraded from Ceph Nautilus to Ceph Octopus on Ubuntu 18.04 using
standard ubuntu packages from the Ceph repo.

Upgrade has gone OK but we are having issues with our radosgw service,
eventually failing after some load, here's what we see in the logs:

2021-10-05T15:55:16.328-0400 7fa47700 -1 NetHandler create_socket
couldn't create socket (24) Too many open files
2021-10-05T15:55:17.896-0400 7fa484b18700 -1 NetHandler create_socket
couldn't create socket (24) Too many open files
2021-10-05T15:55:17.964-0400 7fa484b18700 -1 NetHandler create_socket
couldn't create socket (24) Too many open files
2021-10-05T15:55:18.148-0400 7fa484b18700 -1 NetHandler create_socket
couldn't create socket (24) Too many open files

In Ceph Nautilus we used to set in ceph.conf the following which I
think helped is avoid the situation:

[global]
  max open files = 131072

This config option seems to be no longer recognized by ceph.


Any help would be appreciated.

Jared Baker
Ontario Institute for Cancer Research
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-05 Thread Szabo, Istvan (Agoda)
This unable to load table properties also interesting before caught signal:

  -16> 2021-10-05T20:31:28.484+0700 7f310cce5f00 2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 247222 --- 
NotFound:


   -15> 2021-10-05T20:31:28.484+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 251966 --- 
NotFound:

   -14> 2021-10-05T20:31:28.484+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 247508 --- 
NotFound:

   -13> 2021-10-05T20:31:28.484+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 252237 --- 
NotFound:

   -12> 2021-10-05T20:31:28.486+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 249610 --- 
NotFound:

   -11> 2021-10-05T20:31:28.486+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 251798 --- 
NotFound:

   -10> 2021-10-05T20:31:28.486+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 251799 --- 
NotFound:

-9> 2021-10-05T20:31:28.486+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 252235 --- 
NotFound:

-8> 2021-10-05T20:31:28.486+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 252236 --- 
NotFound:

-7> 2021-10-05T20:31:28.486+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 244769 --- 
NotFound:

-6> 2021-10-05T20:31:28.486+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 242684 --- 
NotFound:

-5> 2021-10-05T20:31:28.486+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 241854 --- 
NotFound:

-4> 2021-10-05T20:31:28.486+0700 7f310cce5f00  2 rocksdb: 
[db/version_set.cc:1362] Unable to load table properties for file 241191 --- 
NotFound:

-3> 2021-10-05T20:31:28.492+0700 7f310cce5f00  4 rocksdb: 
[db/version_set.cc:3757] Recovered from manifest file:db/MANIFEST-241072 
succeeded,manifest_file_number is 241072, next_file_number is 252389, 
last_sequence is 5847989279, log_number is 252336,prev_log_number is 
0,max_column_family is 0,min_log_number_to_keep is 0

-2> 2021-10-05T20:31:28.492+0700 7f310cce5f00  4 rocksdb: 
[db/version_set.cc:3766] Column family [default] (ID 0), log number is 252336

-1> 2021-10-05T20:31:28.501+0700 7f310cce5f00  4 rocksdb: 
[db/db_impl.cc:390] Shutdown: canceling all background work
 0> 2021-10-05T20:31:28.512+0700 7f310cce5f00 -1 *** Caught signal 
(Aborted) **
 in thread 7f310cce5f00 thread_name:ceph-osd




Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2021. Oct 5., at 17:19, Szabo, Istvan (Agoda)  wrote:


Hmm, I’ve removed from the cluster, now data rebalance, I’ll do with the next 
one ☹

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Igor Fedotov 
Sent: Tuesday, October 5, 2021 10:02 PM
To: Szabo, Istvan (Agoda) ; 胡 玮文 
Cc: ceph-users@ceph.io; Eugen Block 
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Not sure dmcrypt is a culprit here.

Could you please set debug-bluefs to 20 and collect an OSD startup log.


On 10/5/2021 4:43 PM, Szabo, Istvan (Agoda) wrote:
Hmm, tried another one which hasn’t been spilledover disk, still coredumped ☹
Is there any special thing that we need to do before we migrate db next to the 
block? Our osds are using dmcrypt, is it an issue?

{
"backtrace": [
"(()+0x12b20) [0x7f310aa49b20]",
"(gsignal()+0x10f) [0x7f31096aa37f]",
"(abort()+0x127) [0x7f3109694db5]",
"(()+0x9009b) [0x7f310a06209b]",
"(()+0x9653c) [0x7f310a06853c]",
"(()+0x95559) [0x7f310a067559]",
"(__gxx_personality_v0()+0x2a8) [0x7f310a067ed8]",
"(()+0x10b03) [0x7f3109a48b03]",
"(_Unwind_RaiseException()+0x2b1) [0x7f3109a49071]",
"(__cxa_throw()+0x3b) [0x7f310a0687eb]",
"(()+0x19fa4) [0x7f310b7b6fa4]",
"(tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0x146) 
[0x7f310b7d8c96]",
"(()+0x10d0f8e) [0x55ffa520df8e]",
"(rocksdb::Version::~Version()+0x104) [0x55ffa521d174]",
"(rocksdb::Version::Unref()+0x21) [0x55ffa521d221]",
"(rocksdb::ColumnFamilyData::~ColumnFamilyData()+0x5a) 
[0x55ffa52efcca]",

[ceph-users] Re: 1 MDS report slow metadata IOs

2021-10-05 Thread Abdelillah Asraoui
The osds are continuously flapping up/down due to the slow MDS metadata IOs
..
what is causing the slow MDS metadata IOs ?
currently, there are 2 mds and 3 monitors deployed ..
would it help to just one mds and one monitor ?

thanks!

On Tue, Oct 5, 2021 at 1:42 PM Eugen Block  wrote:

> All your PGs are inactive, if two of four OSDs are down and you
> probably have a pool size of 3 then no IO can be served. You’d need at
> least three up ODSs to resolve that.
>
>
> Zitat von Abdelillah Asraoui :
>
> > Ceph is reporting warning on slow metdataIOs on one of the MDS server,
> > this is
> >
> > a new cluster with no upgrade..
> >
> > Anyone has encountered this and is there a workaround ..
> >
> > ceph -s
> >
> >   cluster:
> >
> > id: 801691e6xx-x-xx-xx-xx
> >
> > health: HEALTH_WARN
> >
> > 1 MDSs report slow metadata IOs
> >
> > noscrub,nodeep-scrub flag(s) set
> >
> > 2 osds down
> >
> > 2 hosts (2 osds) down
> >
> > Reduced data availability: 97 pgs inactive, 66 pgs peering,
> 53
> > pgs stale
> >
> > Degraded data redundancy: 31 pgs undersized
> >
> > 2 slow ops, oldest one blocked for 30 sec, osd.0 has slow ops
> >
> >
> >
> >   services:
> >
> > mon: 3 daemons, quorum a,c,f (age 15h)
> >
> > mgr: a(active, since 17h)
> >
> > mds: myfs:1 {0=myfs-a=up:creating} 1 up:standby
> >
> > osd: 4 osds: 2 up (since 36s), 4 in (since 10h)
> >
> >  flags noscrub,nodeep-scrub
> >
> >
> >
> >   data:
> >
> > pools:   4 pools, 97 pgs
> >
> > objects: 0 objects, 0 B
> >
> > usage:   1.0 GiB used, 1.8 TiB / 1.8 TiB avail
> >
> > pgs: 100.000% pgs not active
> >
> >  44 creating+peering
> >
> >  31 stale+undersized+peered
> >
> >  22 stale+creating+peering
> >
> >
> >
> >   progress:
> >
> > Rebalancing after osd.2 marked in (10h)
> >
> >   []
> >
> > Rebalancing after osd.3 marked in (10h)
> >
> >   []
> >
> >
> > Thanks!
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CEPH 16.2.x: disappointing I/O performance

2021-10-05 Thread Tor Martin Ølberg
Hi Zakhar,

Out of curiosity, what does your crushmap look like? Probably a long shot
but are you sure your crush map is targeting the NVME's for the rados bench
you are performing?

Tor Martin Ølberg

On Tue, Oct 5, 2021 at 9:31 PM Christian Wuerdig <
christian.wuer...@gmail.com> wrote:

> Maybe some info is missing but 7k write IOPs at 4k block size seem fairly
> decent (as you also state) - the bandwidth automatically follows from that
> so not sure what you're expecting?
> I am a bit puzzled though - by my math 7k IOPS at 4k should only be
> 27MiB/sec - not sure how the 120MiB/sec was achieved
> The read benchmark seems in line with 13k IOPS at 4k making around
> 52MiB/sec bandwidth which again is expected.
>
>
> On Wed, 6 Oct 2021 at 04:08, Zakhar Kirpichenko  wrote:
>
> > Hi,
> >
> > I built a CEPH 16.2.x cluster with relatively fast and modern hardware,
> and
> > its performance is kind of disappointing. I would very much appreciate an
> > advice and/or pointers :-)
> >
> > The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with:
> >
> > 2 x Intel(R) Xeon(R) Gold 5220R CPUs
> > 384 GB RAM
> > 2 x boot drives
> > 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL)
> > 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier)
> > 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier)
> > 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
> >
> > All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel,
> > apparmor is disabled, energy-saving features are disabled. The network
> > between the CEPH nodes is 40G, CEPH access network is 40G, the average
> > latencies are < 0.15 ms. I've personally tested the network for
> throughput,
> > latency and loss, and can tell that it's operating as expected and
> doesn't
> > exhibit any issues at idle or under load.
> >
> > The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2
> > smaller NVME drives in each node used as DB/WAL and each HDD allocated .
> > ceph osd tree output:
> >
> > ID   CLASS  WEIGHT TYPE NAMESTATUS  REWEIGHT  PRI-AFF
> >  -1 288.37488  root default
> > -13 288.37488  datacenter ste
> > -14 288.37488  rack rack01
> >  -7  96.12495  host ceph01
> >   0hdd9.38680  osd.0up   1.0  1.0
> >   1hdd9.38680  osd.1up   1.0  1.0
> >   2hdd9.38680  osd.2up   1.0  1.0
> >   3hdd9.38680  osd.3up   1.0  1.0
> >   4hdd9.38680  osd.4up   1.0  1.0
> >   5hdd9.38680  osd.5up   1.0  1.0
> >   6hdd9.38680  osd.6up   1.0  1.0
> >   7hdd9.38680  osd.7up   1.0  1.0
> >   8hdd9.38680  osd.8up   1.0  1.0
> >   9   nvme5.82190  osd.9up   1.0  1.0
> >  10   nvme5.82190  osd.10   up   1.0  1.0
> > -10  96.12495  host ceph02
> >  11hdd9.38680  osd.11   up   1.0  1.0
> >  12hdd9.38680  osd.12   up   1.0  1.0
> >  13hdd9.38680  osd.13   up   1.0  1.0
> >  14hdd9.38680  osd.14   up   1.0  1.0
> >  15hdd9.38680  osd.15   up   1.0  1.0
> >  16hdd9.38680  osd.16   up   1.0  1.0
> >  17hdd9.38680  osd.17   up   1.0  1.0
> >  18hdd9.38680  osd.18   up   1.0  1.0
> >  19hdd9.38680  osd.19   up   1.0  1.0
> >  20   nvme5.82190  osd.20   up   1.0  1.0
> >  21   nvme5.82190  osd.21   up   1.0  1.0
> >  -3  96.12495  host ceph03
> >  22hdd9.38680  osd.22   up   1.0  1.0
> >  23hdd9.38680  osd.23   up   1.0  1.0
> >  24hdd9.38680  osd.24   up   1.0  1.0
> >  25hdd9.38680  osd.25   up   1.0  1.0
> >  26hdd9.38680  osd.26   up   1.0  1.0
> >  27hdd9.38680  osd.27   up   1.0  1.0
> >  28hdd9.38680  osd.28   up   1.0  1.0
> >  29hdd9.38680  osd.29   up   1.0  1.0
> >  30hdd9.38680  osd.30   up   1.0  1.0
> >  31   nvme5.82190  osd.31   up   1.0  1.0
> >  32   nvme5.82190  osd.32   up   1.0  1.0
> >
> > ceph df:
> >
> > --- RAW 

[ceph-users] Re: CEPH 16.2.x: disappointing I/O performance

2021-10-05 Thread Christian Wuerdig
Maybe some info is missing but 7k write IOPs at 4k block size seem fairly
decent (as you also state) - the bandwidth automatically follows from that
so not sure what you're expecting?
I am a bit puzzled though - by my math 7k IOPS at 4k should only be
27MiB/sec - not sure how the 120MiB/sec was achieved
The read benchmark seems in line with 13k IOPS at 4k making around
52MiB/sec bandwidth which again is expected.


On Wed, 6 Oct 2021 at 04:08, Zakhar Kirpichenko  wrote:

> Hi,
>
> I built a CEPH 16.2.x cluster with relatively fast and modern hardware, and
> its performance is kind of disappointing. I would very much appreciate an
> advice and/or pointers :-)
>
> The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with:
>
> 2 x Intel(R) Xeon(R) Gold 5220R CPUs
> 384 GB RAM
> 2 x boot drives
> 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL)
> 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier)
> 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier)
> 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
>
> All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel,
> apparmor is disabled, energy-saving features are disabled. The network
> between the CEPH nodes is 40G, CEPH access network is 40G, the average
> latencies are < 0.15 ms. I've personally tested the network for throughput,
> latency and loss, and can tell that it's operating as expected and doesn't
> exhibit any issues at idle or under load.
>
> The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2
> smaller NVME drives in each node used as DB/WAL and each HDD allocated .
> ceph osd tree output:
>
> ID   CLASS  WEIGHT TYPE NAMESTATUS  REWEIGHT  PRI-AFF
>  -1 288.37488  root default
> -13 288.37488  datacenter ste
> -14 288.37488  rack rack01
>  -7  96.12495  host ceph01
>   0hdd9.38680  osd.0up   1.0  1.0
>   1hdd9.38680  osd.1up   1.0  1.0
>   2hdd9.38680  osd.2up   1.0  1.0
>   3hdd9.38680  osd.3up   1.0  1.0
>   4hdd9.38680  osd.4up   1.0  1.0
>   5hdd9.38680  osd.5up   1.0  1.0
>   6hdd9.38680  osd.6up   1.0  1.0
>   7hdd9.38680  osd.7up   1.0  1.0
>   8hdd9.38680  osd.8up   1.0  1.0
>   9   nvme5.82190  osd.9up   1.0  1.0
>  10   nvme5.82190  osd.10   up   1.0  1.0
> -10  96.12495  host ceph02
>  11hdd9.38680  osd.11   up   1.0  1.0
>  12hdd9.38680  osd.12   up   1.0  1.0
>  13hdd9.38680  osd.13   up   1.0  1.0
>  14hdd9.38680  osd.14   up   1.0  1.0
>  15hdd9.38680  osd.15   up   1.0  1.0
>  16hdd9.38680  osd.16   up   1.0  1.0
>  17hdd9.38680  osd.17   up   1.0  1.0
>  18hdd9.38680  osd.18   up   1.0  1.0
>  19hdd9.38680  osd.19   up   1.0  1.0
>  20   nvme5.82190  osd.20   up   1.0  1.0
>  21   nvme5.82190  osd.21   up   1.0  1.0
>  -3  96.12495  host ceph03
>  22hdd9.38680  osd.22   up   1.0  1.0
>  23hdd9.38680  osd.23   up   1.0  1.0
>  24hdd9.38680  osd.24   up   1.0  1.0
>  25hdd9.38680  osd.25   up   1.0  1.0
>  26hdd9.38680  osd.26   up   1.0  1.0
>  27hdd9.38680  osd.27   up   1.0  1.0
>  28hdd9.38680  osd.28   up   1.0  1.0
>  29hdd9.38680  osd.29   up   1.0  1.0
>  30hdd9.38680  osd.30   up   1.0  1.0
>  31   nvme5.82190  osd.31   up   1.0  1.0
>  32   nvme5.82190  osd.32   up   1.0  1.0
>
> ceph df:
>
> --- RAW STORAGE ---
> CLASS SIZEAVAILUSED  RAW USED  %RAW USED
> hdd253 TiB  241 TiB  13 TiB13 TiB   5.00
> nvme35 TiB   35 TiB  82 GiB82 GiB   0.23
> TOTAL  288 TiB  276 TiB  13 TiB13 TiB   4.42
>
> --- POOLS ---
> POOL   ID  PGS   STORED  OBJECTS USED  %USED  MAX AVAIL
> images 12  256   24 GiB3.15k   73 GiB   0.03 76 TiB
> volumes13  256  839 GiB  232.16k  2.5 

[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-05 Thread Szabo, Istvan (Agoda)
Hmm, I’ve removed from the cluster, now data rebalance, I’ll do with the next 
one ☹

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: Igor Fedotov 
Sent: Tuesday, October 5, 2021 10:02 PM
To: Szabo, Istvan (Agoda) ; 胡 玮文 
Cc: ceph-users@ceph.io; Eugen Block 
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Not sure dmcrypt is a culprit here.

Could you please set debug-bluefs to 20 and collect an OSD startup log.


On 10/5/2021 4:43 PM, Szabo, Istvan (Agoda) wrote:
Hmm, tried another one which hasn’t been spilledover disk, still coredumped ☹
Is there any special thing that we need to do before we migrate db next to the 
block? Our osds are using dmcrypt, is it an issue?

{
"backtrace": [
"(()+0x12b20) [0x7f310aa49b20]",
"(gsignal()+0x10f) [0x7f31096aa37f]",
"(abort()+0x127) [0x7f3109694db5]",
"(()+0x9009b) [0x7f310a06209b]",
"(()+0x9653c) [0x7f310a06853c]",
"(()+0x95559) [0x7f310a067559]",
"(__gxx_personality_v0()+0x2a8) [0x7f310a067ed8]",
"(()+0x10b03) [0x7f3109a48b03]",
"(_Unwind_RaiseException()+0x2b1) [0x7f3109a49071]",
"(__cxa_throw()+0x3b) [0x7f310a0687eb]",
"(()+0x19fa4) [0x7f310b7b6fa4]",
"(tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0x146) 
[0x7f310b7d8c96]",
"(()+0x10d0f8e) [0x55ffa520df8e]",
"(rocksdb::Version::~Version()+0x104) [0x55ffa521d174]",
"(rocksdb::Version::Unref()+0x21) [0x55ffa521d221]",
"(rocksdb::ColumnFamilyData::~ColumnFamilyData()+0x5a) 
[0x55ffa52efcca]",
"(rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x88) [0x55ffa52f0568]",
"(rocksdb::VersionSet::~VersionSet()+0x5e) [0x55ffa520e01e]",
"(rocksdb::VersionSet::~VersionSet()+0x11) [0x55ffa520e261]",
"(rocksdb::DBImpl::CloseHelper()+0x616) [0x55ffa5155ed6]",
"(rocksdb::DBImpl::~DBImpl()+0x83b) [0x55ffa515c35b]",
"(rocksdb::DBImplReadOnly::~DBImplReadOnly()+0x11) [0x55ffa51a3bc1]",
"(rocksdb::DB::OpenForReadOnly(rocksdb::DBOptions const&, 
std::__cxx11::basic_string, std::allocator > 
const&, std::vector > const&, 
std::vector >*, rocksdb::DB**, bool)+0x1089) 
[0x55ffa51a57e9]",
"(RocksDBStore::do_open(std::ostream&, bool, bool, 
std::vector 
> const*)+0x14ca) [0x55ffa51285ca]",
"(BlueStore::_open_db(bool, bool, bool)+0x1314) [0x55ffa4bc27e4]",
"(BlueStore::_open_db_and_around(bool)+0x4c) [0x55ffa4bd4c5c]",
"(BlueStore::_mount(bool, bool)+0x847) [0x55ffa4c2e047]",
"(OSD::init()+0x380) [0x55ffa4753a70]",
"(main()+0x47f1) [0x55ffa46a6901]",
"(__libc_start_main()+0xf3) [0x7f3109696493]",
"(_start()+0x2e) [0x55ffa46d4e3e]"
],
"ceph_version": "15.2.14",
"crash_id": 
"2021-10-05T13:31:28.513463Z_b6818598-4960-4ed6-942a-d4a7ff37a758",
"entity_name": "osd.48",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-osd",
"stack_sig": 
"6a43b6c219adac393b239fbea4a53ff87c4185bcd213724f0d721b452b81ddbf",
"timestamp": "2021-10-05T13:31:28.513463Z",
"utsname_hostname": "server-2s07",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Wed Sep 15 15:39:39 UTC 2021"
}
Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: 胡 玮文 
Sent: Monday, October 4, 2021 12:13 AM
To: Szabo, Istvan (Agoda) 
; Igor Fedotov 

Cc: ceph-users@ceph.io
Subject: 回复: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !

The stack trace (tcmalloc::allocate_full_cpp_throw_oom) seems indicating you 
don’t have enough memory.

发件人: Szabo, Istvan (Agoda)
发送时间: 2021年10月4日 0:46
收件人: Igor Fedotov
抄送: ceph-users@ceph.io
主题: [ceph-users] Re: is it possible to remove the db+wal from an external 
device (nvme)

Seems like it cannot start anymore once migrated ☹

https://justpaste.it/5hkot

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: 

[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-05 Thread Szabo, Istvan (Agoda)
This one is in messages: https://justpaste.it/3x08z

Buffered_io is turned on by default in 15.2.14 octopus FYI.


Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: Eugen Block  
Sent: Tuesday, October 5, 2021 9:52 PM
To: Szabo, Istvan (Agoda) 
Cc: 胡 玮文 ; Igor Fedotov ; 
ceph-users@ceph.io
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Do you see oom killers in dmesg on this host? This line indicates it:

  "(tcmalloc::allocate_full_cpp_throw_oom(unsigned
long)+0x146) [0x7f310b7d8c96]",


Zitat von "Szabo, Istvan (Agoda)" :

> Hmm, tried another one which hasn’t been spilledover disk, still 
> coredumped ☹ Is there any special thing that we need to do before we 
> migrate db next to the block? Our osds are using dmcrypt, is it an issue?
>
> {
> "backtrace": [
> "(()+0x12b20) [0x7f310aa49b20]",
> "(gsignal()+0x10f) [0x7f31096aa37f]",
> "(abort()+0x127) [0x7f3109694db5]",
> "(()+0x9009b) [0x7f310a06209b]",
> "(()+0x9653c) [0x7f310a06853c]",
> "(()+0x95559) [0x7f310a067559]",
> "(__gxx_personality_v0()+0x2a8) [0x7f310a067ed8]",
> "(()+0x10b03) [0x7f3109a48b03]",
> "(_Unwind_RaiseException()+0x2b1) [0x7f3109a49071]",
> "(__cxa_throw()+0x3b) [0x7f310a0687eb]",
> "(()+0x19fa4) [0x7f310b7b6fa4]",
> "(tcmalloc::allocate_full_cpp_throw_oom(unsigned
> long)+0x146) [0x7f310b7d8c96]",
> "(()+0x10d0f8e) [0x55ffa520df8e]",
> "(rocksdb::Version::~Version()+0x104) [0x55ffa521d174]",
> "(rocksdb::Version::Unref()+0x21) [0x55ffa521d221]",
> "(rocksdb::ColumnFamilyData::~ColumnFamilyData()+0x5a)
> [0x55ffa52efcca]",
> "(rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x88)
> [0x55ffa52f0568]",
> "(rocksdb::VersionSet::~VersionSet()+0x5e) [0x55ffa520e01e]",
> "(rocksdb::VersionSet::~VersionSet()+0x11) [0x55ffa520e261]",
> "(rocksdb::DBImpl::CloseHelper()+0x616) [0x55ffa5155ed6]",
> "(rocksdb::DBImpl::~DBImpl()+0x83b) [0x55ffa515c35b]",
> "(rocksdb::DBImplReadOnly::~DBImplReadOnly()+0x11) [0x55ffa51a3bc1]",
> "(rocksdb::DB::OpenForReadOnly(rocksdb::DBOptions const&, 
> std::__cxx11::basic_string, 
> std::allocator > const&, 
> std::vector std::allocator > const&, 
> std::vector std::allocator >*, rocksdb::DB**,
> bool)+0x1089) [0x55ffa51a57e9]",
> "(RocksDBStore::do_open(std::ostream&, bool, bool, 
> std::vector std::allocator > const*)+0x14ca) 
> [0x55ffa51285ca]",
> "(BlueStore::_open_db(bool, bool, bool)+0x1314) [0x55ffa4bc27e4]",
> "(BlueStore::_open_db_and_around(bool)+0x4c) [0x55ffa4bd4c5c]",
> "(BlueStore::_mount(bool, bool)+0x847) [0x55ffa4c2e047]",
> "(OSD::init()+0x380) [0x55ffa4753a70]",
> "(main()+0x47f1) [0x55ffa46a6901]",
> "(__libc_start_main()+0xf3) [0x7f3109696493]",
> "(_start()+0x2e) [0x55ffa46d4e3e]"
> ],
> "ceph_version": "15.2.14",
> "crash_id":
> "2021-10-05T13:31:28.513463Z_b6818598-4960-4ed6-942a-d4a7ff37a758",
> "entity_name": "osd.48",
> "os_id": "centos",
> "os_name": "CentOS Linux",
> "os_version": "8",
> "os_version_id": "8",
> "process_name": "ceph-osd",
> "stack_sig":
> "6a43b6c219adac393b239fbea4a53ff87c4185bcd213724f0d721b452b81ddbf",
> "timestamp": "2021-10-05T13:31:28.513463Z",
> "utsname_hostname": "server-2s07",
> "utsname_machine": "x86_64",
> "utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
> "utsname_sysname": "Linux",
> "utsname_version": "#1 SMP Wed Sep 15 15:39:39 UTC 2021"
> }
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
> From: 胡 玮文 
> Sent: Monday, October 4, 2021 12:13 AM
> To: Szabo, Istvan (Agoda) ; Igor Fedotov 
> 
> Cc: ceph-users@ceph.io
> Subject: 回复: [ceph-users] Re: is it possible to remove the db+wal from 
> an external device (nvme)
>
> Email received from the internet. If in doubt, don't click any link 
> nor open any attachment !
> 
> The stack trace (tcmalloc::allocate_full_cpp_throw_oom) seems 
> indicating you don’t have enough memory.
>
> 发件人: Szabo, Istvan (Agoda)
> 发送时间: 2021年10月4日 0:46
> 收件人: Igor Fedotov
> 抄送: ceph-users@ceph.io
> 主题: [ceph-users] Re: is it possible to remove the db+wal from an 
> external device (nvme)
>
> Seems like it cannot start anymore once migrated ☹
>
> 

[ceph-users] Re: Erasure coded pool chunk count k

2021-10-05 Thread Christian Wuerdig
A couple of notes to this:

Ideally you should have at least 2 more failure domains than your base
resilience (K+M for EC or size=N for replicated) - reasoning: Maintenance
needs to be performed so chances are every now and then you take a host
down for a few hours or possibly days to do some upgrade, fix some broken
things, etc. This means you're running in degraded state since only K+M-1
shards are available. While in that state a drive in another host dies on
you. Now recovery for that is blocked because you have insufficient failure
domains available and things start getting a bit uncomfortable depending on
how large M is. Or a whole host dies on you in that state ...
Generally planning your cluster resources right along the fault lines is
going to bite you and cause high levels of stress and anxiety. I know -
budgets have a limit but still, there is plenty of history on this list for
desperate calls for help simply because clusters were only planned for the
happy day case.

Unlike replicated pools you cannot change your profile on an EC-pool after
it has been created - so if you decide to change EC profile this means
creating a new pool and migrating. Just something to keep in mind.

On Tue, 5 Oct 2021 at 14:58, Anthony D'Atri  wrote:

>
> The larger the value of K relative to M, the more efficient the raw ::
> usable ratio ends up.
>
> There are tradeoffs and caveats.  Here are some of my thoughts; if I’m
> off-base here, I welcome enlightenment.
>
>
>
> When possible, it’s ideal to have at least K+M failure domains — often
> racks, sometimes hosts, chassis, etc.  Thus smaller clusters, say with 5-6
> nodes, aren’t good fits for larger sums of K+M if your data is valuable.
>
> Larger sums of K+M also mean that more drives will be touched by each read
> or write, especially during recovery.  This could be a factor if one is
> IOPS-limited.  Same with scrubs.
>
> When using a pool for, eg. RGW buckets, larger sums of K+M may result in
> greater overhead when storing small objects, since Ceph / RGW only AIUI
> writes full stripes.  So say you have an EC pool of 17,3 on drives with the
> default 4kB bluestone_min_alloc size.  A 1kB S3 object would thus allocate
> 17+3=20 x 4kB == 80kB of storage, which is 7900% overhead.  This is an
> extreme example to illustrate the point.
>
> Larger sums of K+M may present more IOPs to each storage drive, dependent
> on workload and the EC plugin selected.
>
> With larger objects (including RBD) the modulo factor is dramatically
> smaller.  One’s use-case and dataset per-pool may thus inform the EC
> profiles that make sense; workloads that are predominately smaller objects
> might opt for replication instead.
>
> There was a post ….. a year ago? suggesting that values with small prime
> factors are advantageous, but I never saw a discussion of why that might be.
>
> In some cases where one might be pressured to use replication with only 2
> copies of data, a 2,2 EC profile might achieve the same efficiency with
> greater safety.
>
> Geo / stretch clusters or ones in challenging environments are a special
> case; they might choose values of M equal to or even larger than K.
>
> That said, I think 4,2 is a reasonable place to *start*, adjusted by one’s
> specific needs.  You get a raw :: usable ratio of 1.5 without getting too
> complicated.
>
> ymmv
>
>
>
>
>
>
> >
> > Hi,
> >
> > It depends of hardware, failure domain, use case, overhead.
> >
> > I don’t see an easy way to chose k and m values.
> >
> > -
> > Etienne Menguy
> > etienne.men...@croit.io
> >
> >
> >> On 4 Oct 2021, at 16:57, Golasowski Martin 
> wrote:
> >>
> >> Hello guys,
> >> how does one estimate number of chunks for erasure coded pool ( k = ? )
> ? I see that number of m chunks determines the pool’s resiliency, however I
> did not find clear guideline how to determine k.
> >>
> >> Red Hat states that they support only the following combinations:
> >>
> >> k=8, m=3
> >> k=8, m=4
> >> k=4, m=2
> >>
> >> without any rationale behind them.
> >> The table is taken from
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/storage_strategies_guide/erasure_code_pools
> .
> >>
> >> Thanks!
> >>
> >> Regards,
> >> Martin
> >>
> >>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 1 MDS report slow metadata IOs

2021-10-05 Thread Eugen Block
All your PGs are inactive, if two of four OSDs are down and you  
probably have a pool size of 3 then no IO can be served. You’d need at  
least three up ODSs to resolve that.



Zitat von Abdelillah Asraoui :


Ceph is reporting warning on slow metdataIOs on one of the MDS server,
this is

a new cluster with no upgrade..

Anyone has encountered this and is there a workaround ..

ceph -s

  cluster:

id: 801691e6xx-x-xx-xx-xx

health: HEALTH_WARN

1 MDSs report slow metadata IOs

noscrub,nodeep-scrub flag(s) set

2 osds down

2 hosts (2 osds) down

Reduced data availability: 97 pgs inactive, 66 pgs peering, 53
pgs stale

Degraded data redundancy: 31 pgs undersized

2 slow ops, oldest one blocked for 30 sec, osd.0 has slow ops



  services:

mon: 3 daemons, quorum a,c,f (age 15h)

mgr: a(active, since 17h)

mds: myfs:1 {0=myfs-a=up:creating} 1 up:standby

osd: 4 osds: 2 up (since 36s), 4 in (since 10h)

 flags noscrub,nodeep-scrub



  data:

pools:   4 pools, 97 pgs

objects: 0 objects, 0 B

usage:   1.0 GiB used, 1.8 TiB / 1.8 TiB avail

pgs: 100.000% pgs not active

 44 creating+peering

 31 stale+undersized+peered

 22 stale+creating+peering



  progress:

Rebalancing after osd.2 marked in (10h)

  []

Rebalancing after osd.3 marked in (10h)

  []


Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] 1 MDS report slow metadata IOs

2021-10-05 Thread Abdelillah Asraoui
Ceph is reporting warning on slow metdataIOs on one of the MDS server,
this is

a new cluster with no upgrade..

Anyone has encountered this and is there a workaround ..

ceph -s

  cluster:

id: 801691e6xx-x-xx-xx-xx

health: HEALTH_WARN

1 MDSs report slow metadata IOs

noscrub,nodeep-scrub flag(s) set

2 osds down

2 hosts (2 osds) down

Reduced data availability: 97 pgs inactive, 66 pgs peering, 53
pgs stale

Degraded data redundancy: 31 pgs undersized

2 slow ops, oldest one blocked for 30 sec, osd.0 has slow ops



  services:

mon: 3 daemons, quorum a,c,f (age 15h)

mgr: a(active, since 17h)

mds: myfs:1 {0=myfs-a=up:creating} 1 up:standby

osd: 4 osds: 2 up (since 36s), 4 in (since 10h)

 flags noscrub,nodeep-scrub



  data:

pools:   4 pools, 97 pgs

objects: 0 objects, 0 B

usage:   1.0 GiB used, 1.8 TiB / 1.8 TiB avail

pgs: 100.000% pgs not active

 44 creating+peering

 31 stale+undersized+peered

 22 stale+creating+peering



  progress:

Rebalancing after osd.2 marked in (10h)

  []

Rebalancing after osd.3 marked in (10h)

  []


Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: *****SPAM***** Re: CEPH 16.2.x: disappointing I/O performance

2021-10-05 Thread Zakhar Kirpichenko
Aren't all writes to bluestore turned into sequential writes?

/Z


On Tue, 5 Oct 2021, 20:05 Marc,  wrote:

>
> Hi Zakhar,
>
> > using 16 threads) are not. Literally every storage device in my setup
> > can read and write at least 200+ MB/s sequentially, so I'm trying to
> > find an explanation for this behavior.
>
> All writes in ceph are random afaik
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-05 Thread Szabo, Istvan (Agoda)
Hmm, tried another one which hasn’t been spilledover disk, still coredumped ☹
Is there any special thing that we need to do before we migrate db next to the 
block? Our osds are using dmcrypt, is it an issue?

{
"backtrace": [
"(()+0x12b20) [0x7f310aa49b20]",
"(gsignal()+0x10f) [0x7f31096aa37f]",
"(abort()+0x127) [0x7f3109694db5]",
"(()+0x9009b) [0x7f310a06209b]",
"(()+0x9653c) [0x7f310a06853c]",
"(()+0x95559) [0x7f310a067559]",
"(__gxx_personality_v0()+0x2a8) [0x7f310a067ed8]",
"(()+0x10b03) [0x7f3109a48b03]",
"(_Unwind_RaiseException()+0x2b1) [0x7f3109a49071]",
"(__cxa_throw()+0x3b) [0x7f310a0687eb]",
"(()+0x19fa4) [0x7f310b7b6fa4]",
"(tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0x146) 
[0x7f310b7d8c96]",
"(()+0x10d0f8e) [0x55ffa520df8e]",
"(rocksdb::Version::~Version()+0x104) [0x55ffa521d174]",
"(rocksdb::Version::Unref()+0x21) [0x55ffa521d221]",
"(rocksdb::ColumnFamilyData::~ColumnFamilyData()+0x5a) 
[0x55ffa52efcca]",
"(rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x88) [0x55ffa52f0568]",
"(rocksdb::VersionSet::~VersionSet()+0x5e) [0x55ffa520e01e]",
"(rocksdb::VersionSet::~VersionSet()+0x11) [0x55ffa520e261]",
"(rocksdb::DBImpl::CloseHelper()+0x616) [0x55ffa5155ed6]",
"(rocksdb::DBImpl::~DBImpl()+0x83b) [0x55ffa515c35b]",
"(rocksdb::DBImplReadOnly::~DBImplReadOnly()+0x11) [0x55ffa51a3bc1]",
"(rocksdb::DB::OpenForReadOnly(rocksdb::DBOptions const&, 
std::__cxx11::basic_string, std::allocator > 
const&, std::vector > const&, 
std::vector >*, rocksdb::DB**, bool)+0x1089) 
[0x55ffa51a57e9]",
"(RocksDBStore::do_open(std::ostream&, bool, bool, 
std::vector 
> const*)+0x14ca) [0x55ffa51285ca]",
"(BlueStore::_open_db(bool, bool, bool)+0x1314) [0x55ffa4bc27e4]",
"(BlueStore::_open_db_and_around(bool)+0x4c) [0x55ffa4bd4c5c]",
"(BlueStore::_mount(bool, bool)+0x847) [0x55ffa4c2e047]",
"(OSD::init()+0x380) [0x55ffa4753a70]",
"(main()+0x47f1) [0x55ffa46a6901]",
"(__libc_start_main()+0xf3) [0x7f3109696493]",
"(_start()+0x2e) [0x55ffa46d4e3e]"
],
"ceph_version": "15.2.14",
"crash_id": 
"2021-10-05T13:31:28.513463Z_b6818598-4960-4ed6-942a-d4a7ff37a758",
"entity_name": "osd.48",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-osd",
"stack_sig": 
"6a43b6c219adac393b239fbea4a53ff87c4185bcd213724f0d721b452b81ddbf",
"timestamp": "2021-10-05T13:31:28.513463Z",
"utsname_hostname": "server-2s07",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Wed Sep 15 15:39:39 UTC 2021"
}
Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: 胡 玮文 
Sent: Monday, October 4, 2021 12:13 AM
To: Szabo, Istvan (Agoda) ; Igor Fedotov 

Cc: ceph-users@ceph.io
Subject: 回复: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !

The stack trace (tcmalloc::allocate_full_cpp_throw_oom) seems indicating you 
don’t have enough memory.

发件人: Szabo, Istvan (Agoda)
发送时间: 2021年10月4日 0:46
收件人: Igor Fedotov
抄送: ceph-users@ceph.io
主题: [ceph-users] Re: is it possible to remove the db+wal from an external 
device (nvme)

Seems like it cannot start anymore once migrated ☹

https://justpaste.it/5hkot

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: 
istvan.sz...@agoda.com>
---

From: Igor Fedotov mailto:ifedo...@suse.de>>
Sent: Saturday, October 2, 2021 5:22 AM
To: Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>>
Cc: ceph-users@ceph.io; Eugen Block 
mailto:ebl...@nde.ag>>; Christian Wuerdig 
mailto:christian.wuer...@gmail.com>>
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi Istvan,

yeah both db and wal to slow migration are supported. And spillover state isn't 
a show stopper for that.


On 10/2/2021 1:16 AM, Szabo, Istvan (Agoda) wrote:
Dear Igor,

Is the ceph-volume lvm migrate command smart enough in octopus 15.2.14 to be 
able to remove the db (included 

[ceph-users] Re: *****SPAM***** Re: CEPH 16.2.x: disappointing I/O performance

2021-10-05 Thread Marc


Hi Zakhar,

> using 16 threads) are not. Literally every storage device in my setup
> can read and write at least 200+ MB/s sequentially, so I'm trying to
> find an explanation for this behavior.

All writes in ceph are random afaik

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CEPH 16.2.x: disappointing I/O performance

2021-10-05 Thread Zakhar Kirpichenko
Hi Marc,

Many thanks for your comment! As I mentioned, rados bench results are more
or less acceptable and explainable. RBD clients writing at ~120 MB/st tops
(regardless of the number of threads or block size btw) and reading ~50
MB/s in a single thread (I managed to read over 500 MB/s using 16 threads)
are not. Literally every storage device in my setup can read and write at
least 200+ MB/s sequentially, so I'm trying to find an explanation for this
behavior.

Zakhar

On Tue, 5 Oct 2021, 18:44 Marc,  wrote:

> You are aware of this:
> https://yourcmc.ru/wiki/Ceph_performance
>
> I am having these results with ssd and 2.2GHz xeon and no cpu
> state/freq/cpugovernor optimalization, so your results with hdd look quite
> ok to me.
>
>
> [@c01 ~]# rados -p rbd.ssd bench 30 write
> Maintaining 16 concurrent writes of 4194304 bytes to objects of size
> 4194304 for up to 30 seconds or 0 objects
> Object prefix: benchmark_data_c01_2752661
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
> 0   0 0 0 0 0   -
>  0
> 1  16   162   146   583.839   584   0.0807733
> 0.106959
> 2  16   347   331   661.868   7400.052621
>  0.0943461
> 3  16   525   509   678.552   712   0.0493101
>  0.0934826
> 4  16   676   660   659.897   6040.107205
>  0.0958496
> ...
>
> Total time run: 30.0622
> Total writes made:  4454
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 592.638
> Stddev Bandwidth:   65.0681
> Max bandwidth (MB/sec): 740
> Min bandwidth (MB/sec): 440
> Average IOPS:   148
> Stddev IOPS:16.267
> Max IOPS:   185
> Min IOPS:   110
> Average Latency(s): 0.107988
> Stddev Latency(s):  0.0610883
> Max latency(s): 0.452039
> Min latency(s): 0.0209312
> Cleaning up (deleting benchmark objects)
> Removed 4454 objects
> Clean up completed and total clean up time :0.732456
>
> > Subject: [ceph-users] CEPH 16.2.x: disappointing I/O performance
> >
> > Hi,
> >
> > I built a CEPH 16.2.x cluster with relatively fast and modern hardware,
> > and
> > its performance is kind of disappointing. I would very much appreciate
> > an
> > advice and/or pointers :-)
> >
> > The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with:
> >
> > 2 x Intel(R) Xeon(R) Gold 5220R CPUs
> > 384 GB RAM
> > 2 x boot drives
> > 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL)
> > 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier)
> > 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier)
> > 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
> >
> > All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel,
> > apparmor is disabled, energy-saving features are disabled. The network
> > between the CEPH nodes is 40G, CEPH access network is 40G, the average
> > latencies are < 0.15 ms. I've personally tested the network for
> > throughput,
> > latency and loss, and can tell that it's operating as expected and
> > doesn't
> > exhibit any issues at idle or under load.
> >
> > The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2
> > smaller NVME drives in each node used as DB/WAL and each HDD allocated .
> > ceph osd tree output:
> >
> > ID   CLASS  WEIGHT TYPE NAMESTATUS  REWEIGHT  PRI-
> > AFF
> >  -1 288.37488  root default
> > -13 288.37488  datacenter ste
> > -14 288.37488  rack rack01
> >  -7  96.12495  host ceph01
> >   0hdd9.38680  osd.0up   1.0
> > 1.0
> >   1hdd9.38680  osd.1up   1.0
> > 1.0
> >   2hdd9.38680  osd.2up   1.0
> > 1.0
> >   3hdd9.38680  osd.3up   1.0
> > 1.0
> >   4hdd9.38680  osd.4up   1.0
> > 1.0
> >   5hdd9.38680  osd.5up   1.0
> > 1.0
> >   6hdd9.38680  osd.6up   1.0
> > 1.0
> >   7hdd9.38680  osd.7up   1.0
> > 1.0
> >   8hdd9.38680  osd.8up   1.0
> > 1.0
> >   9   nvme5.82190  osd.9up   1.0
> > 1.0
> >  10   nvme5.82190  osd.10   up   1.0
> > 1.0
> > -10  96.12495  host ceph02
> >  11hdd9.38680  osd.11   up   1.0
> > 1.0
> >  12hdd9.38680  osd.12   up   1.0
> > 1.0
> >  13hdd9.38680  osd.13   up   1.0
> > 1.0
> >  14hdd9.38680  osd.14   up   1.0
> > 1.0
> >  15hdd9.38680  osd.15   up   

[ceph-users] Re: CEPH 16.2.x: disappointing I/O performance

2021-10-05 Thread Marc
You are aware of this:
https://yourcmc.ru/wiki/Ceph_performance

I am having these results with ssd and 2.2GHz xeon and no cpu 
state/freq/cpugovernor optimalization, so your results with hdd look quite ok 
to me.


[@c01 ~]# rados -p rbd.ssd bench 30 write
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 
for up to 30 seconds or 0 objects
Object prefix: benchmark_data_c01_2752661
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1  16   162   146   583.839   584   0.08077330.106959
2  16   347   331   661.868   7400.052621   0.0943461
3  16   525   509   678.552   712   0.0493101   0.0934826
4  16   676   660   659.897   6040.107205   0.0958496
...

Total time run: 30.0622
Total writes made:  4454
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 592.638
Stddev Bandwidth:   65.0681
Max bandwidth (MB/sec): 740
Min bandwidth (MB/sec): 440
Average IOPS:   148
Stddev IOPS:16.267
Max IOPS:   185
Min IOPS:   110
Average Latency(s): 0.107988
Stddev Latency(s):  0.0610883
Max latency(s): 0.452039
Min latency(s): 0.0209312
Cleaning up (deleting benchmark objects)
Removed 4454 objects
Clean up completed and total clean up time :0.732456

> Subject: [ceph-users] CEPH 16.2.x: disappointing I/O performance
> 
> Hi,
> 
> I built a CEPH 16.2.x cluster with relatively fast and modern hardware,
> and
> its performance is kind of disappointing. I would very much appreciate
> an
> advice and/or pointers :-)
> 
> The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with:
> 
> 2 x Intel(R) Xeon(R) Gold 5220R CPUs
> 384 GB RAM
> 2 x boot drives
> 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL)
> 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier)
> 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier)
> 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
> 
> All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel,
> apparmor is disabled, energy-saving features are disabled. The network
> between the CEPH nodes is 40G, CEPH access network is 40G, the average
> latencies are < 0.15 ms. I've personally tested the network for
> throughput,
> latency and loss, and can tell that it's operating as expected and
> doesn't
> exhibit any issues at idle or under load.
> 
> The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2
> smaller NVME drives in each node used as DB/WAL and each HDD allocated .
> ceph osd tree output:
> 
> ID   CLASS  WEIGHT TYPE NAMESTATUS  REWEIGHT  PRI-
> AFF
>  -1 288.37488  root default
> -13 288.37488  datacenter ste
> -14 288.37488  rack rack01
>  -7  96.12495  host ceph01
>   0hdd9.38680  osd.0up   1.0
> 1.0
>   1hdd9.38680  osd.1up   1.0
> 1.0
>   2hdd9.38680  osd.2up   1.0
> 1.0
>   3hdd9.38680  osd.3up   1.0
> 1.0
>   4hdd9.38680  osd.4up   1.0
> 1.0
>   5hdd9.38680  osd.5up   1.0
> 1.0
>   6hdd9.38680  osd.6up   1.0
> 1.0
>   7hdd9.38680  osd.7up   1.0
> 1.0
>   8hdd9.38680  osd.8up   1.0
> 1.0
>   9   nvme5.82190  osd.9up   1.0
> 1.0
>  10   nvme5.82190  osd.10   up   1.0
> 1.0
> -10  96.12495  host ceph02
>  11hdd9.38680  osd.11   up   1.0
> 1.0
>  12hdd9.38680  osd.12   up   1.0
> 1.0
>  13hdd9.38680  osd.13   up   1.0
> 1.0
>  14hdd9.38680  osd.14   up   1.0
> 1.0
>  15hdd9.38680  osd.15   up   1.0
> 1.0
>  16hdd9.38680  osd.16   up   1.0
> 1.0
>  17hdd9.38680  osd.17   up   1.0
> 1.0
>  18hdd9.38680  osd.18   up   1.0
> 1.0
>  19hdd9.38680  osd.19   up   1.0
> 1.0
>  20   nvme5.82190  osd.20   up   1.0
> 1.0
>  21   nvme5.82190  osd.21   up   1.0
> 1.0
>  -3  96.12495  host ceph03
>  22hdd9.38680  osd.22   up   1.0
> 1.0
>  23hdd9.38680  osd.23   up   1.0
> 1.0
>  24hdd9.38680  

[ceph-users] CEPH 16.2.x: disappointing I/O performance

2021-10-05 Thread Zakhar Kirpichenko
Hi,

I built a CEPH 16.2.x cluster with relatively fast and modern hardware, and
its performance is kind of disappointing. I would very much appreciate an
advice and/or pointers :-)

The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with:

2 x Intel(R) Xeon(R) Gold 5220R CPUs
384 GB RAM
2 x boot drives
2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL)
2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier)
9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier)
2 x Intel XL710 NICs connected to a pair of 40/100GE switches

All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel,
apparmor is disabled, energy-saving features are disabled. The network
between the CEPH nodes is 40G, CEPH access network is 40G, the average
latencies are < 0.15 ms. I've personally tested the network for throughput,
latency and loss, and can tell that it's operating as expected and doesn't
exhibit any issues at idle or under load.

The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2
smaller NVME drives in each node used as DB/WAL and each HDD allocated .
ceph osd tree output:

ID   CLASS  WEIGHT TYPE NAMESTATUS  REWEIGHT  PRI-AFF
 -1 288.37488  root default
-13 288.37488  datacenter ste
-14 288.37488  rack rack01
 -7  96.12495  host ceph01
  0hdd9.38680  osd.0up   1.0  1.0
  1hdd9.38680  osd.1up   1.0  1.0
  2hdd9.38680  osd.2up   1.0  1.0
  3hdd9.38680  osd.3up   1.0  1.0
  4hdd9.38680  osd.4up   1.0  1.0
  5hdd9.38680  osd.5up   1.0  1.0
  6hdd9.38680  osd.6up   1.0  1.0
  7hdd9.38680  osd.7up   1.0  1.0
  8hdd9.38680  osd.8up   1.0  1.0
  9   nvme5.82190  osd.9up   1.0  1.0
 10   nvme5.82190  osd.10   up   1.0  1.0
-10  96.12495  host ceph02
 11hdd9.38680  osd.11   up   1.0  1.0
 12hdd9.38680  osd.12   up   1.0  1.0
 13hdd9.38680  osd.13   up   1.0  1.0
 14hdd9.38680  osd.14   up   1.0  1.0
 15hdd9.38680  osd.15   up   1.0  1.0
 16hdd9.38680  osd.16   up   1.0  1.0
 17hdd9.38680  osd.17   up   1.0  1.0
 18hdd9.38680  osd.18   up   1.0  1.0
 19hdd9.38680  osd.19   up   1.0  1.0
 20   nvme5.82190  osd.20   up   1.0  1.0
 21   nvme5.82190  osd.21   up   1.0  1.0
 -3  96.12495  host ceph03
 22hdd9.38680  osd.22   up   1.0  1.0
 23hdd9.38680  osd.23   up   1.0  1.0
 24hdd9.38680  osd.24   up   1.0  1.0
 25hdd9.38680  osd.25   up   1.0  1.0
 26hdd9.38680  osd.26   up   1.0  1.0
 27hdd9.38680  osd.27   up   1.0  1.0
 28hdd9.38680  osd.28   up   1.0  1.0
 29hdd9.38680  osd.29   up   1.0  1.0
 30hdd9.38680  osd.30   up   1.0  1.0
 31   nvme5.82190  osd.31   up   1.0  1.0
 32   nvme5.82190  osd.32   up   1.0  1.0

ceph df:

--- RAW STORAGE ---
CLASS SIZEAVAILUSED  RAW USED  %RAW USED
hdd253 TiB  241 TiB  13 TiB13 TiB   5.00
nvme35 TiB   35 TiB  82 GiB82 GiB   0.23
TOTAL  288 TiB  276 TiB  13 TiB13 TiB   4.42

--- POOLS ---
POOL   ID  PGS   STORED  OBJECTS USED  %USED  MAX AVAIL
images 12  256   24 GiB3.15k   73 GiB   0.03 76 TiB
volumes13  256  839 GiB  232.16k  2.5 TiB   1.07 76 TiB
backups14  256   31 GiB8.56k   94 GiB   0.04 76 TiB
vms15  256  752 GiB  198.80k  2.2 TiB   0.96 76 TiB
device_health_metrics  16   32   35 MiB   39  106 MiB  0 76 TiB
volumes-nvme   17  256   28 GiB7.21k   81 GiB   0.24 11 TiB
ec-volumes-meta18  256   27 KiB4   92 KiB  0 76 TiB
ec-volumes-data19  2568 KiB1   12 KiB  0152 TiB

Please disregard the ec-pools, as they're not currently in use. All other
pools are configured with min_size=2, size=3. All pools are bound to HDD

[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-05 Thread Igor Fedotov

Not sure dmcrypt is a culprit here.

Could you please set debug-bluefs to 20 and collect an OSD startup log.


On 10/5/2021 4:43 PM, Szabo, Istvan (Agoda) wrote:


Hmm, tried another one which hasn’t been spilledover disk, still 
coredumped ☹


Is there any special thing that we need to do before we migrate db 
next to the block? Our osds are using dmcrypt, is it an issue?


{

"backtrace": [

"(()+0x12b20) [0x7f310aa49b20]",

"(gsignal()+0x10f) [0x7f31096aa37f]",

"(abort()+0x127) [0x7f3109694db5]",

"(()+0x9009b) [0x7f310a06209b]",

"(()+0x9653c) [0x7f310a06853c]",

"(()+0x95559) [0x7f310a067559]",

"(__gxx_personality_v0()+0x2a8) [0x7f310a067ed8]",

"(()+0x10b03) [0x7f3109a48b03]",

"(_Unwind_RaiseException()+0x2b1) [0x7f3109a49071]",

"(__cxa_throw()+0x3b) [0x7f310a0687eb]",

"(()+0x19fa4) [0x7f310b7b6fa4]",

"(tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0x146) 
[0x7f310b7d8c96]",


"(()+0x10d0f8e) [0x55ffa520df8e]",

  "(rocksdb::Version::~Version()+0x104) [0x55ffa521d174]",

"(rocksdb::Version::Unref()+0x21) [0x55ffa521d221]",

"(rocksdb::ColumnFamilyData::~ColumnFamilyData()+0x5a) [0x55ffa52efcca]",

"(rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x88) [0x55ffa52f0568]",

"(rocksdb::VersionSet::~VersionSet()+0x5e) [0x55ffa520e01e]",

"(rocksdb::VersionSet::~VersionSet()+0x11) [0x55ffa520e261]",

"(rocksdb::DBImpl::CloseHelper()+0x616) [0x55ffa5155ed6]",

"(rocksdb::DBImpl::~DBImpl()+0x83b) [0x55ffa515c35b]",

"(rocksdb::DBImplReadOnly::~DBImplReadOnly()+0x11) [0x55ffa51a3bc1]",

"(rocksdb::DB::OpenForReadOnly(rocksdb::DBOptions const&, 
std::__cxx11::basic_string, 
std::allocator > const&, 
std::vectorstd::allocator > const&, 
std::vectorstd::allocator >*, rocksdb::DB**, 
bool)+0x1089) [0x55ffa51a57e9]",


"(RocksDBStore::do_open(std::ostream&, bool, bool, 
std::vectorstd::allocator > const*)+0x14ca) 
[0x55ffa51285ca]",


"(BlueStore::_open_db(bool, bool, bool)+0x1314) [0x55ffa4bc27e4]",

"(BlueStore::_open_db_and_around(bool)+0x4c) [0x55ffa4bd4c5c]",

"(BlueStore::_mount(bool, bool)+0x847) [0x55ffa4c2e047]",

"(OSD::init()+0x380) [0x55ffa4753a70]",

"(main()+0x47f1) [0x55ffa46a6901]",

"(__libc_start_main()+0xf3) [0x7f3109696493]",

"(_start()+0x2e) [0x55ffa46d4e3e]"

],

"ceph_version": "15.2.14",

"crash_id": 
"2021-10-05T13:31:28.513463Z_b6818598-4960-4ed6-942a-d4a7ff37a758",


"entity_name": "osd.48",

"os_id": "centos",

"os_name": "CentOS Linux",

"os_version": "8",

"os_version_id": "8",

"process_name": "ceph-osd",

"stack_sig": 
"6a43b6c219adac393b239fbea4a53ff87c4185bcd213724f0d721b452b81ddbf",


"timestamp": "2021-10-05T13:31:28.513463Z",

"utsname_hostname": "server-2s07",

"utsname_machine": "x86_64",

"utsname_release": "4.18.0-305.19.1.el8_4.x86_64",

"utsname_sysname": "Linux",

"utsname_version": "#1 SMP Wed Sep 15 15:39:39 UTC 2021"

}

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com 
---

*From:*胡玮文
*Sent:* Monday, October 4, 2021 12:13 AM
*To:* Szabo, Istvan (Agoda) ; Igor Fedotov 


*Cc:* ceph-users@ceph.io
*Subject:* 回复: [ceph-users] Re: is it possible to remove the db+wal 
from an external device (nvme)


Email received from the internet. If in doubt, don't click any link 
nor open any attachment !




The stack trace (tcmalloc::allocate_full_cpp_throw_oom) seems 
indicating you don’t have enough memory.


*发件人**: *Szabo, Istvan (Agoda) 
*发送时间: *2021年10月4日 0:46
*收件人: *Igor Fedotov 
*抄送: *ceph-users@ceph.io 
*主题: *[ceph-users] Re: is it possible to remove the db+wal from an 
external device (nvme)


Seems like it cannot start anymore once migrated ☹

https://justpaste.it/5hkot 

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com>

---

From: Igor Fedotov mailto:ifedo...@suse.de>>
Sent: Saturday, October 2, 2021 5:22 AM
To: Szabo, Istvan (Agoda) >
Cc: ceph-users@ceph.io ; Eugen Block 
mailto:ebl...@nde.ag>>; Christian Wuerdig 
mailto:christian.wuer...@gmail.com>>
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal from 
an external device (nvme)


Email received from the internet. If in doubt, don't click any link 
nor open any attachment !



Hi Istvan,

yeah both db and wal to slow migration are supported. And spillover 
state isn't a show stopper for that.



On 10/2/2021 1:16 AM, Szabo, Istvan (Agoda) wrote:
Dear Igor,

Is the ceph-volume lvm migrate command smart 

[ceph-users] Re: is it possible to remove the db+wal from an external device (nvme)

2021-10-05 Thread Eugen Block

Do you see oom killers in dmesg on this host? This line indicates it:

 "(tcmalloc::allocate_full_cpp_throw_oom(unsigned  
long)+0x146) [0x7f310b7d8c96]",



Zitat von "Szabo, Istvan (Agoda)" :


Hmm, tried another one which hasn’t been spilledover disk, still coredumped ☹
Is there any special thing that we need to do before we migrate db  
next to the block? Our osds are using dmcrypt, is it an issue?


{
"backtrace": [
"(()+0x12b20) [0x7f310aa49b20]",
"(gsignal()+0x10f) [0x7f31096aa37f]",
"(abort()+0x127) [0x7f3109694db5]",
"(()+0x9009b) [0x7f310a06209b]",
"(()+0x9653c) [0x7f310a06853c]",
"(()+0x95559) [0x7f310a067559]",
"(__gxx_personality_v0()+0x2a8) [0x7f310a067ed8]",
"(()+0x10b03) [0x7f3109a48b03]",
"(_Unwind_RaiseException()+0x2b1) [0x7f3109a49071]",
"(__cxa_throw()+0x3b) [0x7f310a0687eb]",
"(()+0x19fa4) [0x7f310b7b6fa4]",
"(tcmalloc::allocate_full_cpp_throw_oom(unsigned  
long)+0x146) [0x7f310b7d8c96]",

"(()+0x10d0f8e) [0x55ffa520df8e]",
"(rocksdb::Version::~Version()+0x104) [0x55ffa521d174]",
"(rocksdb::Version::Unref()+0x21) [0x55ffa521d221]",
"(rocksdb::ColumnFamilyData::~ColumnFamilyData()+0x5a)  
[0x55ffa52efcca]",
"(rocksdb::ColumnFamilySet::~ColumnFamilySet()+0x88)  
[0x55ffa52f0568]",

"(rocksdb::VersionSet::~VersionSet()+0x5e) [0x55ffa520e01e]",
"(rocksdb::VersionSet::~VersionSet()+0x11) [0x55ffa520e261]",
"(rocksdb::DBImpl::CloseHelper()+0x616) [0x55ffa5155ed6]",
"(rocksdb::DBImpl::~DBImpl()+0x83b) [0x55ffa515c35b]",
"(rocksdb::DBImplReadOnly::~DBImplReadOnly()+0x11) [0x55ffa51a3bc1]",
"(rocksdb::DB::OpenForReadOnly(rocksdb::DBOptions const&,  
std::__cxx11::basic_string,  
std::allocator > const&,  
std::vectorstd::allocator > const&,  
std::vectorstd::allocator >*, rocksdb::DB**,  
bool)+0x1089) [0x55ffa51a57e9]",
"(RocksDBStore::do_open(std::ostream&, bool, bool,  
std::vectorstd::allocator > const*)+0x14ca)  
[0x55ffa51285ca]",

"(BlueStore::_open_db(bool, bool, bool)+0x1314) [0x55ffa4bc27e4]",
"(BlueStore::_open_db_and_around(bool)+0x4c) [0x55ffa4bd4c5c]",
"(BlueStore::_mount(bool, bool)+0x847) [0x55ffa4c2e047]",
"(OSD::init()+0x380) [0x55ffa4753a70]",
"(main()+0x47f1) [0x55ffa46a6901]",
"(__libc_start_main()+0xf3) [0x7f3109696493]",
"(_start()+0x2e) [0x55ffa46d4e3e]"
],
"ceph_version": "15.2.14",
"crash_id":  
"2021-10-05T13:31:28.513463Z_b6818598-4960-4ed6-942a-d4a7ff37a758",

"entity_name": "osd.48",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-osd",
"stack_sig":  
"6a43b6c219adac393b239fbea4a53ff87c4185bcd213724f0d721b452b81ddbf",

"timestamp": "2021-10-05T13:31:28.513463Z",
"utsname_hostname": "server-2s07",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-305.19.1.el8_4.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Wed Sep 15 15:39:39 UTC 2021"
}
Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

From: 胡 玮文 
Sent: Monday, October 4, 2021 12:13 AM
To: Szabo, Istvan (Agoda) ; Igor Fedotov  


Cc: ceph-users@ceph.io
Subject: 回复: [ceph-users] Re: is it possible to remove the db+wal  
from an external device (nvme)


Email received from the internet. If in doubt, don't click any link  
nor open any attachment !


The stack trace (tcmalloc::allocate_full_cpp_throw_oom) seems  
indicating you don’t have enough memory.


发件人: Szabo, Istvan (Agoda)
发送时间: 2021年10月4日 0:46
收件人: Igor Fedotov
抄送: ceph-users@ceph.io
主题: [ceph-users] Re: is it possible to remove the db+wal from an  
external device (nvme)


Seems like it cannot start anymore once migrated ☹

https://justpaste.it/5hkot

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e:  
istvan.sz...@agoda.com>

---

From: Igor Fedotov mailto:ifedo...@suse.de>>
Sent: Saturday, October 2, 2021 5:22 AM
To: Szabo, Istvan (Agoda)  
mailto:istvan.sz...@agoda.com>>
Cc: ceph-users@ceph.io; Eugen Block  
mailto:ebl...@nde.ag>>; Christian Wuerdig  
mailto:christian.wuer...@gmail.com>>
Subject: Re: [ceph-users] Re: is it possible to remove the db+wal  
from an external device (nvme)


Email received from the internet. If in doubt, don't click any link  
nor open any attachment !



Hi 

[ceph-users] Re: Can't join new mon - lossy channel, failing

2021-10-05 Thread Konstantin Shalygin
As last resort we've change ipaddr of this host, and mon successfully joined to 
quorum. When revert ipaddr back - mon can't join, we think there something on 
switch side or on old mon's side. From old mon's I was checked new mon process 
connectivity via telnet - all works
It's good to make a some reproducer of this network problem to know what 
exactly message of ceph protocol is broken



k

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Daemon Version Mismatch (But Not Really?) After Deleting/Recreating OSDs

2021-10-05 Thread Edward R Huyer
Gotcha.  Thanks for the input regardless.  I suppose I'll continue what I'm 
doing, and plan on doing an upgrade via quay.io in the near future.

-Original Message-
From: Gregory Farnum  
Sent: Monday, October 4, 2021 7:14 PM
To: Edward R Huyer 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Daemon Version Mismatch (But Not Really?) After 
Deleting/Recreating OSDs

On Mon, Oct 4, 2021 at 12:05 PM Edward R Huyer  wrote:
>
> Apparently the default value for container_image in the cluster configuration 
> is "docker.io/ceph/daemon-base:latest-pacific-devel".  I don't know where 
> that came from.  I didn't set it anywhere.  I'm not allowed to edit it, 
> either (from the dashboard, anyway).
>
> The container_image_base for the cephadm module is "docker.io/ceph/ceph".
>
> Also, 16.2.6 is already out, so I'm not sure why I'd be getting 16.2.5 
> development releases.
>
> Is this possibly related to the issues with docker.io and move to quay.io?

A good guess, but like I said this whole area is way outside my wheelhouse. I 
just know how to decode Ceph's URL and git version conventions. ;)

>
> -Original Message-
> From: Gregory Farnum 
> Sent: Monday, October 4, 2021 2:33 PM
> To: Edward R Huyer 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Daemon Version Mismatch (But Not Really?) 
> After Deleting/Recreating OSDs
>
> On Mon, Oct 4, 2021 at 7:57 AM Edward R Huyer  wrote:
> >
> > Over the summer, I upgraded my cluster from Nautilus to Pacific, and 
> > converted to use cephadm after doing so.  Over the past couple weeks, I've 
> > been converting my OSDs to use NVMe drives for db+wal storage.  Schedule a 
> > node's worth of OSDs to be removed, wait for that to happen, delete the PVs 
> > and zap the drives, let the orchestrator do its thing.
> >
> > Over this past weekend, the cluster threw up a HEALTH_WARN due to 
> > mismatched daemon versions.  Apparently the recreated OSDs are reporting 
> > different version information from the old daemons.
> >
> > New OSDs:
> >
> > -  Container Image Name:  
> > docker.io/ceph/daemon-base:latest-pacific-devel
> >
> > -  Container Image ID: d253896d959e
> >
> > -  Version: 16.2.5-226-g7c9eb137
>
> I haven't done any work with cephadm, but this container name and the version 
> tag look like you've installed the in-development next version of Pacific, 
> not the released 16.2.5. Did you perhaps manage to put a phrase similar to 
> "pacific-dev" somewhere instead of "pacific"?
>
> >
> > Old OSDs and other daemons:
> >
> > -  Container Image Name: docker.io/ceph/ceph:v16
> >
> > -  Container Image ID: 6933c2a0b7dd
> >
> > -  Version: 16.2.5
> >
> > I'm assuming this is not actually a problem and will go away when I next 
> > upgrade the cluster, but I figured I'd throw it out here in case someone 
> > with more knowledge than I thinks otherwise.  If it's not a problem, is 
> > there a way to silence it until I next run an upgrade?  Is there an 
> > explanation for why it happened?
> >
> > -
> > Edward Huyer
> > Golisano College of Computing and Information Sciences Rochester 
> > Institute of Technology Golisano 70-2373
> > 152 Lomb Memorial Drive
> > Rochester, NY 14623
> > 585-475-6651
> > erh...@rit.edu
> >
> > Obligatory Legalese:
> > The information transmitted, including attachments, is intended only for 
> > the person(s) or entity to which it is addressed and may contain 
> > confidential and/or privileged material. Any review, retransmission, 
> > dissemination or other use of, or taking of any action in reliance upon 
> > this information by persons or entities other than the intended recipient 
> > is prohibited. If you received this in error, please contact the sender and 
> > destroy any copies of this information.
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> > email to ceph-users-le...@ceph.io
> >
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Broken mon state after (attempted) 16.2.5 -> 16.2.6 upgrade

2021-10-05 Thread Jonathan D. Proulx
In the middle of a normal cephadm upgrade from 16.2.5 to 16.2.6, after the mgrs 
had successfully upgraded, 2/5 mons didn’t come back up (and the upgrade 
stopped at that point). Attempting to manually restart the crashed mons 
resulted in **all** of the other mons crashing too, usually with:

terminate called after throwing an instance of 
'ceph::buffer::v15_2_0::malformed_input' what(): void 
FSMap::decode(ceph::buffer::v15_2_0::list::const_iterator&) no longer 
understand old encoding version v < 7: Malformed input

After some messing around with the monmaps to try and get the few working mons 
back in a quorum, we’re now in a state where one mon can run fine (but not 
reach a quorum, obviously), but as soon as a second comes up it crashes 
instantly. I also can’t start any mon with a monmap containing only one mon – 
same output as above.

The rest of the cluster is working as expected (with the obvious exception of 
new connections failing). Anyone seen this or have ideas? Happy to provide more 
info from the cluster, just wasn’t sure what would actually be helpful…


-- 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS not becoming active after migrating to cephadm

2021-10-05 Thread Petr Belyaev
Just tried it, stopped all mds nodes and created one using orch. Result: 0/1 
daemons up (1 failed), 1 standby. Same as before, and logs don’t show any 
errors as well.

I’ll probably try upgrading the orch-based setup to 16.2.6 over the weekend to 
match the exact non-dockerized MDS version, maybe it will work.


> On 4 Oct 2021, at 13:41, 胡 玮文  wrote:
> 
> By saying upgrade, I mean upgrade from the non-dockerized 16.2.5 to cephadm 
> version 16.2.6. So I think you need to disable standby-replay and reduce the 
> number of ranks to 1, then stop all the non-dockerized mds, deploy new mds 
> with cephadm. Only scaling back up after you finish the migration. Did you 
> also tried that?
> 
> In fact, similar issue has been reported several times on this list when 
> upgrade mds to 16.2.6, e.g. [1]. I have faced that too. So I’m pretty 
> confident that you are facing the same issue.
> 
> [1]: 
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/KQ5A5OWRIUEOJBC7VILBGDIKPQGJQIWN/
>  
> 
> 
>> 在 2021年10月4日,19:00,Petr Belyaev  写道:
>> 
>>  Hi Weiwen,
>> 
>> Yes, we did that during the upgrade. In fact, we did that multiple times 
>> even after the upgrade to see if it will resolve the issue (disabling hot 
>> standby, scaling everything down to a single MDS, swapping it with the new 
>> one, scaling back up).
>> 
>> The upgrade itself went fine, problems started during the migration to 
>> cephadm (which was done after migrating everything to Pacific). 
>> It only occurs when using dockerized MDS. Non-dockerized MDS nodes, also 
>> Pacific, everything runs fine.
>> 
>> Petr
>> 
>>> On 4 Oct 2021, at 12:43, 胡 玮文 >> > wrote:
>>> 
>>> Hi Petr,
>>>  
>>> Please read https://docs.ceph.com/en/latest/cephfs/upgrading/ 
>>>  for MDS upgrade 
>>> procedure.
>>>  
>>> In short, when upgrading to 16.2.6, you need to disable standby-replay and 
>>> reduce the number of ranks to 1.
>>>  
>>> Weiwen Hu
>>>  
>>> 从 Windows 版邮件 发送
>>>  
>>> 发件人: Petr Belyaev 
>>> 发送时间: 2021年10月4日 18:00
>>> 收件人: ceph-users@ceph.io 
>>> 主题: [ceph-users] MDS not becoming active after migrating to cephadm
>>>  
>>> Hi,
>>> 
>>> We’ve recently upgraded from Nautilus to Pacific, and tried moving our 
>>> services to cephadm/ceph orch.
>>> For some reason, MDS nodes deployed through orch never become active (or at 
>>> least standby-replay). Non-dockerized MDS nodes can still be deployed and 
>>> work fine. Non-dockerized mds version is 16.2.6, docker image version is 
>>> 16.2.5-387-g7282d81d (came as a default).
>>> 
>>> In the MDS log, the only related message is monitors assigning MDS as 
>>> standby. Increasing the log level does not help much, it only adds beacon 
>>> messages.
>>> Monitor log also contains no differences compared to a non-dockerized MDS 
>>> startup.
>>> Mds metadata command output is identical to that of a non-dockerized MDS.
>>> 
>>> The only difference I can see in the log is the value in curly braces after 
>>> the node name, e.g. mds.storage{0:1234ff}. For dockerized MDS, the first 
>>> value is , for non-dockerized it’s zero. Compat flags are identical.
>>> 
>>> Could someone please advise me why the dockerized MDS is being stuck as a 
>>> standby? Maybe some config values missing or smth?
>>> 
>>> Best regards,
>>> Petr
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io 
>>> To unsubscribe send an email to ceph-users-le...@ceph.io 
>>> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io