[ceph-users] Re: Mounting A RBD Via Kernal Modules

2024-03-24 Thread Wesley Dillingham
I suspect this may be a network / firewall issue between the client and one
OSD-server. Perhaps the 100MB RBD didn't have an object mapped to a PG with
the primary on this problematic OSD host but the 2TB RBD does. Just a
theory.

Respectfully,

*Wes Dillingham*
LinkedIn 
w...@wesdillingham.com




On Mon, Mar 25, 2024 at 12:34 AM duluxoz  wrote:

> Hi Alexander,
>
> Already set (and confirmed by running the command again) - no good, I'm
> afraid.
>
> So I just restart with a brand new image and ran the following commands
> on the ceph cluster and the host respectively. Results are below:
>
> On the ceph cluster:
>
> [code]
>
> rbd create --size 4T my_pool.meta/my_image --data-pool my_pool.data
> --image-feature exclusive-lock --image-feature deep-flatten
> --image-feature fast-diff --image-feature layering --image-feature
> object-map --image-feature data-pool
>
> [/code]
>
> On the host:
>
> [code]
>
> rbd device map my_pool.meta/my_image --id ceph_rbd_user --keyring
> /etc/ceph/ceph.client.ceph_rbd_user.keyring
>
> mkfs.xfs /dev/rbd0
>
> [/code]
>
> Results:
>
> [code]
>
> meta-data=/dev/rbd0  isize=512agcount=32,
> agsize=33554432 blks
>   =   sectsz=512   attr=2, projid32bit=1
>   =   crc=1finobt=1, sparse=1, rmapbt=0
>   =   reflink=1bigtime=1 inobtcount=1
> nrext64=0
> data =   bsize=4096   blocks=1073741824, imaxpct=5
>   =   sunit=16 swidth=16 blks
> naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
> log  =internal log   bsize=4096   blocks=521728, version=2
>   =   sectsz=512   sunit=16 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0
> Discarding blocks...Done.
> mkfs.xfs: pwrite failed: Input/output error
> libxfs_bwrite: write failed on (unknown) bno 0x1ff00/0x100, err=5
> mkfs.xfs: Releasing dirty buffer to free list!
> found dirty buffer (bulk) on free list!
> mkfs.xfs: pwrite failed: Input/output error
> libxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=5
> mkfs.xfs: Releasing dirty buffer to free list!
> found dirty buffer (bulk) on free list!
> mkfs.xfs: pwrite failed: Input/output error
> libxfs_bwrite: write failed on xfs_sb bno 0x0/0x1, err=5
> mkfs.xfs: Releasing dirty buffer to free list!
> found dirty buffer (bulk) on free list!
> mkfs.xfs: pwrite failed: Input/output error
> libxfs_bwrite: write failed on (unknown) bno 0x10080/0x80, err=5
> mkfs.xfs: Releasing dirty buffer to free list!
> found dirty buffer (bulk) on free list!
> mkfs.xfs: read failed: Input/output error
> mkfs.xfs: data size check failed
> mkfs.xfs: filesystem failed to initialize
> [/code]
>
> On 25/03/2024 15:17, Alexander E. Patrakov wrote:
> > Hello Matthew,
> >
> > Is the overwrite enabled in the erasure-coded pool? If not, here is
> > how to fix it:
> >
> > ceph osd pool set my_pool.data allow_ec_overwrites true
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Mounting A RBD Via Kernal Modules

2024-03-24 Thread duluxoz

Hi Alexander,

Already set (and confirmed by running the command again) - no good, I'm 
afraid.


So I just restart with a brand new image and ran the following commands 
on the ceph cluster and the host respectively. Results are below:


On the ceph cluster:

[code]

rbd create --size 4T my_pool.meta/my_image --data-pool my_pool.data 
--image-feature exclusive-lock --image-feature deep-flatten 
--image-feature fast-diff --image-feature layering --image-feature 
object-map --image-feature data-pool


[/code]

On the host:

[code]

rbd device map my_pool.meta/my_image --id ceph_rbd_user --keyring 
/etc/ceph/ceph.client.ceph_rbd_user.keyring


mkfs.xfs /dev/rbd0

[/code]

Results:

[code]

meta-data=/dev/rbd0  isize=512    agcount=32, 
agsize=33554432 blks

 =   sectsz=512   attr=2, projid32bit=1
 =   crc=1    finobt=1, sparse=1, rmapbt=0
 =   reflink=1    bigtime=1 inobtcount=1 
nrext64=0

data =   bsize=4096   blocks=1073741824, imaxpct=5
 =   sunit=16 swidth=16 blks
naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
log  =internal log   bsize=4096   blocks=521728, version=2
 =   sectsz=512   sunit=16 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
Discarding blocks...Done.
mkfs.xfs: pwrite failed: Input/output error
libxfs_bwrite: write failed on (unknown) bno 0x1ff00/0x100, err=5
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
mkfs.xfs: pwrite failed: Input/output error
libxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=5
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
mkfs.xfs: pwrite failed: Input/output error
libxfs_bwrite: write failed on xfs_sb bno 0x0/0x1, err=5
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
mkfs.xfs: pwrite failed: Input/output error
libxfs_bwrite: write failed on (unknown) bno 0x10080/0x80, err=5
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
mkfs.xfs: read failed: Input/output error
mkfs.xfs: data size check failed
mkfs.xfs: filesystem failed to initialize
[/code]

On 25/03/2024 15:17, Alexander E. Patrakov wrote:

Hello Matthew,

Is the overwrite enabled in the erasure-coded pool? If not, here is
how to fix it:

ceph osd pool set my_pool.data allow_ec_overwrites true

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Mounting A RBD Via Kernal Modules

2024-03-24 Thread Alexander E. Patrakov
Hello Matthew,

Is the overwrite enabled in the erasure-coded pool? If not, here is
how to fix it:

ceph osd pool set my_pool.data allow_ec_overwrites true

On Mon, Mar 25, 2024 at 11:17 AM duluxoz  wrote:
>
> Hi Curt,
>
> Blockdev --getbsz: 4096
>
> Rbd info my_pool.meta/my_image:
>
> ~~~
>
> rbd image 'my_image':
>  size 4 TiB in 1048576 objects
>  order 22 (4 MiB objects)
>  snapshot_count: 0
>  id: 294519bf21a1af
>  data_pool: my_pool.data
>  block_name_prefix: rbd_data.30.294519bf21a1af
>  format: 2
>  features: layering, exclusive-lock, object-map, fast-diff,
> deep-flatten, data-pool
>  op_features:
>  flags:
>  create_timestamp: Sun Mar 24 17:44:33 2024
>  access_timestamp: Sun Mar 24 17:44:33 2024
>  modify_timestamp: Sun Mar 24 17:44:33 2024
> ~~~
>
> On 24/03/2024 21:10, Curt wrote:
> > Hey Mathew,
> >
> > One more thing out of curiosity can you send the output of blockdev
> > --getbsz on the rbd dev and rbd info?
> >
> > I'm using 16TB rbd images without issue, but I haven't updated to reef
> > .2 yet.
> >
> > Cheers,
> > Curt
>


-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Mounting A RBD Via Kernal Modules

2024-03-24 Thread duluxoz

Hi Curt,

Blockdev --getbsz: 4096

Rbd info my_pool.meta/my_image:

~~~

rbd image 'my_image':
    size 4 TiB in 1048576 objects
    order 22 (4 MiB objects)
    snapshot_count: 0
    id: 294519bf21a1af
    data_pool: my_pool.data
    block_name_prefix: rbd_data.30.294519bf21a1af
    format: 2
    features: layering, exclusive-lock, object-map, fast-diff, 
deep-flatten, data-pool

    op_features:
    flags:
    create_timestamp: Sun Mar 24 17:44:33 2024
    access_timestamp: Sun Mar 24 17:44:33 2024
    modify_timestamp: Sun Mar 24 17:44:33 2024
~~~

On 24/03/2024 21:10, Curt wrote:

Hey Mathew,

One more thing out of curiosity can you send the output of blockdev 
--getbsz on the rbd dev and rbd info?


I'm using 16TB rbd images without issue, but I haven't updated to reef 
.2 yet.


Cheers,
Curt

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] el7 + nautilus rbd snapshot map + lvs mount crash

2024-03-24 Thread Marc


Looks like this procedure crashes the ceph node. Tried this now for 2nd time 
after updating and again crash.

el7 + nautilus -> rbd snapshot map -> lvs mount -> crash
(lvs are not even duplicate names)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-24 Thread Torkil Svensgaard

On 24-03-2024 13:41, Tyler Stachecki wrote:

On Sat, Mar 23, 2024, 4:26 AM Torkil Svensgaard  wrote:


Hi

... Using mclock with high_recovery_ops profile.

What is the bottleneck here? I would have expected a huge number of
simultaneous backfills. Backfill reservation logjam?



mClock is very buggy in my experience and frequently leads to issues like
this. Try using regular backfill and see if the problem goes away.


Hi Tyler

Just tried switching to wpq, same thing.

I'm inclined to think is must be a read reservation logjam of some sort, 
given that increasing osd_max_backfills had an immediate effect and we 
have 4 empty hosts as main write targets. Here's the output for one such 
OSD from the script Alexander linked:


"
osd.539: gimpy  =>0 0.0B <=42 1.54T (Δ1.54T)   drive=9.0% 358.72G/3.90T 
crush=9

.0% 358.72G/3.90T
  <-11.54f  waiting   44.8G  539<-421  1070 of 230320, 0.5%
  <-37.538  waiting   22.3G  539<-121  2819 of 556087, 0.5%
  <-11.507  waiting   45.1G  539<-61   912 of 139227, 0.7%
  <-37.450  waiting   22.3G  539<-220  1235 of 632776, 0.2%
  <-11.458  waiting   45.3G  539<-121  178 of 279150, 0.1%
  <-37.47c  waiting   22.2G  539<-83   2281 of 634472, 0.4%
  <-37.434  waiting   22.1G  539<-78   9496 of 316052, 3.0%
  <-11.3f3  waiting   44.9G  539<-109  2375 of 231055, 1.0%
  <-37.3d3  waiting   22.0G  539<-73   2144 of 316508, 0.7%
  <-37.3c5  waiting   22.2G  539<-83   313880 of 313880, 100.0%
  <-11.3c1  waiting   44.8G  539<-223  93878 of 230270, 40.8%
  <-37.3a4  waiting   21.9G  539<-85   4604 of 315504, 1.5%
  <-11.344  waiting   44.5G  539<-63   728 of 182876, 0.4%
  <-6.1ca   waiting  100.9G  539<-443  36076 of 56270, 64.1%
  <-4.1a2   waiting  157.6G  539<-218  508 of 91456, 0.6%
  <-37.1ba  waiting   22.2G  539<-64   316848 of 316848, 100.0%
  <-37.84   waiting   22.0G  539<-33   4380 of 237633, 1.8%
  <-37.ad   waiting   22.2G  539<-77   6730 of 396635, 1.7%
  <-37.36   waiting   22.1G  539<-47   2170 of 395955, 0.5%
  <-11.b9   waiting   45.1G  539<-223  0 of 231940, 0.0%
  <-37.11c  waiting   22.1G  539<-33   9952 of 316448, 3.1%
  <-11.144  waiting   45.1G  539<-207  528 of 278094, 0.2%
  <-37.2ae  waiting   22.1G  539<-224  2565 of 712539, 0.4%
  <-37.285  waiting   22.0G  539<-65   441 of 315336, 0.1%
  <-37.2ef  waiting   22.0G  539<-414  2124 of 475410, 0.4%
  <-37.674  waiting   22.0G  539<-56   60 of 236511, 0.0%
  <-37.655  waiting   22.3G  539<-143  237316 of 237381, 100.0%
  <-11.6b0  waiting   44.9G  539<-282  1131 of 277122, 0.4%
  <-37.71a  waiting   22.2G  539<-49   82865 of 315684, 26.2%
  <-11.789  waiting   45.0G  539<-196  736 of 277584, 0.3%
  <-11.7cf  waiting   44.8G  539<-127  143 of 276582, 0.1%
  <-11.7f2  waiting   45.2G  539<-272  145857 of 185680, 78.6%
  <-37.7dd  waiting   22.0G  539<-72   0 of 393475, 0.0%
  <-37.7d9  waiting   22.2G  539<-37   930 of 237831, 0.4%
  <-11.7fb  waiting   45.2G  539<-78   1062 of 279042, 0.4%
  <-37.7d2  waiting   22.0G  539<-71   2409 of 631368, 0.4%
  <-11.8db  waiting   44.9G  539<-84   108 of 277182, 0.0%
  <-11.9b6  waiting   44.8G  539<-74   772 of 184432, 0.4%
  <-11.b0b  waiting   45.0G  539<-166  2569 of 231430, 1.1%
  <-11.d42  waiting   45.2G  539<-118  15428 of 46429, 33.2%
  <-11.d5f  waiting   44.8G  539<-64   4 of 184356, 0.0%
  <-11.d98  waiting   45.1G  539<-418  0 of 278568, 0.0%
"

All waiting for something.

Mvh.

Torkil


Tyler




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: log_latency slow operation observed for submit_transact, latency = 22.644258499s

2024-03-24 Thread Torkil Svensgaard
No latency spikes seen the last 24 hours after manually compacting all 
the OSDs so it seemed to solve it for us at least. Thanks all.


Mvh.

Torkil

On 23-03-2024 12:32, Torkil Svensgaard wrote:

Hi guys

Thanks for the suggestions, we'll do the offline compaction and see how 
big an impact it will have.


Even if compact-on-iteration should take care of it, doing it offline 
should avoid I/O problems during the compaction, correct? That's why 
offline is preferred to online?


Mvh.

Torkil

On 22-03-2024 16:47, Joshua Baergen wrote:

Personally, I don't think the compaction is actually required. Reef
has compact-on-iteration enabled, which should take care of this
automatically. We see this sort of delay pretty often during PG
cleaning, at the end of a PG being cleaned, when the PG has a high
count of objects, whether or not OSD compaction has been keeping up
with tombstones. It's unfortunately just something to ride through
these days until backfill completes.

https://github.com/ceph/ceph/pull/49438 is a recent attempt to improve
things in this area, but I'm not sure whether it would eliminate this
issue. We've considered going to higher PG counts (and thus fewer
objects per PG) as a possible mitigation as well.

Josh

On Fri, Mar 22, 2024 at 2:59 AM Alexander E. Patrakov
 wrote:


Hello Torkil,

The easiest way (in my opinion) to perform offline compaction is a bit
different than what Igor suggested. We had a prior off-list
conversation indicating that the results would be equivalent.

1. ceph config set osd osd_compact_on_start true
2. Restart the OSD that you want to compact (or the whole host at
once, if you want to compact the whole host and your failure domain
allows for that)
3. ceph config set osd osd_compact_on_start false

The OSD will restart, but will not show as "up" until the compaction
process completes. In your case, I would expect it to take up to 40
minutes.

On Fri, Mar 22, 2024 at 3:46 PM Torkil Svensgaard  
wrote:



On 22-03-2024 08:38, Igor Fedotov wrote:

Hi Torkil,


Hi Igor

highly likely you're facing a well known issue with RocksDB 
performance

drop after bulk data removal. The latter might occur at source OSDs
after PG migration completion.


Aha, thanks.

You might want to use DB compaction (preferably offline one using 
ceph-
kvstore-tool) to get OSD out of this "degraded" state or as a 
preventive

measure. I'd recommend to do that for all the OSDs right now. And once
again after rebalancing is completed.  This should improve things but
unfortunately no 100% guarantee.


Why is offline preferred? With offline the easiest way would be
something like stop all OSDs one host at a time and run a loop over
/var/lib/ceph/$id/osd.*?

Also curious if you have DB/WAL on fast (SSD or NVMe) drives? This 
might

be crucial..


We do, 22 HDDs and 2 DB/WAL NVMes pr host.

Thanks.

Mvh.

Torkil



Thanks,

Igor

On 3/22/2024 9:59 AM, Torkil Svensgaard wrote:

Good morning,

Cephadm Reef 18.2.1. We recently added 4 hosts and changed a failure
domain from host to datacenter which is the reason for the large
misplaced percentage.

We were seeing some pretty crazy spikes in "OSD Read Latencies" and
"OSD Write Latencies" on the dashboard. Most of the time 
everything is
well but then for periods of time, 1-4 hours, latencies will go to 
10+

seconds for one or more OSDs. This also happens outside scrub hours
and it is not the same OSDs every time. The OSDs affected are HDD 
with

DB/WAL on NVMe.

Log snippet:

"
...
2024-03-22T06:48:22.859+ 7fb184b52700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 
15.00954s

2024-03-22T06:48:22.859+ 7fb185b54700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 
15.00954s

2024-03-22T06:48:22.864+ 7fb169898700  1 heartbeat_map
clear_timeout 'OSD::osd_op_tp thread 0x7fb169898700' had timed out
after 15.00954s
2024-03-22T06:48:22.864+ 7fb169898700  0 bluestore(/var/lib/ceph/
osd/ceph-112) log_latency slow operation observed for 
submit_transact,

latency = 17.716707230s
2024-03-22T06:48:22.880+ 7fb1748ae700  0 bluestore(/var/lib/ceph/
osd/ceph-112) log_latency_fn slow operation observed for
_txc_committed_kv, latency = 17.732601166s, txc = 0x55a5bcda0f00
2024-03-22T06:48:38.077+ 7fb184b52700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 
15.00954s

2024-03-22T06:48:38.077+ 7fb184b52700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 
15.00954s

...
"

"
[root@dopey ~]# ceph -s
   cluster:
 id: 8ee2d228-ed21-4580-8bbf-0649f229e21d
 health: HEALTH_WARN
 1 failed cephadm daemon(s)
 Low space hindering backfill (add storage if this 
doesn't

resolve itself): 1 pg backfill_toofull

   services:
 mon: 5 daemons, quorum lazy,jolly,happy,dopey,sleepy (age 3d)
 mgr: jolly.tpgixt(active, since 10d), standbys: 

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-24 Thread Tyler Stachecki
On Sat, Mar 23, 2024, 4:26 AM Torkil Svensgaard  wrote:

> Hi
>
> ... Using mclock with high_recovery_ops profile.
>
> What is the bottleneck here? I would have expected a huge number of
> simultaneous backfills. Backfill reservation logjam?
>

mClock is very buggy in my experience and frequently leads to issues like
this. Try using regular backfill and see if the problem goes away.

Tyler

>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Mounting A RBD Via Kernal Modules

2024-03-24 Thread duluxoz

Hi, Alwin,

Command (as requested): rbd create --size 4T my_pool.meta/my_image 
--data-pool my_pool.data --image-feature exclusive-lock --image-feature 
deep-flatten --image-feature fast-diff --image-feature layering 
--image-feature object-map --image-feature data-pool


On 24/03/2024 22:53, Alwin Antreich wrote:

Hi,


March 24, 2024 at 8:19 AM, "duluxoz"  wrote:

Hi,

Yeah, I've been testing various configurations since I sent my last

email - all to no avail.

So I'm back to the start with a brand new 4T image which is rbdmapped to

/dev/rbd0.

Its not formatted (yet) and so not mounted.

Every time I attempt a mkfs.xfs /dev/rbd0 (or mkfs.xfs

/dev/rbd/my_pool/my_image) I get the errors I previous mentioned and the

resulting image then becomes unusable (in ever sense of the word).

If I run a fdisk -l (before trying the mkfs.xfs) the rbd image shows up

in the list - no, I don't actually do a full fdisk on the image.

An rbd info my_pool:my_image shows the same expected values on both the

host and ceph cluster.

I've tried this with a whole bunch of different sized images from 100G

to 4T and all fail in exactly the same way. (My previous successful 100G

test I haven't been able to reproduce).

I've also tried all of the above using an "admin" CephX(sp?) account - I

always can connect via rbdmap, but as soon as I try an mkfs.xfs it

fails. This failure also occurs with a mkfs.ext4 as well (all size drives).

The Ceph Cluster is good (self reported and there are other hosts

happily connected via CephFS) and this host also has a CephFS mapping

which is working.

Between running experiments I've gone over the Ceph Doco (again) and I

can't work out what's going wrong.

There's also nothing obvious/helpful jumping out at me from the

logs/journal (sample below):

~~~

Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno

524773 0~65536 result -1

Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno

524772 65536~4128768 result -1

Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write result -1

Mar 24 17:38:29 my_host.my_net.local kernel: blk_print_req_error: 119

callbacks suppressed

Mar 24 17:38:29 my_host.my_net.local kernel: I/O error, dev rbd0, sector

4298932352 op 0x1:(WRITE) flags 0x4000 phys_seg 1024 prio class 2

Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno

524774 0~65536 result -1

Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno

524773 65536~4128768 result -1

Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write result -1

Mar 24 17:38:29 my_host.my_net.local kernel: I/O error, dev rbd0, sector

4298940544 op 0x1:(WRITE) flags 0x4000 phys_seg 1024 prio class 2

~~~

Any ideas what I should be looking at?

Could you please share the command you've used to create the RBD?

Cheers,
Alwin

--
Peregrine IT Signature

*Matthew J BLACK*
  M.Inf.Tech.(Data Comms)
  MBA
  B.Sc.
  MACS (Snr), CP, IP3P

When you want it done /right/ ‒ the first time!

Phone:  +61 4 0411 0089
Email:  matt...@peregrineit.net 
Web:www.peregrineit.net 

View Matthew J BLACK's profile on LinkedIn 



This Email is intended only for the addressee.  Its use is limited to 
that intended by the author at the time and it is not to be distributed 
without the author’s consent.  You must not use or disclose the contents 
of this Email, or add the sender’s Email address to any database, list 
or mailing list unless you are expressly authorised to do so.  Unless 
otherwise stated, Peregrine I.T. Pty Ltd accepts no liability for the 
contents of this Email except where subsequently confirmed in 
writing.  The opinions expressed in this Email are those of the author 
and do not necessarily represent the views of Peregrine I.T. Pty 
Ltd.  This Email is confidential and may be subject to a claim of legal 
privilege.


If you have received this Email in error, please notify the author and 
delete this message immediately.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph cluster extremely unbalanced

2024-03-24 Thread Matt Vandermeulen

Hi,

I would expect that almost every PG in the cluster is going to have to 
move once you start standardizing CRUSH weights, and I wouldn't want to 
move data twice. My plan would look something like:


- Make sure the cluster is healthy (no degraded PGs)
- Set nobackfill, norebalance flags to prevent any data from moving
- Set your CRUSH weights (this will cause PGs to re-peer, which will 
stall IO during the peering process; I think this could be done in one 
large operation/osdmap update by changing the CRUSH map directly)

- Wait for peering to settle and IO rates to recover
- Use pgremapper[1] to cancel backfill, which will insert upmaps to keep 
the data where it is today (pgremapper cancel-backfill --verbose --yes)
- You could simply enable the balancer at this point if you want a "set 
it and forget it" type of thing, or if you want more control you can use 
pgremapper undo-upmaps in a loop


With a ~5P cluster, this is going to take a while, and I'd probably 
expect to lose some drives while data is moving.


[1] https://github.com/digitalocean/pgremapper

On 2024-03-24 08:06, Denis Polom wrote:

Hi guys,

recently I took over a care of Ceph cluster that is extremely 
unbalanced. Cluster is running on Quincy 17.2.7 (upgraded Nautilus -> 
Octopus -> Quincy) and has 1428 OSDs (HDDs). We are running CephFS on 
it.


Crush failure domain is datacenter (there are 3), data pool is EC 3+3.

This cluster had and has balancer disabled for years. And was 
"balanced" manually by changing OSDs crush weights. So now it is 
complete mess and I would like to change it to have OSDs crush weight 
same (3.63898)  and to enable balancer with upmap.


From `ceph osd df ` sorted from the least used to most used OSDs:

ID    CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP 
META AVAIL %USE   VAR   PGS  STATUS

MIN/MAX VAR: 0.76/1.16  STDDEV: 5.97
 TOTAL  5.1 PiB  3.7 PiB  3.7 PiB  2.9 MiB  8.5 
TiB   1.5 PiB  71.50
 428    hdd  3.63898   1.0  3.6 TiB  2.0 TiB  2.0 TiB    1 KiB  5.6 
GiB   1.7 TiB  54.55  0.76   96  up
 223    hdd  3.63898   1.0  3.6 TiB  2.0 TiB  2.0 TiB    3 KiB  5.6 
GiB   1.7 TiB  54.58  0.76   95  up

...

...

...

 591    hdd  3.53999   1.0  3.6 TiB  3.0 TiB  3.0 TiB    1 KiB  7.0 
GiB   680 GiB  81.74  1.14  125  up
 832    hdd  3.5   1.0  3.6 TiB  3.0 TiB  3.0 TiB    4 KiB  6.9 
GiB   680 GiB  81.75  1.14  114  up
 248    hdd  3.63898   1.0  3.6 TiB  3.0 TiB  3.0 TiB    3 KiB  7.2 
GiB   646 GiB  82.67  1.16  121  up
 559    hdd  3.63799   1.0  3.6 TiB  3.0 TiB  3.0 TiB  0 B  7.0 
GiB   644 GiB  82.70  1.16  123  up
 TOTAL  5.1 PiB  3.7 PiB  3.6 PiB  2.9 MiB  8.5 
TiB   1.5 PiB  71.50

MIN/MAX VAR: 0.76/1.16  STDDEV: 5.97


crush rule:

{
    "rule_id": 10,
    "rule_name": "ec33hdd_rule",
    "type": 3,
    "steps": [
    {
    "op": "set_chooseleaf_tries",
    "num": 5
    },
    {
    "op": "set_choose_tries",
    "num": 100
    },
    {
    "op": "take",
    "item": -2,
    "item_name": "default~hdd"
    },
    {
    "op": "choose_indep",
    "num": 3,
    "type": "datacenter"
    },
    {
    "op": "choose_indep",
    "num": 2,
    "type": "osd"
    },
    {
    "op": "emit"
    }
    ]
}

My question is what would be proper and most safer way to make it 
happen.


* should I first enable balancer and let it do its work and after that 
change the OSDs crush weights to be even?


* or should it otherwise - first to make crush weights even and then 
enable the balancer?


* or is there another safe(r) way?

What are the ideal balancer settings for that?

I'm expecting a large data movement, and this is production cluster.

I'm also afraid that during the balancing or changing crush weights 
some OSDs become full. I've tried that already and had to move some PGs 
manually to another OSDs in the same failure domain.



I would appreciate any suggestion on that.

Thank you!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph cluster extremely unbalanced

2024-03-24 Thread Alexander E. Patrakov
Hi Denis,

My approach would be:

1. Run "ceph osd metadata" and see if you have a mix of 64K and 4K
bluestore_min_alloc_size. If so, you cannot really use the built-in
balancer, as it would result in a bimodal distribution instead of a
proper balance, see https://tracker.ceph.com/issues/64715, but let's
ignore this little issue if you have enough free space.
2. Change the weights as appropriate. Make absolutely sure that there
are no reweights other than 1.0. Delete all dead or destroyed OSDs
from the CRUSH map by purging them. Ignore any PG_BACKFILL_FULL
warnings that appear, they will be gone during the next step.
3. Run this little script from Cern to stop the data movement that was
just initiated:
https://raw.githubusercontent.com/cernceph/ceph-scripts/master/tools/upmap/upmap-remapped.py,
pipe its output to bash. This should cancel most of the data movement,
but not all - the script cannot stop the situation when two OSDs want
to exchange their erasure-coded shards, like this: [1,2,3,4] ->
[1,3,2,4].
4. Set the "target max misplaced ratio" option for MGR to what you
think is appropriate. The default is 0.05, and this means that the
balancer will enable at most 5% of the PGs to participate in the data
movement. I suggest starting with 0.01 and increasing if there is no
visible impact of the balancing on the client traffic.
5. Enable the balancer.

If you think that https://tracker.ceph.com/issues/64715 is a problem
that would prevent you from using the built-in balancer:

4. Download this script:
https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py
5. Run it as follows: ./placementoptimizer.py -v balance --osdsize
device --osdused delta --max-pg-moves 500 --osdfrom fullest | bash

This will move at most 500 PGs to better places, starting with the
fullest OSDs. All weights are ignored, and the switches take care of
the bluestore_min_alloc_size overhead mismatch. You will have to do
that weekly until you redeploy all OSDs that were created with 64K
bluestore_min_alloc_size.

A hybrid approach (initial round of balancing with TheJJ, then switch
to the built-in balancer) may also be viable.

On Sun, Mar 24, 2024 at 7:09 PM Denis Polom  wrote:
>
> Hi guys,
>
> recently I took over a care of Ceph cluster that is extremely
> unbalanced. Cluster is running on Quincy 17.2.7 (upgraded Nautilus ->
> Octopus -> Quincy) and has 1428 OSDs (HDDs). We are running CephFS on it.
>
> Crush failure domain is datacenter (there are 3), data pool is EC 3+3.
>
> This cluster had and has balancer disabled for years. And was "balanced"
> manually by changing OSDs crush weights. So now it is complete mess and
> I would like to change it to have OSDs crush weight same (3.63898)  and
> to enable balancer with upmap.
>
>  From `ceph osd df ` sorted from the least used to most used OSDs:
>
> IDCLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP META
> AVAIL %USE   VAR   PGS  STATUS
> MIN/MAX VAR: 0.76/1.16  STDDEV: 5.97
>   TOTAL  5.1 PiB  3.7 PiB  3.7 PiB  2.9 MiB  8.5
> TiB   1.5 PiB  71.50
>   428hdd  3.63898   1.0  3.6 TiB  2.0 TiB  2.0 TiB1 KiB  5.6
> GiB   1.7 TiB  54.55  0.76   96  up
>   223hdd  3.63898   1.0  3.6 TiB  2.0 TiB  2.0 TiB3 KiB  5.6
> GiB   1.7 TiB  54.58  0.76   95  up
> ...
>
> ...
>
> ...
>
>   591hdd  3.53999   1.0  3.6 TiB  3.0 TiB  3.0 TiB1 KiB  7.0
> GiB   680 GiB  81.74  1.14  125  up
>   832hdd  3.5   1.0  3.6 TiB  3.0 TiB  3.0 TiB4 KiB  6.9
> GiB   680 GiB  81.75  1.14  114  up
>   248hdd  3.63898   1.0  3.6 TiB  3.0 TiB  3.0 TiB3 KiB  7.2
> GiB   646 GiB  82.67  1.16  121  up
>   559hdd  3.63799   1.0  3.6 TiB  3.0 TiB  3.0 TiB  0 B  7.0
> GiB   644 GiB  82.70  1.16  123  up
>   TOTAL  5.1 PiB  3.7 PiB  3.6 PiB  2.9 MiB  8.5
> TiB   1.5 PiB  71.50
> MIN/MAX VAR: 0.76/1.16  STDDEV: 5.97
>
>
> crush rule:
>
> {
>  "rule_id": 10,
>  "rule_name": "ec33hdd_rule",
>  "type": 3,
>  "steps": [
>  {
>  "op": "set_chooseleaf_tries",
>  "num": 5
>  },
>  {
>  "op": "set_choose_tries",
>  "num": 100
>  },
>  {
>  "op": "take",
>  "item": -2,
>  "item_name": "default~hdd"
>  },
>  {
>  "op": "choose_indep",
>  "num": 3,
>  "type": "datacenter"
>  },
>  {
>  "op": "choose_indep",
>  "num": 2,
>  "type": "osd"
>  },
>  {
>  "op": "emit"
>  }
>  ]
> }
>
> My question is what would be proper and most safer way to make it happen.
>
> * should I first enable balancer and let it do its work and after that
> change the OSDs crush weights to be even?
>
> * or should it otherwise - first to make crush weights even and then
> enable the balancer?
>
> * or 

[ceph-users] ceph cluster extremely unbalanced

2024-03-24 Thread Denis Polom

Hi guys,

recently I took over a care of Ceph cluster that is extremely 
unbalanced. Cluster is running on Quincy 17.2.7 (upgraded Nautilus -> 
Octopus -> Quincy) and has 1428 OSDs (HDDs). We are running CephFS on it.


Crush failure domain is datacenter (there are 3), data pool is EC 3+3.

This cluster had and has balancer disabled for years. And was "balanced" 
manually by changing OSDs crush weights. So now it is complete mess and 
I would like to change it to have OSDs crush weight same (3.63898)  and 
to enable balancer with upmap.


From `ceph osd df ` sorted from the least used to most used OSDs:

ID    CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP META 
AVAIL %USE   VAR   PGS  STATUS

MIN/MAX VAR: 0.76/1.16  STDDEV: 5.97
 TOTAL  5.1 PiB  3.7 PiB  3.7 PiB  2.9 MiB  8.5 
TiB   1.5 PiB  71.50
 428    hdd  3.63898   1.0  3.6 TiB  2.0 TiB  2.0 TiB    1 KiB  5.6 
GiB   1.7 TiB  54.55  0.76   96  up
 223    hdd  3.63898   1.0  3.6 TiB  2.0 TiB  2.0 TiB    3 KiB  5.6 
GiB   1.7 TiB  54.58  0.76   95  up

...

...

...

 591    hdd  3.53999   1.0  3.6 TiB  3.0 TiB  3.0 TiB    1 KiB  7.0 
GiB   680 GiB  81.74  1.14  125  up
 832    hdd  3.5   1.0  3.6 TiB  3.0 TiB  3.0 TiB    4 KiB  6.9 
GiB   680 GiB  81.75  1.14  114  up
 248    hdd  3.63898   1.0  3.6 TiB  3.0 TiB  3.0 TiB    3 KiB  7.2 
GiB   646 GiB  82.67  1.16  121  up
 559    hdd  3.63799   1.0  3.6 TiB  3.0 TiB  3.0 TiB  0 B  7.0 
GiB   644 GiB  82.70  1.16  123  up
 TOTAL  5.1 PiB  3.7 PiB  3.6 PiB  2.9 MiB  8.5 
TiB   1.5 PiB  71.50

MIN/MAX VAR: 0.76/1.16  STDDEV: 5.97


crush rule:

{
    "rule_id": 10,
    "rule_name": "ec33hdd_rule",
    "type": 3,
    "steps": [
    {
    "op": "set_chooseleaf_tries",
    "num": 5
    },
    {
    "op": "set_choose_tries",
    "num": 100
    },
    {
    "op": "take",
    "item": -2,
    "item_name": "default~hdd"
    },
    {
    "op": "choose_indep",
    "num": 3,
    "type": "datacenter"
    },
    {
    "op": "choose_indep",
    "num": 2,
    "type": "osd"
    },
    {
    "op": "emit"
    }
    ]
}

My question is what would be proper and most safer way to make it happen.

* should I first enable balancer and let it do its work and after that 
change the OSDs crush weights to be even?


* or should it otherwise - first to make crush weights even and then 
enable the balancer?


* or is there another safe(r) way?

What are the ideal balancer settings for that?

I'm expecting a large data movement, and this is production cluster.

I'm also afraid that during the balancing or changing crush weights some 
OSDs become full. I've tried that already and had to move some PGs 
manually to another OSDs in the same failure domain.



I would appreciate any suggestion on that.

Thank you!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Mounting A RBD Via Kernal Modules

2024-03-24 Thread Curt
Hey Mathew,

One more thing out of curiosity can you send the output of blockdev
--getbsz on the rbd dev and rbd info?

I'm using 16TB rbd images without issue, but I haven't updated to reef .2
yet.

Cheers,
Curt


On Sun, 24 Mar 2024, 11:12 duluxoz,  wrote:

> Hi Curt,
>
> Nope, no dropped packets or errors - sorry, wrong tree  :-)
>
> Thanks for chiming in.
>
> On 24/03/2024 20:01, Curt wrote:
> > I may be barking up the wrong tree, but if you run ip -s link show
> > yourNicID on this server or your OSDs do you see any
> > errors/dropped/missed?
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Mounting A RBD Via Kernal Modules

2024-03-24 Thread duluxoz

Hi Curt,

Nope, no dropped packets or errors - sorry, wrong tree  :-)

Thanks for chiming in.

On 24/03/2024 20:01, Curt wrote:
I may be barking up the wrong tree, but if you run ip -s link show 
yourNicID on this server or your OSDs do you see any 
errors/dropped/missed?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Mounting A RBD Via Kernal Modules

2024-03-24 Thread Curt
I may be barking up the wrong tree, but if you run ip -s link show
yourNicID on this server or your OSDs do you see any errors/dropped/missed?

On Sun, 24 Mar 2024, 09:20 duluxoz,  wrote:

> Hi,
>
> Yeah, I've been testing various configurations since I sent my last
> email - all to no avail.
>
> So I'm back to the start with a brand new 4T image which is rbdmapped to
> /dev/rbd0.
>
> Its not formatted (yet) and so not mounted.
>
> Every time I attempt a mkfs.xfs /dev/rbd0 (or mkfs.xfs
> /dev/rbd/my_pool/my_image) I get the errors I previous mentioned and the
> resulting image then becomes unusable (in ever sense of the word).
>
> If I run a fdisk -l (before trying the mkfs.xfs) the rbd image shows up
> in the list - no, I don't actually do a full fdisk on the image.
>
> An rbd info my_pool:my_image shows the same expected values on both the
> host and ceph cluster.
>
> I've tried this with a whole bunch of different sized images from 100G
> to 4T and all fail in exactly the same way. (My previous successful 100G
> test I haven't been able to reproduce).
>
> I've also tried all of the above using an "admin" CephX(sp?) account - I
> always can connect via rbdmap, but as soon as I try an mkfs.xfs it
> fails. This failure also occurs with a mkfs.ext4 as well (all size drives).
>
> The Ceph Cluster is good (self reported and there are other hosts
> happily connected via CephFS) and this host also has a CephFS mapping
> which is working.
>
> Between running experiments I've gone over the Ceph Doco (again) and I
> can't work out what's going wrong.
>
> There's also nothing obvious/helpful jumping out at me from the
> logs/journal (sample below):
>
> ~~~
>
> Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno
> 524773 0~65536 result -1
> Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno
> 524772 65536~4128768 result -1
> Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write result -1
> Mar 24 17:38:29 my_host.my_net.local kernel: blk_print_req_error: 119
> callbacks suppressed
> Mar 24 17:38:29 my_host.my_net.local kernel: I/O error, dev rbd0, sector
> 4298932352 op 0x1:(WRITE) flags 0x4000 phys_seg 1024 prio class 2
> Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno
> 524774 0~65536 result -1
> Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno
> 524773 65536~4128768 result -1
> Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write result -1
> Mar 24 17:38:29 my_host.my_net.local kernel: I/O error, dev rbd0, sector
> 4298940544 op 0x1:(WRITE) flags 0x4000 phys_seg 1024 prio class 2
> ~~~
>
> Any ideas what I should be looking at?
>
> And thank you for the help  :-)
>
> On 24/03/2024 17:50, Alexander E. Patrakov wrote:
> > Hi,
> >
> > Please test again, it must have been some network issue. A 10 TB RBD
> > image is used here without any problems.
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Mounting A RBD Via Kernal Modules

2024-03-24 Thread duluxoz

Hi,

Yeah, I've been testing various configurations since I sent my last 
email - all to no avail.


So I'm back to the start with a brand new 4T image which is rbdmapped to 
/dev/rbd0.


Its not formatted (yet) and so not mounted.

Every time I attempt a mkfs.xfs /dev/rbd0 (or mkfs.xfs 
/dev/rbd/my_pool/my_image) I get the errors I previous mentioned and the 
resulting image then becomes unusable (in ever sense of the word).


If I run a fdisk -l (before trying the mkfs.xfs) the rbd image shows up 
in the list - no, I don't actually do a full fdisk on the image.


An rbd info my_pool:my_image shows the same expected values on both the 
host and ceph cluster.


I've tried this with a whole bunch of different sized images from 100G 
to 4T and all fail in exactly the same way. (My previous successful 100G 
test I haven't been able to reproduce).


I've also tried all of the above using an "admin" CephX(sp?) account - I 
always can connect via rbdmap, but as soon as I try an mkfs.xfs it 
fails. This failure also occurs with a mkfs.ext4 as well (all size drives).


The Ceph Cluster is good (self reported and there are other hosts 
happily connected via CephFS) and this host also has a CephFS mapping 
which is working.


Between running experiments I've gone over the Ceph Doco (again) and I 
can't work out what's going wrong.


There's also nothing obvious/helpful jumping out at me from the 
logs/journal (sample below):


~~~

Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno 
524773 0~65536 result -1
Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno 
524772 65536~4128768 result -1

Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write result -1
Mar 24 17:38:29 my_host.my_net.local kernel: blk_print_req_error: 119 
callbacks suppressed
Mar 24 17:38:29 my_host.my_net.local kernel: I/O error, dev rbd0, sector 
4298932352 op 0x1:(WRITE) flags 0x4000 phys_seg 1024 prio class 2
Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno 
524774 0~65536 result -1
Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno 
524773 65536~4128768 result -1

Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write result -1
Mar 24 17:38:29 my_host.my_net.local kernel: I/O error, dev rbd0, sector 
4298940544 op 0x1:(WRITE) flags 0x4000 phys_seg 1024 prio class 2

~~~

Any ideas what I should be looking at?

And thank you for the help  :-)

On 24/03/2024 17:50, Alexander E. Patrakov wrote:

Hi,

Please test again, it must have been some network issue. A 10 TB RBD
image is used here without any problems.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Mounting A RBD Via Kernal Modules

2024-03-24 Thread Alexander E. Patrakov
Hi,

Please test again, it must have been some network issue. A 10 TB RBD
image is used here without any problems.

On Sun, Mar 24, 2024 at 1:01 PM duluxoz  wrote:
>
> Hi Alexander,
>
> DOH!
>
> Thanks for pointing out my typo - I missed it, and yes, it was my
> issue.  :-)
>
> New issue (sort of): The requirement of the new RBD Image is 2 TB in
> size (its for a MariaDB Database/Data Warehouse). However, I'm getting
> the following errors:
>
> ~~~
>
> mkfs.xfs: pwrite failed: Input/output error
> libxfs_bwrite: write failed on (unknown) bno 0x7f00/0x100, err=5
> mkfs.xfs: Releasing dirty buffer to free list!
> found dirty buffer (bulk) on free list!
> ~~~
>
> I tested with a 100 GB image in the same pool and was 100% successful,
> so I'm now wondering if there is some sort of Ceph RBD Image size limit
> - although, honestly, that seems to be counter-intuitive to me
> considering CERN uses Ceph for their data storage needs.
>
> Any ideas / thoughts?
>
> Cheers
>
> Dulux-Oz
>
> On 23/03/2024 18:52, Alexander E. Patrakov wrote:
> > Hello Dulux-Oz,
> >
> > Please treat the RBD as a normal block device. Therefore, "mkfs" needs
> > to be run before mounting it.
> >
> > The mistake is that you run "mkfs xfs" instead of "mkfs.xfs" (space vs
> > dot). And, you are not limited to xfs, feel free to use ext4 or btrfs
> > or any other block-based filesystem.
> >
>


-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io