In case anyone was wondering, I figured out the problem...

This nasty bug in Pacific 16.2.10   https://tracker.ceph.com/issues/56031  - I 
think it is fixed in the upcoming .11 release and in Quincy.

This bug causes the computation of the bluestore DB partition to be much 
smaller than it should be, so if you request a reasonable size which is smaller 
than the incorrectly computed maximum size, the DB creation will fail.

Our problem was that we added 3 new SSDs that were considered "unused" by the 
system, giving us a total of 8 (5 used, 3 unused).   When the orchestrator 
issues a "ceph-volume lvm batch" command, it passes 40 data devices and 8 db 
devices.  Normally, you would expect it to divide them into 5 slots per DB 
device (40/8).   But when it computes the size of the slots, that is where the 
problem occurs.

ceph-volume first sees the 3 unused devices in a group and incorrectly decides 
that the slots needed is 3 * 5 = 15 slots, then divides the size of a single DB 
device by 15, thus making a max DB size 3x smaller than it should be.  If the 
code had also used the size of all of the devices in the group, then computed 
the max size, it would have been fine, but it only accounts for the size of the 
1st DB device in the group resulting in a size 3x smaller than it should be.

The workaround is to trick ceph into grouping all of the DB devices into unique 
groups of 1 by putting a minimal VG with a unique name on each of the unused 
SSDs so that when ceph-volume computes the sizing, it sees groups of 1 and thus 
doesn't multiply the number of slots incorrectly.   I used "vgcreate bug1 -s 1M 
/dev/xyz" to create a bogus VG on each of the unused SSDs, now I have properly 
sized DB devices on the new SSDs (the "bugX" VGs can then be removed once there 
are legitimate DB VGs on the device).

Question - Because our cluster was initially layed out using the buggy 
ceph-volume (16.2.10), we now have hundreds of DB devices that are far smaller 
than they should be (far less than the recommended 1-4% of the data devices 
size).  Is it possible to resize the DB devices without destroying and 
recreating the OSD itself?

What are the implications of having bluestore DB devices that are far smaller 
than they should be?


thanks,
  Wyllys Ingersoll


________________________________
From: Wyll Ingersoll <wyllys.ingers...@keepertech.com>
Sent: Friday, January 13, 2023 4:35 PM
To: ceph-users@ceph.io <ceph-users@ceph.io>
Subject: [ceph-users] ceph orch osd spec questions


Ceph Pacific 16.2.9

We have a storage server with multiple 1.7TB SSDs dedicated to the bluestore DB 
usage.  The osd spec originally was misconfigured slightly and had set the 
"limit" parameter on the db_devices to 5 (there are 8 SSDs available) and did 
not specify a block_db_size.  ceph layed out the original 40 OSDs and put 8 DBs 
across 5 of the SSDs (because of limit param).  Ceph seems to have auto-sized 
the bluestore DB partitions to be about 45GB, which is far less than the 
recommended 1-4% (using 10TB drives).  How does ceph-volume determine the size 
of the bluestore DB/WAL partitions when it is not specified in the spec?

We updated the spec and specified a block_db_size of 300G and removed the 
"limit" value.  Now we can see in the cephadm.log that the ceph-volume command 
being issued is using the correct list of SSD devices (all 8) as options to the 
lvm batch (--db-devices ...), but it keeps failing to create the new OSD 
because we are asking for 300G and it thinks there is only 44G available even 
though the last 3 SSDs in the list are empty (1.7T).  So, it appears that 
somehow the orchestrator is ignoring the last 3 SSDs.  I have verified that 
these SSDs are wiped clean, have no partitions or LVM, and no label (sgdisk -Z, 
wipefs -a). They appear as available in the inventory and not locked or 
otherwise in use.

Also, the "db_slots" spec parameter is ignored in pacific due to a bug so there 
is no way to tell the orchestrator to use "block_db_slots". Adding it to the 
spec like "block_db_size" fails since it is not recognized.

Any help figuring out why these SSDs are being ignored would be much 
appreciated.

Our spec for this host looks like this:
---

spec:

  data_devices:

    rotational: 1

    size: '3TB:'

  db_devices:

    rotational: 0

    size: ':2T'

    vendor: 'SEAGATE'

  block_db_size: 300G

---

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to