Anthony,

 Thank you! This is very helpful information and thanks for the specific
advice for these drive types on choosing a 64KB min_alloc_size. I will do
some more review as I believe they are likely at the 4KB min_alloc_size if
that is the default for the `ssd` device-class.

  I will look to try to use the 64K default *min_alloc_size*, if I can do
so for a new device-class, and then `destroy` each of these OSDs and create
anew with the better `min_alloc_size`. These steps could then be done
1-by-1 for each of the OSDs of this type prior to trying to create the new
pool.

  If I cannot do that, then I would use the `ceph osd crush rm-device-class
ssd osd.XX` and `ceph osd crush rm-device-class ssd osd.XX` to individually
reassign the drives over to a new class with a simple name like `qlc` to
avoid issues with special characters in the class name. This could be done
1-by-1 and with watching that the PGs rebalance to the other SSDs in the
original pool.

 Thanks again,
   Matt

On Tue, Oct 24, 2023 at 12:11 PM Anthony D'Atri <a...@dreamsnake.net> wrote:

> Ah, our old friend the P5316.
>
> A few things to remember about these:
>
> * 64KB IU means that you'll burn through endurance if you do a lot of
> writes smaller than that.  The firmware will try to coalesce smaller
> writes, especially if they're sequential.  You probably want to keep your
> RGW / CephFS index / medata pools on other media.
>
>
> * With Quincy or later and a reasonably recent kernel you can set
> bluestore_use_optimal_io_size_for_min_alloc_size to true and OSDs deployed
> on these should automatically be created with a 64KB min_alloc_size.  If
> you're writing a lot of objects smaller than, say, 256KB -- especially if
> using EC -- a more nuanced approach may be warranted.  ISTR that your data
> are large sequential files, so probably you can exploit this.  For sure you
> want these OSDs to not have the default 4KB min_alloc_size; that would
> result in lowered write performance and especially endurance burn.  The
> min_alloc_size cannot be changed after an OSD is created; instead one would
> need to destroy and recreate.
>
> cf. https://github.com/ceph/ceph/pulls?q=is%3Apr+author%3Acurtbruns
>
> [image: maxresdefault.jpg]
>
> Optimizing RGW Object Storage Mixed Media through Storage Classes and Lua
> Scripting <https://www.youtube.com/watch?v=w91e0EjWD6E>
> youtube.com <https://www.youtube.com/watch?v=w91e0EjWD6E>
> <https://www.youtube.com/watch?v=w91e0EjWD6E>
>
>
>
>
> On Oct 24, 2023, at 11:42, Matt Larson <larsonma...@gmail.com> wrote:
>
> I am looking to create a new pool that would be backed by a particular set
> of drives that are larger nVME SSDs (Intel SSDPF2NV153TZ, 15TB drives).
> Particularly, I am wondering about what is the best way to move devices
> from one pool and to direct them to be used in a new pool to be created. In
> this case, the documentation suggests I could want to assign them to a new
> device-class and have a placement rule that targets that device-class in
> the new pool.
>
>
> If you're using cephadm / ceph orch you can craft an OSD spec that uses or
> ignores drives based on size or model.
>
> Multiple pools can share OSDs, for your use-case though you probably don't
> want to.
>
>
> Currently the Ceph cluster has two device classes 'hdd' and 'ssd', and the
> larger 15TB drives were automatically assigned to the 'ssd' device class
> that is in use by a different pool. The `ssd` device classes are used in a
> placement rule targeting that class.
>
>
> The names of device classes are actually semi-arbitrary.  The above
> distinction is made on the basis of whether or not the kernel believes a
> given device to rotate.
>
>
> The documentation describes that I could set a device class for an OSD with
> a command like:
>
> `ceph osd crush set-device-class CLASS OSD_ID [OSD_ID ..]`
>
> Class names can be arbitrary strings like 'big_nvme".
>
>
> or "qlc"
>
> Before setting a new
> device class to an OSD that already has an assigned device class, should
> use `ceph osd crush rm-device-class ssd osd.XX`.
>
>
> Yep.  I suspect that's a guardrail to prevent inadvertently trampling.
>
>
> Can I proceed to directly remove these OSDs from the current device class
> and assign to a new device class?
>
>
> Carpe NAND!
>
> Should they be moved one by one? What is
> the way to safely protect data from the existing pool that they are mapped
> to?
>
>
> Are there other SSDs in said existing pool?  If you reassign all of these,
> will there be enough survivors to meet replication policy and hold all the
> data?
>
> One by one would be safe.  Doing more than one might be faster and more
> efficient, depending on your hardware and topology.  For sure you don't
> want to reassign more than one per CRUSH failure domain at a time (host,
> rack, depends on your setup).  If your topology, RAM, and clients are
> amenable, you could do all OSDs in a single failure domain at once, then
> proceed to the next only after all PGs are active+clean.
>
>
> Thanks,
>  Matt
>
> --
> Matt Larson, PhD
> Madison, WI  53705 U.S.A.
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>

-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to