Anthony, Thank you! This is very helpful information and thanks for the specific advice for these drive types on choosing a 64KB min_alloc_size. I will do some more review as I believe they are likely at the 4KB min_alloc_size if that is the default for the `ssd` device-class.
I will look to try to use the 64K default *min_alloc_size*, if I can do so for a new device-class, and then `destroy` each of these OSDs and create anew with the better `min_alloc_size`. These steps could then be done 1-by-1 for each of the OSDs of this type prior to trying to create the new pool. If I cannot do that, then I would use the `ceph osd crush rm-device-class ssd osd.XX` and `ceph osd crush rm-device-class ssd osd.XX` to individually reassign the drives over to a new class with a simple name like `qlc` to avoid issues with special characters in the class name. This could be done 1-by-1 and with watching that the PGs rebalance to the other SSDs in the original pool. Thanks again, Matt On Tue, Oct 24, 2023 at 12:11 PM Anthony D'Atri <a...@dreamsnake.net> wrote: > Ah, our old friend the P5316. > > A few things to remember about these: > > * 64KB IU means that you'll burn through endurance if you do a lot of > writes smaller than that. The firmware will try to coalesce smaller > writes, especially if they're sequential. You probably want to keep your > RGW / CephFS index / medata pools on other media. > > > * With Quincy or later and a reasonably recent kernel you can set > bluestore_use_optimal_io_size_for_min_alloc_size to true and OSDs deployed > on these should automatically be created with a 64KB min_alloc_size. If > you're writing a lot of objects smaller than, say, 256KB -- especially if > using EC -- a more nuanced approach may be warranted. ISTR that your data > are large sequential files, so probably you can exploit this. For sure you > want these OSDs to not have the default 4KB min_alloc_size; that would > result in lowered write performance and especially endurance burn. The > min_alloc_size cannot be changed after an OSD is created; instead one would > need to destroy and recreate. > > cf. https://github.com/ceph/ceph/pulls?q=is%3Apr+author%3Acurtbruns > > [image: maxresdefault.jpg] > > Optimizing RGW Object Storage Mixed Media through Storage Classes and Lua > Scripting <https://www.youtube.com/watch?v=w91e0EjWD6E> > youtube.com <https://www.youtube.com/watch?v=w91e0EjWD6E> > <https://www.youtube.com/watch?v=w91e0EjWD6E> > > > > > On Oct 24, 2023, at 11:42, Matt Larson <larsonma...@gmail.com> wrote: > > I am looking to create a new pool that would be backed by a particular set > of drives that are larger nVME SSDs (Intel SSDPF2NV153TZ, 15TB drives). > Particularly, I am wondering about what is the best way to move devices > from one pool and to direct them to be used in a new pool to be created. In > this case, the documentation suggests I could want to assign them to a new > device-class and have a placement rule that targets that device-class in > the new pool. > > > If you're using cephadm / ceph orch you can craft an OSD spec that uses or > ignores drives based on size or model. > > Multiple pools can share OSDs, for your use-case though you probably don't > want to. > > > Currently the Ceph cluster has two device classes 'hdd' and 'ssd', and the > larger 15TB drives were automatically assigned to the 'ssd' device class > that is in use by a different pool. The `ssd` device classes are used in a > placement rule targeting that class. > > > The names of device classes are actually semi-arbitrary. The above > distinction is made on the basis of whether or not the kernel believes a > given device to rotate. > > > The documentation describes that I could set a device class for an OSD with > a command like: > > `ceph osd crush set-device-class CLASS OSD_ID [OSD_ID ..]` > > Class names can be arbitrary strings like 'big_nvme". > > > or "qlc" > > Before setting a new > device class to an OSD that already has an assigned device class, should > use `ceph osd crush rm-device-class ssd osd.XX`. > > > Yep. I suspect that's a guardrail to prevent inadvertently trampling. > > > Can I proceed to directly remove these OSDs from the current device class > and assign to a new device class? > > > Carpe NAND! > > Should they be moved one by one? What is > the way to safely protect data from the existing pool that they are mapped > to? > > > Are there other SSDs in said existing pool? If you reassign all of these, > will there be enough survivors to meet replication policy and hold all the > data? > > One by one would be safe. Doing more than one might be faster and more > efficient, depending on your hardware and topology. For sure you don't > want to reassign more than one per CRUSH failure domain at a time (host, > rack, depends on your setup). If your topology, RAM, and clients are > amenable, you could do all OSDs in a single failure domain at once, then > proceed to the next only after all PGs are active+clean. > > > Thanks, > Matt > > -- > Matt Larson, PhD > Madison, WI 53705 U.S.A. > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > > > -- Matt Larson, PhD Madison, WI 53705 U.S.A.
_______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io