> Indeed! Every Ceph instance I have seen (not many) and almost every HPC > storage system I have seen have this problem, and that's because they were > never setup to have enough IOPS to support the maintenance load, never mind > the maintenance load plus the user load (and as a rule not even the user > load).
Yep, this is one of the false economies of spinners. The SNIA TCO calculator includes a performance factor for just this reason. > There is a simple reason why this happens: when a large Ceph (etc. storage > instance is initially setup, it is nearly empty, so it appears to perform > well even if it was setup with inexpensive but slow/large HDDs, then it > becomes fuller and therefore heavily congested Data fragments over time with organic growth, and the drive spends a larger fraction of time seeking. I’ve predicted then seen this even on a cluster whose hardware had been blessed by a certain professional services company (*ahem*). > but whoever set it up has already changed jobs or been promoted because of > their initial success (or they invent excuses). `xfs.mkfs -n size=65536` will haunt my nightmares until the end of my days. As well as an inadequate LFF HDD architecture I was not permitted to fix, *including the mons*. But I digress. > A figure-of-merit that matters is IOPS-per-used-TB, and making it large > enough to support concurrent maintenance (scrubbing, backfilling, > rebalancing, backup) and user workloads. That is *expensive*, so in my > experience very few storage instance buyers aim for that. ^^^ This. Moreover, it’s all too common to try to band-aid this with expensive, fussy RoC HBAs with cache RAM and BBU/supercap. The money spent on those, and spent on jumping through their hoops, can easily debulk the HDD-SSD CapEx gap. Plus if your solution doesn’t do the job it needs to do, it is no bargain at any price. This correlates with IOPS/$, a metric in which HDDs are abysmal. > The CERN IT people discovered long ago that quotes for storage workers always > used very slow/large HDDs that performed very poorly if the specs were given > as mere capacity, so they switched to requiring a different metric, 18MB/s > transfer rate of *interleaved* read and write per TB of capacity, that is at > least two parallel access streams per TB. At least one major SSD manufacturer attends specifically to reads under write pressure. > https://www.sabi.co.uk/blog/13-two.html?131227#131227 > "The issue with disk drives with multi-TB capacities" > > BTW I am not sure that a floor of 18MB/s of interleaved read and write per TB > is high enough to support simultaneous maintenance and user loads for most > Ceph instances, especially in HPC. > > I have seen HPC storage systems "designed" around 10TB and even 18TB HDDs, > and the best that can be said about those HDDs is that they should be > considered "tapes" with some random access ability. Yes! This harks back to DECtape https://www.vt100.net/timeline/1964.html which was literally this, people even used it at a filesystem. Some years ago I had Brian Kernighan sign one “Wow I haven’t seen one of these in YEARS!” — aad > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io