> Indeed! Every Ceph instance I have seen (not many) and almost every HPC 
> storage system I have seen have this problem, and that's because they were 
> never setup to have enough IOPS to support the maintenance load, never mind 
> the maintenance load plus the user load (and as a rule not even the user 
> load).

Yep, this is one of the false economies of spinners.  The SNIA TCO calculator 
includes a performance factor for just this reason.

> There is a simple reason why this happens: when a large Ceph (etc. storage 
> instance is initially setup, it is nearly empty, so it appears to perform 
> well even if it was setup with inexpensive but slow/large HDDs, then it 
> becomes fuller and therefore heavily congested

Data fragments over time with organic growth, and the drive spends a larger 
fraction of time seeking.  I’ve predicted then seen this even on a cluster 
whose hardware had been blessed by a certain professional services company 
(*ahem*).

> but whoever set it up has already changed jobs or been promoted because of 
> their initial success (or they invent excuses).

`xfs.mkfs -n size=65536` will haunt my nightmares until the end of my days.  As 
well as an inadequate LFF HDD architecture I was not permitted to fix, 
*including the mons*.  But I digress.

> A figure-of-merit that matters is IOPS-per-used-TB, and making it large 
> enough to support concurrent maintenance (scrubbing, backfilling, 
> rebalancing, backup) and user workloads. That is *expensive*, so in my 
> experience very few storage instance buyers aim for that.

^^^ This.  Moreover, it’s all too common to try to band-aid this with 
expensive, fussy RoC HBAs with cache RAM and BBU/supercap.  The money spent on 
those, and spent on jumping through their hoops, can easily debulk the HDD-SSD 
CapEx gap.  Plus if your solution doesn’t do the job it needs to do, it is no 
bargain at any price.

This correlates with IOPS/$, a metric in which HDDs are abysmal.

> The CERN IT people discovered long ago that quotes for storage workers always 
> used very slow/large HDDs that performed very poorly if the specs were given 
> as mere capacity, so they switched to requiring a different metric, 18MB/s 
> transfer rate of *interleaved* read and write per TB of capacity, that is at 
> least two parallel access streams per TB.

At least one major SSD manufacturer attends specifically to reads under write 
pressure.

> https://www.sabi.co.uk/blog/13-two.html?131227#131227
> "The issue with disk drives with multi-TB capacities"
> 
> BTW I am not sure that a floor of 18MB/s of interleaved read and write per TB 
> is high enough to support simultaneous maintenance and user loads for most 
> Ceph instances, especially in HPC.
> 
> I have seen HPC storage systems "designed" around 10TB and even 18TB HDDs, 
> and the best that can be said about those HDDs is that they should be 
> considered "tapes" with some random access ability.

Yes!  This harks back to DECtape https://www.vt100.net/timeline/1964.html which 
was literally this, people even used it at a filesystem.  Some years ago I had 
Brian Kernighan sign one “Wow I haven’t seen one of these in YEARS!”

— aad

> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to