[ceph-users] Re: Performance improvement suggestion
> 1. Write object A from client. > 2. Fsync to primary device completes. > 3. Ack to client. > 4. Writes sent to replicas. [...] As mentioned in the discussion this proposal is the opposite of what the current policy, is, which is to wait for all replicas to be written before writes are acknowledged to the client: https://github.com/ceph/ceph/blob/main/doc/architecture.rst "After identifying the target placement group, the client writes the object to the identified placement group's primary OSD. The primary OSD then [...] confirms that the object was stored successfully in the secondary and tertiary OSDs, and reports to the client that the object was stored successfully." A more revolutionary option would be for 'librados' to write in parallel to all the "active set" OSDs and report this to the primary, but that would greatly increase client-Ceph traffic, while the current logic increases traffic only among OSDs. > So I think that to maintain any semblance of reliability, > you'd need to at least wait for a commit ack from the first > replica (i.e. min_size=2). Perhaps it could be similar to 'k'+'m' for EC, that is 'k' synchronous (write completes to the client only when all at least 'k' replicas, including primary, have been committed) and 'm' asynchronous, instead of 'k' being just 1 or 2. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Scrubbing?
> [...] After a few days, I have on our OSD nodes around 90MB/s > read and 70MB/s write while 'ceph -s' have client io as > 2,5MB/s read and 50MB/s write. [...] This is one of my pet-peeves: that a storage system must have capacity (principally IOPS) to handle both a maintenance workload and a user workload, and since the former often involves whole-storage or whole-metadata operations it can be quite heavy, especially in the case of Ceph where rebalancing and scrubbing and checking should be fairly frequent to detect and correct inconsistencies. > Is this activity OK? [...] Indeed. Some "clever" people "save money" by "rightsizing" their storage so it cannot run at the same time the maintenance and the user workload, and so turn off the maintenance workload, because they "feel lucky" I guess, but I do not recommend that. :-). I have seen more than one Ceph cluster that did not have the capacity even to run *just* the maintenance workload. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?
>> So we were going to replace a Ceph cluster with some hardware we had >> laying around using SATA HBAs but I was told that the only right way >> to build Ceph in 2023 is with direct attach NVMe. My impression are somewhat different: * Nowadays it is rather more difficult to find 2.5in SAS or SATA "Enterprise" SSDs than most NVMe types. NVMe as a host bus also has much greater bandwidth than SAS or SATA, but Ceph is mostly about IOPS rather than single-device bandwidth. So in general willing or less willing one has got to move to NVMe. * Ceph was designed (and most people have forgotten it) for many small capacity 1-OSD cheap servers, and lots of them, but unfortunately it is not easy to find small cheap "enterprise" SSD servers. In part because many people use rather unwisely as figure-of-merit the capacity per server-price most NVMe servers have many slots, which means either RAID-ing devices into a small number of large OSDs, which goes against all Ceph stands for, or running many OSD daemons on one system, which work-ish but is not best. >> Does anyone have any recommendation for a 1U barebones server >> (we just drop in ram disks and cpus) with 8-10 2.5" NVMe bays >> that are direct attached to the motherboard without a bridge >> or HBA for Ceph specifically? > If you're buying new, Supermicro would be my first choice for > vendor based on experience. > https://www.supermicro.com/en/products/nvme Indeed, SuperMicro does them fairly well, and there are also GigaByte, and Tyan I think, not yet seen Intel-based models. > You said 2.5" bays, which makes me think you have existing > drives. There are models to fit that, but if you're also > considering new drives, you can get further density in E1/E3 BTW "NVMe" is a bus specification (something not too different from SCSI-over-PCIe), and there are several different physical specifications, like 2.5in U.2 (SFF-8639), 2.5in U.3 (SFF-TA-1001), and various types of EDSFF (SFF-TA-1006,7,8). U.3 is still difficult to find but its connector supports SATA, SAS and NVMe U.2; I have not yet seen EDSFF boxes actually available retail without enormous delivery times, I guess the big internet companies buy all the available production. https://nvmexpress.org/wp-content/uploads/Session-4-NVMe-Form-Factors-Developer-Day-SSD-Form-Factors-v8.pdf https://media.kingston.com/kingston/content/ktc-content-nvme-general-ssd-form-factors-graph-en-3.jpg https://media.kingston.com/kingston/pdf/ktc-article-understanding-ssd-technology-en.pdf https://www.snia.org/sites/default/files/SSSI/OCP%20EDSFF%20JM%20Hands.pdf > The only caveat is that you will absolutely want to put a > better NIC in these systems, because 2x10G is easy to saturate > with a pile of NVME. That's one reason why Ceph was designed for many small 1-OSD servers (ideally distributed across several racks) :-). Note: to maximize changes of many-to-many traffic instead of many-to-one. Anyhow Ceph again is all about lots of IOPS more than bandwidth, but if you need bandwidth nowadays many 10Gb NICs support 25Gb/s too, and 40Gb/s and 100Gb/s are no longer that expensive (but the cables are horrible). ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: CephFS mirror very slow (maybe for small files?)
> the speed of data transfer is varying a lot over time (200KB/s > – 120MB/s). [...] The FS in question, has a lot of small files > in it and I suspect this is the cause of the variability – ie, > the transfer of many small files will be more impacted by > greater site-site latency. 200KB/s on small files across sites? That's pretty good. I have seen rates of 3-5KB/s on some Ceph instances for reading local small files, never mind remotely. > If this suspicion is true, what options do I have to improve > the overall throughput? In practice not much. Perhaps switching to all-RAM storage (with battery backup) for OSDs might help :-). In one case by undoing some of the more egregious issues I managed to improve small file transfer rates locally by 10 times, that is to 40-60KB/s. In your case a 10 times, if achievable, improvement might get you transfer rates of 2MB/s. Often the question is not just longer network latency, but whether your underlying storage can sustain the IOPS needed for "scan" type operations at the same time as user workload. Perhaps it would go a lot faster if you just RSYNC, or even just 'tar -f - -c ... | ssh ... tar =f - -x' (or 'rclone' if you don't use CephFS) and it would be worth doing a test of transferring a directory (or bucket if you don't use CephFS) with small files by RSYNC and/or 'tar' to a non-Ceph remote target and a Ceph remote target to see what you could achieve. No network/sharded filesystem (and very few local ones) handles well small files. In some cases I have seen Ceph was used to store a traditional filesystem image of a type more suitable for small files, mounted on a loop device. https://www.sabi.co.uk/blog/anno05-4th.html?051016#051016 https://www.sabi.co.uk/blog/0909Sep.html?090919#090919 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: CEPH Cluster performance review
>>> during scrubbing, OSD latency spikes to 300-600 ms, >> I have seen Ceph clusters spike to several seconds per IO >> operation as they were designed for the same goals. >>> resulting in sluggish performance for all VMs. Additionally, >>> some OSDs fail during the scrubbing process. >> Most likely they time out because of IO congestion rather than >> failing. > bluestore(/var/lib/ceph/osd/ceph-10) log_latency slow operation observed for > next, latency = 74835459564ns > bluestore(/var/lib/ceph/osd/ceph-10) log_latency slow operation observed for > next, latency = 42822161884ns 7.48s? 4.28s? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: CEPH Cluster performance review
> during scrubbing, OSD latency spikes to 300-600 ms, I have seen Ceph clusters spike to several seconds per IO operation as they were designed for the same goals. > resulting in sluggish performance for all VMs. Additionally, > some OSDs fail during the scrubbing process. Most likely they time out because of IO congestion rather than failing. > In such instances, promptly halting the scrubbing resolves the > issue. > (6 SSD node + 6 HDD node) All nodes are connected through 10G > bonded link, i.e. 10Gx2=20GB for each node. 64 SSD 42 HDD 106 > one-ssd 256 active+clean one-hdd 512 active+clean > cloudstack.hdd 512 active+clean Your Ceph cluster has been optimized for high latency and IO congestion, goals that are suprisingly quite common, and is performing well given its design parameters (it is far from full, if it becomes fuller it will achieve its goals even better). https://www.sabi.co.uk/blog/15-one.html?150305#150305 "How many VMs per disk arm?" https://www.sabi.co.uk/blog/15-one.html?150329#150329 "CERN's old large disk discussion and IOPS-per-TB" ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How do you handle large Ceph object storage cluster?
> [...] (>10k OSDs, >60 PB of data). 6TBs on average per OSD? Hopully SSDs or RAID10 (or low-number, 3-5) RAID5. > It is entirely dedicated to object storage with S3 interface. > Maintenance and its extension are getting more and more > problematic and time consuming. Ah the joys of a single large unified storage pool :-). https://www.sabi.co.uk/blog/0804apr.html?080417#080417 > We consider to split it to two or more completely separate > clusters I would suggest doing it 1-2 years ago... > create S3 layer of abstraction with some additional metadata > that will allow us to use these 2+ physically independent > instances as a one logical cluster. That's what the bucket hierarchy in a Ceph cluster instance already does. What your layer is going to do is either: 1) Lookup the object ID in a list of instances, and fetch the object from the instance that validates the object ID; 2) Maintain a huge table of all object IDs and which instances they are in. But 1) is basically what CRUSH already does and 2) means giving up the Ceph "decentralized" philosophy based on CRUSH. BTW one old practice that so few systems follow is to use as object keys neither addresses nor identifiers, but *both*: first access the address treating it as a hint, check that the identifier matches, if not do a slower lookup using the object identifier part to find the actual address. > Additionally, newest data is the most demanded data, so we > have to spread it equally among clusters to avoid skews in > cluster load. I usually do the opposite, but that depends on your application. My practice is to recognize that data is indeed usually stratified by date, and regard filesystem instances as "silos" and create a new filesystems instance every some months or years, and direct all new file creation to the latest instance, and then get rid progressively of the older instances or copy their "active" data onwards into the new instance, and the "inactive" data to offline storage. http://www.sabi.co.uk/blog/12-fou.html?121218b#121218b If you really need to keep all data forever online, which is usually not the case (that's why there are laws that expire matters after N years) the second best option is to keep old silos powered up indefinitely, and they will take very little attention beyond refreshing the hardware periodically and migrating the data to new instances when that stops being economical. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to deal with increasing HDD sizes ? 1 OSD for 2 LVM-packed HDDs ?
> * Ceph cluster with old nodes having 6TB HDDs > * Add new node with new 12TB HDDs Halving IOPS-per-TB? https://www.sabi.co.uk/blog/17-one.html?170610#170610 https://www.sabi.co.uk/blog/15-one.html?150329#150329 > Is it supported/recommended to pack 2 6TB HDDs handled by 2 > old OSDs into 1 12TB LVM disk handled by 1 new OSD ? The OSDs are just random daemons, what matters to chunk distribution in Ceph is buckets, and in this case leaf buckets. So it all depends on the CRUSH map but I suspect that manipulating it so that two existing leaf buckets become one is not possible or too tricky to attempt. One option would be to divide the 12TB disk in 2 partitions/LVs of 6TB and run 2 OSDs against it. It is not recommended, but I don't see a big issue in this case other than IOPS-per-TB. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Time Estimation for cephfs-data-scan scan_links
[...] > What is being done is a serial tree walk and copy in 3 > replicas of all objects in the CephFS metadata pool, so it > depends on both the read and write IOPS rate for the metadata > pools, but mostly in the write IOPS. [...] Wild guess: > metadata is on 10x 3.84TB SSDs without persistent cache, data > is on 48x 8TB devices probably HDDs. Very cost effective :-). I do not know if those guesses are right, but in general most Ceph instances I have seen have been designed with the "cost effective" choice of providing enough IOPS to run the user workload (but often not even that), but not also more to be able to run the admin workload quickly (checking, scanning, scrubbing, migrating, 'fsck' or 'resilvering' of the underlying filesystem). There is often a similar situation for non HPC filesystem types, but the scale and pressure on instances of those are usually much lower than for HPC filesystem instances, so the consequencesa are less obvious. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Time Estimation for cephfs-data-scan scan_links
>> However, I've observed that the cephfs-data-scan scan_links step has >> been running for over 24 hours on 35 TB of data, which is replicated >> across 3 OSDs, resulting in more than 100 TB of raw data. What matters is the number of "inodes" (and secondarily their size), that is the number of metadata objects, which is proportional to the number of files and directories in the CephFS instance. >> Does anyone have an estimation on the duration for this step? > scan_links has to iterate through every object in the metadata pool > and for each object iterate over the omap key/values - so this step > scales to the amount of objects in the metadata pool, i.e., the number > of directories and files in the file system. >> pools: 12 pools, 1475 pgs >> objects: 50.89M objects, 72 TiB >> usage: 207 TiB used, 148 TiB / 355 TiB avail >> pgs: 579358/152674596 objects misplaced (0.379%) 51m between data and metadata objects, average obiect space used 4MiB, average metadata per object 230KiB (looks like 3-way replication as per default). >> POOL TYPE USED AVAIL >> cephfs_metadata metadata 1045G 35.6T >> cephfs.c3sl.datadata 114T 35.6T [...] >> POOL TYPE USED AVAIL >> cephfs.c3sl.meta metadata 28.2G 35.6T >> cephfs.c3sl.datadata 114T 35.6T Total between data and metadata 142TiB, so CephFS uses around 2/3 of the 207TiB stored in this Ceph instance, so perhaps 2/3 of the objects too, so maybe 35m objects in the CephFS instance. What is being done is a serial tree walk and copy in 3 replicas of all objects in the CephFS metadata pool, so it depends on both the read and write IOPS rate for the metadata pools, but mostly in the write IOPS. Note: it is somewhat like an 'fsck' but an 'fsck' that makes 3 copies of each inode. I wonder whether the source (presumably 'cephfs_metadata') and target (presumably 'cephfs.c3sl.meta') pools are on the same physical devices, and whether they are SSDs with high small writes rates or not, and the physical storage properties. Wild guess: metadata is on 10x 3.84TB SSDs without persistent cache, data is on 48x 8TB devices probably HDDs. Very cost effective :-). > mds: 0/10 daemons up (10 failed), 9 standby > osd: 48 osds: 48 up (since 32h), 48 in (since 2M); 22 remapped pgs Overall it looks like 1 day copied 1TB out of 28TB in metadata, so looks like it will take a month. 1TB of metadata means 1.5m 230KiB metadata object processed in 1 day, so around 15 metadata objects read and written in 3 copies per second, with a 12MB/s metadata storage write rate, which are plausible numbers for a metadata pool on SSDs with non-persistent cache, so the estimate of just 3-4 more weeks looks plausible again. Running something like 'iostat -dk -zy 1' on one of the servers with metadata drives might also help get an idea. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Time Estimation for cephfs-data-scan scan_links
>> However, I've observed that the cephfs-data-scan scan_links step has >> been running for over 24 hours on 35 TB of data, which is replicated >> across 3 OSDs, resulting in more than 100 TB of raw data. What matters is the number of "inodes" (and secondarily their size), that is the number of metadata objects, which is proportional to the number of files and directories in the CephFS instance. >> Does anyone have an estimation on the duration for this step? > scan_links has to iterate through every object in the metadata pool > and for each object iterate over the omap key/values - so this step > scales to the amount of objects in the metadata pool, i.e., the number > of directories and files in the file system. >> pools: 12 pools, 1475 pgs >> objects: 50.89M objects, 72 TiB >> usage: 207 TiB used, 148 TiB / 355 TiB avail >> pgs: 579358/152674596 objects misplaced (0.379%) 51m between data and metadata objects, average obiect space used 4MiB, average metadata per object 230KiB (looks like 3-way replication as per default). >> POOL TYPE USED AVAIL >> cephfs_metadata metadata 1045G 35.6T >> cephfs.c3sl.datadata 114T 35.6T [...] >> POOL TYPE USED AVAIL >> cephfs.c3sl.meta metadata 28.2G 35.6T >> cephfs.c3sl.datadata 114T 35.6T Total between data and metadata 142TiB, so CephFS uses around 2/3 of the 207TiB stored in this Ceph instance, so perhaps 2/3 of the objects too, so maybe 35m objects in the CephFS instance. What is being done is a serial tree walk and copy in 3 replicas of all objects in the CephFS metadata pool, so it depends on both the read and write IOPS rate for the metadata pools, but mostly in the write IOPS. Note: it is somewhat like an 'fsck' but an 'fsck' that makes 3 copies of each inode. I wonder whether the source (presumably 'cephfs_metadata') and target (presumably 'cephfs.c3sl.meta') pools are on the same physical devices, and whether they are SSDs with high small writes rates or not, and the physical storage properties. Wild guess: metadata is on 10x 3.84TB SSDs without persistent cache, data is on 48x 8TB devices probably HDDs. Very cost effective :-). > mds: 0/10 daemons up (10 failed), 9 standby > osd: 48 osds: 48 up (since 32h), 48 in (since 2M); 22 remapped pgs Overall it looks like 1 day copied 1TB out of 28TB in metadata, so looks like it will take a month. 1TB of metadata means 1.5m 230KiB metadata object processed in 1 day, so around 15 metadata objects read and written in 3 copies per second, with a 12MB/s metadata storage write rate, which are plausible numbers for a metadata pool on SSDs with non-persistent cache, so the estimate of just 3-4 more weeks looks plausible again. Running something like 'iostat -dk -zy 1' on one of the servers with metadata drives might also help get an idea. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Decrepit ceph cluster performance
> We recently started experimenting with Proxmox Backup Server, > which is really cool, but performs enough IO to basically lock > out the VM being backed up, leading to IO timeouts, leading to > user complaints. :-( The two most common things I have had to fix over years as to storage systems I hav inherited have been: * Too low IOPS-per-TB to handle a realistic workload. * Too few total IOPS to handle the user and sysadmin (checking, scrubbing, backup, balancing, backfilling, ...) workloads. Both happen because most sysadmins are heavily incentivized to save money now even if there is a huge price to pay later when the storage capacity fills up. An SSD based storage cluster like the one you have to deal with has plenty of IOPS, so your case is strange, in particular that latencies in your tests are low at the same time as IO rates are low; badly overloaded storage complexes have latencies 1 second and way above. That your test reports small latencies as average but a max latency of 37s and long pauses with 0 IOPS are reported is suspicious. It could be that *some* OSD SSDs are not in good condition and they slow down everything, as the Ceph daemons wait for the slowest OSD to respond. 37s looks like retries on a failing SSD. In an ideal world you would have on the cluster a capacity monitor like Ganglia etc. showing year-long graphs of network bandwidth and IO rates and latencies, but I guess this was not setup like that. > The SSDs are probably 5-8 years old. The OSDs were rebuilt to > bluestore around the luminous timeframe. (Nautilus, maybe. It > was a while ago.) >> Newer SSD controllers / models are better than older models >> at housekeeping over time, so the secure-erase might freshen >> performance. Indeed 5-8 year old firmware may not be as sophisticated as more recent firmware, in particular as to needing periodic explicit TRIMs. As to that I noticed this: >>> Its primary use is serving RBD VM block devices for Proxmox A VM workload, and in particular RBD, involves often very small random writes and "mixed-use SSD"s are not as suitable to that, in particular if the usual and insane practice of having VM operating systems log to virtual disks has been followed. So the physical storage on the SSDs may have become hideously fragmented, thus indeed requiring TRIMs, especially if the endurance levels are low (which is dangerous), and especially if the workload never pauses enough to run the firmware compaction mechanism (which is likely given that the storage complex cannot sustain both the user workload and backups). In particular check the logs of these OSDs to see which specific SSDs are reporting the slowest IOPs >>> 36 slow ops, oldest one blocked for 37 sec, daemons >>> [osd.10,osd.12,osd.13,osd.14,osd.15,osd.17,osd.2,osd.25,osd.28,osd.3]... ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Workload that delete 100 M object daily via lifecycle
[...] S3 workload, that will need to delete 100M file daily [...] >> [...] average (what about peaks?) around 1,200 committed >> deletions per second (across the traditional 3 metadata >> OSDs) sustained, that may not leave a lot of time for file > creation, writing or reading. :-)[...] >>> [...] So many people seem to think that distributed (or >>> even local) filesystems (and in particular their metadata >>> servers) can sustain the same workload as high volume >>> transactional DBMSes. [...] > Index pool distributed over a large number of NVMe OSDs? > Multiple, dedicated RGW instances that only run LC? As long as that guarantees a total maximum network+write latency of well below 800µs across all of them that might result in a committed rate of a deletion every 800µs (and there are no peaks and the metadata server only does deletions and does not do creations or opens or any "maintenance" operations like checks and backups). :-) Sometimes I suggest somewhat seriously entirely RAM based metadata OSDs, which given a suitable environment may be feasible. But I still wonder why "So many people seem to think ... can sustain the same workload as high volume transactional DBMSes" :-). ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Workload that delete 100 M object daily via lifecycle
>>> On Mon, 17 Jul 2023 19:19:34 +0700, Ha Nguyen Van >>> said: > [...] S3 workload, that will need to delete 100M file daily [...] So many people seem to think that distributed (or even local) filesystems (and in particular their metadata servers) can sustain the same workload as high volume transactional DBMSes. PS 100m deletions per day given 86,400 second per day means on average (what about peaks?) around 1,200 committed deletions per second (across the traditional 3 metadata OSDs) sustained on the metadata server, that may not leave a lot of time for file creation, writing or reading. :-) ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ls: cannot access '/cephfs': Stale file handle
>>> On Wed, 17 May 2023 16:52:28 -0500, Harry G Coin >>> said: > I have two autofs entries that mount the same cephfs file > system to two different mountpoints. Accessing the first of > the two fails with 'stale file handle'. The second works > normally. [...] Something pretty close to that works for me... I would check the related 'dmesg' lines. Also 'grep cephfs /proc/mounts' to double check the actual mount lines. Note: mountpoints just under '/' as in '/cephfs' have some downsides: http://www.sabi.co.uk/blog/23-one.html?230123#230123 Note: give the above I would anyhow mount the CephFS instance once to a directory like '/mnt/cephfs' once and use two symlinks to it (or 'bind' mounts if not under '/'). ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Deleting millions of objects
> [...] We have this slow and limited delete issue also. [...] That usually, apart from command list length limitations, happens because so many Ceph storage backends have too low committed IOPS (write, but not just) for mass metadata (and equivalently small data) operations, never mind for running them in parallel with the user workload. They are very expensive, and many Ceph storage backend are built to minimize cost-per-TB only. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Deep-scrub much slower than HDD speed
On a 38 TB cluster, if you scrub 8 MB/s on 10 disks (using only numbers already divided by replication factor), you need 55 days to scrub it once. That's 8x larger than the default scrub factor [...] Also, even if I set the default scrub interval to 8x larger, it my disks will still be thrashing > seeks 100% of the time, affecting the cluster's throughput and latency performance. Indeed! Every Ceph instance I have seen (not many) and almost every HPC storage system I have seen have this problem, and that's because they were never setup to have enough IOPS to support the maintenance load, never mind the maintenance load plus the user load (and as a rule not even the user load). There is a simple reason why this happens: when a large Ceph (etc. storage instance is initially setup, it is nearly empty, so it appears to perform well even if it was setup with inexpensive but slow/large HDDs, then it becomes fuller and therefore heavily congested but whoever set it up has already changed jobs or been promoted because of their initial success (or they invent excuses). A figure-of-merit that matters is IOPS-per-used-TB, and making it large enough to support concurrent maintenance (scrubbing, backfilling, rebalancing, backup) and user workloads. That is *expensive*, so in my experience very few storage instance buyers aim for that. The CERN IT people discovered long ago that quotes for storage workers always used very slow/large HDDs that performed very poorly if the specs were given as mere capacity, so they switched to requiring a different metric, 18MB/s transfer rate of *interleaved* read and write per TB of capacity, that is at least two parallel access streams per TB. https://www.sabi.co.uk/blog/13-two.html?131227#131227 "The issue with disk drives with multi-TB capacities" BTW I am not sure that a floor of 18MB/s of interleaved read and write per TB is high enough to support simultaneous maintenance and user loads for most Ceph instances, especially in HPC. I have seen HPC storage systems "designed" around 10TB and even 18TB HDDs, and the best that can be said about those HDDs is that they should be considered "tapes" with some random access ability. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Deep-scrub much slower than HDD speed
> On a 38 TB cluster, if you scrub 8 MB/s on 10 disks (using only > numbers already divided by replication factor), you need 55 days > to scrub it once. > That's 8x larger than the default scrub factor [...] Also, even > if I set the default scrub interval to 8x larger, it my disks > will still be thrashing seeks 100% of the time, affecting the > cluster's throughput and latency performance. Indeed! Every Ceph instance I have seen (not many) and almost every HPC storage system I have seen have this problem, and that's because they were never setup to have enough IOPS to support the maintenance load, never mind the maintenance load plus the user load (and as a rule not even the user load). There is a simple reason why this happens: when a large Ceph (etc. storage instance is initially setup, it is nearly empty, so it appears to perform well even if it was setup with inexpensive but slow/large HDDs, then it becomes fuller and therefore heavily congested but whoever set it up has already changed jobs or been promoted because of their initial success (or they invent excuses). A figure-of-merit that matters is IOPS-per-used-TB, and making it large enough to support concurrent maintenance (scrubbing, backfilling, rebalancing, backup) and user workloads. That is *expensive*, so in my experience very few storage instance buyers aim for that. The CERN IT people discovered long ago that quotes for storage workers always used very slow/large HDDs that performed very poorly if the specs were given as mere capacity, so they switched to requiring a different metric, 18MB/s transfer rate of *interleaved* read and write per TB of capacity, that is at least two parallel access streams per TB. https://www.sabi.co.uk/blog/13-two.html?131227#131227 "The issue with disk drives with multi-TB capacities" BTW I am not sure that a floor of 18MB/s of interleaved read and write per TB is high enough to support simultaneous maintenance and user loads for most Ceph instances, especially in HPC. I have seen HPC storage systems "designed" around 10TB and even 18TB HDDs, and the best that can be said about those HDDs is that they should be considered "tapes" with some random access ability. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Deep-scrub much slower than HDD speed
> On a 38 TB cluster, if you scrub 8 MB/s on 10 disks (using only > numbers already divided by replication factor), you need 55 days > to scrub it once. > That's 8x larger than the default scrub factor [...] Also, even > if I set the default scrub interval to 8x larger, it my disks > will still be thrashing seeks 100% of the time, affecting the > cluster's throughput and latency performance. Indeed! Every Ceph instance I have seen (not many) and almost every HPC storage system I have seen have this problem, and that's because they were never setup to have enough IOPS to support the maintenance load, never mind the maintenance load plus the user load (and as a rule not even the user load). There is a simple reason why this happens: when a large Ceph (etc. storage instance is initially setup, it is nearly empty, so it appears to perform well even if it was setup with inexpensive but slow/large HDDs, then it becomes fuller and therefore heavily congested but whoever set it up has already changed jobs or been promoted because of their initial success (or they invent excuses). A figure-of-merit that matters is IOPS-per-used-TB, and making it large enough to support concurrent maintenance (scrubbing, backfilling, rebalancing, backup) and user workloads. That is *expensive*, so in my experience very few storage instance buyers aim for that. The CERN IT people discovered long ago that quotes for storage workers always used very slow/large HDDs that performed very poorly if the specs were given as mere capacity, so they switched to requiring a different metric, 18MB/s transfer rate of *interleaved* read and write per TB of capacity, that is at least two parallel access streams per TB. https://www.sabi.co.uk/blog/13-two.html?131227#131227 "The issue with disk drives with multi-TB capacities" BTW I am not sure that a floor of 18MB/s of interleaved read and write per TB is high enough to support simultaneous maintenance and user loads for most Ceph instances, especially in HPC. I have seen HPC storage systems "designed" around 10TB and even 18TB HDDs, and the best that can be said about those HDDs is that they should be considered "tapes" with some random access ability. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io