Hi, we are currently running the latest firefly (0.80.9) and we have difficulties maintaining good throughput when Ceph is backfilling/recovering and/or deep-scrubing after an outage. This got to the point where when the VM using rbd start misbehaving (load rising, some simple SQL update queries taking several seconds) I use a script looping through tunable periods of max_backfills/max_recoveries = 1/0.
We recently had power outages and couldn't restart all the OSDs (one server needed special care) so as we only have 4 servers with 6 OSDs each, there was a fair amount or rebalancing. What seems to work with our current load is the following : 1/ disable deep-scrub and scrub (deactivating scrub might not be needed : it doesn't seem to have much of an impact on performance), 2/ activate the max_backfills/recoveries = 1/0 loop with 30 seconds for each, 3/ wait for the rebalancing to finish, activate scrub, 4/ activate the (un)set nodeep_scrub loop with 30 seconds unset, 120 seconds set, 5/ wait for deep-scrubs to catch up (ie: none active during several consecutive 30 seconds "unset" periods), 6/ revert to normal configuration. This can take about a day for us (we have about 800GB per OSD when in the degraded 3 servers configuration). I have two ideas/questions : 1/ Deep scrub scheduling Deep scrubs can happen in salves with the current algorithm which really harms performance even with CFQ and lower priorities. We have a total of 1216 pg (1024 for our rbd pool) and a osd deep scrub interval of 2 weeks. This means that on average a deep scrub could happen about every 16 minutes globally. When recovering from an outage the current algorithm wants to catch up and even if only one scrub per OSD can happen at a time, VM disk accesses involves many OSDs so having multiple deep-scrubs on the whole cluster seems to hurt a bit more than when only one happens at a time. So a smoother distribution when catching up could help a lot (at least our experience seems to point in this direction). I'm even considering scheduling scrubs ahead of time, setting the interval to 2 weeks in ceph.conf, but distributing them at a rate that targets a completion in a week. Does this make any sense ? Is there any development in this direction already (Feature request #7288 didn't seem to go this far and #7540 had no activity) ? 2/ Bigger journals There's not much said about how to tune journal size and filestore max sync interval. I'm not sure what the drawbacks of using much larger journals and max sync interval are. Obviously a sync would be more costly, but if it takes less time to complete than the journal takes to fill up even while there are deep-scrub or backfills, I'm not sure how this would hurt performance. In our case we have a bit less than 5GB of data per pg (for the rbd pool) and use 5GB journals (on the same disk than the OSD in a separate partition at the beginning of the disk). I'm wondering if we could get a lower impact of deep-scrubs if we could buffer more activity in the journal. If we could lower the rate at which each OSD are doing deep-scrubs (in the way I'm thinking of scheduling them in the previous point) I'm wondering if it could give time to an OSD to catch up doing filestore syncs between them and avoid contention between deep-scrubs/journal writes/filestore sync happening all at the same time. I assume deep scrubs and journal writes are mostly sequential so in our configuration we can assume ~half of the available disk throughput is available for each of them. So if we want to avoid filestore syncs during deep-scrubs, it seems to me we should have a journal at least twice the size of our largests pgs and tune the filestore max sync interval to at least the expected duration of a deep scrub. What is worrying me is that in our current configuration this would mean at least twice our journal size (10GB instead of 5GB) and given half of a ~120MB/s throughput a max interval of ~90 seconds (we use 30 seconds currently). This is far from the default values (and as we use about 50% of the storage capacity and have a good pg/OSD ratio we may target twice these values to support pgs twice as large as our current ones). Does that make any sense ? I'm not sure how backfills and recoveries work: I couldn't find a way to let the OSD wait a bit between each batch to give a chance to the filestore sync to go through. If this idea makes sense for deep-scrubs I assume it might work for backfills/recoveries to smooth I/Os too (if they can be configured to pause a bit between batches). Best regards, Lionel _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com