[ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

Andrei Mikhailovsky Sat, 18 Jun 2016 09:05:32 -0700

Hello ceph users, 

I've recently upgraded my ceph cluster from Hammer to Jewel (10.2.1 and then 
10.2.2). The cluster was running okay after the upgrade. I've decided to use 
the optimal tunables for Jewel as the ceph status was complaining about the 
straw version and my cluster settings were not optimal for jewel. I've not 
touched tunables since the Firefly release I think. After reading the release 
notes and the tunables section I have decided to set the crush tunables value 
to optimal. Taking into account that a few weeks ago I have done a reweight 
-by- utilization which has moved around about 8% of my cluster objects. This 
process has not caused any downtime and IO to the virtual machines was 
available. I have also altered several settings to prioritise client IO in case 
of repair and backfilling (see config show output below).


Right, so, after i've set tunables to optimal value my cluster indicated that 
it needs to move around 61% of data in the cluster. The process started and I 
was seeing speeds of between 800MB/s - 1.5GB/s for recovery. My cluster is 
pretty small (3 osd servers with 30 osds in total). The load on the osd servers 
was pretty low. I was seeing a typical load of 4 spiking to around 10. The IO 
wait values on the osd servers were also pretty reasonable - around 5-15%. 
There were around 10-15 backfilling processes. 

About 10 minutes after the optimal tunables were set i've noticed that IO wait 
on the vms started to increase. Initially it was 15%, after another 10 mins or 
so it increased to around 50% and about 30-40 minutes later the iowait became 
95-100% on all vms. Shortly after that the vms showed a bunch of hang tasks in 
dmesg output and shorly stopped responding all together. This kind of behaviour 
didn't happen after doing reweight-by-utilization, which i've done a few weeks 
prior. The vms IO wait during the reweithing was around 15-20% and there were 
no hanged tasks and all vms were running pretty well. 

I wasn't sure how to resolve the problem. On one hand I know that recovery and 
backfilling cause extra load on the cluster, but it should never break client 
IO. Afterall, this seems to negate one of the key points behind ceph - 
resilient storage cluster. Looking at the ceph -w output the client IO has 
decreased to 0-20 IOPs, where as a typical load that I see at that time of the 
day is around 700-1000 IOPs. 

The strange thing is that after the cluster has finished with data move (it 
took around 11 hours) the client IO was still not available! I was not able to 
start any new vms despite having OK health status and all PGs in active + clean 
state. This was pretty strange. All osd servers having almost 0 load, all PGs 
are active + clean, all osds are up and all mons are up, yet no client IO. The 
cluster became operational once again after a reboot of one of the osd servers, 
which seem to have brought the cluster to life. 

My question to the community is what ceph options should be implemented to make 
sure the client IO is _always_ available and has the highest priority during 
any recovery/migration/backfilling operations? 

My current settings, which i've gathered over the years from the advice of 
mailing list and irc members are: 

osd_recovery_max_chunk = 8388608 
osd_recovery_op_priority = 1 
osd_max_backfills = 1 
osd_recovery_max_active = 1 
osd_recovery_threads = 1 
osd_disk_thread_ioprio_priority = 7 
osd_disk_thread_ioprio_class = idle 
osd_scrub_chunk_min = 1 
osd_scrub_chunk_max = 5 
osd_deep_scrub_stride = 1048576 
mon_osd_min_down_reporters = 6 
mon_osd_report_timeout = 1800 
mon_osd_min_down_reports = 7 
osd_heartbeat_grace = 60 
osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,allocsize=4M" 
osd_mkfs_options_xfs = -f -i size=2048 
filestore_max_sync_interval = 15 
filestore_op_threads = 8 
filestore_merge_threshold = 40 
filestore_split_multiple = 8 
osd_disk_threads = 8 
osd_op_threads = 8 
osd_pool_default_pg_num = 1024 
osd_pool_default_pgp_num = 1024 
osd_crush_update_on_start = false 

Many thanks 

Andrei

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

Reply via email to