Hi Robert and Paul, sad news. I did a 5 seconds single thread test after setting osd_op_queue_cut_off=high on all OSDs and MDSs. Here the current settings:
[root@ceph-01 ~]# ceph config show osd.0 NAME VALUE SOURCE OVERRIDES IGNORES bluestore_compression_min_blob_size_hdd 262144 file bluestore_compression_mode aggressive file cluster_addr 192.168.16.68:0/0 override cluster_network 192.168.16.0/20 file crush_location host=c-04-A file daemonize false override err_to_syslog true file keyring $osd_data/keyring default leveldb_log default mgr_initial_modules balancer dashboard file mon_allow_pool_delete false file mon_pool_quota_crit_threshold 90 file mon_pool_quota_warn_threshold 70 file osd_journal_size 4096 file osd_max_backfills 3 mon osd_op_queue_cut_off high mon osd_pool_default_flag_nodelete true file osd_recovery_max_active 8 mon osd_recovery_sleep 0.050000 mon public_addr 192.168.32.68:0/0 override public_network 192.168.32.0/19 file rbd_default_features 61 default setgroup disk cmdline setuser ceph cmdline [root@ceph-01 ~]# ceph config get osd.0 osd_op_queue wpq Unfortunately, the problem is not resolved. The fio job script is: ===================== [global] name=fio-rand-write filename_format=fio-$jobname-${HOSTNAME}-$jobnum-$filenum rw=randwrite bs=4K numjobs=1 time_based=1 runtime=5 [file1] size=100G ioengine=sync ===================== That's a random write test on a 100G file with write size 4K. Note that fio uses "direct=0" by default. Using "direct=1" is absolutely fine. Running this short burst of load, I already get the cluster unhealthy: cluster log: 2019-09-03 20:00:00.000160 [INF] overall HEALTH_OK 2019-09-03 20:08:36.450527 [WRN] Health check failed: 1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO) 2019-09-03 20:08:59.867124 [INF] MDS health message cleared (mds.0): 2 slow metadata IOs are blocked > 30 secs, oldest blocked for 49 secs 2019-09-03 20:09:00.373050 [INF] Health check cleared: MDS_SLOW_METADATA_IO (was: 1 MDSs report slow metadata IOs) 2019-09-03 20:09:00.373094 [INF] Cluster is now healthy /var/log/messages: loads of these (all OSDs!) Sep 3 20:08:39 ceph-09 journal: 2019-09-03 20:08:39.269 7f6a3d63c700 -1 osd.161 10411 get_health_metrics reporting 354 slow ops, oldest is osd_op(client.4497435.0:38244 5.f7s0 5:ef9f1be4:::100010ed9bd.0000390c:head [write 8192~4096,write 32768~4096,write 139264~4096,write 172032~4096,write 270336~4096,write 512000~4096,write 688128~4096,write 876544~4096,write 1048576~4096,write 1257472~4096,write 1425408~4096,write 1445888~4096,write 1503232~4096,write 1552384~4096,write 1716224~4096,write 1765376~4096] snapc 12e=[] ondisk+write+known_if_redirected e10411) It looks like the MDS is pushing waaaayyy too many requests onto the HDDs instead of throttling the client. An ordinary user should not have so much power in his hands. This makes it trivial to destroy a ceph cluster. This very short fio test is probably sufficient to reproduce the issue on any test cluster. Should I open an issue? Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <fr...@dtu.dk> Sent: 30 August 2019 12:56 To: Robert LeBlanc; Paul Emmerich Cc: ceph-users Subject: [ceph-users] Re: ceph fs crashes on simple fio test Hi Robert and Paul, a quick update. I restarted all OSDs today to activate osd_op_queue_cut_off=high. I run into a serious problem right after that. The standby-replay MDS daemons started missing mon beacons and were killed by the mons: ceph-01 journal: debug [...] log [INF] Standby daemon mds.ceph-12 is not responding, dropping it Apparently, one also needs to set this on the MDSes: ceph config set mds osd_op_queue_cut_off high This also requires a restart to become active. After that, everything seems to work again. The question that remains is: Do I need to change this for any other daemon? I will repeat the performance tests later and post results. On observation is, that an MDS fail-over was a factor of 5-10 faster with the cut-off set to high. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io