Hi Robert and Paul,

sad news. I did a 5 seconds single thread test after setting 
osd_op_queue_cut_off=high on all OSDs and MDSs. Here the current settings:

[root@ceph-01 ~]# ceph config show osd.0
NAME                                    VALUE              SOURCE   OVERRIDES 
IGNORES 
bluestore_compression_min_blob_size_hdd 262144             file                 
      
bluestore_compression_mode              aggressive         file                 
      
cluster_addr                            192.168.16.68:0/0  override             
      
cluster_network                         192.168.16.0/20    file                 
      
crush_location                          host=c-04-A        file                 
      
daemonize                               false              override             
      
err_to_syslog                           true               file                 
      
keyring                                 $osd_data/keyring  default              
      
leveldb_log                                                default              
      
mgr_initial_modules                     balancer dashboard file                 
      
mon_allow_pool_delete                   false              file                 
      
mon_pool_quota_crit_threshold           90                 file                 
      
mon_pool_quota_warn_threshold           70                 file                 
      
osd_journal_size                        4096               file                 
      
osd_max_backfills                       3                  mon                  
      
osd_op_queue_cut_off                    high               mon                  
      
osd_pool_default_flag_nodelete          true               file                 
      
osd_recovery_max_active                 8                  mon                  
      
osd_recovery_sleep                      0.050000           mon                  
      
public_addr                             192.168.32.68:0/0  override             
      
public_network                          192.168.32.0/19    file                 
      
rbd_default_features                    61                 default              
      
setgroup                                disk               cmdline              
      
setuser                                 ceph               cmdline              
      
[root@ceph-01 ~]# ceph config get osd.0 osd_op_queue
wpq

Unfortunately, the problem is not resolved. The fio job script is:

=====================
[global]
name=fio-rand-write
filename_format=fio-$jobname-${HOSTNAME}-$jobnum-$filenum
rw=randwrite
bs=4K
numjobs=1
time_based=1
runtime=5

[file1]
size=100G
ioengine=sync
=====================

That's a random write test on a 100G file with write size 4K. Note that fio 
uses "direct=0" by default. Using "direct=1" is absolutely fine.

Running this short burst of load, I already get the cluster unhealthy:

cluster log:

2019-09-03 20:00:00.000160 [INF]  overall HEALTH_OK
2019-09-03 20:08:36.450527 [WRN]  Health check failed: 1 MDSs report slow 
metadata IOs (MDS_SLOW_METADATA_IO)
2019-09-03 20:08:59.867124 [INF]  MDS health message cleared (mds.0): 2 slow 
metadata IOs are blocked > 30 secs, oldest blocked for 49 secs
2019-09-03 20:09:00.373050 [INF]  Health check cleared: MDS_SLOW_METADATA_IO 
(was: 1 MDSs report slow metadata IOs)
2019-09-03 20:09:00.373094 [INF]  Cluster is now healthy

/var/log/messages: loads of these (all OSDs!)

Sep  3 20:08:39 ceph-09 journal: 2019-09-03 20:08:39.269 7f6a3d63c700 -1 
osd.161 10411 get_health_metrics reporting 354 slow ops, oldest is 
osd_op(client.4497435.0:38244 5.f7s0 5:ef9f1be4:::100010ed9bd.0000390c:head 
[write 8192~4096,write 32768~4096,write 139264~4096,write 172032~4096,write 
270336~4096,write 512000~4096,write 688128~4096,write 876544~4096,write 
1048576~4096,write 1257472~4096,write 1425408~4096,write 1445888~4096,write 
1503232~4096,write 1552384~4096,write 1716224~4096,write 1765376~4096] snapc 
12e=[] ondisk+write+known_if_redirected e10411)

It looks like the MDS is pushing waaaayyy too many requests onto the HDDs 
instead of throttling the client.

An ordinary user should not have so much power in his hands. This makes it 
trivial to destroy a ceph cluster.

This very short fio test is probably sufficient to reproduce the issue on any 
test cluster. Should I open an issue?

Best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <fr...@dtu.dk>
Sent: 30 August 2019 12:56
To: Robert LeBlanc; Paul Emmerich
Cc: ceph-users
Subject: [ceph-users] Re: ceph fs crashes on simple fio test

Hi Robert and Paul,

a quick update. I restarted all OSDs today to activate 
osd_op_queue_cut_off=high. I run into a serious problem right after that. The 
standby-replay MDS daemons started missing mon beacons and were killed by the 
mons:

  ceph-01 journal: debug [...] log [INF] Standby daemon mds.ceph-12 is not 
responding, dropping it

Apparently, one also needs to set this on the MDSes:

  ceph config set mds osd_op_queue_cut_off high

This also requires a restart to become active. After that, everything seems to 
work again. The question that remains is:

  Do I need to change this for any other daemon?

I will repeat the performance tests later and post results. On observation is, 
that an MDS fail-over was a factor of 5-10 faster with the cut-off set to high.

Best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to