Hi, the flush from the overlay cache for my ec-based cephfs is very very slow, as are all operations on the cephfs. The flush accelerates when the mds is stopped.
I think this is due to a large number of files that were deleted all at once, but I'm not sure how to verify that. Are there any counters I can look up that show things like "pending deletions"? How else can I debug the problem? Any insight is very much appreciated. Philip (potentially helpful debug output follows) status: root@lxt-prod-ceph-mon02:~# ceph -s cluster: id: 066f558c-6789-4a93-aaf1-5af1ba01a3ad health: HEALTH_WARN noscrub,nodeep-scrub flag(s) set 102 slow requests are blocked > 32 sec services: mon: 2 daemons, quorum lxt-prod-ceph-mon01,lxt-prod-ceph-mon02 mgr: lxt-prod-ceph-mon02(active), standbys: lxt-prod-ceph-mon01 mds: plexfs-1/1/1 up {0=lxt-prod-ceph-mds01=up:active} osd: 13 osds: 7 up, 7 in flags noscrub,nodeep-scrub data: pools: 3 pools, 536 pgs objects: 5431k objects, 21056 GB usage: 28442 GB used, 5319 GB / 33761 GB avail pgs: 536 active+clean io: client: 687 kB/s wr, 0 op/s rd, 9 op/s wr cache: 345 kB/s flush (Throughput is currently in the kilobyte/ low megabyte range, but could go to 100MB/s under healthy conditions) health: root@lxt-prod-ceph-mon02:~# ceph health detail HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 105 slow requests are blocked > 32 sec OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set REQUEST_SLOW 105 slow requests are blocked > 32 sec 45 ops are blocked > 262.144 sec 29 ops are blocked > 131.072 sec 20 ops are blocked > 65.536 sec 11 ops are blocked > 32.768 sec osds 1,7 have blocked requests > 262.144 sec (all osds have a high system load, but not a lot of iowait. cephfs/flushing usually performs much better with the same conditions) pool configuration: root@lxt-prod-ceph-mon02:~# ceph osd pool ls detail pool 6 'cephfs-metadata' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 last_change 12515 lfor 0/12412 flags hashpspool stripe_width 0 application cephfs pool 9 'cephfs-data' erasure size 4 min_size 3 crush_rule 4 object_hash rjenkins pg_num 512 pgp_num 512 last_change 12482 lfor 12481/12481 flags hashpspool crash_replay_interval 45 tiers 17 read_tier 17 write_tier 17 stripe_width 4128 application cephfs pool 17 'cephfs-cache' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 12553 lfor 12481/12481 flags hashpspool,incomplete_clones,noscrub,nodeep-scrub tier_of 9 cache_mode writeback target_bytes 200000000000 target_objects 150000 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 180s x1 decay_rate 20 search_last_n 1 min_write_recency_for_promote 1 stripe_width 0 metadata and cache are both on the same ssd osd: root@lxt-prod-ceph-mon02:~# ceph osd crush tree ID CLASS WEIGHT TYPE NAME -5 0.24399 root ssd 7 ssd 0.24399 osd.7 -4 5.74399 host sinnlich 6 hdd 5.50000 osd.6 7 ssd 0.24399 osd.7 -1 38.40399 root hdd -2 16.45799 host hn-lxt-ceph01 1 hdd 5.50000 osd.1 9 hdd 5.50000 osd.9 12 hdd 5.50000 osd.12 -3 16.44600 host hn-lxt-ceph02 2 hdd 5.50000 osd.2 3 hdd 5.50000 osd.3 4 hdd 2.72299 osd.4 5 hdd 2.72299 osd.5 6 hdd 5.50000 osd.6 cache tier settings: root@lxt-prod-ceph-mon02:~# ceph osd pool get cephfs-cache all size: 1 min_size: 1 crash_replay_interval: 0 pg_num: 8 pgp_num: 8 crush_rule: replicated_ruleset hashpspool: true nodelete: false nopgchange: false nosizechange: false write_fadvise_dontneed: false noscrub: true nodeep-scrub: true hit_set_type: bloom hit_set_period: 180 hit_set_count: 1 hit_set_fpp: 0.05 use_gmt_hitset: 1 auid: 0 target_max_objects: 150000 target_max_bytes: 200000000000 cache_target_dirty_ratio: 0.01 cache_target_dirty_high_ratio: 0.1 cache_target_full_ratio: 0.8 cache_min_flush_age: 60 cache_min_evict_age: 0 min_read_recency_for_promote: 0 min_write_recency_for_promote: 1 fast_read: 0 hit_set_grade_decay_rate: 20 hit_set_search_last_n: 1 (I'm not sure the values make much sense, I copied them from online examples and adapted them minimally if at all) the mds shows no ops in flight, but the ssd osd shows a lot of those operations that seem to be slow (all of them with the same events timeline stopping at reached_pg): root@sinnlich:~# ceph daemon osd.7 dump_ops_in_flight| head -30 { "ops": [ { "description": "osd_op(mds.0.3479:170284 17.1 17:98fc84de:::100002a830d.00000000:head [delete] snapc 1=[] ondisk+write+known_if_redirected+full_force e12553)", "initiated_at": "2018-05-23 21:27:00.140552", "age": 47.611064, "duration": 47.611077, "type_data": { "flag_point": "reached pg", "events": [ { "time": "2018-05-23 21:27:00.140552", "event": "initiated" }, { "time": "2018-05-23 21:27:00.140600", "event": "queued_for_pg" }, { "time": "2018-05-23 21:27:00.140624", "event": "reached_pg" } ] } },
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com