Hi, We seem to be hitting a performance issue with Lustre clients 2.12.2 and 2.12.3. Over time, the grant size of the OSC is shrinking and getting under 1MB and does not grow back. This lowers the performance of this client to a few MB/s, even in the kB/s for some OST. This does not seem to happen on 2.10.8 clients since they don’t have the “grant_shrink” flag. The servers are running 2.12.3 with ZFS 0.7.9.
Here is what we can see as performance per OST with a simple dd test, the worst OST is #5 with 222 kB/s. A client with 2.10 on the same OST is reaching > 800MB/s. for i in {0..37}; do lfs setstripe --ost $i --stripe-count 1 ost$i ; done for i in {0..37}; do dd if=/dev/zero of=ost$i bs=1M count=100; done 104857600 bytes (105 MB) copied, 0.142473 s, 736 MB/s 104857600 bytes (105 MB) copied, 9.22021 s, 11.4 MB/s 104857600 bytes (105 MB) copied, 0.0905684 s, 1.2 GB/s 104857600 bytes (105 MB) copied, 6.36873 s, 16.5 MB/s 104857600 bytes (105 MB) copied, 0.0929602 s, 1.1 GB/s 104857600 bytes (105 MB) copied, 471.699 s, 222 kB/s 104857600 bytes (105 MB) copied, 0.177067 s, 592 MB/s [...] As an example, this slow client have a grant_size of 0.8MB after being up for a while: lctl get_param osc.lustre04-OST0005*.cur_grant_bytes osc.lustre04-OST0005-osc-ffff98128d818000.cur_grant_bytes=883028 In the debug logs, I can see a request sent as sync IO since the grant size is now too small to contain the 1.7MB request 00000008:00000020:10.0:1585145743.107840:0:116122:0:(osc_cache.c:1590:osc_enter_cache()) lustre04-OST0005-osc-ffff98128d818000: grant { dirty: 0/512000 dirty_pages: 448/24562964 dropped: 0 avail: 883028, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 146368, left: 64, waiters: 0 }need:1703936 00000008:00000020:10.0:1585145743.107842:0:116122:0:(osc_cache.c:1539:osc_enter_cache_try()) lustre04-OST0005-osc-ffff98128d818000: grant { dirty: 0/512000 dirty_pages: 448/24562964 dropped: 0 avail: 883028, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 146368, left: 64, waiters: 0 }need:1703936 00000008:00000020:10.0:1585145743.107843:0:116122:0:(osc_cache.c:1666:osc_enter_cache()) lustre04-OST0005-osc-ffff98128d818000: grant { dirty: 0/512000 dirty_pages: 448/24562964 dropped: 0 avail: 883028, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 146368, left: 64, waiters: 0 }no grant space, fall back to sync i/o There is currently 30GB granted on a OST with about 22TB free. [root@lustre04-oss1 ~]# lctl get_param obdfilter/lustre04-OST0005/tot_granted obdfilter.lustre04-OST0005.tot_granted=30257446912 Somehow, the client does not receive a bigger grant, so it seems to stay forever under 1MB. 00000008:00000020:4.0:1585145743.107950:0:22701:0:(osc_request.c:705:osc_announce_cached()) dirty: 0 undirty: 2080374783 dropped 0 grant: 883028 00000008:00000020:14.0:1585145743.236923:0:22702:0:(osc_request.c:727:osc_update_grant()) got 0 extra grant Is this a known issue ? I could not find a similar ticket in JIRA, but I do see some references to disabling grant_shrink in LU-12651 and LU-12759.
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org