Hi,
We seem to be hitting a performance issue with Lustre clients 2.12.2 and
2.12.3. Over time, the grant size of the OSC is shrinking and getting under
1MB and does not grow back. This lowers the performance of this client to a
few MB/s, even in the kB/s for some OST. This does not seem to happen on
2.10.8 clients since they don’t have the “grant_shrink” flag. The servers
are running 2.12.3 with ZFS 0.7.9.
Here is what we can see as performance per OST with a simple dd test, the
worst OST is #5 with 222 kB/s. A client with 2.10 on the same OST is
reaching > 800MB/s.
for i in {0..37}; do lfs setstripe --ost $i --stripe-count 1 ost$i ; done
for i in {0..37}; do dd if=/dev/zero of=ost$i bs=1M count=100; done
104857600 bytes (105 MB) copied, 0.142473 s, 736 MB/s
104857600 bytes (105 MB) copied, 9.22021 s, 11.4 MB/s
104857600 bytes (105 MB) copied, 0.0905684 s, 1.2 GB/s
104857600 bytes (105 MB) copied, 6.36873 s, 16.5 MB/s
104857600 bytes (105 MB) copied, 0.0929602 s, 1.1 GB/s
104857600 bytes (105 MB) copied, 471.699 s, 222 kB/s
104857600 bytes (105 MB) copied, 0.177067 s, 592 MB/s
[...]
As an example, this slow client have a grant_size of 0.8MB after being up
for a while:
lctl get_param osc.lustre04-OST0005*.cur_grant_bytes
osc.lustre04-OST0005-osc-98128d818000.cur_grant_bytes=883028
In the debug logs, I can see a request sent as sync IO since the grant size
is now too small to contain the 1.7MB request
0008:0020:10.0:1585145743.107840:0:116122:0:(osc_cache.c:1590:osc_enter_cache())
lustre04-OST0005-osc-98128d818000: grant { dirty: 0/512000 dirty_pages:
448/24562964 dropped: 0 avail: 883028, dirty_grant: 0, reserved: 0, flight:
0 } lru {in list: 146368, left: 64, waiters: 0 }need:1703936
0008:0020:10.0:1585145743.107842:0:116122:0:(osc_cache.c:1539:osc_enter_cache_try())
lustre04-OST0005-osc-98128d818000: grant { dirty: 0/512000 dirty_pages:
448/24562964 dropped: 0 avail: 883028, dirty_grant: 0, reserved: 0, flight:
0 } lru {in list: 146368, left: 64, waiters: 0 }need:1703936
0008:0020:10.0:1585145743.107843:0:116122:0:(osc_cache.c:1666:osc_enter_cache())
lustre04-OST0005-osc-98128d818000: grant { dirty: 0/512000 dirty_pages:
448/24562964 dropped: 0 avail: 883028, dirty_grant: 0, reserved: 0, flight:
0 } lru {in list: 146368, left: 64, waiters: 0 }no grant space, fall back
to sync i/o
There is currently 30GB granted on a OST with about 22TB free.
[root@lustre04-oss1 ~]# lctl get_param
obdfilter/lustre04-OST0005/tot_granted
obdfilter.lustre04-OST0005.tot_granted=30257446912
Somehow, the client does not receive a bigger grant, so it seems to stay
forever under 1MB.
0008:0020:4.0:1585145743.107950:0:22701:0:(osc_request.c:705:osc_announce_cached())
dirty: 0 undirty: 2080374783 dropped 0 grant: 883028
0008:0020:14.0:1585145743.236923:0:22702:0:(osc_request.c:727:osc_update_grant())
got 0 extra grant
Is this a known issue ? I could not find a similar ticket in JIRA, but I do
see some references to disabling grant_shrink in LU-12651 and LU-12759.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org