Ellis, I would also check the peer_credit between server and the client. They should be same.
On Wed, Jan 19, 2022 at 9:27 AM Patrick Farrell via lustre-discuss < lustre-discuss@lists.lustre.org> wrote: > Ellis, > > As you may have guessed, that function just set looks like a node which is > doing buffered I/O and thrashing for memory. No particular insight > available from the count of functions there. > > Would you consider opening a bug report in the Whamcloud JIRA? You should > have enough for a good report, here's a few things that would be helpful as > well: > > It sounds like you can hang the node on demand. If you could collect > stack traces with: > > echo t > /proc/sysrq-trigger > > after creating the hang, that would be useful. (It will print to dmesg.) > > You've also collected debug logs - Could you include, say, the last 100 > MiB of that log set? That should be reasonable to attach if compressed. > > Regards, > Patrick > > ------------------------------ > *From:* lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on > behalf of Ellis Wilson via lustre-discuss <lustre-discuss@lists.lustre.org > > > *Sent:* Wednesday, January 19, 2022 8:32 AM > *To:* Andreas Dilger <adil...@whamcloud.com> > *Cc:* lustre-discuss@lists.lustre.org <lustre-discuss@lists.lustre.org> > *Subject:* Re: [lustre-discuss] Lustre Client Lockup Under Buffered I/O > (2.14/2.15) > > > Hi Andreas, > > > > Apologies in advance for the top-post. I’m required to use Outlook for > work, and it doesn’t handle in-line or bottom-posting well. > > > > Client-side defaults prior to any tuning of mine (this is a very minimal > 1-client, 1-MDS/MGS, 2-OSS cluster): > > > ~# lctl get_param llite.*.max_cached_mb > > llite.lustrefs-ffff8d52a9c52800.max_cached_mb= > > users: 5 > > max_cached_mb: 7748 > > used_mb: 0 > > unused_mb: 7748 > > reclaim_count: 0 > > ~# lctl get_param osc.*.max_dirty_mb > > osc.lustrefs-OST0000-osc-ffff8d52a9c52800.max_dirty_mb=1938 > > osc.lustrefs-OST0001-osc-ffff8d52a9c52800.max_dirty_mb=1938 > > ~# lctl get_param osc.*.max_rpcs_in_flight > > osc.lustrefs-OST0000-osc-ffff8d52a9c52800.max_rpcs_in_flight=8 > > osc.lustrefs-OST0001-osc-ffff8d52a9c52800.max_rpcs_in_flight=8 > > ~# lctl get_param osc.*.max_pages_per_rpc > > osc.lustrefs-OST0000-osc-ffff8d52a9c52800.max_pages_per_rpc=1024 > > osc.lustrefs-OST0001-osc-ffff8d52a9c52800.max_pages_per_rpc=1024 > > > > Thus far I’ve reduced the following to what I felt were really > conservative values for a 16GB RAM machine: > > > > ~# lctl set_param llite.*.max_cached_mb=1024 > > llite.lustrefs-ffff8d52a9c52800.max_cached_mb=1024 > > ~# lctl set_param osc.*.max_dirty_mb=512 > > osc.lustrefs-OST0000-osc-ffff8d52a9c52800.max_dirty_mb=512 > > osc.lustrefs-OST0001-osc-ffff8d52a9c52800.max_dirty_mb=512 > > ~# lctl set_param osc.*.max_pages_per_rpc=128 > > osc.lustrefs-OST0000-osc-ffff8d52a9c52800.max_pages_per_rpc=128 > > osc.lustrefs-OST0001-osc-ffff8d52a9c52800.max_pages_per_rpc=128 > > ~# lctl set_param osc.*.max_rpcs_in_flight=2 > > osc.lustrefs-OST0000-osc-ffff8d52a9c52800.max_rpcs_in_flight=2 > > osc.lustrefs-OST0001-osc-ffff8d52a9c52800.max_rpcs_in_flight=2 > > > > This slows down how fast I get to basically OOM from <10 seconds to more > like 25 seconds, but the trend is identical. > > > > As an example of what I’m seeing on the client, you can see below we start > with most free, and then iozone rapidly (within ~10 seconds) causes all > memory to be marked used, and that stabilizes at about 140MB free until at > some point it stalls for 20 or more seconds and then some has been synced > out: > > > ~# dstat --mem > > ------memory-usage----- > > used free buff cach > > 1029M 13.9G 2756k 215M > > 1028M 13.9G 2756k 215M > > 1028M 13.9G 2756k 215M > > 1088M 13.9G 2756k 215M > > 2550M 11.5G 2764k 1238M > > 3989M 10.1G 2764k 1236M > > 5404M 8881M 2764k 1239M > > 6831M 7453M 2772k 1240M > > 8254M 6033M 2772k 1237M > > 9672M 4613M 2772k 1239M > > 10.6G 3462M 2772k 1240M > > 12.1G 1902M 2772k 1240M > > 13.4G 582M 2772k 1240M > > 13.9G 139M 2488k 1161M > > 13.9G 139M 1528k 1174M > > 13.9G 140M 896k 1175M > > 13.9G 139M 676k 1176M > > 13.9G 142M 528k 1177M > > 13.9G 140M 484k 1188M > > 13.9G 139M 492k 1188M > > 13.9G 139M 488k 1188M > > 13.9G 141M 488k 1186M > > 13.9G 141M 480k 1187M > > 13.9G 139M 492k 1188M > > 13.9G 141M 600k 1188M > > 13.9G 139M 580k 1187M > > 13.9G 140M 536k 1186M > > 13.9G 141M 668k 1186M > > 13.9G 139M 580k 1188M > > 13.9G 140M 568k 1187M > > 12.7G 1299M 2064k 1197M missed 20 ticks <-- client is totally unresponsive > during this time > > 11.0G 2972M 5404k 1238M^C > > > > Additionally, I’ve messed with sysctl settings. Defaults: > > vm.dirty_background_bytes = 0 > > vm.dirty_background_ratio = 10 > > vm.dirty_bytes = 0 > > vm.dirty_expire_centisecs = 3000 > > vm.dirty_ratio = 20 > > vm.dirty_writeback_centisecs = 500 > > > > Revised to conservative values: > > vm.dirty_background_bytes = 1073741824 > > vm.dirty_background_ratio = 0 > > vm.dirty_bytes = 2147483648 > > vm.dirty_expire_centisecs = 200 > > vm.dirty_ratio = 0 > > vm.dirty_writeback_centisecs = 500 > > > > No observed improvement. > > > > I’m going to trawl two logs today side-by-side, one with ldiskfs backing > the OSTs, and one with zfs backing the OSTs, and see if I can see what the > differences are since the zfs-backed version never gave us this problem. > The only other potentially useful thing I can share right now is that when > I turned on full debug logging and ran the test until I hit OOM, the > following were the most frequently hit functions in the logs (count, > descending, is the first column). This was approximately 30s of logs: > > > 205874 cl_page.c:518:cl_vmpage_page()) > > 206587 cl_page.c:545:cl_page_owner_clear()) > > 206673 cl_page.c:551:cl_page_owner_clear()) > > 206748 osc_cache.c:2483:osc_teardown_async_page()) > > 206815 cl_page.c:867:cl_page_delete()) > > 206862 cl_page.c:837:cl_page_delete0()) > > 206878 osc_cache.c:2478:osc_teardown_async_page()) > > 206928 cl_page.c:869:cl_page_delete()) > > 206930 cl_page.c:441:cl_page_state_set0()) > > 206988 osc_page.c:206:osc_page_delete()) > > 207021 cl_page.c:179:__cl_page_free()) > > 207021 cl_page.c:193:cl_page_free()) > > 207021 cl_page.c:532:cl_vmpage_page()) > > 207024 cl_page.c:210:cl_page_free()) > > 207075 cl_page.c:430:cl_page_state_set0()) > > 207169 osc_cache.c:2505:osc_teardown_async_page()) > > 207175 cl_page.c:475:cl_pagevec_put()) > > 207202 cl_page.c:492:cl_pagevec_put()) > > 207211 cl_page.c:822:cl_page_delete0()) > > 207384 osc_page.c:178:osc_page_delete()) > > 207422 osc_page.c:177:osc_page_delete()) > > 413680 cl_page.c:433:cl_page_state_set0()) > > 413701 cl_page.c:477:cl_pagevec_put()) > > > > If anybody has any additional suggestions or requests for more info don’t > hesitate to ask. > > > > Best, > > > > ellis > > > > *From:* Andreas Dilger <adil...@whamcloud.com> > *Sent:* Tuesday, January 18, 2022 9:54 PM > *To:* Ellis Wilson <elliswil...@microsoft.com> > *Cc:* lustre-discuss@lists.lustre.org > *Subject:* [EXTERNAL] Re: [lustre-discuss] Lustre Client Lockup Under > Buffered I/O (2.14/2.15) > > > > You don't often get email from adil...@whamcloud.com. Learn why this is > important <http://aka.ms/LearnAboutSenderIdentification> > > On Jan 18, 2022, at 13:40, Ellis Wilson via lustre-discuss < > lustre-discuss@lists.lustre.org> wrote: > > > > Recently we've switched from using ZFS to ldiskfs as the backing > filesystem to work around some performance issues and I'm finding that when > I put the cluster under load (with as little as a single client) I can > almost completely lockup the client. SSH (even existing sessions) stall, > iostat, top, etc all freeze for 20 to 200 seconds. This alleviates for > small windows and recurs as long as I leave the io-generating process in > existence. It reports extremely high CPU and RAM usage, and appears to be > consumed exclusively doing 'system'-tagged work. This is on 2.14.0, but > I've reproduced on more or less HOL for master-next. If I do direct-IO, > performance is fantastic and I have no such issues regarding CPU/memory > pressure. > > Uname: Linux 85df894e-8458-4aa4-b16f-1d47154c0dd2-lclient-a0-g0-vm > 5.4.0-1065-azure #68~18.04.1-Ubuntu SMP Fri Dec 3 14:08:44 UTC 2021 x86_64 > x86_64 x86_64 GNU/Linux > > I dmesg I see consistent spew on the client about: > [19548.601651] LustreError: 30918:0:(events.c:208:client_bulk_callback()) > event type 1, status -5, desc 00000000b69b83b0 > [19548.662647] LustreError: 30917:0:(events.c:208:client_bulk_callback()) > event type 1, status -5, desc 000000009ef2fc22 > [19549.153590] Lustre: lustrefs-OST0000-osc-ffff8d52a9c52800: Connection > to lustrefs-OST0000 (at 10.1.98.7@tcp) was lost; in progress operations > using this service will wait for recovery to complete > [19549.153621] Lustre: 30927:0:(client.c:2282:ptlrpc_expire_one_request()) > @@@ Request sent has failed due to network error: [sent 1642535831/real > 1642535833] req@0000000002361e2d x1722317313374336/t0(0) o4-> > lustrefs-OST0001-osc-ffff8d52a9c52800@10.1.98.10@tcp:6/4 lens 488/448 e 0 > to 1 dl 1642535883 ref 2 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:'' > [19549.153623] Lustre: 30927:0:(client.c:2282:ptlrpc_expire_one_request()) > Skipped 4 previous similar messages > > But I actually think this is a symptom of extreme memory pressure causing > the client to timeout things, not a cause. > > Testing with obdfilter-survey (local) on the OSS side shows expected > performance of the disk subsystem. Testing with lnet_selftest from client > to OSS shows expected performance. In neither case do I see the high cpu > or memory pressure issues. > > Reducing a variety of lctl tunables that appear to govern memory > allowances for Lustre clients does not improve the situation. > > > > What have you reduced here? llite.*.max_cached_mb, osc.*.max_dirty_mb, > osc.*.max_rpcs_in_flight and osc.*.max_pages_per_rpc? > > > > By all appearances, the running iozone or even simple dd processes > gradually (i.e., over a span of just 10 seconds or so) consumes all 16GB of > RAM on the client I'm using. I've generated bcc profile graphs for both > on- and off-cpu analysis, and they are utterly boring -- they basically > just reflect rampant calls to shrink_inactive_list resulting from > page_cache_alloc in the presence of extreme memory pressure. > > > > We have seen some issues like this that are being looked at, but this is > mostly only seen on smaller VM clients used in testing and not larger > production clients. Are you able to test with more RAM on the client? > Have you tried with 2.12.8 installed on the client? > > > > Cheers, Andreas > > -- > > Andreas Dilger > > Lustre Principal Architect > > Whamcloud > > > > > > > > > > > > > _______________________________________________ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org