Re: [lustre-discuss] ko2iblnd.conf

2024-04-25 Thread Horn, Chris via lustre-discuss
Ko2iblnd tunings depends on the specific hardware and overall LNet config. I would recommend using the default values unless you find performance or reliability issues. FWIW, DDN wants to update the default values for peer_credits/peer_credits_hiw/concurrent_sends -

Re: [lustre-discuss] lnet ip aliases

2024-04-25 Thread Horn, Chris via lustre-discuss
Yeah, I think LNet needs the address to be tagged with a unique label. The patch https://review.whamcloud.com/c/fs/lustre-release/+/53605 may allow you to configure the desired NIDs without the labels. Chris Horn From: lustre-discuss on behalf of Michael DiDomenico via lustre-discuss Date:

Re: [lustre-discuss] LNet Multi-Rail config

2024-02-22 Thread Horn, Chris via lustre-discuss
Can you share the client’s cpt configuration? $ lctl get_param cpu_partition_table cpu_partition_distance Chris Horn From: lustre-discuss on behalf of Gwen Dawes via lustre-discuss Date: Wednesday, February 14, 2024 at 11:19 AM To: lustre-discuss@lists.lustre.org Subject: Re:

Re: [lustre-discuss] LNet Multi-Rail config - with BODY!

2024-01-17 Thread Horn, Chris via lustre-discuss
NRS only affects Lustre traffic, so it will not factor into lnet_selftest (LST) results. I gave some talks on troubleshooting multi-rail that you may want to review. Overview: https://youtu.be/j3m-mznUdac?feature=shared Demo: https://youtu.be/TLN56cw9Zgs?feature=shared You should probably start

Re: [lustre-discuss] Problems pushing update to patch

2023-11-09 Thread Horn, Chris via lustre-discuss
FWIW, my remote url looks a little different. Namely, it includes my username and the port number. [remote "wc"] url = ssh://ho...@review.whamcloud.com:29418/fs/lustre-release fetch = +refs/heads/*:refs/remotes/wc/* My ssh config: Host review.whamcloud.com Hostname

Re: [lustre-discuss] How to eliminate zombie OSTs

2023-08-09 Thread Horn, Chris via lustre-discuss
The error message is stating that ‘-P’ is not valid option to the conf_param command. You may be thinking of lctl set_param -P … Did you follow the documented procedure for removing an OST from the filesystem when you “adjust[ed] the configuration”?

Re: [lustre-discuss] Lnet config serving multiple routers and clients

2023-04-05 Thread Horn, Chris via lustre-discuss
Do you have the route to o2ib via 10.215.25.76@o2ib2 defined on the client? Chris Horn From: lustre-discuss on behalf of Kumar, Amit via lustre-discuss Date: Wednesday, April 5, 2023 at 12:28 PM To: lustre-discuss@lists.lustre.org Subject: [lustre-discuss] Lnet config serving multiple

Re: [lustre-discuss] LNet nid down after some thing changed the NICs

2023-03-01 Thread Horn, Chris via lustre-discuss
Hi CJ, I don’t know if you ever got an account and ticket opened, but I stumbled upon this change which sounds like it could be your issue - https://jira.whamcloud.com/browse/LU-16378 commit 3c9282a67d73799a03cb1d254275685c1c1e4df2 Author: Cyril Bordage

Re: [lustre-discuss] LNet nid down after some thing changed the NICs

2023-02-17 Thread Horn, Chris via lustre-discuss
If deleting and re-adding it restores the status to up then this sounds like a bug to me. Can you enable debug tracing, reproduce the issue, and add this information to a ticket? To enable/gather debug: # lctl set_param debug=+net # lctl dk > /tmp/dk.log You can create a ticket at

Re: [lustre-discuss] max_frags 257 too large

2022-09-08 Thread Horn, Chris via lustre-discuss
The message is part of normal o2iblnd connection setup. It just means the two peers are negotiating the max number of fragments that will be supported. It is seen because https://jira.whamcloud.com/browse/LU-15092 changed the default max number of fragments from 256 to 257. If one peer has that

Re: [lustre-discuss] network error on bulk WRITE/bad log

2022-08-17 Thread Horn, Chris via lustre-discuss
[66494.575431] LNetError: 20017:0:(o2iblnd.c:1880:kiblnd_fmr_pool_map()) Failed to map mr 1/8 elements [66494.575446] LNetError: 20017:0:(o2iblnd_cb.c:613:kiblnd_fmr_map_tx()) Can't map 32768 bytes (8/8)s: -22 These errors originate from a call to ib_map_mr_sg() which is part of the kernel

Re: [lustre-discuss] 'queue depth too large', but connection works

2022-02-03 Thread Horn, Chris via lustre-discuss
No, it is not necessary to tune map_on_demand with modern NICs/MOFED drivers. Latest Lustre can only accept values of ‘0’ or ‘1’. This forces (‘0’) the use of global memory regions (when available), but global MR API was removed (or deprecated?) by Mellanox. A recent change was made to default

Re: [lustre-discuss] 'queue depth too large', but connection works

2022-01-30 Thread Horn, Chris via lustre-discuss
Yes, this means the server has peer_credits=8, so can only accept that value. It informs the client of this so subsequent client connection attempt uses the lower value. From: lustre-discuss on behalf of Thomas Roth Sent: Saturday, January 29, 2022 11:46 AM

Re: [lustre-discuss] IPoIB best practises

2022-01-19 Thread Horn, Chris via lustre-discuss
Local LNet configuration can be done either via modprobe config or via lnetctl/yaml. We are slowly moving away from modprobe config (kernel module parameters) in favor of lnetctl/yaml because the latter provides more flexibility. For IB and TCP networks, every interface needs an IP address

Re: [lustre-discuss] Disabling multi-rail dynamic discovery

2021-09-14 Thread Horn, Chris via lustre-discuss
When you start LNet via ‘modprobe lnet; lctl net up’, that doesn’t load the configuration from /etc/lnet.conf. It is going to configure LNet based only on kernel module parameters. Since you removed the ‘options lnet networks’ from your modprobe.conf file, it is going to use the default

Re: [lustre-discuss] Disabling multi-rail dynamic discovery

2021-09-13 Thread Horn, Chris via lustre-discuss
I’m not sure why lnetctl import wouldn’t correctly set discovery. Might be a bug. You can try setting the kernel module parameter to disable discovery: options lnet lnet_peer_discovery_disabled=1 This obviously requires LNet to be reloaded. I would not recommend toggling discovery via the CLI

Re: [lustre-discuss] non-existing client in logs

2021-07-15 Thread Horn, Chris via lustre-discuss
The peer may be stuck in the LNet health recovery queue. You may be able to check by running this on the MDS nodes: $ lnetctl debug recovery --peer You could resolve the issue by deleting the peer entry from the peer tables $ lnetctl peer del --prim 10.20.1.237@o2ib5