Remove me from this email distribution. Thanks

-----Original Message-----
From: lustre-discuss <[email protected]> On Behalf Of 
[email protected]
Sent: Wednesday, March 5, 2025 10:43 PM
To: [email protected]
Subject: lustre-discuss Digest, Vol 228, Issue 4

Send lustre-discuss mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        
https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!NpxR!iLqFg9mHSFdTvFBBdIctwSExoOvHn9aTqkO3lh98X6RuKATpU7m6URNvsD02ThhhDvx7OcvFTi8J-UMgYDTpr5zNWPRU16KLKZxY_Q$
or, via email, send a message with subject or body 'help' to
        [email protected]

You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific than "Re: 
Contents of lustre-discuss digest..."


Today's Topics:

   1. Re: Lustre MDT/OST Mount Failures During Virtual Machine
      Reboot with Pacemaker (Laura Hild)
   2. Re: multi-hop routing (John White)
   3. Re: multi-hop routing (John White)
   4. Re: lustre-discuss Digest, Vol 228, Issue 3 (Berry-Lozano, Erica)


----------------------------------------------------------------------

Message: 1
Date: Wed, 5 Mar 2025 22:12:00 +0000
From: Laura Hild <[email protected]>
To: "[email protected]" <[email protected]>
Cc: lustre-discuss <[email protected]>
Subject: Re: [lustre-discuss] Lustre MDT/OST Mount Failures During
        Virtual Machine Reboot with Pacemaker
Message-ID:
        
<blapr09mb62116113a1788bd2d8b6dad2dc...@blapr09mb6211.namprd09.prod.outlook.com>
        
Content-Type: text/plain; charset="iso-8859-2"

I'm not sure what to say about how Pacemaker *should* behave, but I *can* say I 
virtually never try to (cleanly) reboot a host from which I have not already 
evacuated all resources, e.g. with `pcs node standby` or by putting Pacemaker 
in maintenance mode and unmounting/exporting everything manually.  If I can't 
evacuate all resources and complete a lustre_rmmod, the host is getting 
power-cycled.

So maybe I can say, my guess would be that in the host's shutdown process, 
stopping the Pacemaker service happens before filesystems are unmounted, and 
that Pacemaker doesn't want to make an assumption whether its own shut-down 
means it should standby or initiate maintenance mode, and therefore the other 
host ends up knowing only that its partner has disappeared, while the 
filesystems have yet to be unmounted.



------------------------------

Message: 2
Date: Wed, 5 Mar 2025 14:29:10 -0800
From: John White <[email protected]>
To: "Horn, Chris" <[email protected]>
Cc: "[email protected]"
        <[email protected]>
Subject: Re: [lustre-discuss] multi-hop routing
Message-ID: <[email protected]>
Content-Type: text/plain;       charset=utf-8

Oh, so don?t even tell the client about tcp!  That seems to have immediately 
kicked things into place!
I owe you a beverage of your choice if we ever meet up!

Seriously, the imposter syndrome was getting _bad_ the last few days here.

> On Mar 5, 2025, at 12:05?PM, Horn, Chris <[email protected]> wrote:
> 
> You need LNet routes configured on all nodes. It should look something like 
> this:
> 
> # pdsh -w n0[0-3] 'lctl list_nids; lctl show_route' | dshbak -c
> ----------------
> server
> ----------------
> 172.18.2.5@o2ib
> net              o2ib2 hops 2 gw                  172.18.2.6@o2ib up pri 0
> ----------------
> router1
> ----------------
> 172.18.2.6@o2ib
> 172.18.2.2@tcp
> net              o2ib2 hops 1 gw                   172.18.2.3@tcp up pri 0
> ----------------
> router2
> ----------------
> 172.18.2.7@o2ib2
> 172.18.2.3@tcp
> net               o2ib hops 1 gw                   172.18.2.2@tcp up pri 0
> ----------------
> client
> ----------------
> 172.18.2.8@o2ib2
> net               o2ib hops 2 gw                 172.18.2.7@o2ib2 up pri 0
> #
>  Chris Horn
>  From: lustre-discuss <[email protected]> on 
> behalf of John White via lustre-discuss 
> <[email protected]>
> Date: Wednesday, March 5, 2025 at 1:17?PM
> To: [email protected] <[email protected]>
> Subject: [lustre-discuss] multi-hop routing Hello folks.  I have a 
> rare situation that I?m told some centers are successfully pulling off and am 
> looking for guidance - multi-hop lnet routing.
> In short, I have 2 distinct o2ib fabrics at disparate geo sites joined by a 
> routed ethernet fabric.  I?m looking to use a 2-lnet-router chain to plumb 
> the two o2ib fabrics together.
> 
> servers on the left, clients on the right
> o2ib0(10.5.0.0/16) <-> router(o2ib0,tcp0) <-> routed eth 
> (10.37.0.0/16, 10.38.0.0/16) <-> router(tcp0,o2ib2) <-> 
> o2ib2(10.6.0.0/16)
> 
> I have both sets of routers up but traffic absolutely fails the 2nd hop in 
> either direction (I can `lctl ping` tcp0 from o2ib2 and o2ib0 but no further).
> 
> I?ve tried adding a route ON the routers, that didn?t help. 
> 
> I?ve tried defining the 2nd hop on the client:
> options lnet routes="tcp0 10.6.0.[250-251]@o2ib2;\
> o2ib0 10.37.250.[162-163]@tcp0?
> 
> but that failed with the following kern message on lnet load:
> 74067:0:(router.c:644:lnet_add_route()) Cannot add route with gateway 
> 10.37.250.162@tcp. There is no local interface configured on LNet tcp
> 
> Does anyone have any hints here?  It feels like I?m a syntax change or a 
> routing hint away from getting this working.
> _______________________________________________
> lustre-discuss mailing list
> [email protected]
> https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustr
> e-discuss-lustre.org__;!!NpxR!keuGPb7MHd7CQc6Zi_uwIvFahK68FJfbq9MNIXgH
> pd0W8bi5vOYFHf-IixYY5DiOnJKx0z9-Ht8VqH1ew82XWtaTRaoq$




------------------------------

Message: 3
Date: Wed, 5 Mar 2025 15:47:34 -0800
From: John White <[email protected]>
To: "Horn, Chris" <[email protected]>
Cc: "[email protected]"
        <[email protected]>
Subject: Re: [lustre-discuss] multi-hop routing
Message-ID: <[email protected]>
Content-Type: text/plain;       charset=utf-8

Just a quick follow-up for posterity, I did seem to need to add a route for tcp 
to the server-side.  lctl ping was working but MGS communication was failing 
saying it couldn?t talk back to the router:
[Wed Mar  5 15:28:26 2025] LNetError: 
28576:0:(lib-move.c:2078:lnet_handle_find_routed_path()) no route to 
10.38.0.250@tcp from 10.5.250.22@o2ib [Wed Mar  5 15:28:26 2025] LNetError: 
28576:0:(lib-move.c:3991:lnet_parse_get()) 10.5.250.22@o2ib: Unable to send 
REPLY for GET from 12345-10.38.0.250@tcp: -113

Adding a route to tcp from it?s geo-local router fixed that and we?ve got 
mounts passing IO.  Didn?t seem to need to do the same for clients at all.

> On Mar 5, 2025, at 2:29?PM, John White <[email protected]> wrote:
> 
> Oh, so don?t even tell the client about tcp!  That seems to have immediately 
> kicked things into place!
> I owe you a beverage of your choice if we ever meet up!
> 
> Seriously, the imposter syndrome was getting _bad_ the last few days here.
> 
>> On Mar 5, 2025, at 12:05?PM, Horn, Chris <[email protected]> wrote:
>> 
>> You need LNet routes configured on all nodes. It should look something like 
>> this:
>> 
>> # pdsh -w n0[0-3] 'lctl list_nids; lctl show_route' | dshbak -c
>> ----------------
>> server
>> ----------------
>> 172.18.2.5@o2ib
>> net              o2ib2 hops 2 gw                  172.18.2.6@o2ib up pri 0
>> ----------------
>> router1
>> ----------------
>> 172.18.2.6@o2ib
>> 172.18.2.2@tcp
>> net              o2ib2 hops 1 gw                   172.18.2.3@tcp up pri 0
>> ----------------
>> router2
>> ----------------
>> 172.18.2.7@o2ib2
>> 172.18.2.3@tcp
>> net               o2ib hops 1 gw                   172.18.2.2@tcp up pri 0
>> ----------------
>> client
>> ----------------
>> 172.18.2.8@o2ib2
>> net               o2ib hops 2 gw                 172.18.2.7@o2ib2 up pri 0
>> #
>> Chris Horn
>> From: lustre-discuss <[email protected]> on 
>> behalf of John White via lustre-discuss 
>> <[email protected]>
>> Date: Wednesday, March 5, 2025 at 1:17?PM
>> To: [email protected] <[email protected]>
>> Subject: [lustre-discuss] multi-hop routing Hello folks.  I have a 
>> rare situation that I?m told some centers are successfully pulling off and 
>> am looking for guidance - multi-hop lnet routing.
>> In short, I have 2 distinct o2ib fabrics at disparate geo sites joined by a 
>> routed ethernet fabric.  I?m looking to use a 2-lnet-router chain to plumb 
>> the two o2ib fabrics together.
>> 
>> servers on the left, clients on the right
>> o2ib0(10.5.0.0/16) <-> router(o2ib0,tcp0) <-> routed eth 
>> (10.37.0.0/16, 10.38.0.0/16) <-> router(tcp0,o2ib2) <-> 
>> o2ib2(10.6.0.0/16)
>> 
>> I have both sets of routers up but traffic absolutely fails the 2nd hop in 
>> either direction (I can `lctl ping` tcp0 from o2ib2 and o2ib0 but no 
>> further).
>> 
>> I?ve tried adding a route ON the routers, that didn?t help. 
>> 
>> I?ve tried defining the 2nd hop on the client:
>> options lnet routes="tcp0 10.6.0.[250-251]@o2ib2;\
>> o2ib0 10.37.250.[162-163]@tcp0?
>> 
>> but that failed with the following kern message on lnet load:
>> 74067:0:(router.c:644:lnet_add_route()) Cannot add route with gateway 
>> 10.37.250.162@tcp. There is no local interface configured on LNet tcp
>> 
>> Does anyone have any hints here?  It feels like I?m a syntax change or a 
>> routing hint away from getting this working.
>> _______________________________________________
>> lustre-discuss mailing list
>> [email protected]
>> https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lust
>> re-discuss-lustre.org__;!!NpxR!keuGPb7MHd7CQc6Zi_uwIvFahK68FJfbq9MNIX
>> gHpd0W8bi5vOYFHf-IixYY5DiOnJKx0z9-Ht8VqH1ew82XWtaTRaoq$
> 
> 



------------------------------

Message: 4
Date: Thu, 6 Mar 2025 04:42:18 +0000
From: "Berry-Lozano, Erica" <[email protected]>
To: "[email protected]"
        <[email protected]>
Subject: Re: [lustre-discuss] lustre-discuss Digest, Vol 228, Issue 3
Message-ID:
        
<ph0pr84mb1407e5335dca60e74f5b3a02d0...@ph0pr84mb1407.namprd84.prod.outlook.com>
        
Content-Type: text/plain; charset="us-ascii"

Please remove me from this email distribution list.  Thanks

-----Original Message-----
From: lustre-discuss <[email protected]> On Behalf Of 
[email protected]
Sent: Wednesday, March 5, 2025 2:05 PM
To: [email protected]
Subject: lustre-discuss Digest, Vol 228, Issue 3

Send lustre-discuss mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        
https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!NpxR!nW3d5kUkGUumTD2D9jXSF5CvA3MWMg1Ye4tGQs4BwUrstkBSP9l5HNq08rXZwbINHfO2eMRHdwzqA7IXwpZAmNEC1W0u3ef_v-URSg$
or, via email, send a message with subject or body 'help' to
        [email protected]

You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific than "Re: 
Contents of lustre-discuss digest..."


Today's Topics:

   1. multi-hop routing (John White)
   2. Re: multi-hop routing (Horn, Chris)


----------------------------------------------------------------------

Message: 1
Date: Wed, 5 Mar 2025 11:14:49 -0800
From: John White <[email protected]>
To: [email protected]
Subject: [lustre-discuss] multi-hop routing
Message-ID: <[email protected]>
Content-Type: text/plain;       charset=utf-8

Hello folks.  I have a rare situation that I?m told some centers are 
successfully pulling off and am looking for guidance - multi-hop lnet routing.
In short, I have 2 distinct o2ib fabrics at disparate geo sites joined by a 
routed ethernet fabric.  I?m looking to use a 2-lnet-router chain to plumb the 
two o2ib fabrics together.

servers on the left, clients on the right
o2ib0(10.5.0.0/16) <-> router(o2ib0,tcp0) <-> routed eth (10.37.0.0/16, 
10.38.0.0/16) <-> router(tcp0,o2ib2) <-> o2ib2(10.6.0.0/16)

I have both sets of routers up but traffic absolutely fails the 2nd hop in 
either direction (I can `lctl ping` tcp0 from o2ib2 and o2ib0 but no further).

I?ve tried adding a route ON the routers, that didn?t help. 

I?ve tried defining the 2nd hop on the client:
options lnet routes="tcp0 10.6.0.[250-251]@o2ib2;\
o2ib0 10.37.250.[162-163]@tcp0?

but that failed with the following kern message on lnet load:
74067:0:(router.c:644:lnet_add_route()) Cannot add route with gateway 
10.37.250.162@tcp. There is no local interface configured on LNet tcp

Does anyone have any hints here?  It feels like I?m a syntax change or a 
routing hint away from getting this working.

------------------------------

Message: 2
Date: Wed, 5 Mar 2025 20:05:02 +0000
From: "Horn, Chris" <[email protected]>
To: John White <[email protected]>, "[email protected]"
        <[email protected]>
Subject: Re: [lustre-discuss] multi-hop routing
Message-ID:
        
<ph7pr84mb1438479204dbec1e027b8fc79e...@ph7pr84mb1438.namprd84.prod.outlook.com>
        
Content-Type: text/plain; charset="utf-8"

You need LNet routes configured on all nodes. It should look something like 
this:

# pdsh -w n0[0-3] 'lctl list_nids; lctl show_route' | dshbak -c
----------------
server
----------------
172.18.2.5@o2ib<mailto:172.18.2.5@o2ib>
net              o2ib2 hops 2 gw                  
172.18.2.6@o2ib<mailto:172.18.2.6@o2ib> up pri 0
----------------
router1
----------------
172.18.2.6@o2ib<mailto:172.18.2.6@o2ib>
172.18.2.2@tcp<mailto:172.18.2.2@tcp>
net              o2ib2 hops 1 gw                   
172.18.2.3@tcp<mailto:172.18.2.3@tcp> up pri 0
----------------
router2
----------------
172.18.2.7@o2ib2<mailto:172.18.2.7@o2ib2>
172.18.2.3@tcp<mailto:172.18.2.3@tcp>
net               o2ib hops 1 gw                   
172.18.2.2@tcp<mailto:172.18.2.2@tcp> up pri 0
----------------
client
----------------
172.18.2.8@o2ib2<mailto:172.18.2.8@o2ib2>
net               o2ib hops 2 gw                 
172.18.2.7@o2ib2<mailto:172.18.2.7@o2ib2> up pri 0
#

Chris Horn

From: lustre-discuss <[email protected]> on behalf of 
John White via lustre-discuss <[email protected]>
Date: Wednesday, March 5, 2025 at 1:17?PM
To: [email protected] <[email protected]>
Subject: [lustre-discuss] multi-hop routing Hello folks.  I have a rare 
situation that I?m told some centers are successfully pulling off and am 
looking for guidance - multi-hop lnet routing.
In short, I have 2 distinct o2ib fabrics at disparate geo sites joined by a 
routed ethernet fabric.  I?m looking to use a 2-lnet-router chain to plumb the 
two o2ib fabrics together.

servers on the left, clients on the right
o2ib0(10.5.0.0/16) <-> router(o2ib0,tcp0) <-> routed eth (10.37.0.0/16, 
10.38.0.0/16) <-> router(tcp0,o2ib2) <-> o2ib2(10.6.0.0/16)

I have both sets of routers up but traffic absolutely fails the 2nd hop in 
either direction (I can `lctl ping` tcp0 from o2ib2 and o2ib0 but no further).

I?ve tried adding a route ON the routers, that didn?t help.

I?ve tried defining the 2nd hop on the client:
options lnet routes="tcp0 10.6.0.[250-251]@o2ib2;\
o2ib0 10.37.250.[162-163]@tcp0?

but that failed with the following kern message on lnet load:
74067:0:(router.c:644:lnet_add_route()) Cannot add route with gateway 
10.37.250.162@tcp. There is no local interface configured on LNet tcp

Does anyone have any hints here?  It feels like I?m a syntax change or a 
routing hint away from getting this working.
_______________________________________________
lustre-discuss mailing list
[email protected]
https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!NpxR!keuGPb7MHd7CQc6Zi_uwIvFahK68FJfbq9MNIXgHpd0W8bi5vOYFHf-IixYY5DiOnJKx0z9-Ht8VqH1ew82XWtaTRaoq$<https://urldefense.com/v3/__http:/lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!NpxR!keuGPb7MHd7CQc6Zi_uwIvFahK68FJfbq9MNIXgHpd0W8bi5vOYFHf-IixYY5DiOnJKx0z9-Ht8VqH1ew82XWtaTRaoq$>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<https://urldefense.com/v3/__http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250305/bfa13970/attachment.htm__;!!NpxR!nW3d5kUkGUumTD2D9jXSF5CvA3MWMg1Ye4tGQs4BwUrstkBSP9l5HNq08rXZwbINHfO2eMRHdwzqA7IXwpZAmNEC1W0u3efbP49wvQ$
 >

------------------------------

Subject: Digest Footer

_______________________________________________
lustre-discuss mailing list
[email protected]
https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!NpxR!nW3d5kUkGUumTD2D9jXSF5CvA3MWMg1Ye4tGQs4BwUrstkBSP9l5HNq08rXZwbINHfO2eMRHdwzqA7IXwpZAmNEC1W0u3ef_v-URSg$
 


------------------------------

End of lustre-discuss Digest, Vol 228, Issue 3
**********************************************


------------------------------

Subject: Digest Footer

_______________________________________________
lustre-discuss mailing list
[email protected]
https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!NpxR!iLqFg9mHSFdTvFBBdIctwSExoOvHn9aTqkO3lh98X6RuKATpU7m6URNvsD02ThhhDvx7OcvFTi8J-UMgYDTpr5zNWPRU16KLKZxY_Q$
 


------------------------------

End of lustre-discuss Digest, Vol 228, Issue 4
**********************************************
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to