We have, and have had it come and go with no clear explanation. I’d watch out 
for MTU and netmask troubles, sysctl limits that might be relevant (apparently 
the default settings for time spent doing ethernet are really appropriate for 
<1 Gb, not so much faster), hot spots on the network, etc.

|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB A555B, Newark

On Oct 10, 2023, at 22:29, James Lam <unison2...@gmail.com> wrote:

We have a cluster of 176 nodes consisting Infiniband switch and 10GbE and we 
are using 10GbE as SSH. Currently we have the older cards of
Marvell 10GbE at launch

Current batch of 10GbE Qlogic card

We are using slurm 20.11.4 as server and node health check daemon are also 
deployed using the OpenHPC method.  However , we have no issue on using the 
Marvell 10GbE cards - which don't have slurm node down <--> idle state. 
However, we do have the flip-flip situation of the down <--> idle state

We tried on increasing the ARP caching , changing the subversion of the client 
to 20.11.9 , which doesn't help with the situation.

We would like to see if anyone faced similar situation?

Reply via email to