[lustre-discuss] Imbalanced incoming and outgoing network load

Anna Fuchs via lustre-discuss Fri, 07 Jul 2023 04:49:53 -0700

Dear all,

I have some questions regarding the following scenario:
 - A large HPC system.

- Let's assume that Job X is running on 1 compute node and is reading avery large file with a stripecount (>>1)..-1. Alternatively, tons offiles are read at once with smaller striping each, but distributedacross all OSS/OSTs.- The compute node is connected, for example, with a 100Gb/s link, andthere are 50 servers, each with a 200Gb/s link. This generates a networkload of 50x200Gb/s, which is processed at 100Gb/s.- Job Y, which requires the same network and potentially doesn't evenperform I/O, suffers a lot as a result.


Does this scenario sound familiar to you?
Is the sequence of events correct?
What could be done in this situation?

To avoid:
a) having such single/few-nodes jobs
b) striping large files with up to -1
c) reading millions of files at once

One could try, but I have concerns that the users will persist in doingit, either intentionally or accidentally, and it would only shift theproblem, rather than solving it.One could tweak the network design, reconfigure it, separate I/O fromcommunication, but it would hardly optimize all use cases. Virtual lanescould potentially be a solution as well. Though, that might not help ifthe Job Y also involves some I/O.

Wouldn't it be better if Lustre somehow recognized this imbalancebetween incoming and outgoing network traffic and loaded thefile(s)/data gradually rather than all at once, saturating or slightlyoverloading the consumer 100Gb/s connection rather than by a factor of100? Does this sound reasonable, and is there already a solution for it?

I would appreciate any opinions.

Best regards
Anna

--
Anna Fuchs
Universität Hamburg
https://wr.informatik.uni-hamburg.de/people/anna_fuchs
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] Imbalanced incoming and outgoing network load

Reply via email to