On 05/10/2020 07:27, Jordi Caubet Serrabou wrote:
> Coming to the routing point, is there any reason why you need it ? I
> mean, this is because GPFS trying to connect between compute nodes or
> a reason outside GPFS scope ?
> If the reason is GPFS, imho best approach - without knowledge of the
> licensing you have - would be to use separate clusters: a storage
> cluster and two compute clusters.
The issue is that individual nodes want to talk to one another on the
data interface. Which caught me by surprise as the cluster is set to
admin mode central.
The admin interface runs over ethernet for all nodes on a specific VLAN
which which is given 802.1p priority 5 (that's Voice, < 10 ms latency
and jitter). That saved a bunch of switching and cabling as you don't
need the extra interface for the admin traffic. The cabling already
significantly restricts airflow for a compute rack as it is, without
adding a whole bunch more for a barely used admin interface.
It's like the people who wrote the best practice about separate
interface for the admin traffic know very little about networking to be
frankly honest. This is all last century technology.
The nodes for undergraduate teaching only have a couple of 1Gb ethernet
ports which would suck for storage usage. However they also have QDR
Infiniband. That is because even though undergraduates can't run
multinode jobs, on the old cluster the Lustre storage was delivered over
Infiniband, so they got Infiniband cards.
> Both compute clusters join using multicluster setup the storage
> cluster. There is no need both compute clusters see each other, they
> only need to see the storage cluster. One of the clusters using the
> 10G, the other cluster using the IPoIB interface.
> You need at least three quorum nodes in each compute cluster but if
> licensing is per drive on the DSS, it is covered.
Three clusters is starting to get complicated from an admin perspective.
The biggest issue is coordinating maintenance and keep sufficient quorum
nodes up.
Maintenance on compute nodes is done via the job scheduler. I know some
people think this is crazy, but it is in reality extremely elegant.
We can schedule a reboot on a node as soon as the current job has
finished (usually used for firmware upgrades). Or we can schedule a job
to run as root (usually for applying updates) as soon as the current job
has finished. As such we have no way of knowing when that will be for a
given node, and there is a potential for all three quorum nodes to be
down at once.
Using this scheme we can seamlessly upgrade the nodes safe in the
knowledge that a node is either busy and it's running on the current
configuration or it has been upgraded and is running the new
configuration. Consequently multinode jobs are guaranteed to have all
nodes in the job running on the same configuration.
The alternative is to drain the node, but there is only a 23% chance the
node will become available during working hours leading to a significant
loss of compute time when doing maintenance compared to our existing
scheme where the loss of compute time is only as long as the upgrade
takes to install. Pretty much the only time we have idle nodes is when
the scheduler is reserving nodes ready to schedule a multi node job.
Right now we have a single cluster with the quorum nodes being the two
DSS-G nodes and the node used for backup. It is easy to ensure that
quorum is maintained on these, they also all run real RHEL, where as the
compute nodes run CentOS.
JAB.
--
Jonathan A. Buzzard Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss