> >  for example, in MPI, process A know the HCA guid on another node.
> > After running for  some time, the switch is restarted for
> some reason, and the whole fabric is re-configured.
>
>
> CQ,
>
> If by "the whole fabric is re-configured" you refer to a case
> where a subnet prefix changes while a job runs and a process
> is detached/reattached to the job  so now you want to adopt
> your design to handle it, is over engineering, why you want
> to do that?
>

I am concerning the port lid change. It is always the best if a process can 
figure
the info it needs by itself, SA query is the right way and is in IB spec.

while it is possible to let processes to exchange information(port lid) again, 
but
there are difficulties: during the middle of a long job run, it is hard to let 
two
processes to coordinate such infomation exchange, and it requires a second 
channel
to do so. If the second channel is IPoIB, it is broken as well, and we need to 
re-establish
it again.

I just ask for the SA functionalities. If it is not possible, we have to use a 
very
complicated way to let HP-MPI to survive from network failure.


--CQ



> Or.
>
_______________________________________________
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Reply via email to