> > for example, in MPI, process A know the HCA guid on another node. > > After running for some time, the switch is restarted for > some reason, and the whole fabric is re-configured. > > > CQ, > > If by "the whole fabric is re-configured" you refer to a case > where a subnet prefix changes while a job runs and a process > is detached/reattached to the job so now you want to adopt > your design to handle it, is over engineering, why you want > to do that? >
I am concerning the port lid change. It is always the best if a process can figure the info it needs by itself, SA query is the right way and is in IB spec. while it is possible to let processes to exchange information(port lid) again, but there are difficulties: during the middle of a long job run, it is hard to let two processes to coordinate such infomation exchange, and it requires a second channel to do so. If the second channel is IPoIB, it is broken as well, and we need to re-establish it again. I just ask for the SA functionalities. If it is not possible, we have to use a very complicated way to let HP-MPI to survive from network failure. --CQ > Or. > _______________________________________________ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg