Steve Wise wrote:
Pete Wyckoff wrote:
[EMAIL PROTECTED] wrote on Fri, 01 Jun 2007 11:00 -0500:
I'm helping a customer who is trying to run mvapich2 over chelsio's rnic. They're running a simple program that does an mpi init, 1000 barriers, then a finalize. They're using ofed-1.2-rc3, mpiexec-0.82, and mvapich2-0.9.8-p2 (not the mvapich2 from the ofed kit). Also they aren't using mpd to start up stuff. They're using pmi I guess (I'm not sure what pmi is, but the mpiexec has -comm=pmi. BTW: I can run the same program fine on my 8 node cluster using mpd and the ofa mvapich2 code.

Hey Steve.  The "customer" contacted me about helping with the
mpiexec aspects of things, assuming we're talking about the same
people.  It's just an alternative to the MPD startup program, but
uses the same PMI mechanisms under the hood as does MPD.  And it's a
much better way to launch parallel jobs, but I'm biased since I
wrote it.  :)

The hang in rdma_destroy_id() that you describe, does it happen for
both both mpd and mpiexec startup?

I doubt that the mpiexec issue would matter, but frequently tell
people to try it using straight mpirun just to make sure.  The PMI
protocol under the hood is just a way for processes to exchange
data---mpiexec doesn't know anything about MPI itself or iwarp, it
just moves the information around.  So we generally don't see any
problems with starting up mpich2 programs on all sorts of weird
hardware.

Offering to help if you have any more information.  I've asked for
them to send me debug logs of the mpd and mpiexec startups, but
don't have an account on their machine yet.

        -- Pete

Thanks Pete.

I've been out of town until today. I think they have it working. I believe the bug they saw was in an older version of mvapich2 that Sundeep fixed a while back. After rebuilding and re-installing, they don't seem to hit it anymore. The symptoms definitely seemed like the previous bug he fixed.

Anyway, thanks for helping and explaining mpiexec. I'll hollar if anything else comes up.

Steve.

Ignore this last reply. I hadn't caught up on my email for that issue and I think maybe there are still problems with all this.

Steve.
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to