Joshua Baker-LePain wrote:
Do you use a statically linked binary or did you relink it with your
mpich?
Agh. I forgot to mention this little wrinkle. LSTC software
distribution is... interesting.
Yup. Caused us lot of fun at some customer sites.
For mpp-dyna, they ship dynamically
linked binaries compiled against a specific version of LAM/MPI (7.0.3 in
this case).
Yup. Very hard to come by, that particular build. Very hard.
They also provide the matching pre-compiled LAM/MPI
libraries on their site. For a fun little wrinkle, RHEL/CentOS ships
LAM/MPI 7.0.6. However, the spec file in their RPM does *not* include
the --enable-shared flag. IOW, the OS vendor's LAM/MPI package has no
.so files.
I rebuilt this (the LAM) for our customer. Works nicely now.
It seems like it'd be worth re-compiling the centos lam RPM to include
the shared libraries and run against those to see if it helps.
Try an ldd against mpp-dyna-big-long-name
We have ran lstc ls-dyna mpp970 and mpp971 across more than 16 nodes
without any issues on Scyld CW4 which is also centos 4 based.
We can run straight structural sims across as many nodes/CPUs as we've
tried, and ditto for straight thermal sims. It's just on coupled
structural/thermal sims that this issue crops up. That, to me, rather
points to a bug in dyna itself. But the fact that the bug manifests
itself (at least in part) by the MPI job trying to talk to a different
network interface than was 'lamboot'ed is what is throwing me off a bit.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [EMAIL PROTECTED]
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf