On Wed, 14 Mar 2007 at 10:05am, Michael Will wrote
You mentioned your own code does not exhibit the issue but mpp-dyna does.
Yep.
What does the support team from the software vendor think the problem could be?
They say that we have an academic license which does not entitle us to any support, but they'll look at the issue if/when they have some spare time. Which, really, is fair, given what we pay for it.
Do you use a statically linked binary or did you relink it with your mpich?
Agh. I forgot to mention this little wrinkle. LSTC software distribution is... interesting. For mpp-dyna, they ship dynamically linked binaries compiled against a specific version of LAM/MPI (7.0.3 in this case). They also provide the matching pre-compiled LAM/MPI libraries on their site. For a fun little wrinkle, RHEL/CentOS ships LAM/MPI 7.0.6. However, the spec file in their RPM does *not* include the --enable-shared flag. IOW, the OS vendor's LAM/MPI package has no .so files.
It seems like it'd be worth re-compiling the centos lam RPM to include the shared libraries and run against those to see if it helps.
We have ran lstc ls-dyna mpp970 and mpp971 across more than 16 nodes without any issues on Scyld CW4 which is also centos 4 based.
We can run straight structural sims across as many nodes/CPUs as we've tried, and ditto for straight thermal sims. It's just on coupled structural/thermal sims that this issue crops up. That, to me, rather points to a bug in dyna itself. But the fact that the bug manifests itself (at least in part) by the MPI job trying to talk to a different network interface than was 'lamboot'ed is what is throwing me off a bit.
-- Joshua Baker-LePain Department of Biomedical Engineering Duke University _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
