Joshua Baker-LePain wrote:


Running a simulation via 'mpirun -np 12' works just fine. Running the same sim (on the same virtual machine, even, i.e. in the same 'lamboot' session) with -np > 12 leads to the following output:

[...]

*** Error the number of solid elements 13929
defined on the thermal generation control
card is greater than the total number
of solids in the model 12985
connect to address $ADDRESS: Connection timed out
connect to address $ADDRESS: Connection timed out

When you set up that VM via LAM, you did a lamboot .... Could you send the output of

        tping -c 3 N

for the larger VM? Also, what does your machine file look like, and could you share what

        lamboot -d machinefile

returns for N>12? Note, that is a big bit of output, so you might want to send that offline.

where $ADDRESS is the IP address of the *public* interface of the node on which the job was launched. Has anybody seen anything like this?

Yes, with a borked DNS server on a head node, coupled to an incorrectly setup queuing system. We have seen this at a few customer sites.

Any ideas on why it would fail over a specific number of CPUs?

It doesn't sound like it is failing on a specific number of CPUs, more like there is a public address, which likely has iptables on it, preventing that node from reaching back into the private space.


Note that the failure is CPU dependent, not node-count dependent.
I've tried on clusters made of both dual-CPU machines and quad-CPU
machines, and in both cases it took 13 CPUs to create the failure.
Note also that I *do* have a user writing his own MPI code, and he has no issues running on >12 CPUs.

What do the machine files look like?  Are they auto generated?


Thanks.



--

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [EMAIL PROTECTED]
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615

_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to