Joshua Baker-LePain wrote:
Running a simulation via 'mpirun -np 12' works just fine. Running the
same sim (on the same virtual machine, even, i.e. in the same 'lamboot'
session) with -np > 12 leads to the following output:
[...]
*** Error the number of solid elements 13929
defined on the thermal generation control
card is greater than the total number
of solids in the model 12985
connect to address $ADDRESS: Connection timed out
connect to address $ADDRESS: Connection timed out
When you set up that VM via LAM, you did a lamboot .... Could you send
the output of
tping -c 3 N
for the larger VM? Also, what does your machine file look like, and
could you share what
lamboot -d machinefile
returns for N>12? Note, that is a big bit of output, so you might want
to send that offline.
where $ADDRESS is the IP address of the *public* interface of the node
on which the job was launched. Has anybody seen anything like this?
Yes, with a borked DNS server on a head node, coupled to an incorrectly
setup queuing system. We have seen this at a few customer sites.
Any ideas on why it would fail over a specific number of CPUs?
It doesn't sound like it is failing on a specific number of CPUs, more
like there is a public address, which likely has iptables on it,
preventing that node from reaching back into the private space.
Note that the failure is CPU dependent, not node-count dependent.
I've tried on clusters made of both dual-CPU machines and quad-CPU
machines, and in both cases it took 13 CPUs to create the failure.
Note also that I *do* have a user writing his own MPI code, and he has
no issues running on >12 CPUs.
What do the machine files look like? Are they auto generated?
Thanks.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [EMAIL PROTECTED]
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf