I just want to mention (not being a sysadmin professionally, at all) that you could get exactly this result if something were assigning IP addresses sequentially, e.g. node1 = foo.bar.1 node2 = foo.bar.2 ... and something else had already assigned 13 to a public thing, say, a webserver that is not open on the port that MPI uses. I don't know nada about addressing a CPU within a multiprocessor machine, but if it has it's own IP address then it could choke this way.
Peter On 3/14/07, Joshua Baker-LePain <[EMAIL PROTECTED]> wrote:
I have a user trying to run a coupled structural thermal analsis using mpp-dyna (mpp971_d_7600.2.398). The underlying OS is centos-4 on x86_64 hardware. We use our cluster largely as a COW, so all the cluster nodes have both public and private network interfaces. All MPI traffic is passed on the private network. Running a simulation via 'mpirun -np 12' works just fine. Running the same sim (on the same virtual machine, even, i.e. in the same 'lamboot' session) with -np > 12 leads to the following output: Performing Decomposition -- Phase 3 03/12/2007 11:47:53 *** Error the number of solid elements 13881 defined on the thermal generation control card is greater than the total number of solids in the model 12984 *** Error the number of solid elements 13929 defined on the thermal generation control card is greater than the total number of solids in the model 12985 connect to address $ADDRESS: Connection timed out connect to address $ADDRESS: Connection timed out where $ADDRESS is the IP address of the *public* interface of the node on which the job was launched. Has anybody seen anything like this? Any ideas on why it would fail over a specific number of CPUs? Note that the failure is CPU dependent, not node-count dependent. I've tried on clusters made of both dual-CPU machines and quad-CPU machines, and in both cases it took 13 CPUs to create the failure. Note also that I *do* have a user writing his own MPI code, and he has no issues running on >12 CPUs. Thanks. -- Joshua Baker-LePain Department of Biomedical Engineering Duke University _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
