>>>>> "Tziporet" == Tziporet Koren <[EMAIL PROTECTED]> writes:
Tziporet> Roland Fehrenbacher wrote:
>> Hi,
>>
>> when running MPI codes, we have the following error messages
>> coming from some of our servers running 2.6.22.16 with kernel
>> modules from ofa_kernel-1.2.5.4:
>>
>> mlx4_core 0000:08:00.0: SW2HW_MPT failed (-16)
>>
>> The communication on the corresponding machines is completely
>> blocked, and ibstat is just hanging.
>>
>> Any idea what could be wrong? Just for additional info: When
>> running the kernel with the original 2.6.22 drivers, I had
>> these kind of error messages at a much higher rate.
>>
>>
>>
Tziporet> What is the FW version you use?
# ibstat
CA 'mlx4_0'
CA type: MT25418
Number of ports: 2
Firmware version: 2.3.0
Hardware version: 0
Node GUID: 0x0002c9020025a69c
System image GUID: 0x0002c9020025a69f
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 199
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x0002c9020025a69d
Tziporet> What is the type of machine used?
It is a dual Xeon (Quad core) on a 5000P chipset board.
Tziporet> Can you send us description how to reproduce?
I started a 100 node / 8 core = 800 processes mvapich job
(linpack). The issue occured after about 1 hour of runtime. A 50 node
/ 8 core = 400 processes mvapich job ran fine several times for more
than 36 hours (including the node on which this issue occured now).
Roland
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general