>>>>> "Tziporet" == Tziporet Koren <[EMAIL PROTECTED]> writes:

    Tziporet> Roland Fehrenbacher wrote:
    >> Hi,
    >> 
    >> when running MPI codes, we have the following error messages
    >> coming from some of our servers running 2.6.22.16 with kernel
    >> modules from ofa_kernel-1.2.5.4:
    >> 
    >> mlx4_core 0000:08:00.0: SW2HW_MPT failed (-16)
    >> 
    >> The communication on the corresponding machines is completely
    >> blocked, and ibstat is just hanging.
    >> 
    >> Any idea what could be wrong? Just for additional info: When
    >> running the kernel with the original 2.6.22 drivers, I had
    >> these kind of error messages at a much higher rate.
    >> 
    >> 
    >> 
    Tziporet> What is the FW version you use?

# ibstat
CA 'mlx4_0'
        CA type: MT25418
        Number of ports: 2
        Firmware version: 2.3.0
        Hardware version: 0
        Node GUID: 0x0002c9020025a69c
        System image GUID: 0x0002c9020025a69f
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 20
                Base lid: 199
                LMC: 0
                SM lid: 1
                Capability mask: 0x02510868
                Port GUID: 0x0002c9020025a69d



    Tziporet>   What is the type of machine used?

It is a dual Xeon (Quad core) on a 5000P chipset board.

    Tziporet> Can you send us description how to reproduce?

I started a 100 node / 8 core = 800 processes mvapich job
(linpack). The issue occured after about 1 hour of runtime. A 50 node
/ 8 core = 400 processes mvapich job ran fine several times for more
than 36 hours (including the node on which this issue occured now).

Roland
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to