Andrew,

Thanks for looking. These machines are SUN X2200 and looking at the OUI of
the card it's a generic SUN Mellanox HCA.
This is SuSE SLES10 SP1 and the QuickSilver(SilverStorm) 4.1.0.0.1 software
release.

02:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor 
compatibility mode) (rev a0)
HCA #0: MT25208 Tavor Compat, Lion Cub, revision A0
  Primary image is valid, unknown source
  Secondary image is valid, unknown source

  Vital Product Data
    Product Name: Lion cub
    P/N: 375-3382-01
    E/C: A1
    S/N: 1388FMH-0728200266
    Freq/Power: N/A
    Checksum: Ok
    Date Code: N/A

s1471:/proc/iba/mt23108/1/port2 # cat info Port 2 Info
   PortState: Active           PhysState: LinkUp    DownDefault: Polling
   LID:    0x0392              LMC: 0
   Subnet: 0xfe80000000000000  GUID: 0x0003ba000100430e
   SMLID:  0x0001   SMSL:  0   RespTimeout :  33 ms  SubnetTimeout:  536 ms
   M_KEY:  0x0000000000000000  Lease:     0 s       Protect: Readonly
   MTU:       Active:    2048  Supported:    2048  VL Stall: 0
   LinkWidth: Active:      4x  Supported:    1-4x  Enabled:    1-4x
   LinkSpeed: Active:   2.5Gb  Supported:   2.5Gb  Enabled:   2.5Gb
   VLs:       Active:     4+1  Supported:     4+1  HOQLife: 4096 ns
   Capability 0x02010048: CR CM SL Trap
   Violations: M_Key:     0 P_Key:     0 Q_Key:     0
   ErrorLimits: Overrun: 15 LocalPhys: 15  DiagCode: 0x0000
   P_Key Enforcement: In: Off Out: Off  FilterRaw: In: Off Out: Off

s1471:/proc/iba/mt23108/1/port2 # cat /etc/dat.conf
#
# DAT 1.1 configuration file
#
# Each entry should have the following fields:
#
# <ia_name> <api_version> <threadsafety> <default> <lib_path> \ # <provider_version> <ia_params> <platform_params>
#
# [ICS VERSION STRING: @(#) ./config/dat.conf.64 4_1_0_0_1G [10/22/07 19:25]

# Following are examples of valid entries:
#Hca  u1.1 nonthreadsafe default /lib64/libdapl.so ri.1.1 " " " "
#Hca0 u1.1 nonthreadsafe default /lib64/libdapl.so ri.1.1 "InfiniHost0 " " "
#Hca1 u1.1 nonthreadsafe default /lib64/libdapl.so ri.1.1 "InfiniHost1 " " "
#Hca0Port1 u1.1 nonthreadsafe default /lib64/libdapl.so ri.1.1 "InfiniHost0 ib1" " 
"
#Hca0Port2 u1.1 nonthreadsafe default /lib64/libdapl.so ri.1.1 "InfiniHost0 ib2" " 
"
#=======
InfiniHost0 u1.1 nonthreadsafe default /lib64/libdapl.so ri.1.1 " " " "

Qlogic, now say they can reproduce it.

However, as we use the SilverStorm stuff a lot with many compilers and for
such as IB transport for the Lustre FileSystem, we try to stick with not
to many flavors of IB/MPI but also sometimes use OFED and Qlogics' OFED for
their PathScale cards. We also throw in Scali, MVAPICH amd mpich - so we
have a real mix to handle.

Regarding the lack of mvapi support in OpenMPI there's just udapl left for
such as SilverStorm :-(

Thanks for looking,
Mostyn


On Tue, 6 Nov 2007, Andrew Friedley wrote:



Mostyn Lewis wrote:
Andrew,

Failure looks like:

+ mpirun --prefix
+
/tools/openmpi/1.3a1r16632_svn/infinicon/gcc64/4.1.2/udapl/suse_sles_1
+ 0/x86_64/opteron -np 8
 -machinefile H ./a.out
Process 0 of 8 on s1470
Process 1 of 8 on s1470
Process 4 of 8 on s1469
Process 2 of 8 on s1470
Process 7 of 8 on s1469
Process 5 of 8 on s1469
Process 6 of 8 on s1469
Process 3 of 8 on s1470
30989:a.out *->0 (f=noaffinity,0,1,2,3)
30988:a.out *->0 (f=noaffinity,0,1,2,3)
30990:a.out *->0 (f=noaffinity,0,1,2,3)
30372:a.out *->0 (f=noaffinity,0,1,2,3)
30991:a.out *->0 (f=noaffinity,0,1,2,3)
30370:a.out *->0 (f=noaffinity,0,1,2,3)
30369:a.out *->0 (f=noaffinity,0,1,2,3)
30371:a.out *->0 (f=noaffinity,0,1,2,3)
 get ASYNC ERROR = 6

I thought this might be coming from the uDAPL BTL but I don't see where
in the could this could possibly be printed from.

[s1469:30369] *** Process received signal *** [s1469:30369] Signal:
Segmentation fault (11) [s1469:30369] Signal code: Address not mapped
(1) [s1469:30369] Failing at address: 0x110 [s1469:30369] [ 0]
/lib64/libpthread.so.0 [0x2b528ceefc10] [s1469:30369] [ 1]
/lib64/libdapl.so(dapl_llist_next_entry+0x25) [0x2b528fba5df5]
[s1469:30369] *** End of error message ***

and in a /var/log/messages I see:

Nov  5 14:46:00 s1469 sshd[30363]: Accepted publickey for mostyn from
10.173.132.37 port 36211 ssh2 Nov  5 14:46:25 s1469 kernel: TVpd:
!ERROR! Async Event:TAVOR_EQE_TYPE_CQ_ERR: (CQ Access Error) cqn:641
Nov
5 14:46:25 s1469 kernel: a.out[30374]: segfault at 0000000000000110
rip
00002b528fba5df5 rsp 00000000410010b0 error 4


This makes me wonder if you're using the right DAT libraries.  Take a
look at your dat.conf, it's usually found in /etc and make sure that it
is configured properly for the Qlogic stuff, and does NOT contain any
lines for any other stuff (like OFED-based interfaces).  Usually each
line contains a path to a specific library to use for a particular
interface, make sure it's the library you want.  You might have to
contact you uDAPL vendor for help on that.

This is repoducible.

Is this OpenMPI or your libdapl that's doing this, you think?

I can't be sure -- every uDAPL implementation seems to interpret the
spec differently (or completely change or leave out some functionality),
making it hell to provide portable uDAPL support.  And currently the
uDAPL BTL has seen little/no testing outside of Sun's and OFED's uDAPL.

What kind of interface adapters are you using?  Sounds like some kind of
IB hardware; if possible I recommend using the OFED (openib BTL) or PSM
(PSM MTL) interfaces instead of uDAPL.

Andrew


+ ompi_info
                Open MPI: 1.3a1svn11022007
   Open MPI SVN revision: svn11022007
                Open RTE: 1.3a1svn11022007
   Open RTE SVN revision: svn11022007
                    OPAL: 1.3a1svn11022007
       OPAL SVN revision: svn11022007
                  Prefix:

/tools/openmpi/1.3a1r16632_svn/infinicon/gcc64/4.1.2/udapl/suse_sles_10/
x86_64/opter
on
 Configured architecture: x86_64-unknown-linux-gnu
          Configure host: s1471
           Configured by: root
           Configured on: Fri Nov  2 16:20:29 PDT 2007
          Configure host: s1471
                Built by: mostyn
                Built on: Fri Nov  2 16:30:07 PDT 2007
              Built host: s1471
              C bindings: yes
            C++ bindings: yes
      Fortran77 bindings: yes (all)
      Fortran90 bindings: yes
 Fortran90 bindings size: small
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
      Fortran77 compiler: gfortran
  Fortran77 compiler abs: /usr/bin/gfortran
      Fortran90 compiler: gfortran
  Fortran90 compiler abs: /usr/bin/gfortran
             C profiling: yes
           C++ profiling: yes
     Fortran77 profiling: yes
     Fortran90 profiling: yes
          C++ exceptions: no
          Thread support: posix (mpi: no, progress: no)
           Sparse Groups: no
  Internal debug support: no
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
         libltdl support: yes
   Heterogeneous support: yes
 mpirun default --prefix: no
         MPI I/O support: yes
           MCA backtrace: execinfo (MCA v1.0, API v1.0, Component
v1.3)
              MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component
v1.3)
           MCA paffinity: linux (MCA v1.0, API v1.1, Component v1.3)
           MCA maffinity: first_use (MCA v1.0, API v1.0, Component
v1.3)
           MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.3)
               MCA timer: linux (MCA v1.0, API v1.0, Component v1.3)
         MCA installdirs: env (MCA v1.0, API v1.0, Component v1.3)
         MCA installdirs: config (MCA v1.0, API v1.0, Component v1.3)
           MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
           MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
                MCA coll: basic (MCA v1.0, API v1.1, Component v1.3)
                MCA coll: inter (MCA v1.0, API v1.1, Component v1.3)
                MCA coll: self (MCA v1.0, API v1.1, Component v1.3)
                MCA coll: sm (MCA v1.0, API v1.1, Component v1.3)
                MCA coll: tuned (MCA v1.0, API v1.1, Component v1.3)
                  MCA io: romio (MCA v1.0, API v1.0, Component v1.3)
               MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.3)
               MCA mpool: sm (MCA v1.0, API v1.0, Component v1.3)
                 MCA pml: cm (MCA v1.0, API v1.0, Component v1.3)
                 MCA pml: dr (MCA v1.0, API v1.0, Component v1.3)
                 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.3)
                 MCA bml: r2 (MCA v1.0, API v1.0, Component v1.3)
              MCA rcache: vma (MCA v1.0, API v1.0, Component v1.3)
                 MCA btl: self (MCA v1.0, API v1.0.1, Component v1.3)
                 MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.3)
                 MCA btl: udapl (MCA v1.0, API v1.0, Component v1.3)
                MCA topo: unity (MCA v1.0, API v1.0, Component v1.3)
                 MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.3)
                 MCA osc: rdma (MCA v1.0, API v1.0, Component v1.3)
              MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.3)
              MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.3)
              MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.3)
                 MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.3)
                 MCA gpr: replica (MCA v1.0, API v1.0, Component v1.3)
             MCA grpcomm: basic (MCA v1.0, API v2.0, Component v1.3)
                 MCA iof: proxy (MCA v1.0, API v1.0, Component v1.3)
                 MCA iof: svc (MCA v1.0, API v1.0, Component v1.3)
                  MCA ns: proxy (MCA v1.0, API v2.0, Component v1.3)
                  MCA ns: replica (MCA v1.0, API v2.0, Component v1.3)
                 MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
                MCA odls: default (MCA v1.0, API v1.3, Component v1.3)
                 MCA ras: dash_host (MCA v1.0, API v1.3, Component
v1.3)
                 MCA ras: localhost (MCA v1.0, API v1.3, Component
v1.3)
                 MCA ras: slurm (MCA v1.0, API v1.3, Component v1.3)
                 MCA rds: hostfile (MCA v1.0, API v1.3, Component
v1.3)
                 MCA rds: proxy (MCA v1.0, API v1.3, Component v1.3)
               MCA rmaps: round_robin (MCA v1.0, API v1.3, Component
v1.3)
                MCA rmgr: proxy (MCA v1.0, API v2.0, Component v1.3)
                MCA rmgr: urm (MCA v1.0, API v2.0, Component v1.3)
                 MCA rml: oob (MCA v1.0, API v1.0, Component v1.3)
              MCA routed: tree (MCA v1.0, API v1.0, Component v1.3)
              MCA routed: unity (MCA v1.0, API v1.0, Component v1.3)
                 MCA pls: proxy (MCA v1.0, API v1.3, Component v1.3)
                 MCA pls: rsh (MCA v1.0, API v1.3, Component v1.3)
                 MCA pls: slurm (MCA v1.0, API v1.3, Component v1.3)
                 MCA sds: env (MCA v1.0, API v1.0, Component v1.3)
                 MCA sds: pipe (MCA v1.0, API v1.0, Component v1.3)
                 MCA sds: seed (MCA v1.0, API v1.0, Component v1.3)
                 MCA sds: singleton (MCA v1.0, API v1.0, Component
v1.3)
                 MCA sds: slurm (MCA v1.0, API v1.0, Component v1.3)
               MCA filem: rsh (MCA v1.0, API v1.0, Component v1.3)


Regards,
Mostyn


On Tue, 6 Nov 2007, Andrew Friedley wrote:

All thread support is disabled by default in Open MPI; the uDAPL BTL is
neither thread safe nor makes use of a threaded uDAPL implementation.
For completeness, the thread support is controlled by the
--enable-mpi-threads and --enable-progress-threads options to the
configure script.

The referense you're seeing to libpthread.so.0 is a side effect of the
way we print backtraces when crashes occur and can be ignored.

How exactly does your MPI program fail?  Make sure you take a look at
http://www.open-mpi.org/community/help/ and provide all relevant
information.

Andrew

Mostyn Lewis wrote:
I'm trying to build a udapl OpenMPI from last Friday's SVN and using
Qlogic/QuickSilver/SilverStorm 4.1.0.0.1 software. I can get it
made and it works in machine. With IB between 2 machines is fails
near termination of a job. Qlogic says they don't have a threaded
udapl (libpthread is in the traceback).

How do you (can you?) configure pthreads away alltogether?

Mostyn
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to