On Tue, Sep 30, 2008 at 6:51 AM, Ramiro Alba Queipo <[EMAIL PROTECTED]> wrote: > Hello everybody: > > We have just started to run a 22 nodes infiniband cluster (44 in a > couple > of months) under Ubuntu 8.04 and after carefully reading and testing > OFED 1.3.1 diagnogstics packages (ibutils and infiniband-diags), I have > got some messages I can not understand: > > * ibdiagnet -o . -t file.topo -s jff -pm > > > -I--------------------------------------------------- > -I- IPoIB Subnets Check > -I--------------------------------------------------- > -I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps > SL:0x00 > -W- Suboptimal rate for group. Lowest member rate:20Gbps > > group-rate:10Gbps > > > What does it mean?
This means your subnet is pure DDR and the IPoIB broadcast group can run at a higher rate than the default. This is done via OpenSM configuration which is slightly different depending on which version you are using. > * ibchecknet > > #warn: counter RcvSwRelayErrors = 259 (threshold 100) lid 4 port 255 > Error check on lid 4 (MT47396 Infiniscale-III Mellanox Technologies) > port all: FAILED > > > I could see that command 'perfquery -a 255' shows its counters, but: > > - What is for? > - ibqueryerrors.pl -a says > RcvSwRelayErrors: This counter can increase due to a valid network > event > Should I worry by switch ports increasing little by little this > counter? > > I am using IPoIB Unfortunately when running IPoIB, RcvSwRelayErrors needs to be ignored as multicasts are counted as looping. > * ibdiagpath -o . -t file.topo -s jff -n jff201 > > -I--------------------------------------------------- > -I- QoS on Path Check > -I--------------------------------------------------- > -W- VLArbTableLow Entries:6 7 VL > 5 at node:"jff/U1" lid=0x0001 > guid=0x0002c90200279295 dev=25204 port:1 > -W- VLArbTableHigh Entries:6 7 VL > 5 at node:"jff/U1" lid=0x0001 > guid=0x0002c90200279295 dev=25204 port:1 > -W- VLArbTableLow Entries:6 7 VL > 5 at node:"switch-1/U1" lid=0x0004 > guid=0x000b8cffff0052cf dev=47396 port:1 > -W- VLArbTableHigh Entries:6 7 VL > 5 at node:"switch-1/U1" lid=0x0004 > guid=0x000b8cffff0052cf dev=47396 port:1 > -W- SLs:6 7 14 15 mapped to VL > 5 at node:"switch-1/U1" lid=0x0004 > guid=0x000b8cffff0052cf dev=47396 in-port:23 out-port:1 > -I- The following SLs can be used:0 1 2 3 4 5 8 9 10 11 12 13 > > What is the meaning of this messages? I'm not sure but it looks like it's complaining about an invalid VL. Can you run: smpquery portinfo <lid> 1 smpquery sl2vl <lid> 1 smpquery vlarb <lid> 1 for both of these lids ? -- Hal > Finally, and not related to diagnostics messages, I have to change > permissions at > > crw-rw---- 1 root rdma 231, 192 2008-09-30 09:19 /dev/infiniband/uverbs0 > > to be 'rw' to everybody. > > Should I add users to 'rdma' group instead? > > > --- > Thanks in advance > > Regards > > > -- > Aquest missatge ha estat analitzat per MailScanner > a la cerca de virus i d'altres continguts perillosos, > i es considera que està net. > For all your IT requirements visit: http://www.transtec.co.uk > > _______________________________________________ > general mailing list > general@lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general