On Thu, Feb 4, 2010 at 7:13 PM, Ira Weiny <wei...@llnl.gov> wrote:
> On Thu, 4 Feb 2010 15:01:32 -0500
> Hal Rosenstock <hal.rosenst...@gmail.com> wrote:
>
>> On Thu, Feb 4, 2010 at 1:00 PM, Ira Weiny <wei...@llnl.gov> wrote:
>> > On Thu, 4 Feb 2010 09:19:39 -0500
>> > Hal Rosenstock <hal.rosenst...@gmail.com> wrote:
>> >
>> >> On Tue, Feb 2, 2010 at 7:45 PM, Ira Weiny <wei...@llnl.gov> wrote:
>> >> > Sasha,
>> >> >
>
> [snip]
>
>> >> >
>> >> > real    0m2.249s
>> >> > user    0m1.244s
>> >> > sys     0m0.936s
>> >> >
>> >> > 14:40:59 > time ./ibnetdiscover -o 4 --node-name-map 
>> >> > /etc/opensm/ib-node-name-map -g > new
>> >> >
>> >> > real    0m2.170s
>> >> > user    0m1.160s
>> >> > sys     0m0.933s
>> >> >
>> >> > 14:41:10 > /usr/sbin/ibqueryerrors  -s 
>> >> > RcvErrors,SymbolErrors,RcvSwRelayErrors,XmtWait -r --data
>> >> > Suppressing: RcvErrors SymbolErrors RcvSwRelayErrors XmtWait
>> >> > Errors for 0x66a00d90006fb "SW19"
>> >> >   GUID 0x66a00d90006fb port 9: [VL15Dropped == 3] [XmtData == 25187379] 
>> >> > [RcvData == 25196688] [XmtPkts == 349861] [RcvPkts == 349954]
>> >> >       Link info:    139   9[  ] ==( 4X 5.0 Gbps Active/  LinkUp)==>  
>> >> > 0x0002c9030001d736    864    1[  ] "hyperion1" ( )
>> >> >
>> >> > Note that there were no additional VL15Dropped packets on the fabric.  
>> >> > I think 4 seems to be a good compromise.  I have not tested when there 
>> >> > are errors on the fabric.  (Right now things seem to be good!)
>> >>
>> >> Is this just with the SM doing light sweeping ?
>> >
>> > Yes.
>>
>> That's not a lot of SMP stress from the SM side. SMP consumers are SM,
>> diags, and the unsolicited traps.
>
> Agreed.  I hope to test this more next week.
>>
>> >
>> >>
>> >> Is there a speedup with 4 rather than 2 ?
>> >
>> > There is a bit of a speed up (~0.5 to 1.0 sec).  But my main reason to 
>> > want to
>> > go to 4 is that if there are issues on the fabric, unresponsive nodes 
>> > etc.; 4
>> > will give us better parallelism to get around these issues.  I have not had
>> > the chance to test this condition with the new algorithm but the original
>> > ibnetdiscover would slow way down when there are nodes which have 
>> > unresponsive
>> > SMA's.  If there are only 2 outstanding this will not give us much speed 
>> > up.
>> > This was the main motivation I had for improving the library in this way.
>> >
>> > Also, I think you are correct that we should increase OpenSM's default 
>> > from 4
>> > to 8.  For the same reason as above.  Some of our clusters have worked 
>> > better
>> > with 8 when we are having issues.  But right now we are still running with 
>> > 4.
>>
>> I'm concerned about just increasing ibnetdiscover to 4 rather than 2.
>> I've seen a number of clusters with SMP dropping with the current
>> lower defaults.
>
> So OpenSM is seeing dropped packets?

OpenSM is seeing timeouts and there are VL15 drops in the subnet.

> With 4 SMP's on the wire?

Yes.

> I do see some
> VL15Dropped errors (maybe 2-3 a day) but I did not think that would be an
> issue.  What kind of rate are you seeing?

> The other question is; do people regularly run the tools which are using
> libibnetdisc (ibqueryerrors, iblinkinfo, ibnetdiscover)?

These tools are being used (at least ibnetdiscover and ibqueryerrors).

> We do.  If others
> are not then I would say this change would have less impact as they would want
> the diags to have some priority for debugging.  The other option is to change
> the patch to be a default of 2 and allow user to change it depending on what
> they are trying to do.  If you think that is best I will change the patch.

FWIW I think 2 is better until we have more exhaustive experience with
4. The other alternative would be to make it 4 and then see if people
start noticing (more) VL15 drops and possibly other issues.

-- Hal

> Ira
>
>>
>> -- Hal
>>
>> > Ira
>> >
>> >>
>> >> -- Hal
>> >>
>> >> >
>> >> > The first patch converts the algorithm and the second adds the 
>> >> > ibnd_set_max_smps_on_wire call.
>> >> >
>> >> > Let me know what you think.  Because the algorithm changed so much 
>> >> > testing this is a bit difficult because the order of the node discovery 
>> >> > is different.  However, I have done some extensive diffing of the 
>> >> > output of ibnetdiscover and things look good.
>> >> >
>> >> > Ira
>> >> >
>> >> > --
>> >> > Ira Weiny
>> >> > Math Programmer/Computer Scientist
>> >> > Lawrence Livermore National Lab
>> >> > 925-423-8008
>> >> > wei...@llnl.gov
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> >> > the body of a message to majord...@vger.kernel.org
>> >> > More majordomo info at  http://**vger.kernel.org/majordomo-info.html
>> >> >
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> >> the body of a message to majord...@vger.kernel.org
>> >> More majordomo info at  http://**vger.kernel.org/majordomo-info.html
>> >>
>> >
>> >
>> > --
>> > Ira Weiny
>> > Math Programmer/Computer Scientist
>> > Lawrence Livermore National Lab
>> > 925-423-8008
>> > wei...@llnl.gov
>> >
>>
>
>
> --
> Ira Weiny
> Math Programmer/Computer Scientist
> Lawrence Livermore National Lab
> 925-423-8008
> wei...@llnl.gov
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to