Hi Steve,  You're welcome for the suggestion.  I offered it as you
mentioned adding a couple new oss servers and noticing the entries in the
logs.  Helpful to know would be where you are seeing the errors - new nodes
only, or ??  Generally, networks with existing problems seems to work ok at
low bandwidths, but problems start to appear as loads increase - hence the
suggestion to check the network for problems.  A quick check could be made
with LNet self test between two different sets of nodes - set 1 nodes
indicate the problem, and set 2 do not.  Best,
On Dec 11, 2016 6:05 PM, "Steve Barnet" <bar...@icecube.wisc.edu> wrote:

> Hi Brett,
>
>
> On 12/11/16 4:46 PM, Brett Lee wrote:
>
>> Steve, It might be the network that LNet is running on.  Have you run
>> some bandwidth tests without LNet to check for network problems?
>>
>
>
> It's running over a 10Gb/s Ethernet network that is carrying
> other OSS traffic successfully. No routers or other fancy LNET
> features in play. However, it is quite possible that there are
> issues with the networking on the host side. Definitely on my
> list of things to test out.
>
>   At this point, I'm just trying to narrow the search space.
> I didn't find anything particularly revealing when I searched
> around, so I'm hoping some expert eyes can shine a bit of
> light on the situation.
>
> Thanks for the tip!
>
> Best,
>
> ---Steve
>
>
>> On Dec 11, 2016 3:37 PM, "Steve Barnet" <bar...@icecube.wisc.edu
>> <mailto:bar...@icecube.wisc.edu>> wrote:
>>
>>     Hi all,
>>
>>       Seeing something very strange. I recently added two OSSes
>>     and 10 OSTs to one of our filesystems. Things look OK under
>>     light loads, but when we load them up, we start seeing lots
>>     of LNet errors.
>>
>>     OS: Scientific Linux 6.7
>>     Lustre - Server: 2.8.0 Community version
>>     Lustre - Client: 2.5.3
>>
>>     The errors are below. Do these narrow the range of possible
>>     problems?
>>
>>
>>     Dec 11 11:17:39 lfs-ex-oss-20 kernel: LNetError:
>>     7732:0:(socklnd_cb.c:2509:ksocknal_check_peer_timeouts()) Total 4
>>     stale ZC_REQs for peer 10.128.10.29@tcp1 detected; the
>>     oldest(ffff880f6a90e000) timed out 7 secs ago, resid: 0, wmem: 0
>>     Dec 11 11:17:39 lfs-ex-oss-20 kernel: LustreError:
>>     7732:0:(events.c:447:server_bulk_callback()) event type 5, status
>>     -5, desc ffff8805379f8000
>>     Dec 11 11:17:39 lfs-ex-oss-20 kernel: LustreError:
>>     7732:0:(events.c:447:server_bulk_callback()) event type 5, status
>>     -5, desc ffff880f375dc000
>>     Dec 11 11:17:39 lfs-ex-oss-20 kernel: LustreError:
>>     8234:0:(ldlm_lib.c:3175:target_bulk_io()) @@@ network error on bulk
>>     READ  req@ffff880e506263c0 x1551187318090340/t0(0)
>>     o3->092e941d-272a-09e3-502b-9338dbf387d3@10.128.10.29@tcp1:587/0
>>     lens 488/432 e 3 to 0 dl 1481476687 ref 1 fl Interpret:/0/0 rc 0/0
>>     Dec 11 11:17:39 lfs-ex-oss-20 kernel: LustreError:
>>     8234:0:(ldlm_lib.c:3175:target_bulk_io()) Skipped 1 previous similar
>>     message
>>     Dec 11 11:17:39 lfs-ex-oss-20 kernel: Lustre: lfs2-OST0024: Bulk IO
>>     read error with 092e941d-272a-09e3-502b-9338dbf387d3 (at
>>     10.128.10.29@tcp1), client will retry: rc -110
>>     Dec 11 11:17:39 lfs-ex-oss-20 kernel: LustreError:
>>     7732:0:(events.c:447:server_bulk_callback()) event type 5, status
>>     -5, desc ffff8804db0ce000
>>     Dec 11 11:17:39 lfs-ex-oss-20 kernel: LustreError:
>>     7732:0:(events.c:447:server_bulk_callback()) event type 5, status
>>     -5, desc ffff880aa4374000
>>
>>
>>     Thanks much!
>>
>>     Best,
>>
>>     ---Steve
>>
>>     _______________________________________________
>>     lustre-discuss mailing list
>>     lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.l
>> ustre.org>
>>     http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>     <http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org>
>>
>>
>
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to