Re: [Lustre-discuss] Questions about the LNET routing

2015-02-20 Thread Rick Wagner
Teng,

It is the mode of your LACP that determines with physical interface the packets 
travel over, which can be configured to hash on client IP and port. Each client 
will open 3 TCP sockets for Lustre traffic to each server, and given a 
reasonable number of clients these will balance over interfaces in the link 
aggregation group. If you have bonded interfaces on the clients, a similar 
thing will happen as they connect to multiple servers. The caveat is that 
performance can be impacted by the NUMA architecture of your server. Basically, 
it's better to have both NICs attached to the same CPU.

--Rick


From: lustre-discuss-boun...@lists.lustre.org 
[lustre-discuss-boun...@lists.lustre.org] on behalf of teng wang 
[tzw0...@gmail.com]
Sent: Friday, February 20, 2015 1:40 PM
To: Andrus, Brian Contractor; lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] Questions about the LNET routing


Hi Andrus, thanks for your answer.


Without bonding, is there any preference

for LNET to route from the two interfaces?

Even when we bond the two interfaces together,

I think LNET should still choose between the different

interfaces, although they share the same address.

Is there any preference in this situation?


Thanks,

Teng

On Thu, Feb 19, 2015 at 3:55 PM, Andrus, Brian Contractor 
mailto:bdand...@nps.edu>> wrote:
Teng,

I believe it would depend on how you have your interfaces configured.

I seems that you have them both on the same subnet and being accessed by the 
same client.
Is this the case?

If they are on the same subnet, I would expect you would bond them (bond0) 
rather than have two separate IPs for them. Then you get to control how/where 
the data flows at the networking level.

You may want to check on the nodes to see what they see. (lctl peer_list)

If they are different subnets or networks, you can set that in the options for 
the lustre module for lnet.
For example, we have both ib and tcp. I give ib the priority for the best 
performance, but if ib is unavailable, it falls back to tcp. That could just as 
well be two Ethernet cards on two networks as well.


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238




From: 
lustre-discuss-boun...@lists.lustre.org
 
[mailto:lustre-discuss-boun...@lists.lustre.org]
 On Behalf Of teng wang
Sent: Thursday, February 19, 2015 1:28 PM
To: lustre-discuss@lists.lustre.org
Subject: [Lustre-discuss] Questions about the LNET routing

I have a basic question about the LNET.  Will the data belonging
to the same object be routed from the same interface? For example,
if a node has multiple network interfaces and two processes are
running on the same node writing to the same shared file, striped
across 1 OST.
Process 1 writes like:
write chunk1
write chunk2

Process 2 writes like:
write chunk3
write chunk4
If Process 1 and Process2 are pinned to two different network interfaces,
say Eth0 and Eth1. Then from the OSC side, will these chunks be routed
to the OST from the same interface (E.g. All the four chunks through
Eth0)?  If so, what if they write different objects that come to the same
OST (E.g. Process1 write File1, Process2 write file2, file1 and file2 are
striped over the same OST)?

Thanks,
Teng

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Questions about the LNET routing

2015-02-23 Thread Rick Wagner
Teng,

In our case, the bonded NICs are on the server. The issue is that a task 
handling traffic through a bonded interface can't tell which physical interface 
the data is going through, and can't be guaranteed to run on the NUMA node 
handling the interface. This can cause more traffic between each NUMA node 
(socket) than desired, which drives up latency. Take a look at this ticket [1], 
in particular the slides Liang posted [2,3]. This issue applies to network 
interfaces and IO controllers like HBAs and RAID controllers.

--Rick

[1] https://jira.hpdd.intel.com/browse/LU-6228
[2] Lustre 2.0 and NUMIOA 
architectures<http://cdn.opensfs.org/wp-content/uploads/2012/12/900-930_Diego_Moreno_LUG_Bull_2011.pdf>

[3] High Performance I/O with NUMA Systems in 
Linux<http://events.linuxfoundation.org/sites/events/files/eeus13_shelton.pdf>


From: teng wang [tzw0...@gmail.com]
Sent: Monday, February 23, 2015 8:38 AM
To: Rick Wagner
Cc: Andrus, Brian Contractor; lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] Questions about the LNET routing


Rick,


Thanks for you answer.  Could you explain more

about the NUMA architecture? The two NICs attached

to the same CPU you mentioned are on the client

side or the server side? How is the performance impacted

by the NUMA architecture given

that the client can balance traffic from the interfaces?


Thanks,

Teng

On Fri, Feb 20, 2015 at 5:03 PM, Rick Wagner 
mailto:rpwag...@sdsc.edu>> wrote:
Teng,

It is the mode of your LACP that determines with physical interface the packets 
travel over, which can be configured to hash on client IP and port. Each client 
will open 3 TCP sockets for Lustre traffic to each server, and given a 
reasonable number of clients these will balance over interfaces in the link 
aggregation group. If you have bonded interfaces on the clients, a similar 
thing will happen as they connect to multiple servers. The caveat is that 
performance can be impacted by the NUMA architecture of your server. Basically, 
it's better to have both NICs attached to the same CPU.

--Rick


From: 
lustre-discuss-boun...@lists.lustre.org<mailto:lustre-discuss-boun...@lists.lustre.org>
 
[lustre-discuss-boun...@lists.lustre.org<mailto:lustre-discuss-boun...@lists.lustre.org>]
 on behalf of teng wang [tzw0...@gmail.com<mailto:tzw0...@gmail.com>]
Sent: Friday, February 20, 2015 1:40 PM
To: Andrus, Brian Contractor; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [Lustre-discuss] Questions about the LNET routing


Hi Andrus, thanks for your answer.


Without bonding, is there any preference

for LNET to route from the two interfaces?

Even when we bond the two interfaces together,

I think LNET should still choose between the different

interfaces, although they share the same address.

Is there any preference in this situation?


Thanks,

Teng

On Thu, Feb 19, 2015 at 3:55 PM, Andrus, Brian Contractor 
mailto:bdand...@nps.edu>> wrote:
Teng,

I believe it would depend on how you have your interfaces configured.

I seems that you have them both on the same subnet and being accessed by the 
same client.
Is this the case?

If they are on the same subnet, I would expect you would bond them (bond0) 
rather than have two separate IPs for them. Then you get to control how/where 
the data flows at the networking level.

You may want to check on the nodes to see what they see. (lctl peer_list)

If they are different subnets or networks, you can set that in the options for 
the lustre module for lnet.
For example, we have both ib and tcp. I give ib the priority for the best 
performance, but if ib is unavailable, it falls back to tcp. That could just as 
well be two Ethernet cards on two networks as well.


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238




From: 
lustre-discuss-boun...@lists.lustre.org<mailto:lustre-discuss-boun...@lists.lustre.org>
 
[mailto:lustre-discuss-boun...@lists.lustre.org<mailto:lustre-discuss-boun...@lists.lustre.org>]
 On Behalf Of teng wang
Sent: Thursday, February 19, 2015 1:28 PM
To: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: [Lustre-discuss] Questions about the LNET routing

I have a basic question about the LNET.  Will the data belonging
to the same object be routed from the same interface? For example,
if a node has multiple network interfaces and two processes are
running on the same node writing to the same shared file, striped
across 1 OST.
Process 1 writes like:
write chunk1
write chunk2

Process 2 writes like:
write chunk3
write chunk4
If Process 1 and Process2 are pinned to two different network interfaces,
say Eth0 and Eth1. Then from the OSC side, will these chunks be routed
to the OST from the same interface (E.g. All the four chunks through
E

Re: [lustre-discuss] zfs -- mds/mdt -- ssd model / type recommendation

2015-05-05 Thread Rick Wagner


On May 5, 2015, at 7:16 AM, Wolfgang Baudler  wrote:

>> The Livermore folks leading this effort can correct me if I misspeak, but
>> they (Specifically Brian Behlendorf) presented on this topic at the
>> Developers' Day at LUG 2015 (no video of the DD talks, sorry).
>> 
>>> From his discussion, the issues have been identified, but the fixes are
>>> between six months and two years away, and may still not fully close the
>>> gap.  It'll be a bit yet.
>> 
>> - Patrick
> 
> So, these performance issues are specific to Lustre using ZFS or is it
> problems with ZFS on Linux in general?

It's Lustre on ZFS, especially for metadata operations that create, modify, or 
remove inodes. Native ZFS metadata operations are much faster than what Lustre 
on ZFS is currently providing. That said, we've gone with a ZFS-based patchless 
MDS, since read operations have always been more critical for us, and our 
performance is more than adequate.

--Rick

> 
> Wolfgang
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] MDT 100% full

2016-07-26 Thread Rick Wagner
Hi Brian,

On Jul 26, 2016, at 5:45 PM, Andrus, Brian Contractor 
mailto:bdand...@nps.edu>> wrote:

All,

Ok, I thought 100GB would be sufficient for an MDT.
I have 2 MDTs as well, BUT…

MDT0 is 100% full and now I cannot write anything to my lustre filesystem.
The MDT is on a ZFS backing filesystem.

So, what is the proper way to grow my MDT using ZFS? Do I need to shut the 
filesystem down completely? Can I just add a disk or space to the pool and 
Lustre will see it?

Any advice or direction is appreciated.

We just did this successfully on the two MDTs backing one of our Lustre file 
systems and everything happened at the ZFS layer. We added drives to the pool 
and Lustre immediately saw the additional capacity. Whether you take down the 
file system or do it live is a question of your architecture, skills, and 
confidence. Having test file system is also worthwhile to go over the steps.

--Rick




Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Mounting Lustre over IB-to-Ethernet gateway

2016-08-01 Thread Rick Wagner
Hi Kevin,

You’ll definitely need to have tcp on the server interface used for the clients 
accessing Lustre over the gateway (really an InfiniBand-Ethernet bridge). I 
don’t know of any problem having both protocols (tcp and o2ib) on the same 
interface but you could also add an alias or child interface (e.g., ib0:0 or 
ib0.8001) if you wanted to keep things separated.

—Rick

On Aug 1, 2016, at 4:05 AM, Kevin M. Hildebrand 
mailto:ke...@umd.edu>> wrote:

Our Lustre filesystem is currently set up to use the o2ib interface only- all 
of the servers have
options lnet networks=o2ib0(ib0)

We've just added a Mellanox IB-to-Ethernet gateway and would like to be able to 
have clients on the Ethernet side also mount Lustre.  The gateway extends the 
same layer-2 IP range that's being used for IPoIB out to the Ethernet clients

How should I go about doing this?  Since the clients don't have IB, it doesn't 
appear that I can use o2ib0 to mount.  Do I need to add another lnet network on 
the servers?  Something like
options lnet networks=o2ib0(ib0),tcp0(ib0)?  Can I have both protocols on the 
same interface?
And if I do have to add another lnet network, is there any way to do so without 
restarting the servers?

Thanks,
Kevin

--
Kevin Hildebrand
University of Maryland, College Park
Division of IT

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] OpenSFS Handbook - Now Available

2017-05-09 Thread Rick Wagner
Hello Lustre Community, 
 
As a follow up to an earlier message, we wanted to thank everyone who 
contributed to the review and feedback of the OpenSFS Handbook. The feedback 
received has been incorporated and the first revision of the OpenSFS Handbook 
is now published on the OpenSFS website. This document will be revised 
periodically as needed or as feedback is received. 
 
Best regards,
OpenSFS Administration 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] KMOD vs DKMS

2017-07-18 Thread Rick Wagner
Hi Brian,

I consider build processes somewhat fragile, especially when you expect to get 
the same results across a large number of hosts, like a set of Lustre servers. 
As a result, I favor building a single set of RPMs, testing them, and then 
pushing an update to the production servers. So count me in the kmod camp.

In the case of a small number of NFS servers, I might go with dkms for 
convenience.

—Rick

> On Jul 18, 2017, at 10:22 AM, Brian Andrus  wrote:
> 
> All,
> 
> I have been watching some of the discussions/issues folks have with building 
> lustre and I am wondering what the consensus is on the two approaches.
> 
> Myself, I have been building my own RPMs for some time and it seemed to me 
> that the general direction of linux was to move toward kmod and away from 
> dkms, so I redesigned my build scripts to use zfs/kmod and dropped ldiskfs. 
> Certainly, this has made life easier when there are kernel updates :)
> 
> So if there is a choice between the two, what is preferred and why?
> 
> Hopefully this doesn't start a war or anything...
> 
> Brian Andrus
> Firstspot, Inc.
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Announce: Lustre Systems Administration Guide

2017-11-18 Thread Rick Wagner
Marcin,

Thanks for sketching out the mechanisms we could use to help ensure the quality 
and accuracy of the documentation. If someone in the community is willing to 
work on any or all of these items, I will ask the OpenSFS board to cover the 
costs of any CI/CD cloud services that are needed. I would rather see the 
expertise of those doing the work targeted at improving Lustre than hosting 
services available for modest costs.

—Rick

> On Nov 18, 2017, at 2:47 AM, Marcin Dulak  wrote:
> 
> 
> 
> On Sat, Nov 18, 2017 at 4:20 AM, Stu Midgley  > wrote:
> Thank you both for the documentation.  I know how hard it is to maintain. 
> 
> I've asked that all my admin staff to read it - even if some of it doesn't 
> directly apply to our environment.
> 
> What we would like is ell organised, comprehensive, accurate and up to date 
> documenation.  Most of the time when I dive into the manual, or other online 
> material, I find it isn't quite right (path's slightly wrong or outdated 
> etc).  I also have difficulty finding all the information I want in a single 
> location and in a logical fashon.  These aren't new issues and blight all 
> documentation, but having the definitive source in a wiki might open it up to 
> more transparency, greater use and thus, ultimately, being kept up to date, 
> even if its by others outside Intel.
> 
> Documentation should be treated in the say way as code, i.e. automatically 
> tested. This is not a new idea 
> https://en.wikipedia.org/wiki/Software_documentation#Literate_programming 
> 
> and with the access to various kinds of virtualization this is feasible now.
> There are Python projects 
> (https://gitlab.com/ase/ase/tree/master/doc/tutorials 
> ), that make use of 
> this idea thanks to http://www.sphinx-doc.org  
> which allows one to execute embedded Python commands
> during the process of building the documentation in html or pdf formats out 
> of rst (restructured text) files.
> There is a system that stores LFS (Linux from scratch) in an xml format for 
> extraction to be executed http://www.linuxfromscratch.org/alfs/ 
>  https://github.com/ojab/jhalfs 
>  but it seems not to be under a continuous 
> automatic testing.
> However, projects like https://docs.openstack.org/install-guide/ 
>  suprisingly do not use this idea 
> and it takes months to correct a small inconsistency in the documentation 
> https://bugs.launchpad.net/keystone/+bug/1698455 
> 
> 
> It is not very difficult to create a virtual setup consisting of several 
> lustre servers in an unattended way 
> (https://github.com/marcindulak/vagrant-lustre-tutorial-centos6 
> ) and use that
> to test the lustre documentation.
> An alternative to making the lustre documentation executable would be to 
> abstract the basics of lustre using a supported configuration management 
> system (is there any progress 
> abouthttps://www.youtube.com/watch?v=WX00LQLYf2w 
>  ?) and test that using the 
> standard CI tools.
> 
> Cheers
> 
> Marcin
>  
> 
> I'd also like a section where people can post their experiences and 
> solutions.  For example, in recent times, we have battled bad interactions 
> with ZFS+lustre which lead to poor performance and ZFS corruption.  While we 
> have now tuned both lustre and zfs and the bugs have mostly been fixed, the 
> learnings, trouble shooting methods etc. should be preserved and might assist 
> others in the future diagnose tricky problems.
> 
>  
> 
> That's my 5c.
> 
> 
> 
> On Sat, Nov 18, 2017 at 6:03 AM, Dilger, Andreas  > wrote:
> On Nov 16, 2017, at 22:41, Cowe, Malcolm J  > wrote:
> >
> > I am pleased to announce the availability of a new systems administration 
> > guide for the Lustre file system, which has been published to 
> > wiki.lustre.org . The content can be accessed 
> > directly from the front page of the wiki, or from the following URL:
> >
> > http://wiki.lustre.org/Category:Lustre_Systems_Administration 
> > 
> >
> > The guide is intended to provide comprehensive instructions for the 
> > installation and configuration of production-ready Lustre storage clusters. 
> > Topics covered:
> >
> >   • Introduction to Lustre
> >   • Lustre File System Components
> >   • Lustre Software Installation
> >   • Lustre Networking (LNet)
> >   • LNet Router Configuration
> >   • Lustre Object Storage Devices (OSDs)
> >   • Creating Lustre Fil

Re: [lustre-discuss] strange time of reading for large file

2017-11-23 Thread Rick Wagner
Hi Rosana,

Without knowing anything about your setup or test, my first question would be 
whether you accounted for caching between your read tests? That can occur at 
several layers within the environment, including the client, server, and 
underlying storage hardware. This is naturally a benefit when in production, 
but needs to understand during performance testing. And I’m assuming that your 
file system was not otherwise utilized during your tests. Another assumption 
would be that the file system was mostly empty so that file fragmentation and 
disk seeks weren’t a problem.

—Rick

> On Nov 23, 2017, at 10:52 AM, Rosana Guimaraes Ribeiro 
>  wrote:
> 
> Hi,
> 
> I have some doubts about Lustre, I already sent my issues to forums but no 
> one answer me.
> In our application, during the performance testing on lustre 2.4.2 we got 
> times of reading and writing to test I/O operations with a file of almost 
> 400GB. 
> Running this application a lot of times, consecutively, we see that in write 
> operations, I/O time remain on same range, but in read operations there are a 
> huge difference of time. As you can see below:
> Write time [sec]:
> 325.77
> 318.80
> 325.44
> 458.54
> 316.89
> 327.75
> 344.90
> 340.34
> 383.57
> 316.35
> Read time [sec]:
> 570.48
> 601.11
> 447.14
> 406.39
> 480.44
> 5824.40
> 299.40
> 293.54
> 1049.93
> 4190.47
> We ran on the single client with 1 process and tested on same infrastructure 
> (hardware and network).
> Could you explain why is reading time so distorted? What kind of problem 
> might be occurring?
> 
> Regards,
> Rosana
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org 
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org 
> 
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

2011-07-11 Thread Rick Wagner
Hi,

We are seeing intermittent client evictions from a new Lustre installation that 
we are testing. The errors on writes from a parallel job running on 32 client 
nodes, each with 16 tasks writing a single HDF5 file of ~40MB (512 tasks 
total). Occasionally, one nodes will be evicted from an OST, and the code 
running on the client will experience an IO error.

The directory with the data has a stripe count of 1, and a comparable amount is 
read in at the start of the job. Sometimes the evictions occur the first time a 
write is attempted, sometimes after a successful write. There is about 15 
minutes before the first and subsequent write attempts.

The client and server errors are attached. In the server errors, 
XXX.XXX.118.141 refers to the client that gets evicted. In the client errors, 
here are the server names to match with the NIDS:
  lustre-oss-0-2: 172.25.33.248
  lustre-oss-2-0: 172.25.33.246
  lustre-oss-2-2: 172.25.32.118
I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the 
error codes from errno.h are being used.

We've been experiencing similar problems for a while, and we've never seen IP 
traffic have a problem. But, clients will begin to have trouble communicating 
with the Lustre server (seen because an LNET ping will return an I/O error), 
and things will only recover when an LNET ping is performed from the server to 
the client NID.

The filesystem is in testing, so there is no other load on it, and when 
watching the load during writes, the OSS machines hardly notice. The servers 
are running version 1.8.5, and the client 1.8.4.

Any advice, or pointers to possible bugs would be appreciated.

Thanks,
Rick

Jul  6 15:07:45 lustre-oss-2-2.local kernel: LustreError: 
23681:0:(events.c:381:server_bulk_callback()) event type 2, status -113, desc 
8105256b3c00 
Jul  6 15:07:45 lustre-oss-2-2.local kernel: LustreError: 
24049:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET 
0(40960)  req@8103e1f1dc00 x1372441561093199/t0 
o4->2da500e9-f52c-3978-ce0e-be4518714347@NET_0x2c6ca768d_UUID:0/0 lens 
464/416 e 0 to 0 dl 1309990076 ref 1 fl Interpret:/0/0 rc 0/0 
Jul  6 15:07:45 lustre-oss-2-2.local kernel: LustreError: 
23667:0:(events.c:381:server_bulk_callback()) event type 2, status -113, desc 
8101d728 
Jul  6 15:07:45 lustre-oss-2-2.local kernel: LustreError: 
24095:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET 
0(1048576)  req@810147243c00 x1372437788293191/t0 
o4->d10b9ac8-f4d2-637c-c3a8-cdccfd5bf07d@NET_0x2c6ca7662_UUID:0/0 lens 
448/416 e 0 to 0 dl 1309990071 ref 1 fl Interpret:/0/0 rc 0/0 
Jul  6 15:07:45 lustre-oss-2-0.local kernel: LustreError: 
24091:0:(events.c:381:server_bulk_callback()) event type 2, status -113, desc 
81043032a000 
Jul  6 15:07:45 lustre-oss-2-0.local kernel: LustreError: 
24511:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET 
0(1048576)  req@8103b380f400 x1372441561093193/t0 
o4->2da500e9-f52c-3978-ce0e-be4518714347@NET_0x2c6ca768d_UUID:0/0 lens 
464/416 e 0 to 0 dl 1309990071 ref 1 fl Interpret:/0/0 rc 0/0 
Jul  6 15:07:45 lustre-oss-0-2.local kernel: LustreError: 
10295:0:(events.c:381:server_bulk_callback()) event type 2, status -113, desc 
81050cf7 
Jul  6 15:07:45 lustre-oss-0-2.local kernel: LustreError: 
10677:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET 
0(1048576)  req@8105469cc800 x1372441561093196/t0 
o4->2da500e9-f52c-3978-ce0e-be4518714347@NET_0x2c6ca768d_UUID:0/0 lens 
448/416 e 0 to 0 dl 1309990072 ref 1 fl Interpret:/0/0 rc 0/0 
Jul  6 15:07:52 lustre-oss-2-2.local kernel: LustreError: 
23922:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16)  
req@810268226c00 x1372437788293751/t0 
o8->d10b9ac8-f4d2-637c-c3a8-cdccfd5bf07d@NET_0x2c6ca7662_UUID:0/0 lens 
368/264 e 0 to 0 dl 1309990172 ref 1 fl Interpret:/0/0 rc -16/0 
Jul  6 15:07:52 lustre-oss-0-2.local kernel: LustreError: 
29773:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16)  
req@8101a7e73450 x1372441561093716/t0 
o8->2da500e9-f52c-3978-ce0e-be4518714347@NET_0x2c6ca768d_UUID:0/0 lens 
368/264 e 0 to 0 dl 1309990172 ref 1 fl Interpret:/0/0 rc -16/0 
Jul  6 15:07:52 lustre-oss-2-2.local kernel: LustreError: 
24118:0:(ost_handler.c:1064:ost_brw_write()) @@@ Reconnect on bulk GET  
req@8101d5214000 x1372437788293198/t0 
o4->d10b9ac8-f4d2-637c-c3a8-cdccfd5bf07d@NET_0x2c6ca7662_UUID:0/0 lens 
448/416 e 1 to 0 dl 1309990097 ref 1 fl Interpret:/0/0 rc 0/0 
Jul  6 15:07:52 lustre-oss-2-2.local kernel: LustreError: 
24118:0:(ost_handler.c:1064:ost_brw_write()) Skipped 2 previous similar 
messages 
Jul  6 15:07:52 lustre-oss-0-2.local kernel: LustreError: 138-a: 
phase1-OST0009: A client on nid XXX.XXX.118.141@tcp was evicted due to a lock 
blocking callback to XXX.XXX.118.141@tcp timed out: rc -107 
Jul  6 15:07:52 lustre-oss-0-2.local kernel: LustreError: 
10636:0:(ldlm_l

Re: [Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

2011-07-12 Thread Rick Wagner
Hi Kevin,

Thank very much for the reply, answers to your questions are below.

On Jul 12, 2011, at 9:10 AM, Kevin Van Maren wrote:

> Rick Wagner wrote:
>> Hi,
>> 
>> We are seeing intermittent client evictions from a new Lustre installation 
>> that we are testing. The errors on writes from a parallel job running on 32 
>> client nodes, each with 16 tasks writing a single HDF5 file of ~40MB (512 
>> tasks total). Occasionally, one nodes will be evicted from an OST, and the 
>> code running on the client will experience an IO error.
>>  
> 
> Yes, evictions are very bad.  Worse than an IO errors, however, is the 
> knowledge that a write that previously "succeeded" never made it out of the 
> client cache to disk (eviction forces client to drop any dirty cache on the 
> floor).
> 
>> The directory with the data has a stripe count of 1, and a comparable amount 
>> is read in at the start of the job. Sometimes the evictions occur the first 
>> time a write is attempted, sometimes after a successful write. There is 
>> about 15 minutes before the first and subsequent write attempts.
>>  
> 
> So you have 512 processes on 32 nodes writing to a single file, which exists 
> on a single OST.

No, each task is writing it's own HDF5 file of ~40MB; the total amount of data 
per write is 20GB. This avoids the need for synchronizing writes to a single 
file.

> Have you tuned any of the network or Lustre tunables?  For example, 
> max_dirty_mb, max_rpcs_in_flight?  socket buffer sizes?

For network tuning, this is what we have on the clients and servers:

net.core.somaxconn = 1
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432
net.core.netdev_max_backlog = 25
net.ipv4.tcp_congestion_control = htcp
net.ipv4.ip_forward = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_moderate_rcvbuf = 1

On the Lustre side, max_rpcs_in_flights = 8, max_dirty_mb = 32. We don't have 
as much experience tuning Lustre, so we've tended to use the defaults.


> What size are the RPCs, application IO sizes?

As I mentioned above, each task is writing a single file, in 5 consecutive 8MB 
chunks. There are a few other files written by the master tasks, but the 
failure hasn't occurred on that particular node (yet).

>> The client and server errors are attached. In the server errors, 
>> XXX.XXX.118.141 refers to the client that gets evicted. In the client 
>> errors, here are the server names to match with the NIDS:
>>  lustre-oss-0-2: 172.25.33.248
>>  lustre-oss-2-0: 172.25.33.246
>>  lustre-oss-2-2: 172.25.32.118
>> I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the 
>> error codes from errno.h are being used.
>> 
>> We've been experiencing similar problems for a while, and we've never seen 
>> IP traffic have a problem. 
> 
> You are using gigabit Ethernet for Lustre?

The servers are using bonded Myricom 10Gbs cards. On the client side, the nodes 
have Mellanox QDR InfiniBand HCAs, but we use a Mellanox BridgeX BX 4010, and 
the clients have virtual 10Gbs NICs. Hence the use of the tcp driver. We do 
have a problem with setting the MTU on the client side, so currently the 
servers are using an MTU of 9000, and the client 1500, which means more work 
for the central 10Gbs switch and the bridge.

> These errors are indicating issues with IP traffic.  When you say you have 
> never seen IP traffic have a problem, you mean "ssh" and "ping" work, or have 
> you stress-tested the network outside Lustre (run network tests from 32 
> clients to a single server)?

You're right about how I defined IP functionality being available, but that's a 
good point about stressing the fabric. We've run simultaneous iperf tests, but 
only until we reach a desired bandwidth. Given our goals, I think the multiple 
tests will be necessary.

>> But, clients will begin to have trouble communicating with the Lustre server 
>> (seen because an LNET ping will return an I/O error), and things will only 
>> recover when an LNET ping is performed from the server to the client NID.
>> 
>> The filesystem is in testing, so there is no other load on it, and when 
>> watching the load during writes, the OSS machines hardly notice. The servers 
>> are running version 1.8.5, and the client 1.8.4.
>> 
>> Any advice, or pointers to possible bugs would be appreciated.
>>  
> 
> You have provided no information about your network (NICs/drivers, switches, 
> MTU, settings, etc), but it sounds like you are having network issues, which 
> are exhibiting themselves under load.  It is possible a NI

Re: [Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

2011-07-12 Thread Rick Wagner
On Jul 12, 2011, at 11:01 AM, Isaac Huang wrote:

> On Mon, Jul 11, 2011 at 03:39:34PM -0700, Rick Wagner wrote:
>> Hi,
>> ..
>> I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the 
>> error codes from errno.h are being used.
>> 
>> We've been experiencing similar problems for a while, and we've never seen 
>> IP traffic have a problem. But, clients will begin to have trouble 
>> communicating with the Lustre server (seen because an LNET ping will return 
>> an I/O error), and things will only recover when an LNET ping is performed 
>> from the server to the client NID.
> 
> I'd suggest to enable console logging of network errors, by 'echo
> +neterror > /proc/sys/lnet/printk'. Then some detailed debug messages
> should show up in 'dmesg' when you have LNET connectivity problems.

Thanks, Isaac, I have put that in place. We have that in the sysctl 
configuration, as part of lnet.debug, and thought that was sufficient. But so 
far, dmesg and /var/log/messages have looked very similar.

[root@lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk 
warning error emerg console
[root@lustre-oss-0-2 ~]# sysctl -a | grep neterr
lnet.debug = ioctl neterror net warning error emerg ha config console

--Rick


> 
> - Isaac
> __
> This email may contain privileged or confidential information, which should 
> only be used for the purpose for which it was sent by Xyratex. No further 
> rights or licenses are granted to use such information. If you are not the 
> intended recipient of this message, please notify the sender by return and 
> delete it. You may not use, copy, disclose or rely on the information 
> contained in it.
> 
> Internet email is susceptible to data corruption, interception and 
> unauthorised amendment for which Xyratex does not accept liability. While we 
> have taken reasonable precautions to ensure that this email is free of 
> viruses, Xyratex does not accept liability for the presence of any computer 
> viruses in this email, nor for any losses caused as a result of viruses.
> 
> Xyratex Technology Limited (03134912), Registered in England & Wales, 
> Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
> 
> The Xyratex group of companies also includes, Xyratex Ltd, registered in 
> Bermuda, Xyratex International Inc, registered in California, Xyratex 
> (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd 
> registered in The People's Republic of China and Xyratex Japan Limited 
> registered in Japan.
> __
> 

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

2011-07-12 Thread Rick Wagner

On Jul 12, 2011, at 11:32 AM, Isaac Huang wrote:

> On Tue, Jul 12, 2011 at 11:06:40AM -0700, Rick Wagner wrote:
>> On Jul 12, 2011, at 11:01 AM, Isaac Huang wrote:
>> 
>>> On Mon, Jul 11, 2011 at 03:39:34PM -0700, Rick Wagner wrote:
>>>> Hi,
>>>> ..
>>>> I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the 
>>>> error codes from errno.h are being used.
>>>> 
>>>> We've been experiencing similar problems for a while, and we've never seen 
>>>> IP traffic have a problem. But, clients will begin to have trouble 
>>>> communicating with the Lustre server (seen because an LNET ping will 
>>>> return an I/O error), and things will only recover when an LNET ping is 
>>>> performed from the server to the client NID.
>>> 
>>> I'd suggest to enable console logging of network errors, by 'echo
>>> +neterror > /proc/sys/lnet/printk'. Then some detailed debug messages
>>> should show up in 'dmesg' when you have LNET connectivity problems.
>> 
>> Thanks, Isaac, I have put that in place. We have that in the sysctl 
>> configuration, as part of lnet.debug, and thought that was sufficient. But 
>> so far, dmesg and /var/log/messages have looked very similar.
>> 
>> [root@lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk 
>> warning error emerg console
> 
> You should be able to see 'neterror' in 'cat /proc/sys/lnet/printk'
> output after 'echo +neterror > /proc/sys/lnet/printk', otherwise
> it's a bug. This is different from lnet.debug.

I am, sorry if the previous post was misleading.
[root@lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk 
neterror warning error emerg console
[root@lustre-oss-0-2 ~]# 

--Rick

> 
>> [root@lustre-oss-0-2 ~]# sysctl -a | grep neterr
>> lnet.debug = ioctl neterror net warning error emerg ha config console
> 
> - Isaac
> __
> This email may contain privileged or confidential information, which should 
> only be used for the purpose for which it was sent by Xyratex. No further 
> rights or licenses are granted to use such information. If you are not the 
> intended recipient of this message, please notify the sender by return and 
> delete it. You may not use, copy, disclose or rely on the information 
> contained in it.
> 
> Internet email is susceptible to data corruption, interception and 
> unauthorised amendment for which Xyratex does not accept liability. While we 
> have taken reasonable precautions to ensure that this email is free of 
> viruses, Xyratex does not accept liability for the presence of any computer 
> viruses in this email, nor for any losses caused as a result of viruses.
> 
> Xyratex Technology Limited (03134912), Registered in England & Wales, 
> Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
> 
> The Xyratex group of companies also includes, Xyratex Ltd, registered in 
> Bermuda, Xyratex International Inc, registered in California, Xyratex 
> (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd 
> registered in The People's Republic of China and Xyratex Japan Limited 
> registered in Japan.
> __
> 

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

2011-07-12 Thread Rick Wagner
On Jul 12, 2011, at 11:34 AM, Rick Wagner wrote:

> 
> On Jul 12, 2011, at 11:32 AM, Isaac Huang wrote:
> 
>> On Tue, Jul 12, 2011 at 11:06:40AM -0700, Rick Wagner wrote:
>>> On Jul 12, 2011, at 11:01 AM, Isaac Huang wrote:
>>> 
>>>> On Mon, Jul 11, 2011 at 03:39:34PM -0700, Rick Wagner wrote:
>>>>> Hi,
>>>>> ..
>>>>> I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that 
>>>>> the error codes from errno.h are being used.
>>>>> 
>>>>> We've been experiencing similar problems for a while, and we've never 
>>>>> seen IP traffic have a problem. But, clients will begin to have trouble 
>>>>> communicating with the Lustre server (seen because an LNET ping will 
>>>>> return an I/O error), and things will only recover when an LNET ping is 
>>>>> performed from the server to the client NID.
>>>> 
>>>> I'd suggest to enable console logging of network errors, by 'echo
>>>> +neterror > /proc/sys/lnet/printk'. Then some detailed debug messages
>>>> should show up in 'dmesg' when you have LNET connectivity problems.
>>> 
>>> Thanks, Isaac, I have put that in place. We have that in the sysctl 
>>> configuration, as part of lnet.debug, and thought that was sufficient. But 
>>> so far, dmesg and /var/log/messages have looked very similar.
>>> 
>>> [root@lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk 
>>> warning error emerg console
>> 
>> You should be able to see 'neterror' in 'cat /proc/sys/lnet/printk'
>> output after 'echo +neterror > /proc/sys/lnet/printk', otherwise
>> it's a bug. This is different from lnet.debug.
> 
> I am, sorry if the previous post was misleading.
> [root@lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk 
> neterror warning error emerg console
> [root@lustre-oss-0-2 ~]# 

Isaac, I think your suggestion led to more information. I re-ran the client 
code, and experienced the same problems. However, this time there were 
additional messages, in particular one about "No usable routes".

  Lustre: 24086:0:(socklnd_cb.c:922:ksocknal_launch_packet()) No usable routes 
to 12345-XXX.XXX.118.137@tcp

In our configuration, the cluster I'm running on uses TCP, and should not have 
traffic going over a router. There is a second test system which does have an 
LNET router configured, so this is what have for our LNET configuration on the 
servers:

  [root@lustre-oss-0-2 ~]# cat /etc/modprobe.d/lnet.conf 
  options lnet networks=tcp(bond0)  routes="o2ib XXX.XXX.81.18@tcp"

For the servers, all of the traffic goes out the bond0 interface.

The dmesg output is below. It shows peer not alive for the client that I saw 
fail. When I logged onto that client, it reported an I/O error when performing 
and LNET ping to the server that I've posted the messages for, and then they 
both recovered.

Thanks again,
Rick

>From dmesg on lustre-oss-2-0

Lustre: 24086:0:(socklnd_cb.c:922:ksocknal_launch_packet()) No usable routes to 
12345-XXX.XXX.118.137@tcp
LustreError: 24086:0:(events.c:381:server_bulk_callback()) event type 2, status 
-113, desc 8104d595
LustreError: 24502:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on 
bulk GET 0(1048576)  req@8101cbfb3000 x1373631114768318/t0 
o4->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b@NET_0x2c6ca7689_UUID:0/0 lens 
448/416 e 0 to 0 dl 1310502335 ref 1 fl Interpret:/0/0 rc 0/0
Lustre: 24502:0:(ost_handler.c:1224:ost_brw_write()) phase1-OST0020: ignoring 
bulk IO comm error with 
ebb92dfb-9c0c-c870-a01a-0f0e7333c71b@NET_0x2c6ca7689_UUID id 
12345-XXX.XXX.118.137@tcp - client will retry
Lustre: 24349:0:(ldlm_lib.c:574:target_handle_reconnect()) phase1-OST0023: 
ebb92dfb-9c0c-c870-a01a-0f0e7333c71b reconnecting
Lustre: 24349:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 2 previous 
similar messages
Lustre: 24307:0:(ldlm_lib.c:803:target_handle_connect()) phase1-OST0023: exp 
810292b50e00 already connecting
LustreError: 24307:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing 
error (-114)  req@81033c17bc50 x1373631114768595/t0 
o8->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b@NET_0x2c6ca7689_UUID:0/0 lens 
368/264 e 0 to 0 dl 1310502459 ref 1 fl Interpret:/0/0 rc -114/0
LustreError: 24307:0:(ldlm_lib.c:1919:target_send_reply_msg()) Skipped 338 
previous similar messages
Lustre: 24086:0:(socklnd_cb.c:922:ksocknal_launch_packet()) No usable routes to 
12345-XXX.XXX.118.137@tcp
LustreError: 24086:0:(events.c:381:server_bulk_callback()) event type 2, status 
-113, desc 810226052000
LustreError: 24553:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on 
bulk GET 0(

Re: [Lustre-discuss] Kernel panic in ost_rw_prolong_locks

2011-07-28 Thread Rick Wagner
Hi Johann,

On Jul 28, 2011, at 12:18 PM, Johann Lombardi wrote:

> Hi,
> 
> On Thu, Jul 21, 2011 at 12:44:54PM -0700, Rick Wagner wrote:
>> Host info:
>> CentOS 5.4
>> Linux lustre-oss-0-2.local 2.6.18-194.3.1.el5_lustre.1.8.4 #1 SMP Fri Jul 9 
>> 21:55:24 MDT 2010 x86_64 x86_64 x86_64 GNU/Linux
> 
> I think you hit bugzilla ticket 21804 which is fixed in both 1.8.6 & 
> 1.8.6-wc1.

That's good news. We're testing new servers with 1.8.6-wc1.

Thanks,
Rick

> 
> Cheers,
> Johann
> 
> -- 
> Johann Lombardi
> Whamcloud, Inc.
> www.whamcloud.com

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Kernel panic in ost_rw_prolong_locks

2011-07-28 Thread Rick Wagner
Repeated post. Please ignore.

--Rick

On Jul 21, 2011, at 12:44 PM, Rick Wagner wrote:

> Hi,
> 
> We've had several OSSes kernel panic during the past week, and all but one 
> occurred in ost_rw_prolong_locks in ost_handler.c. From what I can tell, this 
> file hasn't changed since 1.8.4, which is what we're running in production. 
> We have had no luck in tying these events to load on the file system or 
> errors reported in the logs. Hardware wise, the machines are stable (until 
> they crash and the RAID arrays need to rebuild).
> 
> I've attached a screen shot from the console after the panic; unfortunately, 
> I don't know if the stack trace before the panic is associated with the 
> kernel panic. For the most part, the kernel seems to manage cleaning up hung 
> threads.
> 
> At this point, we would appreciate any insight into what may be causing this. 
> If someone thinks it may be a bug, I would be glad to open a ticket.
> 
> Thanks,
> Rick
> 
> Host info:
> CentOS 5.4
> Linux lustre-oss-0-2.local 2.6.18-194.3.1.el5_lustre.1.8.4 #1 SMP Fri Jul 9 
> 21:55:24 MDT 2010 x86_64 x86_64 x86_64 GNU/Linux
> 
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss