Re: [Lustre-discuss] Where to download Lustre from since 01 Aug?

2011-08-08 Thread Aaron Everett
Thanks Richard!

Using the 1.8.6 CentOS 5 instructions I was able to successfully use git,
configure, make rpms for what I needed. The "Provision Machine" instructions
were followed exactly, and since I only needed client rpms, I skipped to the
"Configure and Build Lustre" section, which produced the rpm's I needed.

Installing the RPM's was smooth, and after install Lustre mounted.

One oddity, there is no /etc/modprobe.conf file to enter lnet options in,
however everything seems to be mounting without error.

Aaron

On Fri, Aug 5, 2011 at 11:16 PM, Richard Henwood wrote:

>
> On Fri, Aug 5, 2011 at 9:16 PM, Aaron Everett wrote:
>
>> Is there a whamcloud version of a lustre-1.8.5.tar.gz or 1.8.6 that I can
>> use to build rpms for RedHat/CentOS 6? (Specifically CentOS
>> 6,  2.6.32-71.29.1.el6.x86_64)
>>
>>
> Hi Aaron,
>
> There is a description on building from Whamcloud git on:
>
> http://wiki.whamcloud.com/display/PUB/Walk-thru-+Build+Lustre+2.1+on+RHEL+6.1+from+Whamcloud+git
>
> I recently worked through that page, updating for RHEL 6.1, so I'm not sure
> on the page's applicability to CentOS 6. If you make progress (or otherwise)
> I would appreciate feedback.
>
> Also, there is a page on building RPMs with 1.8.6, but that is for CentOS
> 5:
>
> http://wiki.whamcloud.com/display/PUB/Walk-thru-+Build+Lustre+1.8+on+CentOS+5.5+or+5.6+from+Whamcloud+git
> again, feedback is welcome.
>
> richard,
>
> 
>>
>
> --
> richard.henw...@whamcloud.com
> Whamcloud Inc.
> tel: +1 512 410 9612
>
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Where to download Lustre from since 01 Aug?

2011-08-05 Thread Aaron Everett
Is there a whamcloud version of a lustre-1.8.5.tar.gz or 1.8.6 that I can
use to build rpms for RedHat/CentOS 6? (Specifically CentOS
6,  2.6.32-71.29.1.el6.x86_64)

Thanks!
Aaron

On Fri, Aug 5, 2011 at 6:20 PM, Anthony David wrote:

> On 08/05/2011 11:31 PM, Ray Muno wrote:
> > I see a previous post regarding this.
> >
> > Since Oracle decommissioned the Sun Download Center, all links for
> > downloading Lustre appear to go in circles, always bringing you back to
> > a page that has no path to anything Lustre related.
> >
> > Does anyone have any insights as to where Oracle buried it?
>
> I too went around in circles following the "Lustre download link".
>
> Anyone with a "My Oracle Support" account can download zipped bundles of
> RPMs from the "Patches and Updates" section.
>
> Regards
> Anthony David
> [Not speaking for Oracle]
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] inode tuning on shared mdt/mgs

2011-07-05 Thread Aaron Everett
Thank you both for the explanation. I have spent the morning populating our
Lustre file system with test data, and monitoring the inode usage. Having
reformatted with --mkfsoptions="-i 1536" I'm seeing roughly 8M IUsed for
every 1M IFree decrease. If the ratio holds, this will meet my needs.

Aaron


On Sat, Jul 2, 2011 at 10:54 AM, Kevin Van Maren  wrote:

> Andreas Dilger wrote:
>
>  On 2011-07-01, at 12:03 PM, Aaron Everett > aever...@forteds.com>> wrote:
>>
>>> I'm trying to increase the number of inodes available on our shared
>>> mdt/mgs. I've tried reformatting using the following:
>>>
>>>  mkfs.lustre --fsname fdfs --mdt --mgs --mkfsoptions="-i 2048" --reformat
>>> /dev/sdb
>>>
>>> The number of inodes actually decreased when I specified -i 2048 vs.
>>> leaving the number at default.
>>>
>>
>> This os a bit of an anomaly in how 1.8 reports the inode count. You
>> actually do have more inodes on the MDS, but because the MDS might need to
>> use an external block to store the striping layout, it limits the returned
>> inode count to the worst case usage. As the filesystem fills and these
>> external blck
>>
>
> [trying to complete his sentence:]
> are not used, the free inode count keeps reporting the same number of free
> inodes, as the number of used inodes goes up.
>
> It is pretty weird, but it was doing the same thing in v1.6
>
>
>  We have a large number of smaller files, and we're nearing our inode limit
>>> on our mdt/mgs. I'm trying to find a solution before simply expanding the
>>> RAID on the server. Since there is plenty of disk space, changing the bytes
>>> per inode seemed like a simple solution.
>>> From the docs:
>>>
>>> Alternately, if you are specifying an absolute number of inodes, use
>>> the-N  option. You should not specify the -i option with
>>> an inode ratio below one inode per 1024 bytes in order to avoid
>>> unintentional mistakes. Instead, use the -N option.
>>>
>>> What is the format of the -N flag, and how should I calculate the number
>>> to use? Thanks for your help!
>>>
>>> Aaron
>>>
>>
>
>
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] inode tuning on shared mdt/mgs

2011-07-01 Thread Aaron Everett
Hi list,

I'm trying to increase the number of inodes available on our shared mdt/mgs.
I've tried reformatting using the following:

 mkfs.lustre --fsname fdfs --mdt --mgs --mkfsoptions="-i 2048" --reformat
/dev/sdb

The number of inodes actually decreased when I specified -i 2048 vs. leaving
the number at default.

We have a large number of smaller files, and we're nearing our inode limit
on our mdt/mgs. I'm trying to find a solution before simply expanding the
RAID on the server. Since there is plenty of disk space, changing the bytes
per inode seemed like a simple solution.

>From the docs:

Alternately, if you are specifying an absolute number of inodes, use
the-N  option. You should not specify the -i option with an inode ratio
below one inode per 1024 bytes in order to avoid unintentional mistakes.
Instead, use the -N option.

What is the format of the -N flag, and how should I calculate the number to
use? Thanks for your help!

Aaron
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Client Kernel panic - not syncing. Lustre 1.8.5

2011-05-20 Thread Aaron Everett
Thanks for the tip. I've already updated with the LU-286 patch, but I'll
build new rpms with both patches and roll that out too. Since updating with
the LU-286 patch Lustre has been running cleanly. Thanks for the support and
the work!

Aaron

On Fri, May 20, 2011 at 4:40 AM, Johann Lombardi wrote:

> On Thu, May 19, 2011 at 01:57:33PM -0400, Aaron Everett wrote:
> > Sorry for the noise. I cleaned everything up, untarred a fresh copy of
>
> np. BTW, while you are patching the lustre client, you might also want to
> apply the following patch http://review.whamcloud.com/#change,457 which
> fixes a memory leak in the same part of the code.
>
> Johann
> --
> Johann Lombardi
> Whamcloud, Inc.
> www.whamcloud.com
>
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Client Kernel panic - not syncing. Lustre 1.8.5

2011-05-19 Thread Aaron Everett
Sorry for the noise. I cleaned everything up, untarred a fresh copy of
lustre1.8.5, applied the patch, configured, and successfully made patches.
I'm not sure what went wrong last time.

Aaron

On Thu, May 19, 2011 at 12:14 PM, Johann Lombardi wrote:

> On Thu, May 19, 2011 at 11:51:49AM -0400, Aaron Everett wrote:
> > I'm getting a build error:
> > make[5]: Entering directory `/usr/src/kernels/2.6.18-238.9.1.el5-x86_64'
> > /usr/src/redhat/BUILD/lustre-1.8.5/lustre/mdc/mdc_lib.c:828: error:
> > conflicting types for 'mdc_getattr_pack'
> > /usr/src/redhat/BUILD/lustre-1.8.5/lustre/mdc/mdc_internal.h:56: error:
> > previous declaration of 'mdc_getattr_pack' was here
>
> Weird, the patch does not modify mdc_getattr_pack() at all.
> It applies cleanly to 1.8.5 for me and i can successfully build it.
> How did you pick up the patch? Have you made any changes to
> mdc_getattr_pack()?
> Please find attached the patch I used (extracted from git).
>
> >  55 void mdc_getattr_pack(struct ptlrpc_request *req, int offset,
> __u64
> > valid,
> >  56   int flags, struct mdc_op_data *data);
> >
> > mdc_lib.c (patched version downloaded from link above):
> > 828 {
> > 829 if (mdc_req_is_2_0_server(req))
> > 830 mdc_getattr_pack_20(req, offset, valid, flags,
> data,
> > ea_size);
>
> The patch from bugzilla ticket 24048 added a new argument to
> mdc_getattr_pack(), but that's not the patch i pointed you at.
>
> > 831 else
> > 832 mdc_getattr_pack_18(req, offset, valid, flags,
> > data);
> > 833 }
> >
> > Upon closer inspection, it appears this patch is for Lustre 1.8.6, while
> > we're running 1.8.5.
>
> The patch i attached should really work with 1.8.5.
>
> > Is there a download location for lustre source for
> > 1.8.6? I don't see it on lustre.org.
>
> AFAIK, 1.8.6 has not been released yet.
>
> Johann
> --
> Johann Lombardi
> Whamcloud, Inc.
> www.whamcloud.com
>
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Client Kernel panic - not syncing. Lustre 1.8.5

2011-05-19 Thread Aaron Everett
I'm getting a build error:
make[5]: Entering directory `/usr/src/kernels/2.6.18-238.9.1.el5-x86_64'
/usr/src/redhat/BUILD/lustre-1.8.5/lustre/mdc/mdc_lib.c:828: error:
conflicting types for 'mdc_getattr_pack'
/usr/src/redhat/BUILD/lustre-1.8.5/lustre/mdc/mdc_internal.h:56: error:
previous declaration of 'mdc_getattr_pack' was here
make[8]: *** [/usr/src/redhat/BUILD/lustre-1.8.5/lustre/mdc/mdc_lib.o] Error
1
make[7]: *** [/usr/src/redhat/BUILD/lustre-1.8.5/lustre/mdc] Error 2
make[7]: *** Waiting for unfinished jobs
make[6]: *** [/usr/src/redhat/BUILD/lustre-1.8.5/lustre] Error 2
make[5]: *** [_module_/usr/src/redhat/BUILD/lustre-1.8.5] Error 2
make[5]: Leaving directory `/usr/src/kernels/2.6.18-238.9.1.el5-x86_64'
make[4]: *** [modules] Error 2
make[4]: Leaving directory `/usr/src/redhat/BUILD/lustre-1.8.5'
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory `/usr/src/redhat/BUILD/lustre-1.8.5'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/usr/src/redhat/BUILD/lustre-1.8.5'
error: Bad exit status from /var/tmp/rpm-tmp.63873 (%build)

mdc_internal.h:
 55 void mdc_getattr_pack(struct ptlrpc_request *req, int offset, __u64
valid,
 56   int flags, struct mdc_op_data *data);

mdc_lib.c (patched version downloaded from link above):
828 {
829 if (mdc_req_is_2_0_server(req))
830 mdc_getattr_pack_20(req, offset, valid, flags, data,
ea_size);
831 else
832 mdc_getattr_pack_18(req, offset, valid, flags,
data);
833 }

Upon closer inspection, it appears this patch is for Lustre 1.8.6, while
we're running 1.8.5. Is there a download location for lustre source for
1.8.6? I don't see it on lustre.org.

Thanks again for your help!
Aaron


On Thu, May 19, 2011 at 11:30 AM, Aaron Everett wrote:

> Excellent. Thanks for the quick reply. Building new rpm's now.
>
> Aaron
>
>
> On Thu, May 19, 2011 at 11:17 AM, Johann Lombardi wrote:
>
>> On Thu, May 19, 2011 at 11:06:08AM -0400, Aaron Everett wrote:
>> > Thanks for the replies. Does the patch need to be applied to clients'
>> > lustre-module rpms only, client and server lustre&lustre-module rpms, or
>> > will I need to build new kernels for the servers as well?
>>
>> You only need to apply the patch to the lustre clients (only the
>> lustre-module rpms will be modified). No need to rebuild the kernel.
>>
>> Johann
>> --
>> Johann Lombardi
>> Whamcloud, Inc.
>> www.whamcloud.com
>>
>
>
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Client Kernel panic - not syncing. Lustre 1.8.5

2011-05-19 Thread Aaron Everett
Excellent. Thanks for the quick reply. Building new rpm's now.

Aaron

On Thu, May 19, 2011 at 11:17 AM, Johann Lombardi wrote:

> On Thu, May 19, 2011 at 11:06:08AM -0400, Aaron Everett wrote:
> > Thanks for the replies. Does the patch need to be applied to clients'
> > lustre-module rpms only, client and server lustre&lustre-module rpms, or
> > will I need to build new kernels for the servers as well?
>
> You only need to apply the patch to the lustre clients (only the
> lustre-module rpms will be modified). No need to rebuild the kernel.
>
> Johann
> --
> Johann Lombardi
> Whamcloud, Inc.
> www.whamcloud.com
>
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Client Kernel panic - not syncing. Lustre 1.8.5

2011-05-19 Thread Aaron Everett
Thanks for the replies. Does the patch need to be applied to clients'
lustre-module rpms only, client and server lustre&lustre-module rpms, or
will I need to build new kernels for the servers as well?

Best regards,
Aaron

On Wed, May 18, 2011 at 9:17 PM, Johann Lombardi wrote:

> On Tue, May 17, 2011 at 08:13:42PM -0400, Aaron Everett wrote:
> > Code: 48 89 08 31 c9 48 89 12 48 89 52 08 ba 01 00 00 00 83 83 10
> > RIP   []  :mdc:mdc_exit_request+0x6d/0xb0
> >  RSP  
> > CR2:  3877
> >  <0>Kernel panic - not syncing: Fatal exception
>
> This bug was introduced in 1.8.5, see bugzilla ticket 24508 & jira ticket
> LU-286. A fix is available here: http://review.whamcloud.com/#change,506
>
> Johann
>
> --
> Johann Lombardi
> Whamcloud, Inc.
> www.whamcloud.com
>
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Client Kernel panic - not syncing. Lustre 1.8.5

2011-05-18 Thread Aaron Everett
More information:

The frequency of these errors was dramatically reduced by
changing /proc/fs/lustre/osc/fdfs-OST000[0-3]-osc/max_rpcs_in_flight from 8
to 32.

Processor, memory, and disk I/O on the servers is not high, is there a
reason for not increasing max_rpcs_in_flight from 32 to 48 or 64? Is there a
limit on how high I can set this value?

Best regards,
Aaron

On Tue, May 17, 2011 at 8:13 PM, Aaron Everett  wrote:

> Hi all,
>
> We've been running Lustre 1.6.6 for several years and are deploying 1.8.5
> on some new hardware. When under load we've been seeing random kernel panics
> on many of the clients. We are running 2.6.18-194.17.1.el5_lustre.1.8.5 on
> the servers (shared MDT/MGS, and 4 OST's. We have patchless clients
> running 2.6.18-238.9.1.el5 (all CentOS).
>
> On the MDT, the following is logged in /var/log/messages:
>
> May 17 16:46:44 lustre-mdt-00 kernel: Lustre:
> 5878:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1368993021040034 sent from fdfs-MDT to NID 172.16.14.219@tcp 7s ago
> has timed out (7s prior to deadline).
> May 17 16:46:44 lustre-mdt-00 kernel:   
> req@8105f140b800x1368993021040034/t0 
> o104->@NET_0x2ac100edb_UUID:15/16 lens 296/384 e 0
> to 1 dl 1305665204 ref 1 fl Rpc:N/0/0 rc 0/0
> May 17 16:46:44 lustre-mdt-00 kernel: Lustre:
> 5878:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 39 previous
> similar messages
> May 17 16:46:44 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT: A
> client on nid 172.16.14.219@tcp was evicted due to a lock blocking
> callback to 172.16.14.219@tcp timed out: rc -107
> May 17 16:46:52 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT: A
> client on nid 172.16.14.225@tcp was evicted due to a lock blocking
> callback to 172.16.14.225@tcp timed out: rc -107
> May 17 16:46:52 lustre-mdt-00 kernel: LustreError:
> 6227:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED
> req@81181ccb6800 x1368993021041016/t0
> o104->@NET_0x2ac100ee1_UUID:15/16 lens 296/384 e 0 to 1 dl 0 ref 1 fl
> Rpc:N/0/0 rc 0/0
> May 17 16:46:52 lustre-mdt-00 kernel: LustreError:
> 6227:0:(ldlm_lockd.c:607:ldlm_handle_ast_error()) ### client (nid
> 172.16.14.225@tcp) returned 0 from blocking AST ns: mds-fdfs-MDT_UUID
> lock: 81169f590a00/0x767f56e4ad136f72 lrc: 4/0,0 mode: CR/CR res:
> 35202584/110090815 bits 0x3 rrc: 25 type: IBT flags: 0x420 remote:
> 0x364122c82e3aca01 expref: 229900 pid: 6310 timeout: 4386580591
> May 17 16:46:59 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT: A
> client on nid 172.16.14.230@tcp was evicted due to a lock blocking
> callback to 172.16.14.230@tcp timed out: rc -107
> May 17 16:46:59 lustre-mdt-00 kernel: LustreError: Skipped 6 previous
> similar messages
> May 17 16:47:07 lustre-mdt-00 kernel: Lustre:
> 6688:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1368993021041492 sent from fdfs-MDT to NID 172.16.14.229@tcp 7s ago
> has timed out (7s prior to deadline).
> May 17 16:47:07 lustre-mdt-00 kernel:   
> req@81093052b000x1368993021041492/t0 
> o104->@NET_0x2ac100ee5_UUID:15/16 lens 296/384 e 0
> to 1 dl 1305665227 ref 1 fl Rpc:N/0/0 rc 0/0
> May 17 16:47:07 lustre-mdt-00 kernel: Lustre:
> 6688:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 8 previous
> similar messages
> May 17 16:47:07 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT: A
> client on nid 172.16.14.229@tcp was evicted due to a lock blocking
> callback to 172.16.14.229@tcp timed out: rc -107
> May 17 16:47:07 lustre-mdt-00 kernel: LustreError: Skipped 8 previous
> similar messages
> May 17 16:50:16 lustre-mdt-00 kernel: Lustre: MGS: haven't heard from
> client c8e311a5-f1d6-7197-1021-c5a02c1c5b14 (at 172.16.14.230@tcp) in 228
> seconds. I think it's dead, and I am evicting it.
>
> On the clients, there is a kernel panic, with the following message on the
> screen:
>
> Code: 48 89 08 31 c9 48 89 12 48 89 52 08 ba 01 00 00 00 83 83 10
> RIP   []  :mdc:mdc_exit_request+0x6d/0xb0
>  RSP  
> CR2:  3877
>  <0>Kernel panic - not syncing: Fatal exception
>
> We're running the same set of jobs on both the 1.6.6 lustre filesystem and
> the 1.8.5 lustre filesystem. Only the 1.8.5 clients crash, the 1.6.6 clients
> that are also using the new servers never exhibit this issue. I'm assuming
> there is a setting on the 1.8.5 clients that needs to be adjusted, but I'm
> searching for help.
>
> Best regards,
> Aaron
>
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Client Kernel panic - not syncing. Lustre 1.8.5

2011-05-17 Thread Aaron Everett
Hi all,

We've been running Lustre 1.6.6 for several years and are deploying 1.8.5 on
some new hardware. When under load we've been seeing random kernel panics on
many of the clients. We are running 2.6.18-194.17.1.el5_lustre.1.8.5 on the
servers (shared MDT/MGS, and 4 OST's. We have patchless clients
running 2.6.18-238.9.1.el5 (all CentOS).

On the MDT, the following is logged in /var/log/messages:

May 17 16:46:44 lustre-mdt-00 kernel: Lustre:
5878:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1368993021040034 sent from fdfs-MDT to NID 172.16.14.219@tcp 7s ago has
timed out (7s prior to deadline).
May 17 16:46:44 lustre-mdt-00 kernel:
req@8105f140b800x1368993021040034/t0
o104->@NET_0x2ac100edb_UUID:15/16 lens 296/384 e 0
to 1 dl 1305665204 ref 1 fl Rpc:N/0/0 rc 0/0
May 17 16:46:44 lustre-mdt-00 kernel: Lustre:
5878:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 39 previous
similar messages
May 17 16:46:44 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT: A
client on nid 172.16.14.219@tcp was evicted due to a lock blocking callback
to 172.16.14.219@tcp timed out: rc -107
May 17 16:46:52 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT: A
client on nid 172.16.14.225@tcp was evicted due to a lock blocking callback
to 172.16.14.225@tcp timed out: rc -107
May 17 16:46:52 lustre-mdt-00 kernel: LustreError:
6227:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED
req@81181ccb6800 x1368993021041016/t0
o104->@NET_0x2ac100ee1_UUID:15/16 lens 296/384 e 0 to 1 dl 0 ref 1 fl
Rpc:N/0/0 rc 0/0
May 17 16:46:52 lustre-mdt-00 kernel: LustreError:
6227:0:(ldlm_lockd.c:607:ldlm_handle_ast_error()) ### client (nid
172.16.14.225@tcp) returned 0 from blocking AST ns: mds-fdfs-MDT_UUID
lock: 81169f590a00/0x767f56e4ad136f72 lrc: 4/0,0 mode: CR/CR res:
35202584/110090815 bits 0x3 rrc: 25 type: IBT flags: 0x420 remote:
0x364122c82e3aca01 expref: 229900 pid: 6310 timeout: 4386580591
May 17 16:46:59 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT: A
client on nid 172.16.14.230@tcp was evicted due to a lock blocking callback
to 172.16.14.230@tcp timed out: rc -107
May 17 16:46:59 lustre-mdt-00 kernel: LustreError: Skipped 6 previous
similar messages
May 17 16:47:07 lustre-mdt-00 kernel: Lustre:
6688:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1368993021041492 sent from fdfs-MDT to NID 172.16.14.229@tcp 7s ago has
timed out (7s prior to deadline).
May 17 16:47:07 lustre-mdt-00 kernel:
req@81093052b000x1368993021041492/t0
o104->@NET_0x2ac100ee5_UUID:15/16 lens 296/384 e 0
to 1 dl 1305665227 ref 1 fl Rpc:N/0/0 rc 0/0
May 17 16:47:07 lustre-mdt-00 kernel: Lustre:
6688:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 8 previous
similar messages
May 17 16:47:07 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT: A
client on nid 172.16.14.229@tcp was evicted due to a lock blocking callback
to 172.16.14.229@tcp timed out: rc -107
May 17 16:47:07 lustre-mdt-00 kernel: LustreError: Skipped 8 previous
similar messages
May 17 16:50:16 lustre-mdt-00 kernel: Lustre: MGS: haven't heard from client
c8e311a5-f1d6-7197-1021-c5a02c1c5b14 (at 172.16.14.230@tcp) in 228 seconds.
I think it's dead, and I am evicting it.

On the clients, there is a kernel panic, with the following message on the
screen:

Code: 48 89 08 31 c9 48 89 12 48 89 52 08 ba 01 00 00 00 83 83 10
RIP   []  :mdc:mdc_exit_request+0x6d/0xb0
 RSP  
CR2:  3877
 <0>Kernel panic - not syncing: Fatal exception

We're running the same set of jobs on both the 1.6.6 lustre filesystem and
the 1.8.5 lustre filesystem. Only the 1.8.5 clients crash, the 1.6.6 clients
that are also using the new servers never exhibit this issue. I'm assuming
there is a setting on the 1.8.5 clients that needs to be adjusted, but I'm
searching for help.

Best regards,
Aaron
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Kernel Headers for Lustre 1.6.6 / RHEL4

2011-01-18 Thread Aaron Everett
Hi all,

I'm attempting to install some third party software, and I'm being prompted
for kernel headers. However, I don't seem to have the kernel-header rpm for
our (rather outdated) Lustre kernel. Does anyone have, or can point me to,
kernel headers for Lustre 1.6.6 on RHEL4 (64bit)?

Specifically, I'm looking for:  2.6.9-67.0.22.EL_lustre.1.6.6smp

Thanks!
Aaron
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Unbalanced load across OST's

2009-03-26 Thread Aaron Everett
Is there significance to the last digit on the OST in device_list? I'm assuming 
it is some sort of status. Notice below that from OST, the last digits are 
63. When I run lctl dl from OST0001 and OST0002 (the OST's I can not write to) 
the last digits are 61. I just noticed it this morning.

[aever...@lustrefs01 ~]$ lctl dl
  0 UP mgc mgc172.16.14...@tcp c951a722-71ad-368c-d791-c4e0efee7120 5
  1 UP ost OSS OSS_uuid 3
  2 UP obdfilter fortefs-OST fortefs-OST_UUID 63
[aever...@lustrefs01 ~]$ ssh lustrefs02 lctl dl
  0 UP mgc mgc172.16.14...@tcp 5389dc05-63d9-2ae9-1f21-2d9471e82166 5
  1 UP ost OSS OSS_uuid 3
  2 UP obdfilter fortefs-OST0001 fortefs-OST0001_UUID 61
[aever...@lustrefs01 ~]$ ssh lustrefs03 lctl dl
  0 UP mgc mgc172.16.14...@tcp 01325f6d-ee28-aa97-910c-fef806443a99 5
  1 UP ost OSS OSS_uuid 3
  2 UP obdfilter fortefs-OST0002 fortefs-OST0002_UUID 61
[aever...@lustrefs01 ~]$

MDT:
[aever...@lustrefs01 ~]$ ssh lustrefs lctl dl
  0 UP mgs MGS MGS 67
  1 UP mgc mgc172.16.14...@tcp 97c99c52-d7f9-1b74-171e-f71c0707 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov fortefs-mdtlov fortefs-mdtlov_UUID 4
  4 UP mds fortefs-MDT fortefs-MDT_UUID 61
  5 UP osc fortefs-OST-osc fortefs-mdtlov_UUID 5
[aever...@lustrefs01 ~]$

The log files on the OST's are clean. But I did notice that there are errors on 
the MDT. I don't know how I missed these earlier:

Mar 26 09:19:35 lustrefs kernel: LustreError: 
5816:0:(lov_ea.c:238:lsm_unpackmd_plain()) OST index 2 more than OST count 2
Mar 26 10:57:45 lustrefs kernel: LustreError: 
5823:0:(lov_ea.c:238:lsm_unpackmd_plain()) OST index 2 more than OST count 2
Mar 26 10:57:48 lustrefs kernel: LustreError: 
5481:0:(lov_ea.c:243:lsm_unpackmd_plain()) OST index 1 missing
Mar 26 10:57:48 lustrefs kernel: LustreError: 
5481:0:(lov_ea.c:243:lsm_unpackmd_plain()) Skipped 156 previous similar messages
Mar 26 10:57:48 lustrefs kernel: LustreError: 
5497:0:(lov_ea.c:238:lsm_unpackmd_plain()) OST index 2 more than OST count 2
Mar 26 10:57:51 lustrefs kernel: LustreError: 
5482:0:(lov_ea.c:238:lsm_unpackmd_plain()) OST index 2 more than OST count 2
Mar 26 10:57:51 lustrefs kernel: LustreError: 
5482:0:(lov_ea.c:238:lsm_unpackmd_plain()) Skipped 5 previous similar messages
Mar 26 10:57:56 lustrefs kernel: LustreError: 
7299:0:(lov_ea.c:238:lsm_unpackmd_plain()) OST index 2 more than OST count 2
Mar 26 10:57:56 lustrefs kernel: LustreError: 
7299:0:(lov_ea.c:238:lsm_unpackmd_plain()) Skipped 60 previous similar messages
Mar 26 10:58:07 lustrefs kernel: LustreError: 
5510:0:(lov_ea.c:238:lsm_unpackmd_plain()) OST index 2 more than OST count 2
Mar 26 10:58:07 lustrefs kernel: LustreError: 
5510:0:(lov_ea.c:238:lsm_unpackmd_plain()) Skipped 116 previous similar messages
Mar 26 10:58:27 lustrefs kernel: LustreError: 
7320:0:(lov_ea.c:238:lsm_unpackmd_plain()) OST index 2 more than OST count 2
Mar 26 10:58:27 lustrefs kernel: LustreError: 
7320:0:(lov_ea.c:238:lsm_unpackmd_plain()) Skipped 240 previous similar messages
Mar 26 11:00:13 lustrefs kernel: LustreError: 
4200:0:(lov_ea.c:243:lsm_unpackmd_plain()) OST index 1 missing
Mar 26 11:00:13 lustrefs kernel: LustreError: 
4200:0:(lov_ea.c:243:lsm_unpackmd_plain()) Skipped 506 previous similar messages

If I can't get the OST's back online, I will move the data off, recreate the 
filesystem and replace the data. I just wish I understood the cause...

Thanks for all the help
Aaron

-Original Message-
From: kevin.vanma...@sun.com [mailto:kevin.vanma...@sun.com] 
Sent: Wednesday, March 25, 2009 12:51 PM
To: Aaron Everett
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] Unbalanced load across OST's

Yes, you are correct that the issue is with the MDS not seeing the OSTs.

Has restarting the mds or clients made any difference?  If there is 
nothing in the logs, and you haven't made any configuration changes, I 
don't have an answer for you as for why that is the case.

Kevin


Aaron Everett wrote:
> Any advice on this? Any resources I can look at? I've been looking through 
> archives, re-reading the Lustre manual, but I'm still stuck. Clients see all 
> 3 OST, but the MGS/MDT only sees OST000
>
> Thanks in advance
> Aaron
>
> -Original Message-----
> From: Aaron Everett 
> Sent: Monday, March 23, 2009 10:23 AM
> To: Aaron Everett; kevin.vanma...@sun.com
> Cc: lustre-discuss@lists.lustre.org
> Subject: RE: [Lustre-discuss] Unbalanced load across OST's
>
> Clients still see all 3 OST's, but it seems like the MGS/MDT machine is 
> missing 2 of the 3 OST machines:
>
> [r...@lustrefs ~]# lctl device_list
>   0 UP mgs MGS MGS 63
>   1 UP mgc mgc172.16.14...@tcp 97c99c52-d7f9-1b74-171e-f71c0707 5
>   2 UP mdt MDS MDS_uuid 3
>   3 UP lov fortefs-mdtlov fortefs-mdtlov_UUID 4
>   4 UP mds fortefs-MDT fortefs-MDT_UUID 57
>   5 UP osc fortefs-OST-osc fort

Re: [Lustre-discuss] Unbalanced load across OST's

2009-03-25 Thread Aaron Everett
Any advice on this? Any resources I can look at? I've been looking through 
archives, re-reading the Lustre manual, but I'm still stuck. Clients see all 3 
OST, but the MGS/MDT only sees OST000

Thanks in advance
Aaron

-Original Message-----
From: Aaron Everett 
Sent: Monday, March 23, 2009 10:23 AM
To: Aaron Everett; kevin.vanma...@sun.com
Cc: lustre-discuss@lists.lustre.org
Subject: RE: [Lustre-discuss] Unbalanced load across OST's

Clients still see all 3 OST's, but it seems like the MGS/MDT machine is missing 
2 of the 3 OST machines:

[r...@lustrefs ~]# lctl device_list
  0 UP mgs MGS MGS 63
  1 UP mgc mgc172.16.14...@tcp 97c99c52-d7f9-1b74-171e-f71c0707 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov fortefs-mdtlov fortefs-mdtlov_UUID 4
  4 UP mds fortefs-MDT fortefs-MDT_UUID 57
  5 UP osc fortefs-OST-osc fortefs-mdtlov_UUID 5
[r...@lustrefs ~]#

What steps should I take to force the OST's to rejoin?

Thanks!
Aaron


-Original Message-
From: lustre-discuss-boun...@lists.lustre.org 
[mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Aaron Everett
Sent: Friday, March 20, 2009 12:30 PM
To: kevin.vanma...@sun.com
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] Unbalanced load across OST's

I assume this is wrong:

[r...@lustrefs osc]# pwd
/proc/fs/lustre/osc
[r...@lustrefs osc]# ls
fortefs-OST-osc  num_refs
[r...@lustrefs osc]#

I should be seeing something like this - correct?
fortefs-OST-osc
fortefs-OST0001-osc
fortefs-OST0002-osc
num_refs

This was done from the MDS
Aaron


-Original Message-
From: kevin.vanma...@sun.com [mailto:kevin.vanma...@sun.com] 
Sent: Friday, March 20, 2009 11:55 AM
To: Aaron Everett
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] Unbalanced load across OST's

If Lustre deactivated them, there should be something in the log.  You 
can check the status with something like:
# cat /proc/fs/lustre/osc/*-OST*-osc/active
on the MDS node (or using lctl).

You can also try setting the index to 1 or 2, which should force 
allocations there.

Kevin


Aaron Everett wrote:
> Hello, I tried the suggestion of using lfs setstripe and it appears that 
> everything is still being written to only OST. You mentioned the OST's 
> may have been deactivated. Is it possible that last time we restarted Lustre 
> they came up in a deactivated or read only state? Last week we brought our 
> Lustre machines offline to swap out UPS's. 
>
> [r...@englogin01 teststripe]# pwd
> /lustre/work/aeverett/teststripe
> [r...@englogin01 aeverett]# mkdir teststripe
> [r...@englogin01 aeverett]# cd teststripe/
> [r...@englogin01 teststripe]# lfs setstripe -i -1 .
> [r...@englogin01 teststripe]# cp -R /home/aeverett/RHEL4WS_update/ .
>
> [r...@englogin01 teststripe]# lfs getstripe *
> OBDS:
> 0: fortefs-OST_UUID ACTIVE
> 1: fortefs-OST0001_UUID ACTIVE
> 2: fortefs-OST0002_UUID ACTIVE
> RHEL4WS_update
> default stripe_count: 1 stripe_size: 1048576 stripe_offset: 0
> RHEL4WS_update/rhn-packagesws.tgz
> obdidx   objid  objidgroup
>  077095451  0x498621b0
>
> RHEL4WS_update/rhn-packages
> default stripe_count: 1 stripe_size: 1048576 stripe_offset: 0
> RHEL4WS_update/kernel
> default stripe_count: 1 stripe_size: 1048576 stripe_offset: 0
> RHEL4WS_update/tools.tgz
> obdidx   objid  objidgroup
>  077096794  0x498675a0
>
> RHEL4WS_update/install
> obdidx   objid  objidgroup
>  077096842  0x498678a0
>
> RHEL4WS_update/installlinks
> obdidx   objid  objidgroup
>  077096843  0x498678b0
>
> RHEL4WS_update/ssh_config
> obdidx   objid  objidgroup
>  077096844  0x498678c0
>
> RHEL4WS_update/sshd_config
> obdidx   objid  objidgroup
>  077096845  0x498678d0
>
> .. continues on like this for about 100 files with incrementing 
> objid numbers and obdidx = 0 and group = 0.
>
>
> Thanks for all the help,
> Aaron
>
>
> -Original Message-
> From: kevin.vanma...@sun.com [mailto:kevin.vanma...@sun.com] 
> Sent: Friday, March 20, 2009 8:57 AM
> To: Aaron Everett
> Cc: Brian J. Murrell; lustre-discuss@lists.lustre.org
> Subject: Re: [Lustre-discuss] Unbalanced load across OST's
>
> There are several things that could have been done.  The most likely are:
>
> 1) you deactivated the OSTs on the MSD, using something like:
>
> # lctl 

Re: [Lustre-discuss] Unbalanced load across OST's

2009-03-23 Thread Aaron Everett
Clients still see all 3 OST's, but it seems like the MGS/MDT machine is missing 
2 of the 3 OST machines:

[r...@lustrefs ~]# lctl device_list
  0 UP mgs MGS MGS 63
  1 UP mgc mgc172.16.14...@tcp 97c99c52-d7f9-1b74-171e-f71c0707 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov fortefs-mdtlov fortefs-mdtlov_UUID 4
  4 UP mds fortefs-MDT fortefs-MDT_UUID 57
  5 UP osc fortefs-OST-osc fortefs-mdtlov_UUID 5
[r...@lustrefs ~]#

What steps should I take to force the OST's to rejoin?

Thanks!
Aaron


-Original Message-
From: lustre-discuss-boun...@lists.lustre.org 
[mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Aaron Everett
Sent: Friday, March 20, 2009 12:30 PM
To: kevin.vanma...@sun.com
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] Unbalanced load across OST's

I assume this is wrong:

[r...@lustrefs osc]# pwd
/proc/fs/lustre/osc
[r...@lustrefs osc]# ls
fortefs-OST-osc  num_refs
[r...@lustrefs osc]#

I should be seeing something like this - correct?
fortefs-OST-osc
fortefs-OST0001-osc
fortefs-OST0002-osc
num_refs

This was done from the MDS
Aaron


-Original Message-
From: kevin.vanma...@sun.com [mailto:kevin.vanma...@sun.com] 
Sent: Friday, March 20, 2009 11:55 AM
To: Aaron Everett
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] Unbalanced load across OST's

If Lustre deactivated them, there should be something in the log.  You 
can check the status with something like:
# cat /proc/fs/lustre/osc/*-OST*-osc/active
on the MDS node (or using lctl).

You can also try setting the index to 1 or 2, which should force 
allocations there.

Kevin


Aaron Everett wrote:
> Hello, I tried the suggestion of using lfs setstripe and it appears that 
> everything is still being written to only OST. You mentioned the OST's 
> may have been deactivated. Is it possible that last time we restarted Lustre 
> they came up in a deactivated or read only state? Last week we brought our 
> Lustre machines offline to swap out UPS's. 
>
> [r...@englogin01 teststripe]# pwd
> /lustre/work/aeverett/teststripe
> [r...@englogin01 aeverett]# mkdir teststripe
> [r...@englogin01 aeverett]# cd teststripe/
> [r...@englogin01 teststripe]# lfs setstripe -i -1 .
> [r...@englogin01 teststripe]# cp -R /home/aeverett/RHEL4WS_update/ .
>
> [r...@englogin01 teststripe]# lfs getstripe *
> OBDS:
> 0: fortefs-OST_UUID ACTIVE
> 1: fortefs-OST0001_UUID ACTIVE
> 2: fortefs-OST0002_UUID ACTIVE
> RHEL4WS_update
> default stripe_count: 1 stripe_size: 1048576 stripe_offset: 0
> RHEL4WS_update/rhn-packagesws.tgz
> obdidx   objid  objidgroup
>  077095451  0x498621b0
>
> RHEL4WS_update/rhn-packages
> default stripe_count: 1 stripe_size: 1048576 stripe_offset: 0
> RHEL4WS_update/kernel
> default stripe_count: 1 stripe_size: 1048576 stripe_offset: 0
> RHEL4WS_update/tools.tgz
> obdidx   objid  objidgroup
>  077096794  0x498675a0
>
> RHEL4WS_update/install
> obdidx   objid  objidgroup
>  077096842  0x498678a0
>
> RHEL4WS_update/installlinks
> obdidx   objid  objidgroup
>  077096843  0x498678b0
>
> RHEL4WS_update/ssh_config
> obdidx   objid  objidgroup
>  077096844  0x498678c0
>
> RHEL4WS_update/sshd_config
> obdidx   objid  objidgroup
>  077096845  0x498678d0
>
> .. continues on like this for about 100 files with incrementing 
> objid numbers and obdidx = 0 and group = 0.
>
>
> Thanks for all the help,
> Aaron
>
>
> -Original Message-
> From: kevin.vanma...@sun.com [mailto:kevin.vanma...@sun.com] 
> Sent: Friday, March 20, 2009 8:57 AM
> To: Aaron Everett
> Cc: Brian J. Murrell; lustre-discuss@lists.lustre.org
> Subject: Re: [Lustre-discuss] Unbalanced load across OST's
>
> There are several things that could have been done.  The most likely are:
>
> 1) you deactivated the OSTs on the MSD, using something like:
>
> # lctl set_param ost.work-OST0001.active=0
> # lctl set_param ost.work-OST0002.active=0
>
> 2) you set the file stripe on the directory to use only OST0, as with
>
> # lfs setstripe -i 0 .
>
> I would think that you'd remember #1, so my guess would be #2, which 
> could have happened when someone intended to do "lfs setstripe -c 0".  
> Do an "lfs getstripe ."  A simple:
>
> "lfs setstripe -i -1 ." in e

Re: [Lustre-discuss] Unbalanced load across OST's

2009-03-20 Thread Aaron Everett
I assume this is wrong:

[r...@lustrefs osc]# pwd
/proc/fs/lustre/osc
[r...@lustrefs osc]# ls
fortefs-OST-osc  num_refs
[r...@lustrefs osc]#

I should be seeing something like this - correct?
fortefs-OST-osc
fortefs-OST0001-osc
fortefs-OST0002-osc
num_refs

This was done from the MDS
Aaron


-Original Message-
From: kevin.vanma...@sun.com [mailto:kevin.vanma...@sun.com] 
Sent: Friday, March 20, 2009 11:55 AM
To: Aaron Everett
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] Unbalanced load across OST's

If Lustre deactivated them, there should be something in the log.  You 
can check the status with something like:
# cat /proc/fs/lustre/osc/*-OST*-osc/active
on the MDS node (or using lctl).

You can also try setting the index to 1 or 2, which should force 
allocations there.

Kevin


Aaron Everett wrote:
> Hello, I tried the suggestion of using lfs setstripe and it appears that 
> everything is still being written to only OST. You mentioned the OST's 
> may have been deactivated. Is it possible that last time we restarted Lustre 
> they came up in a deactivated or read only state? Last week we brought our 
> Lustre machines offline to swap out UPS's. 
>
> [r...@englogin01 teststripe]# pwd
> /lustre/work/aeverett/teststripe
> [r...@englogin01 aeverett]# mkdir teststripe
> [r...@englogin01 aeverett]# cd teststripe/
> [r...@englogin01 teststripe]# lfs setstripe -i -1 .
> [r...@englogin01 teststripe]# cp -R /home/aeverett/RHEL4WS_update/ .
>
> [r...@englogin01 teststripe]# lfs getstripe *
> OBDS:
> 0: fortefs-OST_UUID ACTIVE
> 1: fortefs-OST0001_UUID ACTIVE
> 2: fortefs-OST0002_UUID ACTIVE
> RHEL4WS_update
> default stripe_count: 1 stripe_size: 1048576 stripe_offset: 0
> RHEL4WS_update/rhn-packagesws.tgz
> obdidx   objid  objidgroup
>  077095451  0x498621b0
>
> RHEL4WS_update/rhn-packages
> default stripe_count: 1 stripe_size: 1048576 stripe_offset: 0
> RHEL4WS_update/kernel
> default stripe_count: 1 stripe_size: 1048576 stripe_offset: 0
> RHEL4WS_update/tools.tgz
> obdidx   objid  objidgroup
>  077096794  0x498675a0
>
> RHEL4WS_update/install
> obdidx   objid  objidgroup
>  077096842  0x498678a0
>
> RHEL4WS_update/installlinks
> obdidx   objid  objidgroup
>  077096843  0x498678b0
>
> RHEL4WS_update/ssh_config
> obdidx   objid  objidgroup
>  077096844  0x498678c0
>
> RHEL4WS_update/sshd_config
> obdidx   objid  objidgroup
>  077096845  0x498678d0
>
> .. continues on like this for about 100 files with incrementing 
> objid numbers and obdidx = 0 and group = 0.
>
>
> Thanks for all the help,
> Aaron
>
>
> -Original Message-
> From: kevin.vanma...@sun.com [mailto:kevin.vanma...@sun.com] 
> Sent: Friday, March 20, 2009 8:57 AM
> To: Aaron Everett
> Cc: Brian J. Murrell; lustre-discuss@lists.lustre.org
> Subject: Re: [Lustre-discuss] Unbalanced load across OST's
>
> There are several things that could have been done.  The most likely are:
>
> 1) you deactivated the OSTs on the MSD, using something like:
>
> # lctl set_param ost.work-OST0001.active=0
> # lctl set_param ost.work-OST0002.active=0
>
> 2) you set the file stripe on the directory to use only OST0, as with
>
> # lfs setstripe -i 0 .
>
> I would think that you'd remember #1, so my guess would be #2, which 
> could have happened when someone intended to do "lfs setstripe -c 0".  
> Do an "lfs getstripe ."  A simple:
>
> "lfs setstripe -i -1 ." in each directory
>
> should clear it up going forward.  Note that existing files will NOT be 
> re-striped, but new files will be balanced going forward.
>
> Kevin
>
>
> Aaron Everett wrote:
>   
>> Thanks for the reply.
>>
>> File sizes are all <1GB and most files are <1MB. For a test, I copied a 
>> typical result set from a non-lustre mount to my lustre directory. Total 
>> size of the test is 42GB. I included before/after results for lfs df -i from 
>> a client. 
>>
>> Before test:
>> [r...@englogin01 backups]# lfs df 
>> UUID 1K-blocks  Used Available  Use% Mounted on
>> fortefs-MDT_UUID 1878903960 129326660 17495773006% 
>> /lustre/work[MDT:0]
>> fortefs-OST_UUID 12644728

Re: [Lustre-discuss] Unbalanced load across OST's

2009-03-20 Thread Aaron Everett
Hello, I tried the suggestion of using lfs setstripe and it appears that 
everything is still being written to only OST. You mentioned the OST's may 
have been deactivated. Is it possible that last time we restarted Lustre they 
came up in a deactivated or read only state? Last week we brought our Lustre 
machines offline to swap out UPS's. 

[r...@englogin01 teststripe]# pwd
/lustre/work/aeverett/teststripe
[r...@englogin01 aeverett]# mkdir teststripe
[r...@englogin01 aeverett]# cd teststripe/
[r...@englogin01 teststripe]# lfs setstripe -i -1 .
[r...@englogin01 teststripe]# cp -R /home/aeverett/RHEL4WS_update/ .

[r...@englogin01 teststripe]# lfs getstripe *
OBDS:
0: fortefs-OST_UUID ACTIVE
1: fortefs-OST0001_UUID ACTIVE
2: fortefs-OST0002_UUID ACTIVE
RHEL4WS_update
default stripe_count: 1 stripe_size: 1048576 stripe_offset: 0
RHEL4WS_update/rhn-packagesws.tgz
obdidx   objid  objidgroup
 077095451  0x498621b0

RHEL4WS_update/rhn-packages
default stripe_count: 1 stripe_size: 1048576 stripe_offset: 0
RHEL4WS_update/kernel
default stripe_count: 1 stripe_size: 1048576 stripe_offset: 0
RHEL4WS_update/tools.tgz
obdidx   objid  objidgroup
 077096794  0x498675a0

RHEL4WS_update/install
obdidx   objid  objidgroup
 077096842  0x498678a0

RHEL4WS_update/installlinks
obdidx   objid  objidgroup
 077096843  0x498678b0

RHEL4WS_update/ssh_config
obdidx   objid  objidgroup
 077096844  0x498678c0

RHEL4WS_update/sshd_config
obdidx   objid  objidgroup
 077096845  0x498678d0

.. continues on like this for about 100 files with incrementing 
objid numbers and obdidx = 0 and group = 0.


Thanks for all the help,
Aaron


-Original Message-
From: kevin.vanma...@sun.com [mailto:kevin.vanma...@sun.com] 
Sent: Friday, March 20, 2009 8:57 AM
To: Aaron Everett
Cc: Brian J. Murrell; lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] Unbalanced load across OST's

There are several things that could have been done.  The most likely are:

1) you deactivated the OSTs on the MSD, using something like:

# lctl set_param ost.work-OST0001.active=0
# lctl set_param ost.work-OST0002.active=0

2) you set the file stripe on the directory to use only OST0, as with

# lfs setstripe -i 0 .

I would think that you'd remember #1, so my guess would be #2, which 
could have happened when someone intended to do "lfs setstripe -c 0".  
Do an "lfs getstripe ."  A simple:

"lfs setstripe -i -1 ." in each directory

should clear it up going forward.  Note that existing files will NOT be 
re-striped, but new files will be balanced going forward.

Kevin


Aaron Everett wrote:
> Thanks for the reply.
>
> File sizes are all <1GB and most files are <1MB. For a test, I copied a 
> typical result set from a non-lustre mount to my lustre directory. Total size 
> of the test is 42GB. I included before/after results for lfs df -i from a 
> client. 
>
> Before test:
> [r...@englogin01 backups]# lfs df 
> UUID 1K-blocks  Used Available  Use% Mounted on
> fortefs-MDT_UUID 1878903960 129326660 17495773006% /lustre/work[MDT:0]
> fortefs-OST_UUID 1264472876 701771484 562701392   55% /lustre/work[OST:0]
> fortefs-OST0001_UUID 1264472876 396097912 868374964   31% /lustre/work[OST:1]
> fortefs-OST0002_UUID 1264472876 393607384 870865492   31% /lustre/work[OST:2]
>
> filesystem summary:  3793418628 1491476780 2301941848   39% /lustre/work
>
> [r...@englogin01 backups]# lfs df -i
> UUIDInodes IUsed IFree IUse% Mounted on
> fortefs-MDT_UUID 497433511  33195991 4642375206% /lustre/work[MDT:0]
> fortefs-OST_UUID  80289792  13585653  66704139   16% /lustre/work[OST:0]
> fortefs-OST0001_UUID  80289792   7014185  732756078% /lustre/work[OST:1]
> fortefs-OST0002_UUID  80289792   7013859  732759338% /lustre/work[OST:2]
>
> filesystem summary:  497433511  33195991 4642375206% /lustre/work
>
>
> After test:
>
> [aever...@englogin01 ~]$ lfs df
> UUID 1K-blocks  Used Available  Use% Mounted on
> fortefs-MDT_UUID 1878903960 129425104 17494788566% /lustre/work[MDT:0]
> fortefs-OST_UUID 1264472876 759191664 505281212   60% /lustre/work[OST:0]
> fortefs-OST0001_UUID 1264472876 395929536 868543340   31% /lustre/work[OST:1]
> fortefs-OST0002_UUID 1264472876 393392924 871079952   31% /lustre/work[OST:2]
>
> filesystem summary:  379

Re: [Lustre-discuss] Unbalanced load across OST's

2009-03-19 Thread Aaron Everett
Thanks for the reply.

File sizes are all <1GB and most files are <1MB. For a test, I copied a typical 
result set from a non-lustre mount to my lustre directory. Total size of the 
test is 42GB. I included before/after results for lfs df -i from a client. 

Before test:
[r...@englogin01 backups]# lfs df 
UUID 1K-blocks  Used Available  Use% Mounted on
fortefs-MDT_UUID 1878903960 129326660 17495773006% /lustre/work[MDT:0]
fortefs-OST_UUID 1264472876 701771484 562701392   55% /lustre/work[OST:0]
fortefs-OST0001_UUID 1264472876 396097912 868374964   31% /lustre/work[OST:1]
fortefs-OST0002_UUID 1264472876 393607384 870865492   31% /lustre/work[OST:2]

filesystem summary:  3793418628 1491476780 2301941848   39% /lustre/work

[r...@englogin01 backups]# lfs df -i
UUIDInodes IUsed IFree IUse% Mounted on
fortefs-MDT_UUID 497433511  33195991 4642375206% /lustre/work[MDT:0]
fortefs-OST_UUID  80289792  13585653  66704139   16% /lustre/work[OST:0]
fortefs-OST0001_UUID  80289792   7014185  732756078% /lustre/work[OST:1]
fortefs-OST0002_UUID  80289792   7013859  732759338% /lustre/work[OST:2]

filesystem summary:  497433511  33195991 4642375206% /lustre/work


After test:

[aever...@englogin01 ~]$ lfs df
UUID 1K-blocks  Used Available  Use% Mounted on
fortefs-MDT_UUID 1878903960 129425104 17494788566% /lustre/work[MDT:0]
fortefs-OST_UUID 1264472876 759191664 505281212   60% /lustre/work[OST:0]
fortefs-OST0001_UUID 1264472876 395929536 868543340   31% /lustre/work[OST:1]
fortefs-OST0002_UUID 1264472876 393392924 871079952   31% /lustre/work[OST:2]

filesystem summary:  3793418628 1548514124 2244904504   40% /lustre/work

[aever...@englogin01 ~]$ lfs df -i
UUIDInodes IUsed IFree IUse% Mounted on
fortefs-MDT_UUID 497511996  33298931 4642130656% /lustre/work[MDT:0]
fortefs-OST_UUID  80289792  13665028  66624764   17% /lustre/work[OST:0]
fortefs-OST0001_UUID  80289792   7013783  732760098% /lustre/work[OST:1]
fortefs-OST0002_UUID  80289792   7013456  732763368% /lustre/work[OST:2]

filesystem summary:  497511996  33298931 4642130656% /lustre/work






-Original Message-
From: lustre-discuss-boun...@lists.lustre.org 
[mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Brian J. Murrell
Sent: Thursday, March 19, 2009 3:13 PM
To: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] Unbalanced load across OST's

On Thu, 2009-03-19 at 14:33 -0400, Aaron Everett wrote:
> Hello all,

Hi,

> We are running 1.6.6 with a shared mgs/mdt and 3 ost’s. We run a set 
> of tests that write heavily, then we review the results and delete the 
> data. Usually the load is evenly spread across all 3 ost’s. I noticed 
> this afternoon that the load does not seem to be distributed.

Striping as well as file count and size affects OST distribution as well.  Are 
any of the data involved striped?  Are you writing very few large files before 
you measure distribution?

> OST has a load of 50+ with iowait of around 10%
> 
> OST0001 has a load of <1 with >99% idle
> 
> OST0002 has a load of <1 with >99% idle

What does lfs df say before and after such a test that produces the above 
results?  Does it bear out even use amongst the OST before, and after the test?

> df confirms the lopsided writes:

lfs df [-i] from a client is usually more illustrative of use.  As I say above, 
if you can quiesce the filesystem for the test above, do an lfs df; lfs df -i 
before the test and after.  Assuming you were successful in quiescing, you 
should see the change to the OSTs that your test effected.

> OST:
> 
> FilesystemSize  Used Avail Use% Mounted on
> 
> /dev/sdb1 1.2T  602G  544G  53% /mnt/fortefs/ost0

What's important is what it looked like before the test too.  Your test could 
have, for example, wrote a single object (i.e. file) of nearly 300G for all we 
can tell from what you've posted so far.

b.


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Unbalanced load across OST's

2009-03-19 Thread Aaron Everett
Hello all,

 

We are running 1.6.6 with a shared mgs/mdt and 3 ost's. We run a set of
tests that write heavily, then we review the results and delete the
data. Usually the load is evenly spread across all 3 ost's. I noticed
this afternoon that the load does not seem to be distributed.

 

OST has a load of 50+ with iowait of around 10%

OST0001 has a load of <1 with >99% idle

OST0002 has a load of <1 with >99% idle

 

>From a client all 3 OST's appear online:

 

[aever...@englogin01 ~]$ lctl device_list

  0 UP mgc mgc172.16.14...@tcp 19dde65d-8eba-22b0-b618-f59bfbd36cde 5

  1 UP lov fortefs-clilov-f7cc4800 c86e2947-f2bf-5e47-541f-6ff3f13af9a0
4

  2 UP mdc fortefs-MDT-mdc-f7cc4800
c86e2947-f2bf-5e47-541f-6ff3f13af9a0 5

  3 UP osc fortefs-OST-osc-f7cc4800
c86e2947-f2bf-5e47-541f-6ff3f13af9a0 5

  4 UP osc fortefs-OST0001-osc-f7cc4800
c86e2947-f2bf-5e47-541f-6ff3f13af9a0 5

  5 UP osc fortefs-OST0002-osc-f7cc4800
c86e2947-f2bf-5e47-541f-6ff3f13af9a0 5

[aever...@englogin01 ~]$

 

>From MGS/MDT claims Lustre is healthy:

 

[aever...@lustrefs ~]$ cat /proc/fs/lustre/health_check 

healthy

[aever...@lustrefs ~]$

 

df confirms the lopsided writes:

 

OST:

FilesystemSize  Used Avail Use% Mounted on

/dev/sdb1 1.2T  602G  544G  53% /mnt/fortefs/ost0

 

OST0001:

FilesystemSize  Used Avail Use% Mounted on

/dev/sdb1 1.2T  317G  828G  28% /mnt/fortefs/ost0

 

OST0002:

FilesystemSize  Used Avail Use% Mounted on

/dev/sdb1 1.2T  315G  831G  28% /mnt/fortefs/ost0

 

What else should I be checking? Has the MGS/MDT lost track of OST0001
and OST0002 somehow? Clients can still read data that is on OST0001 and
OST0002. I confirmed this using lfs getstripe and cat'ing files on those
devices. If I edit the file, the file is written to OST.

 

Regards,
Aaron

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Permission denied error when mv files

2009-03-06 Thread Aaron Everett
)   = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=448, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0xb7fc3000
read(3, "# This file controls the state o"..., 4096) = 448
close(3)= 0
munmap(0xb7fc3000, 4096)= 0
open("/proc/mounts", O_RDONLY)  = 3
fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0xb7fc3000
read(3, "rootfs / rootfs rw 0 0\n/proc /pr"..., 1024) = 1024
close(3)= 0
munmap(0xb7fc3000, 4096)= 0
open("/usr/lib/locale/locale-archive", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=48508720, ...}) = 0
mmap2(NULL, 2097152, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7dac000
close(3)= 0
geteuid32() = 680
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo
...}) = 0
open("/proc/filesystems", O_RDONLY) = 3
read(3, "nodev\tsysfs\nnodev\trootfs\nnodev\tb"..., 4095) = 326
open("/proc/self/attr/current", O_RDONLY) = 4
read(4, "user_u:system_r:unconfined_t\0", 4095) = 29
close(4)= 0
close(3)= 0
stat64("/lustre/work/aeverett/ae.lustremvtest", 0xbfe79170) = -1 ENOENT
(No such file or directory)
lstat64("/tmp/ae.test", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
lstat64("/lustre/work/aeverett/ae.lustremvtest", 0xbfe79060) = -1 ENOENT
(No such file or directory)
rename("/tmp/ae.test", "/lustre/work/aeverett/ae.lustremvtest") = -1
EXDEV (Invalid cross-device link)
unlink("/lustre/work/aeverett/ae.lustremvtest") = -1 ENOENT (No such
file or directory)
lgetxattr("/tmp/ae.test", "security.selinux"...,
"user_u:object_r:tmp_t", 255) = 22
open("/proc/self/attr/fscreate", O_RDWR) = 3
write(3, "user_u:object_r:tmp_t\0", 22) = 22
close(3)= 0
open("/tmp/ae.test", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
open("/lustre/work/aeverett/ae.lustremvtest",
O_WRONLY|O_CREAT|O_LARGEFILE, 0100644) = -1 EACCES (Permission denied)
open("/usr/share/locale/locale.alias", O_RDONLY) = 4
fstat64(4, {st_mode=S_IFREG|0644, st_size=2528, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0xb7dab000
read(4, "# Locale name alias data base.\n#"..., 4096) = 2528
read(4, "", 4096)   = 0
close(4)= 0
munmap(0xb7dab000, 4096)= 0
open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/coreutils.mo", O_RDONLY)
= -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY)
= -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1
ENOENT (No such file or directory)
open("/usr/share/locale/en.UTF-8/LC_MESSAGES/coreutils.mo", O_RDONLY) =
-1 ENOENT (No such file or directory)
open("/usr/share/locale/en.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) =
-1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1
ENOENT (No such file or directory)
write(2, "mv: ", 4mv: )     = 4
write(2, "cannot create regular file `/lus"..., 66cannot create regular
file `/lustre/work/aeverett/ae.lustremvtest') = 66
open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1
ENOENT (No such file or directory)
open("/usr/share/locale/en_US.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1
ENOENT (No such file or directory)
open("/usr/share/locale/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1
ENOENT (No such file or directory)
open("/usr/share/locale/en.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1
ENOENT (No such file or directory)
open("/usr/share/locale/en.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1
ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT
(No such file or directory)
write(2, ": Permission denied", 19: Permission denied) = 19
write(2, "\n", 1
)   = 1
close(3)= 0
open("/proc/self/attr/fscreate", O_RDWR) = 3
write(3, NULL, 0)   = 0
close(3)= 0
exit_group(1)   = ?
[aever...@rdlogin02 ~]$ ls /tmp

-Original Message-
From: andreas.dil...@sun.com [mailto:andreas.dil...@sun.com] On Behalf
Of Andreas Dilger
Sent: Friday, March 06, 2009 6:05 AM
To: Aaron Everett
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] Permission denied error when mv files

On Mar 05, 2009  20:01 -0500, Aaron Everett wrote:
> We are seeing a reproducible error if we try to mv a file from /tmp to
> the lustre filesystem. If we cp the same file, there is no error.
> 
> [aever...@rdlogin02 ~]$ touch /tmp/ae.test
> 
> [aever...@rdlogin02 ~]$ mv /tmp/ae.test /lustre/work/aeverett/ae.file
> 
> mv: cannot create regular file `/lustre/work/aeverett/ae.file':
> Permission denied

Run strace on the "mv" and see what syscall is returning this error.

> [aever...@rdlogin02 ~]$ cp /tmp/ae.test /lustre/work/aeverett/ae.file
> 
> [aever...@rdlogin02 ~]$ ls -al /lustre/work/aeverett/ae.file 
> 
> -rw-r--r--  1 aeverett users 0 Mar  5 19:57
> /lustre/work/aeverett/ae.file


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Permission denied error when mv files

2009-03-05 Thread Aaron Everett
Hello all,

 

I am running Lustre 1.6.6 on 3 OST's and 1 MDS with 1.6.6 clients. I am
also running Sun Grid Engine and have approximately 70 client nodes, all
multiple processors. The Lustre machines are running RHEL5
2.6.18-92.1.10.el5_lustre.1.6.6smp and clients are running RHEL4
2.6.9-67.0.22.EL_lustre.1.6.6smp.

 

I am using autofs on the clients to mount the Lustre filesystem. My
/etc/auto.lustre file contains:

 

work-fstype=lustre  -flock  lustrefs:/lustre

 

We are seeing a reproducible error if we try to mv a file from /tmp to
the lustre filesystem. If we cp the same file, there is no error.

 

[aever...@rdlogin02 ~]$ 

[aever...@rdlogin02 ~]$ touch /tmp/ae.test

[aever...@rdlogin02 ~]$ mv /tmp/ae.test /lustre/work/aeverett/ae.file

mv: cannot create regular file `/lustre/work/aeverett/ae.file':
Permission denied

[aever...@rdlogin02 ~]$ cp /tmp/ae.test /lustre/work/aeverett/ae.file

[aever...@rdlogin02 ~]$ ls -al /lustre/work/aeverett/ae.file 

-rw-r--r--  1 aeverett users 0 Mar  5 19:57
/lustre/work/aeverett/ae.file

[aever...@rdlogin02 ~]$

 

How does writing the file during a mv differ from a cp? Should I change
my flags for the automount? Any help is appreciated!

 

Regards,

Aaron

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss