[Lustre-discuss] problem with secondary groups

2012-05-25 Thread Temple Jason
Hello,

I am running lustre 2.1.56 on the server side, and 1.8.4 (cray) on the client 
side.

I am having the classical secondary group problem, but when I enable it on the 
mds vi lctl conf_param 
lustre-MDT.mdt.identity_upcall=/usr/sbin/l_getidentity, I still have the 
same permissions problem on the client.

How do I get l_getidentity to work correctly?  When I run it directly myself 
using a valid uid,via:

l_getidentity -v lustre-MDT 21135

I get no output.

However if I run it with a non-valid uid, like this:

[root@mds1 ~]# l_getidentity lustre-MDT 21133
l_getidentity[7805]: no such user 21133
l_getidentity[7805]: partial write ret -1: Invalid argument

Any help would be appreciated.

Thanks,

Jason Temple
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Swap over lustre

2011-08-17 Thread Temple Jason
Hello,

I experimented with swap on lustre in as many ways as possible (without 
touching the code), and had the shortest path possible to no avail.  The code 
is not able to handle it at all, and the system always hung.

Without serious code rewrites, this isn't going to work for you.

-Jason

-Original Message-
From: lustre-discuss-boun...@lists.lustre.org 
[mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of John Hanks
Sent: giovedì, 18. agosto 2011 05:55
To: land...@scalableinformatics.com
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] Swap over lustre

On Wed, Aug 17, 2011 at 8:57 PM, Joe Landman
 wrote:
> On 08/17/2011 10:43 PM, John Hanks wrote:
> As a rule of thumb, you should try to keep the path to swap as simple as
> possible.  No memory/buffer allocations on the way to a paging event if
> you can possibly do this.

I do have a long path there, will try simplifying that and see if it helps.

> The lustre client (and most NFS or even network block devices) all do
> memory allocation of buffers ... which is anathema to migrating pages
> out to disk.  You can easily wind up in a "death spiral" race condition
> (and it sounds like you are there).  You might be able to do something
> with iSCSI or SRP (though these also do block allocations and could
> trigger death spirals).  If you can limit the number of buffers they
> allocate, and then force them to allocate the buffers at startup (by
> forcing some activity to the block device, and then pin this memory so
> that they can't be ejected ...) you might have chance to do it as a
> block device.  I think SRP can do this, not sure if iSCSI initiators can
> pin buffers in ram.
>
> You might look at the swapz patches (we haven't integrated them into our
> kernel yet, but have been looking at it) to compress swap pages and
> store them ... in ram.  This may not work for you, but it could be an
> option.

I wasn't aware of swapz, that sounds really interesting. The codes
that run the nodes out of memory tend to be sequencing applications,
which seem like good candidates for memory compression.

> Is there any particular reason you can't use a local drive for this
> (such as you don't have local drives, or they aren't big/fast enough)?

We're doing this on diskless nodes. I'm not looking to get a huge
amount of swap, just enough to provide a place for the root filesystem
to page out of the tmpfs so we can squeeze out all the RAM possible
for applications. Since I don't expect it to get heavily used, I'm
considering running vblade on a server and carving out small aoe LUNs.
It seems logical that if a host can boot off of iscsi or aoe, that you
could have a swap space there but I've never tried it with either
protocol.

FWIW, mounting a file on lustre via loopback to provide a local
scratch filesystem works really well.

jbh
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] software raid

2011-03-24 Thread Temple Jason
I believe that software raid has a historical bias.  I use software raid 
exclusively for my lustre installations here, and have never seen any problem 
with it.  The argument used to be that having dedicated hardware running your 
raids removed any overhead from the OS having to control them, and that raid in 
general took too much cpu and memory, but the md stack has been drastically 
improved since those times (over a decade), and now I see very little evidence 
of this being a problem.

My argument against hardware raid is that if you lose a controller, you lose 
the raid completely.

Just my 2cents.

Jason

-Original Message-
From: lustre-discuss-boun...@lists.lustre.org 
[mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Brian O'Connor
Sent: giovedì, 24. marzo 2011 03:55
To: lustre-discuss@lists.lustre.org
Subject: [Lustre-discuss] software raid


This has probably been asked and answered.

Is software raid(md) still considered bad practice?

I would like to use ssd drives for an mdt, but using fast ssd drives
behind a raid controller seems to defeat the purpose.

There was some thought that the decision not to support
software raid was mostly about Sun/Oracle trying to sell hardware
raid.

thoughts?

-- 
Brian O'Connor
---
SGI Consulting
Email: bri...@sgi.com, Mobile +61 417 746 452
Phone: +61 3 9963 1900, Fax:  +61 3 9963 1902
357 Camberwell Road, Camberwell, Victoria, 3124
AUSTRALIA
http://www.sgi.com/support/services
---

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] "up" a router that is marked "down"

2011-01-25 Thread Temple Jason
I've found that even with the Protocal Error, it still works.

-Jason

-Original Message-
From: lustre-discuss-boun...@lists.lustre.org 
[mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Michael Shuey
Sent: martedì, 25. gennaio 2011 14:45
To: Michael Kluge
Cc: Lustre Diskussionsliste
Subject: Re: [Lustre-discuss] "up" a router that is marked "down"

You'll want to add the "dead_router_check_interval" lnet module
parameter as soon as you are able.  As near as I can tell, without
that there's no automatic check to make sure the router is alive.

I've had some success in getting machines to recognize that a router
is alive again by doing an lctl ping of their side of a router (e.g.,
on a tcp0 client, `lctl ping @tcp0`, then `lctl ping
@o2ib0` from an o2ib0 client).  If you have a server/client
version mismatch, where lctl ping returns a protocol error, you may be
out of luck.

--
Mike Shuey



On Tue, Jan 25, 2011 at 8:38 AM, Michael Kluge
 wrote:
> Hi list,
>
> if a Lustre router is down, comes back to life and the servers do not
> actively test the routers periodically: is it possible to mark a Lustre
> router as "up"? Or to tell the servers to ping the router?
>
> Or can I enable the "router pinger" in a live system without unloading
> and loading the Lustre kernel modules?
>
>
> Regards, Michael
>
> --
>
> Michael Kluge, M.Sc.
>
> Technische Universität Dresden
> Center for Information Services and
> High Performance Computing (ZIH)
> D-01062 Dresden
> Germany
>
> Contact:
> Willersbau, Room A 208
> Phone:  (+49) 351 463-34217
> Fax:    (+49) 351 463-37773
> e-mail: michael.kl...@tu-dresden.de
> WWW:    http://www.tu-dresden.de/zih
>
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] status of lustre 2.0 on 2.6.18-194.17.1.0.1.el5 kernels

2011-01-11 Thread Temple Jason
I mean this article.  Forgot to attach it:

http://feedproxy.google.com/~r/InsideHPC/~3/LI9iHNGoFZw/

-Original Message-
From: lustre-discuss-boun...@lists.lustre.org 
[mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Andreas Dilger
Sent: martedì, 11. gennaio 2011 21:55
To: Michael Shuey
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] status of lustre 2.0 on 2.6.18-194.17.1.0.1.el5 
kernels

The interoperability for 2.x releases has been the following for a long time:

- 1.8 clients will work with 1.8 and 2.x servers (through some version of x to 
be determined), though they may not be able to take advantage of newer features 
being added to 2.x. In some cases, the 2.x features need to be turned off until 
all of the clients have been upgraded to 2.x.
- 2.x clients will not be able to interoperate with 1.8 servers.

That means that it is necessary to upgrade the servers to 2.x before the 
clients, or at the same time. 

Also, the upgrade process of the servers from 1.8 to 2.x is "disruptive" to the 
client - the client is evicted and automatically reconnects to the 2.x server 
using the new wire protocol. Any client doing RPCs at the time of the upgrade 
will get an IO error, so running jobs on the clients need to be at least paused.
 
Cheers, Andreas

On 2011-01-11, at 13:24, Michael Shuey  wrote:

> What does that imply for sites migrating from 1.8 to 2.1?  Presumably
> some sites will have both 1.8 and 2.1 filesystems; will those sites
> need to run 2.0 on the clients to mount both FS versions concurrently?
> 
> --
> Mike Shuey
> 
> 
> 
> On Tue, Jan 11, 2011 at 3:07 PM, Andreas Dilger  wrote:
>> While 2.0 was submitted to quite heavy testing at Oracle before it's
>> release, it has not been widely deployed for production at this point. All
>> of the develoment and maintenance effort has gone into the next release
>> (2.1) which is not released yet. I think that 2.1 will represent a much more
>> sustainable target for production usage, when it is released. Until that
>> happens, I would only recommend 2.0 for evaluation usage, and especially for
>> sites new to Lustre that they stay on the tried-and-true 1.8 code base.
>> 
>> Cheers, Andreas
>> On 2011-01-11, at 12:56, Samuel Aparicio  wrote:
>> 
>> thanks for this note.
>> is lustre 2.0 regarded as stable for production?
>> Professor Samuel Aparicio BM BCh PhD FRCPath
>> Nan and Lorraine Robertson Chair UBC/BC Cancer Agency
>> 675 West 10th, Vancouver V5Z 1L3, Canada.
>> office: +1 604 675 8200 cellphone: +1 604 762 5178: lab
>> website http://molonc.bccrc.ca
>> 
>> 
>> 
>> 
>> On Jan 7, 2011, at 5:11 PM, Colin Faber wrote:
>> 
>> Hi,
>> 
>> I've built several against 2.6.18-194.17.1.el5 kernels without problem
>> so I would think you can probably get away with 0.1 as well.
>> 
>> -cf
>> 
>> 
>> On 01/07/2011 06:05 PM, Samuel Aparicio wrote:
>> 
>> Is it known if Lustre 2.0 GA will run with 2.6.18-194.17.1.0.1.el5
>> 
>> kernels. The test matrix has only the 164 kernel as the latest tested.
>> 
>> Professor Samuel Aparicio BM BCh PhD FRCPath
>> 
>> Nan and Lorraine Robertson Chair UBC/BC Cancer Agency
>> 
>> 675 West 10th, Vancouver V5Z 1L3, Canada.
>> 
>> office: +1 604 675 8200 cellphone: +1 604 762 5178: lab website
>> 
>> http://molonc.bccrc.ca 
>> 
>> 
>> 
>> 
>> 
>> 
>> ___
>> 
>> Lustre-discuss mailing list
>> 
>> Lustre-discuss@lists.lustre.org
>> 
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> 
>> ___
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> 
>> ___
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> 
>> 
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] status of lustre 2.0 on 2.6.18-194.17.1.0.1.el5 kernels

2011-01-11 Thread Temple Jason
And what impact do you forsee from this article?

-Original Message-
From: lustre-discuss-boun...@lists.lustre.org 
[mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Andreas Dilger
Sent: martedì, 11. gennaio 2011 21:55
To: Michael Shuey
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] status of lustre 2.0 on 2.6.18-194.17.1.0.1.el5 
kernels

The interoperability for 2.x releases has been the following for a long time:

- 1.8 clients will work with 1.8 and 2.x servers (through some version of x to 
be determined), though they may not be able to take advantage of newer features 
being added to 2.x. In some cases, the 2.x features need to be turned off until 
all of the clients have been upgraded to 2.x.
- 2.x clients will not be able to interoperate with 1.8 servers.

That means that it is necessary to upgrade the servers to 2.x before the 
clients, or at the same time. 

Also, the upgrade process of the servers from 1.8 to 2.x is "disruptive" to the 
client - the client is evicted and automatically reconnects to the 2.x server 
using the new wire protocol. Any client doing RPCs at the time of the upgrade 
will get an IO error, so running jobs on the clients need to be at least paused.
 
Cheers, Andreas

On 2011-01-11, at 13:24, Michael Shuey  wrote:

> What does that imply for sites migrating from 1.8 to 2.1?  Presumably
> some sites will have both 1.8 and 2.1 filesystems; will those sites
> need to run 2.0 on the clients to mount both FS versions concurrently?
> 
> --
> Mike Shuey
> 
> 
> 
> On Tue, Jan 11, 2011 at 3:07 PM, Andreas Dilger  wrote:
>> While 2.0 was submitted to quite heavy testing at Oracle before it's
>> release, it has not been widely deployed for production at this point. All
>> of the develoment and maintenance effort has gone into the next release
>> (2.1) which is not released yet. I think that 2.1 will represent a much more
>> sustainable target for production usage, when it is released. Until that
>> happens, I would only recommend 2.0 for evaluation usage, and especially for
>> sites new to Lustre that they stay on the tried-and-true 1.8 code base.
>> 
>> Cheers, Andreas
>> On 2011-01-11, at 12:56, Samuel Aparicio  wrote:
>> 
>> thanks for this note.
>> is lustre 2.0 regarded as stable for production?
>> Professor Samuel Aparicio BM BCh PhD FRCPath
>> Nan and Lorraine Robertson Chair UBC/BC Cancer Agency
>> 675 West 10th, Vancouver V5Z 1L3, Canada.
>> office: +1 604 675 8200 cellphone: +1 604 762 5178: lab
>> website http://molonc.bccrc.ca
>> 
>> 
>> 
>> 
>> On Jan 7, 2011, at 5:11 PM, Colin Faber wrote:
>> 
>> Hi,
>> 
>> I've built several against 2.6.18-194.17.1.el5 kernels without problem
>> so I would think you can probably get away with 0.1 as well.
>> 
>> -cf
>> 
>> 
>> On 01/07/2011 06:05 PM, Samuel Aparicio wrote:
>> 
>> Is it known if Lustre 2.0 GA will run with 2.6.18-194.17.1.0.1.el5
>> 
>> kernels. The test matrix has only the 164 kernel as the latest tested.
>> 
>> Professor Samuel Aparicio BM BCh PhD FRCPath
>> 
>> Nan and Lorraine Robertson Chair UBC/BC Cancer Agency
>> 
>> 675 West 10th, Vancouver V5Z 1L3, Canada.
>> 
>> office: +1 604 675 8200 cellphone: +1 604 762 5178: lab website
>> 
>> http://molonc.bccrc.ca 
>> 
>> 
>> 
>> 
>> 
>> 
>> ___
>> 
>> Lustre-discuss mailing list
>> 
>> Lustre-discuss@lists.lustre.org
>> 
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> 
>> ___
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> 
>> ___
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> 
>> 
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Multihome question : unable to mount lustre over tcp.

2010-12-10 Thread Temple Jason
Hi,

You need to run tunefs.lustre on all the servers to add the new nid @tcp nids.

Thanks,

Jason

From: lustre-discuss-boun...@lists.lustre.org 
[mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of vaibhav pol
Sent: venerdì, 10. dicembre 2010 09:36
To: lustre-discuss@lists.lustre.org
Subject: [Lustre-discuss] Multihome question : unable to mount lustre over tcp.

Dear Lustre users,
   I am using lustre file system(lustre-1.8.1)  over ib.Now i required 
lustre over ib and over Ethernet also.
I modified the modprobe.conf on client , mds(mdt),oss(ost).
I add below line in modprobe.conf.
options lnet networks=o2ib(ib0),tcp1(eth1)
I am able to  mount lustre over ib on client but not able to mount over 
Ethernet.
I am getting following error on stdout .

mount.lustre: mount  172.29.2...@tcp1:/home at /mnt failed: No such file or 
directory
Is the MGS specification correct?
Is the file system name correct?
If upgrading, is the copied client log valid? (see upgrade docs)

and /var/log/message is

kernel: LustreError: 6943:0:(ldlm_lib.c:329:client_obd_setup()) can't add 
initial connection
kernel: LustreError: 6943:0:(obd_config.c:370:class_setup()) setup 
home-MDT-mdc-810f39c8ac00 failed (-2)
kernel: LustreError: 6943:0:(obd_config.c:1197:class_config_llog_handler()) 
Err -2 on cfg command:
kernel: Lustre:cmd=cf003 0:home-MDT-mdc  1:home-MDT_UUID  
2:172.31.65...@o2ib
kernel: LustreError: 15c-8: mgc172.29.2...@tcp1: The configuration from log 
'home-client' failed (-2). This may be the result of communication errors 
between this node and the MGS, a bad configuration, or other errors. See the 
syslog for more information.
kernel: LustreError: 6933:0:(llite_lib.c:1171:ll_fill_super()) Unable to 
process log: -2
kernel: LustreError: 6933:0:(obd_config.c:441:class_cleanup()) Device 2 not 
setup
kernel: LustreError: 6933:0:(ldlm_request.c:1030:ldlm_cli_cancel_req()) Got 
rc -108 from cancel RPC: canceling anyway
kernel: LustreError: 6933:0:(ldlm_request.c:1533:ldlm_cli_cancel_list()) 
ldlm_cli_cancel_list: -108
kernel: Lustre: client 810f39c8ac00 umount complete
kernel: LustreError: 6933:0:(obd_mount.c:1997:lustre_fill_super()) Unable 
to mount  (-2)

I am able to ping mgs from client .

 lctl list_nids  command output on mgs give following output.

  172.31.65...@o2ib
  172.29.2...@tcp1

also i  able to mount lustre client on mgs  itself  over Ethernet

Following is tunefs.lustre output


1) mgs tunefs.lustre output

checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

   Read previous values:
Target: MGS
Index:  unassigned
Lustre FS:  scratch
Mount type: ldiskfs
Flags:  0x174
  (MGS needs_index first_time update writeconf )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters:


   Permanent disk data:
Target: MGS
Index:  unassigned
Lustre FS:  scratch
Mount type: ldiskfs
Flags:  0x174
  (MGS needs_index first_time update writeconf )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr

2) mds(mdt) tunefs.lustre output
 Read previous values:
Target: home-MDT
Index:  0
Lustre FS:  home
Mount type: ldiskfs
Flags:  0x1
  (MDT )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters: mgsnode=172.31.65...@o2ib 
mdt.group_upcall=/usr/sbin/l_getgroups


   Permanent disk data:
Target: home-MDT
Index:  0
Lustre FS:  home
Mount type: ldiskfs
Flags:  0x1
  (MDT )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters: mgsnode=172.31.65...@o2ib 
mdt.group_upcall=/usr/sbin/l_getgroups

 3) oss(ost) tunefs.lustre output
   Read previous values:
Target: home-OST0003
Index:  3
Lustre FS:  home
Mount type: ldiskfs
Flags:  0x2
  (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.31.65...@o2ib mdt.quota_type=ug


   Permanent disk data:
Target: home-OST0003
Index:  3
Lustre FS:  home
Mount type: ldiskfs
Flags:  0x2
  (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.31.65...@o2ib mdt.quota_type=ug

exiting before disk write.

I am not able to figure out what is the exact problem.


Thanks and regards
Vaibhi
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http:

Re: [Lustre-discuss] Metadata performance question

2010-10-05 Thread Temple Jason
I believe that was the *goal* of 2.0, but unfortunately, that lofty goal was 
not met, the timeline of which seemed to stretch from when Sun purchased Lustre 
until some time in the far future.

See here for the features available in 2.0:

http://wiki.lustre.org/index.php/Lustre_2.0_Features

-Jason Temple

-Original Message-
From: lustre-discuss-boun...@lists.lustre.org 
[mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of David Noriega
Sent: martedì, 5. ottobre 2010 16:54
To: lustre-discuss@lists.lustre.org
Subject: [Lustre-discuss] Metadata performance question

If I'm wrong please let me know, but my understanding of how lustre
1.8 works is metadata is only accessible from a single host. So should
there be alot of activity, the metadata server becomes a bottleneck.
But I've heard that in ver 2.x that we'll be able to setup multiple
machines for metadata just like for the OSSs, and that should cut down
on a bottleneck when accessing metadata information.

-- 
Personally, I liked the university. They gave us money and facilities,
we didn't have to produce anything! You've never been out of college!
You don't know what it's like out there! I've worked in the private
sector. They expect results. -Ray Ghostbusters
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] How do you monitor your lustre?

2010-09-30 Thread Temple Jason
We use ganglia with collectl.  These versions are the only ones I could find to 
work in this way:

Sep 30 13:35 [r...@wn125:~]# rpm -qa |grep collectl
collectl-3.4.2-5
Sep 30 13:35 [r...@wn125:~]# rpm -qa |grep ganglia
ganglia-gmond-3.1.7-1

We are quite happy with it.

Thanks,

Jason

-Original Message-
From: lustre-discuss-boun...@lists.lustre.org 
[mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Andreas Davour
Sent: giovedì, 30. settembre 2010 11:47
To: lustre-discuss@lists.lustre.org
Subject: [Lustre-discuss] How do you monitor your lustre?


I ask because the lmt project seem to be quite moribund. Anyone else out there 
doing something? 

/andreas
-- 
Systems Engineer
PDC Center for High Performance Computing
CSC School of Computer Science and Communication
KTH Royal Institute of Technology
SE-100 44 Stockholm, Sweden
Phone: 087906658
"A satellite, an earring, and a dust bunny are what made America great!"
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

2010-09-20 Thread Temple Jason
It appears that turning off statahead does indeed avoid the soft lockup bug.  
But this seems to me to be a workaround, and not a solution.

Is statahead not useful for performance gains?  I am not comfortable making my 
user's jobs waste more cpu time because I have to implement a workaround 
instead of a fix.

Is there one in the works?  Nasf - does your patch solve the bug, or does it 
just disable statahead by default?

Thanks,

Jason

-Original Message-
From: lustre-discuss-boun...@lists.lustre.org 
[mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of paciu...@gmail.com
Sent: sabato, 18. settembre 2010 08:13
To: rr...@whamcloud.com; peter.x.jo...@oracle.com
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! 
with lustre 1.8.4

No i have disabled the statahead cache to avoid the problem -Messaggio 
originale-
Da: "Robert Read" 
Data: Sat Sep 18 04:42:18 GMT 2010
A: "Peter Jones" 
CC: "lustre-discuss@lists.lustre.org" 
Oggetto: Re: [Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! 
with lustre 1.8.4

Hi Peter,

Perhaps the link got mangled by your mail client? (It does have some seemingly 
unusual characters for an URL.)  My interpretation of Gabriele's reply is that 
the problem occurred even with statahead disabled, so in that case this patch 
might be worth trying. 

robert




On Sep 17, 2010, at 10:18 , Peter Jones wrote:

> The URL does not work for me, but if it is a statahead issue then 
> surely turning statahead off would be a simple workaround to avoid 
> having to apply a patch.
> 
> Fan Yong wrote:
>>  On 9/14/10 8:55 PM, Gabriele Paciucci wrote:
>> 
>>> I have the same problem, I put the statahead_max to 0 !!!
>>> 
>> In fact, I have made a patch for statahead related issues (including 
>> this one) against lustre-1.8, which is in inspection.
>> http://review.whamcloud.com/#change,2
>> If possible, you can try such patch.
>> 
>> Cheers,
>> --
>> Nasf
>> 
>> 
>> 
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] kernel: BUG: soft lockup - CPU stuck for 10s! with lustre 1.8.4

2010-09-14 Thread Temple Jason
Hello,

I have recently upgraded my lustre filesystem from 1.8.3 to 1.8.4.  The first 
day we brought our system online with the new version, we started seeing 
clients getting stuck in this soft lockup loop.  The load shoots up over 120, 
and eventually the node becomes unusable and requires a hard reset.  I've seen 
loops like this on the server side in previous lustre versions, but to have it 
happen on the client is completely new.  Here is a bit of what I see in the 
logs:

Sep 13 21:11:39 wn122 kernel: LustreError: 
27016:0:(statahead.c:289:ll_sai_entry_fini()) ASSERTION(sa_is_stopped(sai)) 
failed
Sep 13 21:11:39 wn122 kernel: LustreError: 
27016:0:(statahead.c:289:ll_sai_entry_fini()) LBUG
Sep 13 21:11:39 wn122 kernel: Pid: 27016, comm: athena.py
Sep 13 21:11:39 wn122 kernel:
Sep 13 21:11:39 wn122 kernel: Call Trace:
Sep 13 21:11:39 wn122 kernel:  [] 
libcfs_debug_dumpstack+0x51/0x60 [libcfs]
Sep 13 21:11:39 wn122 kernel:  [] lbug_with_loc+0x7a/0xd0 
[libcfs]
Sep 13 21:11:39 wn122 kernel:  [] tracefile_init+0x0/0x110 
[libcfs]
Sep 13 21:11:39 wn122 kernel:  [] 
ll_statahead_exit+0x409/0x500 [lustre]
Sep 13 21:11:39 wn122 kernel:  [] 
default_wake_function+0x0/0xe
Sep 13 21:11:39 wn122 kernel:  [] 
ll_intent_drop_lock+0x8e/0xb0 [lustre]
Sep 13 21:11:39 wn122 kernel:  [] ll_lookup_it+0x30b/0x7c0 
[lustre]
Sep 13 21:11:39 wn122 kernel:  [] 
__ll_inode_revalidate_it+0x5bd/0x650 [lustre]
Sep 13 21:11:39 wn122 kernel:  [] 
ldlm_lock_add_to_lru+0x74/0xe0 [ptlrpc]
Sep 13 21:11:39 wn122 kernel:  [] 
ll_convert_intent+0xb1/0x170 [lustre]
Sep 13 21:11:39 wn122 kernel:  [] ll_lookup_nd+0x207/0x400 
[lustre]
Sep 13 21:11:39 wn122 kernel:  [] d_alloc+0x174/0x1a9
Sep 13 21:11:39 wn122 kernel:  [] do_lookup+0xe5/0x1e6
Sep 13 21:11:39 wn122 kernel:  [] __link_path_walk+0xa01/0xf42
Sep 13 21:11:39 wn122 kernel:  [] link_path_walk+0x5c/0xe5
Sep 13 21:11:39 wn122 kernel:  [] vfs_readdir+0x94/0xa9
Sep 13 21:11:39 wn122 kernel:  [] 
compat_sys_getdents+0xaf/0xbd
Sep 13 21:11:39 wn122 kernel:  [] do_path_lookup+0x270/0x2e8
Sep 13 21:11:39 wn122 kernel:  [] getname+0x15b/0x1c1
Sep 13 21:11:39 wn122 kernel:  [] __user_walk_fd+0x37/0x4c
Sep 13 21:11:39 wn122 kernel:  [] sys_faccessat+0xe4/0x18d
Sep 13 21:11:39 wn122 kernel:  [] vfs_readdir+0x94/0xa9
Sep 13 21:11:39 wn122 kernel:  [] 
compat_sys_getdents+0xaf/0xbd
Sep 13 21:11:39 wn122 kernel:  [] sysenter_do_call+0x1b/0x67
Sep 13 21:11:39 wn122 kernel:  [] 
dummy_inode_permission+0x0/0x3
Sep 13 21:11:39 wn122 kernel:
Sep 13 21:11:39 wn122 kernel: LustreError: dumping log to 
/tmp/lustre-log.1284405099.27016
Sep 13 21:11:44 wn122 dhclient: DHCPREQUEST on eth0 to 148.187.67.113 port 67
Sep 13 21:11:49 wn122 kernel: BUG: soft lockup - CPU#3 stuck for 10s! 
[ptlrpcd:31817]
Sep 13 21:11:49 wn122 kernel: CPU 3:
Sep 13 21:11:49 wn122 kernel: Modules linked in: mgc(U) lustre(U) lov(U) mdc(U) 
lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) 
nfs fscache nfs_acl loc
kd sunrpc bonding(U) ip_conntrack_netbios_ns ipt_REJECT xt_tcpudp xt_state 
iptable_filter iptable_nat ip_nat ip_conntrack nfnetlink iptable_mangle 
ip_tables x_tables rdma_ucm(U) ib
_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) 
ipv6 xfrm_nalgo crypto_api ib_uverbs(U) ib_umad(U) mlx4_vnic(U) ib_sa(U) 
mlx4_ib(U) ib_mthca(U) ib_mad(U
) ib_core(U) dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button 
battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sg i2c_i801 
i2c_core e1000e shpchp mlx4_
core(U) pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot 
dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd 
uhci_hcd ohci_hcd ehci_hcd
Sep 13 21:11:49 wn122 kernel: Pid: 31817, comm: ptlrpcd Tainted: G  
2.6.18-128.7.1.el5 #1
Sep 13 21:11:49 wn122 kernel: RIP: 0010:[]  
[] .text.lock.spinlock+0x5/0x30
Sep 13 21:11:49 wn122 kernel: RSP: 0018:8101ec177cb8  EFLAGS: 0282
Sep 13 21:11:49 wn122 kernel: RAX: 004f RBX:  RCX: 

Sep 13 21:11:49 wn122 kernel: RDX: 81035956b480 RSI: 810253c2d400 RDI: 
810552ccb500
Sep 13 21:11:49 wn122 kernel: RBP: 810192294000 R08: 5a5a5a5a5a5a5a5a R09: 
5a5a5a5a5a5a5a5a
Sep 13 21:11:49 wn122 kernel: R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: 
0038
Sep 13 21:11:49 wn122 kernel: R13: 81045b0150c0 R14: 81067fc57000 R15: 
886f5168
Sep 13 21:11:49 wn122 kernel: FS:  2b5af649d240() 
GS:81010c4c8e40() knlGS:
Sep 13 21:11:49 wn122 kernel: CS:  0010 DS:  ES:  CR0: 8005003b
Sep 13 21:11:49 wn122 kernel: CR2: 08183094 CR3: 00201000 CR4: 
06e0
Sep 13 21:11:49 wn122 kernel:
Sep 13 21:11:49 wn122 kernel: Call Trace:
Sep 13 21:11:49 wn122 kernel:  [] 
:lustre:ll_statahead_interpret+0xfc/0x5b0
Sep 13 21:11:49 wn122 kernel:  [] 
:mdc:mdc_intent_getattr_async_interpret+0x459/0x490
Sep 13 21:11:49 wn122 kernel:  [] 
:ptlrpc:ptlrp