[lustre-discuss] Lustre Client

2023-01-13 Thread Nick dan via lustre-discuss
Hi

I wanted to ask if Lustre client can be mounted as read-only client or not

Regards,
Nick Dan
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[Lustre-discuss] Lustre client lockups

2008-11-04 Thread Kurt Dillen
Hello all,

We have a serious problem with lustre.  Since a few days we have
lockups on the client side.  Not all clients are having this
problem.

We are running this kernel  2.6.16-54-0.2.5_lustre.1.6.4.3smp.

The statahead disable is done on the systems.

Some more information about the environment:

- Lustre clients are all vmware virtual systems
- Lustre Farm are all vmware virtual systems

the errors I see are the following:

LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc 8100e5dca000
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc 8100e519e000
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc 8100e4e0a000
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc 8100e86b1bc0
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc 8100e79fe5c0
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc 8100e70a88c0
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc 8100e7081280
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc 8100e6d6d5c0
LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225816920, 100s ago)  [EMAIL PROTECTED] x17940/t0
o4->[EMAIL PROTECTED]@tcp:28 lens 384/352 ref 2 fl Rpc:/
0/0 rc 0/-22
Lustre: lustre-OST0005-osc-8100e8551800: Connection to service
lustre-OST0005 via nid [EMAIL PROTECTED] was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-OST0005-osc-8100e8551800: Connection restored to
service lustre-OST0005 using nid [EMAIL PROTECTED]
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225816924, 100s ago)  [EMAIL PROTECTED] x19702/t0
o36->[EMAIL PROTECTED]@tcp:12 lens 1544/296 ref 1 fl
Rpc:/0/0 rc 0/-22
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) Skipped
2 previous similar messages
Lustre: lustre-MDT-mdc-8100e8551800: Connection to service
lustre-MDT via nid [EMAIL PROTECTED] was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-MDT-mdc-8100e8551800: Connection restored to
service lustre-MDT using nid [EMAIL PROTECTED]
LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225816953, 100s ago)  [EMAIL PROTECTED] x20560/t0
o4->[EMAIL PROTECTED]@tcp:28 lens 384/352 ref 2 fl Rpc:/
0/0 rc 0/-22
Lustre: lustre-OST0006-osc-8100e8551800: Connection to service
lustre-OST0006 via nid [EMAIL PROTECTED] was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-OST0006-osc-8100e8551800: Connection restored to
service lustre-OST0006 using nid [EMAIL PROTECTED]
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817024, 100s ago)  [EMAIL PROTECTED] x19702/t0
o36->[EMAIL PROTECTED]@tcp:12 lens 1544/296 ref 1 fl
Rpc:/2/0 rc -11/-22
Lustre: lustre-MDT-mdc-8100e8551800: Connection to service
lustre-MDT via nid [EMAIL PROTECTED] was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-MDT-mdc-8100e8551800: Connection restored to
service lustre-MDT using nid [EMAIL PROTECTED]
LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817053, 100s ago)  [EMAIL PROTECTED] x20724/t0
o4->[EMAIL PROTECTED]@tcp:28 lens 384/352 ref 2 fl Rpc:/
2/0 rc -11/-22
Lustre: lustre-OST0006-osc-8100e8551800: Connection to service
lustre-OST0006 via nid [EMAIL PROTECTED] was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-OST0006-osc-8100e8551800: Connection restored to
service lustre-OST0006 using nid [EMAIL PROTECTED]
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817124, 100s ago)  [EMAIL PROTECTED] x19702/t0
o36->[EMAIL PROTECTED]@tcp:12 lens 1544/296 ref 1 fl
Rpc:/2/0 rc -11/-22
Lustre: lustre-MDT-mdc-8100e8551800: Connection to service
lustre-MDT via nid [EMAIL PROTECTED] was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-MDT-mdc-8100e8551800: Connection restored to
service lustre-MDT using nid [EMAIL PROTECTED]
LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817153, 100s ago)  [EMAIL PROTECTED] x20767/t0
o4->[EMAIL PROTECTED]@tcp:28 lens 384/352 ref 2 fl Rpc:/
2/0 rc -11/-22
Lustre: lustre-OST0006-osc-8100e8551800: Connection to service
lustre-OST0006 via nid [EMAIL PROTECTED] was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-OST0006-osc-8100e8551800: Connection restored to
service lustre-OST0006 u

[Lustre-discuss] Lustre client problems

2010-04-07 Thread Lawrence Sorrillo
Has anyone seen this before?


I have a lustre client that will work well soon after reboot (giving 
300MB/sec writes over SDR infiniband to a lustre mount ) but then after 
a couple of hours the
the mount will stop working-I get hangs on files coming from particular 
OSTs. Simultaneously, other clients, built a bit differently, do not 
hang on the same OST. 

All clients with this particular build share this same malady.

This is RHEL5u3/4 with OFED 1.5 and Lustre 1.8.2.

(uname -a)
Linux host0 2.6.18-164.6.1.0.1.el5 #10 SMP Fri Mar 12 17:45:10 EST 2010 
x86_64 x86_64 x86_64 GNU/Linux


Here is what it displays (/var/log/messages ) soon after reboot and for 
initial read/writes to the lustre mount areas.

Apr  6 13:37:04 host0 kernel: Lustre: OBD class driver, 
http://www.lustre.org/
Apr  6 13:37:04 host0 kernel: Lustre: Lustre Version: 1.8.2
Apr  6 13:37:04 host0 kernel: Lustre: Build Version: 
1.8.2-20100122203014-PRISTINE-2.6.18-164.6.1.0.1.el5
Apr  6 13:37:05 host0 kernel: Lustre: Listener bound to 
ib0:172.17.3.61:987:mthca0
Apr  6 13:37:05 host0 kernel: Lustre: Register global MR array, MR size: 
0x, array size: 1
Apr  6 13:37:05 host0 kernel: Lustre: Added LNI 172.17.3...@o2ib 
[8/64/0/180]
Apr  6 13:37:05 host0 kernel: Lustre: Added LNI x.x@tcp [8/256/0/180]
Apr  6 13:37:05 host0 kernel: Lustre: Accept secure, port 988
Apr  6 13:37:06 host0 kernel: Lustre: Lustre Client File System; 
http://www.lustre.org/
Apr  6 13:37:06 host0 kernel: Lustre: mgc172.17.1...@o2ib: Reactivating 
import
Apr  6 13:37:06 host0 kernel: Lustre: Client lustre-client has started




. Everthings is fine herejust OS messages that do not pertain to lustre


Apr  6 23:45:55 host0 dhclient: DHCPACK from X.X.X.X
Apr  6 23:45:55 host0 dhclient: bound to 129.57.16.37 -- renewal in 
36986 seconds.
Apr  7 08:38:36 host0 : error getting update info: (104, 'Connection 
reset by peer')
Apr  7 09:09:30 host0 kernel: LustreError: 
5270:0:(o2iblnd_cb.c:2883:kiblnd_check_txs()) Timed out tx: active_txs, 
9 seconds
Apr  7 09:09:30 host0 kernel: LustreError: 
5270:0:(o2iblnd_cb.c:2945:kiblnd_check_conns()) Timed out RDMA with 
172.17.1@o2ib (84)
Apr  7 09:09:45 host0 kernel: LustreError: 
5312:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 
12345-172.17.1@o2ib: -113
Apr  7 09:09:45 host0 kernel: LustreError: 
5312:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  
r...@810509419000 x1332294902650884/t0 
o400->lustre-ost0018_u...@172.17.1.108@o2ib:28/4 lens 192/384 e 0 to 1 
dl 1270645802 ref 2 fl Rpc:N/0/0 rc 0/0
Apr  7 09:09:45 host0 kernel: Lustre: 
5312:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
x1332294902650884 sent from lustre-OST0018-osc-810335e15c00 to NID 
172.17.1@o2ib 0s ago has failed due to network error (17s prior to 
deadline).
Apr  7 09:09:45 host0 kernel:   r...@810509419000 
x1332294902650884/t0 o400->lustre-ost0018_u...@172.17.1.108@o2ib:28/4 
lens 192/384 e 0 to 1 dl 1270645802 ref 1 fl Rpc:N/0/0 rc 0/0
Apr  7 09:09:45 host0 kernel: Lustre: 
lustre-OST0018-osc-810335e15c00: Connection to service 
lustre-OST0018 via nid 172.17.1@o2ib was lost; in progress 
operations using this service will wait for recovery to complete.
Apr  7 09:09:45 host0 kernel: LustreError: 
5312:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 
12345-172.17.1@o2ib: -113
Apr  7 09:09:45 host0 kernel: LustreError: 
5313:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  
r...@8104345b2c00 x1332294902650898/t0 
o8->lustre-ost0018_u...@172.17.1.108@o2ib:28/4 lens 368/584 e 0 to 1 dl 
1270645791 ref 2 fl Rpc:N/0/0 rc 0/0
Apr  7 09:09:45 host0 kernel: Lustre: 
5313:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
x1332294902650898 sent from lustre-OST0018-osc-810335e15c00 to NID 
172.17.1@o2ib 0s ago has failed due to network error (6s prior to 
deadline).
Apr  7 09:09:45 host0 kernel:   r...@8104345b2c00 
x1332294902650898/t0 o8->lustre-ost0018_u...@172.17.1.108@o2ib:28/4 lens 
368/584 e 0 to 1 dl 1270645791 ref 1 fl Rpc:N/0/0 rc 0/0
Apr  7 09:09:45 host0 kernel: LustreError: 
5312:0:(lib-move.c:2436:LNetPut()) Skipped 1 previous similar message
Apr  7 09:09:45 host0 kernel: Lustre: 
lustre-OST0019-osc-810335e15c00: Connection to service 
lustre-OST0019 via nid 172.17.1@o2ib was lost; in progress 
operations using this service will wait for recovery to complete.
Apr  7 09:09:52 host0 kernel: Lustre: 
5314:0:(import.c:524:import_select_connection()) 
lustre-OST0018-osc-810335e15c00: tried all connections, increasing 
latency to 2s
Apr  7 09:09:59 host0 kernel: Lustre: 
5313:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
x1332294902654188 sent from lustre-OST0018-osc-810335e15c00 to NID 
172.17.1@o2ib 7s ago has timed out (7s prior to deadline).
Apr  7 09:09:59 host0 kernel:   r...@8104ff9c6c00 
x1332294902654188/t0 o8->lustre-ost0018_u...@172.17.1.108@o2ib:28/4 

[Lustre-discuss] Lustre client bug (?)

2010-04-22 Thread Andrew Godziuk
Hi,

I'm not sure where I should report it but I couldn't find the error
text in Google so I guess it's not in bug tracker yet.

This appeared on CentOS 64-bit client under light traffic. Lustre
1.8.2 patchless client from Sun, Linux 2.6.28.10 #4 SMP, both without
custom patches. I'm not sure what more details I could supply.

mx1 kernel: LustreError:
20716:0:(statahead.c:149:ll_sai_entry_cleanup())
ASSERTION(list_empty(&entry->se_list)) failed
Message from syslogd@ at Thu Apr 22 04:31:50 2010 ...
mx1 kernel: LustreError: 20716:0:(statahead.c:149:ll_sai_entry_cleanup()) LBUG

-- 
Andrzej Godziuk
http://CloudAccess.net/
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre client mount

2010-05-27 Thread Sergey Arlashin
Hi!
I have a Lustre storage named "storage" and usually I mount it with command:

mount -t lustre 192.168.15.241:/storage /mnt/lustrefs
I wonder if it is possible to mount only a particular folder from the Lustre
storage? Like nfs? I mean.. for example.. I have a folder named "folder1"
and I want to mount it on a client, so in case of nfs i do the following:
mount -t nfs nfsserver.com:/nfsstorage/folder1 /mnt/nfs
And it works!
But when I try to do this with lustre by issuing:
mount -t lustre lustreserver.com:/storage/particularfolder /mnt/lustrefs
I get the following error:

mount -t lustre lustreserver.com:/storage/particularfolder /mnt/lustrefs
failed: Invalid argument
This may have multiple causes.
Is 'storage/particularfolder' the correct filesystem name?
Are the mount options correct?
Check the syslog for more info.

I have a great deal of clients and I really don't want to attach the whole
lustre storage to each of them due to security issues.

-
WBR, Sergey Arlashin
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre client error

2011-02-15 Thread Jagga Soorma
Hi Guys,

One of my clients got a hung lustre mount this morning and I saw the
following errors in my logs:

--
..snip..
Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred while
communicating with 10.0.250.47@o2ib3. The ost_write operation failed with
-28
Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836 previous
similar messages
Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred while
communicating with 10.0.250.47@o2ib3. The ost_write operation failed with
-28
Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141 previous
similar messages
Feb 15 10:16:54 reshpc116 kernel: Lustre:
6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1360125198261945 sent from reshpcfs-OST0005-osc-8830175c8400 to NID
10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
Feb 15 10:16:54 reshpc116 kernel: Lustre:
reshpcfs-OST0005-osc-8830175c8400: Connection to service
reshpcfs-OST0005 via nid 10.0.250.47@o2ib3 was lost; in progress operations
using this service will wait for recovery to complete.
Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred while
communicating with 10.0.250.47@o2ib3. The ost_connect operation failed with
-16
Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779 previous
similar messages
Feb 15 10:16:55 reshpc116 kernel: Lustre:
6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1360125198261947 sent from reshpcfs-OST0005-osc-8830175c8400 to NID
10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred while
communicating with 10.0.250.47@o2ib3. The ost_connect operation failed with
-16
Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous similar
messages
Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error occurred while
communicating with 10.0.250.47@o2ib3. The ost_connect operation failed with
-16
Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous similar
messages
Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error occurred while
communicating with 10.0.250.47@o2ib3. The ost_connect operation failed with
-16
Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous similar
messages
Feb 15 10:31:43 reshpc116 kernel: Lustre:
reshpcfs-OST0005-osc-8830175c8400: Connection restored to service
reshpcfs-OST0005 using nid 10.0.250.47@o2ib3.
--

Due to disk space issues on my lustre filesystem one of the OST's were full
and I deactivated that OST this morning.  I thought that operation just puts
it in a read only state and that clients can still access the data from that
OST.  After activating this OST again the client connected again and was
okay after this.  How else would you deal with a OST that is close to 100%
full?  Is it okay to leave the OST active and the clients will know not to
write data to that OST?

Thanks,
-J
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre client question

2011-05-13 Thread Zachary Beebleson

We recently had two raid rebuilds on a couple storage targets that did not go 
according to plan. The cards reported a successful rebuild in each case, but 
ldiskfs errors started showing up on the associated OSSs and the effected OSTs 
were  remounted read-only. We are planning to migrate off the data, but we've 
noticed that some clients are getting i/o errors, while others are not. As an 
example, a file that has a stripe on at least one affected OST could not be 
read on one client, i.e. I received a read-error trying to access it, while it 
was perfectly readable and apparently uncorrupted on another (I am able to 
migrate the file to healthy OSTs by copying to a new file name). The clients 
with the i/o problem see inactive devices corresponding to the read-only OSTs 
when I issue a 'lfs df', while the others without the i/o problems report the 
targets as normal. Is it just that many clients are not aware of an OST problem 
yet? I need clients with minimal I/O disruptions in order to migrate as much 
data off as possible.

A client reboot appears to awaken them to the fact that there are problems with 
the OSTs. However, I need them to be able to read the data in order to migrate 
it off. Is there a way to reconnect the clients to the problematic OSTs?

We have dd-ed copies of the OSTs to try e2fsck against them, but the results 
were not promising. The check aborted with:

--
Resize inode (re)creation failed: A block group is missing an inode 
table.Continue? yes

ext2fs_read_inode: A block group is missing an inode table while reading inode 
7 in recreate inode
e2fsck: aborted
--

Any advice would be greatly appreciated.
Zach
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre client question

2011-05-13 Thread Zachary Beebleson
Kevin,

I just failed the drive and remounted. A basic 'df' hangs when it gets to 
the mount point, but /proc/fs/lustre/health_check says everything is 
healthy. 'lfs df' on a client reports the OST is active, where it was 
inactive before. However, now I'm working with a degraded volume, but it 
is raid 6. Should I try another rebuild or just proceed with the 
mirgration off of this OST asap?

Thanks,
Zach
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre client question

2011-05-13 Thread Zachary Beebleson
We recently had two raid rebuilds on a couple storage targets that did not 
go according to plan. The cards reported a successful rebuild in 
each case, but ldiskfs errors started showing up on the associated OSSs 
and the effected OSTs were  remounted read-only. We are planning to 
migrate off the data, but we've noticed that some clients are getting i/o 
errors, while others are not. As an example, a file that has a stripe on 
at least one affected OST could not be read on one client, i.e. I 
received a read-error trying to access it, while it was perfectly 
readable and apparently uncorrupted on another (I am able to migrate the 
file to healthy OSTs by copying to a new file name). The clients 
with the i/o problem see inactive devices corresponding to the read-only 
OSTs when I issue a 'lfs df', while the others without the i/o problems 
report the targets as normal. Is it just that many clients are not aware 
of an OST problem yet? I need clients with minimal I/O disruptions in 
order to migrate as much data off as possible.

A client reboot appears to awaken them to the fact that there are problems 
with the OSTs. However, I need them to be able to read the data in order 
to migrate it off. Is there a way to reconnect the clients to the 
problematic OSTs?

We have dd-ed copies of the OSTs to try e2fsck against them, but the 
results were not promising. The check aborted with:

--
Resize inode (re)creation failed: A block group is missing an inode 
table.Continue? yes

ext2fs_read_inode: A block group is missing an inode table while reading 
inode 7 in recreate inode
e2fsck: aborted
--

Any advice would be greatly appreciated.
Zach
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[lustre-discuss] Lustre client modules

2020-05-26 Thread Phill Harvey-Smith

Hi all,

Can anyone tell me where to download the Lustere client modules for 
CentOS 7.8 please ?


# uname -a
Linux exec3r420 3.10.0-1127.8.2.el7.x86_64 #1 SMP Tue May 12 16:57:42 
UTC 2020 x86_64 x86_64 x86_64 GNU/Linux


# cat /etc/redhat-release
CentOS Linux release 7.8.2003 (Core)


Servers are running :

# cat /proc/fs/lustre/version
lustre: 2.8.0
kernel: patchless_client
build: 
jenkins-arch=x86_64,build_type=server,distro=el7,ib_stack=inkernel-12--PRISTINE-3.10.0-327.3.1.el7_lustre.x86_64


So will I need to also upgrade these too?

Cheers.

Phill.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre Client

2023-01-13 Thread Degremont, Aurelien via lustre-discuss
Did you try? :)


But the answer is yes, ‘-o ro’ is supported for client mounts

Aurélien
De : lustre-discuss  au nom de Nick 
dan via lustre-discuss 
Répondre à : Nick dan 
Date : vendredi 13 janvier 2023 à 10:48
À : "lustre-discuss@lists.lustre.org" , 
"lustre-discuss-requ...@lists.lustre.org" 
, 
"lustre-discuss-ow...@lists.lustre.org" 
Objet : [EXTERNAL] [lustre-discuss] Lustre Client


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Hi

I wanted to ask if Lustre client can be mounted as read-only client or not

Regards,
Nick Dan
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre Client

2023-01-13 Thread Nick dan via lustre-discuss
Hi

Thank you for your help
I am using postgres with lustre client, when I mount lustre client with -o
ro the postgres service is not starting
Can you help how to make the postgres client read only?

On Fri, 13 Jan 2023 at 16:34, Degremont, Aurelien 
wrote:

> Did you try? :)
>
>
>
>
>
> But the answer is yes, ‘-o ro’ is supported for client mounts
>
>
>
> Aurélien
>
> *De : *lustre-discuss  au nom de
> Nick dan via lustre-discuss 
> *Répondre à : *Nick dan 
> *Date : *vendredi 13 janvier 2023 à 10:48
> *À : *"lustre-discuss@lists.lustre.org" ,
> "lustre-discuss-requ...@lists.lustre.org" <
> lustre-discuss-requ...@lists.lustre.org>, "
> lustre-discuss-ow...@lists.lustre.org" <
> lustre-discuss-ow...@lists.lustre.org>
> *Objet : *[EXTERNAL] [lustre-discuss] Lustre Client
>
>
>
> *CAUTION*: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> Hi
>
>
>
> I wanted to ask if Lustre client can be mounted as read-only client or not
>
>
>
> Regards,
>
> Nick Dan
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre Client

2023-01-15 Thread Andreas Dilger via lustre-discuss
That would seem to be a Postgres problem and not Lustre?

Cheers, Andreas

On Jan 13, 2023, at 05:01, Nick dan via lustre-discuss 
 wrote:


Hi

Thank you for your help
I am using postgres with lustre client, when I mount lustre client with -o ro 
the postgres service is not starting
Can you help how to make the postgres client read only?

On Fri, 13 Jan 2023 at 16:34, Degremont, Aurelien 
mailto:degre...@amazon.fr>> wrote:
Did you try? :)


But the answer is yes, ‘-o ro’ is supported for client mounts

Aurélien
De : lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>>
 au nom de Nick dan via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>>
Répondre à : Nick dan mailto:nickdan2...@gmail.com>>
Date : vendredi 13 janvier 2023 à 10:48
À : "lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>" 
mailto:lustre-discuss@lists.lustre.org>>, 
"lustre-discuss-requ...@lists.lustre.org<mailto:lustre-discuss-requ...@lists.lustre.org>"
 
mailto:lustre-discuss-requ...@lists.lustre.org>>,
 
"lustre-discuss-ow...@lists.lustre.org<mailto:lustre-discuss-ow...@lists.lustre.org>"
 
mailto:lustre-discuss-ow...@lists.lustre.org>>
Objet : [EXTERNAL] [lustre-discuss] Lustre Client


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

Hi

I wanted to ask if Lustre client can be mounted as read-only client or not

Regards,
Nick Dan
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] lustre client server interoperability

2015-08-10 Thread Kurt Strosahl
Hello,

   Is the 2.7 lustre client compatible with lustre 2.5.3 servers?  I'm running 
a 2.5.3 system lustre system, but have been asked by a few people about 
upgrading some of our clients to CentOS 7 (which appears to need a 2.7 or 
greater client).

w/r,
Kurt J. Strosahl
System Administrator
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre client intel

2016-04-11 Thread Oliver Mangold
Hi,
HLRS baut ein neues Image,
gibt es was neueres als
lustre-client-2.5.41-2.6.32_573.12.1.el6.x86_64.x86_64
was man evtl einbauen sollte?
Geht um Redhat 6.7 und 7.2

Für 6.7 und 7.1 würde ich normalerweise einen 2.7er-Client empfehlen. Also, 
sofern es keine 1.8er-Server mehr gibt.

On 11.04.2016 13:20, Benedikt Schaefer wrote:

Hi,

mit 7.2 aktueller kernel hatte ich heute probleme im BM cluster
versucht hatte ich lustre-client-2.7.0-3.10.0_123.20.1.el7.x86_64.src.rpm
mit kernel 3.10.0-327.13.1.el7.x86_64 zu bauen
ging leider nicht.


Ja, ist bekannt, das geht nicht. Da braucht man einen 2.7.5x oder 2.8er oder 
auch den von IEEL3.0, der heißt 2.7.13. Erich müsste letzeren haben.


--
Dr. Oliver Mangold
System Analyst
NEC Deutschland GmbH
HPC Division
Raiffeisenstraße 14
70771 Leinfelden-Echterdingen
Germany
Phone: +49 711 78055 13
Mail: oliver.mang...@emea.nec.com
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] lustre-client is obsoleted

2017-07-26 Thread Jon Tegner

Hi,

when trying to update clients from 2.9 to 2.10.0 (on CentOS-7) I 
received the following:


"Package lustre-client is obsoleted by lustre, trying to install 
lustre-2.10.0-1.el7.x86_64 instead"


and then the update failed (to my guessing due to the fact that 
zfs-related packages are missing on the system (at the moment I don't 
intend to use zfs) .


I managed to get past this by forcing the installation of the client, i.e.,

"yum install lustre-client-2.10.0-1.el7.x86_64.rpm"

Just curious, is lustre-client really obsoleted?

Regards,

/jon
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[Lustre-discuss] lustre client 1.6.5.1 hangs

2008-07-10 Thread Heiko Schroeter
Hello,

we have a _test_ setup for a lustre 1.6.5.1 installation with 2 Raid Systems 
(64 Bit Systems) counting for 4 OSTs with 6TB each. One combined MDS and MDT 
server (32 Bit system , for testing only).

OST lustre mkfs:
"mkfs.lustre --param="failover.mode=failout" --fsname 
scia --ost --mkfsoptions='-i 2097152 -E stride=16 -b 
4096' [EMAIL PROTECTED] /dev/sdb"
(Our files are quite large 100MB+ on the system)

Kernel: Vanilla Kernel 2.6.22.19, lustre compiled from the sources on Gentoo 
2008.0

The client mount point is /misc/testfs via automount.
The access can be done through a link from /mnt/testfs -> /misc/testfs

The following procedure hangs a client:
1) copy files to the lustre system
2) do a 'du -sh /mnt/testfs/willi' while copying
3) unmount an OST (here OST0003) while copying

The 'du' job hangs and the lustre file system cannot be acessed any longer on 
this client even from other logins. The only way to restore normal op is IMHO 
a hard reset of the machine. A reboot hangs because the filesystem is still 
active.
Other clients and there mount points are not affected as long as they do not 
access the file system with 'du' 'ls' or so.
I know that this is drastic but may happen in production by our users.

Deactivating/Reactivating or remounting the OST does not have any effect on 
the 'du' job. The 'du' job (#29665 see process list below) and the 
correpsonding lustre thread (#29694) cannot be killed manually.

This behaviour is reproducable. The OST0003 is not reactivated on the client 
side though the MDS does so. It seems that this info does not propagate to 
the client. See last lines of dmesg below.

What is the proper way (besides avoiding the use of 'du') to reactivate the 
client file system ?

Thanks and Regards
Heiko




The process list on the CLIENT:

root 29175  5026  0 08:36 ?00:00:00 sshd: laura [priv]
laura   29177 29175  0 08:36 ?00:00:01 sshd: [EMAIL PROTECTED]/0
laura   29178 29177  0 08:36 pts/000:00:00 -bash
laura   29665 29178  0 09:15 pts/000:00:03 du -sh /mnt/testfs/foo/fam/
schell   29694 2  0 09:15 ?00:00:00 [ll_sa_29665]
root 29695  4846  0 09:15 ?00:00:00 /usr/sbin/automount --timeout 
60 --pid-file /var/run/autofs.misc.pid /misc yp auto.misc


and CLIENT dmesg:
Lustre: 5361:0:(import.c:395:import_select_connection()) 
scia-OST0003-osc-8100ea24a000: tried all connections, increasing latency 
to 6s
Lustre: 5361:0:(import.c:395:import_select_connection()) Skipped 10 previous 
similar messages
LustreError: 11-0: an error occurred while communicating with 
[EMAIL PROTECTED] The ost_connect operation failed with -19
LustreError: Skipped 20 previous similar messages
Lustre: 5361:0:(import.c:395:import_select_connection()) 
scia-OST0003-osc-8100ea24a000: tried all connections, increasing latency 
to 51s
Lustre: 5361:0:(import.c:395:import_select_connection()) Skipped 20 previous 
similar messages
LustreError: 11-0: an error occurred while communicating with 
[EMAIL PROTECTED] The ost_connect operation failed with -19
LustreError: Skipped 24 previous similar messages
Lustre: 5361:0:(import.c:395:import_select_connection()) 
scia-OST0003-osc-8100ea24a000: tried all connections, increasing latency 
to 51s
Lustre: 5361:0:(import.c:395:import_select_connection()) Skipped 24 previous 
similar messages
LustreError: 167-0: This client was evicted by scia-OST0003; in progress 
operations using this service will fail.

The MDS dmesg:

Lustre: 6108:0:(import.c:395:import_select_connection()) scia-OST0003-osc: 
tried all connections, increasing latency to 51s
Lustre: 6108:0:(import.c:395:import_select_connection()) Skipped 10 previous 
similar messages
LustreError: 11-0: an error occurred while communicating with 
[EMAIL PROTECTED] The ost_connect operation failed with -19
LustreError: Skipped 10 previous similar messages
Lustre: 6108:0:(import.c:395:import_select_connection()) scia-OST0003-osc: 
tried all connections, increasing latency to 51s
Lustre: 6108:0:(import.c:395:import_select_connection()) Skipped 20 previous 
similar messages
Lustre: Permanently deactivating scia-OST0003
Lustre: Setting parameter scia-OST0003-osc.osc.active in log scia-client
Lustre: Skipped 3 previous similar messages
Lustre: setting import scia-OST0003_UUID INACTIVE by administrator request
Lustre: scia-OST0003-osc.osc: set parameter active=0
Lustre: Skipped 2 previous similar messages
Lustre: scia-MDT: haven't heard from client 
9111f740-b7a7-e2ff-b672-288a66decfab (at [EMAIL PROTECTED]) in 1269 seconds. 
I think it's dead, and I am evicting it.
Lustre: Permanently reactivating scia-OST0003
Lustre: Modifying parameter scia-OST0003-osc.osc.active in log scia-client
Lustre: Skipped 1 previous similar message
Lustre: 15406:0:(import.c:395:import_select_connection()) scia-OST0003-osc: 
tried all connections, increasing latency to 51s
Lustre: 15406:0:(import.c:395:import_select_connection()) Skipped 2 previous 
similar messages
LustreError

Re: [Lustre-discuss] Lustre client lockups

2008-11-05 Thread Brian J. Murrell
On Tue, 2008-11-04 at 09:06 -0800, Kurt Dillen wrote:
> 
> Some more information about the environment:
> 
> - Lustre clients are all vmware virtual systems
> - Lustre Farm are all vmware virtual systems

Hrm.  That is a bit of a red flag right there.

> the errors I see are the following:
> 
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc 8100e5dca000
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc 8100e519e000
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc 8100e4e0a000
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc 8100e86b1bc0
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc 8100e79fe5c0
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc 8100e70a88c0
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc 8100e7081280
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc 8100e6d6d5c0
> LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
> timeout (sent at 1225816920, 100s ago)  [EMAIL PROTECTED] x17940/t0
> o4->[EMAIL PROTECTED]@tcp:28 lens 384/352 ref 2 fl Rpc:/
> 0/0 rc 0/-22
> Lustre: lustre-OST0005-osc-8100e8551800: Connection to service
> lustre-OST0005 via nid [EMAIL PROTECTED] was lost; in progress
> operations using this service will wait for recovery to complete.
> Lustre: lustre-OST0005-osc-8100e8551800: Connection restored to
> service lustre-OST0005 using nid [EMAIL PROTECTED]

These are just regular timeouts with nothing really to explain them.  A
detailed log analysis of all of your server logs (not something we can
do here on lustre-discuss) might yield more but I have suspicions about
your vmware-farm set up.  Running VMs, all competing for the same host
resources makes the environment unpredictable.

I'm not sure if you are using host-only or bridged networking but my
(now quite historic) experience with running lots of vmware machines on
a single piece of hardware is that the host-only network is less than
robust and the memory rquirements of running many VMs on a single
machine are demanding.  Additionally, if you have many OSTs all sharing
the same physical disk, you will have further contention there.
Timeouts are not surprising.

I would also encourage you to try 1.6.6 now that it is out.  I would
also encourage you to get some baseline performance metrics of all of
this virtual hardware with our iokit.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client lockups

2008-11-06 Thread Andreas Dilger
On Nov 04, 2008  09:06 -0800, Kurt Dillen wrote:
> We have a serious problem with lustre.  Since a few days we have
> lockups on the client side.  Not all clients are having this
> problem.
> 
> We are running this kernel  2.6.16-54-0.2.5_lustre.1.6.4.3smp.
> 
> The statahead disable is done on the systems.
> 
> Some more information about the environment:
> 
> - Lustre clients are all vmware virtual systems
> - Lustre Farm are all vmware virtual systems
> 
> the errors I see are the following:
> 
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc 8100e5dca000
> LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
> timeout (sent at 1225816920, 100s ago)  [EMAIL PROTECTED] x17940/t0
> o4->[EMAIL PROTECTED]@tcp:28 lens 384/352 ref 2 fl Rpc:/
> 0/0 rc 0/-22
> Lustre: lustre-OST0005-osc-8100e8551800: Connection to service
> lustre-OST0005 via nid [EMAIL PROTECTED] was lost; in progress
> operations using this service will wait for recovery to complete.

These all look like network problems.  Running production Lustre servers
inside a vmware doesn't make much sense.  We don't test clients inside
vmware, but I don't think that is nearly as bad as running the servers
in a virtual environment.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre Client Access SAN

2009-07-27 Thread Tharindu Rukshan Bamunuarachchi

this may be an stupid question. but .

 

1.   does Lustre client directly access SAN/storage (like GFS, OCFS or
Sun Cluster SVM)

2.   if client connects over network, 

a.   will TCP/IP performance directly hit on CFS

b.  cannot I keep client on same machine as OSS/MDS etc.

 

 

cheers,

__

tharindu

 



***

"The information contained in this email including in any attachment is 
confidential and is meant to be read only by the person to whom it is 
addressed. If you are not the intended recipient(s), you are prohibited from 
printing, forwarding, saving or copying this email. If you have received this 
e-mail in error, please immediately notify the sender and delete this e-mail 
and its attachments from your computer."

***___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre client kernel panic

2009-09-25 Thread Nick Jennings
Hello All,

 I'm running lustre v1.6.7.2 - I've got one of each (mds, oss, client).
The client serves apache off the lustre volume.

  About an hour ago the client completely hung. Hosting co. says it was
a kernel panic. I got not useful feedback in /var/log/messages from the
client or the MDS. However from the OST I got several complaints.
(below).

 Does anyone have any insight into the problem? All help as to how I can
fix this, or avoid the problem, greatly appreciated.

Thanks,
Nick



Sep 20 04:02:03 ssn1 syslogd 1.4.1: restart.
Sep 22 18:01:29 ssn1 : error getting update info: tuple index out of
range
Sep 24 19:09:30 ssn1 auditd[10084]: Audit daemon rotating log files
Sep 25 22:31:29 ssn1 kernel: Lustre: clients-OST: haven't heard from
client eaa25af9-d0f5-8d54-1644-9cdd7f978e05 (at 10.0.0...@tcp1) in 269
seconds. I think it's dead, and I am evicting it.
Sep 25 22:31:39 ssn1 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
[ll_evictor:3714]
Sep 25 22:31:39 ssn1 kernel: CPU 2:
Sep 25 22:31:39 ssn1 kernel: Modules linked in: ipmi_devintf(U)
ipmi_si(U) ipmi_msghandler(U) lockd(U) obdfilter(U) fsfilt_ldiskfs(U)
ost(U) mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ldiskfs(U) crc16(U)
autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U)
cpufreq_ondemand(U) dm_multipath(U) video(U) sbs(U) backlight(U)
i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U)
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sr_mod(U)
cdrom(U) pata_acpi(U) serio_raw(U) tg3(U) sg(U) pcspkr(U) dm_snapshot(U)
dm_zero(U) dm_mirror(U) dm_mod(U) ata_piix(U) libata(U) megaraid_sas(U)
shpchp(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U)
sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Sep 25 22:31:39 ssn1 kernel: Pid: 3714, comm: ll_evictor Tainted: G
2.6.18-92.1.26.el5_lustre.1.6.7.2smp #1
Sep 25 22:31:39 ssn1 kernel: RIP: 0010:[]
[] _write_lock+0x7/0xf
Sep 25 22:31:39 ssn1 kernel: RSP: 0018:810213aabd88  EFLAGS:
0246
Sep 25 22:31:39 ssn1 kernel: RAX: 81012ac8a800 RBX: 2be2
RCX: 5c90
Sep 25 22:31:39 ssn1 kernel: RDX: f6ef RSI: 802f1d80
RDI: c20010a1ee2c
Sep 25 22:31:39 ssn1 kernel: RBP: 0286 R08: 810001016e60
R09: 
Sep 25 22:31:39 ssn1 kernel: R10: 81012ac85240 R11: 0150
R12: 0286
Sep 25 22:31:39 ssn1 kernel: R13: 810213aabd30 R14: 81012ac85298
R15: 81012ac85240
Sep 25 22:31:39 ssn1 kernel: FS:  2aaf1c86f220()
GS:810107b9cd40() knlGS:
Sep 25 22:31:39 ssn1 kernel: CS:  0010 DS:  ES:  CR0:
8005003b
Sep 25 22:31:39 ssn1 kernel: CR2: 1745b024 CR3: 00022275b000
CR4: 06e0
Sep 25 22:31:39 ssn1 kernel: 
Sep 25 22:31:39 ssn1 kernel: Call Trace:
Sep 25 22:31:39 ssn1 kernel:
[] :obdclass:lustre_hash_for_each_empty+0x20e/0x290
Sep 25 22:31:39 ssn1 kernel:
[] :obdclass:class_disconnect+0x378/0x400
Sep 25 22:31:39 ssn1 kernel:
[] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
Sep 25 22:31:39 ssn1 kernel:
[] :obdfilter:filter_disconnect+0x36d/0x4b0
Sep 25 22:31:39 ssn1 kernel:
[] :obdclass:class_fail_export+0x384/0x4c0
Sep 25 22:31:39 ssn1 kernel:
[] :ptlrpc:ping_evictor_main+0x4f8/0x7d5
Sep 25 22:31:39 ssn1 kernel:  [] default_wake_function
+0x0/0xe
Sep 25 22:31:39 ssn1 kernel:  [] audit_syscall_exit
+0x31b/0x336
Sep 25 22:31:39 ssn1 kernel:  [] child_rip+0xa/0x11
Sep 25 22:31:39 ssn1 kernel:
[] :ptlrpc:ping_evictor_main+0x0/0x7d5
Sep 25 22:31:39 ssn1 kernel:  [] child_rip+0x0/0x11
Sep 25 22:31:39 ssn1 kernel: 
Sep 25 22:31:49 ssn1 kernel: BUG: soft lockup - CPU#2 stuck for 10s!
[ll_evictor:3714]
Sep 25 22:31:49 ssn1 kernel: CPU 2:
Sep 25 22:31:49 ssn1 kernel: Modules linked in: ipmi_devintf(U)
ipmi_si(U) ipmi_msghandler(U) lockd(U) obdfilter(U) fsfilt_ldiskfs(U)
ost(U) mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ldiskfs(U) crc16(U)
autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U)
cpufreq_ondemand(U) dm_multipath(U) video(U) sbs(U) backlight(U)
i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U)
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sr_mod(U)
cdrom(U) pata_acpi(U) serio_raw(U) tg3(U) sg(U) pcspkr(U) dm_snapshot(U)
dm_zero(U) dm_mirror(U) dm_mod(U) ata_piix(U) libata(U) megaraid_sas(U)
shpchp(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U)
sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Sep 25 22:31:49 ssn1 kernel: Pid: 3714, comm: ll_evictor Tainted: G
2.6.18-92.1.26.el5_lustre.1.6.7.2smp #1
Sep 25 22:31:49 ssn1 kernel: RIP: 0010:[]
[] :obdclass:lustre_hash_for_each_empty+0x20e/0x290
Sep 25 22:31:49 ssn1 kernel: RSP: 0018:810213aabd90  EFLAGS:
0246
Sep 25 22:31:49 ssn1 kernel: RAX: 81012b1a7400 RBX: 4bea
RCX: 8468
Sep 25 22:31:49 ssn1 kernel: RDX: 0368 R

Re: [Lustre-discuss] Lustre client problems

2010-04-07 Thread Lawrence Sorrillo
Also, the logs from the OST that is providing the files for which we 
have hangs are showing the following errors:

Apr  7 02:51:45 loss09 kernel: Lustre: Skipped 1 previous similar message
Apr  7 02:51:45 loss09 kernel: Lustre: lustre-OST001a: haven't heard 
from client dd7aee74-0bb9-7b4a-4c7f-d0e78fff45ef (at 172.17.0@o2ib) 
in 227 seconds. I think it's dead, and I am evicting it.
Apr  7 02:51:45 loss09 kernel: Lustre: Skipped 1 previous similar message
Apr  7 02:53:18 loss09 kernel: LustreError: 
13561:0:(ldlm_lib.c:1863:target_send_reply_msg()) @@@ processing error 
(-107)  r...@81018021c000 x1326517357508998/t0 o400->@:0/0 lens 
192/0 e 0 to 0 dl 1270623204 ref 1 fl Interpret:H/0/0 rc -107/0
Apr  7 02:53:18 loss09 kernel: LustreError: 
13561:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 5 previous 
similar messages
Apr  7 09:12:42 loss09 kernel: Lustre: lustre-OST001a: haven't heard 
from client 6c81ad18-13bb-6455-06a2-a1f413f967e9 (at 172.17.3...@o2ib) 
in 227 seconds. I think it's dead, and I am evicting it.
Apr  7 09:13:07 host09 kernel: Lustre: lustre-OST0018: haven't heard 
from client 6c81ad18-13bb-6455-06a2-a1f413f967e9 (at 172.17.3...@o2ib) 
in 227 seconds. I think it's dead, and I am evicting it.


172.17.3...@o2ib is the IB interface for the client experiencing the 
hang condition.

~Lawrence

Lawrence Sorrillo wrote:
> Has anyone seen this before?
>
>
> I have a lustre client that will work well soon after reboot (giving 
> 300MB/sec writes over SDR infiniband to a lustre mount ) but then after 
> a couple of hours the
> the mount will stop working-I get hangs on files coming from particular 
> OSTs. Simultaneously, other clients, built a bit differently, do not 
> hang on the same OST. 
>
> All clients with this particular build share this same malady.
>
> This is RHEL5u3/4 with OFED 1.5 and Lustre 1.8.2.
>
> (uname -a)
> Linux host0 2.6.18-164.6.1.0.1.el5 #10 SMP Fri Mar 12 17:45:10 EST 2010 
> x86_64 x86_64 x86_64 GNU/Linux
>
>
> Here is what it displays (/var/log/messages ) soon after reboot and for 
> initial read/writes to the lustre mount areas.
>
> Apr  6 13:37:04 host0 kernel: Lustre: OBD class driver, 
> http://www.lustre.org/
> Apr  6 13:37:04 host0 kernel: Lustre: Lustre Version: 1.8.2
> Apr  6 13:37:04 host0 kernel: Lustre: Build Version: 
> 1.8.2-20100122203014-PRISTINE-2.6.18-164.6.1.0.1.el5
> Apr  6 13:37:05 host0 kernel: Lustre: Listener bound to 
> ib0:172.17.3.61:987:mthca0
> Apr  6 13:37:05 host0 kernel: Lustre: Register global MR array, MR size: 
> 0x, array size: 1
> Apr  6 13:37:05 host0 kernel: Lustre: Added LNI 172.17.3...@o2ib 
> [8/64/0/180]
> Apr  6 13:37:05 host0 kernel: Lustre: Added LNI x.x@tcp [8/256/0/180]
> Apr  6 13:37:05 host0 kernel: Lustre: Accept secure, port 988
> Apr  6 13:37:06 host0 kernel: Lustre: Lustre Client File System; 
> http://www.lustre.org/
> Apr  6 13:37:06 host0 kernel: Lustre: mgc172.17.1...@o2ib: Reactivating 
> import
> Apr  6 13:37:06 host0 kernel: Lustre: Client lustre-client has started
>
>
> 
> 
> . Everthings is fine herejust OS messages that do not pertain to lustre
> 
> 
> Apr  6 23:45:55 host0 dhclient: DHCPACK from X.X.X.X
> Apr  6 23:45:55 host0 dhclient: bound to 129.57.16.37 -- renewal in 
> 36986 seconds.
> Apr  7 08:38:36 host0 : error getting update info: (104, 'Connection 
> reset by peer')
> Apr  7 09:09:30 host0 kernel: LustreError: 
> 5270:0:(o2iblnd_cb.c:2883:kiblnd_check_txs()) Timed out tx: active_txs, 
> 9 seconds
> Apr  7 09:09:30 host0 kernel: LustreError: 
> 5270:0:(o2iblnd_cb.c:2945:kiblnd_check_conns()) Timed out RDMA with 
> 172.17.1@o2ib (84)
> Apr  7 09:09:45 host0 kernel: LustreError: 
> 5312:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 
> 12345-172.17.1@o2ib: -113
> Apr  7 09:09:45 host0 kernel: LustreError: 
> 5312:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  
> r...@810509419000 x1332294902650884/t0 
> o400->lustre-ost0018_u...@172.17.1.108@o2ib:28/4 lens 192/384 e 0 to 1 
> dl 1270645802 ref 2 fl Rpc:N/0/0 rc 0/0
> Apr  7 09:09:45 host0 kernel: Lustre: 
> 5312:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request 
> x1332294902650884 sent from lustre-OST0018-osc-810335e15c00 to NID 
> 172.17.1@o2ib 0s ago has failed due to network error (17s prior to 
> deadline).
> Apr  7 09:09:45 host0 kernel:   r...@810509419000 
> x1332294902650884/t0 o400->lustre-ost0018_u...@172.17.1.108@o2ib:28/4 
> lens 192/384 e 0 to 1 dl 1270645802 ref 1 fl Rpc:N/0/0 rc 0/0
> Apr  7 09:09:45 host0 kernel: Lustre: 
> lustre-OST0018-osc-810335e15c00: Connection to service 
> lustre-OST0018 via nid 172.17.1@o2ib was lost; in progress 
> operations using this service will wait for recovery to complete.
> Apr  7 09:09:45 host0 kernel: LustreError: 
> 5312:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 
> 12345-172.17.1@o2ib: -113
> Apr  7 09:09:45 host0 kernel: LustreError: 
> 5313:0:(events.c:66:r

Re: [Lustre-discuss] Lustre client problems

2010-04-07 Thread Brian J. Murrell
On Wed, 2010-04-07 at 10:23 -0400, Lawrence Sorrillo wrote: 
> Apr  7 09:09:30 host0 kernel: LustreError: 
> 5270:0:(o2iblnd_cb.c:2883:kiblnd_check_txs()) Timed out tx: active_txs, 
> 9 seconds
> Apr  7 09:09:30 host0 kernel: LustreError: 
> 5270:0:(o2iblnd_cb.c:2945:kiblnd_check_conns()) Timed out RDMA with 
> 172.17.1@o2ib (84)

Your network is failing.  You need to test and fix your network.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre Client - Memory Issue

2010-04-19 Thread Jagga Soorma
Hi Guys,

My users are reporting some issues with memory on our lustre 1.8.1 clients.
It looks like when they submit a single job at a time the run time was about
4.5 minutes.  However, when they ran multiple jobs (10 or less) on a client
with 192GB of memory on a single node the run time for each job was
exceeding 3-4X the run time for the single process.  They also noticed that
the swap space kept climbing even though there was plenty of free memory on
the system.  Could this possibly be related to the lustre client?  Does it
reserve any memory that is not accessible by any other process even though
it might not be in use?

Thanks much,
-J
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client bug (?)

2010-04-22 Thread Brian J. Murrell
On Thu, 2010-04-22 at 16:36 +0200, Andrew Godziuk wrote: 
> Hi,
> 
> I'm not sure where I should report it but I couldn't find the error
> text in Google so I guess it's not in bug tracker yet.

Hrm.  I'd not be too sure that Google has indexed the entire of our
Bugzilla.  Maybe it has, but searching it directly is probably more
definitive.

> This appeared on CentOS 64-bit client under light traffic. Lustre
> 1.8.2 patchless client from Sun, Linux 2.6.28.10 #4 SMP, both without
> custom patches. I'm not sure what more details I could supply.
> 
> mx1 kernel: LustreError:
> 20716:0:(statahead.c:149:ll_sai_entry_cleanup())
> ASSERTION(list_empty(&entry->se_list)) failed
> Message from syslogd@ at Thu Apr 22 04:31:50 2010 ...
> mx1 kernel: LustreError: 20716:0:(statahead.c:149:ll_sai_entry_cleanup()) LBUG

I can't find an existing bug in our bugzilla regarding this
ASSERTION/LBUG.  ASSERTION/LBUGs are logic conditions that were
unexpected.  They are also fatal errors that need the node to rebooted
to resolve.

Can you please a file a bug in our bugzilla about this one.  Please
attach the syslog from the node that hit the LBUG.  Include a few hours
of syslog prior to the LBUG if you can.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] lustre client and kvm?

2010-05-11 Thread Janne Aho

We run into a slight misfortune, the lustrefs client kernel lacks kvm
support and it wasn't just straight to compile the kernel just by adding
the kvm patches.

Does someone have a set of patches to make kvm part of the lustre client
kernel or even know some already compiled kernels?


Thanks in advance for your replies.


-- 
Janne Aho | City Network Hosting AB
Developer
Phone: +46 455 69 00 22
Cell.: +46 733 31 27 75
EMail: ja...@citynetwork.se

ICQ: 567311547 | Skype: janne_mz | AIM: janne4cn
Gadu: 16275665 | MSN: ja...@citynetwork.se

www.citynetwork.se | www.box.se
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client mount

2010-05-27 Thread Brian J. Murrell
On Thu, 2010-05-27 at 19:18 +0400, Sergey Arlashin wrote:
> Hi!
> I have a Lustre storage named "storage" and usually I mount it with
> command: 
> mount -t lustre 192.168.15.241:/storage /mnt/lustrefs
> I wonder if it is possible to mount only a particular folder from the
> Lustre storage?

No.  This is not possible with Lustre.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client mount

2010-05-27 Thread Kevin Van Maren
Brian J. Murrell wrote:
> On Thu, 2010-05-27 at 19:18 +0400, Sergey Arlashin wrote:
>   
>> Hi!
>> I have a Lustre storage named "storage" and usually I mount it with
>> command: 
>> mount -t lustre 192.168.15.241:/storage /mnt/lustrefs
>> I wonder if it is possible to mount only a particular folder from the
>> Lustre storage?
>> 
>
> No.  This is not possible with Lustre.
>
> b.
>   

But once you have the filesystem mounted, you can use a bind mount to 
mount the folder at another location, as in:
# mount --bind /mnt/lustrefs/folder1 /mnt/folder1

Kevin

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre client mount error

2010-11-19 Thread Chakravarthy N
 List,

I'm using lustre 2.0.

While i try to mount the client, i'm getting an error "File: Exists". It was
working all the time before. After starting the heartbeat service for
failover in OSS, some of the OST's are unmounted and mounted again.

After that the client mount didn't work at all I've also attached the log
below for your reference.. Please help me to sort it out...



Logs Errors as below...

0001:0008:1.0:1290171315.880469:0:6554:0:(ldlm_request.c:1201:ldlm_cli_update_pool())
@@@ Zero SLV or Limit found (SLV: 0, Limit: 418116)
r...@8100cbbdc000x1352815423278325/t0(0) o400->
scratch-ost000c_u...@10.2.2.123@o2ib:28/4 lens 192/192 e 0 to 0 dl
1290171322 ref 1 fl Rpc:RN// rc 0/-1
0100:0008:0.0:1290171320.090304:0:8516:0:(service.c:812:ptlrpc_update_export_timer())
updating export 5230a9c8-3964-c9bf-3aea-bb55a71841b1 at 1290171320 exp
81011fa82000
0100:0008:4.0:1290171321.394052:0:8517:0:(service.c:812:ptlrpc_update_export_timer())
updating export 2495578b-e42c-fda4-b815-2794fdad8a74 at 1290171321 exp
81011db43c00
0100:0008:0.0:1290171329.382916:0:8515:0:(service.c:812:ptlrpc_update_export_timer())
updating export c83b509f-7428-2191-423f-fdf85f669911 at 1290171329 exp
81011db43000
0100:0008:4.0:1290171334.165131:0:8516:0:(service.c:812:ptlrpc_update_export_timer())
updating export 2854a4ed-5606-a530-24a4-7266bd6de2a6 at 1290171334 exp
8100d611f400
0100:0008:0.0:1290171338.342859:0:8517:0:(service.c:812:ptlrpc_update_export_timer())
updating export 61909d2c-8006-4a3c-99d7-f71edcc27acb at 1290171338 exp
8100d8013c00
0100:0008:4.0:1290171340.879687:0:8515:0:(service.c:812:ptlrpc_update_export_timer())
updating export cf565092-b218-a731-febc-ebdfaed0aed8 at 1290171340 exp
810215098000
0001:0008:1.0:1290171340.880052:0:6554:0:(ldlm_request.c:1201:ldlm_cli_update_pool())
@@@ Zero SLV or Limit found (SLV: 0, Limit: 418116)
r...@81011b4ex1352815423278339/t0(0) o400->
scratch-ost0002_u...@10.2.2.122@o2ib:28/4 lens 192/192 e 0 to 0 dl
1290171347 ref 1 fl Rpc:RN// rc 0/-1
0001:0008:1.0:1290171340.880065:0:6554:0:(ldlm_request.c:1201:ldlm_cli_update_pool())
@@@ Zero SLV or Limit found (SLV: 0, Limit: 418116)
r...@81011b4e0400x1352815423278340/t0(0) o400->
scratch-ost0004_u...@10.2.2.122@o2ib:28/4 lens 192/192 e 0 to 0 dl
1290171347 ref 1 fl Rpc:RN// rc 0/-1
0001:0008:1.0:1290171340.880074:0:6554:0:(ldlm_request.c:1201:ldlm_cli_update_pool())
@@@ Zero SLV or Limit found (SLV: 0, Limit: 418116)
r...@81011b4e0800x1352815423278341/t0(0) o400->
scratch-ost0003_u...@10.2.2.122@o2ib:28/4 lens 192/192 e 0 to 0 dl
1290171347 ref 1 fl Rpc:RN// rc 0/-1
0001:0008:1.0:1290171340.880083:0:6554:0:(ldlm_request.c:1201:ldlm_cli_update_pool())
@@@ Zero SLV or Limit found (SLV: 0, Limit: 418116)
r...@81011b6c7800x1352815423278345/t0(0) o400->
scratch-ost0008_u...@10.2.2.124@o2ib:28/4 lens 192/192 e 0 to 0 dl
1290171347 ref 1 fl Rpc:RN// rc 0/-1
0001:0008:1.0:1290171340.880092:0:6554:0:(ldlm_request.c:1201:ldlm_cli_update_pool())
@@@ Zero SLV or Limit found (SLV: 0, Limit: 418116)
r...@81011b6c7c00x1352815423278346/t0(0) o400->
scratch-ost000a_u...@10.2.2.124@o2ib:28/4 lens 192/192 e 0 to 0 dl
1290171347 ref 1 fl Rpc:RN// rc 0/-1
0001:0008:1.0:1290171340.880100:0:6554:0:(ldlm_request.c:1201:ldlm_cli_update_pool())
@@@ Zero SLV or Limit found (SLV: 0, Limit: 418116)
r...@81011eea5000x1352815423278347/t0(0) o400->
scratch-ost000b_u...@10.2.2.124@o2ib:28/4 lens 192/192 e 0 to 0 dl
1290171347 ref 1 fl Rpc:RN// rc 0/-1
0001:0008:1.0:1290171340.880123:0:6554:0:(ldlm_request.c:1201:ldlm_cli_update_pool())
@@@ Zero SLV or Limit found (SLV: 0, Limit: 418116)
r...@81011b781000x1352815423278336/t0(0) o400->
scratch-ost_u...@10.2.2.121@o2ib:28/4 lens 192/192 e 0 to 0 dl
1290171347 ref 1 fl Rpc:RN// rc 0/-1
0001:0008:1.0:1290171340.880132:0:6554:0:(ldlm_request.c:1201:ldlm_cli_update_pool())
@@@ Zero SLV or Limit found (SLV: 0, Limit: 418116)
r...@81011b781c00x1352815423278337/t0(0) o400->
scratch-ost0001_u...@10.2.2.121@o2ib:28/4 lens 192/192 e 0 to 0 dl
1290171347 ref 1 fl Rpc:RN// rc 0/-1
0001:0008:1.0:1290171340.880141:0:6554:0:(ldlm_request.c:1201:ldlm_cli_update_pool())
@@@ Zero SLV or Limit found (SLV: 0, Limit: 418116)
r...@81011b781400x1352815423278338/t0(0) o400->
scratch-ost0005_u...@10.2.2.121@o2ib:28/4 lens 192/192 e 0 to 0 dl
1290171347 ref 1 fl Rpc:RN// rc 0/-1
0001:0008:1.0:1290171340.880355:0:6554:0:(ldlm_request.c:1201:ldlm_cli_update_pool())
@@@ Zero SLV or Limit found (SLV: 0, Limit: 418116)
r...@81011b4e0c00x1352815423278342/t0(0) o400->
scratch-ost0006_u...@10.2.2.123@o2ib:28/4 lens 192/192 e 0 to 0 dl
1290171347 ref 1 fl Rpc:RN// rc 0/-1

Re: [Lustre-discuss] Lustre client error

2011-02-15 Thread Bob Ball
You can deactivate it on the MDT, that will make it RO, but leave it 
alone on the clients so they can still access files from it.


bob

On 2/15/2011 1:57 PM, Jagga Soorma wrote:

Hi Guys,

One of my clients got a hung lustre mount this morning and I saw the 
following errors in my logs:


--
..snip..
Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred 
while communicating with 10.0.250.47@o2ib3. The ost_write operation 
failed with -28
Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836 
previous similar messages
Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred 
while communicating with 10.0.250.47@o2ib3. The ost_write operation 
failed with -28
Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141 
previous similar messages
Feb 15 10:16:54 reshpc116 kernel: Lustre: 
6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
x1360125198261945 sent from reshpcfs-OST0005-osc-8830175c8400 to 
NID 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
Feb 15 10:16:54 reshpc116 kernel: Lustre: 
reshpcfs-OST0005-osc-8830175c8400: Connection to service 
reshpcfs-OST0005 via nid 10.0.250.47@o2ib3 was lost; in progress 
operations using this service will wait for recovery to complete.
Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred 
while communicating with 10.0.250.47@o2ib3. The ost_connect operation 
failed with -16
Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779 
previous similar messages
Feb 15 10:16:55 reshpc116 kernel: Lustre: 
6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
x1360125198261947 sent from reshpcfs-OST0005-osc-8830175c8400 to 
NID 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred 
while communicating with 10.0.250.47@o2ib3. The ost_connect operation 
failed with -16
Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous 
similar messages
Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error occurred 
while communicating with 10.0.250.47@o2ib3. The ost_connect operation 
failed with -16
Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous 
similar messages
Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error occurred 
while communicating with 10.0.250.47@o2ib3. The ost_connect operation 
failed with -16
Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous 
similar messages
Feb 15 10:31:43 reshpc116 kernel: Lustre: 
reshpcfs-OST0005-osc-8830175c8400: Connection restored to service 
reshpcfs-OST0005 using nid 10.0.250.47@o2ib3.

--

Due to disk space issues on my lustre filesystem one of the OST's were 
full and I deactivated that OST this morning.  I thought that 
operation just puts it in a read only state and that clients can still 
access the data from that OST.  After activating this OST again the 
client connected again and was okay after this.  How else would you 
deal with a OST that is close to 100% full?  Is it okay to leave the 
OST active and the clients will know not to write data to that OST?


Thanks,
-J


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client error

2011-02-15 Thread Cliff White
Client situation depends on where you deactivated the OST - if you
deactivate on the MDS only,
clients should be able to read.

What is best to do when an OST fills up really depends on what else you are
doing at the time, and
how much control you have over what the clients are doing and other things.
If you can solve the space issue with a quick rm -rf, best to leave it
online, likewise if all your clients are
trying to bang on it and failing, best to turn things off. YMMV

cliffw

On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma  wrote:

> Hi Guys,
>
> One of my clients got a hung lustre mount this morning and I saw the
> following errors in my logs:
>
> --
> ..snip..
> Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47@o2ib3. The ost_write operation failed
> with -28
> Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836 previous
> similar messages
> Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47@o2ib3. The ost_write operation failed
> with -28
> Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141 previous
> similar messages
> Feb 15 10:16:54 reshpc116 kernel: Lustre:
> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1360125198261945 sent from reshpcfs-OST0005-osc-8830175c8400 to NID
> 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
> Feb 15 10:16:54 reshpc116 kernel: Lustre:
> reshpcfs-OST0005-osc-8830175c8400: Connection to service
> reshpcfs-OST0005 via nid 10.0.250.47@o2ib3 was lost; in progress
> operations using this service will wait for recovery to complete.
> Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
> failed with -16
> Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779 previous
> similar messages
> Feb 15 10:16:55 reshpc116 kernel: Lustre:
> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1360125198261947 sent from reshpcfs-OST0005-osc-8830175c8400 to NID
> 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
> Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
> failed with -16
> Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous similar
> messages
> Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
> failed with -16
> Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous similar
> messages
> Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
> failed with -16
> Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous similar
> messages
> Feb 15 10:31:43 reshpc116 kernel: Lustre:
> reshpcfs-OST0005-osc-8830175c8400: Connection restored to service
> reshpcfs-OST0005 using nid 10.0.250.47@o2ib3.
> --
>
> Due to disk space issues on my lustre filesystem one of the OST's were full
> and I deactivated that OST this morning.  I thought that operation just puts
> it in a read only state and that clients can still access the data from that
> OST.  After activating this OST again the client connected again and was
> okay after this.  How else would you deal with a OST that is close to 100%
> full?  Is it okay to leave the OST active and the clients will know not to
> write data to that OST?
>
> Thanks,
> -J
>
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client error

2011-02-15 Thread Andreas Dilger
On 2011-02-15, at 12:20, Cliff White wrote:
> Client situation depends on where you deactivated the OST - if you deactivate 
> on the MDS only, clients should be able to read. 
> 
> What is best to do when an OST fills up really depends on what else you are 
> doing at the time, and how much control you have over what the clients are 
> doing and other things.  If you can solve the space issue with a quick rm 
> -rf, best to leave it online, likewise if all your clients are trying to bang 
> on it and failing, best to turn things off. YMMV

In theory, with 1.8 the full OST should be skipped for new object allocations, 
but this is not robust in the face of e.g. a single very large file being 
written to the OST that takes it from "average" usage to being full.

> On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma  wrote:
> Hi Guys,
> 
> One of my clients got a hung lustre mount this morning and I saw the 
> following errors in my logs:
> 
> --
> ..snip..
> Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred while 
> communicating with 10.0.250.47@o2ib3. The ost_write operation failed with -28
> Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836 previous 
> similar messages
> Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred while 
> communicating with 10.0.250.47@o2ib3. The ost_write operation failed with -28
> Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141 previous 
> similar messages
> Feb 15 10:16:54 reshpc116 kernel: Lustre: 
> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
> x1360125198261945 sent from reshpcfs-OST0005-osc-8830175c8400 to NID 
> 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
> Feb 15 10:16:54 reshpc116 kernel: Lustre: 
> reshpcfs-OST0005-osc-8830175c8400: Connection to service reshpcfs-OST0005 
> via nid 10.0.250.47@o2ib3 was lost; in progress operations using this service 
> will wait for recovery to complete.
> Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred while 
> communicating with 10.0.250.47@o2ib3. The ost_connect operation failed with 
> -16
> Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779 previous 
> similar messages
> Feb 15 10:16:55 reshpc116 kernel: Lustre: 
> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
> x1360125198261947 sent from reshpcfs-OST0005-osc-8830175c8400 to NID 
> 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
> Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred while 
> communicating with 10.0.250.47@o2ib3. The ost_connect operation failed with 
> -16
> Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous similar 
> messages
> Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error occurred while 
> communicating with 10.0.250.47@o2ib3. The ost_connect operation failed with 
> -16
> Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous similar 
> messages
> Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error occurred while 
> communicating with 10.0.250.47@o2ib3. The ost_connect operation failed with 
> -16
> Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous similar 
> messages
> Feb 15 10:31:43 reshpc116 kernel: Lustre: 
> reshpcfs-OST0005-osc-8830175c8400: Connection restored to service 
> reshpcfs-OST0005 using nid 10.0.250.47@o2ib3.
> --
> 
> Due to disk space issues on my lustre filesystem one of the OST's were full 
> and I deactivated that OST this morning.  I thought that operation just puts 
> it in a read only state and that clients can still access the data from that 
> OST.  After activating this OST again the client connected again and was okay 
> after this.  How else would you deal with a OST that is close to 100% full?  
> Is it okay to leave the OST active and the clients will know not to write 
> data to that OST?
> 
> Thanks,
> -J
> 
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> 
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss


Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client error

2011-02-15 Thread Jagga Soorma
I did deactivate this OST on the MDS server.  So how would I deal with a OST
filling up?  The OST's don't seem to be filling up evenly either.  How does
lustre handle a OST that is at 100%?  Would it not use this specific OST for
writes if there are other OST available with capacity?

Thanks,
-J

On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger wrote:

> On 2011-02-15, at 12:20, Cliff White wrote:
> > Client situation depends on where you deactivated the OST - if you
> deactivate on the MDS only, clients should be able to read.
> >
> > What is best to do when an OST fills up really depends on what else you
> are doing at the time, and how much control you have over what the clients
> are doing and other things.  If you can solve the space issue with a quick
> rm -rf, best to leave it online, likewise if all your clients are trying to
> bang on it and failing, best to turn things off. YMMV
>
> In theory, with 1.8 the full OST should be skipped for new object
> allocations, but this is not robust in the face of e.g. a single very large
> file being written to the OST that takes it from "average" usage to being
> full.
>
> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma 
> wrote:
> > Hi Guys,
> >
> > One of my clients got a hung lustre mount this morning and I saw the
> following errors in my logs:
> >
> > --
> > ..snip..
> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47@o2ib3. The ost_write operation failed
> with -28
> > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836 previous
> similar messages
> > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47@o2ib3. The ost_write operation failed
> with -28
> > Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141 previous
> similar messages
> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1360125198261945 sent from reshpcfs-OST0005-osc-8830175c8400 to NID
> 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
> reshpcfs-OST0005-osc-8830175c8400: Connection to service
> reshpcfs-OST0005 via nid 10.0.250.47@o2ib3 was lost; in progress
> operations using this service will wait for recovery to complete.
> > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
> failed with -16
> > Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779 previous
> similar messages
> > Feb 15 10:16:55 reshpc116 kernel: Lustre:
> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1360125198261947 sent from reshpcfs-OST0005-osc-8830175c8400 to NID
> 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
> > Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
> failed with -16
> > Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous
> similar messages
> > Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
> failed with -16
> > Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous
> similar messages
> > Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
> failed with -16
> > Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous
> similar messages
> > Feb 15 10:31:43 reshpc116 kernel: Lustre:
> reshpcfs-OST0005-osc-8830175c8400: Connection restored to service
> reshpcfs-OST0005 using nid 10.0.250.47@o2ib3.
> > --
> >
> > Due to disk space issues on my lustre filesystem one of the OST's were
> full and I deactivated that OST this morning.  I thought that operation just
> puts it in a read only state and that clients can still access the data from
> that OST.  After activating this OST again the client connected again and
> was okay after this.  How else would you deal with a OST that is close to
> 100% full?  Is it okay to leave the OST active and the clients will know not
> to write data to that OST?
> >
> > Thanks,
> > -J
> >
> > ___
> > Lustre-discuss mailing list
> > Lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
> > ___
> > Lustre-discuss mailing list
> > Lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Engineer
> Whamcloud, Inc.
>
>
>
>
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client error

2011-02-15 Thread Jagga Soorma
Also, it looks like the client is reporting a different %used compared to
the oss server itself:

client:
reshpc101:~ # lfs df -h | grep -i 0007
reshpcfs-OST0007_UUID  2.0T  1.7T202.7G   84% /reshpcfs[OST:7]

oss:
/dev/mapper/mpath72.0T  1.9T   40G  98% /gnet/lustre/oss02/mpath7

Here is how the data seems to be distributed on one of the OSS's:
--
/dev/mapper/mpath52.0T  1.2T  688G  65% /gnet/lustre/oss02/mpath5
/dev/mapper/mpath62.0T  1.7T  224G  89% /gnet/lustre/oss02/mpath6
/dev/mapper/mpath72.0T  1.9T   41G  98% /gnet/lustre/oss02/mpath7
/dev/mapper/mpath82.0T  1.3T  671G  65% /gnet/lustre/oss02/mpath8
/dev/mapper/mpath92.0T  1.3T  634G  67% /gnet/lustre/oss02/mpath9
--

-J

On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma  wrote:

> I did deactivate this OST on the MDS server.  So how would I deal with a
> OST filling up?  The OST's don't seem to be filling up evenly either.  How
> does lustre handle a OST that is at 100%?  Would it not use this specific
> OST for writes if there are other OST available with capacity?
>
> Thanks,
> -J
>
>
> On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger wrote:
>
>> On 2011-02-15, at 12:20, Cliff White wrote:
>> > Client situation depends on where you deactivated the OST - if you
>> deactivate on the MDS only, clients should be able to read.
>> >
>> > What is best to do when an OST fills up really depends on what else you
>> are doing at the time, and how much control you have over what the clients
>> are doing and other things.  If you can solve the space issue with a quick
>> rm -rf, best to leave it online, likewise if all your clients are trying to
>> bang on it and failing, best to turn things off. YMMV
>>
>> In theory, with 1.8 the full OST should be skipped for new object
>> allocations, but this is not robust in the face of e.g. a single very large
>> file being written to the OST that takes it from "average" usage to being
>> full.
>>
>> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma 
>> wrote:
>> > Hi Guys,
>> >
>> > One of my clients got a hung lustre mount this morning and I saw the
>> following errors in my logs:
>> >
>> > --
>> > ..snip..
>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred
>> while communicating with 10.0.250.47@o2ib3. The ost_write operation
>> failed with -28
>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836 previous
>> similar messages
>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred
>> while communicating with 10.0.250.47@o2ib3. The ost_write operation
>> failed with -28
>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141 previous
>> similar messages
>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>> x1360125198261945 sent from reshpcfs-OST0005-osc-8830175c8400 to NID
>> 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>> reshpcfs-OST0005-osc-8830175c8400: Connection to service
>> reshpcfs-OST0005 via nid 10.0.250.47@o2ib3 was lost; in progress
>> operations using this service will wait for recovery to complete.
>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred
>> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
>> failed with -16
>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779 previous
>> similar messages
>> > Feb 15 10:16:55 reshpc116 kernel: Lustre:
>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>> x1360125198261947 sent from reshpcfs-OST0005-osc-8830175c8400 to NID
>> 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred
>> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
>> failed with -16
>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous
>> similar messages
>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error occurred
>> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
>> failed with -16
>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous
>> similar messages
>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error occurred
>> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
>> failed with -16
>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous
>> similar messages
>> > Feb 15 10:31:43 reshpc116 kernel: Lustre:
>> reshpcfs-OST0005-osc-8830175c8400: Connection restored to service
>> reshpcfs-OST0005 using nid 10.0.250.47@o2ib3.
>> > --
>> >
>> > Due to disk space issues on my lustre filesystem one of the OST's were
>> full and I deactivated that OST this morning.  I thought that operation just
>> puts it in a read only state and that clients can still access the data from
>> that OST.  After activating this OST again the client connected again and

Re: [Lustre-discuss] Lustre client error

2011-02-15 Thread Jagga Soorma
I might be looking at the wrong OST.  What is the best way to map the actual
/dev/mapper/mpath[X] to what OST ID is used for that volume?

Thanks,
-J

On Tue, Feb 15, 2011 at 3:01 PM, Jagga Soorma  wrote:

> Also, it looks like the client is reporting a different %used compared to
> the oss server itself:
>
> client:
> reshpc101:~ # lfs df -h | grep -i 0007
> reshpcfs-OST0007_UUID  2.0T  1.7T202.7G   84% /reshpcfs[OST:7]
>
> oss:
> /dev/mapper/mpath72.0T  1.9T   40G  98% /gnet/lustre/oss02/mpath7
>
> Here is how the data seems to be distributed on one of the OSS's:
> --
> /dev/mapper/mpath52.0T  1.2T  688G  65% /gnet/lustre/oss02/mpath5
> /dev/mapper/mpath62.0T  1.7T  224G  89% /gnet/lustre/oss02/mpath6
> /dev/mapper/mpath72.0T  1.9T   41G  98% /gnet/lustre/oss02/mpath7
> /dev/mapper/mpath82.0T  1.3T  671G  65% /gnet/lustre/oss02/mpath8
> /dev/mapper/mpath92.0T  1.3T  634G  67% /gnet/lustre/oss02/mpath9
> --
>
> -J
>
>
> On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma  wrote:
>
>> I did deactivate this OST on the MDS server.  So how would I deal with a
>> OST filling up?  The OST's don't seem to be filling up evenly either.  How
>> does lustre handle a OST that is at 100%?  Would it not use this specific
>> OST for writes if there are other OST available with capacity?
>>
>> Thanks,
>> -J
>>
>>
>> On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger 
>> wrote:
>>
>>> On 2011-02-15, at 12:20, Cliff White wrote:
>>> > Client situation depends on where you deactivated the OST - if you
>>> deactivate on the MDS only, clients should be able to read.
>>> >
>>> > What is best to do when an OST fills up really depends on what else you
>>> are doing at the time, and how much control you have over what the clients
>>> are doing and other things.  If you can solve the space issue with a quick
>>> rm -rf, best to leave it online, likewise if all your clients are trying to
>>> bang on it and failing, best to turn things off. YMMV
>>>
>>> In theory, with 1.8 the full OST should be skipped for new object
>>> allocations, but this is not robust in the face of e.g. a single very large
>>> file being written to the OST that takes it from "average" usage to being
>>> full.
>>>
>>> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma 
>>> wrote:
>>> > Hi Guys,
>>> >
>>> > One of my clients got a hung lustre mount this morning and I saw the
>>> following errors in my logs:
>>> >
>>> > --
>>> > ..snip..
>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 10.0.250.47@o2ib3. The ost_write operation
>>> failed with -28
>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836 previous
>>> similar messages
>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 10.0.250.47@o2ib3. The ost_write operation
>>> failed with -28
>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141 previous
>>> similar messages
>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>>> x1360125198261945 sent from reshpcfs-OST0005-osc-8830175c8400 to NID
>>> 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>>> reshpcfs-OST0005-osc-8830175c8400: Connection to service
>>> reshpcfs-OST0005 via nid 10.0.250.47@o2ib3 was lost; in progress
>>> operations using this service will wait for recovery to complete.
>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
>>> failed with -16
>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779 previous
>>> similar messages
>>> > Feb 15 10:16:55 reshpc116 kernel: Lustre:
>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>>> x1360125198261947 sent from reshpcfs-OST0005-osc-8830175c8400 to NID
>>> 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
>>> failed with -16
>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous
>>> similar messages
>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
>>> failed with -16
>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous
>>> similar messages
>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
>>> failed with -16
>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous
>>> similar messages
>>> > Feb 15 10:31:43 reshpc116 kernel: Lustre:
>>> reshpcfs-OST0005-osc-8830175c8400: Connection restored to service
>>> reshpcfs-OST0005 using nid 10.0.250.47@

Re: [Lustre-discuss] Lustre client error

2011-02-15 Thread Jagga Soorma
This OST is 100% now with only 12GB remaining and something is actively
writing to this volume.  What would be the appropriate thing to do in this
scenario?  If I set this to read only on the mds then some of my clients
start hanging up.

Should I be running "lfs find -O OST_UID /lustre" and then move the files
out of this filesystem and re-add them back?  But then there is no gurantee
that they will not be written to this specific OST.

Any help would be greately appreciated.

Thanks,
-J

On Tue, Feb 15, 2011 at 3:05 PM, Jagga Soorma  wrote:

> I might be looking at the wrong OST.  What is the best way to map the
> actual /dev/mapper/mpath[X] to what OST ID is used for that volume?
>
> Thanks,
> -J
>
>
> On Tue, Feb 15, 2011 at 3:01 PM, Jagga Soorma  wrote:
>
>> Also, it looks like the client is reporting a different %used compared to
>> the oss server itself:
>>
>> client:
>> reshpc101:~ # lfs df -h | grep -i 0007
>> reshpcfs-OST0007_UUID  2.0T  1.7T202.7G   84% /reshpcfs[OST:7]
>>
>> oss:
>> /dev/mapper/mpath72.0T  1.9T   40G  98% /gnet/lustre/oss02/mpath7
>>
>> Here is how the data seems to be distributed on one of the OSS's:
>> --
>> /dev/mapper/mpath52.0T  1.2T  688G  65% /gnet/lustre/oss02/mpath5
>> /dev/mapper/mpath62.0T  1.7T  224G  89% /gnet/lustre/oss02/mpath6
>> /dev/mapper/mpath72.0T  1.9T   41G  98% /gnet/lustre/oss02/mpath7
>> /dev/mapper/mpath82.0T  1.3T  671G  65% /gnet/lustre/oss02/mpath8
>> /dev/mapper/mpath92.0T  1.3T  634G  67% /gnet/lustre/oss02/mpath9
>> --
>>
>> -J
>>
>>
>> On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma  wrote:
>>
>>> I did deactivate this OST on the MDS server.  So how would I deal with a
>>> OST filling up?  The OST's don't seem to be filling up evenly either.  How
>>> does lustre handle a OST that is at 100%?  Would it not use this specific
>>> OST for writes if there are other OST available with capacity?
>>>
>>> Thanks,
>>> -J
>>>
>>>
>>> On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger 
>>> wrote:
>>>
 On 2011-02-15, at 12:20, Cliff White wrote:
 > Client situation depends on where you deactivated the OST - if you
 deactivate on the MDS only, clients should be able to read.
 >
 > What is best to do when an OST fills up really depends on what else
 you are doing at the time, and how much control you have over what the
 clients are doing and other things.  If you can solve the space issue with 
 a
 quick rm -rf, best to leave it online, likewise if all your clients are
 trying to bang on it and failing, best to turn things off. YMMV

 In theory, with 1.8 the full OST should be skipped for new object
 allocations, but this is not robust in the face of e.g. a single very large
 file being written to the OST that takes it from "average" usage to being
 full.

 > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma 
 wrote:
 > Hi Guys,
 >
 > One of my clients got a hung lustre mount this morning and I saw the
 following errors in my logs:
 >
 > --
 > ..snip..
 > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred
 while communicating with 10.0.250.47@o2ib3. The ost_write operation
 failed with -28
 > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836
 previous similar messages
 > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred
 while communicating with 10.0.250.47@o2ib3. The ost_write operation
 failed with -28
 > Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141
 previous similar messages
 > Feb 15 10:16:54 reshpc116 kernel: Lustre:
 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
 x1360125198261945 sent from reshpcfs-OST0005-osc-8830175c8400 to NID
 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
 > Feb 15 10:16:54 reshpc116 kernel: Lustre:
 reshpcfs-OST0005-osc-8830175c8400: Connection to service
 reshpcfs-OST0005 via nid 10.0.250.47@o2ib3 was lost; in progress
 operations using this service will wait for recovery to complete.
 > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred
 while communicating with 10.0.250.47@o2ib3. The ost_connect operation
 failed with -16
 > Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779
 previous similar messages
 > Feb 15 10:16:55 reshpc116 kernel: Lustre:
 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
 x1360125198261947 sent from reshpcfs-OST0005-osc-8830175c8400 to NID
 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
 > Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred
 while communicating with 10.0.250.47@o2ib3. The ost_connect operation
 failed with -16
 > Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous
 similar messages
 > Feb 15 10:20:45 reshpc116

Re: [Lustre-discuss] Lustre client error

2011-02-15 Thread Cliff White
you can use lfs find or lfs getstripe to identify where files are.
If you move the files out and move them back, the QOS policy should
re-distribute them evenly, but it very much depends. If you have clients
using a stripe count of 1,
a single large file can fill up one OST.
df on the client reports space for the entire filesystem, df on the OSS
reports space for the targets
attached to that server, so yes the results will be different.
cliffw

On Tue, Feb 15, 2011 at 4:09 PM, Jagga Soorma  wrote:

> This OST is 100% now with only 12GB remaining and something is actively
> writing to this volume.  What would be the appropriate thing to do in this
> scenario?  If I set this to read only on the mds then some of my clients
> start hanging up.
>
> Should I be running "lfs find -O OST_UID /lustre" and then move the files
> out of this filesystem and re-add them back?  But then there is no gurantee
> that they will not be written to this specific OST.
>
> Any help would be greately appreciated.
>
> Thanks,
> -J
>
>
> On Tue, Feb 15, 2011 at 3:05 PM, Jagga Soorma  wrote:
>
>> I might be looking at the wrong OST.  What is the best way to map the
>> actual /dev/mapper/mpath[X] to what OST ID is used for that volume?
>>
>> Thanks,
>> -J
>>
>>
>> On Tue, Feb 15, 2011 at 3:01 PM, Jagga Soorma  wrote:
>>
>>> Also, it looks like the client is reporting a different %used compared to
>>> the oss server itself:
>>>
>>> client:
>>> reshpc101:~ # lfs df -h | grep -i 0007
>>> reshpcfs-OST0007_UUID  2.0T  1.7T202.7G   84%
>>> /reshpcfs[OST:7]
>>>
>>> oss:
>>> /dev/mapper/mpath72.0T  1.9T   40G  98% /gnet/lustre/oss02/mpath7
>>>
>>> Here is how the data seems to be distributed on one of the OSS's:
>>> --
>>> /dev/mapper/mpath52.0T  1.2T  688G  65% /gnet/lustre/oss02/mpath5
>>> /dev/mapper/mpath62.0T  1.7T  224G  89% /gnet/lustre/oss02/mpath6
>>> /dev/mapper/mpath72.0T  1.9T   41G  98% /gnet/lustre/oss02/mpath7
>>> /dev/mapper/mpath82.0T  1.3T  671G  65% /gnet/lustre/oss02/mpath8
>>> /dev/mapper/mpath92.0T  1.3T  634G  67% /gnet/lustre/oss02/mpath9
>>> --
>>>
>>> -J
>>>
>>>
>>> On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma  wrote:
>>>
 I did deactivate this OST on the MDS server.  So how would I deal with a
 OST filling up?  The OST's don't seem to be filling up evenly either.  How
 does lustre handle a OST that is at 100%?  Would it not use this specific
 OST for writes if there are other OST available with capacity?

 Thanks,
 -J


 On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger >>> > wrote:

> On 2011-02-15, at 12:20, Cliff White wrote:
> > Client situation depends on where you deactivated the OST - if you
> deactivate on the MDS only, clients should be able to read.
> >
> > What is best to do when an OST fills up really depends on what else
> you are doing at the time, and how much control you have over what the
> clients are doing and other things.  If you can solve the space issue 
> with a
> quick rm -rf, best to leave it online, likewise if all your clients are
> trying to bang on it and failing, best to turn things off. YMMV
>
> In theory, with 1.8 the full OST should be skipped for new object
> allocations, but this is not robust in the face of e.g. a single very 
> large
> file being written to the OST that takes it from "average" usage to being
> full.
>
> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma 
> wrote:
> > Hi Guys,
> >
> > One of my clients got a hung lustre mount this morning and I saw the
> following errors in my logs:
> >
> > --
> > ..snip..
> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error
> occurred while communicating with 10.0.250.47@o2ib3. The ost_write
> operation failed with -28
> > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836
> previous similar messages
> > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error
> occurred while communicating with 10.0.250.47@o2ib3. The ost_write
> operation failed with -28
> > Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141
> previous similar messages
> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1360125198261945 sent from reshpcfs-OST0005-osc-8830175c8400 to NID
> 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
> reshpcfs-OST0005-osc-8830175c8400: Connection to service
> reshpcfs-OST0005 via nid 10.0.250.47@o2ib3 was lost; in progress
> operations using this service will wait for recovery to complete.
> > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error
> occurred while communicating with 10.0.250.47@o2ib3. The ost_connect
> operation failed with -16
> > Feb 15 10:16:54 reshpc116 ker

Re: [Lustre-discuss] Lustre client error

2011-02-16 Thread Jagga Soorma
Another thing that I just noticed is that after deactivating a OST on the
MDS, I am no longer able to check the quota's for users.  Here is the
message I receive:

--
Disk quotas for user testuser (uid 17229):
 Filesystem  kbytes   quota   limit   grace   files   quota   limit
grace
  /lustre [0] [0] [0] [0] [0] [0]

Some errors happened when getting quota info. Some devices may be not
working or deactivated. The data in "[]" is inaccurate.
--

Is this normal and expected?  Or am I missing something here?

Thanks for all your support.  It is much appreciated.

Regards,
-J

On Tue, Feb 15, 2011 at 4:25 PM, Cliff White  wrote:

> you can use lfs find or lfs getstripe to identify where files are.
> If you move the files out and move them back, the QOS policy should
> re-distribute them evenly, but it very much depends. If you have clients
> using a stripe count of 1,
> a single large file can fill up one OST.
> df on the client reports space for the entire filesystem, df on the OSS
> reports space for the targets
> attached to that server, so yes the results will be different.
> cliffw
>
>
> On Tue, Feb 15, 2011 at 4:09 PM, Jagga Soorma  wrote:
>
>> This OST is 100% now with only 12GB remaining and something is actively
>> writing to this volume.  What would be the appropriate thing to do in this
>> scenario?  If I set this to read only on the mds then some of my clients
>> start hanging up.
>>
>> Should I be running "lfs find -O OST_UID /lustre" and then move the files
>> out of this filesystem and re-add them back?  But then there is no gurantee
>> that they will not be written to this specific OST.
>>
>> Any help would be greately appreciated.
>>
>> Thanks,
>> -J
>>
>>
>> On Tue, Feb 15, 2011 at 3:05 PM, Jagga Soorma  wrote:
>>
>>> I might be looking at the wrong OST.  What is the best way to map the
>>> actual /dev/mapper/mpath[X] to what OST ID is used for that volume?
>>>
>>> Thanks,
>>> -J
>>>
>>>
>>> On Tue, Feb 15, 2011 at 3:01 PM, Jagga Soorma  wrote:
>>>
 Also, it looks like the client is reporting a different %used compared
 to the oss server itself:

 client:
 reshpc101:~ # lfs df -h | grep -i 0007
 reshpcfs-OST0007_UUID  2.0T  1.7T202.7G   84%
 /reshpcfs[OST:7]

 oss:
 /dev/mapper/mpath72.0T  1.9T   40G  98% /gnet/lustre/oss02/mpath7

 Here is how the data seems to be distributed on one of the OSS's:
 --
 /dev/mapper/mpath52.0T  1.2T  688G  65% /gnet/lustre/oss02/mpath5
 /dev/mapper/mpath62.0T  1.7T  224G  89% /gnet/lustre/oss02/mpath6
 /dev/mapper/mpath72.0T  1.9T   41G  98% /gnet/lustre/oss02/mpath7
 /dev/mapper/mpath82.0T  1.3T  671G  65% /gnet/lustre/oss02/mpath8
 /dev/mapper/mpath92.0T  1.3T  634G  67% /gnet/lustre/oss02/mpath9
 --

 -J


 On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma wrote:

> I did deactivate this OST on the MDS server.  So how would I deal with
> a OST filling up?  The OST's don't seem to be filling up evenly either.  
> How
> does lustre handle a OST that is at 100%?  Would it not use this specific
> OST for writes if there are other OST available with capacity?
>
> Thanks,
> -J
>
>
> On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger <
> adil...@whamcloud.com> wrote:
>
>> On 2011-02-15, at 12:20, Cliff White wrote:
>> > Client situation depends on where you deactivated the OST - if you
>> deactivate on the MDS only, clients should be able to read.
>> >
>> > What is best to do when an OST fills up really depends on what else
>> you are doing at the time, and how much control you have over what the
>> clients are doing and other things.  If you can solve the space issue 
>> with a
>> quick rm -rf, best to leave it online, likewise if all your clients are
>> trying to bang on it and failing, best to turn things off. YMMV
>>
>> In theory, with 1.8 the full OST should be skipped for new object
>> allocations, but this is not robust in the face of e.g. a single very 
>> large
>> file being written to the OST that takes it from "average" usage to being
>> full.
>>
>> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma 
>> wrote:
>> > Hi Guys,
>> >
>> > One of my clients got a hung lustre mount this morning and I saw the
>> following errors in my logs:
>> >
>> > --
>> > ..snip..
>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error
>> occurred while communicating with 10.0.250.47@o2ib3. The ost_write
>> operation failed with -28
>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836
>> previous similar messages
>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error
>> occurred while communicating with 10.0.250.47@o2ib3. The ost_write
>> operation failed with -28
>> > Feb 15 09:48:07 

Re: [Lustre-discuss] Lustre client error

2011-02-17 Thread Kevin Van Maren
To figure out which OST is which, use "e2label /dev/sdX" (or "e2label 
/dev/mapper/mpath7") which will print the OST index in hex.

If clients run out of space, but there is space left, see Bug 22755 
(mostly fixed in Lustre 1.8.4).

Lustre assigns the OST index at file creation time.  Lustre will avoid 
full OSTs, but once a file is created any growth must be accommodated by 
the initial OST assignment(s).  Deactivating the OST on the MDS will 
prevent new allocations, but they shouldn't be happening anyway.

You can copy/rename some large files to put them on another OST which 
will free up space on the full OST (move will not allocate new space, 
just change the directory name).

Kevin



Jagga Soorma wrote:
> This OST is 100% now with only 12GB remaining and something is 
> actively writing to this volume.  What would be the appropriate thing 
> to do in this scenario?  If I set this to read only on the mds then 
> some of my clients start hanging up.
>
> Should I be running "lfs find -O OST_UID /lustre" and then move the 
> files out of this filesystem and re-add them back?  But then there is 
> no gurantee that they will not be written to this specific OST.
>
> Any help would be greately appreciated.
>
> Thanks,
> -J
>
> On Tue, Feb 15, 2011 at 3:05 PM, Jagga Soorma  > wrote:
>
> I might be looking at the wrong OST.  What is the best way to map
> the actual /dev/mapper/mpath[X] to what OST ID is used for that
> volume?
>
> Thanks,
> -J
>
>
> On Tue, Feb 15, 2011 at 3:01 PM, Jagga Soorma  > wrote:
>
> Also, it looks like the client is reporting a different %used
> compared to the oss server itself:
>
> client:
> reshpc101:~ # lfs df -h | grep -i 0007
> reshpcfs-OST0007_UUID  2.0T  1.7T202.7G   84%
> /reshpcfs[OST:7]
>
> oss:
> /dev/mapper/mpath72.0T  1.9T   40G  98%
> /gnet/lustre/oss02/mpath7
>
> Here is how the data seems to be distributed on one of the OSS's:
> --
> /dev/mapper/mpath52.0T  1.2T  688G  65%
> /gnet/lustre/oss02/mpath5
> /dev/mapper/mpath62.0T  1.7T  224G  89%
> /gnet/lustre/oss02/mpath6
> /dev/mapper/mpath72.0T  1.9T   41G  98%
> /gnet/lustre/oss02/mpath7
> /dev/mapper/mpath82.0T  1.3T  671G  65%
> /gnet/lustre/oss02/mpath8
> /dev/mapper/mpath92.0T  1.3T  634G  67%
> /gnet/lustre/oss02/mpath9
> --
>
> -J
>
>
> On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma
> mailto:jagg...@gmail.com>> wrote:
>
> I did deactivate this OST on the MDS server.  So how would
> I deal with a OST filling up?  The OST's don't seem to be
> filling up evenly either.  How does lustre handle a OST
> that is at 100%?  Would it not use this specific OST for
> writes if there are other OST available with capacity? 
>
> Thanks,
> -J
>
>
> On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger
> mailto:adil...@whamcloud.com>> wrote:
>
> On 2011-02-15, at 12:20, Cliff White wrote:
> > Client situation depends on where you deactivated
> the OST - if you deactivate on the MDS only, clients
> should be able to read.
> >
> > What is best to do when an OST fills up really
> depends on what else you are doing at the time, and
> how much control you have over what the clients are
> doing and other things.  If you can solve the space
> issue with a quick rm -rf, best to leave it online,
> likewise if all your clients are trying to bang on it
> and failing, best to turn things off. YMMV
>
> In theory, with 1.8 the full OST should be skipped for
> new object allocations, but this is not robust in the
> face of e.g. a single very large file being written to
> the OST that takes it from "average" usage to being full.
>
> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma
> mailto:jagg...@gmail.com>> wrote:
> > Hi Guys,
> >
> > One of my clients got a hung lustre mount this
> morning and I saw the following errors in my logs:
> >
> > --
> > ..snip..
> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0:
> an error occurred while communicating with
> 10.0.250.47@o2ib3. The ost_write operation failed with -28
> > Feb 15 09:38:07 reshpc116 kernel: LustreError:
> Skipped 4755836 previous similar messages
> > Feb 15 09:48:07 reshpc116 kerne

Re: [Lustre-discuss] Lustre client question

2011-05-13 Thread Kevin Van Maren
See bug 24264 -- certainly possible that the raid controller corrupted  
your filesystem.

If you remove the new drive and reboot, does the file system look  
cleaner?

Kevin


On May 13, 2011, at 11:39 AM, Zachary Beebleson  wrote:

>
> We recently had two raid rebuilds on a couple storage targets that  
> did not go
> according to plan. The cards reported a successful rebuild in each  
> case, but
> ldiskfs errors started showing up on the associated OSSs and the  
> effected OSTs
> were  remounted read-only. We are planning to migrate off the data,  
> but we've
> noticed that some clients are getting i/o errors, while others are  
> not. As an
> example, a file that has a stripe on at least one affected OST could  
> not be
> read on one client, i.e. I received a read-error trying to access  
> it, while it
> was perfectly readable and apparently uncorrupted on another (I am  
> able to
> migrate the file to healthy OSTs by copying to a new file name). The  
> clients
> with the i/o problem see inactive devices corresponding to the read- 
> only OSTs
> when I issue a 'lfs df', while the others without the i/o problems  
> report the
> targets as normal. Is it just that many clients are not aware of an  
> OST problem
> yet? I need clients with minimal I/O disruptions in order to migrate  
> as much
> data off as possible.
>
> A client reboot appears to awaken them to the fact that there are  
> problems with
> the OSTs. However, I need them to be able to read the data in order  
> to migrate
> it off. Is there a way to reconnect the clients to the problematic  
> OSTs?
>
> We have dd-ed copies of the OSTs to try e2fsck against them, but the  
> results
> were not promising. The check aborted with:
>
> --
> Resize inode (re)creation failed: A block group is missing an inode
> table.Continue? yes
>
> ext2fs_read_inode: A block group is missing an inode table while  
> reading inode
> 7 in recreate inode
> e2fsck: aborted
> --
>
> Any advice would be greatly appreciated.
> Zach
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client question

2011-05-13 Thread Zachary Beebleson
Kevin,

I just failed the drive and remounted. A basic 'df' hangs when it gets to
the mount point, but /proc/fs/lustre/health_check reports everything is
healthy. 'lfs df' on a client reports the OST is active, where it was
inactive before. However, now I'm working with a degraded volume, but it
is raid 6. Should I try another rebuild or just proceed with the
mirgration off of this OST asap?

Thanks,
Zach

PS. Sorry for the repeat message
On Fri, 13 May 2011, Kevin Van Maren wrote:

> See bug 24264 -- certainly possible that the raid controller corrupted your 
> filesystem.
>
> If you remove the new drive and reboot, does the file system look cleaner?
>
> Kevin
>
>
> On May 13, 2011, at 11:39 AM, Zachary Beebleson  
> wrote:
>
>> 
>> We recently had two raid rebuilds on a couple storage targets that did not 
>> go
>> according to plan. The cards reported a successful rebuild in each case, 
>> but
>> ldiskfs errors started showing up on the associated OSSs and the effected 
>> OSTs
>> were  remounted read-only. We are planning to migrate off the data, but 
>> we've
>> noticed that some clients are getting i/o errors, while others are not. As 
>> an
>> example, a file that has a stripe on at least one affected OST could not be
>> read on one client, i.e. I received a read-error trying to access it, while 
>> it
>> was perfectly readable and apparently uncorrupted on another (I am able to
>> migrate the file to healthy OSTs by copying to a new file name). The 
>> clients
>> with the i/o problem see inactive devices corresponding to the read-only 
>> OSTs
>> when I issue a 'lfs df', while the others without the i/o problems report 
>> the
>> targets as normal. Is it just that many clients are not aware of an OST 
>> problem
>> yet? I need clients with minimal I/O disruptions in order to migrate as 
>> much
>> data off as possible.
>> 
>> A client reboot appears to awaken them to the fact that there are problems 
>> with
>> the OSTs. However, I need them to be able to read the data in order to 
>> migrate
>> it off. Is there a way to reconnect the clients to the problematic OSTs?
>> 
>> We have dd-ed copies of the OSTs to try e2fsck against them, but the 
>> results
>> were not promising. The check aborted with:
>> 
>> --
>> Resize inode (re)creation failed: A block group is missing an inode
>> table.Continue? yes
>> 
>> ext2fs_read_inode: A block group is missing an inode table while reading 
>> inode
>> 7 in recreate inode
>> e2fsck: aborted
>> --
>> 
>> Any advice would be greatly appreciated.
>> Zach
>> ___
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client question

2011-05-13 Thread Kevin Van Maren
It sounds like it is working better.  Did the clients recover?  I would 
have re-run fsck before mounting it again, and moving the data off may 
still be the best plan.  Since dropping the rebuilt drive reduced the 
corruption, certainly contact your raid vendor over this issue.

Kevin


Zachary Beebleson wrote:
> Kevin,
>
> I just failed the drive and remounted. A basic 'df' hangs when it gets to
> the mount point, but /proc/fs/lustre/health_check reports everything is
> healthy. 'lfs df' on a client reports the OST is active, where it was
> inactive before. However, now I'm working with a degraded volume, but it
> is raid 6. Should I try another rebuild or just proceed with the
> mirgration off of this OST asap?
>
> Thanks,
> Zach
>
> PS. Sorry for the repeat message
> On Fri, 13 May 2011, Kevin Van Maren wrote:
>
>> See bug 24264 -- certainly possible that the raid controller 
>> corrupted your filesystem.
>>
>> If you remove the new drive and reboot, does the file system look 
>> cleaner?
>>
>> Kevin
>>
>>
>> On May 13, 2011, at 11:39 AM, Zachary Beebleson 
>>  wrote:
>>
>>>
>>> We recently had two raid rebuilds on a couple storage targets that 
>>> did not go
>>> according to plan. The cards reported a successful rebuild in each 
>>> case, but
>>> ldiskfs errors started showing up on the associated OSSs and the 
>>> effected OSTs
>>> were  remounted read-only. We are planning to migrate off the data, 
>>> but we've
>>> noticed that some clients are getting i/o errors, while others are 
>>> not. As an
>>> example, a file that has a stripe on at least one affected OST could 
>>> not be
>>> read on one client, i.e. I received a read-error trying to access 
>>> it, while it
>>> was perfectly readable and apparently uncorrupted on another (I am 
>>> able to
>>> migrate the file to healthy OSTs by copying to a new file name). The 
>>> clients
>>> with the i/o problem see inactive devices corresponding to the 
>>> read-only OSTs
>>> when I issue a 'lfs df', while the others without the i/o problems 
>>> report the
>>> targets as normal. Is it just that many clients are not aware of an 
>>> OST problem
>>> yet? I need clients with minimal I/O disruptions in order to migrate 
>>> as much
>>> data off as possible.
>>>
>>> A client reboot appears to awaken them to the fact that there are 
>>> problems with
>>> the OSTs. However, I need them to be able to read the data in order 
>>> to migrate
>>> it off. Is there a way to reconnect the clients to the problematic 
>>> OSTs?
>>>
>>> We have dd-ed copies of the OSTs to try e2fsck against them, but the 
>>> results
>>> were not promising. The check aborted with:
>>>
>>> --
>>> Resize inode (re)creation failed: A block group is missing an inode
>>> table.Continue? yes
>>>
>>> ext2fs_read_inode: A block group is missing an inode table while 
>>> reading inode
>>> 7 in recreate inode
>>> e2fsck: aborted
>>> --
>>>
>>> Any advice would be greatly appreciated.
>>> Zach
>>> ___
>>> Lustre-discuss mailing list
>>> Lustre-discuss@lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client question

2011-05-13 Thread Zachary Beebleson
Yes, the clients appear to have recovered. I didn't want to risk an fsck 
until a new file level backup was completed --- this will take time given 
the size of our system.

I've done at least 5 or 6 raid rebuilds in the past without issue using 
these raid cards. We will try to isolate the cause to this problem 
further --- i.e. perhaps a bad batch of spare drives, buggy raid driver 
(I think this is a newer Lustre version), etc.

Many thanks for your help.
Zach.

> It sounds like it is working better.  Did the clients recover?  I would have 
> re-run fsck before mounting it again, and moving the data off may still be 
> the best plan.  Since dropping the rebuilt drive reduced the corruption, 
> certainly contact your raid vendor over this issue.
>
> Kevin
>
>
> Zachary Beebleson wrote:
>>  Kevin,
>>
>>  I just failed the drive and remounted. A basic 'df' hangs when it gets to
>>  the mount point, but /proc/fs/lustre/health_check reports everything is
>>  healthy. 'lfs df' on a client reports the OST is active, where it was
>>  inactive before. However, now I'm working with a degraded volume, but it
>>  is raid 6. Should I try another rebuild or just proceed with the
>>  mirgration off of this OST asap?
>>
>>  Thanks,
>>  Zach
>>
>>  PS. Sorry for the repeat message
>>  On Fri, 13 May 2011, Kevin Van Maren wrote:
>> 
>> >  See bug 24264 -- certainly possible that the raid controller corrupted 
>> >  your filesystem.
>> > 
>> >  If you remove the new drive and reboot, does the file system look 
>> >  cleaner?
>> > 
>> >  Kevin
>> > 
>> > 
>> >  On May 13, 2011, at 11:39 AM, Zachary Beebleson 
>> >   wrote:
>> > 
>> > > 
>> > >  We recently had two raid rebuilds on a couple storage targets that did 
>> > >  not go
>> > >  according to plan. The cards reported a successful rebuild in each 
>> > >  case, but
>> > >  ldiskfs errors started showing up on the associated OSSs and the 
>> > >  effected OSTs
>> > >  were  remounted read-only. We are planning to migrate off the data, 
>> > >  but we've
>> > >  noticed that some clients are getting i/o errors, while others are 
>> > >  not. As an
>> > >  example, a file that has a stripe on at least one affected OST could 
>> > >  not be
>> > >  read on one client, i.e. I received a read-error trying to access it, 
>> > >  while it
>> > >  was perfectly readable and apparently uncorrupted on another (I am 
>> > >  able to
>> > >  migrate the file to healthy OSTs by copying to a new file name). The 
>> > >  clients
>> > >  with the i/o problem see inactive devices corresponding to the 
>> > >  read-only OSTs
>> > >  when I issue a 'lfs df', while the others without the i/o problems 
>> > >  report the
>> > >  targets as normal. Is it just that many clients are not aware of an 
>> > >  OST problem
>> > >  yet? I need clients with minimal I/O disruptions in order to migrate 
>> > >  as much
>> > >  data off as possible.
>> > > 
>> > >  A client reboot appears to awaken them to the fact that there are 
>> > >  problems with
>> > >  the OSTs. However, I need them to be able to read the data in order to 
>> > >  migrate
>> > >  it off. Is there a way to reconnect the clients to the problematic 
>> > >  OSTs?
>> > > 
>> > >  We have dd-ed copies of the OSTs to try e2fsck against them, but the 
>> > >  results
>> > >  were not promising. The check aborted with:
>> > > 
>> > >  --
>> > >  Resize inode (re)creation failed: A block group is missing an inode
>> > >  table.Continue? yes
>> > > 
>> > >  ext2fs_read_inode: A block group is missing an inode table while 
>> > >  reading inode
>> > >  7 in recreate inode
>> > >  e2fsck: aborted
>> > >  --
>> > > 
>> > >  Any advice would be greatly appreciated.
>> > >  Zach
>> > >  ___
>> > >  Lustre-discuss mailing list
>> > >  Lustre-discuss@lists.lustre.org
>> > >  http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> > 
>
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre Client - Server Compatibility

2013-01-16 Thread Brett Worth
I am in the process of doing an upgrade of our 1.8.6 Lustre servers to 2.1.3.  
We have 
quite a few clients that all run versions of the lustre client close to 1.8.6.

Is there a compatibility matrix somewhere that will show what clients will work 
with what 
servers?  We don't want to do the upgrade all at once so would like to know 
what order to 
make the changes.

Brett
-- 

   /) _ _ _/_/ / / /  _ _//
  /_)/http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [lustre-discuss] Lustre client modules

2020-05-27 Thread Leonardo Saavedra

On 5/26/20 5:47 PM, Phill Harvey-Smith wrote:

Hi all,

Can anyone tell me where to download the Lustere client modules for 
CentOS 7.8 please ?


# uname -a
Linux exec3r420 3.10.0-1127.8.2.el7.x86_64 #1 SMP Tue May 12 16:57:42 
UTC 2020 x86_64 x86_64 x86_64 GNU/Linux


# cat /etc/redhat-release
CentOS Linux release 7.8.2003 (Core)


echo "%_topdir  $HOME/rpmbuild" >> .rpmmacros
wget -c 
https://downloads.whamcloud.com/public/lustre/lustre-2.12.4/el7/client/SRPMS/lustre-2.12.4-1.src.rpm
rpmbuild --clean  --rebuild --without servers --without lustre_tests 
lustre-2.12.4-1.src.rpm

cd $HOME/rpmbuild/RPMS/x86_64

Leo Saavedra
National Radio Astronomy Observatory
http://www.nrao.edu
+1-575-8357033

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre client modules

2020-05-28 Thread Phill Harvey-Smith

On 27/05/2020 19:26, Leonardo Saavedra wrote:

On 5/26/20 5:47 PM, Phill Harvey-Smith wrote:
echo "%_topdir  $HOME/rpmbuild" >> .rpmmacros
wget -c 
https://downloads.whamcloud.com/public/lustre/lustre-2.12.4/el7/client/SRPMS/lustre-2.12.4-1.src.rpm
rpmbuild --clean  --rebuild --without servers --without lustre_tests 
lustre-2.12.4-1.src.rpm

cd $HOME/rpmbuild/RPMS/x86_64


Right that worked and I have the following rpms in
$HOME/rpmbuild/RPMS/x86_64 :

# ls
kmod-lustre-client-2.12.4-1.el7.x86_64.rpm
lustre-client-2.12.4-1.el7.x86_64.rpm
lustre-client-debuginfo-2.12.4-1.el7.x86_64.rpm
lustre-iokit-2.12.4-1.el7.x86_64.rpm

However trying to install them with yum I get :

Loaded plugins: fastestmirror, langpacks
Examining kmod-lustre-client-2.12.4-1.el7.x86_64.rpm: 
kmod-lustre-client-2.12.4-1.el7.x86_64
Marking kmod-lustre-client-2.12.4-1.el7.x86_64.rpm as an update to 
kmod-lustre-client-2.9.0-1.el7.x86_64
Examining lustre-client-2.12.4-1.el7.x86_64.rpm: 
lustre-client-2.12.4-1.el7.x86_64
Marking lustre-client-2.12.4-1.el7.x86_64.rpm as an update to 
lustre-client-2.9.0-1.el7.x86_64
Examining lustre-client-debuginfo-2.12.4-1.el7.x86_64.rpm: 
lustre-client-debuginfo-2.12.4-1.el7.x86_64

Marking lustre-client-debuginfo-2.12.4-1.el7.x86_64.rpm to be installed
Examining lustre-iokit-2.12.4-1.el7.x86_64.rpm: 
lustre-iokit-2.12.4-1.el7.x86_64
Marking lustre-iokit-2.12.4-1.el7.x86_64.rpm as an update to 
lustre-iokit-2.9.0-1.el7.x86_64

Resolving Dependencies
--> Running transaction check
---> Package kmod-lustre-client.x86_64 0:2.9.0-1.el7 will be updated
--> Processing Dependency: kmod-lustre-client = 2.9.0 for package: 
lustre-client-tests-2.9.0-1.el7.x86_64

Loading mirror speeds from cached hostfile
 * base: centos.serverspace.co.uk
 * epel: lon.mirror.rackspace.com
 * extras: centos.serverspace.co.uk
 * updates: centos.mirrors.nublue.co.uk
--> Processing Dependency: ksym(class_find_client_obd) = 0x7fc892aa for 
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(class_name2obd) = 0x2a2fe6c0 for 
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(class_register_type) = 0xc4cc2c4f for 
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_cat_add) = 0xc5e4acf5 for package: 
kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_cat_cancel_records) = 0x72fd39ee 
for package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_cat_close) = 0xf83a61a8 for 
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_cat_process) = 0x79b2c569 for 
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_cat_reverse_process) = 0xd7510c21 
for package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_cleanup) = 0x0632eadc for package: 
kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_close) = 0xa6f1cf8b for package: 
kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(__llog_ctxt_put) = 0xe1c19687 for 
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_destroy) = 0xe12c11de for package: 
kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_exist) = 0xa6594d74 for package: 
kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_init_handle) = 0xe2107196 for 
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_open) = 0x9ba55f56 for package: 
kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_open_create) = 0xd4bdcea7 for 
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_osd_ops) = 0x034860f6 for package: 
kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_process) = 0x18a1b423 for package: 
kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_reverse_process) = 0x4b183427 for 
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_setup) = 0x5029bcff for package: 
kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_write) = 0x94fd16f4 for package: 
kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(lu_context_enter) = 0xffa84ad2 for 
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(lu_context_exit) = 0x2d678501 for 
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(lu_context_fini) = 0xf5361e15 for 
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(lu_context_init) = 0x7f95d027 for 
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(lu_env_fini) = 0xc6a207d4 for package: 
kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing 

Re: [lustre-discuss] Lustre client modules

2020-05-28 Thread Degremont, Aurelien
Hi Phil,

There are conflicts with your already installed Lustre 2.9.0 packages.
Based on the output you provide, you should remove 'kmod-lustre-client-tests' 
first.

Actually, only kmod-lustre-client and lustre-client are the required. You 
probably don't need the other ones (lustre-iokit, lustre-client-debuginfo, ...).

Remove all the other Lustre packages except for these 2 and try again.
 

Aurélien

Le 28/05/2020 10:57, « lustre-discuss au nom de Phill Harvey-Smith » 
 a écrit :

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



On 27/05/2020 19:26, Leonardo Saavedra wrote:
> On 5/26/20 5:47 PM, Phill Harvey-Smith wrote:
> echo "%_topdir  $HOME/rpmbuild" >> .rpmmacros
> wget -c
> 
https://downloads.whamcloud.com/public/lustre/lustre-2.12.4/el7/client/SRPMS/lustre-2.12.4-1.src.rpm
> rpmbuild --clean  --rebuild --without servers --without lustre_tests
> lustre-2.12.4-1.src.rpm
> cd $HOME/rpmbuild/RPMS/x86_64

Right that worked and I have the following rpms in
$HOME/rpmbuild/RPMS/x86_64 :

# ls
kmod-lustre-client-2.12.4-1.el7.x86_64.rpm
lustre-client-2.12.4-1.el7.x86_64.rpm
lustre-client-debuginfo-2.12.4-1.el7.x86_64.rpm
lustre-iokit-2.12.4-1.el7.x86_64.rpm

However trying to install them with yum I get :

Loaded plugins: fastestmirror, langpacks
Examining kmod-lustre-client-2.12.4-1.el7.x86_64.rpm:
kmod-lustre-client-2.12.4-1.el7.x86_64
Marking kmod-lustre-client-2.12.4-1.el7.x86_64.rpm as an update to
kmod-lustre-client-2.9.0-1.el7.x86_64
Examining lustre-client-2.12.4-1.el7.x86_64.rpm:
lustre-client-2.12.4-1.el7.x86_64
Marking lustre-client-2.12.4-1.el7.x86_64.rpm as an update to
lustre-client-2.9.0-1.el7.x86_64
Examining lustre-client-debuginfo-2.12.4-1.el7.x86_64.rpm:
lustre-client-debuginfo-2.12.4-1.el7.x86_64
Marking lustre-client-debuginfo-2.12.4-1.el7.x86_64.rpm to be installed
Examining lustre-iokit-2.12.4-1.el7.x86_64.rpm:
lustre-iokit-2.12.4-1.el7.x86_64
Marking lustre-iokit-2.12.4-1.el7.x86_64.rpm as an update to
lustre-iokit-2.9.0-1.el7.x86_64
Resolving Dependencies
--> Running transaction check
---> Package kmod-lustre-client.x86_64 0:2.9.0-1.el7 will be updated
--> Processing Dependency: kmod-lustre-client = 2.9.0 for package:
lustre-client-tests-2.9.0-1.el7.x86_64
Loading mirror speeds from cached hostfile
  * base: centos.serverspace.co.uk
  * epel: lon.mirror.rackspace.com
  * extras: centos.serverspace.co.uk
  * updates: centos.mirrors.nublue.co.uk
--> Processing Dependency: ksym(class_find_client_obd) = 0x7fc892aa for
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(class_name2obd) = 0x2a2fe6c0 for
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(class_register_type) = 0xc4cc2c4f for
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_cat_add) = 0xc5e4acf5 for package:
kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_cat_cancel_records) = 0x72fd39ee
for package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_cat_close) = 0xf83a61a8 for
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_cat_process) = 0x79b2c569 for
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_cat_reverse_process) = 0xd7510c21
for package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_cleanup) = 0x0632eadc for package:
kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_close) = 0xa6f1cf8b for package:
kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(__llog_ctxt_put) = 0xe1c19687 for
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_destroy) = 0xe12c11de for package:
kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_exist) = 0xa6594d74 for package:
kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_init_handle) = 0xe2107196 for
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_open) = 0x9ba55f56 for package:
kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_open_create) = 0xd4bdcea7 for
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_osd_ops) = 0x034860f6 for package:
kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(llog_process) = 0x18a1b423 for package:
kmod-lustre-client

Re: [lustre-discuss] Lustre client modules

2020-05-28 Thread Leonardo Saavedra

On 5/28/20 2:57 AM, Phill Harvey-Smith wrote:

Right that worked and I have the following rpms in
$HOME/rpmbuild/RPMS/x86_64 :

# ls
kmod-lustre-client-2.12.4-1.el7.x86_64.rpm
lustre-client-2.12.4-1.el7.x86_64.rpm
lustre-client-debuginfo-2.12.4-1.el7.x86_64.rpm
lustre-iokit-2.12.4-1.el7.x86_64.rpm

However trying to install them with yum I get :

Loaded plugins: fastestmirror, langpacks
Examining kmod-lustre-client-2.12.4-1.el7.x86_64.rpm: 
kmod-lustre-client-2.12.4-1.el7.x86_64
Marking kmod-lustre-client-2.12.4-1.el7.x86_64.rpm as an update to 
kmod-lustre-client-2.9.0-1.el7.x86_64
Examining lustre-client-2.12.4-1.el7.x86_64.rpm: 
lustre-client-2.12.4-1.el7.x86_64
Marking lustre-client-2.12.4-1.el7.x86_64.rpm as an update to 
lustre-client-2.9.0-1.el7.x86_64
Examining lustre-client-debuginfo-2.12.4-1.el7.x86_64.rpm: 
lustre-client-debuginfo-2.12.4-1.el7.x86_64

Marking lustre-client-debuginfo-2.12.4-1.el7.x86_64.rpm to be installed
Examining lustre-iokit-2.12.4-1.el7.x86_64.rpm: 
lustre-iokit-2.12.4-1.el7.x86_64
Marking lustre-iokit-2.12.4-1.el7.x86_64.rpm as an update to 
lustre-iokit-2.9.0-1.el7.x86_64

Resolving Dependencies
--> Running transaction check
---> Package kmod-lustre-client.x86_64 0:2.9.0-1.el7 will be updated
--> Processing Dependency: kmod-lustre-client = 2.9.0 for package: 
lustre-client-tests-2.9.0-1.el7.x86_64

Loading mirror speeds from cached hostfile
 * base: centos.serverspace.co.uk
 * epel: lon.mirror.rackspace.com
 * extras: centos.serverspace.co.uk
 * updates: centos.mirrors.nublue.co.uk
--> Processing Dependency: ksym(class_find_client_obd) = 0x7fc892aa 
for package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(class_name2obd) = 0x2a2fe6c0 for 
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64
--> Processing Dependency: ksym(class_register_type) = 0xc4cc2c4f for 
package: kmod-lustre-client-tests-2.9.0-1.el7.x86_64



[...]
Remove the 2.9.0 lustre packages, then install 
lustre-client-2.12.4-1.el7.x86_64.rpm and 
kmod-lustre-client-2.12.4-1.el7.x86_64.rpm



Leo Saavedra
National Radio Astronomy Observatory
http://www.nrao.edu
+1-575-8357033

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre client modules

2020-05-28 Thread Phill Harvey-Smith

On 28/05/2020 17:40, Leonardo Saavedra wrote:

[...]
Remove the 2.9.0 lustre packages, then install 
lustre-client-2.12.4-1.el7.x86_64.rpm and 
kmod-lustre-client-2.12.4-1.el7.x86_64.rpm


Cheers to you and Degremont, Aurelien, who replied saying the same 
earlier, that seemed to fix it.


Phill.


--
This email has been checked for viruses by AVG.
https://www.avg.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] lustre client kernel compatibility

2021-07-28 Thread Scott Wood via lustre-discuss
Hi all,

Section 8.1.1 of the current lustre documentation, "Software Requirements", 
states that "ver refers to the Linux distribution (e.g., 3.6.18-348.1.1.el5)."  
The client binaries currently available at 
https://downloads.whamcloud.com/public/lustre/latest-release/el7/client/RPMS/x86_64/
 only have "el7" in the version number.  Does that mean that the 
kmod-lustre-client-2.12.7-1.el7.x86_64.rpm  binary can be used on any RHEL7.x 
system with any 3.10.0-x.y.z.el7.x86_64 kernel, or must the client have the 
3.10.0-1160.25.1.el7.x86-64.

Cheers,
Scott
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre Client with Postgres

2023-01-16 Thread Nick dan via lustre-discuss
Hi,

We have mounted our clients with the Lustre filesystem.
We want to configure our client setup as follows:

1 client as read-write
2 clients as read only

Can you help in configuring this setup? Or send a relevant document for the
same?

Regards,
Nick Dan
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre client caching question

2023-08-14 Thread John Bauer
I have an application that reads a 70GB section of a file forwards and 
backwards multiple ( 14 ) times.  This is on a 64 GB system.  Monitoring 
/proc/meminfo  shows that the memory consumed by file cache bounces 
around the 32GB value.  The forward reads go at about 3.3GB/s.  What is 
disappointing is the backwards read performance.  One would think that 
after the file is read forwards the most recently used 32GB of the file 
should be in the system buffers and the reading of that the first 32GB 
of the file going backwards should be coming out of the system buffers.  
But the backwards reads generally perform at about 500MB/s.  Generally 
the first 1GB going backwards is at 5.5GB/s, but then the remaining 
backwards read is at the 500MB/s.


The Lustre client version is 2.12.9_ddn18.  The file is striped at 4x1M.

Is this expected behavior?

John

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[Lustre-discuss] Lustre client in 3.11-rc1

2013-07-17 Thread Vikentsi Lapa
Hello all

I trying to find more info about state of Lustre client in upstream kernel
and  correct way to build client drivers.
I see that drivers have dependencies that prevent Lustre part appears in
configuration menu and it is easy to change. What is reason for broken
dependency? What is proper way to enable Lustre client code to build it?
And what to expect after building or this part need additional work to
implement some features?
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [lustre-discuss] lustre client server interoperability

2015-08-10 Thread Patrick Farrell
Kurt,

Yes.  It's worth noting that 2.7 is probably marginally less reliable than 2.5, 
since it has had no updates/fixes since it was released.

From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf of 
Kurt Strosahl [stros...@jlab.org]
Sent: Monday, August 10, 2015 2:25 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] lustre client server interoperability

Hello,

   Is the 2.7 lustre client compatible with lustre 2.5.3 servers?  I'm running 
a 2.5.3 system lustre system, but have been asked by a few people about 
upgrading some of our clients to CentOS 7 (which appears to need a 2.7 or 
greater client).

w/r,
Kurt J. Strosahl
System Administrator
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre client server interoperability

2015-08-11 Thread Kurt Strosahl
So is there a stable client for centos 7 that is backwards compatible with 
2.5.3?

w/r,
Kurt

- Original Message -
From: "Patrick Farrell" 
To: "Kurt Strosahl" , lustre-discuss@lists.lustre.org
Sent: Monday, August 10, 2015 4:24:15 PM
Subject: RE: lustre client server interoperability

Kurt,

Yes.  It's worth noting that 2.7 is probably marginally less reliable than 2.5, 
since it has had no updates/fixes since it was released.

From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf of 
Kurt Strosahl [stros...@jlab.org]
Sent: Monday, August 10, 2015 2:25 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] lustre client server interoperability

Hello,

   Is the 2.7 lustre client compatible with lustre 2.5.3 servers?  I'm running 
a 2.5.3 system lustre system, but have been asked by a few people about 
upgrading some of our clients to CentOS 7 (which appears to need a 2.7 or 
greater client).

w/r,
Kurt J. Strosahl
System Administrator
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre client server interoperability

2015-08-11 Thread Patrick Farrell
No - 2.5 is the last public stable client release from Intel.

On 8/11/15, 2:22 PM, "Kurt Strosahl"  wrote:

>So is there a stable client for centos 7 that is backwards compatible
>with 2.5.3?
>
>w/r,
>Kurt
>
>- Original Message -
>From: "Patrick Farrell" 
>To: "Kurt Strosahl" , lustre-discuss@lists.lustre.org
>Sent: Monday, August 10, 2015 4:24:15 PM
>Subject: RE: lustre client server interoperability
>
>Kurt,
>
>Yes.  It's worth noting that 2.7 is probably marginally less reliable
>than 2.5, since it has had no updates/fixes since it was released.
>
>From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf
>of Kurt Strosahl [stros...@jlab.org]
>Sent: Monday, August 10, 2015 2:25 PM
>To: lustre-discuss@lists.lustre.org
>Subject: [lustre-discuss] lustre client server interoperability
>
>Hello,
>
>   Is the 2.7 lustre client compatible with lustre 2.5.3 servers?  I'm
>running a 2.5.3 system lustre system, but have been asked by a few people
>about upgrading some of our clients to CentOS 7 (which appears to need a
>2.7 or greater client).
>
>w/r,
>Kurt J. Strosahl
>System Administrator
>Scientific Computing Group, Thomas Jefferson National Accelerator Facility
>___
>lustre-discuss mailing list
>lustre-discuss@lists.lustre.org
>http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre client server interoperability

2015-08-12 Thread Patrick Farrell
Jon,

You've got the interop right.

Unfortunately, Intel is no longer doing public maintenance versions of Lustre, 
so 2.8 will not receive updates after release.

- Patrick

From: Jon Tegner [jon.teg...@foi.se]
Sent: Wednesday, August 12, 2015 1:16 AM
To: Patrick Farrell; Kurt Strosahl
Cc: lustre-discuss@lists.lustre.org; Jan Pettersson
Subject: SV: lustre client server interoperability

So if I understand correctly one has the following "centos options":

1. Lustre-2.5.3 with CentOS-6 on both clients and servers.
2. Lustre-2.5.3, CentOS-6 on servers, and 2.7.0 and CentOS-7 on clients.
3. Wait a while and use Lustre-2.8.0/CentOS-7 on clients and servers.

At least on clients I would prefer to run CentOS-7, but if 2.7 (and 2.8 - will 
this version receive updates?) are less reliable that might not be a good idea?

Any thoughts on this would be greatly appreciated.

Thanks!

/jon


Från: lustre-discuss  för Patrick 
Farrell 
Skickat: den 11 augusti 2015 21:23
Till: Kurt Strosahl
Kopia: lustre-discuss@lists.lustre.org
Ämne: Re: [lustre-discuss] lustre client server interoperability

No - 2.5 is the last public stable client release from Intel.

On 8/11/15, 2:22 PM, "Kurt Strosahl"  wrote:

>So is there a stable client for centos 7 that is backwards compatible
>with 2.5.3?
>
>w/r,
>Kurt
>
>- Original Message -
>From: "Patrick Farrell" 
>To: "Kurt Strosahl" , lustre-discuss@lists.lustre.org
>Sent: Monday, August 10, 2015 4:24:15 PM
>Subject: RE: lustre client server interoperability
>
>Kurt,
>
>Yes.  It's worth noting that 2.7 is probably marginally less reliable
>than 2.5, since it has had no updates/fixes since it was released.
>
>From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf
>of Kurt Strosahl [stros...@jlab.org]
>Sent: Monday, August 10, 2015 2:25 PM
>To: lustre-discuss@lists.lustre.org
>Subject: [lustre-discuss] lustre client server interoperability
>
>Hello,
>
>   Is the 2.7 lustre client compatible with lustre 2.5.3 servers?  I'm
>running a 2.5.3 system lustre system, but have been asked by a few people
>about upgrading some of our clients to CentOS 7 (which appears to need a
>2.7 or greater client).
>
>w/r,
>Kurt J. Strosahl
>System Administrator
>Scientific Computing Group, Thomas Jefferson National Accelerator Facility
>___
>lustre-discuss mailing list
>lustre-discuss@lists.lustre.org
>http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre client server interoperability

2015-08-12 Thread Scott Nolin
I'd just add that we've been generally OK for running a variety of 2.X 
servers vs 2.whatever clients.


For our latest project I hear we're going to try 2.7 server and 2.8 
client. The client machines for us are much more likely to need OS 
versions pushed forward.


Regarding interim patches to Lustre, my feeling is the important thing 
is to simply know what patches are critical. I believe all the patches 
are still public from Intel (and how about others people providing 
lustre patches).


There has been some discussion about sharing information on people's 
patch sets on wiki.lustre.org, but I haven't see anything come out.


Patrick, is Cray providing public maintenance releases? Or sharing 
information on important patches?


Scott


On 8/12/2015 7:32 AM, Patrick Farrell wrote:

Jon,

You've got the interop right.

Unfortunately, Intel is no longer doing public maintenance versions of Lustre, 
so 2.8 will not receive updates after release.

- Patrick

From: Jon Tegner [jon.teg...@foi.se]
Sent: Wednesday, August 12, 2015 1:16 AM
To: Patrick Farrell; Kurt Strosahl
Cc: lustre-discuss@lists.lustre.org; Jan Pettersson
Subject: SV: lustre client server interoperability

So if I understand correctly one has the following "centos options":

1. Lustre-2.5.3 with CentOS-6 on both clients and servers.
2. Lustre-2.5.3, CentOS-6 on servers, and 2.7.0 and CentOS-7 on clients.
3. Wait a while and use Lustre-2.8.0/CentOS-7 on clients and servers.

At least on clients I would prefer to run CentOS-7, but if 2.7 (and 2.8 - will 
this version receive updates?) are less reliable that might not be a good idea?

Any thoughts on this would be greatly appreciated.

Thanks!

/jon


Från: lustre-discuss  för Patrick Farrell 

Skickat: den 11 augusti 2015 21:23
Till: Kurt Strosahl
Kopia: lustre-discuss@lists.lustre.org
Ämne: Re: [lustre-discuss] lustre client server interoperability

No - 2.5 is the last public stable client release from Intel.

On 8/11/15, 2:22 PM, "Kurt Strosahl"  wrote:


So is there a stable client for centos 7 that is backwards compatible
with 2.5.3?

w/r,
Kurt

- Original Message -
From: "Patrick Farrell" 
To: "Kurt Strosahl" , lustre-discuss@lists.lustre.org
Sent: Monday, August 10, 2015 4:24:15 PM
Subject: RE: lustre client server interoperability

Kurt,

Yes.  It's worth noting that 2.7 is probably marginally less reliable
than 2.5, since it has had no updates/fixes since it was released.

From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf
of Kurt Strosahl [stros...@jlab.org]
Sent: Monday, August 10, 2015 2:25 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] lustre client server interoperability

Hello,

   Is the 2.7 lustre client compatible with lustre 2.5.3 servers?  I'm
running a 2.5.3 system lustre system, but have been asked by a few people
about upgrading some of our clients to CentOS 7 (which appears to need a
2.7 or greater client).

w/r,
Kurt J. Strosahl
System Administrator
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org






smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre Client on Fedora 23

2016-09-28 Thread Michael Watters
Has anybody been able to build the lustre client on Fedora 23 or 24?  
I'm getting a compile error when I attempt to build the SRPM for 2.8.0 
which is shown here.


http://paste.fedoraproject.org/437345/77178147/

Adding "-fPIC" to the optflags in the spec file also did not help, the 
build then fails with a different error as follows.


EXTRA_KCFLAGS: -include /root/rpmbuild/BUILD/lustre-2.8.0/undef.h -include 
/root/rpmbuild/BUILD/lustre-2.8.0/config.h  -g 
-I/root/rpmbuild/BUILD/lustre-2.8.0/libcfs/include 
-I/root/rpmbuild/BUILD/lustre-2.8.0/lnet/include 
-I/root/rpmbuild/BUILD/lustre-2.8.0/lustre/include

Type 'make' to build Lustre.

+ make -j12 -s

Making all in .

/root/rpmbuild/BUILD/lustre-2.8.0/lnet/lnet/api-ni.c:1:0: error: code model 
kernel does not support PIC mode

 /*

 ^

/root/rpmbuild/BUILD/lustre-2.8.0/lustre/fid/fid_request.c:1:0: error: code 
model kernel does not support PIC mode

 /*

 ^

/root/rpmbuild/BUILD/lustre-2.8.0/lustre/fld/fld_request.c:1:0: error: code 
model kernel does not support PIC mode

 /*

 ^

scripts/Makefile.build:289: recipe for target 
'/root/rpmbuild/BUILD/lustre-2.8.0/lnet/lnet/api-ni.o' failed

make[6]: *** [/root/rpmbuild/BUILD/lustre-2.8.0/lnet/lnet/api-ni.o] Error 1

/root/rpmbuild/BUILD/lustre-2.8.0/lustre/llite/dcache.c:1:0: error: code model 
kernel does not support PIC mode

 /*


Do I need to build the client from source or use a different lustre 
version?  My servers are all running 2.8.0 on CentOS 7 however we run 
Fedora on our workstations which doesn't have a native client package.


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] lustre client 2.9.0 compile error

2016-12-13 Thread Martin BALVERS
Hi,

I am trying to compile the lustre client on a CentOS 6.5 system with 
elrepo kernel-lt (3.10.104-1.el6.elrepo.x86_64) installed.

When I do a 'rpmbuild --rebuild --without servers lustre-2.9.0-1.src.rpm' 
i run into the following error. I have also tried to compile the current 
git master branch with '--disable-server --disable-tests', and that also 
resulted in the same error.

Is there any way I can get this to compile?

Thanks,
Martin Balvers

CC:gcc
LD:/usr/bin/ld -m elf_x86_64
CPPFLAGS:  -include /root/rpmbuild/BUILD/lustre-2.9.0/undef.h -include 
/root/rpmbuild/BUILD/lustre-2.9.0/config.h 
-I/root/rpmbuild/BUILD/lustre-2.9.0/libcfs/include 
-I/root/rpmbuild/BUILD/lustre-2.9.0/lnet/include 
-I/root/rpmbuild/BUILD/lustre-2.9.0/lustre/include
CFLAGS:-g -O2 -Werror -Wall -Werror
EXTRA_KCFLAGS: -include /root/rpmbuild/BUILD/lustre-2.9.0/undef.h -include 
/root/rpmbuild/BUILD/lustre-2.9.0/config.h  -g 
-I/root/rpmbuild/BUILD/lustre-2.9.0/libcfs/include 
-I/root/rpmbuild/BUILD/lustre-2.9.0/lnet/include 
-I/root/rpmbuild/BUILD/lustre-2.9.0/lustre/include

Type 'make' to build Lustre.
+ make -j8 -s
Making all in .
/root/rpmbuild/BUILD/lustre-2.9.0/lustre/llite/dir.c: In function 
'll_dir_setdirstripe':
/root/rpmbuild/BUILD/lustre-2.9.0/lustre/llite/dir.c:459: error: unknown 
field 'len' specified in initializer
cc1: warnings being treated as errors
/root/rpmbuild/BUILD/lustre-2.9.0/lustre/llite/dir.c:459: error: excess 
elements in struct initializer
/root/rpmbuild/BUILD/lustre-2.9.0/lustre/llite/dir.c:459: error: (near 
initialization for 'dentry.d_name')
/root/rpmbuild/BUILD/lustre-2.9.0/lustre/llite/dir.c:460: error: unknown 
field 'hash' specified in initializer
/root/rpmbuild/BUILD/lustre-2.9.0/lustre/llite/dir.c:460: error: excess 
elements in struct initializer
/root/rpmbuild/BUILD/lustre-2.9.0/lustre/llite/dir.c:460: error: (near 
initialization for 'dentry.d_name')
make[6]: *** [/root/rpmbuild/BUILD/lustre-2.9.0/lustre/llite/dir.o] Error 
1
make[6]: *** Waiting for unfinished jobs
make[5]: *** [/root/rpmbuild/BUILD/lustre-2.9.0/lustre/llite] Error 2
make[4]: *** [/root/rpmbuild/BUILD/lustre-2.9.0/lustre] Error 2
make[4]: *** Waiting for unfinished jobs
make[3]: *** [_module_/root/rpmbuild/BUILD/lustre-2.9.0] Error 2
make[2]: *** [modules] Error 2
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre-client is obsoleted

2017-07-26 Thread Dilger, Andreas
This is a minor problem in the .spec file and has been fixed. 

The reason for the Obsoletes was to allow installing server RPMs on clients, 
but it should have only obsoleted older versions. 

Cheers, Andreas

> On Jul 26, 2017, at 10:24, Jon Tegner  wrote:
> 
> Hi,
> 
> when trying to update clients from 2.9 to 2.10.0 (on CentOS-7) I received the 
> following:
> 
> "Package lustre-client is obsoleted by lustre, trying to install 
> lustre-2.10.0-1.el7.x86_64 instead"
> 
> and then the update failed (to my guessing due to the fact that zfs-related 
> packages are missing on the system (at the moment I don't intend to use zfs) .
> 
> I managed to get past this by forcing the installation of the client, i.e.,
> 
> "yum install lustre-client-2.10.0-1.el7.x86_64.rpm"
> 
> Just curious, is lustre-client really obsoleted?
> 
> Regards,
> 
> /jon
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre Client 2.10 on CentOS6.9

2017-09-12 Thread Arman Khalatyan
Hello,
Is it possible to recompile the Lustre Client 2.10 on centos 6.9?
Did you drop the support for the centos 6.x?

usually I use: rpmbuild --rebuild --without servers src.rpm

unfortunately I cannot find the src files for the version 2.10.
On Jenkins the files are from el7 not the el6
https://build.hpdd.intel.com/job/lustre-b2_10/arch=x86_64,build_type=client,distro=el6.9,ib_stack=inkernel/

Should not be el6 rpms instead of el7?

thanks,
Arman.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre client fails to boot

2017-10-16 Thread Amjad Syed
Hello,
Lustre newbie here
We are using lustre 1.8.7 on Rhel 5.4.
Due to some issues  with our lustre filesystem, the client hangs on reboot.
It hangs on screen kernel alive and mapping.
We tried using single mode , but that fails also,
Is there any way we can remove the entry in /etc/fstab  without having to
use  the rescue mode or  pushing a customized image to client?

Thanks
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre Client in a container

2017-12-30 Thread David Cohen
Hi,
Is it possible to run Lustre client in a container?
The goal is to run two different client version on the same node, can it be
done?

David
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre Client: Unable to mount

2018-05-05 Thread Rohan Garg
Hi,

I'm trying to set up a virtual cluster (with 4 VirtualBox VMs: 1
MGS, 1 MDS, 1 Client, and 1 OSS) using Lustre.  The VMs are running
CentOS-7.  I have built Lustre from the master branch.

The VM's have a NAT interface (eth0), and a host-only network
interface (eth1).

  Client: eth0: 10.0.2.15, eth1: 192.168.50.7, hostname: ct-client1
 OSS: eth0: 10.0.2.15, eth1: 192.168.50.5, hostname: ct-oss1
 MDS: eth0: 10.0.2.15, eth1: 192.168.50.9, hostname: ct-mds1
 MGS: eth0: 10.0.2.15, eth1: 192.168.50.11, hostname: ct-mgs1

 - All the VMs have SELinux disabled.
 - All the VMs can ping each other and can use password-less ssh among 
themselves.
 - All the 4 VM's have the following line in /etc/modprobe.d/lnet.conf:

  options lnet networks="tcp(eth1)"

I modified the cfg/local.sh file and added the following entries to make
it use the correct hostnames.

MDSCOUNT=1
mds_HOST=ct-mds1
MDSDEV1=/dev/sdb

mgs_HOST=ct-mgs1
MGSDEV=/dev/sdb

OSTCOUNT=1
ost_HOST=ct-oss1
OSTDEV1=/dev/sdb

The issue is that I can't get the llmount.sh script to mount the
filesystem on the client and run successfully. The script exits with the
following messages:

  ...
  Started lustre-OST
  Starting client: ct-client1.lfs.local:  -o user_xattr,flock ct-mgs1:/lustre 
/mnt/lustre
  CMD: ct-client1.lfs.local mkdir -p /mnt/lustre
  CMD: ct-client1.lfs.local mount -t lustre -o user_xattr,flock ct-mgs1:/lustre 
/mnt/lustre
  mount.lustre: mount ct-mgs1:/lustre at /mnt/lustre failed: No such file or 
directory
  Is the MGS specification correct?
  Is the filesystem name correct?
  If upgrading, is the copied client log valid? (see upgrade docs)

(Trying to run the last mount command manually also gives the same
 error.)

After the llmount script exits, I can check the output of "lctl list_nids"
on the 3 server VMs.

OSS: 192.168.50.5@tcp
MDS: 192.168.50.9@tcp
MGS: 192.168.50.11@tcp

Here's the dmesg output from the client:

  [311.259776] Lustre: Lustre: Build Version: 2.11.51_20_g9ac477c
  [312.792145] Lustre: 1836:0:(gss_svc_upcall.c:1185:gss_init_svc_upcall()) 
Init channel is not opened by lsvcgssd, following request might be dropped 
until lsvcgssd is active
  [312.792162] Lustre: 1836:0:(gss_mech_switch.c:71:lgss_mech_register()) 
Register gssnull mechanism
  [312.792174] Key type lgssc registered
  [312.868636] Lustre: Echo OBD driver; http://www.lustre.org/
  [325.835994] LustreError: 3302:0:(ldlm_lib.c:488:client_obd_setup()) can't 
add initial connection
  [325.836737] LustreError: 3302:0:(obd_config.c:559:class_setup()) setup 
MGC192.168.50.11@tcp failed (-2)
  [325.837248] LustreError: 3302:0:(obd_mount.c:202:lustre_start_simple()) 
MGC192.168.50.11@tcp setup error -2
  [325.837765] LustreError: 3302:0:(obd_mount.c:1583:lustre_fill_super()) 
Unable to mount  (-2)

I'm not sure if I'm missing something in my config. Any help is appreciated.

Thanks,
Rohan
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [Lustre-discuss] lustre client goes wacky?

2008-02-13 Thread Eric Mei
Yes there seems some problems. I filed a bug 14881 to track this.

Ron, thanks for reporting this. In the mean time please don't use CVS 
version with your 1.6.4 server until 14881 get fixed.

--
Eric

Nathaniel Rutman wrote:
> The clients you pulled from CVS have a feature called adaptive timeouts 
> which apparently
> are having an issue with your 1.6.4.1 servers.  Eric, can you make sure 
> our interoperability is working?
> 
> Moving this thread to lustre-discuss; devel is more for 
> architecture/coding stuff.
> 
> Ron wrote:
>> Hi,
>> I don't know if this is a bug or it's it's a misconfig or something
>> else.
>>
>> What I have is:
>> server = 1.6.4.1+vanilla 2.6.18.8   (mgs+2*ost+mdt all on a single
>> server)
>>clients = cvs.20080116+2.6.23.12
>>
>> I mounted the server from several clients and several hours later
>> noticed the top display below.  dmesg show some lustre errors (also
>> below).Can someone comment on what could be going on?
>>
>> Thanks,
>> Ron
>>
>> top - 18:28:09 up 5 days,  3:36,  1 user,  load average: 12.00, 12.00,
>> 11.94
>> Tasks: 168 total,  13 running, 136 sleeping,   0 stopped,  19 zombie
>> Cpu(s):  0.0% us, 37.5% sy,  0.0% ni, 62.5% id,  0.0% wa,  0.0% hi,
>> 0.0% si
>> Mem:  16468196k total,   526828k used, 15941368k free,42996k
>> buffers
>> Swap:  4192924k total,0k used,  4192924k free,   294916k
>> cached
>>
>>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+
>> COMMAND
>>  1533 root  20   0 000 R  100  0.0 308:54.05
>> ll_cfg_requeue
>> 32071 root  20   0 000 R  100  0.0 308:15.95
>> socknal_reaper
>> 32073 root  20   0 000 R  100  0.0 308:48.90
>> ptlrpcd
>> 1 root  20   0  4832  588  492 R0  0.0   0:02.48
>> init
>> 2 root  15  -5 000 S0  0.0   0:00.00
>> kthreadd
>>
>>
>> Lustre: OBD class driver, [EMAIL PROTECTED]
>> Lustre Version: 1.6.4.50
>> Build Version: b1_6-20080210103536-
>> CHANGED-.usr.src.linux-2.6.23.12-2.6.23.12
>> Lustre: Added LNI [EMAIL PROTECTED] [8/256]
>> Lustre: Accept secure, port 988
>> Lustre: Lustre Client File System; [EMAIL PROTECTED]
>> Lustre: Binding irq 17 to CPU 0 with cmd: echo 1 > /proc/irq/17/
>> smp_affinity
>> Lustre: [EMAIL PROTECTED]: Reactivating import
>> Lustre: setting import datafs-OST0002_UUID INACTIVE by administrator
>> request
>> Lustre: datafs-OST0002-osc-810241ad7800.osc: set parameter
>> active=0
>> LustreError: 32181:0:(lov_obd.c:230:lov_connect_obd()) not connecting
>> OSC datafs-OST0002_UUID; administratively disabled
>> Lustre: Client datafs-client has started
>> Lustre: Request x7684 sent from [EMAIL PROTECTED] to NID
>> [EMAIL PROTECTED] 15s ago has timed out (limit 15s).
>> LustreError: 166-1: [EMAIL PROTECTED]: Connection to service MGS
>> via nid [EMAIL PROTECTED] was lost; in progress operations using
>> this service will fail.
>> LustreError: 32073:0:(import.c:212:ptlrpc_invalidate_import()) MGS: rc
>> = -110 waiting for callback (1 != 0)
>> LustreError: 32073:0:(import.c:216:ptlrpc_invalidate_import()) @@@
>> still on sending list  [EMAIL PROTECTED] x7684/t0 o400-
>>   
>>> [EMAIL PROTECTED]@tcp:26/25 lens 128/256 e 0 to 11 dl 1202843837
>>> 
>> ref 1 fl Rpc:EXN/0/0 rc -4/0
>> Lustre: Request x7685 sent from datafs-MDT-mdc-810241ad7800 to
>> NID [EMAIL PROTECTED] 115s ago has timed out (limit 15s).
>> Lustre: datafs-MDT-mdc-810241ad7800: Connection to service
>> datafs-MDT via nid [EMAIL PROTECTED] was lost; in progress
>> operations using this service will wait for recovery to complete.
>> Lustre: [EMAIL PROTECTED]: Reactivating import
>> Lustre: [EMAIL PROTECTED]: Connection restored to service MGS
>> using nid [EMAIL PROTECTED]
>> LustreError: 32059:0:(events.c:116:reply_in_callback()) ASSERTION(ev-
>>   
>>> mlength == lustre_msg_early_size()) failed
>>> 
>> LustreError: 32059:0:(tracefile.c:432:libcfs_assertion_failed()) LBUG
>>
>> Call Trace:
>>  [] :libcfs:lbug_with_loc+0x73/0xc0
>>  [] :libcfs:libcfs_assertion_failed+0x54/0x60
>>  [] :ptlrpc:reply_in_callback+0x426/0x430
>>  [] :lnet:lnet_enq_event_locked+0xc5/0xf0
>>  [] :lnet:lnet_finalize+0x1e5/0x270
>>  [] :ksocklnd:ksocknal_process_receive+0x469/0xab0
>>  [] :ksocklnd:ksocknal_tx_done+0x80/0x1e0
>>  [] :ksocklnd:ksocknal_scheduler+0x12c/0x7e0
>>  [] autoremove_wake_function+0x0/0x30
>>  [] autoremove_wake_function+0x0/0x30
>>  [] child_rip+0xa/0x12
>>  [] :ksocklnd:ksocknal_scheduler+0x0/0x7e0
>>  [] child_rip+0x0/0x12
>>
>> LustreError: dumping log to /tmp/lustre-log.1202843942.32059
>> Lustre: Request x7707 sent from [EMAIL PROTECTED] to NID
>> [EMAIL PROTECTED] 15s ago has timed out (limit 15s).
>> Lustre: Skipped 2 previous similar messages
>>
>>
>> ___
>> Lustre-devel mailing list
>> [EMAIL PROTECTED]
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>   
> 
> ___
> Lustre-discu

Re: [Lustre-discuss] lustre client 1.6.5.1 hangs

2008-07-10 Thread Lundgren, Andrew
We are experiencing the same problem with 1.6.4.2.  We thought it was the 
statahead problems.  After turning off the statahead code, we experienced the 
same problem again.  I had hoped going to 1.6.5 would resolve the issue.  If 
you open a bug, would you mind sending the bug number to the list?  I would 
like to get on the CC list.

> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of
> Heiko Schroeter
> Sent: Thursday, July 10, 2008 2:25 AM
> To: [EMAIL PROTECTED]
> Subject: [Lustre-discuss] lustre client 1.6.5.1 hangs
>
> Hello,
>
> we have a _test_ setup for a lustre 1.6.5.1 installation with
> 2 Raid Systems
> (64 Bit Systems) counting for 4 OSTs with 6TB each. One
> combined MDS and MDT
> server (32 Bit system , for testing only).
>
> OST lustre mkfs:
> "mkfs.lustre --param="failover.mode=failout" --fsname
> scia --ost --mkfsoptions='-i 2097152 -E stride=16 -b
> 4096' [EMAIL PROTECTED] /dev/sdb"
> (Our files are quite large 100MB+ on the system)
>
> Kernel: Vanilla Kernel 2.6.22.19, lustre compiled from the
> sources on Gentoo
> 2008.0
>
> The client mount point is /misc/testfs via automount.
> The access can be done through a link from /mnt/testfs -> /misc/testfs
>
> The following procedure hangs a client:
> 1) copy files to the lustre system
> 2) do a 'du -sh /mnt/testfs/willi' while copying
> 3) unmount an OST (here OST0003) while copying
>
> The 'du' job hangs and the lustre file system cannot be
> acessed any longer on
> this client even from other logins. The only way to restore
> normal op is IMHO
> a hard reset of the machine. A reboot hangs because the
> filesystem is still
> active.
> Other clients and there mount points are not affected as long
> as they do not
> access the file system with 'du' 'ls' or so.
> I know that this is drastic but may happen in production by our users.
>
> Deactivating/Reactivating or remounting the OST does not have
> any effect on
> the 'du' job. The 'du' job (#29665 see process list below) and the
> correpsonding lustre thread (#29694) cannot be killed manually.
>
> This behaviour is reproducable. The OST0003 is not
> reactivated on the client
> side though the MDS does so. It seems that this info does not
> propagate to
> the client. See last lines of dmesg below.
>
> What is the proper way (besides avoiding the use of 'du') to
> reactivate the
> client file system ?
>
> Thanks and Regards
> Heiko
>
>
>
>
> The process list on the CLIENT:
> 
> root 29175  5026  0 08:36 ?00:00:00 sshd: laura [priv]
> laura   29177 29175  0 08:36 ?00:00:01 sshd: [EMAIL PROTECTED]/0
> laura   29178 29177  0 08:36 pts/000:00:00 -bash
> laura   29665 29178  0 09:15 pts/000:00:03 du -sh
> /mnt/testfs/foo/fam/
> schell   29694 2  0 09:15 ?00:00:00 [ll_sa_29665]
> root 29695  4846  0 09:15 ?00:00:00
> /usr/sbin/automount --timeout
> 60 --pid-file /var/run/autofs.misc.pid /misc yp auto.misc
> 
>
> and CLIENT dmesg:
> Lustre: 5361:0:(import.c:395:import_select_connection())
> scia-OST0003-osc-8100ea24a000: tried all connections,
> increasing latency
> to 6s
> Lustre: 5361:0:(import.c:395:import_select_connection())
> Skipped 10 previous
> similar messages
> LustreError: 11-0: an error occurred while communicating with
> [EMAIL PROTECTED] The ost_connect operation failed with -19
> LustreError: Skipped 20 previous similar messages
> Lustre: 5361:0:(import.c:395:import_select_connection())
> scia-OST0003-osc-8100ea24a000: tried all connections,
> increasing latency
> to 51s
> Lustre: 5361:0:(import.c:395:import_select_connection())
> Skipped 20 previous
> similar messages
> LustreError: 11-0: an error occurred while communicating with
> [EMAIL PROTECTED] The ost_connect operation failed with -19
> LustreError: Skipped 24 previous similar messages
> Lustre: 5361:0:(import.c:395:import_select_connection())
> scia-OST0003-osc-8100ea24a000: tried all connections,
> increasing latency
> to 51s
> Lustre: 5361:0:(import.c:395:import_select_connection())
> Skipped 24 previous
> similar messages
> LustreError: 167-0: This client was evicted by scia-OST0003;
> in progress
> operations using this service will fail.
>
> The MDS dmesg:
> 
> Lustre: 6108:0:(import.c:395:import_select_connection())
> scia-OST0003-osc:
> tried all connections, increasing latency to 51s
> Lustre: 6108:0:(import.c:395:import_select_connection())
> Skipped 10 previous
> similar messages
> LustreError: 11-0: an error occurred while communicating with
&g

Re: [Lustre-discuss] lustre client 1.6.5.1 hangs

2008-07-10 Thread Brian J. Murrell
On Thu, 2008-07-10 at 10:25 +0200, Heiko Schroeter wrote:
> Hello,

Hi.

> OST lustre mkfs:
> "mkfs.lustre --param="failover.mode=failout" --fsname 
   ^^^
Given this (above) parameter setting...

> scia --ost --mkfsoptions='-i 2097152 -E stride=16 -b 
> 4096' [EMAIL PROTECTED] /dev/sdb"

> The following procedure hangs a client:
> 1) copy files to the lustre system
> 2) do a 'du -sh /mnt/testfs/willi' while copying
> 3) unmount an OST (here OST0003) while copying

Do you expect that the copy and du (which are both running at the same
time while you unmount the OST, right?) should both get EIOs?

> Deactivating/Reactivating or remounting the OST does not have any effect on 
> the 'du' job. The 'du' job (#29665 see process list below) and the 
> correpsonding lustre thread (#29694) cannot be killed manually.

That latter process (ll_sa_29665) is statahead at work.

> What is the proper way (besides avoiding the use of 'du') to reactivate the 
> client file system ?

Well, in fact the du and the copy should both EIO when they get to
trying to write to the unmounted OST.

Can you get a stack trace (sysrq-t) on the client after you have
unmounted the OST and processes are hung/blocked?

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lustre client 1.6.5.1 hangs

2008-07-11 Thread Brian J. Murrell
On Fri, 2008-07-11 at 08:24 +0200, Heiko Schroeter wrote:
> 
> Is 'failout' not ok ?

That's up to you.

Failout means that if an OST becomes unreachable (because it has failed
or taken off the network, or unmounted or turned off, etc.) then any I/O
to get objects from that OST will cause a client to get an EIO
(Input/Output error).

Failover means that a client that tries to do I/O to a failed OST will
continue to try (forever) until it gets an answer.  A userspace sees
nothing strange, other than an I/O that takes, potentially, a very long
time to complete.

> Actually we like to use it because we like to use the 
> lustre system as a huge expandable data archive system.

I'm not sure what using failout has to do with that.

> If one OST breaks 
> down and destroys the data on it we can restore them.

Again, failout/failover really has nothing to do with this.  It has
everything to do with what a client does when it sees an OST fail.

> Actually i do expect the client not tho hang any job that acesses the file 
> systerm in this moment. If that needs an EIO and KILL of that process this is 
> fine by me.

Well, no kill should be necessary.  An EIO should terminate an
application.  Unless it has a retry handler for EIOs written into it.
That's not very common.  EIO usually should be interpreted as fatal.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lustre client 1.6.5.1 hangs

2008-07-11 Thread Brian J. Murrell
On Fri, 2008-07-11 at 10:14 +0200, Heiko Schroeter wrote:
> 
> Here is the stack trace. I hope it is the one you requested.

Hrm.

What is strange is that you have configured failout but not getting
EIOs.

Maybe you should file a bug in our bugzilla about this one.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lustre client 1.6.5.1 hangs

2008-07-13 Thread Heiko Schroeter
Am Donnerstag, 10. Juli 2008 19:35:57 schrieben Sie:

Hi.
>
> > OST lustre mkfs:
> > "mkfs.lustre --param="failover.mode=failout" --fsname
>
>^^^
> Given this (above) parameter setting...

Is 'failout' not ok ? Actually we like to use it because we like to use the 
lustre system as a huge expandable data archive system. If one OST breaks 
down and destroys the data on it we can restore them.

> > scia --ost --mkfsoptions='-i 2097152 -E stride=16 -b
> > 4096' [EMAIL PROTECTED] /dev/sdb"
> >
> > The following procedure hangs a client:
> > 1) copy files to the lustre system
> > 2) do a 'du -sh /mnt/testfs/willi' while copying
> > 3) unmount an OST (here OST0003) while copying
>
> Do you expect that the copy and du (which are both running at the same
> time while you unmount the OST, right?

Right.

> ) should both get EIOs? 

Actually i do expect the client not tho hang any job that acesses the file 
systerm in this moment. If that needs an EIO and KILL of that process this is 
fine by me.

> > What is the proper way (besides avoiding the use of 'du') to reactivate
> > the client file system ?
>
> Well, in fact the du and the copy should both EIO when they get to
> trying to write to the unmounted OST.
>
> Can you get a stack trace (sysrq-t) on the client after you have
> unmounted the OST and processes are hung/blocked?

I will get this done today. If the output is very large can i zip it and 
attach it ?

Thank you.
Heiko
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lustre client 1.6.5.1 hangs

2008-07-13 Thread Heiko Schroeter
Am Donnerstag, 10. Juli 2008 19:35:57 schrieb Brian J. Murrell:
> Well, in fact the du and the copy should both EIO when they get to
> trying to write to the unmounted OST.
>
> Can you get a stack trace (sysrq-t) on the client after you have
> unmounted the OST and processes are hung/blocked?

Here is the stack trace. I hope it is the one you requested.

Regards
Heiko


stack_trace.txt.gz
Description: GNU Zip compressed data
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] lustre client /proc cached reads

2008-07-31 Thread John Parhizgari
Simply, I would like to be able to access lustre client data statistics 
for each filesystem that excludes statistics for cached reads.

It is my understanding that for the lustre client (at least on version 
1.6.4.2) the /proc lustre stats for the client for each fs 
(/proc/fs/lustre/llite/FSNAME/stats) report the total number of bytes 
read (reported in the line read_bytes), and that this number also 
includes the total number of bytes read using client-cached data.

First of all, is my understanding correct for this version of lustre? 
And does this apply also for newer versions?

It has been suggested to subtract the lustre IO from the network IO to 
get this data, but this is only applicable if the network is dedicated 
to Lustre IO, which is not the case.

For the moment it seems only the number of cached reads are being 
reported in /proc, and not the actual sizes, so this -seems- difficult 
or impossible.

Is there another way (perhaps even in a newer version of lustre) to find 
the true read rate for the lustre client that excludes cached reads?

--
John Parhizgari
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre Client using 10GigE iWarp

2009-05-29 Thread Dennis Nelson
Hi,

I have a Lustre filesystem where all servers and clients are using DDR IB
interconnects.  Customer wants to add an additional client which has a 10
GigE interface that they want to use to connect to the filesystem.

The Lustre documentation says that iWarp interfaces are supported using OFED
drivers.  What would the syntax of the modprobe.conf.local file look like?

I have tried this:

options lnet networks=o2ib(eth2)

Is that correct?

The client is connected to the IB fabric using a Voltaire 10 GigE line card
on a IB switch.  Has anyone tested such a configuration?  Should I expect it
to work?

Thanks,

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre Client Access SAN

2009-07-27 Thread Andreas Dilger
On Jul 27, 2009  13:19 +0600, Tharindu Rukshan Bamunuarachchi wrote:
> 1.   does Lustre client directly access SAN/storage (like GFS, OCFS or
> Sun Cluster SVM)

No, client only communicates with the server, and only a single server
will access storage at any one time.

> 2.   if client connects over network, 
> a.   will TCP/IP performance directly hit on CFS

Depends on what you do.  Lustre makes VERY efficient use of GigE network,
but it can also use InfiniBand natively for lower latency and higher
bandwidth (IPoIB used only for addressing the nodes).

> b.  cannot I keep client on same machine as OSS/MDS etc.

This will work, but is not a 100% supported configuration because of
a small risk of deadlock.  A patch to reduce the chance of local
deadlock has been landed for 1.8.2, but it still isn't one of our
supported configs (i.e. we don't test this heavily ourselves).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre Client network connection info.

2009-09-08 Thread Lundgren, Andrew
We have OSS machines that also have read only clients mounted.

My OSTs and clients mount and function correctly to one of my two MDS machines. 
 Only the OSTs will mount and connect to the MDS failover peer.  The clients 
won't mount.  There are timeouts logged.

I checked the IPs, so I am thinking that it is probably a network connectivity 
issue.  What does the client network wise that the OSTs do not?

Thank you!

--
Andrew
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client kernel panic

2009-09-26 Thread Oleg Drokin
Hello!

On Sep 26, 2009, at 1:57 AM, Nick Jennings wrote:

>  About an hour ago the client completely hung. Hosting co. says it was
> a kernel panic. I got not useful feedback in /var/log/messages from  
> the
> client or the MDS. However from the OST I got several complaints.
> (below).
> Does anyone have any insight into the problem? All help as to how I  
> can
> fix this, or avoid the problem, greatly appreciated.

The traces you see is a known bug (19557), it happens when client is  
evicted
that had too many locks cached.
Unfortunately that provides us with zero insight into what happened to  
the client
and MDS.

Bye,
 Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client kernel panic

2009-09-26 Thread Nick Jennings

On Sat, 2009-09-26 at 04:11 -0400, Oleg Drokin wrote:
> Hello!
> 
> On Sep 26, 2009, at 1:57 AM, Nick Jennings wrote:
> 
> >  About an hour ago the client completely hung. Hosting co. says it was
> > a kernel panic. I got not useful feedback in /var/log/messages from  
> > the
> > client or the MDS. However from the OST I got several complaints.
> > (below).
> > Does anyone have any insight into the problem? All help as to how I  
> > can
> > fix this, or avoid the problem, greatly appreciated.
> 
> The traces you see is a known bug (19557), it happens when client is  
> evicted
> that had too many locks cached.
> Unfortunately that provides us with zero insight into what happened to  
> the client
> and MDS.

 Hi Oleg! How ya doing? :)

 Unfortunately that was the only info I could get. The client had no
information in the logs about what happened. The MDS only had the
following entry near the time:

Sep 25 22:28:43 dbn1 kernel: Lustre: MGS: haven't heard from client
ab5e5f08-e39d-385d-f7e3-fbd1addb0fac (at 10.0.0...@tcp1) in 248 seconds.
I think it's dead, and I am evicting it.

 Is there any other info I should be gathering when something like this
happens? (Sorry, it's been a while since I've done any lustre bug
reporting) :)

Cheers,
-Nick



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client kernel panic

2009-09-26 Thread Brian J. Murrell
On Sat, 2009-09-26 at 14:53 +0200, Nick Jennings wrote:
> 

Heya Nick,

>  Unfortunately that was the only info I could get. The client had no
> information in the logs about what happened.

They usually don't when they panic.

> The MDS only had the
> following entry near the time:
> 
> Sep 25 22:28:43 dbn1 kernel: Lustre: MGS: haven't heard from client
> ab5e5f08-e39d-385d-f7e3-fbd1addb0fac (at 10.0.0...@tcp1) in 248 seconds.
> I think it's dead, and I am evicting it.

That's because the client panic'd.

>  Is there any other info I should be gathering when something like this
> happens? (Sorry, it's been a while since I've done any lustre bug
> reporting) :)

The only/most useful info you can get from a client panic is what was on
the console when it panic'd.  Does the "Hosting co." not have their
machines hooked up to (i.e. serial) consoles?

netconsole can usually be useful as a substitute for a physically
connected serial console.  Do you know if the client kernel has
netconsole (usually bundled with netdump) available?  Can you put a new
kernel on the client that includes netdump?  You could set up a netdump
server somewhere (across the Internet even) then and see if you can get
anything useful when it panics.

Cheers,
b.




signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client kernel panic

2009-09-26 Thread Oleg Drokin
Hello!

On Sep 26, 2009, at 9:37 AM, Brian J. Murrell wrote:

>> Unfortunately that was the only info I could get. The client had no
>> information in the logs about what happened.
> They usually don't when they panic.

Right.
RHEL configured to have panic on oops too, if you disable that (in / 
etc/sysctl.conf)
there is a bigger chance oopses would make it to the log.

> netconsole can usually be useful as a substitute for a physically
> connected serial console.  Do you know if the client kernel has
> netconsole (usually bundled with netdump) available?  Can you put a  
> new
> kernel on the client that includes netdump?  You could set up a  
> netdump
> server somewhere (across the Internet even) then and see if you can  
> get
> anything useful when it panics.

Alternatively you can try to configure kdump and dump the entire
panicked kernel image to local disk which potentially has even
more chances to contain useful information in case of panics.

Bye,
 Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client kernel panic

2009-09-27 Thread Nick Jennings
On Sat, 2009-09-26 at 14:20 -0400, Oleg Drokin wrote:
> On Sep 26, 2009, at 9:37 AM, Brian J. Murrell wrote:
> 
> >> Unfortunately that was the only info I could get. The client had no
> >> information in the logs about what happened.
> > They usually don't when they panic.
> 
> Right.
> RHEL configured to have panic on oops too, if you disable that (in / 
> etc/sysctl.conf)
> there is a bigger chance oopses would make it to the log.
> 
> > netconsole can usually be useful as a substitute for a physically
> > connected serial console.  Do you know if the client kernel has
> > netconsole (usually bundled with netdump) available?  Can you put a  
> > new
> > kernel on the client that includes netdump?  You could set up a  
> > netdump
> > server somewhere (across the Internet even) then and see if you can  
> > get
> > anything useful when it panics.
> 
> Alternatively you can try to configure kdump and dump the entire
> panicked kernel image to local disk which potentially has even
> more chances to contain useful information in case of panics.

I can look into all of these possibilities soon, and try to get more
information the next time it happens, but as I understand it from
reading the bug, this is already a known issue, and hasn't been fixed
yet (slated for 2.0 release as of now). Correct? If so, is the data I
gather really of much use?

Thanks,
Nick




signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client kernel panic

2009-09-27 Thread Brian J. Murrell
On Sun, 2009-09-27 at 14:58 +0200, Nick Jennings wrote:
> 
> I can look into all of these possibilities soon, and try to get more
> information the next time it happens, but as I understand it from
> reading the bug, this is already a known issue, and hasn't been fixed
> yet (slated for 2.0 release as of now). Correct?

The client panic is not likely directly related that bug and that bug
aside, we still need the crash info in order to diagnose the client
panic.

> If so, is the data I
> gather really of much use?

To diagnosing the client panic?  Absolutely.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client kernel panic

2009-09-28 Thread Nick Jennings

On Sun, 2009-09-27 at 11:41 -0400, Brian J. Murrell wrote:
> On Sun, 2009-09-27 at 14:58 +0200, Nick Jennings wrote:
> > 
> > I can look into all of these possibilities soon, and try to get more
> > information the next time it happens, but as I understand it from
> > reading the bug, this is already a known issue, and hasn't been fixed
> > yet (slated for 2.0 release as of now). Correct?
> 
> The client panic is not likely directly related that bug and that bug
> aside, we still need the crash info in order to diagnose the client
> panic.
> 
> > If so, is the data I
> > gather really of much use?
> 
> To diagnosing the client panic?  Absolutely.
> 

Aha, I understand. I'll try to set something up to get more information
if the lockup happens again.

Thanks,
-Nick


signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] lustre client network stack hanging

2009-10-22 Thread Derek Yarnell
So I have been trying to find out if someone else has reported or  
found something similar.  I would be happy to create a bug report but  
I searched bugzilla for a bit and haven't found out much.  So the  
weirdest thing is that the MDS/OSS servers are fine but the clients  
whole network stack gets screwed up.  I mean it stops pinging which is  
just very odd that Lustre is causing problems to this extent.

Anyone heard or know of anything like this attached are the syslogs  
from when the clients network stack hung and the MDS/MGS.

Note: the client cfd-mds-01 is not running any MDS/MGT services just a  
patch-less client for now.

Client (lustre-client-1.8.1-2.6.18_128.1.14.el5_lustre.1.8.1)

Oct 22 12:35:11 cfd-mds-01 kernel: LustreError: 4682:0:(socklnd.c: 
1661:ksocknal_destroy_conn()) Completing partial receive from  
12345-192.168.14...@tcp, ip 192.168.14.23:1022, with error
Oct 22 12:35:11 cfd-mds-01 kernel: LustreError: 4682:0:(events.c: 
189:client_bulk_callback()) event type 1, status -5, desc  
8100c7672000
Oct 22 12:37:59 cfd-mds-01 kernel: Lustre: 4678:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1020 ->  
192.168.14.20/988
Oct 22 12:37:59 cfd-mds-01 kernel: Lustre: 4678:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14...@tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 12:41:09 cfd-mds-01 kernel: Lustre: 4681:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 12:41:09 cfd-mds-01 kernel: Lustre: 4681:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14...@tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 12:44:20 cfd-mds-01 kernel: Lustre: 4679:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 12:44:20 cfd-mds-01 kernel: Lustre: 4679:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14...@tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 12:47:33 cfd-mds-01 kernel: Lustre: 4680:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 12:47:33 cfd-mds-01 kernel: Lustre: 4680:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14...@tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 12:50:51 cfd-mds-01 kernel: Lustre: 4678:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 12:50:51 cfd-mds-01 kernel: Lustre: 4678:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14...@tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 12:54:16 cfd-mds-01 kernel: Lustre: 4681:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 12:54:16 cfd-mds-01 kernel: Lustre: 4681:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14...@tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 12:57:57 cfd-mds-01 kernel: Lustre: 4679:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 12:57:57 cfd-mds-01 kernel: Lustre: 4679:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14...@tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 13:02:07 cfd-mds-01 kernel: Lustre: 4680:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 13:02:07 cfd-mds-01 kernel: Lustre: 4680:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14...@tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 13:06:16 cfd-mds-01 kernel: Lustre: 4678:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 13:06:16 cfd-mds-01 kernel: Lustre: 4678:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14...@tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 13:10:25 cfd-mds-01 kernel: Lustre: 4681:0:(linux-tcpip.c: 
688:libcfs_sock_connect()) Error -110 connecting 192.168.14.21/1021 ->  
192.168.14.20/988
Oct 22 13:10:25 cfd-mds-01 kernel: Lustre: 4681:0:(acceptor.c: 
102:lnet_connect_console_error()) Connection to 192.168.14...@tcp at  
host 192.168.14.20 on port 988 took too long: that node may be hung or  
experiencing high load.
Oct 22 13:14:34 cfd-mds-01 kernel: Lustre: 4679:0:(linux-

Re: [Lustre-discuss] Lustre Client - Memory Issue

2010-04-19 Thread Andreas Dilger
There is a known problem with the DLM LRU size that may be affecting  
you. It may be something else too. Please check /proc/ 
{slabinfo,meminfo} to see what is using the memory on the client.

Cheers, Andreas

On 2010-04-19, at 10:43, Jagga Soorma  wrote:

> Hi Guys,
>
> My users are reporting some issues with memory on our lustre 1.8.1  
> clients.  It looks like when they submit a single job at a time the  
> run time was about 4.5 minutes.  However, when they ran multiple  
> jobs (10 or less) on a client with 192GB of memory on a single node  
> the run time for each job was exceeding 3-4X the run time for the  
> single process.  They also noticed that the swap space kept climbing  
> even though there was plenty of free memory on the system.  Could  
> this possibly be related to the lustre client?  Does it reserve any  
> memory that is not accessible by any other process even though it  
> might not be in use?
>
> Thanks much,
> -J
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre Client - Memory Issue

2010-04-19 Thread Jagga Soorma
Thanks for the response Andreas.

What is the known problem with the DLM LRU size?  Here is what my
slabinfo/meminfo look like on one of the clients.  I don't see anything out
of the ordinary:

(then again there are no jobs currently running on this system)

Thanks
-J

--
slabinfo:
..
slabinfo - version: 2.1
# name   
 : tunables: slabdata
  
nfs_direct_cache   0  0128   301 : tunables  120   608 :
slabdata  0  0  0
nfs_write_data36 44704   112 : tunables   54   278 :
slabdata  4  4  0
nfs_read_data 32 33704   112 : tunables   54   278 :
slabdata  3  3  0
nfs_inode_cache0  098441 : tunables   54   278 :
slabdata  0  0  0
nfs_page   0  0128   301 : tunables  120   608 :
slabdata  0  0  0
rpc_buffers8  8   204821 : tunables   24   128 :
slabdata  4  4  0
rpc_tasks  8 12320   121 : tunables   54   278 :
slabdata  1  1  0
rpc_inode_cache0  083241 : tunables   54   278 :
slabdata  0  0  0
ll_async_page 326589 328572320   121 : tunables   54   278 :
slabdata  27381  27381  0
ll_file_data   0  0192   201 : tunables  120   608 :
slabdata  0  0  0
lustre_inode_cache76977289641 : tunables   54   278
: slabdata193193  0
lov_oinfo   1322   1392320   121 : tunables   54   278 :
slabdata116116  0
osc_quota_info 0  0 32  1121 : tunables  120   608 :
slabdata  0  0  0
ll_qunit_cache 0  0112   341 : tunables  120   608 :
slabdata  0  0  0
llcd_cache 0  0   395211 : tunables   24   128 :
slabdata  0  0  0
ptlrpc_cbdatas 0  0 32  1121 : tunables  120   608 :
slabdata  0  0  0
interval_node   1166   3240128   301 : tunables  120   608 :
slabdata108108  0
ldlm_locks  2624   368851281 : tunables   54   278 :
slabdata461461  0
ldlm_resources  2002   3340384   101 : tunables   54   278 :
slabdata334334  0
ll_import_cache0  0   124831 : tunables   24   128 :
slabdata  0  0  0
ll_obdo_cache  0 452282156208   191 : tunables  120   60
8 : slabdata  0 23804324  0
ll_obd_dev_cache  13 13   567212 : tunables840 :
slabdata 13 13  0
obd_lvfs_ctxt_cache  0  0 96   401 : tunables  120   608
: slabdata  0  0  0
SDP0  0   172842 : tunables   24   128 :
slabdata  0  0  0
fib6_nodes 7118 64   591 : tunables  120   608 :
slabdata  2  2  0
ip6_dst_cache 14 36320   121 : tunables   54   278 :
slabdata  3  3  0
ndisc_cache4 30256   151 : tunables  120   608 :
slabdata  2  2  0
RAWv6 35 3696041 : tunables   54   278 :
slabdata  9  9  0
UDPLITEv6  0  096041 : tunables   54   278 :
slabdata  0  0  0
UDPv6  7 1296041 : tunables   54   278 :
slabdata  3  3  0
tw_sock_TCPv6  0  0192   201 : tunables  120   608 :
slabdata  0  0  0
request_sock_TCPv6  0  0192   201 : tunables  120   608
: slabdata  0  0  0
TCPv6  2  4   179221 : tunables   24   128 :
slabdata  2  2  0
ib_mad  2069   216044881 : tunables   54   278 :
slabdata270270  6
fuse_request   0  060861 : tunables   54   278 :
slabdata  0  0  0
fuse_inode 0  0704   112 : tunables   54   278 :
slabdata  0  0  0
kcopyd_job 0  0360   111 : tunables   54   278 :
slabdata  0  0  0
dm_uevent  0  0   260832 : tunables   24   128 :
slabdata  0  0  0
dm_clone_bio_info  0  0 16  2021 : tunables  120   608 :
slabdata  0  0  0
dm_rq_target_io0  040891 : tunables   54   278 :
slabdata  0  0  0
dm_target_io   0  0 24  1441 : tunables  120   608 :
slabdata  0  0  0
dm_io  0  0 32  1121 : tunables  120   608 :
slabdata  0  0  0
uhci_urb_priv  1 67 56   671 : tunables  120   608 :
slabdata  1  1  0
ext3_inode_cache  224598 224625

Re: [Lustre-discuss] Lustre Client - Memory Issue

2010-04-19 Thread Jagga Soorma
Actually this does not seem correct:

SUnreclaim:   95407476 kB

Shouldn't this be a lot smaller?

-Simran

On Mon, Apr 19, 2010 at 10:16 AM, Jagga Soorma  wrote:

> Thanks for the response Andreas.
>
> What is the known problem with the DLM LRU size?  Here is what my
> slabinfo/meminfo look like on one of the clients.  I don't see anything out
> of the ordinary:
>
> (then again there are no jobs currently running on this system)
>
> Thanks
> -J
>
> --
> slabinfo:
> ..
> slabinfo - version: 2.1
> # name   
>  : tunables: slabdata
>   
> nfs_direct_cache   0  0128   301 : tunables  120   608
> : slabdata  0  0  0
> nfs_write_data36 44704   112 : tunables   54   278
> : slabdata  4  4  0
> nfs_read_data 32 33704   112 : tunables   54   278
> : slabdata  3  3  0
> nfs_inode_cache0  098441 : tunables   54   278
> : slabdata  0  0  0
> nfs_page   0  0128   301 : tunables  120   608
> : slabdata  0  0  0
> rpc_buffers8  8   204821 : tunables   24   128
> : slabdata  4  4  0
> rpc_tasks  8 12320   121 : tunables   54   278
> : slabdata  1  1  0
> rpc_inode_cache0  083241 : tunables   54   278
> : slabdata  0  0  0
> ll_async_page 326589 328572320   121 : tunables   54   278
> : slabdata  27381  27381  0
> ll_file_data   0  0192   201 : tunables  120   608
> : slabdata  0  0  0
> lustre_inode_cache76977289641 : tunables   54   278
> : slabdata193193  0
> lov_oinfo   1322   1392320   121 : tunables   54   278
> : slabdata116116  0
> osc_quota_info 0  0 32  1121 : tunables  120   608
> : slabdata  0  0  0
> ll_qunit_cache 0  0112   341 : tunables  120   608
> : slabdata  0  0  0
> llcd_cache 0  0   395211 : tunables   24   128
> : slabdata  0  0  0
> ptlrpc_cbdatas 0  0 32  1121 : tunables  120   608
> : slabdata  0  0  0
> interval_node   1166   3240128   301 : tunables  120   608
> : slabdata108108  0
> ldlm_locks  2624   368851281 : tunables   54   278
> : slabdata461461  0
> ldlm_resources  2002   3340384   101 : tunables   54   278
> : slabdata334334  0
> ll_import_cache0  0   124831 : tunables   24   128
> : slabdata  0  0  0
> ll_obdo_cache  0 452282156208   191 : tunables  120   60
> 8 : slabdata  0 23804324  0
> ll_obd_dev_cache  13 13   567212 : tunables840
> : slabdata 13 13  0
> obd_lvfs_ctxt_cache  0  0 96   401 : tunables  120   60
> 8 : slabdata  0  0  0
> SDP0  0   172842 : tunables   24   128
> : slabdata  0  0  0
> fib6_nodes 7118 64   591 : tunables  120   608
> : slabdata  2  2  0
> ip6_dst_cache 14 36320   121 : tunables   54   278
> : slabdata  3  3  0
> ndisc_cache4 30256   151 : tunables  120   608
> : slabdata  2  2  0
> RAWv6 35 3696041 : tunables   54   278
> : slabdata  9  9  0
> UDPLITEv6  0  096041 : tunables   54   278
> : slabdata  0  0  0
> UDPv6  7 1296041 : tunables   54   278
> : slabdata  3  3  0
> tw_sock_TCPv6  0  0192   201 : tunables  120   608
> : slabdata  0  0  0
> request_sock_TCPv6  0  0192   201 : tunables  120   608
> : slabdata  0  0  0
> TCPv6  2  4   179221 : tunables   24   128
> : slabdata  2  2  0
> ib_mad  2069   216044881 : tunables   54   278
> : slabdata270270  6
> fuse_request   0  060861 : tunables   54   278
> : slabdata  0  0  0
> fuse_inode 0  0704   112 : tunables   54   278
> : slabdata  0  0  0
> kcopyd_job 0  0360   111 : tunables   54   278
> : slabdata  0  0  0
> dm_uevent  0  0   260832 : tunables   24   128
> : slabdata  0  0  0
> dm_clone_bio_info  0  0 16  2021 : tunables  120   608
> : slabdata  0  0  0
> dm_rq_target_io0  040891 : tunables   54   278
> : slabdata  0  0  0

Re: [Lustre-discuss] Lustre Client - Memory Issue

2010-04-19 Thread Andreas Dilger

On 2010-04-19, at 11:16, Jagga Soorma wrote:

What is the known problem with the DLM LRU size?


It is mostly a problem on the server, actually.

  Here is what my slabinfo/meminfo look like on one of the clients.   
I don't see anything out of the ordinary:


(then again there are no jobs currently running on this system)

slabinfo - version: 2.1
# name 
 : tunables:  
slabdata   


ll_async_page 326589 328572320   121 : tunables   54
278 : slabdata  27381  27381  0


This shows you have 326589 pages in the lustre filesystem cache, or  
about 1275MB of data.  That shouldn't be too much for a system with  
192GB of RAM...


lustre_inode_cache76977289641 : tunables   54
278 : slabdata193193  0
ldlm_locks  2624   368851281 : tunables   54
278 : slabdata461461  0
ldlm_resources  2002   3340384   101 : tunables   54
278 : slabdata334334  0


Only about 2600 locks on 770 files is fine (this is what the DLM LRU  
size would affect, if it were out of control, which it isn't).


ll_obdo_cache  0 452282156208   191 : tunables   
120   608 : slabdata  0 23804324  0


This is really out of whack.  The "obdo" struct should normally only  
be allocated for a short time and then freed again, but here you have  
452M of them using over 90GB of RAM.  It looks like a leak of some  
kind, which is a bit surprising since we have fairly tight checking  
for memory leaks in the Lustre code.


Are you running some unusual workload that is maybe walking an unusual  
code path?  What you can do to track down memory leaks is enable  
Lustre memory tracing, increase the size of the debug buffer to catch  
enough tracing to be useful, and then run your job to see what is  
causing the leak, dump the kernel debug log, and then run leak- 
finder.pl (attached, and also in Lustre sources):


client# lctl set_param debug=+malloc
client# lctl set_param debug_mb=256
client$ {run job}
client# sync
client# lctl dk /tmp/debug
client# perl leak-finder.pl < /tmp/debug 2>&1 | grep "Leak.*oa"
client# lctl set_param debug=-malloc
client# lctl set_param debug_mb=32

Since this is a running system, it will report spurious leaks for some  
kinds of allocations that remain in memory for some time (e.g. cached  
pages, inodes, etc), but with the exception of uncommitted RPCs (of  
which there should be none after the sync) there should not be any  
leaked obdo.



On 2010-04-19, at 10:43, Jagga Soorma  wrote:
My users are reporting some issues with memory on our lustre 1.8.1  
clients.  It looks like when they submit a single job at a time the  
run time was about 4.5 minutes.  However, when they ran multiple  
jobs (10 or less) on a client with 192GB of memory on a single node  
the run time for each job was exceeding 3-4X the run time for the  
single process.  They also noticed that the swap space kept  
climbing even though there was plenty of free memory on the  
system.  Could this possibly be related to the lustre client?  Does  
it reserve any memory that is not accessible by any other process  
even though it might not be in use?




Cheers, Andreas
--
Andreas Dilger
Principal Engineer, Lustre Group
Oracle Corporation Canada Inc.
#!/usr/bin/perl -w

use IO::Handle;

STDOUT->autoflush(1);
STDERR->autoflush(1);

my ($line, $memory);
my $debug_line = 0;

my $total = 0;
my $max = 0;

while ($line = <>) {
$debug_line++;
my ($file, $func, $lno, $name, $size, $addr, $type);
if ($line =~ m/^.*(\.).*\((.*):(\d+):(.*)\(\)\) (k|v|slab-)(.*) '(.*)': (\d+) at ([\da-f]+)/){
$file = $2;
$lno  = $3;
$func = $4;
$type = $6;
$name = $7;
$size = $8;
$addr = $9;

	# we can't dump the log after portals has exited, so skip "leaks"
	# from memory freed in the portals module unloading.
	if ($func eq 'portals_handle_init') {
	next;
	}
printf("%8s %6d bytes at %s called %s (%s:%s:%d)\n", $type, $size,
   $addr, $name, $file, $func, $lno);
} else {
next;
}

if (index($type, 'alloced') >= 0) {
if (defined($memory->{$addr})) {
print STDERR "*** Two allocs with the same address ($size bytes at $addr, $file:$func:$lno)\n";
print STDERR "first malloc at $memory->{$addr}->{file}:$memory->{$addr}->{func}:$memory->{$addr}->{lno}, second at $file:$func:$lno\n";
next;
}

$memory->{$addr}->{name} = $name;
$memory->{$addr}->{size} = $size;
$memory->{$addr}->{file} = $file;
$memory->{$addr}->{func} = $func;
$memory->{$addr}->{lno} = $lno;
$memory->{$addr}->{debug_line} = $debug_line;

$total += $size;
if ($total > $max) {
$max = $total;
}
} else {
if (!defined($memory->{$addr})) {
print STDERR "*** Free without malloc ($name=$size

Re: [Lustre-discuss] Lustre Client - Memory Issue

2010-04-20 Thread Jagga Soorma
Hi Andreas,

Thanks for your response.  I will try to run the leak-finder script and
hopefully it will point us in the right direction.  This only seems to be
happening on some of my clients:

--
client112: ll_obdo_cache  0  0208   191 : tunables
120   608 : slabdata  0  0  0
client108: ll_obdo_cache  0  0208   191 : tunables
120   608 : slabdata  0  0  0
client110: ll_obdo_cache  0  0208   191 : tunables
120   608 : slabdata  0  0  0
client107: ll_obdo_cache  0  0208   191 : tunables
120   608 : slabdata  0  0  0
client111: ll_obdo_cache  0  0208   191 : tunables
120   608 : slabdata  0  0  0
client109: ll_obdo_cache  0  0208   191 : tunables
120   608 : slabdata  0  0  0
client102: ll_obdo_cache  5 38208   191 : tunables
120   608 : slabdata  2  2  1
client114: ll_obdo_cache  0  0208   191 : tunables
120   608 : slabdata  0  0  0
client105: ll_obdo_cache  0  0208   191 : tunables
120   608 : slabdata  0  0  0
client103: ll_obdo_cache  0  0208   191 : tunables
120   608 : slabdata  0  0  0
client104: ll_obdo_cache  0 433506280208   191 : tunables
120   608 : slabdata  0 22816120  0
client116: ll_obdo_cache  0 457366746208   191 : tunables
120   608 : slabdata  0 24071934  0
client113: ll_obdo_cache  0 456778867208   191 : tunables
120   608 : slabdata  0 24040993  0
client106: ll_obdo_cache  0 456372267208   191 : tunables
120   608 : slabdata  0 24019593  0
client115: ll_obdo_cache  0 449929310208   191 : tunables
120   608 : slabdata  0 23680490  0
client101: ll_obdo_cache  0 454318101208   191 : tunables
120   608 : slabdata  0 23911479  0
--

Hopefully this should help.  Not sure which application might be causing the
leaks.  Currently R is the only app that users seem to be using heavily on
these clients.  Will let you know what I find.

Thanks again,
-J

On Mon, Apr 19, 2010 at 9:04 PM, Andreas Dilger
wrote:

> On 2010-04-19, at 11:16, Jagga Soorma wrote:
>
>> What is the known problem with the DLM LRU size?
>>
>
> It is mostly a problem on the server, actually.
>
>   Here is what my slabinfo/meminfo look like on one of the clients.  I
>> don't see anything out of the ordinary:
>>
>> (then again there are no jobs currently running on this system)
>>
>> slabinfo - version: 2.1
>> # name   
>>  : tunables: slabdata
>>   
>>
>
>  ll_async_page 326589 328572320   121 : tunables   54   278
>> : slabdata  27381  27381  0
>>
>
> This shows you have 326589 pages in the lustre filesystem cache, or about
> 1275MB of data.  That shouldn't be too much for a system with 192GB of
> RAM...
>
>  lustre_inode_cache76977289641 : tunables   54   27
>>  8 : slabdata193193  0
>> ldlm_locks  2624   368851281 : tunables   54   278
>> : slabdata461461  0
>> ldlm_resources  2002   3340384   101 : tunables   54   278
>> : slabdata334334  0
>>
>
> Only about 2600 locks on 770 files is fine (this is what the DLM LRU size
> would affect, if it were out of control, which it isn't).
>
>  ll_obdo_cache  0 452282156208   191 : tunables  120   60
>>  8 : slabdata  0 23804324  0
>>
>
> This is really out of whack.  The "obdo" struct should normally only be
> allocated for a short time and then freed again, but here you have 452M of
> them using over 90GB of RAM.  It looks like a leak of some kind, which is a
> bit surprising since we have fairly tight checking for memory leaks in the
> Lustre code.
>
> Are you running some unusual workload that is maybe walking an unusual code
> path?  What you can do to track down memory leaks is enable Lustre memory
> tracing, increase the size of the debug buffer to catch enough tracing to be
> useful, and then run your job to see what is causing the leak, dump the
> kernel debug log, and then run leak-finder.pl (attached, and also in
> Lustre sources):
>
> client# lctl set_param debug=+malloc
> client# lctl set_param debug_mb=256
> client$ {run job}
> client# sync
> client# lctl dk /tmp/debug
> client# perl leak-finder.pl < /tmp/debug 2>&1 | grep "Leak.*oa"
> client# lctl set_param debug=-malloc
> client# lctl set_param debug_mb=32
>
> Since this is a running system, it will report spurious leaks for some
> kinds of allocations that remain in memory for some time (e.g. cached pages,
> inodes, etc), but with the exception of uncommitted RPCs (of which there
> should be none after the sync) there should not be any leaked obdo.

Re: [Lustre-discuss] Lustre Client - Memory Issue

2010-04-26 Thread Tommi T
Hi

Here is leak from our system, not sure if this is the same issue. Application 
is molpro which is causing this leak. (pristine lustre 1.8.2)

 total   used   free shared    buffers cached
Mem:  32962040   32869120  92920  0    632    4201332
-/+ buffers/cache:   28667156    4294884
Swap:  1020116    1020116  0


# perl leak-finder.pl < /tmp/debug 2>&1 | grep "Leak.*oa"
*** Leak: ((oa))=208 bytes allocated at 8106c7e39be0 
(osc_request.c:osc_build_req:2278, debug file line 2834)
*** Leak: ((oa))=208 bytes allocated at 8102bbd96220 
(osc_request.c:osc_build_req:2278, debug file line 7844)
*** Leak: ((oa))=208 bytes allocated at 81039196e970 
(osc_request.c:osc_build_req:2278, debug file line 12308)
*** Leak: ((oa))=208 bytes allocated at 8103fc5833c0 
(osc_request.c:osc_build_req:2278, debug file line 14137)
*** Leak: ((oa))=208 bytes allocated at 8103fc583560 
(osc_request.c:osc_build_req:2278, debug file line 18273)
*** Leak: ((oa))=208 bytes allocated at 810196049080 
(osc_request.c:osc_build_req:2278, debug file line 22003)
*** Leak: ((oa))=208 bytes allocated at 8103fc583080 
(osc_request.c:osc_build_req:2278, debug file line 50346)
*** Leak: ((oa))=208 bytes allocated at 8102bbd967d0 
(osc_request.c:osc_build_req:2278, debug file line 52119)
*** Leak: ((oa))=208 bytes allocated at 8103fc583220 
(osc_request.c:osc_build_req:2278, debug file line 53941)
*** Leak: ((oa))=208 bytes allocated at 8102fe6b8700 
(osc_request.c:osc_build_req:2278, debug file line 56284)
*** Leak: ((oa))=208 bytes allocated at 8106c7e39080 
(osc_request.c:osc_build_req:2278, debug file line 62691)
*** Leak: ((oa))=208 bytes allocated at 8106bad438a0 
(osc_request.c:osc_build_req:2278, debug file line 64219)
*** Leak: ((oa))=208 bytes allocated at 8106bad432f0 
(osc_request.c:osc_build_req:2278, debug file line 64555)
*** Leak: ((oa))=208 bytes allocated at 8106c7e39630 
(osc_request.c:osc_build_req:2278, debug file line 66742)
*** Leak: ((oa))=208 bytes allocated at 8102bbd968a0 
(osc_request.c:osc_build_req:2278, debug file line 67819)
*** Leak: ((oa))=208 bytes allocated at 8102fe6b8e50 
(osc_request.c:osc_build_req:2278, debug file line 79362)
*** Leak: ((oa))=208 bytes allocated at 8106c7e39cb0 
(osc_request.c:osc_build_req:2278, debug file line 80852)
*** Leak: ((oa))=208 bytes allocated at 8106bad437d0 
(osc_request.c:osc_build_req:2278, debug file line 81220)
*** Leak: ((oa))=208 bytes allocated at 8106bad43490 
(osc_request.c:osc_build_req:2278, debug file line 81228)
*** Leak: ((oa))=208 bytes allocated at 8106c7e39d80 
(osc_request.c:osc_build_req:2278, debug file line 82818)
*** Leak: ((oa))=208 bytes allocated at 810196049be0 
(osc_request.c:osc_build_req:2278, debug file line 83205)
*** Leak: ((oa))=208 bytes allocated at 8106c7e39220 
(osc_request.c:osc_build_req:2278, debug file line 84102)
*** Leak: ((oa))=208 bytes allocated at 8103fc583700 
(osc_request.c:osc_build_req:2278, debug file line 86491)
*** Leak: ((oa))=208 bytes allocated at 8106c7e39490 
(osc_request.c:osc_build_req:2278, debug file line 87882)
*** Leak: ((oa))=208 bytes allocated at 81034fbd1e50 
(osc_request.c:osc_build_req:2278, debug file line 88296)
*** Leak: ((oa))=208 bytes allocated at 8101960498a0 
(osc_request.c:osc_build_req:2278, debug file line 97721)
*** Leak: ((oa))=208 bytes allocated at 8103fc583630 
(osc_request.c:osc_build_req:2278, debug file line 112202)
*** Leak: ((oa))=208 bytes allocated at 8102bbd96be0 
(osc_request.c:osc_build_req:2278, debug file line 131547)
*** Leak: ((oa))=208 bytes allocated at 8102fe6b8cb0 
(osc_request.c:osc_build_req:2278, debug file line 132286)
*** Leak: ((oa))=208 bytes allocated at 81034fbd1a40 
(osc_request.c:osc_build_req:2278, debug file line 136195)
*** Leak: ((oa))=208 bytes allocated at 81034fbd13c0 
(osc_request.c:osc_build_req:2278, debug file line 181450)
*** Leak: ((oa))=208 bytes allocated at 810630625970 
(osc_request.c:osc_build_req:2278, debug file line 199146)
*** Leak: ((oa))=208 bytes allocated at 810630625be0 
(osc_request.c:osc_build_req:2278, debug file line 199428)
*** Leak: ((oa))=208 bytes allocated at 810630625080 
(osc_request.c:osc_build_req:2278, debug file line 200627)
*** Leak: ((oa))=208 bytes allocated at 81034fbd17d0 
(osc_request.c:osc_build_req:2278, debug file line 206924)
*** Leak: ((oa))=208 bytes allocated at 81034fbd1150 
(osc_request.c:osc_build_req:2278, debug file line 207207)
*** Leak: ((oa))=208 bytes allocated at 810630625a40 
(osc_request.c:osc_build_req:2278, debug file line 210587)
*** Leak: ((oa))=208 bytes allocated at 8107402c7630 
(osc_request.c:osc_build_req:2278, debug file line 211246)
*** Leak: ((oa))=208 bytes allocated at 8107402c72f0 
(osc_request.c:osc_build_req:2278, debug file line 211639)
*** Leak: ((oa))=208 bytes a

Re: [Lustre-discuss] Lustre Client - Memory Issue

2010-04-26 Thread Johann Lombardi
Hi,

On Mon, Apr 26, 2010 at 01:16:37AM -0700, Tommi T wrote:
> Here is leak from our system, not sure if this is the same issue.
> Application is molpro which is causing this leak. (pristine lustre 1.8.2)
> 
>   total   used   free sharedbuffers cached
> Mem:  32962040   32869120  92920  06324201332
> -/+ buffers/cache:   286671564294884
> Swap:  10201161020116  0
> 
># perl leak-finder.pl < /tmp/debug 2>&1 | grep "Leak.*oa"
>*** Leak: ((oa))=208 bytes allocated at 8106c7e39be0
>(osc_request.c:osc_build_req:2278, debug file line 2834)
[..]

interesting. could you please collect a debug log with a debug mask
set to -1 and attach it to a bugzilla ticket?

Thanks in advance.

Cheers,
Johann
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre Client - Memory Issue

2010-04-27 Thread Johann Lombardi
Hi,

On Tue, Apr 20, 2010 at 09:08:25AM -0700, Jagga Soorma wrote:
> Thanks for your response.* I will try to run the leak-finder script and
> hopefully it will point us in the right direction.* This only seems to be
> happening on some of my clients:

Could you please tell us what kernel you use on the client side?

>client104: ll_obdo_cache* 0 433506280*** 208** 19*** 1 : tunables*
>120** 60*** 8 : slabdata* 0 22816120* 0
>client116: ll_obdo_cache* 0 457366746*** 208** 19*** 1 : tunables*
>120** 60*** 8 : slabdata* 0 24071934* 0
>client113: ll_obdo_cache* 0 456778867*** 208** 19*** 1 : tunables*
>120** 60*** 8 : slabdata* 0 24040993* 0
>client106: ll_obdo_cache* 0 456372267*** 208** 19*** 1 : tunables*
>120** 60*** 8 : slabdata* 0 24019593* 0
>client115: ll_obdo_cache* 0 449929310*** 208** 19*** 1 : tunables*
>120** 60*** 8 : slabdata* 0 23680490* 0
>client101: ll_obdo_cache* 0 454318101*** 208** 19*** 1 : tunables*
>120** 60*** 8 : slabdata* 0 23911479* 0
>--
> 
>Hopefully this should help.* Not sure which application might be causing
>the leaks.* Currently R is the only app that users seem to be using
>heavily on these clients.* Will let you know what I find.

Tommi Tervo has filed a bugzilla ticket for this issue, see
https://bugzilla.lustre.org/show_bug.cgi?id=22701

Could you please add a comment to this ticket to describe the
behavior of the application "R" (fork many threads, write to
many files, use direct i/o, ...)?

Cheers,
Johann
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre Client - Memory Issue

2010-04-28 Thread Jagga Soorma
Hi Johann,

I am actually using 1.8.1 and not 1.8.2:

# rpm -qa | grep -i lustre
lustre-client-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default
lustre-client-modules-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default

My kernel version on the SLES 11 clients is:
# uname -r
2.6.27.29-0.1-default

My kernel version on the RHEL 5.3 mds/oss servers is:
# uname -r
2.6.18-128.7.1.el5_lustre.1.8.1.1

Please let me know if you need any further information.  I am still trying
to get the user to help me run his app so that I can run the leak finder
script to capture more information.

Regards,
-Simran

On Tue, Apr 27, 2010 at 7:20 AM, Johann Lombardi  wrote:

> Hi,
>
> On Tue, Apr 20, 2010 at 09:08:25AM -0700, Jagga Soorma wrote:
> > Thanks for your response.* I will try to run the leak-finder script and
> > hopefully it will point us in the right direction.* This only seems to be
> > happening on some of my clients:
>
> Could you please tell us what kernel you use on the client side?
>
> >client104: ll_obdo_cache* 0 433506280*** 208** 19*** 1 :
> tunables*
> >120** 60*** 8 : slabdata* 0 22816120* 0
> >client116: ll_obdo_cache* 0 457366746*** 208** 19*** 1 :
> tunables*
> >120** 60*** 8 : slabdata* 0 24071934* 0
> >client113: ll_obdo_cache* 0 456778867*** 208** 19*** 1 :
> tunables*
> >120** 60*** 8 : slabdata* 0 24040993* 0
> >client106: ll_obdo_cache* 0 456372267*** 208** 19*** 1 :
> tunables*
> >120** 60*** 8 : slabdata* 0 24019593* 0
> >client115: ll_obdo_cache* 0 449929310*** 208** 19*** 1 :
> tunables*
> >120** 60*** 8 : slabdata* 0 23680490* 0
> >client101: ll_obdo_cache* 0 454318101*** 208** 19*** 1 :
> tunables*
> >120** 60*** 8 : slabdata* 0 23911479* 0
> >--
> >
> >Hopefully this should help.* Not sure which application might be
> causing
> >the leaks.* Currently R is the only app that users seem to be using
> >heavily on these clients.* Will let you know what I find.
>
> Tommi Tervo has filed a bugzilla ticket for this issue, see
> https://bugzilla.lustre.org/show_bug.cgi?id=22701
>
> Could you please add a comment to this ticket to describe the
> behavior of the application "R" (fork many threads, write to
> many files, use direct i/o, ...)?
>
> Cheers,
> Johann
>
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lustre client and kvm?

2010-05-11 Thread Brian J. Murrell
On Tue, 2010-05-11 at 19:56 +0200, Janne Aho wrote: 
> We run into a slight misfortune, the lustrefs client kernel lacks kvm
> support and it wasn't just straight to compile the kernel just by adding
> the kvm patches.

Are you not using the patchless client?  The patchless client works with
the vendor supplied kernel, so your kernel has whatever your vendor
normally builds into it.

> Does someone have a set of patches to make kvm part of the lustre client
> kernel or even know some already compiled kernels?

I don't know that we "remove" kvm from our patched kernel.  It might be
that we disable the feature when we build it and it could be that you
simply need to enable it and rebuild the kernel.  But if we disable it,
it's entirely possible that it's for good reason.

My suggestion would be to use the patchless client and your choice of
vendor kernel.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lustre client and kvm?

2010-05-11 Thread Janne Aho
Brian wrote:
> On Tue, 2010-05-11 at 19:56 +0200, Janne Aho wrote: 
>> We run into a slight misfortune, the lustrefs client kernel lacks kvm
>> support and it wasn't just straight to compile the kernel just by adding
>> the kvm patches.
> 
> Are you not using the patchless client?  The patchless client works with
> the vendor supplied kernel, so your kernel has whatever your vendor
> normally builds into it.

I'll tell my tester and see if he gets it working.
We gone with what we found at the Sun, sorry Oracle download site.



>> Does someone have a set of patches to make kvm part of the lustre client
>> kernel or even know some already compiled kernels?
> 
> I don't know that we "remove" kvm from our patched kernel.  It might be
> that we disable the feature when we build it and it could be that you
> simply need to enable it and rebuild the kernel.  But if we disable it,
> it's entirely possible that it's for good reason.

You haven't done anything to remove it, neither anything to add it, as
it didn't become part of the vanilla kernel until 2.6.20 and RHEL/CentOS
are using 2.6.18 with kvm patches.


But we see how it goes with the the patchless client. Thanks for the advice.


-- 
Janne Aho | City Network Hosting AB
Developer
Phone: +46 455 69 00 22
Cell.: +46 733 31 27 75
EMail: ja...@citynetwork.se

ICQ: 567311547 | Skype: janne_mz | AIM: janne4cn
Gadu: 16275665 | MSN: ja...@citynetwork.se

www.citynetwork.se | www.box.se
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lustre client and kvm?

2010-05-11 Thread Brian J. Murrell
On Tue, 2010-05-11 at 22:59 +0200, Janne Aho wrote: 
> 
> You haven't done anything to remove it, neither anything to add it, as
> it didn't become part of the vanilla kernel until 2.6.20 and RHEL/CentOS
> are using 2.6.18 with kvm patches.

So given that, if it's in the RHEL kernel then it should be in our
patched kernel, barring any specific reason to disable the feature.  We
don't remove code from the upstream kernels other than the patching that
we do to add performance-with-lustre enhancments but we typically don't
do any wholesale remove of entire features.

> But we see how it goes with the the patchless client. Thanks for the advice.

OK.  NP.

b.




signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre Client - Memory Issue

2010-05-19 Thread Dmitry Zogin

Hello Jagga,

I checked the data, and indeed this does not look like a lustre memory 
leak, rather than a slab fragmentation, which assumes there might be a 
kernel issue here. From the slabinfo (I only keep three first columns here):


name 
ll_obdo_cache  0 452282156208

means that there are no active objects, but the memory pages are not 
released back from slab allocator to the free pool (the num value is 
huge). That looks like a slab fragmentation - you can get more 
description at

http://kerneltrap.org/Linux/Slab_Defragmentation

Checking your mails, I wonder if this only happens on clients which 
have  SLES11 installed? As the RAM size is around 192Gb, I assume they 
are NUMA systems?

If so, SLES11 has defrag_ratio tunables in /sys/kernel/slab/xxx
From the source of get_any_partial()

#ifdef CONFIG_NUMA

   /*
* The defrag ratio allows a configuration of the tradeoffs between
* inter node defragmentation and node local allocations. A lower
* defrag_ratio increases the tendency to do local allocations
* instead of attempting to obtain partial slabs from other nodes.
*
* If the defrag_ratio is set to 0 then kmalloc() always
* returns node local objects. If the ratio is higher then kmalloc()
* may return off node objects because partial slabs are obtained
* from other nodes and filled up.
*
* If /sys/kernel/slab/xx/defrag_ratio is set to 100 (which makes
* defrag_ratio = 1000) then every (well almost) allocation will
* first attempt to defrag slab caches on other nodes. This means
* scanning over all nodes to look for partial slabs which may be
* expensive if we do it every time we are trying to find a slab
* with available objects.
*/

Could you please verify that your clients have defrag_ratio tunable and 
try to use various values?
It looks like the value of 100 should be the best, unless there is a 
bug, then may be even 0 gets the desired result?


Best regards,
Dmitry


Jagga Soorma wrote:

Hi Johann,

I am actually using 1.8.1 and not 1.8.2:

# rpm -qa | grep -i lustre
lustre-client-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default
lustre-client-modules-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default

My kernel version on the SLES 11 clients is:
# uname -r
2.6.27.29-0.1-default

My kernel version on the RHEL 5.3 mds/oss servers is:
# uname -r
2.6.18-128.7.1.el5_lustre.1.8.1.1

Please let me know if you need any further information.  I am still 
trying to get the user to help me run his app so that I can run the 
leak finder script to capture more information.


Regards,
-Simran

On Tue, Apr 27, 2010 at 7:20 AM, Johann Lombardi > wrote:


Hi,

On Tue, Apr 20, 2010 at 09:08:25AM -0700, Jagga Soorma wrote:
> Thanks for your response.* I will try to run the leak-finder
script and
> hopefully it will point us in the right direction.* This only
seems to be
> happening on some of my clients:

Could you please tell us what kernel you use on the client side?

>client104: ll_obdo_cache* 0 433506280*** 208** 19***
1 : tunables*
>120** 60*** 8 : slabdata* 0 22816120* 0
>client116: ll_obdo_cache* 0 457366746*** 208** 19***
1 : tunables*
>120** 60*** 8 : slabdata* 0 24071934* 0
>client113: ll_obdo_cache* 0 456778867*** 208** 19***
1 : tunables*
>120** 60*** 8 : slabdata* 0 24040993* 0
>client106: ll_obdo_cache* 0 456372267*** 208** 19***
1 : tunables*
>120** 60*** 8 : slabdata* 0 24019593* 0
>client115: ll_obdo_cache* 0 449929310*** 208** 19***
1 : tunables*
>120** 60*** 8 : slabdata* 0 23680490* 0
>client101: ll_obdo_cache* 0 454318101*** 208** 19***
1 : tunables*
>120** 60*** 8 : slabdata* 0 23911479* 0
>--
>
>Hopefully this should help.* Not sure which application might
be causing
>the leaks.* Currently R is the only app that users seem to be
using
>heavily on these clients.* Will let you know what I find.

Tommi Tervo has filed a bugzilla ticket for this issue, see
https://bugzilla.lustre.org/show_bug.cgi?id=22701

Could you please add a comment to this ticket to describe the
behavior of the application "R" (fork many threads, write to
many files, use direct i/o, ...)?

Cheers,
Johann




___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
  


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre Client - Memory Issue

2010-08-26 Thread Jagga Soorma
Hi Dmitry,

I am still running into this issue on some nodes:

client109: ll_obdo_cache  0 152914489208   191 : tunables
 120   608 : slabdata  0 8048131  0
client102: ll_obdo_cache  0 308526883208   191 : tunables
 120   608 : slabdata  0 16238257  0

How can I calculate how much memory this is holding on to.  My system shows
a lot of memory that is being used up but none of the jobs are using that
much memory.  Also, these clients are running a smp sles 11 kernel but I
can't find any /sys/kernel/slab directory.

Linux client102 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200
x86_64 x86_64 x86_64 GNU/Linux

What makes you say that this does not look like a lustre memory leak?  I
thought all the ll_* objects in slabinfo are lustre related?  To me it looks
like lustre is holding on to this memory but I don't know much about lustre
internals.

Also, memused on these systems are:

client102: 2353666940
client109: 2421645924

Any help would be greatly appreciated.

Thanks,
-J

On Wed, May 19, 2010 at 8:08 AM, Dmitry Zogin wrote:

>  Hello Jagga,
>
> I checked the data, and indeed this does not look like a lustre memory
> leak, rather than a slab fragmentation, which assumes there might be a
> kernel issue here. From the slabinfo (I only keep three first columns here):
>
>
> name 
> ll_obdo_cache  0 452282156208
>
> means that there are no active objects, but the memory pages are not
> released back from slab allocator to the free pool (the num value is huge).
> That looks like a slab fragmentation - you can get more description at
> http://kerneltrap.org/Linux/Slab_Defragmentation
>
> Checking your mails, I wonder if this only happens on clients which have
> SLES11 installed? As the RAM size is around 192Gb, I assume they are NUMA
> systems?
> If so, SLES11 has defrag_ratio tunables in /sys/kernel/slab/xxx
> From the source of get_any_partial()
>
> #ifdef CONFIG_NUMA
>
> /*
>  * The defrag ratio allows a configuration of the tradeoffs between
>  * inter node defragmentation and node local allocations. A lower
>  * defrag_ratio increases the tendency to do local allocations
>  * instead of attempting to obtain partial slabs from other nodes.
>  *
>  * If the defrag_ratio is set to 0 then kmalloc() always
>  * returns node local objects. If the ratio is higher then
> kmalloc()
>  * may return off node objects because partial slabs are obtained
>  * from other nodes and filled up.
>  *
>  * If /sys/kernel/slab/xx/defrag_ratio is set to 100 (which makes
>  * defrag_ratio = 1000) then every (well almost) allocation will
>  * first attempt to defrag slab caches on other nodes. This means
>  * scanning over all nodes to look for partial slabs which may be
>  * expensive if we do it every time we are trying to find a slab
>  * with available objects.
>  */
>
> Could you please verify that your clients have defrag_ratio tunable and try
> to use various values?
> It looks like the value of 100 should be the best, unless there is a bug,
> then may be even 0 gets the desired result?
>
> Best regards,
> Dmitry
>
>
> Jagga Soorma wrote:
>
> Hi Johann,
>
> I am actually using 1.8.1 and not 1.8.2:
>
> # rpm -qa | grep -i lustre
> lustre-client-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default
> lustre-client-modules-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default
>
> My kernel version on the SLES 11 clients is:
> # uname -r
> 2.6.27.29-0.1-default
>
> My kernel version on the RHEL 5.3 mds/oss servers is:
> # uname -r
> 2.6.18-128.7.1.el5_lustre.1.8.1.1
>
> Please let me know if you need any further information.  I am still trying
> to get the user to help me run his app so that I can run the leak finder
> script to capture more information.
>
> Regards,
> -Simran
>
> On Tue, Apr 27, 2010 at 7:20 AM, Johann Lombardi  wrote:
>
>> Hi,
>>
>> On Tue, Apr 20, 2010 at 09:08:25AM -0700, Jagga Soorma wrote:
>>  > Thanks for your response.* I will try to run the leak-finder script
>> and
>> > hopefully it will point us in the right direction.* This only seems to
>> be
>> > happening on some of my clients:
>>
>>  Could you please tell us what kernel you use on the client side?
>>
>>  >client104: ll_obdo_cache* 0 433506280*** 208** 19*** 1 :
>> tunables*
>> >120** 60*** 8 : slabdata* 0 22816120* 0
>> >client116: ll_obdo_cache* 0 457366746*** 208** 19*** 1 :
>> tunables*
>> >120** 60*** 8 : slabdata* 0 24071934* 0
>> >client113: ll_obdo_cache* 0 456778867*** 208** 19*** 1 :
>> tunables*
>> >120** 60*** 8 : slabdata* 0 24040993* 0
>> >client106: ll_obdo_cache* 0 456372267*** 208** 19*** 1 :
>> tunables*
>> >120** 60*** 8 : slabdata* 0 24019593* 0
>> >client115: ll_obdo_cache* 0 449929310*** 208** 19*** 1 :
>> tunab

Re: [Lustre-discuss] Lustre Client - Memory Issue

2010-08-26 Thread Andreas Dilger
On 2010-08-26, at 18:42, Jagga Soorma wrote:
> I am still running into this issue on some nodes:
> 
> client109: ll_obdo_cache  0 152914489208   191 : tunables  
> 120   608 : slabdata  0 8048131  0
> client102: ll_obdo_cache  0 308526883208   191 : tunables  
> 120   608 : slabdata  0 16238257  0
> 
> How can I calculate how much memory this is holding on to.

If you do "head -1 /proc/slabinfo" it reports the column descriptions.

The "slabdata" will section reports numslabs=16238257, and pagesperslab=1, so 
tis is 16238257 pages of memory, or about 64GB of RAM on client102.  Ouch.

>  My system shows a lot of memory that is being used up but none of the jobs 
> are using that much memory.  Also, these clients are running a smp sles 11 
> kernel but I can't find any /sys/kernel/slab directory.  
> 
> Linux client102 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200 x86_64 
> x86_64 x86_64 GNU/Linux
> 
> What makes you say that this does not look like a lustre memory leak?  I 
> thought all the ll_* objects in slabinfo are lustre related?

It's true that the ll_obdo_cache objects are allocated by Lustre, but the above 
data shows 0 of those objects in use, so the kernel _should_ be freeing the 
unused slab objects.  This particular data type (obdo) is only ever in use 
temporarily during system calls on the client, and should never be allocated 
for a long time.

For some reason the kernel is not freeing the empty slab pages.  That is the 
responsibility of the kernel, and not Lustre.

>  To me it looks like lustre is holding on to this memory but I don't know 
> much about lustre internals.
> 
> Also, memused on these systems are:
> 
> client102: 2353666940
> client109: 2421645924

This shows that Lustre is actively using about 2.4GB of memory allocations.  It 
is not tracking the 64GB of memory in the obdo_cache slab, because it has freed 
that memory (even though the kernel has not freed those pages).

> Any help would be greatly appreciated.

The only suggestion I have is that if you unmount Lustre and unload the modules 
(lustre_rmmod) it will free up this memory.  Otherwise, searching for problems 
with the slab cache on this kernel may turn up something.

> On Wed, May 19, 2010 at 8:08 AM, Dmitry Zogin  
> wrote:
> Hello Jagga,
> 
> I checked the data, and indeed this does not look like a lustre memory leak, 
> rather than a slab fragmentation, which assumes there might be a kernel issue 
> here. From the slabinfo (I only keep three first columns here):
> 
> 
> name 
> ll_obdo_cache  0 452282156208
> 
> means that there are no active objects, but the memory pages are not released 
> back from slab allocator to the free pool (the num value is huge). That looks 
> like a slab fragmentation - you can get more description at 
> http://kerneltrap.org/Linux/Slab_Defragmentation
> 
> Checking your mails, I wonder if this only happens on clients which have  
> SLES11 installed? As the RAM size is around 192Gb, I assume they are NUMA 
> systems?
> If so, SLES11 has defrag_ratio tunables in /sys/kernel/slab/xxx
> From the source of get_any_partial()
> 
> #ifdef CONFIG_NUMA
> 
> /*
>  * The defrag ratio allows a configuration of the tradeoffs between
>  * inter node defragmentation and node local allocations. A lower
>  * defrag_ratio increases the tendency to do local allocations
>  * instead of attempting to obtain partial slabs from other nodes.
>  *
>  * If the defrag_ratio is set to 0 then kmalloc() always
>  * returns node local objects. If the ratio is higher then kmalloc()
>  * may return off node objects because partial slabs are obtained
>  * from other nodes and filled up.
>  *
>  * If /sys/kernel/slab/xx/defrag_ratio is set to 100 (which makes
>  * defrag_ratio = 1000) then every (well almost) allocation will
>  * first attempt to defrag slab caches on other nodes. This means
>  * scanning over all nodes to look for partial slabs which may be
>  * expensive if we do it every time we are trying to find a slab
>  * with available objects.
>  */
> 
> Could you please verify that your clients have defrag_ratio tunable and try 
> to use various values?
> It looks like the value of 100 should be the best, unless there is a bug, 
> then may be even 0 gets the desired result?
> 
> Best regards,
> Dmitry
> 
> 
> Jagga Soorma wrote:
>> Hi Johann,
>> 
>> I am actually using 1.8.1 and not 1.8.2:
>> 
>> # rpm -qa | grep -i lustre
>> lustre-client-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default
>> lustre-client-modules-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default
>> 
>> My kernel version on the SLES 11 clients is:
>> # uname -r
>> 2.6.27.29-0.1-default
>> 
>> My kernel version on the RHEL 5.3 mds/oss servers is:
>> # uname -r
>> 2.6.18-128.7.1.el5_lustre.1.8.1.1
>> 
>> Please let me know if

Re: [Lustre-discuss] Lustre Client - Memory Issue

2010-08-30 Thread Dmitry Zogin
Actually there was a bug fixed in 1.8.4 when obdo structures can be 
allocated and freed outside of OBDO_ALLOC/OBDO_FREE macros. That could 
lead to the slab fragmentation and pseudo-leak.

The patch is in the attachment 30664 for bz 21980

Dmitry


Andreas Dilger wrote:

On 2010-08-26, at 18:42, Jagga Soorma wrote:
  

I am still running into this issue on some nodes:

client109: ll_obdo_cache  0 152914489208   191 : tunables  120  
 608 : slabdata  0 8048131  0
client102: ll_obdo_cache  0 308526883208   191 : tunables  120  
 608 : slabdata  0 16238257  0

How can I calculate how much memory this is holding on to.



If you do "head -1 /proc/slabinfo" it reports the column descriptions.

The "slabdata" will section reports numslabs=16238257, and pagesperslab=1, so 
tis is 16238257 pages of memory, or about 64GB of RAM on client102.  Ouch.

  
 My system shows a lot of memory that is being used up but none of the jobs are using that much memory.  Also, these clients are running a smp sles 11 kernel but I can't find any /sys/kernel/slab directory.  


Linux client102 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200 x86_64 
x86_64 x86_64 GNU/Linux

What makes you say that this does not look like a lustre memory leak?  I 
thought all the ll_* objects in slabinfo are lustre related?



It's true that the ll_obdo_cache objects are allocated by Lustre, but the above 
data shows 0 of those objects in use, so the kernel _should_ be freeing the 
unused slab objects.  This particular data type (obdo) is only ever in use 
temporarily during system calls on the client, and should never be allocated 
for a long time.

For some reason the kernel is not freeing the empty slab pages.  That is the 
responsibility of the kernel, and not Lustre.

  

 To me it looks like lustre is holding on to this memory but I don't know much 
about lustre internals.

Also, memused on these systems are:

client102: 2353666940
client109: 2421645924



This shows that Lustre is actively using about 2.4GB of memory allocations.  It 
is not tracking the 64GB of memory in the obdo_cache slab, because it has freed 
that memory (even though the kernel has not freed those pages).

  

Any help would be greatly appreciated.



The only suggestion I have is that if you unmount Lustre and unload the modules 
(lustre_rmmod) it will free up this memory.  Otherwise, searching for problems 
with the slab cache on this kernel may turn up something.

  

On Wed, May 19, 2010 at 8:08 AM, Dmitry Zogin  wrote:
Hello Jagga,

I checked the data, and indeed this does not look like a lustre memory leak, 
rather than a slab fragmentation, which assumes there might be a kernel issue 
here. From the slabinfo (I only keep three first columns here):


name 
ll_obdo_cache  0 452282156208

means that there are no active objects, but the memory pages are not released back from slab allocator to the free pool (the num value is huge). That looks like a slab fragmentation - you can get more description at 
http://kerneltrap.org/Linux/Slab_Defragmentation


Checking your mails, I wonder if this only happens on clients which have  
SLES11 installed? As the RAM size is around 192Gb, I assume they are NUMA 
systems?
If so, SLES11 has defrag_ratio tunables in /sys/kernel/slab/xxx
From the source of get_any_partial()

#ifdef CONFIG_NUMA

/*
 * The defrag ratio allows a configuration of the tradeoffs between
 * inter node defragmentation and node local allocations. A lower
 * defrag_ratio increases the tendency to do local allocations
 * instead of attempting to obtain partial slabs from other nodes.
 *
 * If the defrag_ratio is set to 0 then kmalloc() always
 * returns node local objects. If the ratio is higher then kmalloc()
 * may return off node objects because partial slabs are obtained
 * from other nodes and filled up.
 *
 * If /sys/kernel/slab/xx/defrag_ratio is set to 100 (which makes
 * defrag_ratio = 1000) then every (well almost) allocation will
 * first attempt to defrag slab caches on other nodes. This means
 * scanning over all nodes to look for partial slabs which may be
 * expensive if we do it every time we are trying to find a slab
 * with available objects.
 */

Could you please verify that your clients have defrag_ratio tunable and try to 
use various values?
It looks like the value of 100 should be the best, unless there is a bug, then 
may be even 0 gets the desired result?

Best regards,
Dmitry


Jagga Soorma wrote:


Hi Johann,

I am actually using 1.8.1 and not 1.8.2:

# rpm -qa | grep -i lustre
lustre-client-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default
lustre-client-modules-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default

My kernel version on the SLES 11 clients is:
# uname -r
2.6.27.29-0.1-default

My kernel v

  1   2   3   >