Hi,
so it is a bit strange, at least for me !
The problem occurs even when the clocks are all synchronized
but the thing that I misled is that under equal conditions, the
problem does not always.
The test is always the same, launching a process of copying and then
reboot the server OSS concerned:
These are the log when the recovery process is successful:
MGS/MDS
May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(lib-move.c:
2441:LNetPut()) Error sending PUT to 12345-172.16.100....@tcp: -113
May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(lib-move.c:
2441:LNetPut()) Skipped 5 previous similar messages
May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(events.c:
66:request_out_callback()) @@@ type 4, status -113
r...@ffff81011b6dd400 x1336357074007206/t0 o400->lustre01-ost0000_u...@172.16.100.121
@tcp:28/4 lens 192/384 e 0 to 1 dl 1274688078 ref 2 fl Rpc:N/0/0 rc 0/0
May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(events.c:
66:request_out_callback()) Skipped 5 previous similar messages
May 24 10:00:21 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:
Connection to service lustre01-OST0000 via nid 172.16.100....@tcp was
lost; in progress operations using this service will wait for recovery
to complete.
May 24 10:01:43 mdt01prdpom kernel: Lustre: 3690:0:(ldlm_lib.c:
575:target_handle_reconnect()) MGS: 52138191-d920-7519-563b-
ab022b922751 reconnecting
May 24 10:02:17 mdt01prdpom kernel: Lustre: 3559:0:(quota_master.c:
1716:mds_quota_recovery()) Only 1/3 OSTs are active, abort quota
recovery
May 24 10:02:17 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:
Connection restored to service lustre01-OST0000 using nid
172.16.100....@tcp.
May 24 10:02:17 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000:
lustre01-OST0000_UUID now active, resetting orphans
OSS
May 24 10:01:14 oss1prdpom kernel: LDISKFS-fs: file extents enabled
May 24 10:01:14 oss1prdpom kernel: LDISKFS-fs: mballoc enabled
May 24 10:01:19 oss1prdpom kernel: Lustre: 3640:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336607320834049 sent
from mgc172.16.100....@tcp to NID 172.16.100....@tcp 5s ago has timed
out (5s prior to deadline).
May 24 10:01:19 oss1prdpom kernel: r...@ffff81012ec78c00
x1336607320834049/t0 o250->m...@mgc172.16.100.111@tcp_0:26/25 lens
368/584 e 0 to 1 dl 1274688079 ref 1 fl Rpc:N/0/0 rc 0/0
May 24 10:01:44 oss1prdpom kernel: Lustre: 3640:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336607320834051 sent
from mgc172.16.100....@tcp to NID 0...@lo 5s ago has timed out (5s prior
to deadline).
May 24 10:01:44 oss1prdpom kernel: r...@ffff81013ff1cc00
x1336607320834051/t0 o250->m...@mgc172.16.100.111@tcp_1:26/25 lens
368/584 e 0 to 1 dl 1274688104 ref 1 fl Rpc:N/0/0 rc 0/0
May 24 10:01:44 oss1prdpom kernel: LustreError: 3557:0:(client.c:
858:ptlrpc_import_delay_req()) @@@ IMP_INVALID r...@ffff81012d15e400
x1336607320834052/t0 o101->m...@mgc172.16.100.111@tcp_1:26/25 lens
296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
May 24 10:01:44 oss1prdpom kernel: Lustre: Filtering OBD driver;
http://www.lustre.org/
May 24 10:01:44 oss1prdpom kernel: Lustre: 4036:0:(filter.c:
990:filter_init_server_data()) RECOVERY: service lustre01-OST0000, 2
recoverable clients, 0 delayed clients, last_rcvd 103079215728
May 24 10:01:44 oss1prdpom kernel: Lustre: lustre01-OST0000: Now
serving lustre01-OST0000 on /dev/mpath/mpath1 with recovery enabled
May 24 10:01:44 oss1prdpom kernel: Lustre: lustre01-OST0000: Will be
in recovery for at least 5:00, or until 2 clients reconnect
May 24 10:01:44 oss1prdpom kernel: Bluetooth: Core ver 2.10
May 24 10:01:44 oss1prdpom kernel: NET: Registered protocol family 31
May 24 10:01:44 oss1prdpom kernel: Bluetooth: HCI device and
connection manager initialized
May 24 10:01:44 oss1prdpom kernel: Bluetooth: HCI socket layer
initialized
May 24 10:01:44 oss1prdpom kernel: Bluetooth: L2CAP ver 2.8
May 24 10:01:44 oss1prdpom kernel: Bluetooth: L2CAP socket layer
initialized
May 24 10:01:44 oss1prdpom kernel: Bluetooth: HIDP (Human Interface
Emulation) ver 1.1
May 24 10:01:44 oss1prdpom hidd[4106]: Bluetooth HID daemon
May 24 10:01:45 oss1prdpom kernel: Lustre: mgc172.16.100....@tcp:
Reactivating import
May 24 10:01:45 oss1prdpom rhnsd: Red Hat Network Services Daemon
running with check_in interval set to 240 seconds.
May 24 10:01:45 oss1prdpom rhnsd: Red Hat Network Services Daemon
running with check_in interval set to 240 seconds.
May 24 10:01:45 oss1prdpom rhnsd[4179]: Red Hat Network Services
Daemon starting up.
May 24 10:02:12 oss1prdpom kernel: Lustre: 3761:0:(ldlm_lib.c:
1788:target_queue_last_replay_reply()) lustre01-OST0000: 1 recoverable
clients remain
May 24 10:02:18 oss1prdpom kernel: Lustre: lustre01-OST0000: Recovery
period over after 0:06, of 2 clients 2 recovered and 0 were evicted.
May 24 10:02:18 oss1prdpom kernel: Lustre: lustre01-OST0000: sending
delayed replies to recovered clients
May 24 10:02:18 oss1prdpom kernel: Lustre: lustre01-OST0000: received
MDS connection from 172.16.100....@tcp
CLIENT
May 24 09:59:13 mdt02prdpom kernel: LustreError: 11-0: an error
occurred while communicating with 172.16.100....@tcp. The ost_write
operation failed with -107
May 24 09:59:13 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc-
ffff8101337bbc00: Connection to service lustre01-OST0000 via nid
172.16.100....@tcp was lost; in progress operations using this service
will wait for recovery to complete.
May 24 10:00:13 mdt02prdpom kernel: Lustre: There was an unexpected
network error while writing to 172.16.100.121: -110.
May 24 10:02:17 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc-
ffff8101337bbc00: Connection restored to service lustre01-OST0000
using nid 172.16.100....@tcp.
These are the log when the recovery process goes wrong:
MGS/MDS
May 24 09:52:01 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:
Connection to service lustre01-OST0000 via nid 172.16.100....@tcp was
lost; in progress operations using this service will wait for recovery
to complete.
May 24 09:52:51 mdt01prdpom kernel: Lustre: 3559:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336357074007136 sent
from lustre01-OST0000-osc to NID 172.16.100....@tcp 0s ago has failed
due to network error (25s prior to deadline).
May 24 09:52:51 mdt01prdpom kernel: r...@ffff81012d814c00
x1336357074007136/t0 o8->lustre01-ost0000_u...@172.16.100.121@tcp:28/4
lens 368/584 e 0 to 1 dl 1274687596 ref 1 fl Rpc:N/0/0 rc 0/0
May 24 09:52:51 mdt01prdpom kernel: Lustre: 3559:0:(client.c:
1463:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
May 24 09:53:43 mdt01prdpom kernel: Lustre: 3560:0:(import.c:
517:import_select_connection()) lustre01-OST0000-osc: tried all
connections, increasing latency to 21s
May 24 09:53:43 mdt01prdpom kernel: Lustre: 3560:0:(import.c:
517:import_select_connection()) Skipped 4 previous similar messages
May 24 09:53:43 mdt01prdpom kernel: LustreError: 167-0: This client
was evicted by lustre01-OST0000; in progress operations using this
service will fail.
May 24 09:53:43 mdt01prdpom kernel: Lustre: 5071:0:(quota_master.c:
1716:mds_quota_recovery()) Only 1/3 OSTs are active, abort quota
recovery
May 24 09:53:43 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:
Connection restored to service lustre01-OST0000 using nid
172.16.100....@tcp.
May 24 09:53:43 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000:
lustre01-OST0000_UUID now active, resetting orphans
May 24 09:54:35 mdt01prdpom kernel: Lustre: MGS: haven't heard from
client b6c79384-6c45-6d9c-ab9b-f12969d74da0 (at 172.16.100....@tcp) in
235 seconds. I think it's dead, and I am evicting it.
OSS
May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_UP): eth0: link is
not ready
May 24 09:53:02 oss1prdpom kernel: e1000e: eth0 NIC Link is Up 1000
Mbps Full Duplex, Flow Control: RX/TX
May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_CHANGE): eth0: link
becomes ready
May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_UP): eth1: link is
not ready
May 24 09:53:02 oss1prdpom kernel: e1000e: eth1 NIC Link is Up 1000
Mbps Full Duplex, Flow Control: RX/TX
May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_CHANGE): eth1: link
becomes ready
May 24 09:53:02 oss1prdpom kernel: Lustre: OBD class driver,
http://www.lustre.org/
May 24 09:53:02 oss1prdpom kernel: Lustre: Lustre Version: 1.8.3
May 24 09:53:02 oss1prdpom kernel: Lustre: Build Version:
1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3
May 24 09:53:02 oss1prdpom kernel: Lustre: Lustre Version: 1.8.3
May 24 09:53:02 oss1prdpom kernel: Lustre: Build Version:
1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3
May 24 09:53:03 oss1prdpom kernel: Lustre: Added LNI
172.16.100....@tcp [8/256/0/180]
May 24 09:53:03 oss1prdpom kernel: Lustre: Accept secure, port 988
May 24 09:53:03 oss1prdpom kernel: Lustre: Lustre Client File System;
http://www.lustre.org/
May 24 09:53:03 oss1prdpom kernel: init dynlocks cache
May 24 09:53:03 oss1prdpom kernel: ldiskfs created from ext3-2.6-rhel5
May 24 09:53:03 oss1prdpom kernel: kjournald starting. Commit
interval 5 seconds
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs warning: maximal mount
count reached, running e2fsck is recommended
May 24 09:53:03 oss1prdpom kernel: LDISKFS FS on dm-0, internal journal
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: mounted filesystem with
ordered data mode.
May 24 09:53:03 oss1prdpom multipathd: dm-0: umount map (uevent)
May 24 09:53:03 oss1prdpom kernel: kjournald starting. Commit
interval 5 seconds
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs warning: maximal mount
count reached, running e2fsck is recommended
May 24 09:53:03 oss1prdpom kernel: LDISKFS FS on dm-0, internal journal
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: mounted filesystem with
ordered data mode.
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: file extents enabled
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: mballoc enabled
May 24 09:53:03 oss1prdpom kernel: Lustre: mgc172.16.100....@tcp:
Reactivating import
May 24 09:53:03 oss1prdpom kernel: Lustre: Filtering OBD driver;
http://www.lustre.org/
May 24 09:53:03 oss1prdpom kernel: Lustre: lustre01-OST0000: Now
serving lustre01-OST0000 on /dev/mpath/mpath1 with recovery enabled
May 24 09:53:04 oss1prdpom kernel: Bluetooth: Core ver 2.10
May 24 09:53:04 oss1prdpom kernel: NET: Registered protocol family 31
May 24 09:53:04 oss1prdpom kernel: Bluetooth: HCI device and
connection manager initialized
May 24 09:53:04 oss1prdpom kernel: Bluetooth: HCI socket layer
initialized
May 24 09:53:04 oss1prdpom kernel: Bluetooth: L2CAP ver 2.8
May 24 09:53:04 oss1prdpom kernel: Bluetooth: L2CAP socket layer
initialized
May 24 09:53:04 oss1prdpom kernel: Bluetooth: HIDP (Human Interface
Emulation) ver 1.1
May 24 09:53:04 oss1prdpom hidd[4108]: Bluetooth HID daemon
May 24 09:53:04 oss1prdpom rhnsd: Red Hat Network Services Daemon
running with check_in interval set to 240 seconds.
May 24 09:53:04 oss1prdpom rhnsd: Red Hat Network Services Daemon
running with check_in interval set to 240 seconds.
May 24 09:53:04 oss1prdpom rhnsd[4181]: Red Hat Network Services
Daemon starting up.
May 24 09:53:44 oss1prdpom kernel: Lustre: lustre01-OST0000: received
MDS connection from 172.16.100....@tcp
CLIENT
May 24 09:51:28 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc-
ffff8101337bbc00: Connection to service lustre01-OST0000 via nid
172.16.100....@tcp was lost; in progress operations using this service
will wait for recovery to complete.
May 24 09:51:38 mdt02prdpom kernel: Lustre: There was an unexpected
network error while writing to 172.16.100.121: -110.
May 24 09:52:22 mdt02prdpom kernel: Lustre: 3796:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336351691692671 sent
from lustre01-OST0000-osc-ffff8101337bbc00 to NID 172.16.100....@tcp
0s ago has failed due to network error (27s prior to deadline).
May 24 09:52:22 mdt02prdpom kernel: r...@ffff810109b92c00
x1336351691692671/t0 o8->lustre01-ost0000_u...@172.16.100.121@tcp:28/4
lens 368/584 e 0 to 1 dl 1274687569 ref 1 fl Rpc:N/0/0 rc 0/0
May 24 09:52:22 mdt02prdpom kernel: Lustre: 3796:0:(client.c:
1463:ptlrpc_expire_one_request()) Skipped 23 previous similar messages
May 24 09:53:18 mdt02prdpom kernel: Lustre: 3797:0:(import.c:
517:import_select_connection()) lustre01-OST0000-osc-ffff8101337bbc00:
tried all connections, increasing latency to 23s
May 24 09:53:18 mdt02prdpom kernel: Lustre: 3797:0:(import.c:
517:import_select_connection()) Skipped 5 previous similar messages
May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(lib-move.c:
2441:LNetPut()) Error sending PUT to 12345-172.16.100....@tcp: -113
May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(lib-move.c:
2441:LNetPut()) Skipped 4 previous similar messages
May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(events.c:
66:request_out_callback()) @@@ type 4, status -113
r...@ffff810023fcfc00 x1336351691692679/t0 o8->lustre01-ost0000_u...@172.16.100.121
@tcp:28/4 lens 368/584 e 0 to 1 dl 1274687626 ref 2 fl Rpc:N/0/0 rc 0/0
May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(events.c:
66:request_out_callback()) Skipped 4 previous similar messages
May 24 09:54:16 mdt02prdpom kernel: LustreError: 167-0: This client
was evicted by lustre01-OST0000; in progress operations using this
service will fail.
May 24 09:54:16 mdt02prdpom kernel: LustreError: 3795:0:(client.c:
858:ptlrpc_import_delay_req()) @@@ IMP_INVALID r...@ffff81012fa9f000
x1336351691692691/t0 o4->lustre01-ost0000_u...@172.16.100.121@tcp:6/4
lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
May 24 09:54:16 mdt02prdpom kernel: LustreError: 3795:0:(client.c:
858:ptlrpc_import_delay_req()) Skipped 18 previous similar messages
May 24 09:54:16 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc-
ffff8101337bbc00: Connection restored to service lustre01-OST0000
using nid 172.16.100....@tcp.
Thanks !!
Ing. Stefano Elmopi
Gruppo Darco - Resp. ICT Sistemi
Via Ostiense 131/L Corpo B, 00154 Roma
cell. 3466147165
tel. 0657060500
email:stefano.elm...@sociale.it
"Ai sensi e per effetti della legge sulla tutela della riservatezza
personale
(D.lgs n. 196/2003), questa @mail e' destinata unicamente alle
persone sopra
indicate e le informazioni in essa contenute sono da considerarsi
strettamente
riservate. E' proibito leggere, copiare, usare o diffondere il
contenuto della
presente @mail senza autorizzazione. Se avete ricevuto questo
messaggio per
errore, siete pregati di rispedire la stessa al mittente. Grazie"
Il giorno 21/mag/10, alle ore 16:56, Andreas Dilger ha scritto:
On 2010-05-21, at 5:49, Stefano Elmopi <stefano.elm...@sociale.it>
wrote:
I realized that the time server differed much across machines,
there were at least a few hours of difference.
I'm doing the tests and have not been paying attention to time
synchronization
but now I have aligned the time of all servers and I've configured
ntpd service
and the problem no longer occurs.
I can imagine that the cause of the problem was just the time
misalignment.
The client and server clock should have nothing to do with the
functioning of lustre, so it surprising that this would be the cause.
Il giorno 20/mag/10, alle ore 13:28, Johann Lombardi ha scritto:
On Thu, May 20, 2010 at 12:29:41PM +0200, Stefano Elmopi wrote:
Hi Andreas
My version of Lustre 1.8.3
Sorry for my bad English but I used the wrong word, "crash" is
not the
right word.
I try to explain better, I start copying a large file on the file
system
and while the copy process continues, I reboot the server OSS,
and the copy process enters state "- stalled -".
I expected that once the server back online, the copy process to
resume
normal
and complete copy of the file, instead the copy process fault.
Therefore the copy process that goes wrong, Lustre continues to
perform
good.
May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: This
client was
evicted by lustre01-OST0000; in progress operations using this
service
will fail.
The cp process failed because the client got evicted by the OSS.
We need to look at the OSS logs to figure out the root cause of
the eviction.
Johann
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss