Hi,

so it is a bit strange, at least for me !
The problem occurs even when the clocks are all synchronized
but the thing that I misled is that under equal conditions, the problem does not always. The test is always the same, launching a process of copying and then reboot the server OSS concerned:
These are the log when the recovery process is successful:

MGS/MDS
May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(lib-move.c: 2441:LNetPut()) Error sending PUT to 12345-172.16.100....@tcp: -113 May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(lib-move.c: 2441:LNetPut()) Skipped 5 previous similar messages May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(events.c: 66:request_out_callback()) @@@ type 4, status -113 r...@ffff81011b6dd400 x1336357074007206/t0 o400->lustre01-ost0000_u...@172.16.100.121 @tcp:28/4 lens 192/384 e 0 to 1 dl 1274688078 ref 2 fl Rpc:N/0/0 rc 0/0 May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(events.c: 66:request_out_callback()) Skipped 5 previous similar messages May 24 10:00:21 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc: Connection to service lustre01-OST0000 via nid 172.16.100....@tcp was lost; in progress operations using this service will wait for recovery to complete. May 24 10:01:43 mdt01prdpom kernel: Lustre: 3690:0:(ldlm_lib.c: 575:target_handle_reconnect()) MGS: 52138191-d920-7519-563b- ab022b922751 reconnecting May 24 10:02:17 mdt01prdpom kernel: Lustre: 3559:0:(quota_master.c: 1716:mds_quota_recovery()) Only 1/3 OSTs are active, abort quota recovery May 24 10:02:17 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc: Connection restored to service lustre01-OST0000 using nid 172.16.100....@tcp. May 24 10:02:17 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000: lustre01-OST0000_UUID now active, resetting orphans

OSS
May 24 10:01:14 oss1prdpom kernel: LDISKFS-fs: file extents enabled
May 24 10:01:14 oss1prdpom kernel: LDISKFS-fs: mballoc enabled
May 24 10:01:19 oss1prdpom kernel: Lustre: 3640:0:(client.c: 1463:ptlrpc_expire_one_request()) @@@ Request x1336607320834049 sent from mgc172.16.100....@tcp to NID 172.16.100....@tcp 5s ago has timed out (5s prior to deadline). May 24 10:01:19 oss1prdpom kernel: r...@ffff81012ec78c00 x1336607320834049/t0 o250->m...@mgc172.16.100.111@tcp_0:26/25 lens 368/584 e 0 to 1 dl 1274688079 ref 1 fl Rpc:N/0/0 rc 0/0 May 24 10:01:44 oss1prdpom kernel: Lustre: 3640:0:(client.c: 1463:ptlrpc_expire_one_request()) @@@ Request x1336607320834051 sent from mgc172.16.100....@tcp to NID 0...@lo 5s ago has timed out (5s prior to deadline). May 24 10:01:44 oss1prdpom kernel: r...@ffff81013ff1cc00 x1336607320834051/t0 o250->m...@mgc172.16.100.111@tcp_1:26/25 lens 368/584 e 0 to 1 dl 1274688104 ref 1 fl Rpc:N/0/0 rc 0/0 May 24 10:01:44 oss1prdpom kernel: LustreError: 3557:0:(client.c: 858:ptlrpc_import_delay_req()) @@@ IMP_INVALID r...@ffff81012d15e400 x1336607320834052/t0 o101->m...@mgc172.16.100.111@tcp_1:26/25 lens 296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
May 24 10:01:44 oss1prdpom kernel: Lustre: Filtering OBD driver; 
http://www.lustre.org/
May 24 10:01:44 oss1prdpom kernel: Lustre: 4036:0:(filter.c: 990:filter_init_server_data()) RECOVERY: service lustre01-OST0000, 2 recoverable clients, 0 delayed clients, last_rcvd 103079215728 May 24 10:01:44 oss1prdpom kernel: Lustre: lustre01-OST0000: Now serving lustre01-OST0000 on /dev/mpath/mpath1 with recovery enabled May 24 10:01:44 oss1prdpom kernel: Lustre: lustre01-OST0000: Will be in recovery for at least 5:00, or until 2 clients reconnect
May 24 10:01:44 oss1prdpom kernel: Bluetooth: Core ver 2.10
May 24 10:01:44 oss1prdpom kernel: NET: Registered protocol family 31
May 24 10:01:44 oss1prdpom kernel: Bluetooth: HCI device and connection manager initialized May 24 10:01:44 oss1prdpom kernel: Bluetooth: HCI socket layer initialized
May 24 10:01:44 oss1prdpom kernel: Bluetooth: L2CAP ver 2.8
May 24 10:01:44 oss1prdpom kernel: Bluetooth: L2CAP socket layer initialized May 24 10:01:44 oss1prdpom kernel: Bluetooth: HIDP (Human Interface Emulation) ver 1.1
May 24 10:01:44 oss1prdpom hidd[4106]: Bluetooth HID daemon
May 24 10:01:45 oss1prdpom kernel: Lustre: mgc172.16.100....@tcp: Reactivating import May 24 10:01:45 oss1prdpom rhnsd: Red Hat Network Services Daemon running with check_in interval set to 240 seconds. May 24 10:01:45 oss1prdpom rhnsd: Red Hat Network Services Daemon running with check_in interval set to 240 seconds. May 24 10:01:45 oss1prdpom rhnsd[4179]: Red Hat Network Services Daemon starting up. May 24 10:02:12 oss1prdpom kernel: Lustre: 3761:0:(ldlm_lib.c: 1788:target_queue_last_replay_reply()) lustre01-OST0000: 1 recoverable clients remain May 24 10:02:18 oss1prdpom kernel: Lustre: lustre01-OST0000: Recovery period over after 0:06, of 2 clients 2 recovered and 0 were evicted. May 24 10:02:18 oss1prdpom kernel: Lustre: lustre01-OST0000: sending delayed replies to recovered clients May 24 10:02:18 oss1prdpom kernel: Lustre: lustre01-OST0000: received MDS connection from 172.16.100....@tcp


CLIENT
May 24 09:59:13 mdt02prdpom kernel: LustreError: 11-0: an error occurred while communicating with 172.16.100....@tcp. The ost_write operation failed with -107 May 24 09:59:13 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc- ffff8101337bbc00: Connection to service lustre01-OST0000 via nid 172.16.100....@tcp was lost; in progress operations using this service will wait for recovery to complete. May 24 10:00:13 mdt02prdpom kernel: Lustre: There was an unexpected network error while writing to 172.16.100.121: -110. May 24 10:02:17 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc- ffff8101337bbc00: Connection restored to service lustre01-OST0000 using nid 172.16.100....@tcp.


These are the log when the recovery process goes wrong:

MGS/MDS
May 24 09:52:01 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc: Connection to service lustre01-OST0000 via nid 172.16.100....@tcp was lost; in progress operations using this service will wait for recovery to complete. May 24 09:52:51 mdt01prdpom kernel: Lustre: 3559:0:(client.c: 1463:ptlrpc_expire_one_request()) @@@ Request x1336357074007136 sent from lustre01-OST0000-osc to NID 172.16.100....@tcp 0s ago has failed due to network error (25s prior to deadline). May 24 09:52:51 mdt01prdpom kernel: r...@ffff81012d814c00 x1336357074007136/t0 o8->lustre01-ost0000_u...@172.16.100.121@tcp:28/4 lens 368/584 e 0 to 1 dl 1274687596 ref 1 fl Rpc:N/0/0 rc 0/0 May 24 09:52:51 mdt01prdpom kernel: Lustre: 3559:0:(client.c: 1463:ptlrpc_expire_one_request()) Skipped 9 previous similar messages May 24 09:53:43 mdt01prdpom kernel: Lustre: 3560:0:(import.c: 517:import_select_connection()) lustre01-OST0000-osc: tried all connections, increasing latency to 21s May 24 09:53:43 mdt01prdpom kernel: Lustre: 3560:0:(import.c: 517:import_select_connection()) Skipped 4 previous similar messages May 24 09:53:43 mdt01prdpom kernel: LustreError: 167-0: This client was evicted by lustre01-OST0000; in progress operations using this service will fail. May 24 09:53:43 mdt01prdpom kernel: Lustre: 5071:0:(quota_master.c: 1716:mds_quota_recovery()) Only 1/3 OSTs are active, abort quota recovery May 24 09:53:43 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc: Connection restored to service lustre01-OST0000 using nid 172.16.100....@tcp. May 24 09:53:43 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000: lustre01-OST0000_UUID now active, resetting orphans May 24 09:54:35 mdt01prdpom kernel: Lustre: MGS: haven't heard from client b6c79384-6c45-6d9c-ab9b-f12969d74da0 (at 172.16.100....@tcp) in 235 seconds. I think it's dead, and I am evicting it.

OSS
May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready May 24 09:53:02 oss1prdpom kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready May 24 09:53:02 oss1prdpom kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
May 24 09:53:02 oss1prdpom kernel: Lustre: OBD class driver, 
http://www.lustre.org/
May 24 09:53:02 oss1prdpom kernel: Lustre:     Lustre Version: 1.8.3
May 24 09:53:02 oss1prdpom kernel: Lustre: Build Version: 1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3
May 24 09:53:02 oss1prdpom kernel: Lustre:     Lustre Version: 1.8.3
May 24 09:53:02 oss1prdpom kernel: Lustre: Build Version: 1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3 May 24 09:53:03 oss1prdpom kernel: Lustre: Added LNI 172.16.100....@tcp [8/256/0/180]
May 24 09:53:03 oss1prdpom kernel: Lustre: Accept secure, port 988
May 24 09:53:03 oss1prdpom kernel: Lustre: Lustre Client File System; 
http://www.lustre.org/
May 24 09:53:03 oss1prdpom kernel: init dynlocks cache
May 24 09:53:03 oss1prdpom kernel: ldiskfs created from ext3-2.6-rhel5
May 24 09:53:03 oss1prdpom kernel: kjournald starting. Commit interval 5 seconds May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs warning: maximal mount count reached, running e2fsck is recommended
May 24 09:53:03 oss1prdpom kernel: LDISKFS FS on dm-0, internal journal
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
May 24 09:53:03 oss1prdpom multipathd: dm-0: umount map (uevent)
May 24 09:53:03 oss1prdpom kernel: kjournald starting. Commit interval 5 seconds May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs warning: maximal mount count reached, running e2fsck is recommended
May 24 09:53:03 oss1prdpom kernel: LDISKFS FS on dm-0, internal journal
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: file extents enabled
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: mballoc enabled
May 24 09:53:03 oss1prdpom kernel: Lustre: mgc172.16.100....@tcp: Reactivating import
May 24 09:53:03 oss1prdpom kernel: Lustre: Filtering OBD driver; 
http://www.lustre.org/
May 24 09:53:03 oss1prdpom kernel: Lustre: lustre01-OST0000: Now serving lustre01-OST0000 on /dev/mpath/mpath1 with recovery enabled
May 24 09:53:04 oss1prdpom kernel: Bluetooth: Core ver 2.10
May 24 09:53:04 oss1prdpom kernel: NET: Registered protocol family 31
May 24 09:53:04 oss1prdpom kernel: Bluetooth: HCI device and connection manager initialized May 24 09:53:04 oss1prdpom kernel: Bluetooth: HCI socket layer initialized
May 24 09:53:04 oss1prdpom kernel: Bluetooth: L2CAP ver 2.8
May 24 09:53:04 oss1prdpom kernel: Bluetooth: L2CAP socket layer initialized May 24 09:53:04 oss1prdpom kernel: Bluetooth: HIDP (Human Interface Emulation) ver 1.1
May 24 09:53:04 oss1prdpom hidd[4108]: Bluetooth HID daemon
May 24 09:53:04 oss1prdpom rhnsd: Red Hat Network Services Daemon running with check_in interval set to 240 seconds. May 24 09:53:04 oss1prdpom rhnsd: Red Hat Network Services Daemon running with check_in interval set to 240 seconds. May 24 09:53:04 oss1prdpom rhnsd[4181]: Red Hat Network Services Daemon starting up. May 24 09:53:44 oss1prdpom kernel: Lustre: lustre01-OST0000: received MDS connection from 172.16.100....@tcp


CLIENT
May 24 09:51:28 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc- ffff8101337bbc00: Connection to service lustre01-OST0000 via nid 172.16.100....@tcp was lost; in progress operations using this service will wait for recovery to complete. May 24 09:51:38 mdt02prdpom kernel: Lustre: There was an unexpected network error while writing to 172.16.100.121: -110. May 24 09:52:22 mdt02prdpom kernel: Lustre: 3796:0:(client.c: 1463:ptlrpc_expire_one_request()) @@@ Request x1336351691692671 sent from lustre01-OST0000-osc-ffff8101337bbc00 to NID 172.16.100....@tcp 0s ago has failed due to network error (27s prior to deadline). May 24 09:52:22 mdt02prdpom kernel: r...@ffff810109b92c00 x1336351691692671/t0 o8->lustre01-ost0000_u...@172.16.100.121@tcp:28/4 lens 368/584 e 0 to 1 dl 1274687569 ref 1 fl Rpc:N/0/0 rc 0/0 May 24 09:52:22 mdt02prdpom kernel: Lustre: 3796:0:(client.c: 1463:ptlrpc_expire_one_request()) Skipped 23 previous similar messages May 24 09:53:18 mdt02prdpom kernel: Lustre: 3797:0:(import.c: 517:import_select_connection()) lustre01-OST0000-osc-ffff8101337bbc00: tried all connections, increasing latency to 23s May 24 09:53:18 mdt02prdpom kernel: Lustre: 3797:0:(import.c: 517:import_select_connection()) Skipped 5 previous similar messages May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(lib-move.c: 2441:LNetPut()) Error sending PUT to 12345-172.16.100....@tcp: -113 May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(lib-move.c: 2441:LNetPut()) Skipped 4 previous similar messages May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(events.c: 66:request_out_callback()) @@@ type 4, status -113 r...@ffff810023fcfc00 x1336351691692679/t0 o8->lustre01-ost0000_u...@172.16.100.121 @tcp:28/4 lens 368/584 e 0 to 1 dl 1274687626 ref 2 fl Rpc:N/0/0 rc 0/0 May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(events.c: 66:request_out_callback()) Skipped 4 previous similar messages May 24 09:54:16 mdt02prdpom kernel: LustreError: 167-0: This client was evicted by lustre01-OST0000; in progress operations using this service will fail. May 24 09:54:16 mdt02prdpom kernel: LustreError: 3795:0:(client.c: 858:ptlrpc_import_delay_req()) @@@ IMP_INVALID r...@ffff81012fa9f000 x1336351691692691/t0 o4->lustre01-ost0000_u...@172.16.100.121@tcp:6/4 lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 May 24 09:54:16 mdt02prdpom kernel: LustreError: 3795:0:(client.c: 858:ptlrpc_import_delay_req()) Skipped 18 previous similar messages May 24 09:54:16 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc- ffff8101337bbc00: Connection restored to service lustre01-OST0000 using nid 172.16.100....@tcp.


Thanks !!



Ing. Stefano Elmopi
Gruppo Darco - Resp. ICT Sistemi
Via Ostiense 131/L Corpo B, 00154 Roma

cell. 3466147165
tel.  0657060500
email:stefano.elm...@sociale.it

"Ai sensi e per effetti della legge sulla tutela della riservatezza personale (D.lgs n. 196/2003), questa @mail e' destinata unicamente alle persone sopra indicate e le informazioni in essa contenute sono da considerarsi strettamente riservate. E' proibito leggere, copiare, usare o diffondere il contenuto della presente @mail senza autorizzazione. Se avete ricevuto questo messaggio per
errore, siete pregati di rispedire la stessa al mittente. Grazie"

Il giorno 21/mag/10, alle ore 16:56, Andreas Dilger ha scritto:


On 2010-05-21, at 5:49, Stefano Elmopi <stefano.elm...@sociale.it> wrote:

I realized that the time server differed much across machines,
there were at least a few hours of difference.
I'm doing the tests and have not been paying attention to time synchronization but now I have aligned the time of all servers and I've configured ntpd service
and the problem no longer occurs.
I can imagine that the cause of the problem was just the time misalignment.

The client and server clock should have nothing to do with the functioning of lustre, so it surprising that this would be the cause.


Il giorno 20/mag/10, alle ore 13:28, Johann Lombardi ha scritto:

On Thu, May 20, 2010 at 12:29:41PM +0200, Stefano Elmopi wrote:
Hi Andreas
My version of Lustre 1.8.3
Sorry for my bad English but I used the wrong word, "crash" is not the
right word.
I try to explain better, I start copying a large file on the file system
and while the copy process continues, I reboot the server OSS,
and the copy process enters state "- stalled -".
I expected that once the server back online, the copy process to resume
normal
and complete copy of the file, instead the copy process fault.
Therefore the copy process that goes wrong, Lustre continues to perform
good.

May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: This client was evicted by lustre01-OST0000; in progress operations using this service
will fail.

The cp process failed because the client got evicted by the OSS.
We need to look at the OSS logs to figure out the root cause of
the eviction.

Johann


_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to