[Lustre-discuss] Recovery Problem

Stefano Elmopi Mon, 24 May 2010 05:11:14 -0700


Hi,

so it is a bit strange, at least for me !
The problem occurs even when the clocks are all synchronized

but the thing that I misled is that under equal conditions, theproblem does not always.The test is always the same, launching a process of copying and thenreboot the server OSS concerned:

These are the log when the recovery process is successful:

MGS/MDS

May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-172.16.100....@tcp: -113May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(lib-move.c:2441:LNetPut()) Skipped 5 previous similar messagesMay 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(events.c:66:request_out_callback()) @@@ type 4, status -113r...@ffff81011b6dd400 x1336357074007206/t0 o400->lustre01-ost0000_u...@172.16.100.121@tcp:28/4 lens 192/384 e 0 to 1 dl 1274688078 ref 2 fl Rpc:N/0/0 rc 0/0May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(events.c:66:request_out_callback()) Skipped 5 previous similar messagesMay 24 10:00:21 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:Connection to service lustre01-OST0000 via nid 172.16.100....@tcp waslost; in progress operations using this service will wait for recoveryto complete.May 24 10:01:43 mdt01prdpom kernel: Lustre: 3690:0:(ldlm_lib.c:575:target_handle_reconnect()) MGS: 52138191-d920-7519-563b-ab022b922751 reconnectingMay 24 10:02:17 mdt01prdpom kernel: Lustre: 3559:0:(quota_master.c:1716:mds_quota_recovery()) Only 1/3 OSTs are active, abort quotarecoveryMay 24 10:02:17 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:Connection restored to service lustre01-OST0000 using nid172.16.100....@tcp.May 24 10:02:17 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000:lustre01-OST0000_UUID now active, resetting orphans


OSS
May 24 10:01:14 oss1prdpom kernel: LDISKFS-fs: file extents enabled
May 24 10:01:14 oss1prdpom kernel: LDISKFS-fs: mballoc enabled

May 24 10:01:19 oss1prdpom kernel: Lustre: 3640:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1336607320834049 sentfrom mgc172.16.100....@tcp to NID 172.16.100....@tcp 5s ago has timedout (5s prior to deadline).May 24 10:01:19 oss1prdpom kernel: r...@ffff81012ec78c00x1336607320834049/t0 o250->m...@mgc172.16.100.111@tcp_0:26/25 lens368/584 e 0 to 1 dl 1274688079 ref 1 fl Rpc:N/0/0 rc 0/0May 24 10:01:44 oss1prdpom kernel: Lustre: 3640:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1336607320834051 sentfrom mgc172.16.100....@tcp to NID 0...@lo 5s ago has timed out (5s priorto deadline).May 24 10:01:44 oss1prdpom kernel: r...@ffff81013ff1cc00x1336607320834051/t0 o250->m...@mgc172.16.100.111@tcp_1:26/25 lens368/584 e 0 to 1 dl 1274688104 ref 1 fl Rpc:N/0/0 rc 0/0May 24 10:01:44 oss1prdpom kernel: LustreError: 3557:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID r...@ffff81012d15e400x1336607320834052/t0 o101->m...@mgc172.16.100.111@tcp_1:26/25 lens296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0

May 24 10:01:44 oss1prdpom kernel: Lustre: Filtering OBD driver; 
http://www.lustre.org/

May 24 10:01:44 oss1prdpom kernel: Lustre: 4036:0:(filter.c:990:filter_init_server_data()) RECOVERY: service lustre01-OST0000, 2recoverable clients, 0 delayed clients, last_rcvd 103079215728May 24 10:01:44 oss1prdpom kernel: Lustre: lustre01-OST0000: Nowserving lustre01-OST0000 on /dev/mpath/mpath1 with recovery enabledMay 24 10:01:44 oss1prdpom kernel: Lustre: lustre01-OST0000: Will bein recovery for at least 5:00, or until 2 clients reconnect

May 24 10:01:44 oss1prdpom kernel: Bluetooth: Core ver 2.10
May 24 10:01:44 oss1prdpom kernel: NET: Registered protocol family 31

May 24 10:01:44 oss1prdpom kernel: Bluetooth: HCI device andconnection manager initializedMay 24 10:01:44 oss1prdpom kernel: Bluetooth: HCI socket layerinitialized

May 24 10:01:44 oss1prdpom kernel: Bluetooth: L2CAP ver 2.8

May 24 10:01:44 oss1prdpom kernel: Bluetooth: L2CAP socket layerinitializedMay 24 10:01:44 oss1prdpom kernel: Bluetooth: HIDP (Human InterfaceEmulation) ver 1.1

May 24 10:01:44 oss1prdpom hidd[4106]: Bluetooth HID daemon

May 24 10:01:45 oss1prdpom kernel: Lustre: mgc172.16.100....@tcp:Reactivating importMay 24 10:01:45 oss1prdpom rhnsd: Red Hat Network Services Daemonrunning with check_in interval set to 240 seconds.May 24 10:01:45 oss1prdpom rhnsd: Red Hat Network Services Daemonrunning with check_in interval set to 240 seconds.May 24 10:01:45 oss1prdpom rhnsd[4179]: Red Hat Network ServicesDaemon starting up.May 24 10:02:12 oss1prdpom kernel: Lustre: 3761:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) lustre01-OST0000: 1 recoverableclients remainMay 24 10:02:18 oss1prdpom kernel: Lustre: lustre01-OST0000: Recoveryperiod over after 0:06, of 2 clients 2 recovered and 0 were evicted.May 24 10:02:18 oss1prdpom kernel: Lustre: lustre01-OST0000: sendingdelayed replies to recovered clientsMay 24 10:02:18 oss1prdpom kernel: Lustre: lustre01-OST0000: receivedMDS connection from 172.16.100....@tcp



CLIENT

May 24 09:59:13 mdt02prdpom kernel: LustreError: 11-0: an erroroccurred while communicating with 172.16.100....@tcp. The ost_writeoperation failed with -107May 24 09:59:13 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc-ffff8101337bbc00: Connection to service lustre01-OST0000 via nid172.16.100....@tcp was lost; in progress operations using this servicewill wait for recovery to complete.May 24 10:00:13 mdt02prdpom kernel: Lustre: There was an unexpectednetwork error while writing to 172.16.100.121: -110.May 24 10:02:17 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc-ffff8101337bbc00: Connection restored to service lustre01-OST0000using nid 172.16.100....@tcp.



These are the log when the recovery process goes wrong:

MGS/MDS

May 24 09:52:01 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:Connection to service lustre01-OST0000 via nid 172.16.100....@tcp waslost; in progress operations using this service will wait for recoveryto complete.May 24 09:52:51 mdt01prdpom kernel: Lustre: 3559:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1336357074007136 sentfrom lustre01-OST0000-osc to NID 172.16.100....@tcp 0s ago has faileddue to network error (25s prior to deadline).May 24 09:52:51 mdt01prdpom kernel: r...@ffff81012d814c00x1336357074007136/t0 o8->lustre01-ost0000_u...@172.16.100.121@tcp:28/4lens 368/584 e 0 to 1 dl 1274687596 ref 1 fl Rpc:N/0/0 rc 0/0May 24 09:52:51 mdt01prdpom kernel: Lustre: 3559:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 9 previous similar messagesMay 24 09:53:43 mdt01prdpom kernel: Lustre: 3560:0:(import.c:517:import_select_connection()) lustre01-OST0000-osc: tried allconnections, increasing latency to 21sMay 24 09:53:43 mdt01prdpom kernel: Lustre: 3560:0:(import.c:517:import_select_connection()) Skipped 4 previous similar messagesMay 24 09:53:43 mdt01prdpom kernel: LustreError: 167-0: This clientwas evicted by lustre01-OST0000; in progress operations using thisservice will fail.May 24 09:53:43 mdt01prdpom kernel: Lustre: 5071:0:(quota_master.c:1716:mds_quota_recovery()) Only 1/3 OSTs are active, abort quotarecoveryMay 24 09:53:43 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:Connection restored to service lustre01-OST0000 using nid172.16.100....@tcp.May 24 09:53:43 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000:lustre01-OST0000_UUID now active, resetting orphansMay 24 09:54:35 mdt01prdpom kernel: Lustre: MGS: haven't heard fromclient b6c79384-6c45-6d9c-ab9b-f12969d74da0 (at 172.16.100....@tcp) in235 seconds. I think it's dead, and I am evicting it.

OSS

May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_UP): eth0: link isnot readyMay 24 09:53:02 oss1prdpom kernel: e1000e: eth0 NIC Link is Up 1000Mbps Full Duplex, Flow Control: RX/TXMay 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_CHANGE): eth0: linkbecomes readyMay 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_UP): eth1: link isnot readyMay 24 09:53:02 oss1prdpom kernel: e1000e: eth1 NIC Link is Up 1000Mbps Full Duplex, Flow Control: RX/TXMay 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_CHANGE): eth1: linkbecomes ready

May 24 09:53:02 oss1prdpom kernel: Lustre: OBD class driver, 
http://www.lustre.org/
May 24 09:53:02 oss1prdpom kernel: Lustre:     Lustre Version: 1.8.3

May 24 09:53:02 oss1prdpom kernel: Lustre: Build Version:1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3

May 24 09:53:02 oss1prdpom kernel: Lustre:     Lustre Version: 1.8.3

May 24 09:53:02 oss1prdpom kernel: Lustre: Build Version:1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3May 24 09:53:03 oss1prdpom kernel: Lustre: Added LNI172.16.100....@tcp [8/256/0/180]

May 24 09:53:03 oss1prdpom kernel: Lustre: Accept secure, port 988
May 24 09:53:03 oss1prdpom kernel: Lustre: Lustre Client File System; 
http://www.lustre.org/
May 24 09:53:03 oss1prdpom kernel: init dynlocks cache
May 24 09:53:03 oss1prdpom kernel: ldiskfs created from ext3-2.6-rhel5

May 24 09:53:03 oss1prdpom kernel: kjournald starting. Commitinterval 5 secondsMay 24 09:53:03 oss1prdpom kernel: LDISKFS-fs warning: maximal mountcount reached, running e2fsck is recommended

May 24 09:53:03 oss1prdpom kernel: LDISKFS FS on dm-0, internal journal

May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: mounted filesystem withordered data mode.

May 24 09:53:03 oss1prdpom multipathd: dm-0: umount map (uevent)

May 24 09:53:03 oss1prdpom kernel: kjournald starting. Commitinterval 5 secondsMay 24 09:53:03 oss1prdpom kernel: LDISKFS-fs warning: maximal mountcount reached, running e2fsck is recommended

May 24 09:53:03 oss1prdpom kernel: LDISKFS FS on dm-0, internal journal

May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: mounted filesystem withordered data mode.

May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: file extents enabled
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: mballoc enabled

May 24 09:53:03 oss1prdpom kernel: Lustre: mgc172.16.100....@tcp:Reactivating import

May 24 09:53:03 oss1prdpom kernel: Lustre: Filtering OBD driver; 
http://www.lustre.org/

May 24 09:53:03 oss1prdpom kernel: Lustre: lustre01-OST0000: Nowserving lustre01-OST0000 on /dev/mpath/mpath1 with recovery enabled

May 24 09:53:04 oss1prdpom kernel: Bluetooth: Core ver 2.10
May 24 09:53:04 oss1prdpom kernel: NET: Registered protocol family 31

May 24 09:53:04 oss1prdpom kernel: Bluetooth: HCI device andconnection manager initializedMay 24 09:53:04 oss1prdpom kernel: Bluetooth: HCI socket layerinitialized

May 24 09:53:04 oss1prdpom kernel: Bluetooth: L2CAP ver 2.8

May 24 09:53:04 oss1prdpom kernel: Bluetooth: L2CAP socket layerinitializedMay 24 09:53:04 oss1prdpom kernel: Bluetooth: HIDP (Human InterfaceEmulation) ver 1.1

May 24 09:53:04 oss1prdpom hidd[4108]: Bluetooth HID daemon

May 24 09:53:04 oss1prdpom rhnsd: Red Hat Network Services Daemonrunning with check_in interval set to 240 seconds.May 24 09:53:04 oss1prdpom rhnsd: Red Hat Network Services Daemonrunning with check_in interval set to 240 seconds.May 24 09:53:04 oss1prdpom rhnsd[4181]: Red Hat Network ServicesDaemon starting up.May 24 09:53:44 oss1prdpom kernel: Lustre: lustre01-OST0000: receivedMDS connection from 172.16.100....@tcp



CLIENT

May 24 09:51:28 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc-ffff8101337bbc00: Connection to service lustre01-OST0000 via nid172.16.100....@tcp was lost; in progress operations using this servicewill wait for recovery to complete.May 24 09:51:38 mdt02prdpom kernel: Lustre: There was an unexpectednetwork error while writing to 172.16.100.121: -110.May 24 09:52:22 mdt02prdpom kernel: Lustre: 3796:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1336351691692671 sentfrom lustre01-OST0000-osc-ffff8101337bbc00 to NID 172.16.100....@tcp0s ago has failed due to network error (27s prior to deadline).May 24 09:52:22 mdt02prdpom kernel: r...@ffff810109b92c00x1336351691692671/t0 o8->lustre01-ost0000_u...@172.16.100.121@tcp:28/4lens 368/584 e 0 to 1 dl 1274687569 ref 1 fl Rpc:N/0/0 rc 0/0May 24 09:52:22 mdt02prdpom kernel: Lustre: 3796:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 23 previous similar messagesMay 24 09:53:18 mdt02prdpom kernel: Lustre: 3797:0:(import.c:517:import_select_connection()) lustre01-OST0000-osc-ffff8101337bbc00:tried all connections, increasing latency to 23sMay 24 09:53:18 mdt02prdpom kernel: Lustre: 3797:0:(import.c:517:import_select_connection()) Skipped 5 previous similar messagesMay 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-172.16.100....@tcp: -113May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(lib-move.c:2441:LNetPut()) Skipped 4 previous similar messagesMay 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(events.c:66:request_out_callback()) @@@ type 4, status -113r...@ffff810023fcfc00 x1336351691692679/t0 o8->lustre01-ost0000_u...@172.16.100.121@tcp:28/4 lens 368/584 e 0 to 1 dl 1274687626 ref 2 fl Rpc:N/0/0 rc 0/0May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(events.c:66:request_out_callback()) Skipped 4 previous similar messagesMay 24 09:54:16 mdt02prdpom kernel: LustreError: 167-0: This clientwas evicted by lustre01-OST0000; in progress operations using thisservice will fail.May 24 09:54:16 mdt02prdpom kernel: LustreError: 3795:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID r...@ffff81012fa9f000x1336351691692691/t0 o4->lustre01-ost0000_u...@172.16.100.121@tcp:6/4lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0May 24 09:54:16 mdt02prdpom kernel: LustreError: 3795:0:(client.c:858:ptlrpc_import_delay_req()) Skipped 18 previous similar messagesMay 24 09:54:16 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc-ffff8101337bbc00: Connection restored to service lustre01-OST0000using nid 172.16.100....@tcp.



Thanks !!



Ing. Stefano Elmopi
Gruppo Darco - Resp. ICT Sistemi
Via Ostiense 131/L Corpo B, 00154 Roma

cell. 3466147165
tel.  0657060500
email:stefano.elm...@sociale.it

"Ai sensi e per effetti della legge sulla tutela della riservatezzapersonale(D.lgs n. 196/2003), questa @mail e' destinata unicamente allepersone sopraindicate e le informazioni in essa contenute sono da considerarsistrettamenteriservate. E' proibito leggere, copiare, usare o diffondere ilcontenuto dellapresente @mail senza autorizzazione. Se avete ricevuto questomessaggio per

errore, siete pregati di rispedire la stessa al mittente. Grazie"

Il giorno 21/mag/10, alle ore 16:56, Andreas Dilger ha scritto:

On 2010-05-21, at 5:49, Stefano Elmopi <stefano.elm...@sociale.it>wrote:
I realized that the time server differed much across machines,
there were at least a few hours of difference.
I'm doing the tests and have not been paying attention to timesynchronizationbut now I have aligned the time of all servers and I've configuredntpd service
and the problem no longer occurs.
I can imagine that the cause of the problem was just the timemisalignment.
The client and server clock should have nothing to do with thefunctioning of lustre, so it surprising that this would be the cause.
Il giorno 20/mag/10, alle ore 13:28, Johann Lombardi ha scritto:
On Thu, May 20, 2010 at 12:29:41PM +0200, Stefano Elmopi wrote:
Hi Andreas
My version of Lustre 1.8.3
Sorry for my bad English but I used the wrong word, "crash" isnot the
right word.
I try to explain better, I start copying a large file on the filesystem
and while the copy process continues, I reboot the server OSS,
and the copy process enters state "- stalled -".
I expected that once the server back online, the copy process toresume
normal
and complete copy of the file, instead the copy process fault.
Therefore the copy process that goes wrong, Lustre continues toperform
good.
May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: Thisclient wasevicted by lustre01-OST0000; in progress operations using thisservice
will fail.

The cp process failed because the client got evicted by the OSS.
We need to look at the OSS logs to figure out the root cause of
the eviction.

Johann

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

[Lustre-discuss] Recovery Problem

Reply via email to