On 08.01.2014 16:35, Drokin, Oleg wrote:
> On Jan 8, 2014, at 10:15 AM, Oliver Mangold wrote:
>>> Is this some sort of kerberos enabled deployment?
>> No nothing like that. Even group upcall is disabled.
> There's all this sptlrpc chatter coming out from somewhere that should not be 
> there.
> I suspect there might be a bad entry in your mgs config somewhere that throws 
> things off.
> You might need to mount mgs fs as ldiskfs in order to remove it and then it's 
> going to be in CONFIGS dir.
>
> Hm, in fact I don't think we even test downgrading 2.5-formatted fs to 2.1, 
> so I am not sure if that'll even work.
> My 2.5 configs certainly don't seem to be having any sptlrpc entries 
> according to my logs, so I am not sure how did you get those.
Me neither. Apparently it is the other way round. They are supposed to 
be present for 2.1.6, but not for 2.5.0. Checking on an MGS formatted 
with 2.1.6, I find 4 empty files in CONFIGS/

-rw-r--r-- 1 root root     0 Jan  8 17:03 lnec-sptlrpc
-rw-r--r-- 1 root root     0 Jan  8 17:03 lustre-params
-rw-r--r-- 1 root root     0 Jan  8 17:03 lustre-sptlrpc
-rw-r--r-- 1 root root     0 Jan  8 17:03 _mgs-sptlrpc

but creating them also on the 2.5.0 MGS doesn't help. I get the same errors:
> LDISKFS-fs (sdd): mounted filesystem with ordered data mode. Opts:
> LDISKFS-fs (sdd): mounted filesystem with ordered data mode. Opts:
> Lustre: MGS MGS started
> LustreError: 3566:0:(mgc_request.c:76:mgc_name2resid()) missing name: 
> -sptlrpc
> Lustre: 3620:0:(ldlm_lib.c:952:target_handle_connect()) MGS: 
> connection from a9306de7-1b82-e8f4-3b22-c41158102e5b@0@lo t0 exp 
> (null) cur 1389197742 last 0
> Lustre: MGC10.188.20.31@o2ib: Reactivating import
> LustreError: 3566:0:(mgc_request.c:286:config_log_add()) can't create 
> sptlrpc log: -sptlrpc
> LustreError: 15b-f: MGC10.188.20.31@o2ib: The configuration from log 
> '-params'failed from the MGS (-22).  Make sure this client and the MGS 
> are running compatible versions of Lustre.
> LustreError: 15c-8: MGC10.188.20.31@o2ib: The configuration from log 
> '-params' failed (-22). This may be the result of communication errors 
> between this node and the MGS, a bad configuration, or other errors. 
> See the syslog for more information.
> BUG: unable to handle kernel paging request at 00000000deadbeef

> Also rereading your email, can you please elaborate more on the 2.5 stability 
> issues?
Oh well, running it against 2.5.0 clients we get more client crashes 
than we can expect our customer to tolerate. It seems to be a similar 
problem as

https://jira.hpdd.intel.com/browse/LU-3889

but applying this fix

http://review.whamcloud.com/#/c/8405/

didn't help. Running the system against 1.8.9 clients, causes server 
hangs and crashes (but I understand 1.8.x clients are not supported 
anyway). Kernel log of a crashed 2.5 client, e.g. looks like this:

> Lustre: lnec-OST0002-osc-ffff880624d55000: Connection restored to 
> lnec-OST0002 (at 10.188.20.42@o2ib)
> Lustre: Skipped 4 previous similar messages
> Lustre: 1720:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request 
> sent has timed out for slow reply: [sent 1387297472/real 1387297472] 
> req@ffff88057a
> e0d000 x1454596248373276/t0(0) 
> o8->lnec-OST0005-osc-ffff880624d55000@10.188.20.41@o2ib:28/4 lens 
> 400/544 e 0 to 1 dl 1387297478 ref 1 fl Rpc:XN/0/ffffffff
> rc 0/-1
> Lustre: 1720:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 3 
> previous similar messages
> Lustre: 1720:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request 
> sent has timed out for slow reply: [sent 1387297497/real 1387297497] 
> req@ffff8805ca
> 6d6c00 x1454596248373328/t0(0) 
> o8->lnec-OST0000-osc-ffff880624d55000@10.188.20.41@o2ib:28/4 lens 
> 400/544 e 0 to 1 dl 1387297503 ref 1 fl Rpc:XN/0/ffffffff
> rc 0/-1
> Lustre: 1720:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 3 
> previous similar messages
> Lustre: lnec-OST0001-osc-ffff880624d55000: Connection restored to 
> lnec-OST0001 (at 10.188.20.42@o2ib)
> Lustre: Skipped 1 previous similar message
> LustreError: 11-0: MGC10.188.20.31@o2ib: Communicating with 
> 10.188.20.32@o2ib, operation obd_ping failed with -107.
> LustreError: 166-1: MGC10.188.20.31@o2ib: Connection to MGS (at 
> 10.188.20.32@o2ib) was lost; in progress operations using this service 
> will fail
> Lustre: Evicted from MGS (at MGC10.188.20.31@o2ib_1) after server 
> handle changed from 0xf5db8c187bc6a8cf to 0xf5db8c187bc90529
> LustreError: 6875:0:(ldlm_resource.c:804:ldlm_resource_complain()) 
> MGC10.188.20.31@o2ib: namespace resource [0x63656e6c:0x2:0x0].0 
> (ffff8802621a3940) refco
> unt nonzero (1) after lock cleanup; forcing cleanup.
> LustreError: 6875:0:(ldlm_resource.c:1415:ldlm_resource_dump()) --- 
> Resource: [0x63656e6c:0x2:0x0].0 (ffff8802621a3940) refcount = 2
> LustreError: 6875:0:(ldlm_resource.c:1436:ldlm_resource_dump()) 
> Waiting locks:
> LustreError: 6875:0:(ldlm_resource.c:1438:ldlm_resource_dump()) ### 
> ### ns: MGC10.188.20.31@o2ib lock: ffff8802408e7000/0x25d6bc1ad90ff5ad 
> lrc: 4/1,0 mode:
>  --/CR res: [0x63656e6c:0x2:0x0].0 rrc: 2 type: PLN flags: 
> 0x1106400000000 nid: local remote: 0xf5db8c187bc87984 expref: -99 pid: 
> 1761 timeout: 0 lvb_type:
>  0
> Lustre: MGC10.188.20.31@o2ib: Connection restored to MGS (at 
> 10.188.20.32@o2ib)
> Lustre: Skipped 5 previous similar messages
> LustreError: 11-0: lnec-OST0002-osc-ffff880624d55000: Communicating 
> with 10.188.20.42@o2ib, operation obd_ping failed with -107.
> Lustre: lnec-OST0003-osc-ffff880624d55000: Connection to lnec-OST0003 
> (at 10.188.20.42@o2ib) was lost; in progress operations using this 
> service will wait
> for recovery to complete
> Lustre: Skipped 7 previous similar messages
> LustreError: Skipped 5 previous similar messages
> Lustre: 1744:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request 
> sent has timed out for slow reply: [sent 1387300622/real 1387300622] 
> req@ffff8805de
> 67e000 x1454596248395324/t0(0) 
> o400->lnec-MDT0000-mdc-ffff880624d55000@10.188.20.32@o2ib:12/10 lens 
> 224/224 e 0 to 1 dl 1387300629 ref 1 fl Rpc:XN/0/ffffff
> ff rc 0/-1
> Lustre: lnec-MDT0000-mdc-ffff880624d55000: Connection to lnec-MDT0000 
> (at 10.188.20.32@o2ib) was lost; in progress operations using this 
> service will wait
> for recovery to complete
> Lustre: Skipped 6 previous similar messages
> LustreError: 11-0: lnec-OST000b-osc-ffff880624d55000: Communicating 
> with 10.188.20.44@o2ib, operation obd_ping failed with -107.
> LustreError: Skipped 1 previous similar message
> Lustre: 1720:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request 
> sent has timed out for slow reply: [sent 1387300629/real 1387300629] 
> req@ffff8805ba
> 627400 x1454596248395424/t0(0) 
> o38->lnec-MDT0000-mdc-ffff880624d55000@10.188.20.32@o2ib:12/10 lens 
> 400/544 e 0 to 1 dl 1387300635 ref 1 fl Rpc:XN/0/fffffff
> f rc 0/-1
> Lustre: lnec-OST0002-osc-ffff880624d55000: Connection restored to 
> lnec-OST0002 (at 10.188.20.41@o2ib)
> Lustre: lnec-MDT0000-mdc-ffff880624d55000: Connection restored to 
> lnec-MDT0000 (at 10.188.20.31@o2ib)
> Lustre: Skipped 6 previous similar messages
> Lustre: lnec-OST0001-osc-ffff880624d55000: Connection restored to 
> lnec-OST0001 (at 10.188.20.41@o2ib)
> LustreError: 1929:0:(cl_lock.c:1964:discard_cb()) ASSERTION( 
> (!(page->cp_type == CPT_CACHEABLE) || 
> (!PageWriteback(cl_page_vmpage(env, page)))) ) failed:
> LustreError: 1929:0:(cl_lock.c:1964:discard_cb()) LBUG
> Pid: 1929, comm: discus
>
> Call Trace:
>  [<ffffffffa0370895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
>  [<ffffffffa0370e97>] lbug_with_loc+0x47/0xb0 [libcfs]
>  [<ffffffffa04c4d64>] discard_cb+0x1c4/0x1d0 [obdclass]
>  [<ffffffffa04c20e4>] cl_page_gang_lookup+0x1d4/0x3d0 [obdclass]
>  [<ffffffffa04c4ba0>] ? discard_cb+0x0/0x1d0 [obdclass]
>  [<ffffffffa04c4ba0>] ? discard_cb+0x0/0x1d0 [obdclass]
>  [<ffffffffa04c4a6e>] cl_lock_discard_pages+0x11e/0x1f0 [obdclass]
>  [<ffffffffa084657f>] osc_lock_flush+0xff/0x280 [osc]
>  [<ffffffffa08467e7>] osc_lock_cancel+0xe7/0x1c0 [osc]
>  [<ffffffffa04c2905>] cl_lock_cancel0+0x75/0x160 [obdclass]
>  [<ffffffffa04c34ab>] cl_lock_cancel+0x13b/0x140 [obdclass]
>  [<ffffffffa0847aba>] osc_ldlm_blocking_ast+0x13a/0x350 [osc]
>  [<ffffffffa05c9f7c>] ldlm_cancel_callback+0x6c/0x1a0 [ptlrpc]
>  [<ffffffffa05d8c7a>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
>  [<ffffffffa05dbf6e>] ldlm_cli_cancel_list_local+0xee/0x290 [ptlrpc]
>  [<ffffffffa05d7d50>] ? ldlm_cancel_aged_policy+0x0/0x30 [ptlrpc]
>  [<ffffffffa05dce25>] ldlm_cancel_lru_local+0x35/0x40 [ptlrpc]
>  [<ffffffffa05de29c>] ldlm_prep_elc_req+0x3ec/0x4b0 [ptlrpc]
>  [<ffffffffa05de388>] ldlm_prep_enqueue_req+0x28/0x30 [ptlrpc]
>  [<ffffffffa05ed883>] ? ptlrpc_request_alloc+0x13/0x20 [ptlrpc]
>  [<ffffffffa082b623>] osc_enqueue_base+0x113/0x590 [osc]
>  [<ffffffffa04c4f12>] ? cl_lock_mutex_try+0x112/0x120 [obdclass]
>  [<ffffffffa04c73e0>] ? cl_lock_enclosure+0x1d0/0x210 [obdclass]
>  [<ffffffffa084718b>] osc_lock_enqueue+0x1eb/0x870 [osc]
>  [<ffffffffa0848d60>] ? osc_lock_upcall+0x0/0x5e0 [osc]
>  [<ffffffffa04c6a2c>] cl_enqueue_try+0xfc/0x300 [obdclass]
>  [<ffffffffa08d90aa>] lov_lock_enqueue+0x22a/0x850 [lov]
>  [<ffffffffa04c6a2c>] cl_enqueue_try+0xfc/0x300 [obdclass]
>  [<ffffffffa04c7e1f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass]
>  [<ffffffffa04c8a8e>] cl_lock_request+0x7e/0x270 [obdclass]
>  [<ffffffffa09a47b0>] cl_glimpse_lock+0x180/0x490 [lustre]
>  [<ffffffffa09a5025>] cl_glimpse_size0+0x1a5/0x1d0 [lustre]
>  [<ffffffffa0958528>] ll_inode_revalidate_it+0x198/0x1c0 [lustre]
>  [<ffffffff8122fe4b>] ? dentry_has_perm+0x5b/0x80
>  [<ffffffffa0958599>] ll_getattr_it+0x49/0x170 [lustre]
>  [<ffffffffa09586f7>] ll_getattr+0x37/0x40 [lustre]
>  [<ffffffff812274e3>] ? security_inode_getattr+0x23/0x30
>  [<ffffffff8118e981>] vfs_getattr+0x51/0x80
>  [<ffffffff8118ec5f>] vfs_fstat+0x3f/0x60
>  [<ffffffff8118eca4>] sys_newfstat+0x24/0x40
>  [<ffffffff8119e271>] ? sys_ioctl+0x81/0xa0
>  [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
>
> Kernel panic - not syncing: LBUG
> Pid: 1929, comm: discus Not tainted 2.6.32-431.1.2.0.1.el6.x86_64 #1
> Call Trace:
>  [<ffffffff81527213>] ? panic+0xa7/0x16f
>  [<ffffffffa0370eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
>  [<ffffffffa04c4d64>] ? discard_cb+0x1c4/0x1d0 [obdclass]
>  [<ffffffffa04c20e4>] ? cl_page_gang_lookup+0x1d4/0x3d0 [obdclass]
>  [<ffffffffa04c4ba0>] ? discard_cb+0x0/0x1d0 [obdclass]
>  [<ffffffffa04c4ba0>] ? discard_cb+0x0/0x1d0 [obdclass]
>  [<ffffffffa04c4a6e>] ? cl_lock_discard_pages+0x11e/0x1f0 [obdclass]
>  [<ffffffffa084657f>] ? osc_lock_flush+0xff/0x280 [osc]
>  [<ffffffffa08467e7>] ? osc_lock_cancel+0xe7/0x1c0 [osc]
>  [<ffffffffa04c2905>] ? cl_lock_cancel0+0x75/0x160 [obdclass]
>  [<ffffffffa04c34ab>] ? cl_lock_cancel+0x13b/0x140 [obdclass]
>  [<ffffffffa0847aba>] ? osc_ldlm_blocking_ast+0x13a/0x350 [osc]
>  [<ffffffffa05c9f7c>] ? ldlm_cancel_callback+0x6c/0x1a0 [ptlrpc]
>  [<ffffffffa05d8c7a>] ? ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
>  [<ffffffffa05dbf6e>] ? ldlm_cli_cancel_list_local+0xee/0x290 [ptlrpc]
>  [<ffffffffa05d7d50>] ? ldlm_cancel_aged_policy+0x0/0x30 [ptlrpc]
>  [<ffffffffa05dce25>] ? ldlm_cancel_lru_local+0x35/0x40 [ptlrpc]
>  [<ffffffffa05de29c>] ? ldlm_prep_elc_req+0x3ec/0x4b0 [ptlrpc]
>  [<ffffffffa05de388>] ? ldlm_prep_enqueue_req+0x28/0x30 [ptlrpc]
>  [<ffffffffa05ed883>] ? ptlrpc_request_alloc+0x13/0x20 [ptlrpc]
>  [<ffffffffa082b623>] ? osc_enqueue_base+0x113/0x590 [osc]
>  [<ffffffffa04c4f12>] ? cl_lock_mutex_try+0x112/0x120 [obdclass]
>  [<ffffffffa04c73e0>] ? cl_lock_enclosure+0x1d0/0x210 [obdclass]
>  [<ffffffffa084718b>] ? osc_lock_enqueue+0x1eb/0x870 [osc]
>  [<ffffffffa0848d60>] ? osc_lock_upcall+0x0/0x5e0 [osc]
>  [<ffffffffa04c6a2c>] ? cl_enqueue_try+0xfc/0x300 [obdclass]
>  [<ffffffffa08d90aa>] ? lov_lock_enqueue+0x22a/0x850 [lov]
>  [<ffffffffa04c6a2c>] ? cl_enqueue_try+0xfc/0x300 [obdclass]
>  [<ffffffffa04c7e1f>] ? cl_enqueue_locked+0x6f/0x1f0 [obdclass]
>  [<ffffffffa04c8a8e>] ? cl_lock_request+0x7e/0x270 [obdclass]
>  [<ffffffffa09a47b0>] ? cl_glimpse_lock+0x180/0x490 [lustre]
>  [<ffffffffa09a5025>] ? cl_glimpse_size0+0x1a5/0x1d0 [lustre]
>  [<ffffffffa0958528>] ? ll_inode_revalidate_it+0x198/0x1c0 [lustre]
>  [<ffffffff8122fe4b>] ? dentry_has_perm+0x5b/0x80
>  [<ffffffffa0958599>] ? ll_getattr_it+0x49/0x170 [lustre]
>  [<ffffffffa09586f7>] ? ll_getattr+0x37/0x40 [lustre]
>  [<ffffffff812274e3>] ? security_inode_getattr+0x23/0x30
>  [<ffffffff8118e981>] ? vfs_getattr+0x51/0x80
>  [<ffffffff8118ec5f>] ? vfs_fstat+0x3f/0x60
>  [<ffffffff8118eca4>] ? sys_newfstat+0x24/0x40
>  [<ffffffff8119e271>] ? sys_ioctl+0x81/0xa0
>  [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b 

-- 
Dr. Oliver Mangold
System Analyst
NEC Deutschland GmbH
HPC Division
Hessbrühlstraße 21b
70565 Stuttgart
Germany
Phone: +49 711 78055 13
Mail: oliver.mang...@emea.nec.com
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to