Good afternoon,

I think you've got to unwind this a bit.  You've got a massive number of 
communication errors - I'd start there and try to analyze those.  You've also 
got nodes trying to reach the failover partners of some of your OSTs - Are the 
OSSes dying?  (That could cause the communication errors.)  Or is it simply 
because the clients can't reliably communicate with them?

It's extremely likely that everything flows from the communication errors or 
their immediate cause.  For example, they're likely causing the evictions.

I'd start with and concentrate on those.  There should be a bit more info 
either from the clients reporting the errors or from the nodes they're trying 
to connect to.

- Patrick

________________________________
From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf of 
Exec Unerd [execun...@gmail.com]
Sent: Friday, October 16, 2015 4:23 PM
To: Lustre discussion
Subject: [lustre-discuss] OSS Panics in ptlrpc_prep_bulk_page

We have a smallish cluster -- a few thousands cores on the client side; four 
OSSs on the Lustre server side.

Under otherwise normal operations, some of the clients will stop being able to 
find some of the OSTs.

When this happens, the OSSs start seeing an escalating error count. As more 
clients hit this condition, we start seeing 10s of thousands of errors of the 
following sort on the OSS, eventually resulting in a kernel panic on the OSS 
with what looks like "LNET/ptlrpc" messages.

We have tried this with client = v2.5.3 and v2.7.55. The OSSs are running 
v2.7.55. The kernel on the OSS side is based on RHEL's 
2.6.32-504.23.4.el6.x86_64, with the 2.7.55 server patches of course.

OSS panic message:
LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) ASSERTION( 
pageoffset + len <= ((1UL) << 12) ) failed:
LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) LBUG
Kernel panic - not syncing: LBUG
Pid: 31929, comm: ll_ost_io00_050 Tainted: P

We think this is because the clients (randomly) are unable to find the OSTs. 
The clients show messages like the following:
Oct 15 23:08:29 client00 kernel: Lustre: 
60196:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1444964873/real 1444964873] req@ffff8801026d1400 
x1514474144942020/t0(0) 
o8->fs00-OST012d-osc-ffff88041640a000@172.18.83.180@o2ib:28/4 lens 400/544 e 0 
to 1 dl 1444964909 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 15 23:12:53 client00 kernel: LNetError: 
60196:0:(o2iblnd_cb.c:1322:kiblnd_connect_peer()) Can't resolve addr for 
172.16.10.12@o2ib: -19
Oct 15 23:12:53 client00 kernel: LNetError: 
60196:0:(o2iblnd_cb.c:1322:kiblnd_connect_peer()) Skipped 385 previous similar 
messages

It says "Can't resolve addr", but they can resolve the address of the OSS via 
DNS, so I don't know what "resolve" means in this context

The OSTs are always actually available on the OSSes, and most (e.g. 99%) of the 
clients can always talk to them even while a few clients are showing the above 
errors.

It's just that, inexplicably, some of the clients sometimes won't connect to 
some of the OSTs even though everybody else can.

We see a ton of the following throughout the day on the OSSs, even when the 
OSSs are all up and seem to be serving data without issue:
Oct 15 05:12:23 OSS02 kernel: LustreError: 137-5: fs00-OST00c9_UUID: not 
available for connect from [IP]@o2ib (no target). If you are running an HA pair 
check that the target is mounted on the other server.
Oct 15 05:12:23 OSS02 kernel: LustreError: Skipped 700 previous similar messages

This appears to show lots of clients trying to reach "fs00-OST00c9" via an OSS 
that (a) is a valid HA service node for that OST, but (b) isn't actually the 
one serving it at the moment. So we'd expect the client to move on to the next 
service node and find the OST there... Which is what 99% of the clients 
actually do. But randomly, some of the clients just keep cycling through the 
available service nodes and never find the OSTs.

We also see a lot of eviction notices throughout the day on all servers (MDS 
and OSS).
Oct 15 23:23:45 MDS00 kernel: Lustre: fs00-MDT0000: haven't heard from client 
ac1445c9-2178-b3c9-c701-d6ff83e13210 (at [IP]@o2ib) in 227 seconds. I think 
it's dead, and I am evicting it. exp ffff882028bd6800, cur 1444965825 expire 
1444965675 last 1444965598
Oct 15 23:23:45 MDS00 kernel: Lustre: Skipped 10 previous similar messages

We're pretty sure the above is a totally unrelated issue, but it is putting 
additional pressure on the OSSs. Add it all up, and the storage cluster could 
be getting >10k errors in a given second.

Eventually, the glut of invalid client attempts of each type results in a 
kernel panic on the OSS, usually referencing ptlrpc_prep_bulk_page like that 
below.

LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) ASSERTION( 
pageoffset + len <= ((1UL) << 12) ) failed:
LustreError: 31929:0:(client.c:210:__ptlrpc_prep_bulk_page()) LBUG
Kernel panic - not syncing: LBUG
Pid: 31929, comm: ll_ost_io00_050 Tainted: P           ---------------    
2.6.32-504.23.4.el6.x86_64 #1
Call Trace:
 [<ffffffff8152931c>] ? panic+0xa7/0x16f
 [<ffffffffa097becb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa0cae5e8>] ? __ptlrpc_prep_bulk_page+0x118/0x1e0 [ptlrpc]
 [<ffffffffa0cae6c1>] ? ptlrpc_prep_bulk_page_nopin+0x11/0x20 [ptlrpc]
 [<ffffffffa0d2c162>] ? tgt_brw_read+0xa92/0x11d0 [ptlrpc]
 [<ffffffffa0cbfa0b>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc]
 [<ffffffffa0cbfb46>] ? lustre_pack_reply_flags+0xa6/0x1e0 [ptlrpc]
 [<ffffffffa098968a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
 [<ffffffffa0d2994c>] ? tgt_request_handle+0xa4c/0x1290 [ptlrpc]
 [<ffffffffa0cd15b1>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]
 [<ffffffff81529a1e>] ? thread_return+0x4e/0x7d0
 [<ffffffffa0cd0770>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
 [<ffffffff8109e78e>] ? kthread+0x9e/0xc0
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20
 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
[Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is b0)
dmar: DRHD: handling fault status reg 2
dmar: INTR-REMAP: Request device [[82:00.0] fault index 48
INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear
dmar: INTR-REMAP: Request device [[82:00.0] fault index 4a
INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear
dmar: DRHD: handling fault status reg 200

I've been trying to find information on this sort of thing, but it's not 
exactly a common problem. :-( Thanks for your time and assistance.

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to