Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the 
following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11472



see also:  bug#9423

Heavier loads on our alc/ltest Lnet router nodes are producing
many messages of the type:

2006-12-20 20:36:22 LustreError: 24509:0:(qswlnd_cb.c:1023:kqswnal_send()) Can't
get txd for msg type 1 for [EMAIL PROTECTED]
2006-12-20 20:36:22 LustreError: 24509:0:(qswlnd_cb.c:1023:kqswnal_send())
Skipped 53 previous similar messages
2006-12-20 20:36:49 LustreError: 24521:0:(qswlnd_cb.c:890:kqswnal_rdma()) Can't
get txd for RDMA with [EMAIL PROTECTED]
2006-12-20 20:36:49 LustreError: 24521:0:(qswlnd_cb.c:890:kqswnal_rdma())
Skipped 2 previous similar messages
2006-12-20 20:42:38 LustreError: 24509:0:(qswlnd_cb.c:1023:kqswnal_send()) Can't
get txd for msg type 2 for [EMAIL PROTECTED]
2006-12-20 20:42:38 LustreError: 24509:0:(qswlnd_cb.c:1023:kqswnal_send())
Skipped 145 previous similar messages
2006-12-20 20:42:38 LustreError: 956:0:(qswlnd_cb.c:890:kqswnal_rdma()) Can't
get txd for RDMA with [EMAIL PROTECTED]
2006-12-20 20:42:38 LustreError: 956:0:(qswlnd_cb.c:890:kqswnal_rdma()) Skipped
4 previous similar messages


Eric Barton commented:
The "Can't get txd for..." qswlnd messages mean that you need to increase
"ntxmsgs".  But before that, it would help to see all the LNET/LND module
parameters everywhere.  "ntxmsgs" is a finite parameter for the qswlnd since
it can consume all of elan VM (each tx reserves 1 Mbyte of elan VM in case
it needs to map a data buffer), but pressure can be ameliorated by playing
with "credits" and "peer_credits".

Maybe it's worth opening a bugzilla bug until this is understood.













sample Lnet router:
# alc14 /root > cat /etc/modprobe.conf
###############################################################################
# $URL: svn+ssh://tyche.llnl.gov/var/svn/dist/kernel/modprobe.conf.alc.io.ltest 
$
# $Author: marc $
# $Date: 2006-12-15 15:11:15 -0800 (Fri, 15 Dec 2006) $
# $Rev: 4324 $
###############################################################################
#

alias scsi_hostadapter0 mptscsih
alias scsi_hostadapter1 mptscsih

alias eth0 e1000
alias eth1 e1000
alias eth2 e1000
alias eth3 e1000


options elan3 DownLinks=0x44644
alias eip0 eip

# Turn off IPv6
alias net-pf-10 off
alias ipv6 off

# Lustre modules for clients
options lnet networks="tcp0(eth1,eth0),elan0" \
             forwarding="enabled" \
             tiny_router_buffers=2048 \
             small_router_buffers=2048 \
             large_router_buffers=512 \
             hosts_allow="172.16.[70-71].[147-163]; 172.16.[2,10].[101-123,231]"
options e1000 RxDescriptors=4096,4096,256,256
options ksocklnd keepalive_intvl=5 keepalive_count=100 keepalive_idle=30


sample ALC client:
# alc200 /root > cat /etc/modprobe.conf
###############################################################################
# $URL:
svn+ssh://tyche.llnl.gov/var/svn/dist/kernel/modprobe.conf.alc.compute.ltest $
# $Author: wartens2 $
# $Date: 2006-10-18 15:32:37 -0700 (Wed, 18 Oct 2006) $
# $Rev: 3652 $
###############################################################################
#

alias scsi_hostadapter mptscsih

alias eth0 tg3
alias eth1 tg3

options elan3 DownLinks=0x44644
alias eip0 eip

# Turn off IPv6
alias net-pf-10 off
alias ipv6 off

# Lustre modules for clients
alias lustre llite
options lnet networks="elan0" \
             routes="tcp0 [EMAIL PROTECTED]; tcp1 [EMAIL PROTECTED]; tcp2 
[EMAIL PROTECTED]" \
             live_router_check_interval=60 \
             dead_router_check_interval=60 \
             check_routers_before_use=1
#install kptlrouter /sbin/modprobe portals; /sbin/modprobe --ignore-install
kptlrouter
#install kqswnal /sbin/modprobe kptlrouter; /sbin/modprobe --ignore-install 
kqswnal
#install ksocknal /sbin/modprobe kqswnal; /sbin/modprobe --ignore-install 
ksocknal
#install ptlrpc /sbin/modprobe ksocknal; /sbin/modprobe --ignore-install ptlrpc
#install llite /sbin/modprobe lov osc; /sbin/modprobe --ignore-install llite


Here's a console log snippet from the client that hung:

2006-12-20 21:43:25 LustreError: 3888:0:(client.c:576:ptlrpc_check_status()) @@@
type == PTL_RPC_MSG_ERR, err == -107 [EMAIL PROTECTED] x10094820/t0
o4->[EMAIL PROTECTED]:28 lens 328/288 ref 2 fl Rpc:R/0/0 r
c 0/-107
2006-12-20 21:43:25 LustreError:
OSC_igs-mds1_OST_igs4_MNT_crater_client-cc685200: Connection to service OST_igs4
via nid [EMAIL PROTECTED] was lost; in progress operations using this service
will wait for rec
overy to complete.
2006-12-20 21:43:25 LustreError: Skipped 4 previous similar messages
2006-12-20 21:43:25 LustreError: This client was evicted by OST_igs4; in
progress operations using this service will fail.
2006-12-20 21:43:25 LustreError:
26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c1ad3340
failed: -5
2006-12-20 21:43:25 LustreError:
26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c1ad3360
failed: -5
2006-12-20 21:43:25 LustreError:
26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c1ad3380
failed: -5
2006-12-20 21:43:25 LustreError:
26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c1ad33a0
failed: -5
2006-12-20 21:43:25 LustreError:
26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c1ad33c0
failed: -5
2006-12-20 21:43:25 LustreError:
26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c1ad33e0
failed: -5
2006-12-20 21:43:25 LustreError:
26451:0:(file.c:632:ll_remove_page_from_cache()) writepage of page c181b800
failed: -5
2006-12-20 21:43:25 Lustre: OSC_igs-mds1_OST_igs4_MNT_crater_client-cc685200:
Connection restored to service OST_igs4 using nid [EMAIL PROTECTED]
2006-12-20 21:43:25 Lustre: Skipped 4 previous similar messages
2006-12-20 21:47:28 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at
1166679748, 300s ago) [EMAIL PROTECTED] x10094752/t0
o4->[EMAIL PROTECTED]:28 lens 328/288 ref 2 fl
 Rpc:/0/0 rc 0/0
2006-12-20 21:47:28 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) Skipped 4 previous similar
messages
2006-12-20 21:47:28 LustreError:
OSC_igs-mds1_OST_igs14_3_MNT_crater_client-cc685200: Connection to service
OST_igs14_3 via nid [EMAIL PROTECTED] was lost; in progress operations using 
this
service will wait f
or recovery to complete.
2006-12-20 21:47:28 Lustre: OSC_igs-mds1_OST_igs14_3_MNT_crater_client-cc685200:
Connection restored to service OST_igs14_3 using nid [EMAIL PROTECTED]
2006-12-20 21:47:29 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at
1166679749, 300s ago) [EMAIL PROTECTED] x10094758/t0 o4->[EMAIL PROTECTED]:28
lens 328/288 ref 2 fl R
pc:/0/0 rc 0/0
2006-12-20 21:47:30 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) Skipped 1 previous similar 
message
2006-12-20 21:47:30 LustreError:
OSC_igs-mds1_OST_igs19_MNT_crater_client-cc685200: Connection to service
OST_igs19 via nid [EMAIL PROTECTED] was lost; in progress operations using this
service will wait for r
ecovery to complete.
2006-12-20 21:47:30 LustreError: Skipped 1 previous similar message
2006-12-20 21:47:30 Lustre: OSC_igs-mds1_OST_igs19_MNT_crater_client-cc685200:
Connection restored to service OST_igs19 using nid [EMAIL PROTECTED]
2006-12-20 21:47:30 Lustre: Skipped 1 previous similar message
2006-12-20 21:48:35 LustreError: 26964:0:(qswlnd_cb.c:575:kqswnal_txhandler())
[EMAIL PROTECTED] RPC status ffffffb9
2006-12-20 21:48:35 LustreError: 3489:0:(events.c:129:client_bulk_callback())
event type 0, status -71, desc daefb000
2006-12-20 21:49:49 LustreError: 0:0:(qswlnd_cb.c:575:kqswnal_txhandler())
[EMAIL PROTECTED] RPC status ffffffb9
2006-12-20 21:49:49 LustreError: 3489:0:(events.c:129:client_bulk_callback())
event type 0, status -71, desc f2b71000
2006-12-20 21:53:34 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at
1166680113, 301s ago) [EMAIL PROTECTED] x10095074/t0 o4->[EMAIL PROTECTED]:28
lens 328/288 ref 2 fl R
pc:/0/0 rc 0/0
2006-12-20 21:53:34 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) Skipped 3 previous similar
messages
2006-12-20 21:53:34 LustreError:
OSC_igs-mds1_OST_igs1_3_MNT_crater_client-cc685200: Connection to service
OST_igs1_3 via nid [EMAIL PROTECTED] was lost; in progress operations using this
service will wait for
 recovery to complete.
2006-12-20 21:53:34 LustreError: Skipped 3 previous similar messages
2006-12-20 21:54:46 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at
1166680186, 300s ago) [EMAIL PROTECTED] x10095680/t0 o4->[EMAIL PROTECTED]:28
lens 328/288 ref 2 fl R
pc:/0/0 rc 0/0
2006-12-20 21:54:46 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) Skipped 10 previous similar
messages
2006-12-20 21:54:52 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at
1166680192, 300s ago) [EMAIL PROTECTED] x10095716/t0
o4->[EMAIL PROTECTED]:28 lens 328/288 ref 2 fl
 Rpc:/0/0 rc 0/0
2006-12-20 21:54:52 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) Skipped 4 previous similar
messages
2006-12-20 21:55:18 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at
1166680218, 300s ago) [EMAIL PROTECTED] x10095910/t0 o4->[EMAIL PROTECTED]:28
lens 328/288 ref 2 fl R
pc:/0/0 rc 0/0
2006-12-20 21:56:05 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at
1166680264, 301s ago) [EMAIL PROTECTED] x10096157/t0 o4->[EMAIL PROTECTED]:28
lens 328/288 ref 2 fl R
pc:/0/0 rc 0/0
2006-12-20 21:58:33 LustreError:
3888:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sent at
1166680413, 300s ago) [EMAIL PROTECTED] x10097165/t0 o4->[EMAIL PROTECTED]:28
lens 328/288 ref 2 fl R
pc:/0/0 rc 0/0

[hung]

If useful, I can attach the &Bt backtraces from this node, in its
hung state.

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to