Hello,

In one of our Lustre cluster, We have many softlockup that seems to come 
from a contention on the Lustre Spinlock LNET_LOCK (the_lnet.ln_lock).

Indeed, several Lustre daemons are waiting for this LNET_LOCK spinlock 
and a Lustre daemon is executing lnet_match_md() function with this 
acquired spinlock. lnet_match_md() seems to have problems to manage a 
list of packets which contains 90000 elements.

Do you know today a limitation for managing such a big list ? Have you 
any idea, information that can help me to progress on this problem ?

Lustre : 14.8.1
Kernel : 2.6.18

More traces below :

0xe0000004619e0000      0  30567      1      0  0x400040   -  kiblnd_sd_02
 #1 [BSP:e0000004619e12d8] lnet_match_md at a0000002052c9000          #2 
[BSP:e0000004619e1150] lnet_parse
at a0000002052dbc10
 #3 [BSP:e0000004619e10f8] kiblnd_handle_rx at a000000205825ba0
 #4 [BSP:e0000004619e10a0] kiblnd_rx_complete at a000000205827890
 #5 [BSP:e0000004619e1080] kiblnd_complete at a0000002058348c0
 #6 [BSP:e0000004619e0fc8] kiblnd_scheduler at a000000205835b00
 #7 [BSP:e0000004619e0fa0] kernel_thread_helper at a000000100014810
 #8 [BSP:e0000004619e0fa0] start_kernel_thread at a0000001000090c0
crash>
PID: 30572  TASK: e000000461a30000  CPU: 1   COMMAND: "kiblnd_sd_07"
 #1 [BSP:e000000461a318d8] serial_in at a000000100380cf0
 #2 [BSP:e000000461a31890] serial8250_console_putchar at a000000100386ac0
 #3 [BSP:e000000461a31850] uart_console_write at a00000010037f1f0
 #4 [BSP:e000000461a317e0] serial8250_console_write at a000000100386fa0
 #5 [BSP:e000000461a31798] __call_console_drivers at a00000010007c5a0
 #6 [BSP:e000000461a31768] _call_console_drivers at a00000010007c700
 #7 [BSP:e000000461a316f8] release_console_sem at a00000010007ce10
 #8 [BSP:e000000461a31628] vprintk at a00000010007d5c0
 #9 [BSP:e000000461a315c0] printk at a00000010007d8a0
#10 [BSP:e000000461a31558] ia64_dump_bs at a000000100012220
#11 [BSP:e000000461a31508] ia64_do_show_stack at a000000100012330
#12 [BSP:e000000461a314e0] unw_init_running at a00000010000cb90
#13 [BSP:e000000461a314c0] show_stack at a000000100012400
#14 [BSP:e000000461a314a8] dump_stack at a000000100012450
#15 [BSP:e000000461a31460] softlockup_tick at a0000001000dbbc0
#16 [BSP:e000000461a31448] run_local_timers at a000000100097470
#17 [BSP:e000000461a31418] update_process_times at a000000100097590
#18 [BSP:e000000461a313b8] timer_interrupt at a000000100039720
#19 [BSP:e000000461a31378] handle_IRQ_event at a0000001000dcfb0
#20 [BSP:e000000461a31318] __do_IRQ at a0000001000dd200
#21 [BSP:e000000461a312e0] ia64_handle_irq at a000000100011580
#22 [BSP:e000000461a312e0] ia64_leave_kernel at a00000010000c4e0
#23 [BSP:e000000461a312e0] ia64_spinlock_contention at a000000100009150
#24 [BSP:e000000461a312d8] _spin_lock at a00000010053fd00
#25 [BSP:e000000461a31150] lnet_parse at a0000002052dbb00
#26 [BSP:e000000461a310f8] kiblnd_handle_rx at a000000205825ba0
#27 [BSP:e000000461a310a0] kiblnd_rx_complete at a000000205827890
#28 [BSP:e000000461a31080] kiblnd_complete at a0000002058348c0
#29 [BSP:e000000461a30fc8] kiblnd_scheduler at a000000205835b00
#30 [BSP:e000000461a30fa0] kernel_thread_helper at a000000100014810
#31 [BSP:e000000461a30fa0] start_kernel_thread at a0000001000090c0


Thanks for any information,
Cédric Lambert

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to