Re: [B.A.T.M.A.N.] Batman gateway lock ups
Ok, I got the /proc/modules file now. Current situation is following: it crashes inside the the batman module add position 0x0aa4 a60:3c02lui v0,0x0 a64: 8c500024lw s0,36(v0) a68: 24420024addiu v0,v0,36 a6c: 12020014beq s0,v0,ac0 cleanup_module+0x610 a70: 3c04lui a0,0x0 a74: 3c05lui a1,0x0 a78: 3c02lui v0,0x0 a7c: 2484addiu a0,a0,0 a80: 24a50088addiu a1,a1,136 a84: 2442addiu v0,v0,0 a88: 0040f809jalrv0 a8c: 24060283li a2,643 a90: 8e040004lw a0,4(s0) a94: 8e03lw v1,0(s0) a98: 3c020010lui v0,0x10 a9c: 34420100ori v0,v0,0x100 aa0: 8e110008lw s1,8(s0) aa4: ac83sw v1,0(a0) aa8: ae02sw v0,0(s0) aac: 3c020020lui v0,0x20 ab0: 34420200ori v0,v0,0x200 ab4: ac640004sw a0,4(v1) This is part of the compiled version of packet_recv_thread. Due the optimizations done I cannot say were exactly the problem lies. I think the code of get_ip_addr() got inlined in packet_recv_thread and we need to search for the crash inside of it at list_del(entry-list); I would also say that the really crash is inside __list_del where prev and next will be set. To check it, look at LIST_POISON1 and LIST_POISON1 inside of poison.h of the current linux kernel. You will notice that the values are 0x00100100 and 0x00200200 == address of the failed paging request. The list poison stuff will be done in in list_del after calling __list_del (it is the sequence lui, ori, sw in the asm snipped). So could it be that we have a poisened entry inside the list? This could for example happen when we get scheduled (please notice that the optimizer exchanged many instrictions) while another part of the program is deleting entries. I haven't checked the rest of the code if that really could happen, but that is my current idea. So for better readability the callstack: - packet_recv_thread - get_ip_addr from gateway.c:401 - list_del from gateway.c:645 - __list_del Best regards Sven Eckelmann signature.asc Description: This is a digitally signed message part.
Re: [B.A.T.M.A.N.] Batman gateway lock ups
On Monday 08 September 2008 23:45:55 Sven Eckelmann wrote: I got the System.map right now. So we can convert the kernel oops into something more readable. It doesn't help that much now, but... just for sake of completeness Sry, forgot the second oops with the interesting address of the paging failure. Best regards Sven Eckelmann CPU 0 Unable to handle kernel paging request at virtual address 00200200, epc == c00c8aa4, ra == c00c8a90 Cpu 0 $ 0 : 10009c00 00100100 00100100 $ 4 : 00200200 0001 $ 8 : 8071aa28 000b 127a3980 $12 : 000b ebc2 045d 67350e80 $16 : 80ac1600 c00c9a28 0064 $20 : c00d 0006ab6e 8071d93d $24 : 8071d730 8000 $28 : 8071c000 8071d890 c00c8a90 Hi: 0140 Lo: 68fdd3c0 epc : c00c8aa4 Tainted: P Cause : 308c 8071d9a0 05dc 8071d93d 0054 000210d2 05a82b6e 0002 c00505f1 8026dd80 8071db60 000210d2 801ca5e8 8071a9f8 8007d41c 8071d8fc 8071d8fc 8071d8c0 0010 8071d8b0 0001 4040 8071d8d0 0010 8071d8b8 0001 Call Trace:[801ca5e8][8007d41c][801c018c][801ba0d0][801bbf74][8020f0d4][8020f95c][8020f95c][c015b2f4][c0161e80][8008bfe0][8008dfac][801ca110][800431e8][800437a4][c0106840][8005][c01549f0][c015f7b0][c015f7f0][c0161e80][80079f5c][8006f8d0][8008bfe0][8006b778][8006b1e0][8006b2c4][c015f694][800437a4][80279960][8005fa64][800ba4d4][8005e8d4][8005cd38][8005d578][800b6d24][800b6d1c][802276d8][8022656c][8006704c][80067044][800691d4][80072f64][80072e54][80069290][80073a00][8005e5f0][80046aa0][8005e5f0][8005cd04][8005e8d4][8005e0e4][8005cd38][8005d578][8007d0b0][8022656c][c00c8650][8007d108][8007d0e8][80045698][80045688] Code: 3c020010 34420100 8e110008 ac83 ae02 3c020020 34420200 ac640004 16200011 ???; c00c8aa4 END_OF_CODE+3fe1d474/ = Trace; 801ca5e8 ip_local_deliver_finish+0/2c0 Trace; 8007d41c autoremove_wake_function+0/44 Trace; 801c018c udp_packet+f0/114 Trace; 801ba0d0 nf_conntrack_find_get+c8/dc Trace; 801bbf74 nf_conntrack_in+4ac/6f8 Trace; 8020f0d4 ipt_do_table+50c/588 Trace; 8020f95c nf_nat_fn+20c/244 Trace; 8020f95c nf_nat_fn+20c/244 Trace; c015b2f4 END_OF_CODE+3feafcc4/ Trace; c0161e80 END_OF_CODE+3feb6850/ Trace; 8008bfe0 handle_IRQ_event+64/d4 Trace; 8008dfac handle_level_irq+c0/114 Trace; 801ca110 ip_rcv_finish+0/4d8 Trace; 800431e8 ar5315_irq_dispatch+26c/2a4 Trace; 800437a4 ret_from_irq+0/4 Trace; c0106840 END_OF_CODE+3fe5b210/ Trace; 8005 blast_icache64_page_indexed+0/e4 Trace; c01549f0 END_OF_CODE+3fea93c0/ Trace; c015f7b0 END_OF_CODE+3feb4180/ Trace; c015f7f0 END_OF_CODE+3feb41c0/ Trace; c0161e80 END_OF_CODE+3feb6850/ Trace; 80079f5c rcu_process_callbacks+1c/38 Trace; 8006f8d0 run_timer_softirq+20/1fc Trace; 8008bfe0 handle_IRQ_event+64/d4 Trace; 8006b778 tasklet_action+118/198 Trace; 8006b1e0 __do_softirq+78/100 Trace; 8006b2c4 do_softirq+5c/94 Trace; c015f694 END_OF_CODE+3feb4064/ Trace; 800437a4 ret_from_irq+0/4 Trace; 80279960 cpu_probe+584/994 Trace; 8005fa64 __wake_up_sync+3c/74 Trace; 800ba4d4 __fput+188/1cc Trace; 8005e8d4 dequeue_entity+98/d8 Trace; 8005cd38 dequeue_task+1c/30 Trace; 8005d578 pick_next_task_fair+38/78 Trace; 800b6d24 filp_close+74/90 Trace; 800b6d1c filp_close+6c/90 Trace; 802276d8 cond_resched+44/5c Trace; 8022656c schedule+1e0/7d4 Trace; 8006704c put_files_struct+188/208 Trace; 80067044 put_files_struct+180/208 Trace; 800691d4 do_exit+960/96c Trace; 80072f64 dequeue_signal+13c/17c Trace; 80072e54 dequeue_signal+2c/17c Trace; 80069290 sys_exit_group+0/c Trace; 80073a00 get_signal_to_deliver+444/498 Trace; 8005e5f0 enqueue_entity+2fc/33c Trace; 80046aa0 do_notify_resume+64/3ec Trace; 8005e5f0 enqueue_entity+2fc/33c Trace; 8005cd04 enqueue_task+1c/34 Trace; 8005e8d4 dequeue_entity+98/d8 Trace; 8005e0e4 try_to_wake_up+84/d8 Trace; 8005cd38 dequeue_task+1c/30 Trace; 8005d578 pick_next_task_fair+38/78 Trace; 8007d0b0 kthread+0/b0 Trace; 8022656c schedule+1e0/7d4 Trace; c00c8650 END_OF_CODE+3fe1d020/ Trace; 8007d108 kthread+58/b0 Trace; 8007d0e8 kthread+38/b0 Trace; 80045698 kernel_thread_helper+10/18 Trace; 80045688 kernel_thread_helper+0/18 signature.asc Description: This is a digitally signed message part.
Re: [B.A.T.M.A.N.] Batman gateway lock ups
Hey Sven, thanks for you analysis!! On Mon, Sep 08, 2008 at 11:18:42PM +0200, Sven Eckelmann wrote: Ok, I got the /proc/modules file now. Current situation is following: it crashes inside the the batman module add position 0x0aa4 a60: 3c02lui v0,0x0 a64: 8c500024lw s0,36(v0) a68: 24420024addiu v0,v0,36 a6c: 12020014beq s0,v0,ac0 cleanup_module+0x610 a70: 3c04lui a0,0x0 a74: 3c05lui a1,0x0 a78: 3c02lui v0,0x0 a7c: 2484addiu a0,a0,0 a80: 24a50088addiu a1,a1,136 a84: 2442addiu v0,v0,0 a88: 0040f809jalrv0 a8c: 24060283li a2,643 a90: 8e040004lw a0,4(s0) a94: 8e03lw v1,0(s0) a98: 3c020010lui v0,0x10 a9c: 34420100ori v0,v0,0x100 aa0: 8e110008lw s1,8(s0) aa4: ac83sw v1,0(a0) aa8: ae02sw v0,0(s0) aac: 3c020020lui v0,0x20 ab0: 34420200ori v0,v0,0x200 ab4: ac640004sw a0,4(v1) This is part of the compiled version of packet_recv_thread. Due the optimizations done I cannot say were exactly the problem lies. I think the code of get_ip_addr() got inlined in packet_recv_thread and we need to search for the crash inside of it at list_del(entry-list); I would also say that the really crash is inside __list_del where prev and next will be set. To check it, look at LIST_POISON1 and LIST_POISON1 inside of poison.h of the current linux kernel. You will notice that the values are 0x00100100 and 0x00200200 == address of the failed paging request. The list poison stuff will be done in in list_del after calling __list_del (it is the sequence lui, ori, sw in the asm snipped). So could it be that we have a poisened entry inside the list? This could for example happen when we get scheduled (please notice that the optimizer exchanged many instrictions) while another part of the program is deleting entries. I haven't checked the rest of the code if that really could happen, but that is my current idea. Mhm, as far as i looked into the issue, there are the following points where free_client_list is accessed: init_module() - INIT_LIST_HEAD() * called on startup get_ip_addr() - list_del(): * secured with a hash_lock spinlock cleanup_module() - list_del(): * only called when unloading the module batgat_ioctl() - list_del() * from IOCREMDEV. This is called when batman shuts down. packet_recv_thread - list_add(): * also secured in a hash_lock spinlock. So it seems there should be no concurrency without user interaction (module or batman shutdown). But i don't have a good idea yet where the problem comes from ... :/ best regards, Simon signature.asc Description: Digital signature