Re: [B.A.T.M.A.N.] Batman gateway lock ups

2008-09-09 Thread Sven Eckelmann
Ok, I got the /proc/modules file now. Current situation is following: it 
crashes inside the the batman module add position 0x0aa4

a60:3c02lui v0,0x0
 a64:   8c500024lw  s0,36(v0)
 a68:   24420024addiu   v0,v0,36
 a6c:   12020014beq s0,v0,ac0 cleanup_module+0x610
 a70:   3c04lui a0,0x0
 a74:   3c05lui a1,0x0
 a78:   3c02lui v0,0x0
 a7c:   2484addiu   a0,a0,0
 a80:   24a50088addiu   a1,a1,136
 a84:   2442addiu   v0,v0,0
 a88:   0040f809jalrv0
 a8c:   24060283li  a2,643
 a90:   8e040004lw  a0,4(s0)
 a94:   8e03lw  v1,0(s0)
 a98:   3c020010lui v0,0x10
 a9c:   34420100ori v0,v0,0x100
 aa0:   8e110008lw  s1,8(s0)
 aa4:   ac83sw  v1,0(a0)
 aa8:   ae02sw  v0,0(s0)
 aac:   3c020020lui v0,0x20
 ab0:   34420200ori v0,v0,0x200
 ab4:   ac640004sw  a0,4(v1)

This is part of the compiled version of packet_recv_thread. Due the 
optimizations done I cannot say were exactly the problem lies.

I think the code of get_ip_addr() got inlined in packet_recv_thread and we 
need to search for the crash inside of it at list_del(entry-list);
I would also say that the really crash is inside __list_del where prev and 
next will be set. To check it, look at LIST_POISON1 and LIST_POISON1 inside of 
poison.h of the current linux kernel. You will notice that the values are 
0x00100100 and 0x00200200 == address of the failed paging request. The list 
poison stuff will be done in in list_del after calling __list_del (it is the 
sequence lui, ori, sw in the asm snipped). So could it be that we have a 
poisened entry inside the list?
This could for example happen when we get scheduled (please notice that the 
optimizer exchanged many instrictions) while another part of the program is 
deleting entries. I haven't checked the rest of the code if that really could 
happen, but that is my current idea.

So for better readability the callstack:
- packet_recv_thread
- get_ip_addr from gateway.c:401
- list_del from gateway.c:645
- __list_del

Best regards
Sven Eckelmann


signature.asc
Description: This is a digitally signed message part.


Re: [B.A.T.M.A.N.] Batman gateway lock ups

2008-09-09 Thread Sven Eckelmann
On Monday 08 September 2008 23:45:55 Sven Eckelmann wrote:
 I got the System.map right now. So we can convert the kernel oops into
 something more readable. It doesn't help that much now, but... just for
 sake of completeness
Sry, forgot the second oops with the interesting address of the paging 
failure.

Best regards
Sven Eckelmann

CPU 0 Unable to handle kernel paging request at virtual address 00200200, epc 
== c00c8aa4, ra == c00c8a90
Cpu 0
$ 0   :  10009c00 00100100 00100100
$ 4   : 00200200 0001  
$ 8   :  8071aa28 000b 127a3980
$12   : 000b ebc2 045d 67350e80
$16   : 80ac1600  c00c9a28 0064
$20   : c00d  0006ab6e 8071d93d
$24   : 8071d730 8000
$28   : 8071c000 8071d890  c00c8a90
Hi: 0140
Lo: 68fdd3c0
epc   : c00c8aa4 Tainted: P
Cause : 308c
8071d9a0 05dc 8071d93d 0054 000210d2 05a82b6e  
0002 c00505f1 8026dd80 8071db60 000210d2   801ca5e8
 8071a9f8 8007d41c 8071d8fc 8071d8fc 8071d8c0 0010 8071d8b0
0001   4040 8071d8d0 0010 8071d8b8 0001
Call 
Trace:[801ca5e8][8007d41c][801c018c][801ba0d0][801bbf74][8020f0d4][8020f95c][8020f95c][c015b2f4][c0161e80][8008bfe0][8008dfac][801ca110][800431e8][800437a4][c0106840][8005][c01549f0][c015f7b0][c015f7f0][c0161e80][80079f5c][8006f8d0][8008bfe0][8006b778][8006b1e0][8006b2c4][c015f694][800437a4][80279960][8005fa64][800ba4d4][8005e8d4][8005cd38][8005d578][800b6d24][800b6d1c][802276d8][8022656c][8006704c][80067044][800691d4][80072f64][80072e54][80069290][80073a00][8005e5f0][80046aa0][8005e5f0][8005cd04][8005e8d4][8005e0e4][8005cd38][8005d578][8007d0b0][8022656c][c00c8650][8007d108][8007d0e8][80045698][80045688]
Code: 3c020010  34420100  8e110008 ac83 ae02  3c020020  34420200  
ac640004  16200011


???; c00c8aa4 END_OF_CODE+3fe1d474/   =

Trace; 801ca5e8 ip_local_deliver_finish+0/2c0
Trace; 8007d41c autoremove_wake_function+0/44
Trace; 801c018c udp_packet+f0/114
Trace; 801ba0d0 nf_conntrack_find_get+c8/dc
Trace; 801bbf74 nf_conntrack_in+4ac/6f8
Trace; 8020f0d4 ipt_do_table+50c/588
Trace; 8020f95c nf_nat_fn+20c/244
Trace; 8020f95c nf_nat_fn+20c/244
Trace; c015b2f4 END_OF_CODE+3feafcc4/
Trace; c0161e80 END_OF_CODE+3feb6850/
Trace; 8008bfe0 handle_IRQ_event+64/d4
Trace; 8008dfac handle_level_irq+c0/114
Trace; 801ca110 ip_rcv_finish+0/4d8
Trace; 800431e8 ar5315_irq_dispatch+26c/2a4
Trace; 800437a4 ret_from_irq+0/4
Trace; c0106840 END_OF_CODE+3fe5b210/
Trace; 8005 blast_icache64_page_indexed+0/e4
Trace; c01549f0 END_OF_CODE+3fea93c0/
Trace; c015f7b0 END_OF_CODE+3feb4180/
Trace; c015f7f0 END_OF_CODE+3feb41c0/
Trace; c0161e80 END_OF_CODE+3feb6850/
Trace; 80079f5c rcu_process_callbacks+1c/38
Trace; 8006f8d0 run_timer_softirq+20/1fc
Trace; 8008bfe0 handle_IRQ_event+64/d4
Trace; 8006b778 tasklet_action+118/198
Trace; 8006b1e0 __do_softirq+78/100
Trace; 8006b2c4 do_softirq+5c/94
Trace; c015f694 END_OF_CODE+3feb4064/
Trace; 800437a4 ret_from_irq+0/4
Trace; 80279960 cpu_probe+584/994
Trace; 8005fa64 __wake_up_sync+3c/74
Trace; 800ba4d4 __fput+188/1cc
Trace; 8005e8d4 dequeue_entity+98/d8
Trace; 8005cd38 dequeue_task+1c/30
Trace; 8005d578 pick_next_task_fair+38/78
Trace; 800b6d24 filp_close+74/90
Trace; 800b6d1c filp_close+6c/90
Trace; 802276d8 cond_resched+44/5c
Trace; 8022656c schedule+1e0/7d4
Trace; 8006704c put_files_struct+188/208
Trace; 80067044 put_files_struct+180/208
Trace; 800691d4 do_exit+960/96c
Trace; 80072f64 dequeue_signal+13c/17c
Trace; 80072e54 dequeue_signal+2c/17c
Trace; 80069290 sys_exit_group+0/c
Trace; 80073a00 get_signal_to_deliver+444/498
Trace; 8005e5f0 enqueue_entity+2fc/33c
Trace; 80046aa0 do_notify_resume+64/3ec
Trace; 8005e5f0 enqueue_entity+2fc/33c
Trace; 8005cd04 enqueue_task+1c/34
Trace; 8005e8d4 dequeue_entity+98/d8
Trace; 8005e0e4 try_to_wake_up+84/d8
Trace; 8005cd38 dequeue_task+1c/30
Trace; 8005d578 pick_next_task_fair+38/78
Trace; 8007d0b0 kthread+0/b0
Trace; 8022656c schedule+1e0/7d4
Trace; c00c8650 END_OF_CODE+3fe1d020/
Trace; 8007d108 kthread+58/b0
Trace; 8007d0e8 kthread+38/b0
Trace; 80045698 kernel_thread_helper+10/18
Trace; 80045688 kernel_thread_helper+0/18



signature.asc
Description: This is a digitally signed message part.


Re: [B.A.T.M.A.N.] Batman gateway lock ups

2008-09-09 Thread Simon Wunderlich
Hey Sven,

thanks for you analysis!!

On Mon, Sep 08, 2008 at 11:18:42PM +0200, Sven Eckelmann wrote:
 Ok, I got the /proc/modules file now. Current situation is following: it 
 crashes inside the the batman module add position 0x0aa4
 
 a60:  3c02lui v0,0x0
  a64: 8c500024lw  s0,36(v0)
  a68: 24420024addiu   v0,v0,36
  a6c: 12020014beq s0,v0,ac0 cleanup_module+0x610
  a70: 3c04lui a0,0x0
  a74: 3c05lui a1,0x0
  a78: 3c02lui v0,0x0
  a7c: 2484addiu   a0,a0,0
  a80: 24a50088addiu   a1,a1,136
  a84: 2442addiu   v0,v0,0
  a88: 0040f809jalrv0
  a8c: 24060283li  a2,643
  a90: 8e040004lw  a0,4(s0)
  a94: 8e03lw  v1,0(s0)
  a98: 3c020010lui v0,0x10
  a9c: 34420100ori v0,v0,0x100
  aa0: 8e110008lw  s1,8(s0)
  aa4: ac83sw  v1,0(a0)
  aa8: ae02sw  v0,0(s0)
  aac: 3c020020lui v0,0x20
  ab0: 34420200ori v0,v0,0x200
  ab4: ac640004sw  a0,4(v1)
 
 This is part of the compiled version of packet_recv_thread. Due the 
 optimizations done I cannot say were exactly the problem lies.
 
 I think the code of get_ip_addr() got inlined in packet_recv_thread and we 
 need to search for the crash inside of it at list_del(entry-list);
 I would also say that the really crash is inside __list_del where prev and 
 next will be set. To check it, look at LIST_POISON1 and LIST_POISON1 inside 
 of 
 poison.h of the current linux kernel. You will notice that the values are 
 0x00100100 and 0x00200200 == address of the failed paging request. The list 
 poison stuff will be done in in list_del after calling __list_del (it is the 
 sequence lui, ori, sw in the asm snipped). So could it be that we have a 
 poisened entry inside the list?
 This could for example happen when we get scheduled (please notice that the 
 optimizer exchanged many instrictions) while another part of the program is 
 deleting entries. I haven't checked the rest of the code if that really could 
 happen, but that is my current idea.

Mhm, as far as i looked into the issue, there are the following 
points where free_client_list is accessed:

init_module() - INIT_LIST_HEAD()
* called on startup

get_ip_addr() - list_del():
* secured with a hash_lock spinlock

cleanup_module() - list_del():
* only called when unloading the module

batgat_ioctl() - list_del()
* from IOCREMDEV. This is called when batman shuts down.

packet_recv_thread - list_add():
* also secured in a hash_lock spinlock.

So it seems there should be no concurrency without user interaction 
(module or batman shutdown).
But i don't have a good idea yet where the problem comes from  ... :/

best regards,
Simon


signature.asc
Description: Digital signature