Hi Eddie, Thanks for your work and I am very happy to help click project even though it's just a little bit.
I have a quick question about your work. I think it could fix the problem between block_tasks() and others. However still it seems to have a problem between driver_lock_tasks() and driver_lock_tasks(). What I'm concerning is like this: access clickfs multiple _task_blocker_waitings became 1 schedule multiple routerthreads call driver_lock_tasks() at the same time dead lock at code : while (!_task_blocker.compare_and_swap(0, -1)) How do you think? Please correct me if I'm wrong and I'm sorry that I can't be help a lot. Thanks, Joonwoo On Mon, Nov 3, 2008 at 6:32 PM, Eddie Kohler <[EMAIL PROTECTED]> wrote: > Hi Joonwoo, > > I appreciate all your work. Thanks for the time you have spent! > > After some poking around and a bunch of rebooting, I have a different > analysis of the problem, and have checked in a patch. It is here: > > http://www.read.cs.ucla.edu/gitweb?p=click;a=commit;h=7312a95decddc7c4f5043d29d622dc9efb99a547 > > Does this make sense? And if and when you get a chance, does it work for > you? > > Eddie > > > Joonwoo Park wrote: >> >> Hello Eddie, >> >> Thank you for your reviewing. I cannot take a look at the code, I'll >> check my patch soon again as soon as I have a chance. >> I am not using click for work nowadays. So it's pretty hard to spend >> enough time for it. >> >> Anyhow, I have been turning on the kassert. However I couldn't see >> assertion failure (both before & after patching) >> It it make sense? >> >> Thanks, >> Joonwoo >> >> On Mon, Nov 3, 2008 at 2:59 PM, Eddie Kohler <[EMAIL PROTECTED]> wrote: >>> >>> Joonwoo, >>> >>> I don't think this patch has any affect on the correctness of the code. >>> It >>> just slows things down. >>> >>> There are also bugs in the patch, including setting _task_blocker_owner >>> in >>> RouterThread::attempt_lock_tasks but not resetting it if the attempt >>> fails. >>> >>> Have you run after having configured with --enable-kassert? If so, do >>> you >>> see any assertions? If not, could you please? >>> >>> I'd like to track this down, but this patch is not the way. >>> Eddie >>> >>> >>> Joonwoo Park wrote: >>>> >>>> Hello Eddie, >>>> >>>> I tried to fix task blocker to support nested locking and attached a >>>> patch. >>>> Can you please take a look at this? I've tested minimally. >>>> >>>> Thanks! >>>> Joonwoo >>>> >>>> On Tue, Sep 16, 2008 at 9:26 AM, Joonwoo Park <[EMAIL PROTECTED]> >>>> wrote: >>>>> >>>>> I am folking 3 threads. >>>>> >>>>> Joonwoo >>>>> >>>>> 2008/9/16 Eddie Kohler <[EMAIL PROTECTED]>: >>>>>> >>>>>> And how many threads? >>>>>> >>>>>> Eddie >>>>>> >>>>>> >>>>>> Joonwoo Park wrote: >>>>>>> >>>>>>> Hi Eddie, >>>>>>> >>>>>>> I guess so that you intended to they are recursive. :-) >>>>>>> Here is the config can cause lock up without device elements. >>>>>>> >>>>>>> ---- >>>>>>> s0::RatedSource(DATASIZE 128) -> EtherEncap(0x0800, >>>>>>> FF:FF:FF:FF:FF:FF, >>>>>>> FF:FF:FF:FF:FF:FF) -> Discard >>>>>>> s1::InfiniteSource(DATASIZE 128) -> EtherEncap(0x0800, >>>>>>> FF:FF:FF:FF:FF:FF, FF:FF:FF:FF:FF:FF) -> Discard >>>>>>> >>>>>>> sched::BalancedThreadSched(100); >>>>>>> ---- >>>>>>> >>>>>>> Thanks! >>>>>>> Joonwoo >>>>>>> >>>>>>> 2008/9/16 Eddie Kohler <[EMAIL PROTECTED]>: >>>>>>>> >>>>>>>> Hi Joonwoo, >>>>>>>> >>>>>>>> I intended block_tasks() and driver_lock_tasks() to be recursive. I >>>>>>>> could >>>>>>>> certainly have failed! Can you tell me more about the configuration >>>>>>>> you're >>>>>>>> running? Can you cause a soft lockup even without device elements >>>>>>>> (such >>>>>>>> as >>>>>>>> with InfiniteSources)? >>>>>>>> >>>>>>>> Eddie >>>>>>>> >>>>>>>> >>>>>>>> Joonwoo Park wrote: >>>>>>>>> >>>>>>>>> Hi Eddie, >>>>>>>>> >>>>>>>>> I agree with your blocking task execution as a solution. >>>>>>>>> However I got a following soft lock up problem with your patch. >>>>>>>>> With a quick review, it's seems to block_tasks() and driver_tasks() >>>>>>>>> doesn't support recursive lock. (please correct me if I am wrong) >>>>>>>>> So when BalancedThreadSched's run_timer try to lock the tasks, it >>>>>>>>> looks like goes hang. >>>>>>>>> >>>>>>>>> Here is my oops message and gdb output. I used my 2.6.24 patched >>>>>>>>> kernel. I'm sorry for that. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Joonwoo >>>>>>>>> >>>>>>>>> [EMAIL PROTECTED]:~/SRC5/click/linuxmodule$ BUG: soft >>>>>>>>> lockup - CPU#0 stuck for 11s! [kclick:3116] >>>>>>>>> SysRq : Changing Loglevel >>>>>>>>> Loglevel set to 9 >>>>>>>>> BUG: soft lockup - CPU#0 stuck for 11s! [kclick:3116] >>>>>>>>> CPU 0: >>>>>>>>> Modules linked in: click proclikefs e1000 iptable_filter ip_tables >>>>>>>>> x_tables parport_pc lp parport ipv6 floppy pcspkr forcedeth ext3 >>>>>>>>> jbd >>>>>>>>> Pid: 3116, comm: kclick Not tainted 2.6.24.7-joonwpark #3 >>>>>>>>> RIP: 0010:[<ffffffff881f818a>] [<ffffffff881f818a>] >>>>>>>>> :click:_ZN19BalancedThreadSched9run_timerEP5Timer+0x58a/0x630 >>>>>>>>> RSP: 0018:ffff8100370d7d30 EFLAGS: 00000286 >>>>>>>>> RAX: ffff8100370d4000 RBX: ffff8100370d7dc0 RCX: ffff810037892430 >>>>>>>>> RDX: 00000000ffffffff RSI: ffff81003792fcd0 RDI: ffff81003792fc60 >>>>>>>>> RBP: ffffffff806b7b10 R08: 0000000000000000 R09: 0000000000000000 >>>>>>>>> R10: 0000000000000000 R11: 0000000000000005 R12: 0000000000000001 >>>>>>>>> R13: ffff810080643000 R14: ffff8100370d6000 R15: 0000000000000001 >>>>>>>>> FS: 00002acdb07f76e0(0000) GS:ffffffff806ae000(0000) >>>>>>>>> knlGS:0000000000000000 >>>>>>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >>>>>>>>> CR2: 00000000007ad008 CR3: 000000006bdf2000 CR4: 00000000000006e0 >>>>>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>>>>>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >>>>>>>>> >>>>>>>>> Call Trace: >>>>>>>>> [<ffffffff88166803>] :click:_Z12element_hookP5TimerPv+0x13/0x20 >>>>>>>>> [<ffffffff8818ebc8>] :click:_ZN6Master10run_timersEv+0x178/0x320 >>>>>>>>> [<ffffffff88183349>] :click:_ZN12RouterThread6driverEv+0x5b9/0x6f0 >>>>>>>>> [<ffffffff881f9ffe>] :click:_Z11click_schedPv+0xfe/0x260 >>>>>>>>> [<ffffffff804e4fef>] _spin_unlock_irq+0x2b/0x30 >>>>>>>>> [<ffffffff8022e0b6>] finish_task_switch+0x57/0x94 >>>>>>>>> [<ffffffff8020cfe8>] child_rip+0xa/0x12 >>>>>>>>> [<ffffffff8022e0b6>] finish_task_switch+0x57/0x94 >>>>>>>>> [<ffffffff8020c6ff>] restore_args+0x0/0x30 >>>>>>>>> [<ffffffff881f9f00>] :click:_Z11click_schedPv+0x0/0x260 >>>>>>>>> [<ffffffff8020cfde>] child_rip+0x0/0x12 >>>>>>>>> >>>>>>>>> >>>>>>>>> [EMAIL PROTECTED]:~/SRC5/click/linuxmodule$ gdb >>>>>>>>> click.ko >>>>>>>>> GNU gdb 6.8-debian >>>>>>>>> Copyright (C) 2008 Free Software Foundation, Inc. >>>>>>>>> License GPLv3+: GNU GPL version 3 or later >>>>>>>>> <http://gnu.org/licenses/gpl.html> >>>>>>>>> This is free software: you are free to change and redistribute it. >>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show >>>>>>>>> copying" >>>>>>>>> and "show warranty" for details. >>>>>>>>> This GDB was configured as "x86_64-linux-gnu"... >>>>>>>>> info line *(gdb) info line >>>>>>>>> *_ZN19BalancedThreadSched9run_timerEP5Timer+0x58a >>>>>>>>> Line 311 of >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> "/home/joonwpark/SRC5/click/linuxmodule/../include/click/routerthread.hh" >>>>>>>>> starts at address 0x9c1ba >>>>>>>>> <_ZN19BalancedThreadSched9run_timerEP5Timer+1418> >>>>>>>>> and ends at 0x9c1be >>>>>>>>> <_ZN19BalancedThreadSched9run_timerEP5Timer+1422>. >>>>>>>>> (gdb) l >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> "/home/joonwpark/SRC5/click/linuxmodule/../include/click/routerthread.hh:311 >>>>>>>>> 306 assert(!current_thread_is_running()); >>>>>>>>> 307 if (!scheduled) >>>>>>>>> 308 ++_task_blocker_waiting; >>>>>>>>> 309 while (1) { >>>>>>>>> 310 int32_t blocker = _task_blocker.value(); >>>>>>>>> 311 if (blocker >= 0 >>>>>>>>> 312 && _task_blocker.compare_and_swap(blocker, >>>>>>>>> blocker + >>>>>>>>> 1)) >>>>>>>>> 313 break; >>>>>>>>> 314 if (nice) { >>>>>>>>> 315 #if CLICK_LINUXMODULE >>>>>>>>> (gdb) >>>>>>>>> >>>>>>>>> >>>>>>>>> 2008/9/15 Eddie Kohler <[EMAIL PROTECTED]>: >>>>>>>>>> >>>>>>>>>> Joonwoo, >>>>>>>>>> >>>>>>>>>>> I took look into this lock up issue and I think I found >>>>>>>>>>> something. >>>>>>>>>>> >>>>>>>>>>> RoutherThread::driver() calls run_tasks() with locked tasks. >>>>>>>>>>> But after calling run_tasks(), current processor can be changed >>>>>>>>>>> since >>>>>>>>>>> schedule() might be called (eg. ScheduleLinux element) >>>>>>>>>>> So I think that's problem. How do you think? >>>>>>>>>> >>>>>>>>>> I totally agree that this could be a problem. >>>>>>>>>> >>>>>>>>>> It looks like EXCLUSIVE handlers never really worked before. :( >>>>>>>>>> >>>>>>>>>> So my current analysis is this. It is not appropriate for a >>>>>>>>>> thread >>>>>>>>>> to >>>>>>>>>> call >>>>>>>>>> blocking functions and/or schedule() when that thread has >>>>>>>>>> prevented >>>>>>>>>> preemption via get_cpu(). My prior patches prevented preemption. >>>>>>>>>> >>>>>>>>>> The solution is to separate "locking the task list" from "blocking >>>>>>>>>> task >>>>>>>>>> execution." Clickfs, when executing an exclusive handler, "blocks >>>>>>>>>> task >>>>>>>>>> execution." A thread that wants to examine the task list "locks" >>>>>>>>>> the >>>>>>>>>> list. >>>>>>>>>> >>>>>>>>>> This commit: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> http://www.read.cs.ucla.edu/gitweb?p=click;a=commit;h=ede0c6b0a1cface05e8d8e2e3496ee7fcd5ee143 >>>>>>>>>> introduces separate APIs for locking the list and blocking task >>>>>>>>>> execution. >>>>>>>>>> Exclusive handlers block task execution, but do not lock the task >>>>>>>>>> list. >>>>>>>>>> I >>>>>>>>>> believe that task execution, in this patch, does not prevent >>>>>>>>>> preemption. >>>>>>>>>> I >>>>>>>>>> believe the locking works out too. User-level multithreading >>>>>>>>>> tests >>>>>>>>>> appear >>>>>>>>>> OK. >>>>>>>>>> >>>>>>>>>> Any willing stresstesters? Pretty please? :) >>>>>>>>>> >>>>>>>>>> Eddie >>>>>>>>>> > _______________________________________________ click mailing list click@amsterdam.lcs.mit.edu https://amsterdam.lcs.mit.edu/mailman/listinfo/click