On 10 February 2011 14:28, Beyers Cronje <bcro...@gmail.com> wrote: > Hi Eddie, > > I've reverted to the pre-coreperformance code on the "semi-production" > server for now, and will see if I can replicate the issue on my dev server. > Hopefully I can use Cliff's test case to replicate and troubleshoot. Will > keep you posted. In all honesty though I haven't wrapped my head around > memory barriers and the like, so I'm not sure how far I will get on my own > :( > > Cliff, care to host an online advanced multithreading course ? :)
Cliff, If you are hosting mind if i join ;-) ? > > On Thu, Feb 10, 2011 at 4:15 PM, Eddie Kohler <koh...@cs.ucla.edu> wrote: > >> I'm very sorry about this regression... Unfortunately due to travel it's >> going to hard to look at this until the weekend. I would rather not yet >> revert coreperformance. Can you handle this situation? >> >> Eddie >> >> >> >> On 02/09/2011 05:22 AM, Beyers Cronje wrote: >> >>> Hi Eddie, >>> >>> Some info that I'm sure will help debugging. I configured click >>> with --enable-schedule-debugging=extra and also enabled >>> NOTIFIERQUEUE_DEBUG >>> >>> It seems Unqueue gets stuck in the pending list. See the output below: >>> >>> read q.length >>> 1000 >>> >>> read q.notifier_state >>> notifier on >>> task 0x19387b0 [uq :: Unqueue] scheduled >>> >>> read uq.scheduled >>> true /* but pending */ >>> >>> read uq.notifier >>> empty.0/1:1* >>> >>> >>> Unqueue stays in this scheduled but pending state for an undetermined >>> period >>> of time, up to minutes some times. >>> >>> Any idea where I can start on fixing this bug? >>> >>> Beyers >>> >>> >>> On Wed, Feb 9, 2011 at 2:52 AM, Beyers Cronje <bcro...@gmail.com >>> <mailto:bcro...@gmail.com>> wrote: >>> >>> Update - Strange, after typing my previous email I checked again, and >>> all >>> of a sudden Unqueue was pulling packets again. Not sure if me breaking >>> into Click with gdb kick-started, it again :) or if it's a intermittent >>> issue. >>> >>> >>> On Wed, Feb 9, 2011 at 2:41 AM, Beyers Cronje <bcro...@gmail.com >>> <mailto:bcro...@gmail.com>> wrote: >>> >>> Hi Eddie, >>> >>> Since running this merge I've been experiencing issues with >>> usermode >>> multithreading. I'm using commit >>> 9419098acbdc20837e37f3033c40661809431f8d >>> I do believe the issues are related to the changes of this merged, >>> as >>> I was running the same config on pre-coreperformance merge code >>> without any issues. >>> >>> *Simplified Config used:* >>> >>> fd::FromDAG(/dev/dag0) -> cl1::Classifier(12/0800) -> >>> MarkIPHeader(14) >>> -> ipc1::IPClassifier(udp port 1646); >>> sfp::SFP(); >>> q::ThreadSafeQueue; >>> log::Logger(sfp, 1800); >>> >>> ipc1 -> RadAccounting -> q; >>> q -> uq::Unqueue -> aupcc::Counter -> sfp; >>> >>> StaticThreadSched(fd 0, uq 1, log 2); >>> >>> *Problem Description:* >>> The 3 threads in the config: FromDAG and Unqueue do what their >>> names >>> indicate, and Logger has a timer that schedules its task once every >>> 30min. Everything runs fine initially and then somewhere along the >>> line Unqueue stops pulling packets from the queue which leads to a >>> constant queue overflow. When this happens I can see FromDAG is >>> still >>> working as q.drops increases constantly, and Logger also fires away >>> every 30min. Note push rate from FromDAG is quite high, but >>> what RadAccounting pushes into queue is very low ~ 100pps, which >>> means >>> queue is most of the time empty and Unqueue is not scheduled. >>> >>> Unfortunately I didn't configure debug scheduling, which would've >>> helped. I did notice Unqueue.scheduled always returned true, even >>> though it never actually ran. Not sure if the following will help, >>> gdb >>> shows the following: >>> >>> (gdb) info threads >>> 3 Thread 0x7fa422559700 (LWP 27205) 0x000000392cedb0b3 in poll >>> () >>> from /lib64/libc.so.6 >>> 2 Thread 0x7fa421d58700 (LWP 27206) 0x000000392cedcee3 in >>> select >>> () from /lib64/libc.so.6 >>> * 1 Thread 0x7fa432cfb740 (LWP 27197) FromDAG::process_packet >>> (this=0x1afc570, erf_record=0x7fa423258bf0, rlen=<value optimized >>> out>) at ../elements/local/fromdag.cc:193 >>> (gdb) thread 2 >>> [Switching to thread 2 (Thread 0x7fa421d58700 (LWP 27206))]#0 >>> 0x000000392cedb0b3 in poll () from /lib64/libc.so.6 >>> (gdb) bt >>> #0 0x000000392cedb0b3 in poll () from /lib64/libc.so.6 >>> #1 0x000000000058bc36 in Master::run_selects_poll (this=0x1afacc0, >>> thread=0x1afb8b0, more_tasks=false) at ../lib/master.cc:862 >>> #2 0x000000000058c4ec in Master::run_selects (this=0x1afacc0, >>> thread=0x1afb8b0) at ../lib/master.cc:1050 >>> #3 0x000000000057ddf1 in run_os (this=0x1afb8b0) at >>> ../lib/routerthread.cc:447 >>> #4 RouterThread::driver (this=0x1afb8b0) at >>> ../lib/routerthread.cc:568 >>> #5 0x0000000000556dc9 in thread_driver (user_data=<value optimized >>> out>) at click.cc:414 >>> #6 0x000000392d206d5b in start_thread () from >>> /lib64/libpthread.so.0 >>> #7 0x000000392cee4aad in clone () from /lib64/libc.so.6 >>> (gdb) thread 3 >>> [Switching to thread 3 (Thread 0x7fa422559700 (LWP 27205))]#0 >>> 0x000000392cedcee3 in select () from /lib64/libc.so.6 >>> (gdb) bt >>> #0 0x000000392cedcee3 in select () from /lib64/libc.so.6 >>> #1 0x000000000058c4bf in Master::run_selects (this=0x1afacc0, >>> thread=0x1afb7e0) at ../lib/master.cc:1015 >>> #2 0x000000000057ddf1 in run_os (this=0x1afb7e0) at >>> ../lib/routerthread.cc:447 >>> #3 RouterThread::driver (this=0x1afb7e0) at >>> ../lib/routerthread.cc:568 >>> #4 0x0000000000556dc9 in thread_driver (user_data=<value optimized >>> out>) at click.cc:414 >>> #5 0x000000392d206d5b in start_thread () from >>> /lib64/libpthread.so.0 >>> #6 0x000000392cee4aad in clone () from /lib64/libc.so.6 >>> (gdb) thread 1 >>> [Switching to thread 1 (Thread 0x7fa432cfb740 (LWP 27197))]#0 >>> 0x000000392d20ebfd in nanosleep () from /lib64/libpthread.so.0 >>> (gdb) bt >>> #0 FromDAG::process_packet (this=0x1afc570, >>> erf_record=0x7fa423258bf0, rlen=<value optimized out>) at >>> ../elements/local/fromdag.cc:193 >>> #1 0x00000000004d6402 in FromDAG::run_task (this=0x1afc570) at >>> ../elements/local/fromdag.cc:150 >>> #2 0x000000000057dbe6 in fire (this=0x1afb710) at >>> ../include/click/task.hh:612 >>> #3 run_tasks (this=0x1afb710) at ../lib/routerthread.cc:410 >>> #4 RouterThread::driver (this=0x1afb710) at >>> ../lib/routerthread.cc:600 >>> #5 0x0000000000558513 in main (argc=<value optimized out>, >>> argv=<value optimized out>) at click.cc:639 >>> >>> One thing to note, for various reasons I'm doing something very >>> unclick-like with FromDAG where I allow it to block up to 10ms. For >>> my >>> specific requirements this is not a problem, but just in case it >>> might >>> affect the way the new task handling operates it's worth noting. >>> >>> Beyers >>> >>> On Sun, Feb 6, 2011 at 1:23 AM, Eddie Kohler <koh...@cs.ucla.edu >>> <mailto:koh...@cs.ucla.edu>> wrote: >>> >>> Hi all, >>> >>> This is just a note to say that I've merged the >>> "coreperformance" >>> branch >>> with master. There are several changes that may speed up >>> particularly >>> simple configurations, and that (more importantly) may make it >>> easier to >>> experiment with different multithreading setups. (For instance >>> I >>> believe switching a task from one thread to another is faster >>> now.) Let >>> me know if you experience any problems >>> >>> Eddie >>> _______________________________________________ >>> click mailing list >>> click@amsterdam.lcs.mit.edu <mailto: >>> click@amsterdam.lcs.mit.edu> >>> >>> https://amsterdam.lcs.mit.edu/mailman/listinfo/click >>> >>> >>> >>> >>> > _______________________________________________ > click mailing list > click@amsterdam.lcs.mit.edu > https://amsterdam.lcs.mit.edu/mailman/listinfo/click > _______________________________________________ click mailing list click@amsterdam.lcs.mit.edu https://amsterdam.lcs.mit.edu/mailman/listinfo/click