https://issues.apache.org/bugzilla/show_bug.cgi?id=55153
--- Comment #1 from Harald Oehlmann <[email protected]> --- I asked on clt for help: https://groups.google.com/forum/#!topic/comp.lang.tcl/Ja01kv75tVk Thread name: "Async sockets hang under Rivet but work in tclsh 8.6" Start Date: 27.6.2013 Here is the essence: - CentOS 6.2 64 bit - /usr/sbin/httpd -V Server version: Apache/2.2.15 (Unix) Server built: May 13 2013 22:11:16 Server's Module Magic Number: 20051115:25 Server loaded: APR 1.3.9, APR-Util 1.3.9 Compiled using: APR 1.3.9, APR-Util 1.3.9 Architecture: 64-bit Server MPM: Prefork threaded: no forked: yes (variable process count) - Apache was configured to only have one thread (/etc/httpd/conf/httpd.conf: <IfModule prefork.c> StartServers 1 MinSpareServers 1 MaxSpareServers 20 ServerLimit 256 MaxClients 256 MaxRequestsPerChild 4000 </IfModule> ) - Test file "test_file_async.tcl" in "/var/www/html". File contents as in the original bug. - Start http: service httpd start - start strace: ps awux | grep sbin/httpd | awk '{print"-p " $2}' | xargs strace -tt -f -o straceall.txt - the resulting trace file is attached to the bug. Here are the comments by Alexandre Ferieux to the strace log from clt: OK here is what I can see: (1) open the file, and wake up the notifier (by writing a single byte to the pipe) thread so that it takes the new fd into account for its select() mask: > 29969 12:04:52.368364 open("/usr/tmp/test.txt", O_WRONLY|O_CREAT|O_TRUNC, > 0666) = 21 > ... > 29969 12:04:52.368806 write(17, "\0", 1) = 1 (2) In the notifier: detect that activity on the control pipe... > 29965 12:04:52.368843 <... select resumed> ) = 1 (in [16]) > 29965 12:04:52.368889 read(16, <unfinished ...> > 29965 12:04:52.368922 <... read resumed> "\0", 1) = 1 (3) ... and normally restart a select with the new fd (21) in the write mask, but here, FAIL TO DO SO: > 29965 12:04:52.368963 select(17, [16], [], [], NULL <unfinished ...> (4) So, sleep stupidly (only monitoring the control pipe) until the timeout expires: > 29962 12:04:53.218193 <... select resumed> ) = 0 (Timeout) I ran your script in vanilla Tcl (which works, as we already know), and verified that the difference is exactly this (ignore the difference in fds): 2297 17:31:15.800767 open("/tmp/test.txt", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 5 2298 17:31:15.803090 <... read resumed> "\0", 1) = 1 2298 17:31:15.803165 select(6, [3], [5], [], NULL) = 1 (out [5]) So here, somehow the communication with the notifier is incomplete: we do get the control signal through ("hey, new mask to recompute"), but the recomputation miserably fails. Next step, I guess, is to look in detail at the data structures that are shared between the caller and notifier. Perhaps the maintainer knows how the Rivet environment is different in that respect ? --- Comment 2 --- > 29965 12:04:52.368963 select(17, [16], [], [], NULL <unfinished ...> > ... > 29962 12:04:53.218193 <... select resumed> ) = 0 (Timeout) Hmmm, just noticed the different thread IDs: here we have a dance of 3 threads ! - 29969 is your main thread - 29965 is the notifier - 29962 is a mysterious other thread that does a 1-second periodic task Apart from the (remote) similarity between this 1-second period and your [after 1000], this does not really match what we see in the script. Definitely something to be explained by a Rivet-savvy person... -Alex -- You are receiving this mail because: You are the assignee for the bug. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
