On Thursday 11 of April 2013, Arkadiusz Miśkiewicz wrote:
> Another strace including children
> http://ixion.pld-linux.org/~arekm/apache1.txt
>
> Graceful restart and first OPTIONS write at
>
> 12594 15:10:01.397356 write(7, "OPTIONS * HTTP/1.0\r\nUser-Agent:
> Apache/2.4.4 (Unix) (internal dummy connection)\r\n\r\n", 83 <unfinished
> ...>
Another observation.
When this happens (when dummy_connection() is called, my " XXX dummy 2
processing X" are printed) then in that time all children are already
zombie/defunct, so they cannot answer/do anything. The only living process is
master process.
>
> but first read is long after write
>
> 13279 15:10:52.636685 <... read resumed> "OPTIONS * HTTP/1.0\r\nUser-Agent:
> Apache/2.4.4 (Unix) (internal dummy connection)\r\n\r\n", 8000) = 83
> <0.000014>
And that's read by new generation I think
Old generation childs exit in:
else if (retained->my_generation !=
ap_scoreboard_image->global->running_generation) { /*
restart? */
/* yeah, this could be non-graceful restart, in which case the
* parent will kill us soon enough, but why bother checking?
*/
die_now = 1;
}
So looks like master process is trying to send data (in dummy_connection()) to
childrens that already exit()ed leading to whole slowdown. This is how it
looks http://ixion.pld-linux.org/~arekm/apache3.txt
Opinions?
13:38:03.383798 connect(7, {sa_family=AF_INET, sin_port=htons(80),
sin_addr=inet_addr("192.168.1.121")}, 16) = -1 EINPROGRESS (Operation now in
progress) <0.000024>
13:38:03.383841 poll([{fd=7, events=POLLOUT}], 1, 3000) = 1 ([{fd=7,
revents=POLLOUT}]) <0.000006>
13:38:03.383872 getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000006>
13:38:03.383897 write(7, "OPTIONS * HTTP/1.0\r\nUser-Agent: Apache/2.4.4
(Unix) (internal dummy connection)\r\n\r\n", 83) = 83 <0.000022>
4 in SO_ERROR is a "Interrupted system call", so could support my suspicion
>
> Huh?
>
> That's even after resuming operations mesage which was a bit earlier:
>
> 12594 15:10:52.606253 write(2, "[Thu Apr 11 15:10:52.606236 2013]
> [mpm_prefork:notice] [pid 12594] AH00163: Apache/2.4.4 (Unix) configured --
> resuming normal operations\n", 137) = 137 <0.000009>
>
> Compare that to case where ab wasn't running:
> http://ixion.pld-linux.org/~arekm/apache-no-ab.txt
> write OPTIONS in main process and read in childrens are interleaving as
> expected.
>
> Now why these aren't interleaving in apache1.txt where ab was running?
>
> > As the accept call
> > on httpd side only returns when the first data is send on the socket,
> > IMHO the time the poll takes does take place in the kernel and not in
> > the httpd children code.
>
> Well, I think kernel is not the one to blame.
>
> > Have you checked your messages file if the kernel reports
> > something when this happens? How does your run queue and CPU load look
> > like when this happens (top)?
>
> No kernel messages, no cpu spikes - looks normal.
>
> > Regards
> > Rüdiger
--
Arkadiusz Miśkiewicz, arekm / maven.pl