This is good engineering work. I rise my hat to that.

Congrats,

Maurizio

 

 

From: Gustaf Neumann [mailto:neum...@wu.ac.at] 
Sent: 29 November 2012 19:51
To: naviserver-devel@lists.sourceforge.net
Subject: Re: [naviserver-devel] naviserver with connection thread queue

 

Dear all,

one more update. Here is now the data of a full day with the Async-Writer
thread. The avg queueing time dropped from 0,0547 to 0,0009, the standard
deviation from 0,7481 to 0,0211. This is a noticable improvement.

        accept  queue   run     total
avg     0,0000  0,0009  0,1304  0,1313
stdev   0,0000  0,0211  0,9022  0,9027
min     0,0000  0,0000  0,0032  0,0033
max     0,0000  0,9772  28,1699 28,1700
median  0,0000  0,0001  0,0208  0,0209
 

But still, the sometime large values are worth to investigate. It turned out
that the large queueing time came from requests from taipeh, which contained
several 404 errors. The size of the 404 request is 727 bytes, and therefore
under the writersize, which was configured as 1000. The delivery of an error
message takes to this site more than a second. Funny enough, the delivery of
the error message blocked the connection thread longer than the delivery of
the image when it is above the writersize.

I will reduce the writersize further, but still a slow delivery can even
slow down the delivery of the headers, which happens still in the connection
thread. Most likely, i won't address this now.  however, i will check the
effects of fastpath cache, which was deactivated so far...

best regards
-gustaf neumann

Am 28.11.12 11:38, schrieb Gustaf Neumann:

Dear all,

here is a short update of the findings and changes since last week. 

One can now activate for nslog "logpartialtimes", which adds an entry to the
access log containing the partial request times (accept time, queuing time,
and run time). The sample site runs now with minthreads == 5, maxthreads 10.

If one analyzes the data of one day of our sample site, one can see that in
this case, the accept time is always 0 (this is caused by nsssl, which is a
sync driver, header parsing happens solely in the connection thread), the
queue time is on average 54 ms (milliseconds), which is huge, while the
median is 0.1 ms. The reason is that we can see a maximum queue time of 21
secs! As a consequence, the standard deviation is as well huge. When one
looks more closely to the log file, one can see that at times, when the
queuing time is huge, as well the runtime is often huge, leading to
cascading effects as described below. cause and reason are hard to
determine, since we saw large runtimes even on delivery of rather small
times

  accept  queue   run     total
avg      0,0000  0,0547  0,2656  0,3203
stdev    0,0000  0,7481  0,9740  1,4681
min      0,0000  0,0000  0,0134  0,0135
max      0,0000  21,9594 16,6905 29,1668
median   0,0000  0,0001  0,0329  0,0330
 

The causes for the "random" slow cases are at least from two sources: slow
delivery (some clients are slow in retrieving data, especially some bots
seems to be bad) and system specific latencies (e.g. backup running, other
servers becoming active, etc.). To address this problem, i moved more file
deliveries to the writer thread (reducing writersize) and activated
TCP_DEFER_ACCEPT. One sees significant improvements, the average queuing
time is 4.6 ms, but still the standard deviation is huge, caused by some
"erratic" some large times.

  accept  queue   run     total
avg      0,0000  0,0046  0,1538  0,1583
stdev    0,0000  0,0934  0,6868  0,6967
min      0,0000  0,0000  0,0137  0,0138
max      0,0000  4,6041  22,4101 22,4102
median   0,0000  0,0001  0,0216  0,0217
 

There are still some huge values, and the queuing time was "lucky" not to be
influenced by the still large runtimes of requests. Aside of the top values,
one sees sometimes unexplained delays of a few seconds. which seem to block
everything.

A large source for latencies is the file system. On our production system we
already invested some efforts to tune the file system in the past, such we
don't see effects as large as on this sample site. Since NaviServer should
work well also on non-tuned file systems, i implemented an async writer
thread, that decouples writing to files (access.log and error.log) from the
connection threads. The interface is the same as unix write(), but does not
block.

I have this running since yesterday evening on the sample site, and do far,
it seems to help to reduce the random latencies significantly. i'll wait
with posting data until i have a similar time range to compare, maybe the
good times i am seeing are accidentally.

I have not yet looked into the details for handling the async writer withing
logrotate and shutdown.

On another front, i was experimenting with jemalloc and tcmalloc.
Previously we had the following figures, using tcl as distributed with zippy
malloc

  with minthreads=2, maxthreads=2
      requests 10307 spools 1280 queued 2704 connthreads 11 rss 425

When changing minthreads to 5, maxthreads 10, but using jemalloc,
one sees the following figures



   requests 3933 spools 188 queued 67 connthreads 7 rss 488
   requests 6325 spools 256 queued 90 connthreads 9 rss 528
   requests 10021 spools 378 queued 114 connthreads 14 rss 530

It is interesting to see, that with always 5 connections threads running and
using jemalloc, we see a rss consumption only slightly larger than with
plain tcl and zippy malloc having maxthreads == 2, having less requests
queued.

Similarly, with tcmalloc we see with minthreads to 5, maxthreads 10 

   requests 2062 spools 49 queued 3 connthreads 6 rss 376
   requests 7743 spools 429 queued 359 connthreads 11 rss 466
   requests 8389 spools 451 queued 366 connthreads 12 rss 466

which is even better. 

For more information on malloc tests see
https://next-scripting.org/xowiki/docs/misc/thread-mallocs
or the tcl-core mailing list.

That's all for now

-gustaf neumann

Am 20.11.12 20:07, schrieb Gustaf Neumann:

Dear all,

The idea of controlling the number of running threads via queuing latency is
interesting, but i have to look into the details before i can comment on
this.

before one can consider controlling the number of running threads via
queuing latency, one has to improve the awareness in NaviServer about the
various phases in the requests lifetime.

In the experimental version, we have now the following time stamps recorded:

-  acceptTime (the time, a socket was accepted)
-  requestQueueTime (the time the request was queued; was startTime)
-  requestDequeueTime (the time the request was dequeued)

The difference between requestQueueTime and acceptTime is the setup cost and
depends on the amount of work, the driver does. For instance, nssock of
naviserver performs read-ahead, while nsssl does not and passes connection
right away. So, the previously used startTime (which is acctually the time
the request was queued) was for drivers with read ahead not correct. In the
experimental version, [ns_conn start] returns now always the accept time.

The next paragraph uses the term endTime, which is the time, when a
connection thread is done with a request (either the content was delivered,
or the content was handed over to a writer thread).

The difference between requestDequeueTime and requestQueueTime is the time
spent in the queue. The difference between endTime and requestDequeueTime is
the pure runtime, the difference between endTime and acceptTime is the
totalTime. As a rough approximation the time between requestDequeueTime and
acceptTime is pretty much influenced by the server setup, and the runtime by
the application. I used the term "approximation" since the runtime of
certain other requests influences the queuing time, as we see in the
following:

Consider a server with two running connection threads receiving 6 requests,
where requests 2-5 are received in a very short time. The first three
requests are directly assign to connection threads, have fromQueue == 0.
These have queuing times between 88 and 110 micro seconds, which includes
signal sending/receiving, thread change, and initial setup in the connection
thread. The runtimes for these requests are pretty bad, in the range of 0.24
to 3.8 seconds elapsed time.

[1] waiting 0 current 2 idle 1 ncons 999 fromQueue 0 accept 0.000000 queue
0.000110 run 0.637781 total 0.637891
[2] waiting 3 current 2 idle 0 ncons 998 fromQueue 0 accept 0.000000 queue
0.000090 run 0.245030 total 0.245120
[3] waiting 2 current 2 idle 0 ncons 987 fromQueue 0 accept 0.000000 queue
0.000088 run 0.432421 total 0.432509
[4] waiting 1 current 2 idle 0 ncons 997 fromQueue 1 accept 0.000000 queue
0.244246 run 0.249208 total 0.493454
[5] waiting 0 current 2 idle 0 ncons 986 fromQueue 1 accept 0.000000 queue
0.431545 run 3.713331 total 4.144876
[6] waiting 0 current 2 idle 0 ncons 996 fromQueue 1 accept 0.000000 queue
0.480382 run 3.799818 total 4.280200
 

Requests [4, 5, 6] are queued, and have queuing times between 0.2 and 0.5
seconds. The queuing times are pretty much the runtimes of [2, 3, 4],
therefore the runtime determines the queuing time. For example, the
totalTime of request [4] was 0.493454 secs, half of the time it was waiting
in the queue. Request [4] can consider itself happy that it was not
scheduled after [5] or [6], where its totalTime would have been likely in
the range of 4 secs (10 times slower). Low waiting times are essential for
good performance.

This example shows pretty well the importance of aync delivery mechanisms
like the writer thread or bgdelivery in OpenACS. A file being delivered by
the connection thread over a slow internet connection might block later
requests substantially (as in the cases above). This is even more important
for todays web-sites, where a single view might entail 60+ embedded requests
for js, css, images, .... where it is not feasible to defined hundreds of
connection threads.

Before going further into detail i'll provide further introspection
mechanism to the experimental version.

 - [ns_server stats] adding total waiting time 
 - [ns_conn queuetime] ... time spent in queue
 - [ns_conn dequeue]  ... time stamp when the req starts to actually run
(similar to [ns_conn start])

The first can be used for server monitoring, the next two for single
connection introspection. The queuetime can be useful for better awareness
and for optional output in the access log, and the dequeue time-stamp for
application level profiling as base for a difference with current time. 

Further wishes, suggestions, comments?

-gustaf neumann

 

------------------------------------------------------------------------------
Keep yourself connected to Go Parallel: 
VERIFY Test and improve your parallel project with help from experts 
and peers. http://goparallel.sourceforge.net
_______________________________________________
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel

Reply via email to