Dear all,
here is a short update of the findings and changes since last week.
One can now activate for nslog "logpartialtimes", which adds an entry
to the access log containing the partial request times (accept time,
queuing time, and run time). The sample site runs now with minthreads
== 5, maxthreads 10.
If one analyzes the data of one day of our sample site, one can see
that in this case, the accept time is always 0 (this is caused by
nsssl, which is a sync driver, header parsing happens solely in the
connection thread), the queue time is on average 54 ms
(milliseconds), which is huge, while the median is 0.1 ms. The reason
is that we can see a maximum queue time of 21 secs! As a consequence,
the standard deviation is as well huge. When one looks more closely
to the log file, one can see that at times, when the queuing time is
huge, as well the runtime is often huge, leading to cascading effects
as described below. cause and reason are hard to determine, since we
saw large runtimes even on delivery of rather small times
accept queue run total
avg 0,0000 0,0547 0,2656 0,3203
stdev 0,0000 0,7481 0,9740 1,4681
min 0,0000 0,0000 0,0134 0,0135
max 0,0000 21,9594 16,6905 29,1668
median 0,0000 0,0001 0,0329 0,0330
The causes for the "random" slow cases are at least from two sources:
slow delivery (some clients are slow in retrieving data, especially
some bots seems to be bad) and system specific latencies (e.g. backup
running, other servers becoming active, etc.). To address this
problem, i moved more file deliveries to the writer thread (reducing
writersize) and activated TCP_DEFER_ACCEPT. One sees significant
improvements, the average queuing time is 4.6 ms, but still the
standard deviation is huge, caused by some "erratic" some large times.
accept queue run total
avg 0,0000 0,0046 0,1538 0,1583
stdev 0,0000 0,0934 0,6868 0,6967
min 0,0000 0,0000 0,0137 0,0138
max 0,0000 4,6041 22,4101 22,4102
median 0,0000 0,0001 0,0216 0,0217
There are still some huge values, and the queuing time was "lucky"
not to be influenced by the still large runtimes of requests. Aside
of the top values, one sees sometimes unexplained delays of a few
seconds. which seem to block everything.
A large source for latencies is the file system. On our production
system we already invested some efforts to tune the file system in
the past, such we don't see effects as large as on this sample site.
Since NaviServer should work well also on non-tuned file systems, i
implemented an async writer thread, that decouples writing to files
(access.log and error.log) from the connection threads. The interface
is the same as unix write(), but does not block.
I have this running since yesterday evening on the sample site, and
do far, it seems to help to reduce the random latencies
significantly. i'll wait with posting data until i have a similar
time range to compare, maybe the good times i am seeing are accidentally.
I have not yet looked into the details for handling the async writer
withing logrotate and shutdown.
On another front, i was experimenting with jemalloc and tcmalloc.
Previously we had the following figures, using tcl as distributed
with zippy malloc
with minthreads=2, maxthreads=2
requests 10307 spools 1280 queued 2704 connthreads 11 rss 425
When changing minthreads to 5, maxthreads 10, but using jemalloc,
one sees the following figures
requests 3933 spools 188 queued 67 connthreads 7 rss 488
requests 6325 spools 256 queued 90 connthreads 9 rss 528
requests 10021 spools 378 queued 114 connthreads 14 rss 530
It is interesting to see, that with always 5 connections threads
running and using jemalloc, we see a rss consumption only slightly
larger than with plain tcl and zippy malloc having maxthreads == 2,
having less requests queued.
Similarly, with tcmalloc we see with minthreads to 5, maxthreads 10
requests 2062 spools 49 queued 3 connthreads 6 rss 376
requests 7743 spools 429 queued 359 connthreads 11 rss 466
requests 8389 spools 451 queued 366 connthreads 12 rss 466
which is even better.
For more information on malloc tests see
https://next-scripting.org/xowiki/docs/misc/thread-mallocs
or the tcl-core mailing list.
That's all for now
-gustaf neumann
Am 20.11.12 20:07, schrieb Gustaf Neumann:
Dear all,
The idea of controlling the number of running threads via queuing
latency is interesting, but i have to look into the details before
i can comment on this.
before one can consider controlling the number of running threads
via queuing latency, one has to improve the awareness in NaviServer
about the various phases in the requests lifetime.
In the experimental version, we have now the following time stamps
recorded:
- acceptTime (the time, a socket was accepted)
- requestQueueTime (the time the request was queued; was startTime)
- requestDequeueTime (the time the request was dequeued)
The difference between requestQueueTime and acceptTime is the setup
cost and depends on the amount of work, the driver does. For
instance, nssock of naviserver performs read-ahead, while nsssl does
not and passes connection right away. So, the previously used
startTime (which is acctually the time the request was queued) was
for drivers with read ahead not correct. In the experimental
version, [ns_conn start] returns now always the accept time.
The next paragraph uses the term endTime, which is the time, when a
connection thread is done with a request (either the content was
delivered, or the content was handed over to a writer thread).
The difference between requestDequeueTime and requestQueueTime is
the time spent in the queue. The difference between endTime and
requestDequeueTime is the pure runtime, the difference between
endTime and acceptTime is the totalTime. As a rough approximation
the time between requestDequeueTime and acceptTime is pretty much
influenced by the server setup, and the runtime by the application.
I used the term "approximation" since the runtime of certain other
requests influences the queuing time, as we see in the following:
Consider a server with two running connection threads receiving 6
requests, where requests 2-5 are received in a very short time. The
first three requests are directly assign to connection threads, have
fromQueue == 0. These have queuing times between 88 and 110 micro
seconds, which includes signal sending/receiving, thread change, and
initial setup in the connection thread. The runtimes for these
requests are pretty bad, in the range of 0.24 to 3.8 seconds elapsed
time.
[1] waiting 0 current 2 idle 1 ncons 999 fromQueue 0 accept 0.000000 queue
0.000110 run 0.637781 total 0.637891
[2] waiting 3 current 2 idle 0 ncons 998 fromQueue 0 accept 0.000000 queue
0.000090 run 0.245030 total 0.245120
[3] waiting 2 current 2 idle 0 ncons 987 fromQueue 0 accept 0.000000 queue
0.000088 run 0.432421 total 0.432509
[4] waiting 1 current 2 idle 0 ncons 997 fromQueue 1 accept 0.000000 queue
0.244246 run 0.249208 total 0.493454
[5] waiting 0 current 2 idle 0 ncons 986 fromQueue 1 accept 0.000000 queue
0.431545 run 3.713331 total 4.144876
[6] waiting 0 current 2 idle 0 ncons 996 fromQueue 1 accept 0.000000 queue
0.480382 run 3.799818 total 4.280200
Requests [4, 5, 6] are queued, and have queuing times between 0.2
and 0.5 seconds. The queuing times are pretty much the runtimes of
[2, 3, 4], therefore the runtime determines the queuing time. For
example, the totalTime of request [4] was 0.493454 secs, half of the
time it was waiting in the queue. Request [4] can consider itself
happy that it was not scheduled after [5] or [6], where its
totalTime would have been likely in the range of 4 secs (10 times
slower). Low waiting times are essential for good performance.
This example shows pretty well the importance of aync delivery
mechanisms like the writer thread or bgdelivery in OpenACS. A file
being delivered by the connection thread over a slow internet
connection might block later requests substantially (as in the cases
above). This is even more important for todays web-sites, where a
single view might entail 60+ embedded requests for js, css, images,
.... where it is not feasible to defined hundreds of connection threads.
Before going further into detail i'll provide further introspection
mechanism to the experimental version.
- [ns_server stats] adding total waiting time
- [ns_conn queuetime] ... time spent in queue
- [ns_conn dequeue] ... time stamp when the req starts to actually
run (similar to [ns_conn start])
The first can be used for server monitoring, the next two for single
connection introspection. The queuetime can be useful for better
awareness and for optional output in the access log, and the dequeue
time-stamp for application level profiling as base for a difference
with current time.
Further wishes, suggestions, comments?
-gustaf neumann