Re: [naviserver-devel] naviserver with connection thread queue

Gustaf Neumann Mon, 03 Dec 2012 02:39:29 -0800

Am 29.11.12 19:51, schrieb Gustaf Neumann:
> However, i am still in the process to clean up and address
> some strange interactions (e.g. for nsssl some socket
> closing interactions between driver and connection threads
> seems to complex to me), so i am still for a while busy with that

The problem in the bad interaction was that the driver poll reported,there should be data to read from the socket, while the nsssl receivesaid, there is none and vice versa. Since the nsssl was defined as sync,the socket processing happens in the connection threads viaNsGetRequest(), which returned frequently empty, meaning a uselesswakeup and ConnRun etc. in the connection thread. The reason for thiswas that the nsssl sets the socket to non-blocking without defining thedriver as ASYNC, so that it was possible that a receive returned 0bytes. Setting the driver simply to ASYNC did not help either, since thedriver used its own event loop via Ns_SockTimedWait(), which interactsbadly with the drivers poll loop (did not realize on Linux that a socketbecomes readable, while this worked on Mac OS X). So i changed thereceive operation to simply report the read state back to the maindriver, which works now apparently ok (tested on Mac OS X and Linux). Tobe on the save side, i have tagged the previous version of nsssl onbitbucket and incremented the version number. This change led as well toa simplification in the main driver.

The new version is running since sat on next-scripting.org and showseven better results. The average queue time is lower by again a factorof 4, the standard deviation is as well much better.


        accept  queue   run     total
avg     0,0599  0,0002  0,0183  0,0196
stdev   0,3382  0,0039  0,0413  0,0530
min     0,0000  0,0001  0,0011  0,0012
max     12,1002 0,2324  0,8958  2,0112
median  0,0088  0,0001  0,0118  0,0120

the accept times show now up since nsssl is async, but these are oftennot meaningful for the performance (due to closewait the accept might bea previous accept operation). now, all performance measurement happensrelative to queue time (similar as in naviserver in the main tip).

Note, that this data is based not on a synthetic test but on areal-world web site with potentially different traffic patterns from dayto day, so don't draw too much conclusions from this. By moving fromsync to async some potentially time consuming processing (waiting formore data) is moved from the connection thread to the driver's eventloop, so the time span in the connection thread is reduced (for nssslthe same way as it was before for nssock). Therefore the "runtime" hasless randomness and is shorter, leading to the improvements.

The behavior of the async log writer is now configurable (by defaultturned off), log-rolling and shutdown should now wort with this reliableas well.

All changes are on bitbucket (nsssl and naviserver-connthreadqueue). Theserver measures now as well the filter time separately from the runtime,but i am just in the process to collect data on that...


-gustaf neumann

one more update. Here is now the data of a full day with theAsync-Writer thread. The avg queueing time dropped from 0,0547 to0,0009, the standard deviation from 0,7481 to 0,0211. This is anoticable improvement.
        accept  queue   run     total
avg     0,0000  0,0009  0,1304  0,1313
stdev   0,0000  0,0211  0,9022  0,9027
min     0,0000  0,0000  0,0032  0,0033
max     0,0000  0,9772  28,1699 28,1700
median  0,0000  0,0001  0,0208  0,0209
But still, the sometime large values are worth to investigate. Itturned out that the large queueing time came from requests fromtaipeh, which contained several 404 errors. The size of the 404request is 727 bytes, and therefore under the writersize, which wasconfigured as 1000. The delivery of an error message takes to thissite more than a second. Funny enough, the delivery of the errormessage blocked the connection thread longer than the delivery of theimage when it is above the writersize.
I will reduce the writersize further, but still a slow delivery caneven slow down the delivery of the headers, which happens still in theconnection thread. Most likely, i won't address this now. however, iwill check the effects of fastpath cache, which was deactivated so far...
best regards
-gustaf neumann

Am 28.11.12 11:38, schrieb Gustaf Neumann:
Dear all,

here is a short update of the findings and changes since last week.
One can now activate for nslog "logpartialtimes", which adds an entryto the access log containing the partial request times (accept time,queuing time, and run time). The sample site runs now with minthreads== 5, maxthreads 10.
If one analyzes the data of one day of our sample site, one can seethat in this case, the accept time is always 0 (this is caused bynsssl, which is a sync driver, header parsing happens solely in theconnection thread), the queue time is on average 54 ms(milliseconds), which is huge, while the median is 0.1 ms. The reasonis that we can see a maximum queue time of 21 secs! As a consequence,the standard deviation is as well huge. When one looks more closelyto the log file, one can see that at times, when the queuing time ishuge, as well the runtime is often huge, leading to cascading effectsas described below. cause and reason are hard to determine, since wesaw large runtimes even on delivery of rather small times
        accept  queue   run     total
avg     0,0000  0,0547  0,2656  0,3203
stdev   0,0000  0,7481  0,9740  1,4681
min     0,0000  0,0000  0,0134  0,0135
max     0,0000  21,9594 16,6905 29,1668
median  0,0000  0,0001  0,0329  0,0330
The causes for the "random" slow cases are at least from two sources:slow delivery (some clients are slow in retrieving data, especiallysome bots seems to be bad) and system specific latencies (e.g. backuprunning, other servers becoming active, etc.). To address thisproblem, i moved more file deliveries to the writer thread (reducingwritersize) and activated TCP_DEFER_ACCEPT. One sees significantimprovements, the average queuing time is 4.6 ms, but still thestandard deviation is huge, caused by some "erratic" some large times.
        accept  queue   run     total
avg     0,0000  0,0046  0,1538  0,1583
stdev   0,0000  0,0934  0,6868  0,6967
min     0,0000  0,0000  0,0137  0,0138
max     0,0000  4,6041  22,4101 22,4102
median  0,0000  0,0001  0,0216  0,0217
There are still some huge values, and the queuing time was "lucky"not to be influenced by the still large runtimes of requests. Asideof the top values, one sees sometimes unexplained delays of a fewseconds. which seem to block everything.
A large source for latencies is the file system. On our productionsystem we already invested some efforts to tune the file system inthe past, such we don't see effects as large as on this sample site.Since NaviServer should work well also on non-tuned file systems, iimplemented an async writer thread, that decouples writing to files(access.log and error.log) from the connection threads. The interfaceis the same as unix write(), but does not block.
I have this running since yesterday evening on the sample site, anddo far, it seems to help to reduce the random latenciessignificantly. i'll wait with posting data until i have a similartime range to compare, maybe the good times i am seeing are accidentally.
I have not yet looked into the details for handling the async writerwithing logrotate and shutdown.
On another front, i was experimenting with jemalloc and tcmalloc.
Previously we had the following figures, using tcl as distributedwith zippy malloc
  with minthreads=2, maxthreads=2
      requests 10307 spools 1280 queued 2704 connthreads 11 rss 425

When changing minthreads to 5, maxthreads 10, but using jemalloc,
one sees the following figures
    requests 3933 spools 188 queued 67 connthreads 7 rss 488
    requests 6325 spools 256 queued 90 connthreads 9 rss 528
    requests 10021 spools 378 queued 114 connthreads 14 rss 530
It is interesting to see, that with always 5 connections threadsrunning and using jemalloc, we see a rss consumption only slightlylarger than with plain tcl and zippy malloc having maxthreads == 2,having less requests queued.
Similarly, with tcmalloc we see with minthreads to 5, maxthreads 10
    requests 2062 spools 49 queued 3 connthreads 6 rss 376
    requests 7743 spools 429 queued 359 connthreads 11 rss 466
    requests 8389 spools 451 queued 366 connthreads 12 rss 466
which is even better.

For more information on malloc tests see
https://next-scripting.org/xowiki/docs/misc/thread-mallocs
or the tcl-core mailing list.

That's all for now

-gustaf neumann

Am 20.11.12 20:07, schrieb Gustaf Neumann:
Dear all,
The idea of controlling the number of running threads via queuinglatency is interesting, but i have to look into the details beforei can comment on this.
before one can consider controlling the number of running threadsvia queuing latency, one has to improve the awareness in NaviServerabout the various phases in the requests lifetime.
In the experimental version, we have now the following time stampsrecorded:
-  acceptTime (the time, a socket was accepted)
-  requestQueueTime (the time the request was queued; was startTime)
-  requestDequeueTime (the time the request was dequeued)
The difference between requestQueueTime and acceptTime is the setupcost and depends on the amount of work, the driver does. Forinstance, nssock of naviserver performs read-ahead, while nsssl doesnot and passes connection right away. So, the previously usedstartTime (which is acctually the time the request was queued) wasfor drivers with read ahead not correct. In the experimentalversion, [ns_conn start] returns now always the accept time.
The next paragraph uses the term endTime, which is the time, when aconnection thread is done with a request (either the content wasdelivered, or the content was handed over to a writer thread).
The difference between requestDequeueTime and requestQueueTime isthe time spent in the queue. The difference between endTime andrequestDequeueTime is the pure runtime, the difference betweenendTime and acceptTime is the totalTime. As a rough approximationthe time between requestDequeueTime and acceptTime is pretty muchinfluenced by the server setup, and the runtime by the application.I used the term "approximation" since the runtime of certain otherrequests influences the queuing time, as we see in the following:
Consider a server with two running connection threads receiving 6requests, where requests 2-5 are received in a very short time. Thefirst three requests are directly assign to connection threads, havefromQueue == 0. These have queuing times between 88 and 110 microseconds, which includes signal sending/receiving, thread change, andinitial setup in the connection thread. The runtimes for theserequests are pretty bad, in the range of 0.24 to 3.8 seconds elapsedtime.
[1] waiting 0 current 2 idle 1 ncons 999 fromQueue 0 accept 0.000000 queue 
0.000110 run 0.637781 total 0.637891
[2] waiting 3 current 2 idle 0 ncons 998 fromQueue 0 accept 0.000000 queue 
0.000090 run 0.245030 total 0.245120
[3] waiting 2 current 2 idle 0 ncons 987 fromQueue 0 accept 0.000000 queue 
0.000088 run 0.432421 total 0.432509
[4] waiting 1 current 2 idle 0 ncons 997 fromQueue 1 accept 0.000000 queue 
0.244246 run 0.249208 total 0.493454
[5] waiting 0 current 2 idle 0 ncons 986 fromQueue 1 accept 0.000000 queue 
0.431545 run 3.713331 total 4.144876
[6] waiting 0 current 2 idle 0 ncons 996 fromQueue 1 accept 0.000000 queue 
0.480382 run 3.799818 total 4.280200
Requests [4, 5, 6] are queued, and have queuing times between 0.2and 0.5 seconds. The queuing times are pretty much the runtimes of[2, 3, 4], therefore the runtime determines the queuing time. Forexample, the totalTime of request [4] was 0.493454 secs, half of thetime it was waiting in the queue. Request [4] can consider itselfhappy that it was not scheduled after [5] or [6], where itstotalTime would have been likely in the range of 4 secs (10 timesslower). Low waiting times are essential for good performance.
This example shows pretty well the importance of aync deliverymechanisms like the writer thread or bgdelivery in OpenACS. A filebeing delivered by the connection thread over a slow internetconnection might block later requests substantially (as in the casesabove). This is even more important for todays web-sites, where asingle view might entail 60+ embedded requests for js, css, images,.... where it is not feasible to defined hundreds of connection threads.
Before going further into detail i'll provide further introspectionmechanism to the experimental version.
 - [ns_server stats] adding total waiting time
 - [ns_conn queuetime] ... time spent in queue
- [ns_conn dequeue] ... time stamp when the req starts to actuallyrun (similar to [ns_conn start])
The first can be used for server monitoring, the next two for singleconnection introspection. The queuetime can be useful for betterawareness and for optional output in the access log, and the dequeuetime-stamp for application level profiling as base for a differencewith current time.
Further wishes, suggestions, comments?

-gustaf neumann

------------------------------------------------------------------------------
Keep yourself connected to Go Parallel: 
BUILD Helping you discover the best ways to construct your parallel projects.
http://goparallel.sourceforge.net

_______________________________________________
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel

Re: [naviserver-devel] naviserver with connection thread queue

Reply via email to