On 5/08/2013 11:14 p.m., babajaga wrote:
Sorry, Amos, not to waste too much time here for an off-topic issue, but
interesting matter anyways:

Okay. I am running out of time and this is slightly old info I'm basing all this on - so shall we finish up? measurements and testing is kind of requried to go further and demonstrate anything.


Disclaimer: some of what I "know" and say below may be complete FUD with modern disks. I have not done any testing since 10-20GB was a widely available storage device size and SSD layers on drives had not even been invented. Shop-talk with people doing testing more recently though tells me that the basics are probably still completely valid even if the tricks added to solve problems are changing rapidly. The key take-away should be that Squids disk I/O pattern for small objects blows most of those new tricks into uselessness.

I ACK your remarks regarding disk controller activity. But, AFAIK, squid
does NOT directly access the disk controller for raw disk I/O, the FS is
always in-between instead. And, that means, that a (lot of) buffering can
occure, before real disk-I/O is done.

This depends on two factors:
1) there is RAM available for the buffering required.
-> The higher the traffic load the less memory is available to the system for this.

2) The OS has a chance of advance buffering.
-> Objects up to 64KB (often 4KB or 8 KB) can be completely loaded into Squid I/O buffers in a single read(), and there is no way for the OS to identify which of the surrounding sectors/blocks are related objects to the one just loaded (if it guesses and gets it wrong things go even worse than not guessing at all). -> Also, remember AUFS is preferred for large (over-32KB) objects - the ones which will require multiple read()'s - and Rock best for small (under-32KB) objects. This OS buffering prediction is a significant part of the reason why.

  Which might even lead to spurious high
reponse times, when all of a sudden the OS decides, to really flush large
disk-buffers to disk.

Note that this will result in bursty disk I/O traffic pattern, with waves of alternating high and low access speeds for disk accesses. The aim with high performance is to flatten the low-speed troughs out as much as possible by raising them up to make a constant peak rate of I/O.

  In a good file system (or disk controller,
downstream), request-reordering should happen, to allow elevator-style head
movements.  Or merging file accesses, referencing the same disk blocks.

Exactly. And this is where Squid being partially *network* I/O event driven comes into play affecting the disk I/O pattern. Squid is managing N concurrent connections, each of those is potentially servicing a distinct *unique* client file fetch (well mostly, and when collapsed forwarding is ready fo Squid-3 it will be unique). Every I/O loop Squid cycles through all N in order and schedules a cross-sectional slice for any which are needing disk read/write. So each I/O cycle Squid delivers at most one read (HIT/MISS send to client) and one write (MISS received from server) for any given file, with up to N possibly vastly separate files on disk being accessed. The logics doing that elevator calculation are therefore *not* faced with a single set of file operations in one area. But with a cross-sectional read/write over potentially the entire disk. At most it can reorder those into elevator up/down cross section over the disk. But in passing those completion events back to Squid it triggers another I/O cycle for Squid over the network sockets, and thus another sweep over the entire disk pace. Worst-case (and best) the spindle heads are sweeping the platter from end-to-end reading everything needed 1-cycle:1-sweep.

That is with _one_ cache_dir sitting on the spindle.

Now if you pay close attention to the elevator sweep there is a lot of time spent scanning between areas of the disk and not so much doing I/O. To optimize around this effect and allow even more concurrent file reads Squid is load balancing between cache_dir where it places files. AFAIK the theory is that one head can be seeking while another is doing its I/O, for overall effect of having a more steady flow of bytes back to Squid after the FS software abstraction layer and raising those troughs again to a smooth flow. Although that said, "theory" is not practice. Placing both cache_dir on the one disk the FS logics will of course reorder and interleave the I/O for each cache_dir such the the disk behaviour is a single sweep as for one cache_dir. BUT, as a result the seek lag and bursty nature of read() bytes returned is fully exposed to Squid - by the very mechanisms supposedly minimizing that. In turn this reflects in the network I/O as bytes are relayed directly there by Squid and TCP gets a bursty peak/trough pattern appearing.

Additionally, and probably more importantly, that reordering of 2 cache_dir on one disk spindle down to the behaviour of 1 cache_dir caps the I/O limit for *both* of those cache_dir at the disk I/O threshold (after optimization). Whereas having them on separate spindles would allow each to have that full capacity and effectivly double the disk I/O threshold (after optimization).

Why we say Rock can share with UFS/AUFS/diskd is that the I/O block size being requested is larger so there are less disk sweep movements even if many files/blocks are being loaded concurrently. So loading a few hundred objects in one Rock block of I/O, most of which will then get memory-HIT speeds, is just as efficient as loading _one_ more file out of the UFS/AUFS/diskd cache_dir.

And all this should happen after squids activities are completed, but before
the real disk driver/controller starts its work.
BTW, I did some private measurements, not regarding response times because
of various types of cache_dirs, but regarding reponse times/disk thruput
because of various FS and options thereof. And found, that a "crippled" ext4
works best for me. Default journaling etc. in ext4 has a definite hit on
disk-I/O. Giving up some safety features has a drastic positive influence.
Should be valid, for all types of cache_dirs, though.

I hazard a guess that if you go through them those "some" will all be features which involve doing some form of additional read/write to the disk for each chunk of written bytes. Yes? Things such as file access timestamping, journal recording, checksum writing, checksum validation post-write, dedup block registration, RAID protection, etc.

The logic behind that guess:
As mentioned above the I/O presented by Squid will be already sliced across the network I/O streams and just needs reordering for the "elevator sweep" of quite a large number of base operations. Adding a second sweep to perform all the followup operations, OR causing the elevator to jump slightly forward/back to do them mid-sweep (I *hope* no disks do this anymore), will only harm the presented I/O sweep and slow down the time before its completion can be notified to Squid. Worst-case halving the I/O limits the disk can provide to the FS layer let alone Squid, I imagine that worst-case is rare but some "drastic" amount of difference is fully expected.


Amos

Reply via email to