Re: [HACKERS] checkpointer continuous flushing

Fabien COELHO Thu, 22 Oct 2015 09:02:07 -0700


Hello Andres,

there would be as much as was written since the last sleep, about 100
ms ago, which is not likely to be gigabytes?


In many cases we don't sleep all that frequently - after one 100ms sleep
we're already behind a lot.

I think that "being behind" is not a problem as such, it is really the waythe scheduler has been designed and works, by keeping pace with time &wall progress by little bursts of writes. If you reduce the sleep time alot then it would end up having writes interleaved with small sleeps, butthen this would be bad for performance has the OS would loose the abilityto write much data sequentially on the disk.

It does not mean that the default 100 ms is a good figure, but the "beingbehind" is a feature, not an issue as such.

And even so, it's pretty easy to get into checkpoint scenarios with ~500mbyte/s as a writeout rate.


Hmmmm. Not with my hardware:-)

Only issuing a sync_file_range() 10 times for that is obviouslyproblematic.

Hmmm. Then it should depend on the expected write capacity of theunderlying disks...

The implementation pretty always will go behind schedule for some
time. Since sync_file_range() doesn't flush in the foreground I don't
think it's important to do the flushing in concert with sleeping.


For me it is important to avoid accumulating too large flushes, and that is
the point of the call before sleeping.


I don't follow this argument. It's important to avoid large flushes,
therefore we potentially allow large flushes to accumulate?

On my simple test hardware the flushes are not large, I think, so theproblem does not arise. Maybe I should check.

My testing seems to show that just adding a limit of 32 buffers to
FileAsynchronousFlush() leads to markedly better results.


Hmmm. 32 buffers means 256 KB, which is quite small.


Why?


Because the point of sorting is to generate sequential writes so that the
HDD has a lot of aligned stuff to write without moving the head, and 32 is
rather small for that.


A sync_file_range(SYNC_FILE_RANGE_WRITE) doesn't synchronously write
data back. It just puts it into the write queue.


Yes.

You can have merging between IOs from either side. But more importantlyyou can't merge that many requests together anyway.


Probably.

The aim is to not overwhelm the request queue - which is where the
coalescing is done. And usually that's rather small.


That is an argument. How small, though? It seems to be 128 by default, so
I'd rather have 128? Also, it can be changed, so maybe it should really be a
guc?


I couldn't see any benefits above (and below) 32 on a 20 drive system,

So it is one kind of (big) hardware. Assuming that pages are contiguous,how much is written on each disk depends on the RAID type, the stripesize, and when it is really written depends on the various cache (in theRAID HW card if any, on the disk, ...), so whether 32 at the OS level isthe right size is pretty unclear to me. I would have said the larger thebetter, but indeed you should avoid blocking.

so I doubt it's worthwhile. It's actually good for interactivity to
allow other requests into the queue concurrently - otherwise other
reads/writes will obviously have a higher latency...

Sure. Now on my tests, with my (old & little) hardware it seemed quitesmooth. What I'm driving at is that what is good may be relative anddepend on the underlying hardware, which makes it not obvious to choosethe right parameter.

If you flush much more sync_file_range starts to do work in the
foreground.


Argh, too bad. I would have hoped that the would just deal with in an
asynchronous way,


It's even in the man page:
"Note  that  even  this  may  block if you attempt to write more than
request queue size."


Hmmm. What about choosing "request queue size * 0.5", then ?

Because it should be in shared buffers where pg needs it?


Huh? I'm just suggesting p = mmap(fd, offset, bytes);msync(p, bytes);munmap(p);
instead of sync_file_range().

I think that I do not really understand how it may work, but possible itcould.

ISTM that it is rather an argument for taking the tablespace into the
sorting, not necessarily for a binary heap.


I don't understand your problem with that. The heap specific code is
small, smaller than your NextBufferToWrite() implementation?


You have not yet posted the updated version of the patch.

Thee complexity of the round robin scan on the array is O(1) and very fewinstructions, plus some stop condition which is mostly true I think if thewrites are balanced between table spaces, there is no dynamic allocationin the data structure (it is an array). The binary heap is O(log(n)),probably there are dynamic allocations and frees when extracting/insertingsomething, there are functions calls to rebalance the tree, and so on. Ok,"n" is expected to be small.

So basically, for me it is not obviously superior to the previous version.Now I'm also tired, so if it works reasonably I'll be fine with it.

[... code extract ...]

I don't think that makes much sense. All a longer sleep achieves is
creating a larger burst of writes afterwards. We should really sleep
adaptively.


It sounds reasonable, but what would be the criterion?


What IsCheckpointOnSchedule() does is essentially to calculate progress
for two things:
1) Are we on schedule based on WAL segments until CheckPointSegments
  (computed via max_wal_size these days). I.e. is the percentage of
  used up WAL bigger than the percentage of written out buffers.

2) Are we on schedule based on checkpoint_timeout. I.e. is the
  percentage of checkpoint_timeout already passed bigger than the
  percentage of buffers written out.

So the trick is just to compute the number of work items (e.g. buffers
to write out) and divide the remaining time by it. That's how long you
can sleep.

See discussion above. ISTM that the "bursts" is a useful feature of thecheckpoint scheduler, especially with sorted buffers & flushes. You wantto provide grouped writes that will be easilly written to disk together.You do not want to have page writes issued one by one and interleaved withsmall sleeps.

It's slightly trickier for WAL and I'm not sure it's equally
important. But even there it shouldn't be too hard to calculate the
amount of time till we're behind on schedule and only sleep that long.

The scheduler stops writing as soon as it has overtaken the progress, soit should be a very small time, but if you do that you would end upwriting pages one by one, which is not desirable at all.

I'm running benchmarks right now, they'll take a bit to run to
completion.


Good.

I'm looking forward to have a look at the updated version of the patch.

--
Fabien.


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

Reply via email to