Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

KONDO Mitsumasa Mon, 22 Jul 2013 01:57:57 -0700

(2013/07/19 22:48), Greg Smith wrote:

On 7/19/13 3:53 AM, KONDO Mitsumasa wrote:

Recently, a user who think system availability is important uses
synchronous replication cluster.


If your argument for why it's OK to ignore bounding crash recovery on the master
is that it's possible to failover to a standby, I don't think that is
acceptable.  PostgreSQL users certainly won't like it.

OK. I will also test recovery time. However, I consider more good practice now, Itest it with new patch.

I want you to read especially point that is line 631, 651, and 656.
MAX_WRITEBACK_PAGES is 1024 (1024 * 4096 byte).


You should read http://www.westnet.com/~gsmith/content/linux-pdflush.htm to
realize everything you're telling me about the writeback code and its congestion
logic I knew back in 2007.  The situation is even worse than you describe,
because this section of Linux has gone through multiple, major revisions since
then.  You can't just say "here is the writeback source code"; you have to
reference each of the commonly deployed versions of the writeback feature to 
tell
how this is going to play out if released.  There are four major ones I pay
attention to.  The old kernel style as see in RHEL5/2.6.18--that's what my 2007
paper discussed--the similar code but with very different defaults in 2.6.22, 
the
writeback method/tuning in RHEL6/Debian Squeeze/2.6.32, and then there are newer
kernels.  (The newer ones separate out into a few branches too, I haven't mapped
those as carefully yet)

The writeback source code which I indicated part of writeback is almost same ascommunity kernel (2.6.32.61). I also read linux kernel 3.9.7, but it is almostsame this part. I think that fs-writeback.c is easier than xlog.c. It is only1309 steps. I think that linux distributions are only different about tuningparameter, but same as programing logic. Do you think to need reading debiankernel source code? I will read part of this code, because it is only scores ofsteps at most.

 There are some examples of what really bad checkpoints look
like in
http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/WriteStuff-PGCon2011.pdf
if you want to see some of them.  That's the talk I did around the same time I
was trying out spreading the database fsync calls out over a longer period.

Does it cause in ext3 or 4 file system? I think this is bug of XFS. If fsync calldoesn't return,

it indicate cannot writing WAL and not return their commit. It is seriously 
problem.

My fsync patch is only sleep returned succece of fsync and maximum sleep time isset to 10 seconds. It does not cause bad for this problem.

When I did that, checkpoints became even less predictable, and that was a major
reason behind why I rejected the approach.  I think your suggestion will have 
the
same problem.  You just aren't generating test cases with really large write
workloads yet to see it.  You also don't seem afraid of how exceeding the
checkpoint timeout is a very bad thing yet.

I think it is important that why this problem was caused. We should try to findthe cause of which program has bug or problem.

In addition, if you set a large value of a checkpoint_timeout or
checkpoint_complete_taget, you have said that performance is improved,
but is it true in all the cases?


The timeout, yes.  Throughput is always improved by increasing
checkpoint_timeout.  Less checkpoints per unit of time increases efficiency.
Less writes of the most heavy accessed buffers happen per transaction.  It is
faster because you are doing less work, which on average is always faster than
doing more work.  And doing less work usually beats doing more work, but doing 
it
smarter.

If you want to see how much work per transaction a test is doing, track the
numbers of buffers written at the beginning/end of your test via
pg_stat_bgwriter.  Tests that delay checkpoints will show a lower total number 
of
writes per transaction.  That seems more efficient, but it's efficiency mainly
gained by ignoring checkpoint_timeout.

OK. In next test, I will try it.

When a checkpoint complication target is actually enlarged,
performance may fall in some cases. I think this as the last fsync
having become heavy owing to having write in slowly.


I think you're confusing throughput and latency here.  Increasing the checkpoint
timeout, or to a lesser extent the completion target, on average that increases
throughput.  It results in less work, and the more/less work amount is much more
important than worrying about scheduler details.  Now matter how efficient a
given write is, whether you've sorted it across elevator horizon boundary A or
boundary B, it's better not do it at all.

I think fsync which has longest time or continues a lot block other transactions.And my patch not only improvement of throughput but also realize stable responsetime at fsync phase in checkpoint.

By the way:  if you have a theory like "the last fsync having become heavy" for
why something is happening, measure it.  Set log_min_messages to debug2 and
you'll get details about every single fsync in your logs.  I did that for all my
tests that led me to conclude fsync delaying on its own didn't help that
problem.  I was measuring my theories as directly as possible.

OK. It's important things. And I set more detail debug log in this phase.

I'm willing to consider an optional, sloppy checkpoint approach that uses heavy
load to adjust how often checkpoints happen.  But if we're going to do that, it
has to be extremely clear that the reason for the gain is the checkpoint
spacing--and there is going to be a crash recovery time penalty paid for it.  
And
this patch is not how I would do that.

That's right. We should show that there is profit than a penalty.

It's not really clear yet where the gains you're seeing are really coming from.
If you re-ran all your tests with pg_stat_bgwriter before/after snapshots, 
logged
every fsync call, and then build some tools to analyze the fsync call latency,
then you'll have enough data to talk about this usefully.  That's what I 
consider
the bare minimum evidence to consider changing something here.  I have all of
those features in pgbench-tools with checkpoint logging turned way up, but
they're not all in the dbt2 toolset yet as far as I know.

OK. I will also get /proc/meminfo each snapshot. I think OS background-write 
only
write in each 5 sec after 30 sec. Because dirty buffers in OS do not exceed

dirty_background_ratio in checkpoint in DBT-2. So I consider new method which ispart of sorting and collecting write in write phase, and each sleep time is morelong (5 sec).

And I servey about ext3 file system. My system block size is 4096, but 8192 ormore seems to better. It will decrease number of inode and get more largesequential disk fields. Inode block group will be 128MB to 256MB. If you havetest result, please tell us.


Best regards,
--
Mitsumasa KONDO
NTT Open Software Center


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Reply via email to