Re: [HACKERS] Sorted writes in checkpoint

2008-03-11 Thread Bruce Momjian

Added to TODO:

* Consider sorting writes during checkpoint

  http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php


---

ITAGAKI Takahiro wrote:
 Greg Smith [EMAIL PROTECTED] wrote:
 
  On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
   If the kernel can treat sequential writes better than random writes, is 
   it worth sorting dirty buffers in block order per file at the start of 
   checkpoints?
 
 I wrote and tested the attached sorted-writes patch base on Heikki's
 ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.
 
   tests| pgbench | DBT-2 response time (avg/90%/max)
 ---+-+---
  LDC only  | 181 tps | 1.12 / 4.38 / 12.13 s
  + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
  + Sorted writes   | 224 tps | 0.36 / 0.80 /  8.11 s
 
 (*) Don't write buffers that were dirtied after starting the checkpoint.
 
 machine : 2GB-ram, SCSI*4 RAID-5
 pgbench : -s400 -t4 -c10  (about 5GB of database)
 DBT-2   : 60WH (about 6GB of database)
 
 
  I think it has the potential to improve things.  There are three obvious 
  and one subtle argument against it I can think of:
  
  1) Extra complexity for something that may not help.  This would need some 
  good, robust benchmarking improvements to justify its use.
 
 Exactly. I think we need a discussion board for I/O performance issues.
 Can I use Developers Wiki for this purpose?  Since performance graphs and
 result tables are important for the discussion, so it might be better
 than mailing lists, that are text-based.
 
 
  2) Block number ordering may not reflect actual order on disk.  While 
  true, it's got to be better correlated with it than writing at random.
  3) The OS disk elevator should be dealing with this issue, particularly 
  because it may really know the actual disk ordering.
 
 Yes, both are true. However, I think there is pretty high correlation
 in those orderings. In addition, we should use filesystem to assure
 those orderings correspond to each other. For example, pre-allocation
 of files might help us, as has often been discussed.
 
 
  Here's the subtle thing:  by writing in the same order the LRU scan occurs 
  in, you are writing dirty buffers in the optimal fashion to eliminate 
  client backend writes during BuferAlloc.  This makes the checkpoint a 
  really effective LRU clearing mechanism.  Writing in block order will 
  change that.
 
 The issue will probably go away after we have LDC, because it writes LRU
 buffers during checkpoints.
 
 Regards,
 ---
 ITAGAKI Takahiro
 NTT Open Source Software Center
 

[ Attachment, skipping... ]

 
 ---(end of broadcast)---
 TIP 2: Don't 'kill -9' the postmaster

-- 
  Bruce Momjian  [EMAIL PROTECTED]http://momjian.us
  EnterpriseDB http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sorted writes in checkpoint

2007-06-15 Thread Zeugswetter Andreas ADI SD

tests| pgbench | DBT-2 response time 
 (avg/90%/max)
  
 ---+-+
  ---+-+---
   LDC only  | 181 tps | 1.12 / 4.38 / 12.13 s
   + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
   + Sorted writes   | 224 tps | 0.36 / 0.80 /  8.11 s
  
  (*) Don't write buffers that were dirtied after starting 
 the checkpoint.
  
  machine : 2GB-ram, SCSI*4 RAID-5
  pgbench : -s400 -t4 -c10  (about 5GB of database)
  DBT-2   : 60WH (about 6GB of database)
 
 I'm very surprised by the BM_CHECKPOINT_NEEDED results. What 
 percentage of writes has been saved by doing that? We would 
 expect a small percentage of blocks only and so that 
 shouldn't make a significant difference. I thought we 

Wouldn't pages that are dirtied during the checkpoint also usually be
rather hot ?
Thus if we lock one of those for writing, the chances are high that a
client needs to wait for the lock ? 
A write os call should usually be very fast, but when the IO gets
bottlenecked it might easily become slower.

Probably the recent result, that it saves ~53% of the writes, is
sufficient explanation though.

Very nice results :-) Looks like we want all of it including the sort. 

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Sorted writes in checkpoint

2007-06-15 Thread ITAGAKI Takahiro

Simon Riggs [EMAIL PROTECTED] wrote:

tests| pgbench | DBT-2 response time (avg/90%/max)
  ---+-+---
   LDC only  | 181 tps | 1.12 / 4.38 / 12.13 s
   + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
   + Sorted writes   | 224 tps | 0.36 / 0.80 /  8.11 s
 
 I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage
 of writes has been saved by doing that?
 How long was the write phase of the checkpoint, how long
 between checkpoints?

 I can see the sorted writes having an effect because the OS may not
 receive blocks within a sufficient time window to fully optimise them.
 That effect would grow with increasing sizes of shared_buffers and
 decrease with size of controller cache. How big was the shared buffers
 setting? What OS scheduler are you using? The effect would be greatest
 when using Deadline.

I didn't tune OS parameters, used default values.
In terms of cache amounts, postgres buffers were larger than kernel
write pool and controller cache. that's why the OS could not optimise
writes enough in checkpoint, I think.

  - 200MB - RAM * dirty_background_ratio
  - 128MB - Controller cache
  - 2GB   - postgres shared_buffers

I forget to gather detail I/O information in the tests.
I'll retry it and report later.

RAM  2GB
Controller cache 128MB
shared_buffers   1GB
checkpoint_timeout   = 15min
checkpoint_write_percent = 50.0

RHEL4 (Linux 2.6.9-42.0.2.EL)
vm.dirty_background_ratio= 10
vm.dirty_ratio   = 40
vm.dirty_expire_centisecs= 3000
vm.dirty_writeback_centisecs = 500
Using cfq io scheduler

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Sorted writes in checkpoint

2007-06-15 Thread Simon Riggs
On Fri, 2007-06-15 at 18:33 +0900, ITAGAKI Takahiro wrote:
 Simon Riggs [EMAIL PROTECTED] wrote:
 
 tests| pgbench | DBT-2 response time (avg/90%/max)
   ---+-+---
LDC only  | 181 tps | 1.12 / 4.38 / 12.13 s
+ BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
+ Sorted writes   | 224 tps | 0.36 / 0.80 /  8.11 s
  
  I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage
  of writes has been saved by doing that?
  How long was the write phase of the checkpoint, how long
  between checkpoints?
 
  I can see the sorted writes having an effect because the OS may not
  receive blocks within a sufficient time window to fully optimise them.
  That effect would grow with increasing sizes of shared_buffers and
  decrease with size of controller cache. How big was the shared buffers
  setting? What OS scheduler are you using? The effect would be greatest
  when using Deadline.
 
 I didn't tune OS parameters, used default values.
 In terms of cache amounts, postgres buffers were larger than kernel
 write pool and controller cache. that's why the OS could not optimise
 writes enough in checkpoint, I think.
 
   - 200MB - RAM * dirty_background_ratio
   - 128MB - Controller cache
   - 2GB   - postgres shared_buffers
 
 I forget to gather detail I/O information in the tests.
 I'll retry it and report later.
 
 RAM  2GB
 Controller cache 128MB
 shared_buffers   1GB
 checkpoint_timeout   = 15min
 checkpoint_write_percent = 50.0
 
 RHEL4 (Linux 2.6.9-42.0.2.EL)
 vm.dirty_background_ratio= 10
 vm.dirty_ratio   = 40
 vm.dirty_expire_centisecs= 3000
 vm.dirty_writeback_centisecs = 500
 Using cfq io scheduler

Sounds like sorting the buffers before checkpoint is going to be a win
once we go above about ~128MB. We can do a simple test on NBuffers,
rather than have a sort_blocks_at_checkoint (!) GUC.

But it does seem there is a win for larger settings of shared_buffers.

Does performance go up in the non-sorted case if we make shared_buffers
smaller? Sounds like it might. We should check that first.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[HACKERS] Sorted writes in checkpoint

2007-06-14 Thread ITAGAKI Takahiro
Greg Smith [EMAIL PROTECTED] wrote:

 On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
  If the kernel can treat sequential writes better than random writes, is 
  it worth sorting dirty buffers in block order per file at the start of 
  checkpoints?

I wrote and tested the attached sorted-writes patch base on Heikki's
ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.

  tests| pgbench | DBT-2 response time (avg/90%/max)
---+-+---
 LDC only  | 181 tps | 1.12 / 4.38 / 12.13 s
 + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
 + Sorted writes   | 224 tps | 0.36 / 0.80 /  8.11 s

(*) Don't write buffers that were dirtied after starting the checkpoint.

machine : 2GB-ram, SCSI*4 RAID-5
pgbench : -s400 -t4 -c10  (about 5GB of database)
DBT-2   : 60WH (about 6GB of database)


 I think it has the potential to improve things.  There are three obvious 
 and one subtle argument against it I can think of:
 
 1) Extra complexity for something that may not help.  This would need some 
 good, robust benchmarking improvements to justify its use.

Exactly. I think we need a discussion board for I/O performance issues.
Can I use Developers Wiki for this purpose?  Since performance graphs and
result tables are important for the discussion, so it might be better
than mailing lists, that are text-based.


 2) Block number ordering may not reflect actual order on disk.  While 
 true, it's got to be better correlated with it than writing at random.
 3) The OS disk elevator should be dealing with this issue, particularly 
 because it may really know the actual disk ordering.

Yes, both are true. However, I think there is pretty high correlation
in those orderings. In addition, we should use filesystem to assure
those orderings correspond to each other. For example, pre-allocation
of files might help us, as has often been discussed.


 Here's the subtle thing:  by writing in the same order the LRU scan occurs 
 in, you are writing dirty buffers in the optimal fashion to eliminate 
 client backend writes during BuferAlloc.  This makes the checkpoint a 
 really effective LRU clearing mechanism.  Writing in block order will 
 change that.

The issue will probably go away after we have LDC, because it writes LRU
buffers during checkpoints.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



sorted-ckpt.patch
Description: Binary data

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Sorted writes in checkpoint

2007-06-14 Thread Gregory Stark

ITAGAKI Takahiro [EMAIL PROTECTED] writes:

 Exactly. I think we need a discussion board for I/O performance issues.
 Can I use Developers Wiki for this purpose?  Since performance graphs and
 result tables are important for the discussion, so it might be better
 than mailing lists, that are text-based.

I would suggest keeping the discussion on mail and including links to refer to
charts and tables in the wiki.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Sorted writes in checkpoint

2007-06-14 Thread Heikki Linnakangas

ITAGAKI Takahiro wrote:

Greg Smith [EMAIL PROTECTED] wrote:

On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
If the kernel can treat sequential writes better than random writes, is 
it worth sorting dirty buffers in block order per file at the start of 
checkpoints?


I wrote and tested the attached sorted-writes patch base on Heikki's
ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.

  tests| pgbench | DBT-2 response time (avg/90%/max)
---+-+---
 LDC only  | 181 tps | 1.12 / 4.38 / 12.13 s
 + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
 + Sorted writes   | 224 tps | 0.36 / 0.80 /  8.11 s

(*) Don't write buffers that were dirtied after starting the checkpoint.

machine : 2GB-ram, SCSI*4 RAID-5
pgbench : -s400 -t4 -c10  (about 5GB of database)
DBT-2   : 60WH (about 6GB of database)


Wow, I didn't expect that much gain from the sorted writes. How was LDC 
configured?


3) The OS disk elevator should be dealing with this issue, particularly 
because it may really know the actual disk ordering.


Yeah, but we don't give the OS that much chance to coalesce writes when 
we spread them out.


Here's the subtle thing:  by writing in the same order the LRU scan occurs 
in, you are writing dirty buffers in the optimal fashion to eliminate 
client backend writes during BuferAlloc.  This makes the checkpoint a 
really effective LRU clearing mechanism.  Writing in block order will 
change that.


The issue will probably go away after we have LDC, because it writes LRU
buffers during checkpoints.


I think so too.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Sorted writes in checkpoint

2007-06-14 Thread Greg Smith

On Thu, 14 Jun 2007, ITAGAKI Takahiro wrote:

I think we need a discussion board for I/O performance issues. Can I use 
Developers Wiki for this purpose?  Since performance graphs and result 
tables are important for the discussion, so it might be better than 
mailing lists, that are text-based.


I started pushing some of my stuff over to there recently to make it 
easier to edit and other people can expand with their expertise.
http://developer.postgresql.org/index.php/Buffer_Cache%2C_Checkpoints%2C_and_the_BGW 
is what I've done so far on this particular topic.


What I would like to see on the Wiki first are pages devoted to how to run 
the common benchmarks people use for useful performance testing.  A recent 
thread on one of the lists reminded me how easy it is to get worthless 
results out of DBT2 if you don't have any guidance on that.  I've already 
got a stack of documentation about how to wrestle with pgbench and am 
generating more.


The problem with using the Wiki as the main focus is that when you get to 
the point that you want to upload detailed test results, that interface 
really isn't appropriate for it.  For example, in the last day I've 
collected up data from about 400 short tests runs that generated 800 
graphs.  It's all organized as HTML so you can drill down into the 
specific tests that executed oddly.  Heikki's DBT2 resuls are similar; not 
as many files, because he's running longer tests, but the navigation is 
even more complicated.


There is no way to easily put that type and level of information into the 
Wiki page.  You really just need a web server to copy the results onto. 
Then the main problem you have to be concerned about is a repeat of the 
OSDL situation, where all the results just dissapear if their hosting 
sponsor goes away.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Sorted writes in checkpoint

2007-06-14 Thread Simon Riggs
On Thu, 2007-06-14 at 16:39 +0900, ITAGAKI Takahiro wrote:
 Greg Smith [EMAIL PROTECTED] wrote:
 
  On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
   If the kernel can treat sequential writes better than random writes, is 
   it worth sorting dirty buffers in block order per file at the start of 
   checkpoints?
 
 I wrote and tested the attached sorted-writes patch base on Heikki's
 ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.
 
   tests| pgbench | DBT-2 response time (avg/90%/max)
 ---+-+---
  LDC only  | 181 tps | 1.12 / 4.38 / 12.13 s
  + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
  + Sorted writes   | 224 tps | 0.36 / 0.80 /  8.11 s
 
 (*) Don't write buffers that were dirtied after starting the checkpoint.
 
 machine : 2GB-ram, SCSI*4 RAID-5
 pgbench : -s400 -t4 -c10  (about 5GB of database)
 DBT-2   : 60WH (about 6GB of database)

I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage
of writes has been saved by doing that? We would expect a small
percentage of blocks only and so that shouldn't make a significant
difference. I thought we discussed this before, about a year ago. It
would be easy to get that wrong and to avoid writing a block that had
been re-dirtied after the start of checkpoint, but was already dirty
beforehand. How long was the write phase of the checkpoint, how long
between checkpoints?

I can see the sorted writes having an effect because the OS may not
receive blocks within a sufficient time window to fully optimise them.
That effect would grow with increasing sizes of shared_buffers and
decrease with size of controller cache. How big was the shared buffers
setting? What OS scheduler are you using? The effect would be greatest
when using Deadline.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Sorted writes in checkpoint

2007-06-14 Thread Gregory Maxwell

On 6/14/07, Simon Riggs [EMAIL PROTECTED] wrote:

On Thu, 2007-06-14 at 16:39 +0900, ITAGAKI Takahiro wrote:
 Greg Smith [EMAIL PROTECTED] wrote:

  On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
   If the kernel can treat sequential writes better than random writes, is
   it worth sorting dirty buffers in block order per file at the start of
   checkpoints?

 I wrote and tested the attached sorted-writes patch base on Heikki's
 ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.

   tests| pgbench | DBT-2 response time (avg/90%/max)
 ---+-+---
  LDC only  | 181 tps | 1.12 / 4.38 / 12.13 s
  + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
  + Sorted writes   | 224 tps | 0.36 / 0.80 /  8.11 s

 (*) Don't write buffers that were dirtied after starting the checkpoint.

 machine : 2GB-ram, SCSI*4 RAID-5
 pgbench : -s400 -t4 -c10  (about 5GB of database)
 DBT-2   : 60WH (about 6GB of database)

I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage
of writes has been saved by doing that? We would expect a small
percentage of blocks only and so that shouldn't make a significant
difference. I thought we discussed this before, about a year ago. It
would be easy to get that wrong and to avoid writing a block that had
been re-dirtied after the start of checkpoint, but was already dirty
beforehand. How long was the write phase of the checkpoint, how long
between checkpoints?

I can see the sorted writes having an effect because the OS may not
receive blocks within a sufficient time window to fully optimise them.
That effect would grow with increasing sizes of shared_buffers and
decrease with size of controller cache. How big was the shared buffers
setting? What OS scheduler are you using? The effect would be greatest
when using Deadline.


Linux has some instrumentation that might be useful for this testing,

echo 1  /proc/sys/vm/block_dump
Will have the kernel log all physical IO (disable syslog writing to
disk before turning it on if you don't want the system to blow up).

Certainly the OS elevator should be working well enough to not see
that much of an improvement. Perhaps frequent fsync behavior is having
unintended interaction with the elevator?  ... It might be worthwhile
to contact some Linux kernel developers and see if there is some
misunderstanding.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Sorted writes in checkpoint

2007-06-14 Thread Greg Smith

On Thu, 14 Jun 2007, Gregory Maxwell wrote:


Linux has some instrumentation that might be useful for this testing,
echo 1  /proc/sys/vm/block_dump


That bit was developed for tracking down who was spinning the hard drive 
up out of power saving mode, and I was under the impression that very 
rough feature isn't useful at all here.  I just tried to track down again 
where I got that impression from, and I think it was this thread:


http://linux.slashdot.org/comments.pl?sid=231817cid=18832379

This mentions general issues figuring out who was responsible for a write 
and specifically mentions how you'll have to reconcile two different paths 
if fsync is mixed in.  Not saying it won't work, it's just obvious using 
the block_dump output isn't a simple job.


(For anyone who would like an intro to this feature, try 
http://www.linuxjournal.com/node/7539/print and 
http://toadstool.se/journal/2006/05/27/monitoring-filesystem-activity-under-linux-with-block_dump 
)


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster