Re: [HACKERS] Load distributed checkpoint V4

2007-04-23 Thread Heikki Linnakangas

ITAGAKI Takahiro wrote:

Heikki Linnakangas [EMAIL PROTECTED] wrote:
We might want to call GetCheckpointProgress something 
else, though. It doesn't return the amount of progress made, but rather 
the amount of progress we should've made up to that point or we're in 
danger of not completing the checkpoint in time.


GetCheckpointProgress might be a bad name; It returns the progress we should
have done, not at that time. How about GetCheckpointTargetProgress?


Better. A bit long though. Not that I have any better suggestions ;-)

In the sync phase, we sleep between each fsync until enough 
time/segments have passed, assuming that the time to fsync is 
proportional to the file length. I'm not sure that's a very good 
assumption. We might have one huge files with only very little changed 
data, for example a logging table that is just occasionaly appended to. 
If we begin by fsyncing that, it'll take a very short time to finish, 
and we'll then sleep for a long time. If we then have another large file 
to fsync, but that one has all pages dirty, we risk running out of time 
because of the unnecessarily long sleep. The segmentation of relations 
limits the risk of that, though, by limiting the max. file size, and I 
don't really have any better suggestions.


It is difficult to estimate fsync costs. We need additonal statistics to
do it. For example, if we record the number of write() for each segment,
we might use the value as the number of dirty pages in segments. We don't
have per-file write statistics now, but if we will have those information,
we can use them to control checkpoints more cleverly.


It's probably not worth it to be too clever with that. Even if we 
recorded the number of writes we made, we still wouldn't know how many 
of them haven't been flushed to disk yet.


I guess we're fine if we do just avoid excessive waiting per the 
discussion in the next paragraph, and use a reasonable safety margin in 
the default values.


Should we try doing something similar for the sync phase? If there's 
only 2 small files to fsync, there's no point sleeping for 5 minutes 
between them just to use up the checkpoint_sync_percent budget.


Hmmm... if we add a new parameter like kernel_write_throughput [kB/s] and
clamp the maximum sleeping to size-of-segment / kernel_write_throuput (*1), 
we can avoid unnecessary sleeping in fsync phase. Do we want to have such

a new parameter? I think we have many and many guc variables even now.


How about using the same parameter that controls the minimum write speed 
of the write-phase (the patch used bgwriter_all_maxpages, but I 
suggested renaming it)?



I don't want to add new parameters any more if possible...


Agreed.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Load distributed checkpoint V4

2007-04-22 Thread ITAGAKI Takahiro
Heikki Linnakangas [EMAIL PROTECTED] wrote:

Thanks for making clearly understandable my patch!

 We might want to call GetCheckpointProgress something 
 else, though. It doesn't return the amount of progress made, but rather 
 the amount of progress we should've made up to that point or we're in 
 danger of not completing the checkpoint in time.

GetCheckpointProgress might be a bad name; It returns the progress we should
have done, not at that time. How about GetCheckpointTargetProgress?

 However, if we're 
 already past checkpoint_write_percent at the beginning of the nap, I 
 think we should clamp the nap time so that we don't run out of time 
 until the next checkpoint because of sleeping.

Yeah, I'm thinking nap time to be clamped to (100.0 - ckpt_progress_at_nap_
start - checkpoint_sync_percent). I think excess of checkpoint_write_percent
is not so important here, so I care about only the end of checkpoint.

 In the sync phase, we sleep between each fsync until enough 
 time/segments have passed, assuming that the time to fsync is 
 proportional to the file length. I'm not sure that's a very good 
 assumption. We might have one huge files with only very little changed 
 data, for example a logging table that is just occasionaly appended to. 
 If we begin by fsyncing that, it'll take a very short time to finish, 
 and we'll then sleep for a long time. If we then have another large file 
 to fsync, but that one has all pages dirty, we risk running out of time 
 because of the unnecessarily long sleep. The segmentation of relations 
 limits the risk of that, though, by limiting the max. file size, and I 
 don't really have any better suggestions.

It is difficult to estimate fsync costs. We need additonal statistics to
do it. For example, if we record the number of write() for each segment,
we might use the value as the number of dirty pages in segments. We don't
have per-file write statistics now, but if we will have those information,
we can use them to control checkpoints more cleverly.

 Should we try doing something similar for the sync phase? If there's 
 only 2 small files to fsync, there's no point sleeping for 5 minutes 
 between them just to use up the checkpoint_sync_percent budget.

Hmmm... if we add a new parameter like kernel_write_throughput [kB/s] and
clamp the maximum sleeping to size-of-segment / kernel_write_throuput (*1), 
we can avoid unnecessary sleeping in fsync phase. Do we want to have such
a new parameter? I think we have many and many guc variables even now.
I don't want to add new parameters any more if possible...

(*1) dirty-area-in-segment / kernel_write_throuput is better.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] [HACKERS] Load distributed checkpoint

2007-02-27 Thread ITAGAKI Takahiro

Inaam Rana [EMAIL PROTECTED] wrote:

 One of the issues we had during testing with original patch was db stop not
 working properly. I think you coded something to do a stop checkpoint in
 immediately but if a checkpoint is already in progress at that time, it
 would take its own time to complete.
 Does this patch resolve that issue?

Yes, I fixed the problem. If a checkpoint by user SQL or shutdown is waiting
during an automatic checkpoint, the running checkpoint will be done without
any delays. At the worst case, you have to wait two checkpoints, (a running
automatic checkpoint and your explicit request) but nothing more of them.

 Also, is it based on pg82stable or HEAD?

It's based on HEAD.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Load distributed checkpoint

2007-02-26 Thread Inaam Rana

On 12/19/06, ITAGAKI Takahiro [EMAIL PROTECTED] wrote:


Takayuki Tsunakawa [EMAIL PROTECTED] wrote:

 I performed some simple tests, and I'll show the results below.

 (1) The default case
 235  80  226 77  240
 (2) No write case
 242  250  244  253  280
 (3) No checkpoint case
 229  252  256  292  276
 (4) No fsync() case
 236  112  215  216  221
 (5) No write by PostgreSQL, but fsync() by another program case
 9  223  260  283  292
 (6) case (5) + O_SYNC by write_fsync
 97  114  126  112  125
 (7) O_SYNC case
 182  103  41  50  74

I posted a patch to PATCHES. Please try out it.
It does write() smoothly, but fsync() at a burst.
I suppose the result will be between (3) and (5).




Itagaki,

Did you had a chance to look into this any further? We, at EnterpriseDB,
have done some testing on this patch (dbt2 runs) and it looks like we are
getting the desired results, particularly so when we spread out both sync
and write phases.

--
Inaam Rana
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] Load distributed checkpoint

2007-02-26 Thread ITAGAKI Takahiro
Inaam Rana [EMAIL PROTECTED] wrote:

 Did you had a chance to look into this any further? We, at EnterpriseDB,
 have done some testing on this patch (dbt2 runs) and it looks like we are
 getting the desired results, particularly so when we spread out both sync
 and write phases.

Thank you for testing! Yes, I'm cleaning the patch. I changed configuration
parameters to delay each phase in checkpoints from setting absolute times
(checkpoint_xxx_duration) to setting relative to checkpoint_timeout
(checkpoint_xxx_percent). Delay factors strongly depend on total checkpoint
lengths, so I think the relative method is better.


I tested the patch on my machine, too. The belows are the results of
pgbench with/without the patch. As you see, the patch is not a complete
solution -- 12s of response time is not satisfiable yet -- but it was
better than other possible settings we could choose now.


pgbench -s100 -c16 -t10
  on the machine with 1GB ram and one SCSI drive

  |   A|   B|   C|   D|
--+++++
 bgwriter_all_maxpages| 5  | 5  | 60 | 120|
 checkpoint_write_percent | 50.0   | 0  | 0  | 0  |
 checkpoint_nap_percent   | 10.0   | 0  | 0  | 0  |
 checkpoint_sync_percent  | 20.0   | 0  | 0  | 0  |
--+++++
 pgbench tps  | 612.23 | 517.64 | 488.90 | 378.39 |
 response average | 2.50ms | 2.89ms | 3.12ms | 3.99ms |
 response maximum | 12.23s |123.66s | 55.09s | 36.72s |
--+++++

A: Proposed configurations with patch
B: Default configurations
C: Make bgwriter aggressive
D: Make bgwriter more aggressive

Other configurations:
- full_page_writes = off
- wal_buffers = 4MB
- checkpoint_segments = 32
- checkpoint_timeout = 15min

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Load distributed checkpoint

2007-02-26 Thread Josh Berkus
Itagaki,

 Thank you for testing! Yes, I'm cleaning the patch. I changed
 configuration parameters to delay each phase in checkpoints from setting
 absolute times (checkpoint_xxx_duration) to setting relative to
 checkpoint_timeout (checkpoint_xxx_percent). Delay factors strongly
 depend on total checkpoint lengths, so I think the relative method is
 better.

Can I have a copy of the patch to add to the Sun testing queue?

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Load distributed checkpoint

2007-02-26 Thread ITAGAKI Takahiro
Josh Berkus josh@agliodbs.com wrote:

 Can I have a copy of the patch to add to the Sun testing queue?

This is the revised version of the patch. Delay factors in checkpoints
can be specified by checkpoint_write_percent, checkpoint_nap_percent
and checkpoint_sync_percent. They are relative to checkpoint_timeout.

Also, checking of archive_timeout during checkpoints and some error
handling routines were added.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



checkpoint_02-27.patch.gz
Description: Binary data

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] Load distributed checkpoint

2007-02-26 Thread Inaam Rana

On 2/26/07, ITAGAKI Takahiro [EMAIL PROTECTED] wrote:


Josh Berkus josh@agliodbs.com wrote:

 Can I have a copy of the patch to add to the Sun testing queue?

This is the revised version of the patch. Delay factors in checkpoints
can be specified by checkpoint_write_percent, checkpoint_nap_percent
and checkpoint_sync_percent. They are relative to checkpoint_timeout.

Also, checking of archive_timeout during checkpoints and some error
handling routines were added.




One of the issues we had during testing with original patch was db stop not
working properly. I think you coded something to do a stop checkpoint in
immediately but if a checkpoint is already in progress at that time, it
would take its own time to complete.
Does this patch resolve that issue? Also, is it based on pg82stable or HEAD?

regards,
inaam

Regards,

---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings






--
Inaam Rana
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] Load distributed checkpoint

2007-02-02 Thread Bruce Momjian

Thread added to TODO list:

* Reduce checkpoint performance degredation by forcing data to disk
  more evenly

  http://archives.postgresql.org/pgsql-hackers/2006-12/msg00337.php
  http://archives.postgresql.org/pgsql-hackers/2007-01/msg00079.php


---

ITAGAKI Takahiro wrote:
 This is a proposal for load distributed checkpoint.
 (It is presented on postgresql anniversary summit in last summer.)
 
 
 We offen encounters performance gap during checkpoint. The reason is write
 bursts. Storage devices are too overworked in checkpoint, so they can not
 supply usual transaction processing.
 
 Checkpoint consists of the following four steps, and the major performance
 problem is 2nd step. All dirty buffers are written without interval in it.
 
  1. Query information (REDO pointer, next XID etc.)
  2. Write dirty pages in buffer pool
  3. Flush all modified files
  4. Update control file
 
 I suggested to write pages with sleeping in 2nd step, using normal activity
 of the background writer. It is something like cost-based vacuum delay.
 Background writer has two pointers, 'ALL' and 'LRU', indicating where to 
 write out in buffer pool. We can wait for the ALL clock-hand going around
 to guarantee all pages to be written.
 
 Here is pseudo-code for the proposed method. The internal loop is just the
 same as bgwriter's activity.
 
   PrepareCheckPoint();  -- do step 1
   Reset num_of_scanned_pages by ALL activity;
   do {
   BgBufferSync();   -- do a part of step 2
   sleep(bgwriter_delay);
   } while (num_of_scanned_pages  shared_buffers);
   CreateCheckPoint();   -- do step 3 and 4
 
 
 We may accelerate background writer to reduce works at checkpoint instead of
 the method, but it introduces another performance problem; Extra pressure
 is always put on the storage devices to keep the number of dirty pages low.
 
 
 I'm working about adjusting the progress of checkpoint to checkpoint timeout
 and wal segments limitation automatically to avoid overlap of two checkpoints.
 I'll post a patch sometime soon.
 
 Comments and suggestions welcome.
 
 Regards,
 ---
 ITAGAKI Takahiro
 NTT Open Source Software Center
 
 
 ---(end of broadcast)---
 TIP 6: explain analyze is your friend

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Load distributed checkpoint

2007-01-11 Thread Inaam Rana


No, I've not tried yet.  Inaam-san told me that Linux had a few I/O
schedulers but I'm not familiar with them.  I'll find information
about them (how to change the scheduler settings) and try the same
test.



I am sorry, your response just slipped by me. The docs for RHEL (I believe
you are running RHEL which has 2.6.9 kernel) say that it does support
selectable IO scheduler.

http://www.redhat.com/rhel/details/limits/

I am not sure where else to look for scheduler apart from /sys

regards,
inaam


Re: [HACKERS] Load distributed checkpoint

2007-01-09 Thread ITAGAKI Takahiro
I wrote:
 I'm thinking about generalizing your idea; Adding three parameters
 to control sleeps in each stage.

I sent a patch to -patches that adds 3+1 GUC parameters for checkpoints.
We can use three of them to control sleeps in each stage during checkpoints.
The last is an experimental approach to replace fsync() for fine control.

1. checkpoint_write_duration (default=0, in seconds)
Sets the duration of write() phase in checkpoints.
2. checkpoint_nap_duration (default=0, in seconds)
Sets the duration between write() and fsync() phases in checkpoints.
3. checkpoint_sync_duration (default=0, in seconds)
Sets the duration of fsync() phase in checkpoints.

The 1st parameter spreads write(). If you set checkpoint_write_duration
to 90% of checkpoint_timeout, it's just same as the patch I sent before.

The 2nd is naptime between write() and fsync() phases. Kernel's writer might
work much if you set it to around 30-60s, that might be useful for some
traditional UNIXes, as you say. In contrast, the 1st was the most variable
in my machine somehow (Windows and Linux).

The 3rd spreads fsync(). This parameter only works when you have several
tables or a very large table (that consists of some 1GB of files), because
fsync() is on a file basis.


Bruce Momjian [EMAIL PROTECTED] wrote:
 To summarize, if we could have fsync() only write the dirty buffers that
 happened as part of the checkpoint, we could delay the write() for the
 entire time between checkpoints, but we can't do that, so we have to
 make it user-tunable.

The 3rd has the above limitation so that I added another parameter.

4. checkpoint_sync_size (default=0, in KB)
Sets the synchronization unit of data files in checkpoints.

It uses sync_file_range or mmap/msync/munmap to divide file-synchronization
into specified granularity. I think 16-64MB fits the machines in that
performance are restricted by fsync() in checkpoints.

The feature is uncompleted. For example, sync_file_range does not flush
metadata of files in fact (it's equivalent of fdatasync), so we may lose
data under the patch. It must be fixed, but I want to measure the advantage
before that.


I'm interested in which parameter is useful for each environment.
Any comments and testing reports will be appreciated.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Load distributed checkpoint

2007-01-08 Thread Takayuki Tsunakawa
Happy new year

From: Simon Riggs [EMAIL PROTECTED]
 Have you tried setting deadline scheduler on the WAL device and CFQ
on
 the data device? That should allow the I/Os to move through
different
 queues and prevent interference.

No, I've not tried yet.  Inaam-san told me that Linux had a few I/O
schedulers but I'm not familiar with them.  I'll find information
about them (how to change the scheduler settings) and try the same
test.


- Original Message - 
From: Simon Riggs [EMAIL PROTECTED]
To: Takayuki Tsunakawa [EMAIL PROTECTED]
Cc: ITAGAKI Takahiro [EMAIL PROTECTED];
pgsql-hackers@postgresql.org
Sent: Thursday, December 28, 2006 7:07 AM
Subject: Re: [HACKERS] Load distributed checkpoint


 On Mon, 2006-12-18 at 14:47 +0900, Takayuki Tsunakawa wrote:
 Hello, Itagaki-san, all

 Sorry for my long mail.  I've had trouble in sending this mail
because
 it's too long for pgsql-hackers to accept (I couldn't find how
large
 mail is accepted.)  So I'm trying to send several times.
 Please see the attachment for the content.

 Your results for fsync are interesting.

 I've noticed that a checkpoint seems to increase the activity on the
WAL
 drive as well as increasing I/O wait times. That doesn't correspond
to
 any real increase in WAL traffic I'm aware of.

 Have you tried setting deadline scheduler on the WAL device and CFQ
on
 the data device? That should allow the I/Os to move through
different
 queues and prevent interference.

 -- 
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com






---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Load distributed checkpoint

2007-01-03 Thread Zeugswetter Andreas ADI SD

  I believe there's something similar for OS X as well. The question
is:
  would it be better to do that, or to just delay calling fsync until
the
  OS has had a chance to write things out.
 
 A delay is not going to help unless you can suppress additional writes
 to the file, which I don't think you can unless there's very little
 going on in the database --- dirty buffers have to get written to make
 room for other pages, checkpoint in progress or no.

But checkpoint first writes all dirty pages, so we have more than
average
pages that can be replaced without a write. Thus we have a window where
we can wait until dirty pages have to be replaced again. Since the
bgwriter
is sleeping until fsync, only pages that have to be replaced will be
written.
Question is, how do we time that window.

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Load distributed checkpoint

2006-12-29 Thread Jim C. Nasby
On Thu, Dec 28, 2006 at 09:28:48PM +, Heikki Linnakangas wrote:
 Tom Lane wrote:
 To my mind the problem with fsync is not that it gives us too little
 control but that it gives too much: we have to specify a particular
 order of writing out files.  What we'd really like is a version of
 sync(2) that tells us when it's done but doesn't constrain the I/O
 scheduler's choices at all.  Unfortunately there's no such API ...
 
 The problem I see with fsync is that it causes an immediate I/O storm as 
 the OS tries to flush everything out as quickly as possible. But we're 
 not in a hurry. What we'd need is a lazy fsync, that would tell the 
 operating system let me know when all these dirty buffers are written 
 to disk, but I'm not in a hurry, take your time. It wouldn't change the 
 scheduling of the writes, just inform the caller when they're done.
 
 If we wanted more precise control of the flushing, we could use 
 sync_file_range on Linux, but that's not portable. Nevertheless, I think 
  it would be OK to have an ifdef and use it on platforms that support 
 it, if it gave a benefit.
 
I believe there's something similar for OS X as well. The question is:
would it be better to do that, or to just delay calling fsync until the
OS has had a chance to write things out.

 As a side note, with full_page_writes on, a checkpoint wouldn't actually 
 need to fsync those pages that have been written to WAL after the 
 checkpoint started. Doesn't make much difference in most cases, but we 
 could take that into account if we start taking more control of the 
 flushing.

Hrm, interesting point, but I suspect the window involved there is too
small to be worth worrying about.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Load distributed checkpoint

2006-12-29 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes:
 I believe there's something similar for OS X as well. The question is:
 would it be better to do that, or to just delay calling fsync until the
 OS has had a chance to write things out.

A delay is not going to help unless you can suppress additional writes
to the file, which I don't think you can unless there's very little
going on in the database --- dirty buffers have to get written to make
room for other pages, checkpoint in progress or no.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Load distributed checkpoint

2006-12-29 Thread Bruce Momjian
Tom Lane wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
  I believe there's something similar for OS X as well. The question is:
  would it be better to do that, or to just delay calling fsync until the
  OS has had a chance to write things out.
 
 A delay is not going to help unless you can suppress additional writes
 to the file, which I don't think you can unless there's very little
 going on in the database --- dirty buffers have to get written to make
 room for other pages, checkpoint in progress or no.

I am afraid a delay between write and fsync is the only portable option
we have right now --- there is hope that since the check point write, we
will not have a huge number of dirty buffers at the start of the
checkpoint that need to be written out.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Load distributed checkpoint

2006-12-29 Thread Jim C. Nasby
On Fri, Dec 29, 2006 at 09:02:11PM -0500, Bruce Momjian wrote:
 Tom Lane wrote:
  Jim C. Nasby [EMAIL PROTECTED] writes:
   I believe there's something similar for OS X as well. The question is:
   would it be better to do that, or to just delay calling fsync until the
   OS has had a chance to write things out.
  
  A delay is not going to help unless you can suppress additional writes
  to the file, which I don't think you can unless there's very little
  going on in the database --- dirty buffers have to get written to make
  room for other pages, checkpoint in progress or no.
 
 I am afraid a delay between write and fsync is the only portable option
 we have right now --- there is hope that since the check point write, we
 will not have a huge number of dirty buffers at the start of the
 checkpoint that need to be written out.

We could potentially control that behavior, too. For example, trying to
suppress writes to a given file when we're getting ready to checkpoint
it (of course, separating checkpointing into a file-by-file operation is
a non-trivial change, but it should be possible).
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Load distributed checkpoint

2006-12-28 Thread Ron Mayer
Gregory Stark wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
 
 I have a new idea.  ...the BSD kernel...similar issue...to smooth writes:
 Linux has a more complex solution to this (of course) which has undergone a
 few generations over time. Older kernels had a user space daemon called
 bdflush which called an undocumented syscall every 5s. More recent ones have a
 kernel thread called pdflush. I think both have various mostly undocumented
 tuning knobs but neither makes any sort of guarantee about the amount of time
 a dirty buffer might live before being synced.

Earlier in this thread (around the 7th) was a discussion of
/proc/sys/vm/dirty_expire_centisecs and /proc/vm/dirty_writeback_centisecs
which seem to be the tunables that matter here.  Googling suggests that

  dirty_expire_centisecs specifies that data which has been dirty in memory
  for longer than this interval will be written out next time a pdflush
  daemon wakes up

and

  dirty_writeback_centisecs expresses the interval between those wakeups

It seems to me that the sum of the two times does determine the maximum
time before the kernel will start syncing a dirtied page.



Bottom line, though is that it seems both postgresql and the OS's are
trying to delay writes in the hopes of collapsing them; and that the
actual delay is the sum of the OS's delay and postgresql's delay.  I
think Kevin Grittner's experimentation earlier in the thread did indeed
suggest that getting writes to the OS faster and let it handle the
collapsing of the writes was an effective method of reducing painful
checkpoints.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Load distributed checkpoint

2006-12-28 Thread Jim C. Nasby
On Wed, Dec 27, 2006 at 10:54:57PM +, Simon Riggs wrote:
 On Wed, 2006-12-27 at 23:26 +0100, Martijn van Oosterhout wrote:
  On Wed, Dec 27, 2006 at 09:24:06PM +, Simon Riggs wrote:
   On Fri, 2006-12-22 at 13:53 -0500, Bruce Momjian wrote:
   
I assume other kernels have similar I/O smoothing, so that data sent to
the kernel via write() gets to disk within 30 seconds.  

I assume write() is not our checkpoint performance problem, but the
transfer to disk via fsync().  
   
   Well, its correct to say that the transfer to disk is the source of the
   problem, but that doesn't only occur when we fsync(). There are actually
   two disk storms that occur, because of the way the fs cache works. [Ron
   referred to this effect uplist]
  
  As someone looking from the outside:
  
  fsync only works on one file, so presumably the checkpoint process is
  opening each file one by one and fsyncing them. 
 
 Yes
 
  Does that make any
  difference here? Could you adjust the timing here?
 
 Thats the hard bit with io storm 2. When you fsync a file you don't
 actually know how many blocks you're writing, plus there's no way to
 slow down those writes by putting delays between them (although its
 possible your controller might know how to do this, I've never heard of
 one that does).

Any controller that sophisticated would likely also have a BBU and write
caching, which should greatly reduce the impact of at least the fsync
storm... unless you fill the cache. I suspect we might need a way to
control how much data we try and push out at a time to avoid that...

As for settings, I really like the simplicity of the Oracle system...
Just try to ensure recovery takes about X amount of seconds. I like
the idea of a creeping checkpoint even more; only writing a buffer out
when we need to checkpoint it makes a lot more sense to me than trying
to guess when we'll next dirty a buffer. Such a system would probably
also be a lot easier to tune than the current bgwriter, even if we
couldn't simplify it all the way to seconds for recovery.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Load distributed checkpoint

2006-12-28 Thread Bruce Momjian
ITAGAKI Takahiro wrote:
 
 Bruce Momjian [EMAIL PROTECTED] wrote:
 
   566.973777
   327.158222 - (1) write()
   560.773868 - (2) sleep
   544.106645 - (3) fsync()
  
  OK, so you are saying that performance dropped only during the write(),
  and not during the fsync()?  Interesting.
 
 Almost yes, but there is a small drop in fsync. (560-540)
 
 
  I would like to know the
  results of a few tests just like you reported them above:
  
  1a) write spread out over 30 seconds
  1b) write with no delay
  
  2a) sleep(0)
  2b) sleep(30)
  
   3) fsync
  
  I would like to know the performance at each stage for each combination,
  e.g. when using 1b, 2a, 3, performance during the write() phase was X,
  during the sleep it was Y, and during the fsync it was Z. (Of course,
  sleep(0) has no stage timing.)
 
 I'm thinking about generalizing your idea; Adding three parameters
 (checkpoint_write, checkpoint_naptime and checkpoint_fsync)
 to control sleeps in each stage.
 
 1) write() spread out over 'checkpoint_write' seconds.
 2) sleep 'checkpoint_naptime' seconds between write() and fsync().
 3) fsync() spread out over 'checkpoint_fsync' seconds.
 
 If three parameter are all zero, checkpoints behave as the same as now.
 If checkpoint_write = checkpoint_timeout and other two are zero,
 it is just like my proposal before.
 
 
 As you might expect, I intend the above only for development purpose.
 Additinal three parameters are hard to use for users. If we can pull out
 some proper values from the tests, we'd better to set those values as
 default. I assume we can derive them from existing checkpoint_timeout.

Great idea, though I wouldn't bother with checkpoint_fsync.  I think
Simon's previous email spelled out the problems of trying to delay
fsyncs() --- in most cases, there will be one file with most of the I/O,
and that fsync is going to be the flood.  Basically, I think the
variability of table access is too great for the fsync delay to ever be
tunable by users.

To summarize, if we could have fsync() only write the dirty buffers that
happened as part of the checkpoint, we could delay the write() for the
entire time between checkpoints, but we can't do that, so we have to
make it user-tunable.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Load distributed checkpoint

2006-12-28 Thread Jim C. Nasby
On Thu, Dec 28, 2006 at 12:50:19PM -0500, Bruce Momjian wrote:
 To summarize, if we could have fsync() only write the dirty buffers that
 happened as part of the checkpoint, we could delay the write() for the
 entire time between checkpoints, but we can't do that, so we have to
 make it user-tunable.

What about the mmap/msync(?)/munmap idea someone mentioned?
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Load distributed checkpoint

2006-12-28 Thread Bruce Momjian
Jim C. Nasby wrote:
 On Thu, Dec 28, 2006 at 12:50:19PM -0500, Bruce Momjian wrote:
  To summarize, if we could have fsync() only write the dirty buffers that
  happened as part of the checkpoint, we could delay the write() for the
  entire time between checkpoints, but we can't do that, so we have to
  make it user-tunable.
 
 What about the mmap/msync(?)/munmap idea someone mentioned?

I see that as similar to using O_DIRECT during checkpoint, which had
poor performance.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Load distributed checkpoint

2006-12-28 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 Jim C. Nasby wrote:
 What about the mmap/msync(?)/munmap idea someone mentioned?

 I see that as similar to using O_DIRECT during checkpoint, which had
 poor performance.

That's a complete nonstarter on portability grounds, even if msync gave
us the desired semantics, which it doesn't.  It's no better than fsync
for our purposes.

To my mind the problem with fsync is not that it gives us too little
control but that it gives too much: we have to specify a particular
order of writing out files.  What we'd really like is a version of
sync(2) that tells us when it's done but doesn't constrain the I/O
scheduler's choices at all.  Unfortunately there's no such API ...

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Load distributed checkpoint

2006-12-28 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Jim C. Nasby wrote:
  What about the mmap/msync(?)/munmap idea someone mentioned?
 
  I see that as similar to using O_DIRECT during checkpoint, which had
  poor performance.
 
 That's a complete nonstarter on portability grounds, even if msync gave
 us the desired semantics, which it doesn't.  It's no better than fsync
 for our purposes.
 
 To my mind the problem with fsync is not that it gives us too little
 control but that it gives too much: we have to specify a particular
 order of writing out files.  What we'd really like is a version of
 sync(2) that tells us when it's done but doesn't constrain the I/O
 scheduler's choices at all.  Unfortunately there's no such API ...

Yea, we used to use sync() but that did all files, not just the
PostgreSQL ones.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Load distributed checkpoint

2006-12-28 Thread Heikki Linnakangas

Tom Lane wrote:

To my mind the problem with fsync is not that it gives us too little
control but that it gives too much: we have to specify a particular
order of writing out files.  What we'd really like is a version of
sync(2) that tells us when it's done but doesn't constrain the I/O
scheduler's choices at all.  Unfortunately there's no such API ...


The problem I see with fsync is that it causes an immediate I/O storm as 
the OS tries to flush everything out as quickly as possible. But we're 
not in a hurry. What we'd need is a lazy fsync, that would tell the 
operating system let me know when all these dirty buffers are written 
to disk, but I'm not in a hurry, take your time. It wouldn't change the 
scheduling of the writes, just inform the caller when they're done.


If we wanted more precise control of the flushing, we could use 
sync_file_range on Linux, but that's not portable. Nevertheless, I think 
 it would be OK to have an ifdef and use it on platforms that support 
it, if it gave a benefit.


As a side note, with full_page_writes on, a checkpoint wouldn't actually 
need to fsync those pages that have been written to WAL after the 
checkpoint started. Doesn't make much difference in most cases, but we 
could take that into account if we start taking more control of the 
flushing.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Load distributed checkpoint

2006-12-27 Thread Simon Riggs
On Fri, 2006-12-22 at 13:53 -0500, Bruce Momjian wrote:

 I assume other kernels have similar I/O smoothing, so that data sent to
 the kernel via write() gets to disk within 30 seconds.  
 
 I assume write() is not our checkpoint performance problem, but the
 transfer to disk via fsync().  

Well, its correct to say that the transfer to disk is the source of the
problem, but that doesn't only occur when we fsync(). There are actually
two disk storms that occur, because of the way the fs cache works. [Ron
referred to this effect uplist]

Linux 2.6+ will attempt to write to disk any dirty blocks in excess of a
certain threshold, the dirty _background_ratio, which defaults to 10% of
RAM. So when the checkpoint issues lots of write() calls, we generally
exceed the threshold and then begin io storm number 1 to get us back
down to the dirty_background_ratio. When we issue the fsync() calls, we
then begin io storm number 2, which takes us down close to zero dirty
blocks (on a dedicated server). [Thanks to my colleague Richard Kennedy
for confirming this via investigation; he's been on leave throughout
this discussion, regrettably].

Putting delays in is very simple and does help, however, how much it
helps depends upon:
- the number of dirty blocks in shared_buffers 
- the dirty_background_ratio
- the number of dirty blocks in each file when we fsync()

For example, on a system with a very large RAM, and yet a medium write
workload, the dirty_background_ratio may never be exceeded. In that
case, all of the I/O happens during storm 2, so . If you set
dirty_background_ratio lower, then most of the writes happen during
storm 1.

During storm 2, the fsync calls write all dirty blocks in a file to
disk. In many cases, a few tables/files have all of the writes, so
adding a delay between fsyncs doesn't spread out the writes like you
would hope it would.

Most of the time, storm 1 and storm 2 run together in a continuous
stream, but sometimes you see a double peak. There is an overlap of a
few seconds between 1 and 2 in many cases.

Linux will also write blocks to disk after a period of inactivity,
dirty_expire_centisecs which by default is 30 seconds. So putting a
delay between storm1 and storm2 should help matters somewhat, but 30
secs is probably almost exactly the wrong number (by chance), though I
do like the idea.

 Perhaps a simple solution is to do the
 write()'s of all dirty buffers as we do now at checkpoint time, but
 delay 30 seconds and then do fsync() on all the files. 

So yes, putting a short delay, say 10 seconds, in at that point should
help matters somewhat, sometimes. (But the exact number depends upon how
the OS is tuned.)

  The goal here is
 that during the 30-second delay, the kernel will be forcing data to the
 disk, so the fsync() we eventually do will only be for the write() of
 buffers during the 30-second delay, and because we wrote all dirty
 buffers 30 seconds ago, there shouldn't be too many of them.

...but not for that reason.

IMHO the best thing to do is to
1. put a short delay between the write() steps in the checkpoint
2. put a longer delay in between the write() phase and the fsync() phase
3. tune the OS writeback mechanism to help smoothing
Either set (1) and (2) as GUCs, or have code that reads the OS settings
and acts accordingly.

Or alternatively put, both Bruce and Itagaki-san have good ideas.

  So, in the real world, one conclusion seems to be that our existing
  method of tuning the background writer just isn't good enough for the
  average user:
  
  #bgwriter_delay = 200ms # 10-1ms between rounds
  #bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers 
  scanned/round
  #bgwriter_lru_maxpages = 5  # 0-1000 buffers max 
  written/round
  #bgwriter_all_percent = 0.333   # 0-100% of all buffers 
  scanned/round
  #bgwriter_all_maxpages = 5  # 0-1000 buffers max 
  written/round
  
  These settings control what the bgwriter does, but they do not clearly
  relate to the checkpoint timing, which is the purpose of the bgwriter,
  and they don't change during the checkpoint interval, which is also less
  than ideal.  If set to aggressively, it writes too much, and if too low,
  the checkpoint does too much I/O.

Yes, that's very clear.
 
  We clearly need more bgwriter activity as the checkpoint approaches

I'd put it that we should write a block to disk prior to checkpoint if
it appears that it won't be dirtied again if we do so. That doesn't
necessarily translate directly into *more* activity.

 , and
  one that is more auto-tuned, like many of our other parameters.  I think
  we created these settings to see how they worked in the field, so it
  probably time to reevaluate them based on field reports.
  
  I think the bgwriter should keep track of how far it is to the next
  checkpoint, and use that information to increase write activity. 
  Basically now, during a checkpoint, the bgwriter does 

Re: [HACKERS] Load distributed checkpoint

2006-12-27 Thread Simon Riggs
On Mon, 2006-12-18 at 14:47 +0900, Takayuki Tsunakawa wrote: 
 Hello, Itagaki-san, all
 
 Sorry for my long mail.  I've had trouble in sending this mail because
 it's too long for pgsql-hackers to accept (I couldn't find how large
 mail is accepted.)  So I'm trying to send several times.
 Please see the attachment for the content.

Your results for fsync are interesting.

I've noticed that a checkpoint seems to increase the activity on the WAL
drive as well as increasing I/O wait times. That doesn't correspond to
any real increase in WAL traffic I'm aware of.

Have you tried setting deadline scheduler on the WAL device and CFQ on
the data device? That should allow the I/Os to move through different
queues and prevent interference.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Load distributed checkpoint

2006-12-27 Thread Martijn van Oosterhout
On Wed, Dec 27, 2006 at 09:24:06PM +, Simon Riggs wrote:
 On Fri, 2006-12-22 at 13:53 -0500, Bruce Momjian wrote:
 
  I assume other kernels have similar I/O smoothing, so that data sent to
  the kernel via write() gets to disk within 30 seconds.  
  
  I assume write() is not our checkpoint performance problem, but the
  transfer to disk via fsync().  
 
 Well, its correct to say that the transfer to disk is the source of the
 problem, but that doesn't only occur when we fsync(). There are actually
 two disk storms that occur, because of the way the fs cache works. [Ron
 referred to this effect uplist]

As someone looking from the outside:

fsync only works on one file, so presumably the checkpoint process is
opening each file one by one and fsyncing them. Does that make any
difference here? Could you adjust the timing here?

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Load distributed checkpoint

2006-12-27 Thread Simon Riggs
On Wed, 2006-12-27 at 23:26 +0100, Martijn van Oosterhout wrote:
 On Wed, Dec 27, 2006 at 09:24:06PM +, Simon Riggs wrote:
  On Fri, 2006-12-22 at 13:53 -0500, Bruce Momjian wrote:
  
   I assume other kernels have similar I/O smoothing, so that data sent to
   the kernel via write() gets to disk within 30 seconds.  
   
   I assume write() is not our checkpoint performance problem, but the
   transfer to disk via fsync().  
  
  Well, its correct to say that the transfer to disk is the source of the
  problem, but that doesn't only occur when we fsync(). There are actually
  two disk storms that occur, because of the way the fs cache works. [Ron
  referred to this effect uplist]
 
 As someone looking from the outside:
 
 fsync only works on one file, so presumably the checkpoint process is
 opening each file one by one and fsyncing them. 

Yes

 Does that make any
 difference here? Could you adjust the timing here?

Thats the hard bit with io storm 2. When you fsync a file you don't
actually know how many blocks you're writing, plus there's no way to
slow down those writes by putting delays between them (although its
possible your controller might know how to do this, I've never heard of
one that does).

If we put a delay after each fsync, that will space out all the ones
that don't need spacing out and do nothing to the ones that most need
it. Unfortunately.

IMHO there isn't any simple scheme that works all the time, for all OS
settings, default configurations and mechanisms.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Load distributed checkpoint

2006-12-27 Thread ITAGAKI Takahiro

Bruce Momjian [EMAIL PROTECTED] wrote:

  566.973777
  327.158222 - (1) write()
  560.773868 - (2) sleep
  544.106645 - (3) fsync()
 
 OK, so you are saying that performance dropped only during the write(),
 and not during the fsync()?  Interesting.

Almost yes, but there is a small drop in fsync. (560-540)


 I would like to know the
 results of a few tests just like you reported them above:
   
   1a) write spread out over 30 seconds
   1b) write with no delay
   
   2a) sleep(0)
   2b) sleep(30)
   
3) fsync
 
 I would like to know the performance at each stage for each combination,
 e.g. when using 1b, 2a, 3, performance during the write() phase was X,
 during the sleep it was Y, and during the fsync it was Z. (Of course,
 sleep(0) has no stage timing.)

I'm thinking about generalizing your idea; Adding three parameters
(checkpoint_write, checkpoint_naptime and checkpoint_fsync)
to control sleeps in each stage.

1) write() spread out over 'checkpoint_write' seconds.
2) sleep 'checkpoint_naptime' seconds between write() and fsync().
3) fsync() spread out over 'checkpoint_fsync' seconds.

If three parameter are all zero, checkpoints behave as the same as now.
If checkpoint_write = checkpoint_timeout and other two are zero,
it is just like my proposal before.


As you might expect, I intend the above only for development purpose.
Additinal three parameters are hard to use for users. If we can pull out
some proper values from the tests, we'd better to set those values as
default. I assume we can derive them from existing checkpoint_timeout.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Load distributed checkpoint

2006-12-26 Thread ITAGAKI Takahiro

Bruce Momjian [EMAIL PROTECTED] wrote:

 I assume write() is not our checkpoint performance problem, but the
 transfer to disk via fsync().  Perhaps a simple solution is to do the
 write()'s of all dirty buffers as we do now at checkpoint time, but
 delay 30 seconds and then do fsync() on all the files.

I think there are two platforms that have different problems in checkpoints.
It's in fsync() on one platform, and in write() on another. It is complex
depending on OS, the amount of memory, disks, writeback-cache and so on.

 I think the basic difference between this and the proposed patch is that
 we do not put delays in the buffer write() or fsync() phases --- we just
 put a delay _between_ the phases, and wait for the kernel to smooth it
 out for us.  The kernel certainly knows more about what needs to get to
 disk, so it seems logical to let it do the I/O smoothing.

Both proposals do not conflict each other. Also, solutions for either
platform do not have bad effect on the other platform. Can we employ
both of them?

I tested your proposal but it did not work on write-critical machine.
However, if the idea works well on BSD or some platforms, we would be
better off buying it.

[pgbench results]
...
566.973777
327.158222 - (1) write()
560.773868 - (2) sleep
544.106645 - (3) fsync()
...

[changes in codes]
  (This is a bad implementation because shutdown takes long time!)
void
FlushBufferPool(void)
{
BufferSync();   // (1) write -- about 20s

time_t start = time(NULL);
while (time(NULL) - start  30) // (2) sleep -- 30s
{
pg_usleep(BgWriterDelay * 1000L);
BgBufferSync();
AbsorbFsyncRequests();
}

smgrsync(); // (3) fsync -- less than 200ms
}

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Load distributed checkpoint

2006-12-26 Thread Bruce Momjian
ITAGAKI Takahiro wrote:
 
 Bruce Momjian [EMAIL PROTECTED] wrote:
 
  I assume write() is not our checkpoint performance problem, but the
  transfer to disk via fsync().  Perhaps a simple solution is to do the
  write()'s of all dirty buffers as we do now at checkpoint time, but
  delay 30 seconds and then do fsync() on all the files.
 
 I think there are two platforms that have different problems in checkpoints.
 It's in fsync() on one platform, and in write() on another. It is complex
 depending on OS, the amount of memory, disks, writeback-cache and so on.
 
  I think the basic difference between this and the proposed patch is that
  we do not put delays in the buffer write() or fsync() phases --- we just
  put a delay _between_ the phases, and wait for the kernel to smooth it
  out for us.  The kernel certainly knows more about what needs to get to
  disk, so it seems logical to let it do the I/O smoothing.
 
 Both proposals do not conflict each other. Also, solutions for either
 platform do not have bad effect on the other platform. Can we employ
 both of them?
 
 I tested your proposal but it did not work on write-critical machine.
 However, if the idea works well on BSD or some platforms, we would be
 better off buying it.
 
 [pgbench results]
 ...
 566.973777
 327.158222 - (1) write()
 560.773868 - (2) sleep
 544.106645 - (3) fsync()

OK, so you are saying that performance dropped only during the write(),
and not during the fsync()?  Interesting.  I would like to know the
results of a few tests just like you reported them above:

1a) write spread out over 30 seconds
1b) write with no delay

2a) sleep(0)
2b) sleep(30)

 3) fsync

I would like to know the performance at each stage for each combination,
e.g. when using 1b, 2a, 3, performance during the write() phase was X,
during the sleep it was Y, and during the fsync it was Z. (Of course,
sleep(0) has no stage timing.)

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Load distributed checkpoint

2006-12-25 Thread Takayuki Tsunakawa
Hello, Inaam-san,

 There are four IO schedulers in Linux. Anticipatory, CFQ (default),
deadline, and noop. For typical OLTP type loads generally deadline is
recommended. If you are constrained on CPU and you have a good controller
then its better to use noop.
 Deadline attempts to merge requests by maintaining two red black trees in
sector sort order and it also ensures that a request is serviced in given
time by using FIFO. I don't expect it to do the magic but was wondering that
it may dilute the issue of fsync() elbowing out WAL writes.
 You can look into /sys/block/device/queue/scheduler to see which
scheduler you are using.

Thank you for your information.
I could only find the following files in /sys/block/device/queue/:

iosched
max_hw_sectors_kb
max_sectors_kb
nr_requests
read_ahead_kb

In iosched, the following files exist:

quantum (the content is 4)
queued (the content is 8)


Re: [HACKERS] Load distributed checkpoint

2006-12-25 Thread Takayuki Tsunakawa
From: Bruce Momjian [EMAIL PROTECTED]
 On an idle system, would someone dirty a large file, and watch the
disk
 I/O to see how long it takes for the I/O to complete to disk?

I ran dd if=/dev/zero of=file on DB disk bs=8k count=`expr 1048576
/ 8`, that is, writing 1GB file with 8KB write()'s.  It took about 175
seconds for the kernel to flush buffers.  I'll put the result of
iostat -x 5 which was started at the same time as dd.  175 seconds
means 35 rows with nonzero %util * 5 seconds of interval.

Device:rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/srkB/swkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sde  0.00 981.56  0.00 50.900.00 8439.28 0.00  4219.64
165.8034.38  525.01   6.25  31.82
sde  0.00 25737.00  0.00 118.600.00 219820.80 0.00
109910.40  1853.46  6529.74 1587.08   8.43 100.02
sde  0.00 1912.97  0.00 127.740.00 16325.75 0.00
8162.87   127.80  8192.67 6502.98   7.81  99.82
sde  0.00 1728.00  0.00 117.800.00 14745.60 0.00
7372.80   125.17  8209.36 11227.29   8.49 100.02
sde  0.00 1536.00  0.00 103.800.00 13107.20 0.00
6553.60   126.27  8209.10 16729.09   9.64 100.02
sde  0.00 1344.00  0.00 91.600.00 11468.80 0.00
5734.40   125.21  8208.69 21481.22  10.92 100.02
sde  0.00 1532.80  0.00 98.000.00 13081.60 0.00
6540.80   133.49  8209.34 26583.57  10.21 100.02
sde  0.00 1632.00  0.00 107.000.00 13926.40 0.00
6963.20   130.15  8208.89 31662.93   9.35 100.02
sde  0.00 1536.00  0.00 106.400.00 13107.20 0.00
6553.60   123.19  8209.66 36443.07   9.40 100.02
sde  0.00 1442.89  0.00 95.590.00 12312.63 0.00
6156.31   128.81  8227.23 41446.78  10.49 100.24
sde  0.00 1532.80  0.00 103.000.00 13081.60 0.00
6540.80   127.01  8210.77 46606.03   9.71 100.04
sde  0.00 1440.00  0.00 92.000.00 12288.00 0.00
6144.00   133.57  8208.82 51421.78  10.87 100.02
sde  0.00 1344.00  0.00 91.800.00 11468.80 0.00
5734.40   124.93  8209.86 56524.37  10.90 100.02
sde  0.00 1539.08  0.00 101.000.00 13133.47 0.00
6566.73   130.03  8225.59 61477.93   9.92 100.22
sde  0.00 1436.80  0.00 95.400.00 12262.40 0.00
6131.20   128.54  8208.88 66566.42  10.48 100.02
sde  0.00 1344.00  0.00 92.400.00 11468.80 0.00
5734.40   124.12  8209.47 71466.12  10.82 100.02
sde  0.00 1102.40  0.00 101.000.00 9408.38 0.00
4704.1993.15  8174.36 76538.41   9.88  99.82
sde  0.00   0.00  0.00 89.000.000.00 0.00 0.00
0.00  7855.72 80795.64  11.24 100.02
sde  0.00   0.00  0.00 91.380.000.00 0.00 0.00
0.00  7422.53 81823.89  10.97 100.22
sde  0.00   0.00  0.00 96.800.000.00 0.00 0.00
0.00  6919.07 83194.91  10.33 100.02
sde  0.00   0.00  0.00 91.780.000.00 0.00 0.00
0.00  6480.85 84657.04  10.92 100.22
sde  0.00   0.00  0.00 92.600.000.00 0.00 0.00
0.00  5997.22 84749.79  10.80 100.02
sde  0.00   0.00  0.00 91.020.000.00 0.00 0.00
0.00  5528.97 85345.07  10.97  99.82
sde  0.00   0.00  0.00 102.610.000.00 0.00
0.00 0.00  5059.61 85057.91   9.77 100.22
sde  0.00   0.00  0.00 93.200.000.00 0.00 0.00
0.00  4572.57 85284.49  10.73 100.02
sde  0.00   0.00  0.00 98.200.000.00 0.00 0.00
0.00  4109.15 86086.50  10.21 100.22
sde  0.00   0.00  0.00 91.420.000.00 0.00 0.00
0.00  3611.72 86405.24  10.92  99.82
sde  0.00   0.00  0.00 100.000.000.00 0.00
0.00 0.00  3135.62 86292.49  10.00 100.02
sde  0.00   0.00  0.00 100.400.000.00 0.00
0.00 0.00  2652.63 86609.79   9.96 100.02
sde  0.00   0.00  0.00 92.800.000.00 0.00 0.00
0.00  2153.69 86168.58  10.78 100.02
sde  0.00   0.00  0.00 88.800.000.00 0.00 0.00
0.00  1694.74 86275.58  11.26 100.02
sde  0.00   0.00  0.00 98.200.000.00 0.00 0.00
0.00  1241.41 86708.40  10.19 100.02
sde  0.00   0.00  0.00 95.200.000.00 0.00 0.00
0.00   747.52 86505.59  10.51 100.02
sde  0.00   0.00  0.00 89.200.000.00 0.00 0.00
0.00   283.44 86551.11  11.21 100.02
sde  0.00  51.20  0.00 17.200.00  449.60 0.00   224.80
26.14 4.30 61572.65   9.05  15.56
sde  0.00   0.00  0.00  0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00
sde  0.00   0.00  0.00  0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00
sde  0.00   0.00  0.00  0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00
sde  0.00   0.00  0.00  0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00
sde  0.00   0.00  0.00  0.000.000.00 0.00 0.00
0.00 

Re: [HACKERS] Load distributed checkpoint

2006-12-22 Thread Takayuki Tsunakawa
From: Inaam Rana
 Which IO Shceduler (elevator) you are using?

Elevator?  Sorry, I'm not familiar with the kernel implementation, so I
don't what it is.  My Linux distribution is Red Hat Enterprise Linux 4.0 for
AMD64/EM64T, and the kernel is 2.6.9-42.ELsmp.  I probably havn't changed
any kernel settings, except for IPC settings to run PostgreSQL.





Re: [HACKERS] Load distributed checkpoint

2006-12-22 Thread ITAGAKI Takahiro
Takayuki Tsunakawa [EMAIL PROTECTED] wrote:

 (1) Default case(this is show again for comparison and reminder)
 235  80  226  77  240
 (2) Default + WAL 1MB case
 302  328  82  330  85
 (3) Default + wal_sync_method=open_sync case
 162  67  176  67  164
 (4) (2)+(3) case
 322  350  85  321  84
 (5) (4) + /proc/sys/vm/dirty* tuning
 308  349  84  349  84

(3) is very strange. Your machine seems to be too restricted
by WAL so that other factors cannot be measured properly.


I'll send results on my machine.

- Pentium4 3.6GHz with HT / 3GB RAM / Windows XP :-)
- shared_buffers=1GB
- wal_sync_method = open_datasync
- wal_buffers = 1MB
- checkpoint_segments = 16
- checkpoint_timeout = 5min

I repeated pgbench -c16 -t500 -s50
and picked up results around checkpoints.

[HEAD]
...
560.8
373.5 - checkpoint is here
570.8
...

[with patch]
...
562.0
528.4 - checkpoint (fsync) is here
547.0
...

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Load distributed checkpoint

2006-12-22 Thread Takayuki Tsunakawa
From: Greg Smith [EMAIL PROTECTED]
 This is actually a question I'd been meaning to throw out myself to
this
 list.  How hard would it be to add an internal counter to the buffer
 management scheme that kept track of the current number of dirty
pages?
 I've been looking at the bufmgr code lately trying to figure out how
to
 insert one as part of building an auto-tuning bgwriter, but it's
unclear
 to me how I'd lock such a resource properly and scalably.  I have a
 feeling I'd be inserting a single-process locking bottleneck into
that
 code with any of the naive implementations I considered.

To put it in an extreme way, how about making bgwriter count the dirty
buffers periodically scanning all the buffers?  Do you know the book
Principles of Transaction Processing?  Jim Gray was one of the
reviewers of this book.

http://www.amazon.com/gp/aa.html?HMAC=CartId=Operation=ItemLookupItemId=1558604154ResponseGroup=Request,Large,VariationsbStyle=aaz.jpgMerchantId=Allisdetail=truebsi=Bookslogo=fooMarketplace=usAssociateTag=pocketpc

In chapter 8, the author describes fuzzy checkpoint combined with
two-checkpoint approach.  In his explanation, recovery manager (which
would be bgwriter in PostgreSQL) scans the buffers and records the
list of dirty buffers at each checkpoint.  This won't need any locking
in PostgreSQL if I understand correctly.  Then, the recovery manager
performs the next checkpoint after writing those dirty buffers.  In
two-checkpoint approach, crash recovery starts redoing from the second
to last checkpoint.  Two-checkpoint is described in Jim Gray's book,
too.  But they don't refer to how the recovery manager tunes the speed
of writing.


 slightly different from the proposals here.  What if all the
database page
 writes (background writer, buffer eviction, or checkpoint scan) were
 counted and periodic fsync requests send to the bgwriter based on
that?
 For example, when I know I have a battery-backed caching controller
that
 will buffer 64MB worth of data for me, if I forced a fsync after
every
 6000 8K writes, no single fsync would get stuck waiting for the disk
to
 write for longer than I'd like.

That seems interesting.

 You can do sync
 writes with perfectly good performance on systems with a good
 battery-backed cache, but I think you'll get creamed in comparisons
 against MySQL on IDE disks if you start walking down that path;
since
 right now a fair comparison with similar logging behavior is an even
match
 there, that's a step backwards.

I wonder what characteristics SATA disks have compared to IDE.  Recent
PCs are equiped with SATA disks, aren't they?
What do you feel your approach compares to MySQL on IDE disks?

 Also on the topic of sync writes to the database proper:  wouldn't
using
 O_DIRECT for those potentially counter-productive?  I was under the
 impressions that one of the behaviors counted on by Postgres was
that data
 evicted from its buffer cache, eventually intended for writing to
disk,
 was still kept around for a bit in the OS buffer cache.  A
subsequent read
 because the data was needed again might find the data already in the
OS
 buffer, therefore avoiding an actual disk read; that substantially
reduces
 the typical penalty for the database engine making a bad choice on
what to
 evict.  I fear a move to direct writes would put more pressure on
the LRU
 implementation to be very smart, and that's code that you really
don't
 want to be more complicated.

I'm worried about this, too.




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Load distributed checkpoint

2006-12-22 Thread Takayuki Tsunakawa
 (3) is very strange. Your machine seems to be too restricted
 by WAL so that other factors cannot be measured properly.

Right... It takes as long as 15 seconds to fsync 1GB file.  It's
strange.  This is a borrowed PC server, so the disk may be RAID 5?
However, the WAL disk and DB  disks show the same throughput.  I'll
investigate.  I may have to find another machine.

- Pentium4 3.6GHz with HT / 3GB RAM / Windows XP :-)

Oh, Windows.  Maybe the fsync() problem Itagaki-san pointed out does
not exist.
BTW, your env is showing attractive result, isn't it?

- Original Message - 
From: ITAGAKI Takahiro [EMAIL PROTECTED]
To: Takayuki Tsunakawa [EMAIL PROTECTED]
Cc: pgsql-hackers@postgresql.org
Sent: Friday, December 22, 2006 6:09 PM
Subject: Re: [HACKERS] Load distributed checkpoint


Takayuki Tsunakawa [EMAIL PROTECTED] wrote:

 (1) Default case(this is show again for comparison and reminder)
 235  80  226  77  240
 (2) Default + WAL 1MB case
 302  328  82  330  85
 (3) Default + wal_sync_method=open_sync case
 162  67  176  67  164
 (4) (2)+(3) case
 322  350  85  321  84
 (5) (4) + /proc/sys/vm/dirty* tuning
 308  349  84  349  84

(3) is very strange. Your machine seems to be too restricted
by WAL so that other factors cannot be measured properly.


I'll send results on my machine.

- Pentium4 3.6GHz with HT / 3GB RAM / Windows XP :-)
- shared_buffers=1GB
- wal_sync_method = open_datasync
- wal_buffers = 1MB
- checkpoint_segments = 16
- checkpoint_timeout = 5min

I repeated pgbench -c16 -t500 -s50
and picked up results around checkpoints.

[HEAD]
...
560.8
373.5 - checkpoint is here
570.8
...

[with patch]
...
562.0
528.4 - checkpoint (fsync) is here
547.0
...

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center





---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Load distributed checkpoint

2006-12-22 Thread Zeugswetter Andreas ADI SD

  If you use linux, try the following settings:
   1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio.

You will need to pair this with bgwriter_* settings, else too few 
pages are written to the os inbetween checkpoints.

   2. Increase wal_buffers to redule WAL flushing.

You will want the possibility of single group writes to be able to reach

256kb. The default is thus not enough when you have enough RAM.
You also want enough, so that new txns don't need to wait for an empty
buffer (that is only freed by a write).

   3. Set wal_sync_method to open_sync; O_SYNC is faster then fsync().

O_SYNC's only advantage over fdatasync is that it saves a system call,
since it still passes through OS cache, but the disadvantage is that it 
does not let the OS group writes. Thus it is more susceptible to too
few WAL_buffers. What you want is O_DIRECT + enough wal_buffers to allow

256k writes.

   4. Separate data and WAL files into different partitions or disks.

While this is generally suggested, I somehow doubt the validity when you
only have few disk spindles. If e.g. you only have 2-3 (mirrored) disks
I 
wouldn't do it (at least on the average 70/30 read write systems).

Andreas

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Load distributed checkpoint

2006-12-22 Thread Simon Riggs
On Thu, 2006-12-21 at 18:46 +0900, ITAGAKI Takahiro wrote:
 Takayuki Tsunakawa [EMAIL PROTECTED] wrote:
 
   If you use Linux, it has very unpleased behavior in fsync(); It locks all
   metadata of the file being fsync-ed. We have to wait for the completion of
   fsync when we do read(), write(), and even lseek().
  
  Oh, really, what an evil fsync is!  Yes, I sometimes saw a backend
  waiting for lseek() to complete when it committed.  But why does the
  backend which is syncing WAL/pg_control have to wait for syncing the
  data file?  They are, not to mention, different files, and WAL and
  data files are stored on separate disks.
 
 Backends call lseek() in planning, so they have to wait fsync() to
 the table that they will access. Even if all of data in the file is in
 the cache, lseek() conflict with fsync(). You can see a lot of backends
 are waiting in planning phase in checkpoints, not executing phase.

It isn't clear to me why you are doing planning during a test at all.

If you are doing replanning during test execution then the real
performance problem will be the planning, not the fact that the fsync
stops planning from happening.

Prepared queries are only replanned manually, so the chances of
replanning during a checkpoint are fairly low. So although it sounds
worrying, I'm not sure that we'll want to alter the use of lseek during
planning - though there may be other arguments also.

I have also seen cases where the WAL drive, even when separated, appears
to spike upwards during a checkpoint. My best current theory, so far
untested, is that the WAL and data drives are using the same CFQ
scheduler and that the scheduler actively slows down WAL requests when
it need not. Mounting the drives as separate block drives with separate
schedulers, CFQ for data and Deadline for WAL should help.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Load distributed checkpoint

2006-12-22 Thread Bruce Momjian

I have a new idea.  Rather than increasing write activity as we approach
checkpoint, I think there is an easier solution.  I am very familiar
with the BSD kernel, and it seems they have a similar issue in trying to
smooth writes:


http://www.brno.cas.cz/cgi-bin/bsdi-man?proto=1.1query=updatemsection=4apropos=0

UPDATE(4)   BSD Programmer's Manual  
UPDATE(4)

NAME
 update - trickle sync filesystem caches to disk

DESCRIPTION
 At system boot time, the kernel starts filesys_syncer, process
 3.  This process helps protect the integrity of disk volumes
 by ensuring that volatile cached filesystem data are written
 to disk within the vfs.generic.syncdelay interval which defaults
 to thirty seconds (see sysctl(8)).  When a vnode is first
 written it is placed vfs.generic.syncdelay seconds down on
 the trickle sync queue.  If it still exists and has dirty data
 when it reaches the top of the queue, filesys_syncer writes
 it to disk.  This approach evens out the load on the underlying
 I/O system and avoids writing short-lived files.  The pa- pers
 on trickle-sync tend to favor aging based on buffers rather
 than files.  However, BSD/OS synchronizes on file age rather
 than buffer age because the data structures are much smaller
 as there are typically far fewer files than buffers.  Although
 this can make the I/O bursty when a big file is written to
 disk, it is still much better than the wholesale writes that
 were being done by the historic update process which wrote
 all dirty data buffers every 30 seconds.  It also adapts much
 better to the soft update code which wants to control aging
 to improve performance (inodes age in one third of
 vfs.generic.syncdelay seconds, directories in one half of
 vfs.generic.syncdelay seconds).  This ordering ensures that
 most dependencies are gone (e.g., inodes are written when
 directory en- tries want to go to disk) reducing the amount
 of work that the soft up- date code needs to do.

I assume other kernels have similar I/O smoothing, so that data sent to
the kernel via write() gets to disk within 30 seconds.  

I assume write() is not our checkpoint performance problem, but the
transfer to disk via fsync().  Perhaps a simple solution is to do the
write()'s of all dirty buffers as we do now at checkpoint time, but
delay 30 seconds and then do fsync() on all the files.  The goal here is
that during the 30-second delay, the kernel will be forcing data to the
disk, so the fsync() we eventually do will only be for the write() of
buffers during the 30-second delay, and because we wrote all dirty
buffers 30 seconds ago, there shouldn't be too many of them.

I think the basic difference between this and the proposed patch is that
we do not put delays in the buffer write() or fsync() phases --- we just
put a delay _between_ the phases, and wait for the kernel to smooth it
out for us.  The kernel certainly knows more about what needs to get to
disk, so it seems logical to let it do the I/O smoothing.

---

Bruce Momjian wrote:
 
 I have thought a while about this and I have some ideas.
 
 Ideally, we would be able to trickle the sync of individuals blocks
 during the checkpoint, but we can't because we rely on the kernel to
 sync all dirty blocks that haven't made it to disk using fsync().  We
 could trickle the fsync() calls, but that just extends the amount of
 data we are writing that has been dirtied post-checkpoint.  In an ideal
 world, we would be able to fsync() only part of a file at a time, and
 only those blocks that were dirtied pre-checkpoint, but I don't see that
 happening anytime soon (and one reason why many commercial databases
 bypass the kernel cache).
 
 So, in the real world, one conclusion seems to be that our existing
 method of tuning the background writer just isn't good enough for the
 average user:
 
   #bgwriter_delay = 200ms # 10-1ms between rounds
   #bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers 
 scanned/round
   #bgwriter_lru_maxpages = 5  # 0-1000 buffers max 
 written/round
   #bgwriter_all_percent = 0.333   # 0-100% of all buffers 
 scanned/round
   #bgwriter_all_maxpages = 5  # 0-1000 buffers max 
 written/round
 
 These settings control what the bgwriter does, but they do not clearly
 relate to the checkpoint timing, which is the purpose of the bgwriter,
 and they don't change during the checkpoint interval, which is also less
 than ideal.  If set to aggressively, it writes too 

Re: [HACKERS] Load distributed checkpoint

2006-12-22 Thread Inaam Rana

On 12/22/06, Takayuki Tsunakawa [EMAIL PROTECTED] wrote:


 From: Inaam Rana
 Which IO Shceduler (elevator) you are using?

Elevator?  Sorry, I'm not familiar with the kernel implementation, so I
don't what it is.  My Linux distribution is Red Hat Enterprise Linux 4.0for 
AMD64/EM64T, and the kernel is
2.6.9-42.ELsmp.  I probably havn't changed any kernel settings, except for
IPC settings to run PostgreSQL.



There are four IO schedulers in Linux. Anticipatory, CFQ (default),
deadline, and noop. For typical OLTP type loads generally deadline is
recommended. If you are constrained on CPU and you have a good controller
then its better to use noop.
Deadline attempts to merge requests by maintaining two red black trees in
sector sort order and it also ensures that a request is serviced in given
time by using FIFO. I don't expect it to do the magic but was wondering that
it may dilute the issue of fsync() elbowing out WAL writes.

You can look into /sys/block/device/queue/scheduler to see which scheduler
you are using.

regards,
inaam


--
Inaam Rana
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] Load distributed checkpoint

2006-12-22 Thread Gregory Stark

Bruce Momjian [EMAIL PROTECTED] writes:

 I have a new idea.  Rather than increasing write activity as we approach
 checkpoint, I think there is an easier solution.  I am very familiar
 with the BSD kernel, and it seems they have a similar issue in trying to
 smooth writes:

Just to give a bit of context for this. The traditional mechanism for syncing
buffers to disk on BSD which this daemon was a replacement for was to simply
call sync every 30s. Compared to that this daemon certainly smooths the I/O
out over the 30s window...

Linux has a more complex solution to this (of course) which has undergone a
few generations over time. Older kernels had a user space daemon called
bdflush which called an undocumented syscall every 5s. More recent ones have a
kernel thread called pdflush. I think both have various mostly undocumented
tuning knobs but neither makes any sort of guarantee about the amount of time
a dirty buffer might live before being synced.

Your thinking is correct but that's already the whole point of bgwriter isn't
it? To get the buffers out to the kernel early in the checkpoint interval so
that come checkpoint time they're hopefully already flushed to disk. As long
as your checkpoint interval is well over 30s only the last 30s (or so, it's a
bit fuzzier on Linux) should still be at risk of being pending.

I think the main problem with an additional pause in the hopes of getting more
buffers synced is that during the 30s pause on a busy system there would be a
continual stream of new dirty buffers being created as bgwriter works and
other backends need to reuse pages. So when the fsync is eventually called
there will still be a large amount of i/o to do. Fundamentally the problem is
that fsync is too blunt an instrument. We only need to fsync the buffers we
care about, not the entire file.


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Load distributed checkpoint

2006-12-22 Thread Greg Smith

On Fri, 22 Dec 2006, Simon Riggs wrote:


I have also seen cases where the WAL drive, even when separated, appears
to spike upwards during a checkpoint. My best current theory, so far
untested, is that the WAL and data drives are using the same CFQ
scheduler and that the scheduler actively slows down WAL requests when
it need not. Mounting the drives as separate block drives with separate
schedulers, CFQ for data and Deadline for WAL should help.


The situation I've been seeing is that the database needs a new block to 
complete a query and issues a read request to get it, but that read is 
behind the big checkpoint fsync.  Client sits there for quite some time 
waiting for the fsync to finish before it gets the data it needs, and now 
your trivial select took seconds to complete.  It's fairly easy to 
replicate this problem using pgbench on Linux--I've seen a query sit there 
for 15 seconds when going out of my way to aggrevate the behavior.  One of 
Takayuki's posts here mentioned a worst-case delay of 13 seconds, that's 
the problem rearing its ugly head.


You may be right that what you're seeing would be solved with a more 
complicated tuning on a per-device basis (which, by the way, isn't 
available unless you're running a more recent Linux kernel than most many 
distributions have available).  You can tune the schedulers all day and 
not make a lick of difference to what I've been running into; I know, I 
tried.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Load distributed checkpoint

2006-12-22 Thread Bruce Momjian
Gregory Stark wrote:
 
 Bruce Momjian [EMAIL PROTECTED] writes:
 
  I have a new idea.  Rather than increasing write activity as we approach
  checkpoint, I think there is an easier solution.  I am very familiar
  with the BSD kernel, and it seems they have a similar issue in trying to
  smooth writes:
 
 Just to give a bit of context for this. The traditional mechanism for syncing
 buffers to disk on BSD which this daemon was a replacement for was to simply
 call sync every 30s. Compared to that this daemon certainly smooths the I/O
 out over the 30s window...
 
 Linux has a more complex solution to this (of course) which has undergone a
 few generations over time. Older kernels had a user space daemon called
 bdflush which called an undocumented syscall every 5s. More recent ones have a
 kernel thread called pdflush. I think both have various mostly undocumented
 tuning knobs but neither makes any sort of guarantee about the amount of time
 a dirty buffer might live before being synced.
 
 Your thinking is correct but that's already the whole point of bgwriter isn't
 it? To get the buffers out to the kernel early in the checkpoint interval so
 that come checkpoint time they're hopefully already flushed to disk. As long
 as your checkpoint interval is well over 30s only the last 30s (or so, it's a
 bit fuzzier on Linux) should still be at risk of being pending.
 
 I think the main problem with an additional pause in the hopes of getting more
 buffers synced is that during the 30s pause on a busy system there would be a
 continual stream of new dirty buffers being created as bgwriter works and
 other backends need to reuse pages. So when the fsync is eventually called
 there will still be a large amount of i/o to do. Fundamentally the problem is
 that fsync is too blunt an instrument. We only need to fsync the buffers we
 care about, not the entire file.

Well, one idea would be for the bgwriter not to do many write()'s
between the massive checkpoint write()'s and the fsync()'s.  That would
cut down on the extra I/O that fsync() would have to do.

The problem I see with making the bgwriter do more writes between
checkpoints is that overhead of those scans, and the overhead of doing
write's that will later be dirtied before the checkpoint.  With the
delay between stages idea, we don't need to guess how agressive the
bgwriter needs to be --- we can just do the writes, and wait for a
while.

On an idle system, would someone dirty a large file, and watch the disk
I/O to see how long it takes for the I/O to complete to disk?

In what we have now, we are either having the bgwriter do too much I/O
between checkpoints, or guaranteeing an I/O storm during a checkpoint by
doing lots of write()'s and then calling fsync() right away.  I don't
see how we are ever going to get that properly tuned.

Would someone code up a patch and test it?

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Load distributed checkpoint

2006-12-21 Thread Takayuki Tsunakawa
From: ITAGAKI Takahiro [EMAIL PROTECTED]
 You were running the test on the very memory-depend machine.
 shared_buffers = 4GB / The scaling factor is 50, 800MB of data.
 Thet would be why the patch did not work. I tested it with DBT-2,
10GB of
 data and 2GB of memory. Storage is always the main part of
performace here,
 even not in checkpoints.

Yes, I used half the size of RAM as the shared buffers, which is
reasonable.  And I cached all the data.  The effect of fsync() is a
heavier offence, isn't it?  System administrators would say I have
enough memory.  The data hasn't exhausted the DB cache yet.  But the
users complain to me about the response.  Why?  What should I do?
What?  Checkpoint??  Why doesn't PostgreSQL take care of frontend
users?
BTW, is DBT-2 an OLTP benchmark which randomly access some parts of
data, or a batch application which accesses all data?  I'm not
familiar with it.  I know that IPA opens it to the public.

 If you use Linux, it has very unpleased behavior in fsync(); It
locks all
 metadata of the file being fsync-ed. We have to wait for the
completion of
 fsync when we do read(), write(), and even lseek().
 Almost of your data is in the accounts table and it was stored in a
single
 file. All of transactions must wait for fsync to the single largest
file,
 so you saw the bottleneck was in the fsync.

Oh, really, what an evil fsync is!  Yes, I sometimes saw a backend
waiting for lseek() to complete when it committed.  But why does the
backend which is syncing WAL/pg_control have to wait for syncing the
data file?  They are, not to mention, different files, and WAL and
data files are stored on separate disks.


 [Conclusion]
 I believe that the problem cannot be solved in a real sense by
 avoiding fsync/fdatasync().

 I think so, too. However, I assume we can resolve a part of the
 checkpoint spikes with smoothing of write() alone.

First, what's the goal (if possible numerically?  Have you explained
to community members why the patch would help many people?  At least,
I haven't heard that fsync() can be seriously bad and we would close
our eyes to what fsync() does.
By the way, what good results did you get with DBT-2?  If you don't
mind, can you show us?


 BTW, can we use the same way to fsync? We call fsync()s to all
modified
 files without rest in mdsync(), but it's not difficult at all to
insert
 sleeps between fsync()s. Do you think it helps us? One of issues is
that
 we have to sleep in file unit, which is maybe rough granularity.

No, it definitely won't help us.  There is no reason why it will help.
It might help in some limited environments, though, how can we
characterize such environments?  Can we say our approach helps our
environments, but it won't help you.  The kernel VM settings may help
you.  Good luck!?
We have to consider seriously.  I think it's time to face the problem
and we should follow the approaches of experts like Jim Gray and DBMS
vendors, unless we have a new clever idea like them.



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Load distributed checkpoint

2006-12-21 Thread ITAGAKI Takahiro

Takayuki Tsunakawa [EMAIL PROTECTED] wrote:

  If you use Linux, it has very unpleased behavior in fsync(); It locks all
  metadata of the file being fsync-ed. We have to wait for the completion of
  fsync when we do read(), write(), and even lseek().
 
 Oh, really, what an evil fsync is!  Yes, I sometimes saw a backend
 waiting for lseek() to complete when it committed.  But why does the
 backend which is syncing WAL/pg_control have to wait for syncing the
 data file?  They are, not to mention, different files, and WAL and
 data files are stored on separate disks.

Backends call lseek() in planning, so they have to wait fsync() to
the table that they will access. Even if all of data in the file is in
the cache, lseek() conflict with fsync(). You can see a lot of backends
are waiting in planning phase in checkpoints, not executing phase.


  it's not difficult at all to insert sleeps between fsync()s.

 Can we say our approach helps our
 environments, but it won't help you.  The kernel VM settings may help
 you.  Good luck!?

I didn't say such a thing at all.
There are several opinions in the discussion:
  1. High bgwriter setting is enough!
  2. Change your OS :-)
  3. Use O_SYNC or O_DIRECT, but very poor performance.
  4. We may settle for single fsync(), but not many fsync()s in a short time.
I just suggested 4.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Load distributed checkpoint

2006-12-21 Thread Takayuki Tsunakawa
To: Takayuki Tsunakawa [EMAIL PROTECTED]
Cc: pgsql-hackers@postgresql.org
Sent: Thursday, December 21, 2006 6:46 PM
Subject: Re: [HACKERS] Load distributed checkpoint



From: ITAGAKI Takahiro [EMAIL PROTECTED]
 Takayuki Tsunakawa [EMAIL PROTECTED] wrote:
 Oh, really, what an evil fsync is!  Yes, I sometimes saw a backend
 waiting for lseek() to complete when it committed.  But why does
the
 backend which is syncing WAL/pg_control have to wait for syncing
the
 data file?  They are, not to mention, different files, and WAL and
 data files are stored on separate disks.

 Backends call lseek() in planning, so they have to wait fsync() to
 the table that they will access. Even if all of data in the file is
in
 the cache, lseek() conflict with fsync(). You can see a lot of
backends
 are waiting in planning phase in checkpoints, not executing phase.

I see.  I found one backend like the following.  But one in my case
one out of 16 backends.  Most of others are waiting to acquire
WALWRITE lock.

#0  0x003a629c6902 in __lseek_nocancel () from
/lib64/tls/libc.so.6
#1  0x0056789f in FileSeek ()
#2  0x00574053 in mdnblocks ()
#3  0x00574f4a in smgrnblocks ()
#4  0x005489e8 in estimate_rel_size ()
#5  0x00548bee in get_relation_info ()
#6  0x0054aa3d in build_simple_rel ()
#7  0x00539c6b in add_base_rels_to_query ()
#8  0x0053b955 in query_planner ()
#9  0x0053c1c9 in grouping_planner ()
#10 0x0053d3b4 in subquery_planner ()
#11 0x0053d5b3 in planner ()
#12 0x005778fc in pg_plan_query ()
#13 0x0057798c in pg_plan_queries ()
#14 0x00577c53 in exec_simple_query ()




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Load distributed checkpoint

2006-12-21 Thread Martijn van Oosterhout
On Thu, Dec 21, 2006 at 06:46:36PM +0900, ITAGAKI Takahiro wrote:
  Oh, really, what an evil fsync is!  Yes, I sometimes saw a backend
  waiting for lseek() to complete when it committed.  But why does the
  backend which is syncing WAL/pg_control have to wait for syncing the
  data file?  They are, not to mention, different files, and WAL and
  data files are stored on separate disks.
 
 Backends call lseek() in planning, so they have to wait fsync() to
 the table that they will access. Even if all of data in the file is in
 the cache, lseek() conflict with fsync(). You can see a lot of backends
 are waiting in planning phase in checkpoints, not executing phase.

Hmm, there are other ways to sync parts of a file. For example doing an
mmap()/msync()/munmap() cycle to start an asyncronous flush. But given
what you're saying that might suffer from the same problem.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Load distributed checkpoint

2006-12-21 Thread Zeugswetter Andreas ADI SD

  You were running the test on the very memory-depend machine.
  shared_buffers = 4GB / The scaling factor is 50, 800MB of data.
  Thet would be why the patch did not work. I tested it with DBT-2,
10GB of
  data and 2GB of memory. Storage is always the main part of
performace here,
  even not in checkpoints.
 
 Yes, I used half the size of RAM as the shared buffers, which is
 reasonable.  And I cached all the data.

For pg, half RAM for shared_buffers is too much. The ratio is good for
other db software, that does not use the OS cache.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Load distributed checkpoint

2006-12-21 Thread Takayuki Tsunakawa
- Original Message - 
From: Zeugswetter Andreas ADI SD [EMAIL PROTECTED]
To: Takayuki Tsunakawa [EMAIL PROTECTED]; ITAGAKI
Takahiro [EMAIL PROTECTED]
  Yes, I used half the size of RAM as the shared buffers, which is
  reasonable.  And I cached all the data.

 For pg, half RAM for shared_buffers is too much. The ratio is good
for
 other db software, that does not use the OS cache.

What percentage of RAM is recommended for shared buffers in general?
40%?  30%?  Or, is the general recommendation like According to the
amount of your data, this much RAM should be left for the kernel
cache.  But tha's the story on Linux.  It may be different for other
OSes.?
Hmm,  if it is so, it sounds hard for system designers/administrators
to judge.




- Original Message - 
From: Zeugswetter Andreas ADI SD [EMAIL PROTECTED]
To: Takayuki Tsunakawa [EMAIL PROTECTED]; ITAGAKI
Takahiro [EMAIL PROTECTED]
Cc: pgsql-hackers@postgresql.org
Sent: Thursday, December 21, 2006 11:04 PM
Subject: RE: [HACKERS] Load distributed checkpoint



  You were running the test on the very memory-depend machine.
  shared_buffers = 4GB / The scaling factor is 50, 800MB of data.
  Thet would be why the patch did not work. I tested it with DBT-2,
10GB of
  data and 2GB of memory. Storage is always the main part of
performace here,
  even not in checkpoints.

 Yes, I used half the size of RAM as the shared buffers, which is
 reasonable.  And I cached all the data.

For pg, half RAM for shared_buffers is too much. The ratio is good for
other db software, that does not use the OS cache.

Andreas



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Load distributed checkpoint

2006-12-21 Thread ITAGAKI Takahiro

Takayuki Tsunakawa [EMAIL PROTECTED] wrote:

  For pg, half RAM for shared_buffers is too much. The ratio is good for
  other db software, that does not use the OS cache.
 
 What percentage of RAM is recommended for shared buffers in general?
 40%?  30%?  Or, is the general recommendation like According to the
 amount of your data, this much RAM should be left for the kernel
 cache.  But tha's the story on Linux.  It may be different for other
 OSes.?
 Hmm,  if it is so, it sounds hard for system designers/administrators
 to judge.

If you use linux, try the following settings:
  1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio.
  2. Increase wal_buffers to redule WAL flushing.
  3. Set wal_sync_method to open_sync; O_SYNC is faster then fsync().
  4. Separate data and WAL files into different partitions or disks.

I suppose 1 is important for you, because kernel will not write dirty
buffers until 10% of buffers become dirty in default settings.
You have large memory (8GB), but small data set (800MB). So kernel
almost never writes buffers not in checkpoints. Accumulate dirty buffers
are written at a burst in fsync().


We would be happy if we would be free from a difficult combination
of tuning. If you have *idea for improvements*, please suggest it.
I think we've already understood *problem itself*.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Load distributed checkpoint

2006-12-21 Thread Takayuki Tsunakawa
Hello, Itagaki-san,

Thank you for an interesting piece of information.

From: ITAGAKI Takahiro [EMAIL PROTECTED]
 If you use linux, try the following settings:
  1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio.
  2. Increase wal_buffers to redule WAL flushing.
  3. Set wal_sync_method to open_sync; O_SYNC is faster then fsync().
  4. Separate data and WAL files into different partitions or disks.

 I suppose 1 is important for you, because kernel will not write
dirty
 buffers until 10% of buffers become dirty in default settings.
 You have large memory (8GB), but small data set (800MB). So kernel
 almost never writes buffers not in checkpoints. Accumulate dirty
buffers
 are written at a burst in fsync().

I'll show the results of this tuning to share information with people
who don't have experience of this kind.
The numbers shown below are the tps when running pgbench -c16 -t100
postgres five times in succession.

(1) Default case(this is show again for comparison and reminder)
The bgwriter_* and checkpoint_* are set to those defaults.
wal_buffers and wal_sync_method are also set to those defaults (64kB
and fdatasync respectively.)

235  80  226  77  240


(2) Default + WAL 1MB case
The configuration is the same as case (1) except that wal_buffers is
set to 1024kB.

302  328  82  330  85

This is better improvement than I expected.


(3) Default + wal_sync_method=open_sync case
The configuration is the same as case (1) except that wal_sync_method
is set to open_sync.

162  67  176  67  164

Too bad compared to case (2).  Do you know the reason?


(4) (2)+(3) case

322  350  85  321  84

This is good, too.


(5) (4) + /proc/sys/vm/dirty* tuning
dirty_background_ratio is changed from 10 to 1, and dirty_ratio is
changed from 40 to 4.

308  349  84  349  84

The tuning of kernel cache doesn't appear to bring performance
improvement in my env.  The kernel still waits too long before it
starts flushing dirty buffers because the cache is large?  If so,
increasingly available RAM may cause trouble more frequently in the
near future.  Do the dirty_*_ratio accept values less than 1?

BTW, in case (1), the best response time of a transaction was 6
milliseconds.  On the other hand, the worst response time was 13
seconds.


 We would be happy if we would be free from a difficult combination
 of tuning. If you have *idea for improvements*, please suggest it.
 I think we've already understood *problem itself*.

I agree with you.  Let's make the ideas more concrete, doing some
experimentations.




---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Load distributed checkpoint

2006-12-21 Thread Greg Smith

On Wed, 20 Dec 2006, Inaam Rana wrote:

Talking of bgwriter_* parameters I think we are missing a crucial 
internal counter i.e. number of dirty pages. How much work bgwriter has 
to do at each wakeup call should be a function of total buffers and 
currently dirty buffers.


This is actually a question I'd been meaning to throw out myself to this 
list.  How hard would it be to add an internal counter to the buffer 
management scheme that kept track of the current number of dirty pages? 
I've been looking at the bufmgr code lately trying to figure out how to 
insert one as part of building an auto-tuning bgwriter, but it's unclear 
to me how I'd lock such a resource properly and scalably.  I have a 
feeling I'd be inserting a single-process locking bottleneck into that 
code with any of the naive implementations I considered.


The main problem I've been seeing is also long waits stuck behind a slow 
fsync on Linux.  What I've been moving toward testing is an approach 
slightly different from the proposals here.  What if all the database page 
writes (background writer, buffer eviction, or checkpoint scan) were 
counted and periodic fsync requests send to the bgwriter based on that? 
For example, when I know I have a battery-backed caching controller that 
will buffer 64MB worth of data for me, if I forced a fsync after every 
6000 8K writes, no single fsync would get stuck waiting for the disk to 
write for longer than I'd like.


Give the admin a max_writes_before_sync parameter, make the default of 0 
work just like the current behavior, and off you go; a simple tunable that 
doesn't require a complicated scheme to implement or break anybody's 
existing setup.  Combined with a properly tuned background writer, that 
would solve the issues I've been running into.  It would even make the 
problem of Linux caching too many writes until checkpoint time go away (I 
know how to eliminate that by adjusting caching policy, but I have to be 
root to do it; a DBA should be able to work around that issue even if they 
don't have access to the kernel tunables.)


While I'm all for testing to prove me wrong, my gut feel is that going all 
the way to sync writes a la Oracle is a doomed approach, particularly on 
low-end hardware where they're super expensive.  Following The Oracle Way 
is a good roadmap for a lot of things, but I wouldn't put building a lean 
enough database to run on modest hardware on that list.  You can do sync 
writes with perfectly good performance on systems with a good 
battery-backed cache, but I think you'll get creamed in comparisons 
against MySQL on IDE disks if you start walking down that path; since 
right now a fair comparison with similar logging behavior is an even match 
there, that's a step backwards.


Also on the topic of sync writes to the database proper:  wouldn't using 
O_DIRECT for those potentially counter-productive?  I was under the 
impressions that one of the behaviors counted on by Postgres was that data 
evicted from its buffer cache, eventually intended for writing to disk, 
was still kept around for a bit in the OS buffer cache.  A subsequent read 
because the data was needed again might find the data already in the OS 
buffer, therefore avoiding an actual disk read; that substantially reduces 
the typical penalty for the database engine making a bad choice on what to 
evict.  I fear a move to direct writes would put more pressure on the LRU 
implementation to be very smart, and that's code that you really don't 
want to be more complicated.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Load distributed checkpoint

2006-12-21 Thread Takayuki Tsunakawa
From: Takayuki Tsunakawa [EMAIL PROTECTED]
 (5) (4) + /proc/sys/vm/dirty* tuning
 dirty_background_ratio is changed from 10 to 1, and dirty_ratio is
 changed from 40 to 4.

 308  349  84  349  84

Sorry, I forgot to include the result when using Itagaki-san's patch.
The patch showd the following tps for case (5).

323  350  340  59  225

The best response time was 4 msec, and the worst one was 16 seconds.


- Original Message - 
From: Takayuki Tsunakawa [EMAIL PROTECTED]
To: ITAGAKI Takahiro [EMAIL PROTECTED]
Cc: pgsql-hackers@postgresql.org
Sent: Friday, December 22, 2006 3:20 PM
Subject: Re: [HACKERS] Load distributed checkpoint


 Hello, Itagaki-san,

 Thank you for an interesting piece of information.

 From: ITAGAKI Takahiro [EMAIL PROTECTED]
 If you use linux, try the following settings:
  1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio.
  2. Increase wal_buffers to redule WAL flushing.
  3. Set wal_sync_method to open_sync; O_SYNC is faster then
fsync().
  4. Separate data and WAL files into different partitions or disks.

 I suppose 1 is important for you, because kernel will not write
 dirty
 buffers until 10% of buffers become dirty in default settings.
 You have large memory (8GB), but small data set (800MB). So kernel
 almost never writes buffers not in checkpoints. Accumulate dirty
 buffers
 are written at a burst in fsync().

 I'll show the results of this tuning to share information with
people
 who don't have experience of this kind.
 The numbers shown below are the tps when running pgbench -c16 -t100
 postgres five times in succession.

 (1) Default case(this is show again for comparison and reminder)
 The bgwriter_* and checkpoint_* are set to those defaults.
 wal_buffers and wal_sync_method are also set to those defaults (64kB
 and fdatasync respectively.)

 235  80  226  77  240


 (2) Default + WAL 1MB case
 The configuration is the same as case (1) except that wal_buffers is
 set to 1024kB.

 302  328  82  330  85

 This is better improvement than I expected.


 (3) Default + wal_sync_method=open_sync case
 The configuration is the same as case (1) except that
wal_sync_method
 is set to open_sync.

 162  67  176  67  164

 Too bad compared to case (2).  Do you know the reason?


 (4) (2)+(3) case

 322  350  85  321  84

 This is good, too.


 (5) (4) + /proc/sys/vm/dirty* tuning
 dirty_background_ratio is changed from 10 to 1, and dirty_ratio is
 changed from 40 to 4.

 308  349  84  349  84

 The tuning of kernel cache doesn't appear to bring performance
 improvement in my env.  The kernel still waits too long before it
 starts flushing dirty buffers because the cache is large?  If so,
 increasingly available RAM may cause trouble more frequently in the
 near future.  Do the dirty_*_ratio accept values less than 1?

 BTW, in case (1), the best response time of a transaction was 6
 milliseconds.  On the other hand, the worst response time was 13
 seconds.


 We would be happy if we would be free from a difficult combination
 of tuning. If you have *idea for improvements*, please suggest it.
 I think we've already understood *problem itself*.

 I agree with you.  Let's make the ideas more concrete, doing some
 experimentations.




 ---(end of
broadcast)---
 TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Load distributed checkpoint

2006-12-21 Thread Inaam Rana

On 12/22/06, Takayuki Tsunakawa [EMAIL PROTECTED] wrote:


From: Takayuki Tsunakawa [EMAIL PROTECTED]
 (5) (4) + /proc/sys/vm/dirty* tuning
 dirty_background_ratio is changed from 10 to 1, and dirty_ratio is
 changed from 40 to 4.

 308  349  84  349  84

Sorry, I forgot to include the result when using Itagaki-san's patch.
The patch showd the following tps for case (5).

323  350  340  59  225

The best response time was 4 msec, and the worst one was 16 seconds.




Which IO Shceduler (elevator) you are using?

--
Inaam Rana
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] Load distributed checkpoint

2006-12-20 Thread Takayuki Tsunakawa
Hello, Itagaki-san, all

I have to report a sad result.  Your patch didn't work.  Let's
consider the solution together.  What you are addressing is very
important for the system designers in the real world -- smoothing
response time.

Recall that unpatched PostgreSQL showed the following tps's in case
(1) (i.e. with default bgwriter_* and checkpoint_* settings.)

235  80  226 77  240

The patched PostgreSQL showed the following tps's:

230  228  77  209  66

[disk usage]
The same tendency can be seen as with the unpatched PostgreSQL.  That
is:
When the tps is low, the %util of disk for data files is high, and
%util of disk for WAL is low.  Why is transaction logging is disturbed
by cleaning and/or syncing activity?  While the bgwriter is
fsync()ing, it does not lock any data structures that the transactions
want to access.  Even though they share the same SCSI controller and
bus, they are different disks.  The bandwidth does not appear to be
exhausted, since Ultra320 is said to have 256MB band width in
practice.
(Recall that WAL is on sdd and data files are on sde.)

Device:rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/srkB/swkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdd  0.00 810.78  0.00 102.200.00 7306.99 0.00
3653.4971.50 1.12   10.95   7.32  74.77
sde  0.00  25.35  0.00  6.190.00  252.30 0.00   126.15
40.77 0.50   81.32   5.94   3.67
sdd  0.00 884.20  0.00 126.000.00 8080.00 0.00
4040.0064.13 1.26   10.00   7.11  89.64
sde  0.00  21.40  0.00  5.000.00  211.20 0.00   105.60
42.24 0.31   62.56   6.52   3.26
sdd  0.00 924.80  0.00 116.200.00 8326.40 0.00
4163.2071.66 1.23   10.59   7.37  85.64
sde  0.00  27.60  0.00 26.600.00  433.60 0.00   216.80
16.30 4.24  159.29   2.44   6.50
sdd  0.00 721.20  0.00 102.400.00 6588.80 0.00
3294.4064.34 0.999.71   7.07  72.40
sde  0.00 1446.80  0.00 101.600.00 20289.60 0.00
10144.80   199.70  1192.40  572.45   2.29  23.30
sdd  0.00   0.00  0.00  0.200.001.60 0.00 0.80
8.00 0.11  539.00 539.00  10.80
sde  0.00   0.00  0.00 452.100.000.00 0.00
0.00 0.00  3829.57 3715.83   2.22 100.22
sdd  0.00   0.00  0.00  0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00
sde  0.00   0.00  0.00 349.800.000.00 0.00
0.00 0.00  1745.52 8515.74   2.86 100.02
sdd  0.00 442.40  0.00 51.000.00 3948.80 0.00  1974.40
77.43 0.60   11.73   7.54  38.46
sde  0.00   2.80  0.00 184.000.00   25.60 0.00
12.80 0.14   277.52 12629.41   3.19  58.74
sdd  0.00 898.00  0.00 124.800.00 8182.40 0.00
4091.2065.56 1.30   10.40   7.24  90.30
sde  0.00  19.20  0.00  3.800.00  184.00 0.0092.00
48.42 0.24   62.11  14.11   5.36
sdd  0.00 842.28  0.00 109.020.00 7612.02 0.00
3806.0169.82 1.33   12.26   8.35  91.02
sde  0.00  45.49  0.00 46.890.00  739.08 0.00   369.54
15.76 9.04  192.73   3.38  15.85
sdd  0.00 1198.41  0.00 71.510.00 10505.18 0.00
5252.59   146.90   128.19   99.76  13.48  96.43
sde  0.00 1357.77  0.00 199.800.00 19263.75 0.00
9631.8796.41  2251.09 1179.42   2.39  47.81
sdd  0.00   0.00  0.00  7.200.000.00 0.00 0.00
0.00   203.87 5671.83 138.92 100.02
sde  0.00   0.00  0.00 409.600.000.00 0.00
0.00 0.00  3171.04 4779.83   2.44 100.02
sdd  0.00   0.00  0.00 17.800.000.00 0.00 0.00
0.00   137.87 10240.90  56.19 100.02
sde  0.00   0.00  0.00 240.600.000.00 0.00
0.00 0.00  1573.85 9815.29   4.16 100.02
sdd  0.00 109.80  0.00 35.400.00 1012.80 0.00   506.40
28.6142.14 7974.47  27.86  98.64
sde  0.00   2.80  0.00 198.800.00   30.40 0.00
15.20 0.15   428.49 14474.39   4.30  85.56
sdd  0.00 466.20  0.00 62.800.00 4230.40 0.00  2115.20
67.36 0.599.49   6.79  42.62
sde  0.00   5.20  0.00  0.800.00   48.00 0.0024.00
60.00 0.01   16.25  11.25   0.90
sdd  0.00   0.00  0.00  0.200.001.60 0.00 0.80
8.00 0.01   35.00  35.00   0.70
sde  0.00   0.00  0.00  0.000.000.00 0.00 0.00
0.00 0.000.00   0.00   0.00


I suspect that fsync() is the criminal as I've been afraid.  I'll show
you an interesting data.  I collected the stack traces of backend
processes while a checkpoint is happening.


[bgwriter]
Oh, he is fsync()ing hard.

#0  0x003a629bfbb2 in __fsync_nocancel () from
/lib64/tls/libc.so.6
#1  0x005742a1 in mdsync ()
#2  0x005753d7 in smgrsync ()
#3  0x00564d65 in FlushBufferPool ()
...

[some backends]
They are forced to wait for some lock pertaining to WAL when they try
to insert a log record.


Re: [HACKERS] Load distributed checkpoint

2006-12-20 Thread Martijn van Oosterhout
On Wed, Dec 20, 2006 at 08:10:56PM +0900, Takayuki Tsunakawa wrote:
 One question is the disk utilization.  While bgwriter is fsync()ing,
 %util of WAL disk drops to almost 0.  But the the bandwidth of
 Ultra320 SCSI does not appear to be used fully.  Any idea?

That implies that fsyncing a datafile blocks fsyncing the WAL. That
seems terribly unlikely (although...). What OS/Kernel/Filesystem is
this. I note a sync bug in linux for ext3 that may have relevence.

Have a nice day,

-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Load distributed checkpoint

2006-12-20 Thread Takayuki Tsunakawa
 That implies that fsyncing a datafile blocks fsyncing the WAL. That
 seems terribly unlikely (although...). What OS/Kernel/Filesystem is
 this. I note a sync bug in linux for ext3 that may have relevence.

Oh, really?  What bug?  I've heard that ext3 reports wrong data to
iostat when it performs writes (the data is correct when performing
reads.)
My env is:

OS: RHEL 4.0 for AMD64/EM64T
kernel: 2.6.9-42.ELsmp
The file system is ext3.

Terribly unlikely?  But I've seen the disk utilization quite often.




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Load distributed checkpoint

2006-12-20 Thread Martijn van Oosterhout
On Wed, Dec 20, 2006 at 09:14:50PM +0900, Takayuki Tsunakawa wrote:
  That implies that fsyncing a datafile blocks fsyncing the WAL. That
  seems terribly unlikely (although...). What OS/Kernel/Filesystem is
  this. I note a sync bug in linux for ext3 that may have relevence.
 
 Oh, really?  What bug?  I've heard that ext3 reports wrong data to
 iostat when it performs writes (the data is correct when performing
 reads.)

I was referring to this in the 2.6.6 changelog:

http://www.linuxhq.com/kernel/changelog/v2.6/6/index.html

   ext3's fsync/fdatasync implementation is currently syncing the inode via a
   full journal commit even if it was unaltered.
   
However you're running a later version so that's not it.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Load distributed checkpoint

2006-12-20 Thread Inaam Rana

On 12/20/06, Takayuki Tsunakawa [EMAIL PROTECTED] wrote:


[Conclusion]
I believe that the problem cannot be solved in a real sense by
avoiding fsync/fdatasync().  We can't ignore what commercial databases
have done so far.  The kernel does as much as he likes when PostgreSQL
requests him to fsync().



I am new to the community and am very interested in the tests that you have
done. I am also working on resolving the sudden IO spikes at checkpoint
time. I agree with you that fsync() is the core issue here.

Being a new member I was wondering if someone on this list has done testing
with O_DIRECT and/or O_SYNC for datafiles as that seems to be the most
logical way of dealing with fsync() flood at checkpoint time. If so, I'll be
very interested in the results. As mentioned in this thread that a single
bgwriter with O_DIRECT will not be able to keep pace with cleaning effort
causing backend writes. I think (i.e. IMHO) multiple bgwriters and/or
AsyncIO with O_DIRECT can resolve this issue.

Talking of bgwriter_* parameters I think we are missing a crucial internal
counter i.e. number of dirty pages. How much work bgwriter has to do at each
wakeup call should be a function of total buffers and currently dirty
buffers. Relying on both these values instead of just one static NBuffers
should allow bgwriter to adapt more quickly to workload changes and ensure
that not much work is accumulated for checkpoint.

--
Inaam Rana
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] Load distributed checkpoint

2006-12-20 Thread Takayuki Tsunakawa
On 12/20/06, Takayuki Tsunakawa [EMAIL PROTECTED] wrote:
  [Conclusion]
  I believe that the problem cannot be solved in a real sense by
  avoiding fsync/fdatasync().  We can't ignore what commercial databases
  have done so far.  The kernel does as much as he likes when PostgreSQL
  requests him to fsync().


From: Inaam Rana
 I am new to the community and am very interested in the tests that you
have done. I am also working on resolving the sudden IO spikes at checkpoint
time. I agree with you that fsync() is the core issue here.

Thank you for understanding my bad English correctly.  Yes, what I've been
insisting is that it is necessary to avoid fsync()/fdatasync() and to use
O_SYNC (plus O_DIRECT if supported on the target platform) to really
eliminate the big spikes.
In my mail, the following sentence made a small mistake.

I believe that the problem cannot be solved in a real sense by avoiding
fsync/fdatasync().

The correct sentence is:

I believe that the problem cannot be solved in a real sense without
avoiding fsync/fdatasync().


 Being a new member I was wondering if someone on this list has done
testing with O_DIRECT and/or O_SYNC for datafiles as that seems to be the
most logical way of dealing with fsync() flood at checkpoint time. If so,
I'll be very interested in the results.

Could you see the mail I sent on Dec 18?  Its content was so long that I
zipped the whole content and attached to the mail.  I just performed the
same test simply adding O_SYNC to open() in mdopen() and another function in
md.c.  I couldn't succeed in running with O_DIRECT because O_DIRECT requires
the shared buffers to be aligned on the sector-size boundary.  To perform
O_DIRECT test, a little more modification is necessary to the code where the
shared buffers are allocated.
The result was bad.  But that's just a starting point.  We need some
improvements that commercial databases have done.  I think some approaches
we should take are:

(1) two-checkpoint (described in Jim Gray's textbook Transaction
Processing: Concepts and Techniques
(2) what Oracle suggests in its manual (see my previous mails)
(3) write multiple contiguous buffers with one write() to decrease the count
of write() calls

 As mentioned in this thread that a single bgwriter with O_DIRECT will not
be able to keep pace with cleaning effort causing backend writes. I think
(i.e. IMHO) multiple bgwriters and/or AsyncIO with O_DIRECT can resolve this
issue.

I agree with you.  Oracle provides a parameter called DB_WRITER_PROCESSES to
set the number of database writer processes.  Oracle also provides
asynchronous I/O to solve the problem you are saying about.  Please see
section 10.3.9 the following page:

http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14211/instance_tune.htm#sthref1049

 Talking of bgwriter_* parameters I think we are missing a crucial internal
counter i.e. number of dirty pages. How much work bgwriter has to do at each
wakeup call should be a function of total buffers and currently dirty
buffers. Relying on both these values instead of just one static NBuffers
should allow bgwriter to adapt more quickly to workload changes and ensure
that not much work is accumulated for checkpoint.

I agree with you in the sense that the current bgwriter is a bit careless
about the system load.  I believe that PostgreSQL should be more gentle to
OLTP transactions -- many users of the system as a result.  I think the
speed of WAL accumulation should also be taken into account.  Let's list up
the problems and ideas.

-- 


Re: [HACKERS] Load distributed checkpoint

2006-12-20 Thread ITAGAKI Takahiro

Takayuki Tsunakawa [EMAIL PROTECTED] wrote:

 I have to report a sad result.  Your patch didn't work.  Let's
 consider the solution together.  What you are addressing is very
 important for the system designers in the real world -- smoothing
 response time.

You were running the test on the very memory-depend machine.
 shared_buffers = 4GB / The scaling factor is 50, 800MB of data.
Thet would be why the patch did not work. I tested it with DBT-2, 10GB of
data and 2GB of memory. Storage is always the main part of performace here,
even not in checkpoints.

If you use Linux, it has very unpleased behavior in fsync(); It locks all
metadata of the file being fsync-ed. We have to wait for the completion of
fsync when we do read(), write(), and even lseek().

Almost of your data is in the accounts table and it was stored in a single
file. All of transactions must wait for fsync to the single largest file,
so you saw the bottleneck was in the fsync.

 [Conclusion]
 I believe that the problem cannot be solved in a real sense by
 avoiding fsync/fdatasync().

I think so, too. However, I assume we can resolve a part of the
checkpoint spikes with smoothing of write() alone.

BTW, can we use the same way to fsync? We call fsync()s to all modified
files without rest in mdsync(), but it's not difficult at all to insert
sleeps between fsync()s. Do you think it helps us? One of issues is that
we have to sleep in file unit, which is maybe rough granularity.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Load distributed checkpoint

2006-12-19 Thread ITAGAKI Takahiro
Takayuki Tsunakawa [EMAIL PROTECTED] wrote:

 I performed some simple tests, and I'll show the results below.

 (1) The default case
 235  80  226 77  240
 (2) No write case
 242  250  244  253  280
 (3) No checkpoint case
 229  252  256  292  276
 (4) No fsync() case
 236  112  215  216  221
 (5) No write by PostgreSQL, but fsync() by another program case
 9  223  260  283  292
 (6) case (5) + O_SYNC by write_fsync
 97  114  126  112  125
 (7) O_SYNC case
 182  103  41  50  74

I posted a patch to PATCHES. Please try out it.
It does write() smoothly, but fsync() at a burst.
I suppose the result will be between (3) and (5).

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Load distributed checkpoint

2006-12-19 Thread Takayuki Tsunakawa
Hello, Itagaki-san

 I posted a patch to PATCHES. Please try out it.

Really!?  I've just joined pgsql-patches.  When did you post it,
yesterday?  I couldn't find the patch in the following page which
lists the mails to pgsql-patches of this month:

http://archives.postgresql.org/pgsql-patches/2006-12/index.php

Could you send me the patch if it has not already registered on the
above page.  I want to try the patch by all means, because smoothing
response times is very important so that we can recommend PostgreSQL
to system designers.  I really wish your patch will be the real
solution.

 It does write() smoothly, but fsync() at a burst.
 I suppose the result will be between (3) and (5).

Hmm...  I think some logical reasoning is needed to get the
understanding from community members (sorry if the community members
have already agreed.)  Excuse me for repeating myself, but I'm afraid
fsync() will be the evil sometime in some environments.  Success in
one test environment is not the real success.




---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Load distributed checkpoint

2006-12-13 Thread Takayuki Tsunakawa
Hello,

From: Jim C. Nasby [EMAIL PROTECTED]
 Also, I have a dumb question... BgBufferSync uses buf_id1 to keep
track
 of what buffer the bgwriter_all scan is looking at, which means that
 it should remember where it was at the end of the last scan; yet
it's
 initialized to 0 every time BgBufferSync is called. Is there
someplace
 else that is remembering where the complete scan is leaving off when
 bgwriter_all_percent or bgwriter_all_maxpages is hit? Or does the
scan
 in fact just keep re-scanning the beginning of the buffers?

No. BgBufferSync() correctly keeps track of the position to restart
scanning at. bufid1 is not initialized to 0 every time BgBufferSync()
is called, because bufid1 is a static local variable. Please see the
following code. It prints:

a=0
a=1
a=2


#include stdio.h

void func(void)
{
 static int a = 0;

 printf(a=%d\n, a);
 a++;
}

int main(void)
{
 func();
 func();
 func();

  return 0;
}




---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Load distributed checkpoint

2006-12-13 Thread Jim C. Nasby
On Wed, Dec 13, 2006 at 06:27:38PM +0900, Takayuki Tsunakawa wrote:
 No. BgBufferSync() correctly keeps track of the position to restart
 scanning at. bufid1 is not initialized to 0 every time BgBufferSync()
 is called, because bufid1 is a static local variable. Please see the
 following code. It prints:

Doh, I completely missed the static part of the declaration. Thanks for
the C tutorial. :)
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Load distributed checkpoint

2006-12-12 Thread Zeugswetter Andreas ADI SD

  One thing I do worry about is if both postgresql and the OS
  are both delaying write()s in the hopes of collapsing them
  at the same time.  If so, this would cause both to be experience
  bigger delays than expected, and make checkpoints worse.
  
 That is my concern.  Letting 30 seconds worth of dirty pages
accumulate
 between checkpoints and then trying to smooth the writes within
 checkpoint code seems like a doomed effort.

30 seconds ??? You are supposed to avoid excessive checkpoints.
If you are reducing checkpoint_timeout to avoid the spike, you are imho
definitely tuning the wrong knob. You are supposed to increase 
checkpoint_timeout as far as you can to still have an acceptable
recovery delay after a crash.

If you cannot afford a huge spike during checkpoint, say every 30
minutes,
you have to make bgwriter more aggressive. This would generally be true
for 
both of the variants, db driven direct io and buffered filesystem io.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Load distributed checkpoint

2006-12-12 Thread Kevin Grittner
 On Tue, Dec 12, 2006 at  3:22 AM, in message
[EMAIL PROTECTED],
Zeugswetter
Andreas ADI SD [EMAIL PROTECTED] wrote: 
 
  One thing I do worry about is if both postgresql and the OS
  are both delaying write()s in the hopes of collapsing them
  at the same time.  If so, this would cause both to be experience
  bigger delays than expected, and make checkpoints worse.
  
 That is my concern.  Letting 30 seconds worth of dirty pages
 accumulate
 between checkpoints and then trying to smooth the writes within
 checkpoint code seems like a doomed effort.
 
 30 seconds ??? You are supposed to avoid excessive checkpoints.
 If you are reducing checkpoint_timeout to avoid the spike, you are
imho
 definitely tuning the wrong knob.
 
Sorry for the fuzzy language -- I was intending to describe a situation
where dirty pages accumulate by checkpoint time which would take 30
seconds to write to disk.  We were into this situation (and worse) with
the default bgwriter settings.
 
 you have to make bgwriter more aggressive.
 
This is what I've been saying.
 
I've also been saying that if the PostgreSQL way is to let the file
system handle the caching and I/O scheduling, we should trust it to know
what to do with dirty pages, and not try to second-guess it.  (Of course
there are knobs to tune the file system if needed.)  Our checkpoint
performance issues went away when we went to settings which basically
never leave a dirty page hidden from the file system for more than two
seconds.
 
-Kevin
 


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Load distributed checkpoint

2006-12-12 Thread Bruce Momjian

I have thought a while about this and I have some ideas.

Ideally, we would be able to trickle the sync of individuals blocks
during the checkpoint, but we can't because we rely on the kernel to
sync all dirty blocks that haven't made it to disk using fsync().  We
could trickle the fsync() calls, but that just extends the amount of
data we are writing that has been dirtied post-checkpoint.  In an ideal
world, we would be able to fsync() only part of a file at a time, and
only those blocks that were dirtied pre-checkpoint, but I don't see that
happening anytime soon (and one reason why many commercial databases
bypass the kernel cache).

So, in the real world, one conclusion seems to be that our existing
method of tuning the background writer just isn't good enough for the
average user:

#bgwriter_delay = 200ms # 10-1ms between rounds
#bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers 
scanned/round
#bgwriter_lru_maxpages = 5  # 0-1000 buffers max 
written/round
#bgwriter_all_percent = 0.333   # 0-100% of all buffers 
scanned/round
#bgwriter_all_maxpages = 5  # 0-1000 buffers max 
written/round

These settings control what the bgwriter does, but they do not clearly
relate to the checkpoint timing, which is the purpose of the bgwriter,
and they don't change during the checkpoint interval, which is also less
than ideal.  If set to aggressively, it writes too much, and if too low,
the checkpoint does too much I/O.

We clearly need more bgwriter activity as the checkpoint approaches, and
one that is more auto-tuned, like many of our other parameters.  I think
we created these settings to see how they worked in the field, so it
probably time to reevaluate them based on field reports.

I think the bgwriter should keep track of how far it is to the next
checkpoint, and use that information to increase write activity. 
Basically now, during a checkpoint, the bgwriter does a full buffer scan
and fsync's all dirty files, so it changes from the configuration
parameter-defined behavior right to 100% activity.  I think it would be
ideal if we could ramp up the writes so that when it is 95% to the next
checkpoint, it can be operating at 95% of the activity it would do
during a checkpoint.

My guess is if we can do that, we will have much smoother performance
because we have more WAL writes just after checkpoint for newly-dirtied
pages, and the new setup will give us more write activity just before
checkpoint.

One other idea is for the bgwriter to use O_DIRECT or O_SYNC to avoid
the kernel cache, so we are sure data will be on disk by checkpoint
time.  This was avoided in the past because of the expense of
second-guessing the kernel disk I/O scheduling algorithms.

---

Tom Lane wrote:
 Kevin Grittner [EMAIL PROTECTED] writes:
  Jim C. Nasby [EMAIL PROTECTED] wrote: 
  Generally, I try and configure the all* settings so that you'll get 1
  clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't
  have any actual tests to back that methodology up.
 
  We got to these numbers somewhat scientifically.  I studied I/O
  patterns under production load and figured we should be able to handle
  about 800 writes in per 200 ms without causing problems.  I have to
  admit that I based the percentages and the ratio between all and lru
  on gut feel after musing over the documentation.
 
 I like Kevin's settings better than what Jim suggests.  If the bgwriter
 only makes one sweep between checkpoints then it's hardly going to make
 any impact at all on the number of dirty buffers the checkpoint will
 have to write.  The point of the bgwriter is to reduce the checkpoint
 I/O spike by doing writes between checkpoints, and to have any
 meaningful impact on that, you'll need it to make the cycle several times.
 
 Another point here is that you want checkpoints to be pretty far apart
 to minimize the WAL load from full-page images.  So again, a bgwriter
 that's only making one loop per checkpoint is not gonna be doing much.
 
 I wonder whether it would be feasible to teach the bgwriter to get more
 aggressive as the time for the next checkpoint approaches?  Writes
 issued early in the interval have a much higher probability of being
 wasted (because the page gets re-dirtied later).  But maybe that just
 reduces to what Takahiro-san already suggested, namely that
 checkpoint-time writes should be done with the same kind of scheduling
 the bgwriter uses outside checkpoints.  We still have the problem that
 the real I/O storm is triggered by fsync() not write(), and we don't
 have a way to spread out the consequences of fsync().
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if 

Re: [HACKERS] Load distributed checkpoint

2006-12-12 Thread Gregory Stark

 Tom Lane wrote:
 
 I like Kevin's settings better than what Jim suggests.  If the bgwriter
 only makes one sweep between checkpoints then it's hardly going to make
 any impact at all on the number of dirty buffers the checkpoint will
 have to write.  The point of the bgwriter is to reduce the checkpoint
 I/O spike by doing writes between checkpoints, and to have any
 meaningful impact on that, you'll need it to make the cycle several times.
 
 Another point here is that you want checkpoints to be pretty far apart
 to minimize the WAL load from full-page images.  So again, a bgwriter
 that's only making one loop per checkpoint is not gonna be doing much.

I missed the previous message but it sounds like you're operating under a
different set of assumptions than the original poster. If you do a single
sweep through all of the buffers *and sync them* then you've just finished a
checkpoint -- the *previous* checkpoint. Not the subsequent one.

That is, rather than trying to spread the load of the checkpoint out by
getting the writes into the kernel sooner but make no attempt to sync them
until checkpoint time, start the checkpoint as soon as the previous checkpoint
finishes, and dribble the blocks of the checkpoint out slowly throughout an
entire checkpoint cycle syncing them immediately using O_SYNC/ODIRECT.

It's a fundamental shift in the idea of the purpose of bgwriter. Instead of
trying to suck i/o away from the subsequent checkpoint it would be responsible
for all the i/o of the previous checkpoint which would still be in progress
for the entire time of checkpoint_timeout. It would only complete when
bgwriter had finished doing its one full sweep.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Load distributed checkpoint

2006-12-12 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 It's a fundamental shift in the idea of the purpose of bgwriter. Instead of
 trying to suck i/o away from the subsequent checkpoint it would be responsible
 for all the i/o of the previous checkpoint which would still be in progress
 for the entire time of checkpoint_timeout. It would only complete when
 bgwriter had finished doing its one full sweep.

I think that's basically the same as the original suggestion, which is
to do checkpoints using less than the full I/O bandwidth of the machine
--- tying it exactly to the default bgwriter rate may or may not be
appropriate.

The difficulty with such schemes is that if you go over to using O_DIRECT
writes instead of fsync in the bgwriter, it's hard to avoid doing the
same when a random backend has to write a dirty buffer --- yet you'd
really rather that such a backend not have to wait for the ensuing I/O.
And this gets a lot worse if checkpoints are slowed down, because it gets
more likely that the bufmgr will run out of clean buffers and have to do
a write() from a backend.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Load distributed checkpoint

2006-12-12 Thread Jim C. Nasby
On Fri, Dec 08, 2006 at 11:43:27AM -0500, Tom Lane wrote:
 Kevin Grittner [EMAIL PROTECTED] writes:
  Jim C. Nasby [EMAIL PROTECTED] wrote: 
  Generally, I try and configure the all* settings so that you'll get 1
  clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't
  have any actual tests to back that methodology up.
 
  We got to these numbers somewhat scientifically.  I studied I/O
  patterns under production load and figured we should be able to handle
  about 800 writes in per 200 ms without causing problems.  I have to
  admit that I based the percentages and the ratio between all and lru
  on gut feel after musing over the documentation.
 
 I like Kevin's settings better than what Jim suggests.  If the bgwriter
 only makes one sweep between checkpoints then it's hardly going to make
 any impact at all on the number of dirty buffers the checkpoint will
 have to write.  The point of the bgwriter is to reduce the checkpoint
 I/O spike by doing writes between checkpoints, and to have any
 meaningful impact on that, you'll need it to make the cycle several times.

It would be good if the docs included more detailed info on how exactly
the bgwriter goes about flushing stuff to disk. You can certainly read
them and think that the bgwriter just goes through and issues writes for
any dirty buffers it finds. Though, looking at BgBufferSync, I think it
actually does write out pages during the all scan, regardless of what
usage_count says.

 I wonder whether it would be feasible to teach the bgwriter to get more
 aggressive as the time for the next checkpoint approaches?  Writes
 issued early in the interval have a much higher probability of being
 wasted (because the page gets re-dirtied later).  But maybe that just
 reduces to what Takahiro-san already suggested, namely that
 checkpoint-time writes should be done with the same kind of scheduling
 the bgwriter uses outside checkpoints.  We still have the problem that
 the real I/O storm is triggered by fsync() not write(), and we don't
 have a way to spread out the consequences of fsync().

Would the ramp-up of write activity push the kernel to actually write
stuff? My understanding is that most OSes have a time limit on how long
they'll let a write-request sit in cache, so ISTM a better way to smooth
out disk IO is to write things in a steady stream.

If the bgwriter takes the buffer access counter into account when
deciding what to write out, it might make sense to write more recently
accessed pages as checkpoint nears. The idea being that odds are good
those buffers are about to get flushed by BufferSync() anyway.

Also, I have a dumb question... BgBufferSync uses buf_id1 to keep track
of what buffer the bgwriter_all scan is looking at, which means that
it should remember where it was at the end of the last scan; yet it's
initialized to 0 every time BgBufferSync is called. Is there someplace
else that is remembering where the complete scan is leaving off when
bgwriter_all_percent or bgwriter_all_maxpages is hit? Or does the scan
in fact just keep re-scanning the beginning of the buffers?
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Load distributed checkpoint

2006-12-11 Thread Simon Riggs
On Fri, 2006-12-08 at 11:05 +0900, Takayuki Tsunakawa wrote:
 I understand that checkpoints occur during crash
 recovery and PITR, so time for those operations would get longer. 

A restorepoint happens during recovery, not a checkpoint. The recovery
is merely repeating the work of the checkpoint that occurred in the
original WAL stream. Elongating the checkpoint would not have any effect
on a restorepoint: we only record the checkpoint when it is complete and
we only create a restorepoint when we see the checkpoint record.

Crash recovery and PITR use almost exactly the same code path (by
design), so there isn't anything special to say about PITR either.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Load distributed checkpoint

2006-12-11 Thread ITAGAKI Takahiro

Kevin Grittner [EMAIL PROTECTED] wrote:

 We have not experience any increase in I/O, just a smoothing.  Keep in
 mind that the file system cache will collapse repeated writes to the
 same location until things settle, and the controller's cache also has a
 chance of doing so.  If we just push dirty pages out to the OS as soon
 as possible, and let the file system do its job, I think we're in better
 shape than if we try to micro-manage it within our buffer pages.

Maybe we have two entirely different tuning approaches:
  1. Retain dirty buffers in database, and keep OS buffers clean.
  2. Keep database clean, and entrust OS to manage dirty buffers.

I suggested the 1st one, and you did the 2nd. Bottle-neck in checkpoints
vary in the approaches; write() will be worse in 1st, fsync() in 2nd.

Distributed write() is easier than distributed fsync(), because we can
use write() on a page basis, but fsync() only on a file basis.
Also, database has own access-frequency information for its buffers,
so I think 1st approach behaves better in handling re-dirty of buffers.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Load distributed checkpoint

2006-12-11 Thread Takayuki Tsunakawa
Hello,

From: ITAGAKI Takahiro [EMAIL PROTECTED]
Takayuki Tsunakawa [EMAIL PROTECTED] wrote:
 I'm afraid it is difficult for system designers to expect steady
 throughput/response time, as long as PostgreSQL depends on the
 flushing of file system cache. How does Oracle provide stable
 performance?
 Though I'm not sure, isn't it the key to use O_SYNC so that
write()s
 transfer data to disk?

 AFAIK, other databases use write() and fsync() in combination. They
call
 fsync() immediately after they write buffers in some small batches.
Otherwise,
 they uses asynchronous and direct I/O options. Therefore, dirty
pages in
 kernel buffers are keeped to be low at any time.


Oracle seems to use O_SYNC. I've found a related page in the Oracle
manuals.

--
http://download-west.oracle.com/docs/cd/B19306_01/win.102/b15688/ap_unix.htm
Direct Writes to Disk
On both UNIX and Windows platforms, bypassing the file system buffer
cache ensures data is written to disk.
On UNIX, Oracle Database uses the O_SYNC flag to bypass the file
system buffer cache. The flag name depends on the UNIX port.
On Windows, Oracle Database bypasses the file system buffer cache
completely.
--

As Itagaki-san says, asynchronous+direct I/O provides best
performance, I believe. Oracle supplies the combination as follows:

--
http://download-west.oracle.com/docs/cd/B19306_01/server.102/b15658/appc_linux.htm#sthref870
Oracle Database supports kernel asynchronous I/O. This feature is
disabled by default.
By default, the DISK_ASYNCH_IO initialization parameter in the
parameter file is set to TRUE to enable asynchronous I/O on raw
devices. To enable asynchronous I/O on file system files:
Ensure that all Oracle Database files are located on file systems that
support asynchronous I/O.
Set the FILESYSTEMIO_OPTIONS initialization parameter in the parameter
file to one of the following values:

Linux Distribution Recommended Value
SUSE Linux Enterprise Server 9 SETALL
Other distributions ASYNCH
--

I believe SQL Server also uses direct+asynchronous I/O, because
Microsoft recommends in MSDN that the combination plus appropriate
multi-threading provides best performance.

I tested Oracle9i on RHEL 2.1. I straced DBWn (database writer, which
is like the bgwriter of PostgreSQL) while creating tables, indexes,
etc. and shutting down the database server. Oracle surely uses the
O_SYNC as follows, but it doesn't use fsync().

24462 open(/work4/ora/tuna/users01.dbf, O_RDWR|O_SYNC|O_LARGEFILE) =
16

I wonder how the other big DBMS, IBM DB2, handles this. Is Itagaki-san
referring to DB2?




---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Load distributed checkpoint

2006-12-11 Thread Takayuki Tsunakawa
Mr. Riggs,

Thank you for teaching me the following. I seem to have misunderstood.
I'll learn more.

From: Simon Riggs [EMAIL PROTECTED]
 On Fri, 2006-12-08 at 11:05 +0900, Takayuki Tsunakawa wrote:
 I understand that checkpoints occur during crash
 recovery and PITR, so time for those operations would get longer.

 A restorepoint happens during recovery, not a checkpoint. The
recovery
 is merely repeating the work of the checkpoint that occurred in the
 original WAL stream. Elongating the checkpoint would not have any
effect
 on a restorepoint: we only record the checkpoint when it is complete
and
 we only create a restorepoint when we see the checkpoint record.


Regards,

- Original Message - 
From: Simon Riggs [EMAIL PROTECTED]
To: Takayuki Tsunakawa [EMAIL PROTECTED]
Cc: ITAGAKI Takahiro [EMAIL PROTECTED];
pgsql-hackers@postgresql.org
Sent: Monday, December 11, 2006 6:30 PM
Subject: Re: [HACKERS] Load distributed checkpoint


 On Fri, 2006-12-08 at 11:05 +0900, Takayuki Tsunakawa wrote:
 I understand that checkpoints occur during crash
 recovery and PITR, so time for those operations would get longer.

 A restorepoint happens during recovery, not a checkpoint. The
recovery
 is merely repeating the work of the checkpoint that occurred in the
 original WAL stream. Elongating the checkpoint would not have any
effect
 on a restorepoint: we only record the checkpoint when it is complete
and
 we only create a restorepoint when we see the checkpoint record.

 Crash recovery and PITR use almost exactly the same code path (by
 design), so there isn't anything special to say about PITR either.

 -- 
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com






---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Load distributed checkpoint

2006-12-11 Thread Inaam Rana



I wonder how the other big DBMS, IBM DB2, handles this. Is Itagaki-san
referring to DB2?



DB2 would also open data files with O_SYNC option and page_cleaners
(counterparts of bgwriter) would exploit AIO if available on the system.

Inaam Rana
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] Load distributed checkpoint

2006-12-11 Thread Ron Mayer
ITAGAKI Takahiro wrote:
 Kevin Grittner [EMAIL PROTECTED] wrote:
 
 ...the file system cache will collapse repeated writes to the
 same location until things settle ...
 If we just push dirty pages out to the OS as soon as possible, 
 and let the file system do its job, I think we're in better
 shape...
 
 Maybe we have two entirely different tuning approaches:
   1. Retain dirty buffers in database, and keep OS buffers clean.
   2. Keep database clean, and entrust OS to manage dirty buffers.
 
 I suggested the 1st one, and you did the 2nd. Bottle-neck in checkpoints
 vary in the approaches; write() will be worse in 1st, fsync() in 2nd.

The fsync() won't necessarily be worse in the second approach.  OS's have
quite a few tunable parameters that can encourage the system to physically
write the pending write()s before the fsync() - either in the background
or by the process doing the write() itself when there are too many
dirty pages.

For checkpoints, I think the main question is whether postgresql's
background writer is smarter or less smart than pdflush or the
equivalent on your system for database workloads.

 Also, database has own access-frequency information for its buffers,
 so I think 1st approach behaves better in handling re-dirty of buffers.

I'm curious what access-frequency info the OS and the database has.

One thing I do worry about is if both postgresql and the OS
are both delaying write()s in the hopes of collapsing them
at the same time.  If so, this would cause both to be experience
bigger delays than expected, and make checkpoints worse.

I'm guesing if you use approach 1. you might be better off
turning down the amount of buffering that the OS does with
dirty pages - and if you use approach 2, you might be better
off turning down the amount of delays that postgresql adds.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Load distributed checkpoint

2006-12-11 Thread Kevin Grittner
 On Mon, Dec 11, 2006 at  3:31 PM, in message
[EMAIL PROTECTED], Ron Mayer
[EMAIL PROTECTED] wrote: 
 
 One thing I do worry about is if both postgresql and the OS
 are both delaying write()s in the hopes of collapsing them
 at the same time.  If so, this would cause both to be experience
 bigger delays than expected, and make checkpoints worse.
 
That is my concern.  Letting 30 seconds worth of dirty pages accumulate
between checkpoints and then trying to smooth the writes within
checkpoint code seems like a doomed effort.  Either we take over control
of everything, like many other products, and schedule it all ourselves,
or we push it out to the file system fairly quickly and let it do its
job.  Let's not stand with one foot on the pier and one on the boat.
 
It is entirely possible that the current setup, where a delay of a
couple seconds may collapse several OS API calls, could have a
performance benefit; but, let's be clear that we're talking about saving
that overhead, not disk I/O.
 
-Kevin
 


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings