[PATCHES] Load Distributed Checkpoints, revised patch

2007-06-15 Thread Heikki Linnakangas
Here's an updated WIP version of the LDC patch. I just spreads the 
writes, that achieves the goal of smoothing the checkpoint I/O spikes. I 
think sorting the writes etc. is interesting but falls in the category 
of further development and should be pushed to 8.4.


The documentation changes are not complete, GUC variables need 
descriptions, and some of the DEBUG elogs will go away in favor of the 
separate checkpoint logging patch that's in the queue. I'm fairly happy 
with the code now, but there's a few minor open issues:


- What units should we use for the new GUC variables? From 
implementation point of view, it would be simplest if 
checkpoint_write_rate is given as pages/bgwriter_delay, similarly to 
bgwriter_*_maxpages. I never liked those *_maxpages settings, though, a 
more natural unit from users perspective would be KB/s.


- The signaling between RequestCheckpoint and bgwriter is a bit tricky. 
Bgwriter now needs to deal immediate checkpoint requests, like those 
coming from explicit CHECKPOINT or CREATE DATABASE commands, differently 
from those triggered by checkpoint_segments. I'm afraid there might be 
race conditions when a CHECKPOINT is issued at the same instant as 
checkpoint_segments triggers one. What might happen then is that the 
checkpoint is performed lazily, spreading the writes, and the CHECKPOINT 
command has to wait for that to finish which might take a long time. I 
have not been able to convince myself neither that the race condition 
exists or that it doesn't.



A few notes about the implementation:

- in bgwriter loop, CheckArchiveTimeout always calls time(NULL), while 
previously it used the value returned by another call earlier in the 
same codepath. That means we now call time(NULL) twice instead of once 
per bgwriter iteration, when archive_timout is set. That doesn't seem 
significant to me, so I didn't try to optimize it.


- because of a small change in the meaning of force_checkpoint flag in 
bgwriter loop, checkpoints triggered by reaching checkpoint_segments 
call CreateCheckPoint(false, false) instead of CreateCheckPoint(false, 
true). That second argument is the force-flag. If it's false, 
CreateCheckPoint skips the checkpoint if there's been no WAL activity 
since last checkpoint. It doesn't matter in this case, there surely has 
been WAL activity if we reach checkpoint_segments, and doing the check 
isn't that expensive.


- to coordinate the writes with with checkpoint_segments, we need to 
read the WAL insertion location. To do that, we need to acquire the 
WALInsertLock. That means that in the worst case, WALInsertLock is 
acquired every bgwriter_delay when a checkpoint is in progress. I don't 
think that's a problem, it's only held for a very short duration, but I 
thought I'd mention it.


- How should we deal with changing GUC variables that affect LDC, on the 
fly when a checkpoint is in progress? The attached patch finishes the 
in-progress checkpoint ASAP, and reloads the config after that. We could 
reload the config immediately, but making the new settings effective 
immediately is not trivial.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: doc/src/sgml/config.sgml
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/config.sgml,v
retrieving revision 1.126
diff -c -r1.126 config.sgml
*** doc/src/sgml/config.sgml	7 Jun 2007 19:19:56 -	1.126
--- doc/src/sgml/config.sgml	15 Jun 2007 09:35:32 -
***
*** 1565,1570 
--- 1565,1586 
/listitem
   /varlistentry
  
+  varlistentry id=guc-checkpoint-write-percent xreflabel=checkpoint_write_percent
+   termvarnamecheckpoint_write_percent/varname (typefloating point/type)/term
+   indexterm
+primaryvarnamecheckpoint_write_percent/ configuration parameter/primary
+   /indexterm
+   listitem
+para
+ To spread works in checkpoints, each checkpoint spends the specified
+ time and delays to write out all dirty buffers in the shared buffer
+ pool. The default value is 50.0 (50% of varnamecheckpoint_timeout/).
+ This parameter can only be set in the filenamepostgresql.conf/
+ file or on the server command line.
+/para
+   /listitem
+  /varlistentry
+ 
   varlistentry id=guc-checkpoint-warning xreflabel=checkpoint_warning
termvarnamecheckpoint_warning/varname (typeinteger/type)/term
indexterm
Index: src/backend/access/transam/xlog.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.272
diff -c -r1.272 xlog.c
*** src/backend/access/transam/xlog.c	31 May 2007 15:13:01 -	1.272
--- src/backend/access/transam/xlog.c	15 Jun 2007 08:14:18 -
***
*** 398,404 
  static void exitArchiveRecovery(TimeLineID endTLI,
  	uint32 

Re: [PATCHES] WIP: script binaries renaming

2007-06-15 Thread Andrew Dunstan



Zdenek Kotala wrote:


I think this patch has no (or small) impact on functionality and it 
should be committed to 8.3



   


That's really not the test we apply. We are in feature freeze, which 
means the only things not on the queue that should get in are bug fixes. 
If we don't stick to that we'll never get a release out.


cheers

andrew

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PATCHES] WIP: script binaries renaming

2007-06-15 Thread Tom Lane
Zdenek Kotala [EMAIL PROTECTED] writes:
 I think this patch has no (or small) impact on functionality and it 
 should be committed to 8.3

This missed the feature freeze deadline by well over two months.
It is not a candidate for 8.3.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] Load Distributed Checkpoints, revised patch

2007-06-15 Thread Alvaro Herrera
Heikki Linnakangas wrote:

 - The signaling between RequestCheckpoint and bgwriter is a bit tricky. 
 Bgwriter now needs to deal immediate checkpoint requests, like those 
 coming from explicit CHECKPOINT or CREATE DATABASE commands, differently 
 from those triggered by checkpoint_segments. I'm afraid there might be 
 race conditions when a CHECKPOINT is issued at the same instant as 
 checkpoint_segments triggers one. What might happen then is that the 
 checkpoint is performed lazily, spreading the writes, and the CHECKPOINT 
 command has to wait for that to finish which might take a long time. I 
 have not been able to convince myself neither that the race condition 
 exists or that it doesn't.

Isn't it just a matter of having a flag to tell whether the checkpoint
should be quick or spread out, and have a command set the flag if a
checkpoint is already running?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, revised patch

2007-06-15 Thread Heikki Linnakangas

Michael Paesold wrote:

Heikki Linnakangas wrote:
Here's an updated WIP version of the LDC patch. I just spreads the 
writes, that achieves the goal of smoothing the checkpoint I/O spikes. 
I think sorting the writes etc. is interesting but falls in the 
category of further development and should be pushed to 8.4.


Why do you think so? Is it too much risk to adapt the sorted writes? The 
numbers shown by ITAGAKI Takahiro looked quite impressive, at least for 
large shared_buffers configurations. The reactions where rather 
positive, too.


Well, it is a very recent idea, and it's not clear that it's a win under 
all circumstances. Adopting that would need more testing, and at this 
late stage I'd like to just wrap up what we have and come back to this 
idea for 8.4.


If someone performs the tests with different hardware and workloads, I 
would be willing to consider it; the patch is still at an early stage 
but it's very isolated change and should therefore be easy to review. 
But if someone has the hardware and time perform those tests, I'd like 
them to perform more testing with just LDC first. As Josh pointed out, 
it would be good to test it with more oddball workloads, all the tests 
I've done this far have been with DBT-2.


In general, I am hoping that this patch, together with Automatic 
adjustment of bgwriter_lru_maxpages will finally make default 
postgresql configurations experience much less impact from check points. 
For my tast, postgresql has recently got way to many nobs which one must 
tweak by hand... I welcome any approach on auto-tuning (and auto vacuum!).


Sure, but that's another topic.


Patch status says waiting on update from author:
http://archives.postgresql.org/pgsql-patches/2007-04/msg00331.php
Any updates on this?


No. I'm not actually clear what we're waiting for and from whom; I know 
I haven't had the time to review that in detail yet. IIRC we've seen two 
very similar patches in the discussions, one from Itagaki-san, and one 
from Greg Smith. I'm not sure which one of them we should use, they both 
implement roughly the same thing. But the biggest thing needed for that 
patch is testing with different workloads.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] Load Distributed Checkpoints, revised patch

2007-06-15 Thread Heikki Linnakangas

Alvaro Herrera wrote:

Heikki Linnakangas wrote:

- The signaling between RequestCheckpoint and bgwriter is a bit tricky. 
Bgwriter now needs to deal immediate checkpoint requests, like those 
coming from explicit CHECKPOINT or CREATE DATABASE commands, differently 
from those triggered by checkpoint_segments. I'm afraid there might be 
race conditions when a CHECKPOINT is issued at the same instant as 
checkpoint_segments triggers one. What might happen then is that the 
checkpoint is performed lazily, spreading the writes, and the CHECKPOINT 
command has to wait for that to finish which might take a long time. I 
have not been able to convince myself neither that the race condition 
exists or that it doesn't.


Isn't it just a matter of having a flag to tell whether the checkpoint
should be quick or spread out, and have a command set the flag if a
checkpoint is already running?


Hmm. Thinking about this some more, the core problem is that when 
starting the checkpoint, bgwriter needs to read and clear the flag. 
Which is not atomic, as the patch stands.


I think we already have a race condition with ckpt_time_warn. The code 
to test and clear the flag does this:



if (BgWriterShmem-ckpt_time_warn  elapsed_secs  CheckPointWarning)
ereport(LOG,
(errmsg(checkpoints are occurring too frequently 
(%d seconds apart),
elapsed_secs),
 errhint(Consider increasing the configuration parameter 
\checkpoint_segments\.)));
BgWriterShmem-ckpt_time_warn = false;


In the extremely unlikely event that RequestCheckpoint sets 
ckpt_time_warn right before it's cleared, after the test in the 
if-statement, the warning is missed. That's a very harmless and 
theoretical event, you'd have to run CHECKPOINT (or another command that 
triggers a checkpoint) at the same instant that an xlog switch triggers 
one, and all that happens is that you don't get the message in the log 
while you should. So this is not something to worry about in this case, 
but it would be more severe if we had the same problem in deciding if a 
checkpoint should be spread out or not.


I think we just have to protect those signaling flags with a lock. It's 
not like it's on a critical path, and though we don't know what locks 
the callers to RequestCheckpoint hold, as long as we don't acquire any 
other locks while holding the new proposed lock, there's no danger of 
deadlocks.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PATCHES] Load Distributed Checkpoints, revised patch

2007-06-15 Thread Alvaro Herrera
Heikki Linnakangas wrote:
 Alvaro Herrera wrote:

  if (BgWriterShmem-ckpt_time_warn  elapsed_secs  
  CheckPointWarning)
  ereport(LOG,
  (errmsg(checkpoints are occurring too 
  frequently (%d seconds apart),
  elapsed_secs),
   errhint(Consider increasing the 
   configuration parameter 
   \checkpoint_segments\.)));
  BgWriterShmem-ckpt_time_warn = false;
 
 In the extremely unlikely event that RequestCheckpoint sets 
 ckpt_time_warn right before it's cleared, after the test in the 
 if-statement, the warning is missed.

I think this code should look like this:

if (BgWriterShmem-ckpt_time_warn)
{
BgWriterShmem-chpt_time_warn = false;
if (elapsed_secs  CheckPointWarning)
ereport(LOG,
(errmsg(checkpoints are occurring too 
frequently (%d seconds apart),
elapsed_secs),
 errhint(Consider increasing the 
configuration parameter \checkpoint_segments\.)));
}

That way seems safer.  (I am assuming that a process other than the
bgwriter is able to set the ckpt_time_warn bit; otherwise there is no
point).  This is also used in pmsignal.c.  Of course, as you say, this
is probably very harmless, but in the other case it is important to get
it right.

-- 
Alvaro Herrera   http://www.PlanetPostgreSQL.org/
Hackers share the surgeon's secret pleasure in poking about in gross innards,
the teenager's secret pleasure in popping zits. (Paul Graham)

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] Maintaining cluster order on insert

2007-06-15 Thread Jaime Casanova

On 5/27/07, Jim C. Nasby [EMAIL PROTECTED] wrote:

On Mon, May 21, 2007 at 10:48:59AM +0100, Heikki Linnakangas wrote:

 IOW it's working as designed. But maybe it's not the desired behavior.
 Should we have a special case and always respect the fillfactor when
 inserting to the last page of the heap?

I think that would be following with least surprise.


What's the status of this patch? are we waiting an update?

AFAIU, it's not fair to say that the patch maintain cluster order...
it just try to keep similar rows on the same page if possible (it's
not the same thing)... if it can't then it simply insert at the last
page as usual but we have wasted time in the try...

so the real question is if there is any performance win on this...
have you some numbers?

another question: if the fillfactor is 100% then is a complete waste
of time to look for a suggested block. maybe we could check for that?

--
regards,
Jaime Casanova

Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs and the universe trying
to produce bigger and better idiots.
So far, the universe is winning.
  Richard Cook

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster