[PATCHES] New features for pgbench

2007-02-11 Thread Greg Smith

The attached adds two new command line switches to pgbench:

-x:  Generate extended detail in the latency log, including a timestamp 
for each transaction


-X:  Do extra cleanup after the run (vacuum on all tables, checkpoint) 
before stopping the clock.  This gives substantially more consistancy in 
results between runs.  Most pgbench results I see people present are so 
short that they're skewed considerably by whether there was a checkpoint 
in the middle of the run.  This also allows testing situations with 
various autovacuum settings fairly.


There's an update to the README describing the features, as well as 
correcting/extending some of the existing documentation.


I generated the patch from the 8.2.3 release.  Since pgbench runs the same 
way unless you pass it one of the new flags, I was hoping this would be 
considered for the next 8.2 update.  I have a series of additional scripts 
I'll be releasing shortly that do interesting analysis of this extended 
latency data from pgbench (graphs of TPS and latency, that sort of thing), 
and I'd hate for that to only be available on 8.3.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MDIndex: contrib/pgbench/README.pgbench
===
RCS file: /var/lib/pgsql/cvs/postgresql/contrib/pgbench/README.pgbench,v
retrieving revision 1.1.1.1
diff -u -r1.1.1.1 README.pgbench
--- contrib/pgbench/README.pgbench  11 Feb 2007 23:57:37 -  1.1.1.1
+++ contrib/pgbench/README.pgbench  12 Feb 2007 00:07:35 -
@@ -94,10 +94,14 @@
default is 1.  NOTE: scaling factor should be at least
as large as the largest number of clients you intend
to test; else you'll mostly be measuring update contention.
+   Regular (not initializing) runs using one of the
+   built-in tests will detect scale based on the number of 
+   branches in the database.  For custom (-f) runs this can 
+   be manually specified.
 
-D varname=value
-   Define a variable. It can be refereed to by a script
-   provided by using -f option. Multile -D options are allowed.
+   Define a variable. It can be refered to by a script
+   provided by using -f option. Multiple -D options are allowed.
 
-U login
Specify db user's login name if it is different from
@@ -143,6 +147,24 @@
 
where time is measured in microseconds.
 
+   -x
+   If a logfile is being generated with -l, use an
+   extended format for the log that is:
+
+   client_id transaction_no time file_no time-epoch time-us
+
+   where time is measured in microseconds, the file_no is
+   which test file was used (useful when multiple were 
+   specified with -f), and time-epoch/time-us are a 
+   UNIX epoch format timestamp followed by an offset
+   in microseconds (suitable for creating a ISO 8601
+   timestamp with a fraction of a second) of when
+   the transaction completed.
+
+   -X  Perform extra cleanup after the test has completed
+   (vacuums and a checkpoint) to make test results
+   more uniform
+
-d
debug option.
 
Index: contrib/pgbench/pgbench.c
===
RCS file: /var/lib/pgsql/cvs/postgresql/contrib/pgbench/pgbench.c,v
retrieving revision 1.1.1.1
diff -u -r1.1.1.1 pgbench.c
--- contrib/pgbench/pgbench.c   11 Feb 2007 23:57:37 -  1.1.1.1
+++ contrib/pgbench/pgbench.c   12 Feb 2007 00:07:28 -
@@ -76,6 +76,10 @@
 
 bool   use_log;/* log transaction latencies to 
a file */
 
+bool   extended_log;   /* provide extended detailed in 
latency log */
+
+bool   extra_cleanup;  /* do database cleanup after 
test is over */
+
 intremains;/* number of remaining 
clients */
 
 intis_connect; /* establish connection 
 for each transaction */
@@ -455,7 +459,13 @@
diff = (int) (now.tv_sec - st->txn_begin.tv_sec) * 
100.0 +
(int) (now.tv_usec - st->txn_begin.tv_usec);
 
-   fprintf(LOGFILE, "%d %d %.0f\n", st->id, st->cnt, diff);
+   if (extended_log)
+   fprintf(LOGFILE, "%d %d %.0f %d %ld %ld\n", 
+   st->id, st->cnt, diff, 
+   st->use_file, now.tv_sec,now.tv_usec);
+   else
+   fprintf(LOGFILE, "%

Re: [PATCHES] New features for pgbench

2007-02-11 Thread Greg Smith

On Sun, 11 Feb 2007, Tom Lane wrote:


Does this have any impact on the reported results (by slowing pg_bench
itself)?


I didn't put more code than I had to in the transaction path, to avoid any 
slowdown.  I didn't convert the timestamp to human readable format or 
anything intensive like that to avoid impacting the pgbench results. 
It's just dumping some data that was already sitting there.


There is an extra if statement for each transaction, and a slightly longer 
fprintf when running with the extra latency output in place.  That's it. 
The file gets "%d %d %.0f %d %ld %ld\n" instead of "%d %d %.0f\n"


The main drawback to logging more as the default is about twice as much 
disk I/O for writing the latency log out.  That's small change compared 
with the WAL/database writes that must be going on to generate that 
transaction, and I sure haven't been able to measure any change in 
results.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] New features for pgbench

2007-02-12 Thread Greg Smith

On Mon, 12 Feb 2007, NikhilS wrote:

From your patch I see that it augments the -l flag. IMHO it does not 
make sense to add another flag. We can save the "if" check and log the 
extended contents as part of -l itself.


I wanted something people could apply to 8.2 without breaking existing 
scripts (regardless of whether it was accepted into 8.2).  And I expected 
some concern over whether this change effects results.  By putting in a 
switch, it's possible to test both ways, with only the if being added to 
the default case.


If I read it correctly, are you suggesting that the same database with a 
prior pgbench run be used for further pgbench runs? How is it useful? 
How can one guarantee consistency of observed tps values with this in 
place?


Right now when you run pgbench, the results vary considerably from run to 
run even if you completely rebuild the database every time.  I've found 
that a lot of that variation comes from two things:


1) If your run is so small that it usually doesn't generate a checkpoint, 
the runs that do encounter one will be slower--possibly a lot slower if 
you have a large buffer cache.  Similarly, runs that are just long enough 
to normally encounter one checkpoint will take longer if they happen to 
run into two, and so on.  There are games you can play with improving 
pgbench performance by using more checkpoint_segments and a bigger 
shared_buffer cache that look like they dramatically improve results. 
But what you're mainly doing is just making checkpoints less frequent, 
reducing the odds that you'll run into one during the pgbench test itself.


2) The standard pgbench test does 3 UPDATEs per transaction.  That leaves 
behind a lot of dead rows that need to be vacuumed.  The amount of 
autovacuum that occurs during the run will vary.  This means that some 
runs finish with more dead space left behind than others.  It really isn't 
fair that a pgbench run that involves cleaning up more of its own mess 
during the test will get a lower TPS result than one that just generates a 
bunch of dead tuples and leaves the mess hanging around.  Right now, 
optimal pgbench results are generated with the autovacuum turned 
completely off; that's just not realistic.


In order to get a completely fair comparison, I've adopted a policy that 
says the run isn't over until the database has been cleaned up such that 
it's in a similar state to how it started:  all tables are vacuumed, and 
all updates have been written to disk.  The new -X behavior forces this 
cleanup to be considered part of the test.  Whether or not you choose to 
use it for your regular tests, I suggest trying it out.  You may be as 
surprised as I was at exactly how much vacuuming work is leftover after a 
long pgbench run, and how dramatically it lowers TPS results if you 
consider that cleanup essential to the test.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] New features for pgbench

2007-02-12 Thread Greg Smith

On Mon, 12 Feb 2007, Tom Lane wrote:


This is a real issue, but I think your proposed patch does not fix it.


I certainly wouldn't claim that my patch _fixes_ the problem in the 
general case; it provides one way to measure it.  Currently it's not 
obvious to new pgbench users that the problem even exists at all.  I feel 
it's important to draw attention to the fact that it's something you 
should be aware of, even if an automatic resolution to the problem isn't 
obvious yet.


In the context I run pgbench in, it is also a workable fix.  I don't even 
pay attention to pgbench results unless I'm popping 10,000 (desktop) to 
100,000 (server) transactions through it.  In that context, I believe it 
fairly penalizes the transactions for the data they leave behind for 
maintenance.  I completely agree that people doing short runs shouldn't 
use this switch.


Anyway, I like your idea of describing the lower TPS number as including 
maintenance, that matches the terminology used within the documentation 
better.  I will reformat the output to use that term.


Here's what I'm gonna do.  The patch I submitted was prepared with the 
goal of possibly being implemented in 8.2.  I thought a change to contrib/ 
that added a feature turned off by default might have a shot at a 
backport, and I wanted something people could use on the current release 
to be available.  Now that I know it's never going into an offical 8.2, I 
will prepare a slightly different patch aimed at 8.3--incorporating all 
the feedback I've gotten here as either code changes or additional 
documentation--and resubmit in another week or so.


Thanks for the feedback.

--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] New features for pgbench

2007-02-12 Thread Greg Smith

On Tue, 13 Feb 2007, Takayuki Tsunakawa wrote:

The Win32 APIs that pgbench is using for gettimeofday() (in 
src/port/gettimeofday.c) is much lower in resolution than Linux.


I wasn't aware of this issue, and it certainly makes the whole latency 
side of pgbench pretty useless on Win32.  There is code in 
src/include/executor/instrument.h that uses a higher resolution Windows 
timer API than gettimeofday() does (as you point out, that one is only 
resolves to one Windows tick, about 15ms).  If I can get a Windows build 
environment setup, I'll see if I can borrow that solution for pgbench.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] WIP patch - INSERT-able log statements

2007-02-17 Thread Greg Smith

On Sat, 17 Feb 2007, FAST PostgreSQL wrote:


#log_output_type = 'text' #Valid values are 'SQL' or 'text'
Defaults to 'text' which is status quo. If it is set to 'SQL' log will 
be output as INSERT commands.


This assumes someone wants either the INSERT-able logs or the current, 
easily readable ones.  I know I don't want either--I want both.  There are 
times I want to look through the logs with a text editor, there are times 
where I want to query against them.


I would suggest treating this similarly to how the Windows eventlog is 
handled:  made SQL INSERT format another option available to 
log_destination, so it can be combined with the existing formats.  In 
addition to the syslog concerns you already mentioned (which are 
themselves a showstopper for using this feature in companies that rely on 
or aggregate syslogs), I know I'd want to keep the existing logs rolling 
in parallel while I tested out the SQL-based version for a while, before 
cutting over to exclusively INSERT format logs.


I've thought a bit about how to implement this TODO already (I have a log 
file parser and I hate maintaining it), and the only thing that made sense 
to me was giving a new parameter with the filename to output to in this 
format.  For example, make a new log_sql_filename with the same syntax 
already used for log_filename.  There will probably need to be a second 
parameter for the table name to insert into as you've already commented 
on.  And like Joshua has already suggested, the main useful applications 
for this feature I've thought of all involve reading from the INSERT-able 
logs in real-time, using something like "tail -f", and pumping that data 
immediately into a logger table.


Also, I feel that supporting the whole log_line_prefix syntax for this 
feature is not just overkill, it's a bad idea.  Output everything in a 
standard, complete format instead, and then it becomes easy for the 
community at large to build tools on top of that to analyze the log 
database entries instead of having so many ad-hoc approaches.  You want a 
subset, use a view or copy just the fields you want into another table. 
I would guess this simplifies the patch as well.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] WIP patch - INSERT-able log statements

2007-02-18 Thread Greg Smith

On Tue, 20 Feb 2007, FAST PostgreSQL wrote:

I think adding the 'format' of the log as an option in the 'destination' 
may be confusing.  We can have a new boolean variable like 
'output_sql_log' or 'log_sql_format' which will trigger the output of 
INSERT-able log in addition to syslog/stderr/eventlog in text format as 
it is now.


What's confusing about it?  Consider this hypothetical postgresql.conf 
snippet:


log_destination = 'stderr,sql' # Valid values are combinations of
# stderr, syslog, sql, and eventlog,
# depending on platform.

# These are relevant when logging to sql:
log_sql_table = 'pg_log'  # Table SQL formatted logs INSERT into
# Default is 'pg_log'

Adding new GUC variables isn't without cost; no reason to add a new one 
when there's a perfectly good destination one already whose syntax is 
completely appropriate for this task.


I personally would just ignore the duration two entries per statement 
problem and make that the log analyzer software's job to fix, but I'm 
likely biased here because I don't ever do anything with that data.


My preference would be for the sql-logs to use the same variables as 
that of redirect_stderr. In the directory mentioned by the user on 
log_directory we just output the sql logs in a file with an '.SQL' 
appended to the filename specified by the user in log_filename. This 
also means we can make use of the log_truncation and log_rotate options 
(I hope. I will have a look at that).


You're talking along the same lines here I was trying to suggest in my 
previous message.  Keep all the current behavior as far as truncation and 
rotation go, just provide a different name for the file.  If you just 
appended a suffix like .sql to the existing name, that would remove 
another GUC variable requirement.  I think eventually people will complain 
about this, and want a separately formatted filename altogether, but 
there's nothing wrong with the approach you suggest for a first version of 
this feature.  I know I'm so desparate for a better solution to log 
importing that I'd be happy with any workable improvement.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] WIP patch - INSERT-able log statements

2007-02-20 Thread Greg Smith

On Tue, 20 Feb 2007, Tom Lane wrote:


A smaller problem is that this forces people to incur a gettimeofday
call for every message logged


I'm stumped trying to think of an application that would require importing 
the logs into a database to analyze them, but not need the timestamp. 
I'd expect it to be the primary key on the data.



Is it worth providing a knob to determine the set of columns emitted?


Myself and Guillaume felt that having the format be standardized had 
significant value from a downstream application perspective; it would be 
nice to know that everyone can work together to write one simple tool 
chain to process these things and it would work everywhere.  The current 
level of log output customization is part of what makes log analysis tools 
so much of a pain.


How about this as a simple way to proceed:  have the patch include 
everything, as Arul already planned.  When it's done, do some benchmarking 
with it turned on or off.  If it really seems like a drag, then consider a 
GUC addition to trim it down.  Why optimize prematurely?  It's not like 
this will be on by default. My guess is that the person sophisticated to 
analyze their logs probably has an installation that can support the 
overhead.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] WIP patch - INSERT-able log statements

2007-02-21 Thread Greg Smith

On Thu, 22 Feb 2007, FAST PostgreSQL wrote:

As we are triggering the sql output in log_destination, if the user 
gives 'syslog,sql' as options he is going to get two different looking 
logs (in terms of contents) depending upon his settings.


Yes, exactly; it's a good thing.  People add and remove things from the 
text logs to make them easier to read.  It's one of the reasons they're 
harder to process.  Since readability isn't a requirement for the SQL 
formatted ones, you can pack a lot more into there and make it available 
easily anyway.


I keep having every part of this conversation twice, so here's take two on 
this one.  The things that people want out of the text logs are not 
necessarily the same things they want from the SQL ones.  For example, I 
have a situation where the semantics of the syslog output is being driven 
by Sarbanes-Oxley related mechanics.  But the SQL logs are be based on my 
requirements, which is to include close enough to everything that it might 
as well be the whole set, in case I forgot something I find I need later. 
The SQL logs are *completely* different from the syslog setup.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Load distributed checkpoint V3

2007-03-25 Thread Greg Smith

On Fri, 23 Mar 2007, ITAGAKI Takahiro wrote:


Here is the latest version of Load distributed checkpoint patch.


Couple of questions for you:

-Is it still possible to get the original behavior by adjusting your 
tunables?  It would be nice to do a before/after without having to 
recompile, and I know I'd be concerned about something so different 
becoming the new default behavior.


-Can you suggest a current test case to demonstrate the performance 
improvement here?  I've tried several variations on stretching out 
checkpoints like you're doing here and they all made slow checkpoint 
issues even worse on my Linux system.  I'm trying to evaluate this fairly.


-This code operates on the assumption you have a good value for the 
checkpoint timeout.  Have you tested its behavior when checkpoints are 
being triggered by checkpoint_segments being reached instead?


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] Load distributed checkpoint V3

2007-03-25 Thread Greg Smith

On Mon, 26 Mar 2007, ITAGAKI Takahiro wrote:

I'm assuming checkpoints are triggered by timeout in normal use -- and 
it's my recommended configuration whether the patch is installed or not.


I'm curious what other people running fairly serious hardware do in this 
area for write-heavy loads, whether it's timeout or segment limits that 
normally trigger their checkpoints.


I'm testing on a slightly different class of machine than your sample 
results, something that is in the 1500 TPS range running the pgbench test 
you describe.  Running that test, I always hit the checkpoint_segments 
wall well before any reasonable timeout.  With 64 segments, I get a 
checkpoint every two minutes or so.


There's something I'm working on this week that may help out other people 
trying to test your patch out.  I've put together some simple scripts that 
graph (patched) pgbench results, which make it very easy to see what 
changes when you alter the checkpoint behavior.  Edges are still rough but 
the scripts work for me, will be polishing and testing over the next few 
days:


http://www.westnet.com/~gsmith/content/postgresql/pgbench.htm

(Note that the example graphs there aren't from the production system I 
mentioned above, they're from my server at home, which is similar to the 
system your results came from).


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[PATCHES] pgbench transaction timestamps

2007-03-31 Thread Greg Smith
This patch changes the way pgbench outputs its latency log files so that 
every transaction gets a timestamp and notes which transaction type was 
executed.  It's a one-line change that just dumps some additional 
information that was already sitting in that area of code. I also made a 
couple of documentation corrections and clarifications on some of the more 
confusing features of pgbench.


It's straightforward to parse log files in this format to analyze what 
happened during the test at a higher level than was possible with the 
original format.  You can find some rough sample code to convert this 
latency format into CVS files and then into graphs at 
http://www.westnet.com/~gsmith/content/postgresql/pgbench.htm which I'll 
be expanding on once I get all my little patches sent in here.


If you recall the earlier version of this patch I submitted, it added a 
cleanup feature that did a vacuum and checkpoint after the test was 
finished and reported two TPS results.  The idea was to quantify how much 
of a hit the eventual table maintenance required to clean up after the 
test would take.  While those things do influence results and cause some 
of the run-to-run variation in TPS (checkpoints are particularly visible 
in the graphs), after further testing I concluded running a VACUUM VERBOSE 
and CHECKPOINT in a script afterwards and analyzing the results was more 
useful than integrating something into pgbench itself.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MDIndex: contrib/pgbench/README.pgbench
===
RCS file: /projects/cvsroot/pgsql/contrib/pgbench/README.pgbench,v
retrieving revision 1.15
diff -c -r1.15 README.pgbench
*** contrib/pgbench/README.pgbench  21 Oct 2006 06:31:28 -  1.15
--- contrib/pgbench/README.pgbench  1 Apr 2007 00:57:48 -
***
*** 94,103 
default is 1.  NOTE: scaling factor should be at least
as large as the largest number of clients you intend
to test; else you'll mostly be measuring update contention.
  
-D varname=value
!   Define a variable. It can be refereed to by a script
!   provided by using -f option. Multile -D options are allowed.
  
-U login
Specify db user's login name if it is different from
--- 94,107 
default is 1.  NOTE: scaling factor should be at least
as large as the largest number of clients you intend
to test; else you'll mostly be measuring update contention.
+   Regular (not initializing) runs using one of the
+   built-in tests will detect scale based on the number of
+   branches in the database.  For custom (-f) runs it can
+   be manually specified with this parameter.
  
-D varname=value
!   Define a variable. It can be refered to by a script
!   provided by using -f option. Multiple -D options are allowed.
  
-U login
Specify db user's login name if it is different from
***
*** 139,147 
with the name "pgbench_log.xxx", where xxx is the PID
of the pgbench process. The format of the log is:
  
!   client_id transaction_no time
  
!   where time is measured in microseconds.
  
-d
debug option.
--- 143,157 
with the name "pgbench_log.xxx", where xxx is the PID
of the pgbench process. The format of the log is:
  
!   client_id transaction_no time file_no time-epoch time-us
  
!   where time is measured in microseconds, , the file_no is
!   which test file was used (useful when multiple were
!   specified with -f), and time-epoch/time-us are a
!   UNIX epoch format timestamp followed by an offset
!   in microseconds (suitable for creating a ISO 8601
!   timestamp with a fraction of a second) of when
!   the transaction completed.
  
-d
debug option.
***
*** 163,168 
--- 173,180 
  
(7) end;
  
+ If you specify -N, (4) and (5) aren't included in the transaction.
+ 
  o -f option
  
This supports for reading transaction script from a specified
Index: contrib/pgbench/pgbench.c
===
RCS file: /projects/cvsroot/pgsql/contrib/pgbench/pgbench.c,v
retrieving revision 1.62
diff -c -r1.62 pgbench.c
*** contrib/pgbench/pgbench.c   13 Mar 2007 09:06:35 -  1.62
--- contrib/pgbench/pgbench.c   1 Apr 2007 00:57:49 -
***
*** 461,467 
diff = (int) (now.tv_sec - st->txn_begin.tv_sec) * 
100.0 +
   

[PATCHES] Add usage counts to pg_buffercache

2007-03-31 Thread Greg Smith
This patch adds the usage count statistic to the information available in 
contrib/pgbuffercache.  Earlier this month a discussion about my first 
attempt to instrument the background writer had Tom asking for details 
about the usage histogram I was seeing, and this patch proved to be the 
easiest way I found to take a look at that.


In situations where one is trying to optimize the background writer, it's 
very hard to adjust how much to rely on the LRU writer versus the one that 
writes everything unless you know whether your dirty buffers are typically 
used heavily (like index blocks) or not (like new INSERT data).  Some 
statistics about the usage counts in your buffer cache are extremely 
helpful in making that decision.


I'll even pass along an ugly but fun query that utilizes this.  The 
following will give you a summary of your buffer cache broken into 32 
sections.  Each line shows the average usage count of that section, as a 
positive number if most buffers dirty and a negative one if most are 
clean.  If you refresh this frequently enough, you can actually watch 
things like how checkpoints move through the buffer cache:


SELECT current_timestamp,
-- Split into 32 bins of data
round(bufferid / (cast((select setting from pg_settings where 
name='shared_buffers') as int) / (32 - 1.0)))

as section, round(
-- Average usage count, capped at 5
case when avg(usagecount)>5 then 5 else avg(usagecount) end *
-- -1 when the majority are clean records, 1 when most are dirty
(case when sum(case when isdirty then 1 else -1 end)>0 then 1 else -1 
end)) as color_intensity

FROM pg_buffercache GROUP BY
round(bufferid / (cast((select setting from pg_settings where 
name='shared_buffers') as int) / (32 - 1.0)));


The 32 can be changed to anything, that's just what fits on my screen. 
The main idea of the above is that if you dump all this to a file 
regularly, it's possible to produce a graph of it showing how the cache 
has changed over time by assigning a different color intensity based on 
the usage count--at a massive cost in overhead, of course.  I'll be 
passing along all that code once I get it ready for other people to use.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MDIndex: README.pg_buffercache
===
RCS file: /projects/cvsroot/pgsql/contrib/pg_buffercache/README.pg_buffercache,v
retrieving revision 1.3
diff -c -r1.3 README.pg_buffercache
*** README.pg_buffercache   26 Apr 2006 22:50:17 -  1.3
--- README.pg_buffercache   1 Apr 2007 02:17:15 -
***
*** 40,46 
 reldatabase| pg_database.oid  | Database for the relation.
 relblocknumber |  | Offset of the page in the relation.
 isdirty|  | Is the page dirty?
! 
  
There is one row for each buffer in the shared cache. Unused buffers are
shown with all fields null except bufferid.
--- 40,46 
 reldatabase| pg_database.oid  | Database for the relation.
 relblocknumber |  | Offset of the page in the relation.
 isdirty|  | Is the page dirty?
!usagecount |  | Page LRU count
  
There is one row for each buffer in the shared cache. Unused buffers are
shown with all fields null except bufferid.
***
*** 60,79 
  
regression=# \d pg_buffercache;
 View "public.pg_buffercache"
!Column |  Type   | Modifiers
!   +-+---
!bufferid   | integer |
!relfilenode| oid |
!reltablespace  | oid |
!reldatabase| oid |
!relblocknumber | bigint  |
!isdirty| boolean |
View definition:
 SELECT p.bufferid, p.relfilenode, p.reltablespace, p.reldatabase, 
!   p.relblocknumber, p.isdirty
   FROM pg_buffercache_pages() p(bufferid integer, relfilenode oid, 
   reltablespace oid, reldatabase oid, relblocknumber bigint, 
!  isdirty boolean);
  
regression=# SELECT c.relname, count(*) AS buffers
 FROM pg_class c INNER JOIN pg_buffercache b
--- 60,81 
  
regression=# \d pg_buffercache;
 View "public.pg_buffercache"
!Column |  Type| Modifiers
!   +--+---
!bufferid   | integer  |
!relfilenode| oid  |
!reltablespace  | oid  |
!reldatabase| oid  |
!relblocknumber | bigint   |
!isdirty| boolean  |
!usagecount | smallint |
! 
View definition:
 SELECT p.bufferid, p.relfilenode, p.reltablespace, p.reldatabase, 
!   p.relblocknumber, p.isdirty, p.usagecount
   FROM pg_buffercache_pages() p(bufferid integer, relfilenode oid, 
   reltablespace oid, reldatabase oid, relblocknumber bigint, 
!  isdirty boo

Re: [PATCHES] Add usage counts to pg_buffercache

2007-03-31 Thread Greg Smith

On Sun, 1 Apr 2007, Russell Smith wrote:

Currently the max buffer count is 5. But is that a complete safe 
assumption?  Maybe a compile time check that BM_MAX_USAGE_COUNT is < 16k 
would ensure that things don't go wrong?


I actually wasn't even aware that was a hard limit; I just assumed that 
all my systems just never got over 5 before the LRU hit them.  How 'bout 
that.


I'm all for paranoia, but the idea of letting LRU usage counts go over 16 
bits seems pretty unlikely with the current approach.  That would mean 
those pages would need 64K passes over the buffer cache before they could 
get evicted, which would take quite a while.  My bet is that bufferid gets 
expanded from 32 bits before that happens, which would break 
pg_buffercache similarly and isn't being checked for either.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


[PATCHES] Logging checkpoints and other slowdown causes

2007-03-31 Thread Greg Smith
This patch puts entries in the log file around some backend events that 
regularly cause client pauses, most of which are visible only when running 
with log_min_messages = debug2.  The main focus is regularly occuring 
checkpoints, from either running out of segments or hitting a timeout. 
This is an example of what gets written to the logs at checkpoint time:


DEBUG2:  checkpoint required (timeout passed)
DEBUG2:  checkpoint starting
DEBUG2:  checkpoint flushing buffer pool
DEBUG2:  checkpoint database fsync starting
DEBUG2:  checkpoint database fsync complete
DEBUG1:  checkpoint buffers dirty=16.7 MB (13.0%) write=174.8 ms 
sync=3716.5 ms
DEBUG2:  checkpoint complete; 0 transaction log file(s) added, 0 removed, 
0 recycled


The idea here is that if you want to really watch what your system is 
doing, either to track down a problem or for benchmarking, you log at 
DEBUG2 and timestamp the logs with something like log_line_prefix = 
'[%m]:'.  Then line up your timestamped transactions with the logs to 
figure out what happened during bad periods.  This is very easy to do with 
the timestamped pgbench latency logs for example, so you can put them in 
time sequence order and see something like "oh, those long transactions 
all came in between the fsync start and end".  The summary line with the 
write/sync timing information has a broader use and suggests when you need 
to go deeper, which is why I put that one at DEBUG1.


I also adjust some other log messages to make it easier to run the system 
at DEBUG2 without going completely crazy, like moving individual WAL 
segment recycles to DEBUG3 and standardizing the format of the fork/exit 
messages so they're easier to filter out.


My main concern about this code is how it adds several gettimeofday calls 
and complicated string building inside FlushBufferPool, all of which is 
wasted if this data isn't actually going to get logged.  I'd like to have 
something that tells me whether I should even bother, but I didn't want to 
poke into the elog variables for fear of adding a dependency on its 
current structure to here.  Ideally, it would be nice to call something 
like WillLog(DEBUG2) and get a boolean back saying whether something at 
that log level will be output anywhere; I don't know enough about the 
logging code to add such a thing to it myself.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MDIndex: src/backend/access/transam/xlog.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.265
diff -c -r1.265 xlog.c
*** src/backend/access/transam/xlog.c   3 Mar 2007 20:02:26 -   1.265
--- src/backend/access/transam/xlog.c   1 Apr 2007 04:55:19 -
***
*** 1591,1600 
if (new_highbits != old_highbits ||
new_segno >= old_segno + 
(uint32) (CheckPointSegments-1))
{
! #ifdef WAL_DEBUG
!   if (XLOG_DEBUG)
!   elog(LOG, "time for a 
checkpoint, signaling bgwriter");
! #endif
RequestCheckpoint(false, true);
}
}
--- 1591,1598 
if (new_highbits != old_highbits ||
new_segno >= old_segno + 
(uint32) (CheckPointSegments-1))
{
!   ereport(DEBUG2,(errmsg(
!   "checkpoint required 
(wrote checkpoint_segments)")));
RequestCheckpoint(false, true);
}
}
***
*** 1858,1863 
--- 1856,1863 
(errcode_for_file_access(),
 errmsg("could not create file \"%s\": %m", 
tmppath)));
  
+   ereport(DEBUG2, (errmsg("WAL creating and filling new file on disk")));
+ 
/*
 * Zero-fill the file.  We have to do this the hard way to ensure that 
all
 * the file space has really been allocated --- on platforms that allow
***
*** 2565,2571 

   true, &max_advance,

   true))
{
!   ereport(DEBUG2,
(errmsg("recycled 
trans

Re: [PATCHES] bgwriter stats

2007-03-31 Thread Greg Smith

On Thu, 29 Mar 2007, Magnus Hagander wrote:


I've included a couple of more counters per ideas from Greg Smith in his
logging patch.


I just submitted a patch that logs the remaining things of value from my 
original that couldn't be derived from the information you're collecting. 
Between them I'm happy that a lot of previously hidden performance issues 
can now be monitored--not necessarily easily, but it's at least possible.


I made one small change to your code in there as well I wanted to 
highlight here.  You updated the buffers written by checkpoints one at a 
time as they wrote out.  When I tried to develop something that monitored 
pg_stat_bgwriter looking for when checkpoints happened, this made it 
difficult to answer the important question "how many buffers did the last 
checkpoint write?" just from the stats structure because I assumed it's 
possible to get a view in the middle of the checkpoint.  It took watching 
both the total and the checkpoint count, and that was hard to work with.


I modified things so that the checkpoint buffers written number gets 
updated in one shot when the buffer flush is done.  No partial results, 
much easier to monitor:  when the buffers_checkpoint value changes, the 
difference from the previous value is what the last checkpoint wrote.  I 
needed that total anyway which is why I just slipped it into the other 
patch.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] bgwriter stats

2007-04-01 Thread Greg Smith

On Sun, 1 Apr 2007, Magnus Hagander wrote:


The data in pg_stat_bgwriter certainly doesn't update *during* a
checkpoint, if that's what you're saying.


Scratch previous message, replace with "dude ur code rulez!"

I was working on the assumption it was possible to get a partial result 
because I haven't had enough time track all the code paths involved to 
prove otherwise. Glad to hear it was never an issue.  Doesn't change what 
I submitted though.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] Logging checkpoints and other slowdown causes

2007-04-03 Thread Greg Smith

On Tue, 3 Apr 2007, Peter Eisentraut wrote:


Something that is aimed at a user should not be enabled at a "debug"
level.  Debug levels are for debugging, not for very high verbosity.


I asked for feedback about where to log at when I intially sent the first 
version of this in and didn't hear anything back on that part, so I pushed 
these in line with other log messages I saw.  The messages for when 
checkpoints start and stop were both logged at DEBUG2, so I put progress 
reports on the other significant phases of the process there as well.


I don't expect these messages will be helpful for a normal user--that's 
what the new data in pg_stats_bgwriter is for.  Their main purpose of this 
patch is debugging checkpoint related performance issues at a level I'd 
expect only a developer to work at; they're also helpful for someone 
writing benchmark code.


There are several patches in process floating around that aim to adjust 
either the background writer or the checkpoint process to reduce the 
impact of checkpoints.  This logging allows grading their success at that. 
As my tests with this patch in place suggest this problem is far from 
solved with any of the current suggestions, I'd like to get other 
developers looking at that problem the same way I have been; that's why 
I'd like to see some standardization on how checkpoints are instrumented. 
The fact that really advanced users might also use this for 
troubleshooting I consider a bonus rather than the main focus here.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Load distributed checkpoint V3

2007-04-05 Thread Greg Smith

On Thu, 5 Apr 2007, Heikki Linnakangas wrote:

Unfortunately because of the recent instrumentation and CheckpointStartLock 
patches this patch doesn't apply cleanly to CVS HEAD anymore. Could you fix 
the bitrot and send an updated patch, please?


The "Logging checkpoints and other slowdown causes" patch I submitted 
touches some of the same code as well, that's another possible merge 
coming depending on what order this all gets committed in.  Running into 
what I dubbed perpetual checkpoints was one of the reasons I started 
logging timing information for the various portions of the checkpoint, to 
tell when it was bogged down with slow writes versus being held up in sync 
for various (possibly fixed with your CheckpointStartLock) issues.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] Load distributed checkpoint V3

2007-04-05 Thread Greg Smith

On Thu, 5 Apr 2007, Heikki Linnakangas wrote:


Bgwriter has two goals:
1. keep enough buffers clean that normal backends never need to do a write
2. smooth checkpoints by writing buffers ahead of time
Load distributed checkpoints will do 2. in a much better way than the 
bgwriter_all_* guc options. I think we should remove that aspect of bgwriter 
in favor of this patch.


My first question about the LDC patch was whether I could turn it off and 
return to the existing mechanism.  I would like to see a large pile of 
data proving this new approach is better before the old one goes away.  I 
think everyone needs to do some more research and measurement here before 
assuming the problem can be knocked out so easily.


The reason I've been busy working on patches to gather statistics on this 
area of code is because I've tried most simple answers to getting the 
background writer to work better and made little progress, and I'd like to 
see everyone else doing the same at least collecting the right data.


Let me suggest a different way of looking at this problem.  At any moment, 
some percentage of your buffer pool is dirty.  Whether it's 0% or 100% 
dramatically changes what the background writer should be doing.  Whether 
most of the data is usage_count>0 or not also makes a difference.  None of 
the current code has any idea what type of buffer pool they're working 
with, and therefore they don't have enough information to make a 
well-informed prediction about what is going to happen in the near future.


I'll tell you what I did to the all-scan.  I ran a few hundred hours worth 
of background writer tests to collect data on what it does wrong, then 
wrote a prototype automatic background writer that resets the all-scan 
parameters based on what I found.  It keeps a running estimate of how 
dirty the pool at large is using a weighted average of the most recent 
scan with the past history.  From there, I have a simple model that 
predicts how much of the buffer we can scan in any interval, and intends 
to enforce a maximum bound on the amount of physical I/O you're willing to 
stream out.  The beta code is sitting at 
http://www.westnet.com/~gsmith/content/postgresql/bufmgr.c if you want to 
see what I've done so far.  The parts that are done work fine--as long as 
you give it a reasonable % to scan by default, it will correct 
all_max_pages and the interval in real-time to meet the scan rate 
requested you want given how much is currently dirty; the I/O rate is 
computed but doesn't limit properly yet.


Why haven't I brought this all up yet?  Two reasons.  The first is because 
it doesn't work on my system; checkpoints and overall throughput get worse 
when you try to shorten them by running the background writer at optimal 
aggressiveness.  Under really heavy load, the writes slow down as all the 
disk caches fill, the background writer fights with reads on the data that 
isn't in the mostly dirty cache (introducing massive seek delays), it 
stops cleaning effectively, and it's better for it to not even try.  My 
next generation of code was going to start with the LRU flush and then 
only move onto the all-scan if there's time leftover.


The second is that I just started to get useful results here in the last 
few weeks, and I assumed it's too big of a topic to start suggesting major 
redesigns to the background writer mechanism at that point (from me at 
least!).  I was waiting for 8.3 to freeze before even trying.  If you want 
to push through a redesign there, maybe you can get away with it at this 
late moment.  But I ask that you please don't remove anything from the 
current design until you have significant test results to back up that 
change.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] Load distributed checkpoint V3

2007-04-05 Thread Greg Smith

On Thu, 5 Apr 2007, Heikki Linnakangas wrote:

The purpose of the bgwriter_all_* settings is to shorten the duration of 
the eventual checkpoint. The reason to shorten the checkpoint duration 
is to limit the damage to other I/O activity it causes. My thinking is 
that assuming the LDC patch is effective (agreed, needs more testing) at 
smoothening the checkpoint, the duration doesn't matter anymore. Do you 
want to argue there's other reasons to shorten the checkpoint duration?


My testing results suggest that LDC doesn't smooth the checkpoint usefully 
when under a high (>30 client here) load, because (on Linux at least) the 
way the OS caches writes clashes badly with how buffers end up being 
evicted if the buffer pool fills back up before the checkpoint is done. 
In that context, anything that slows down the checkpoint duration is going 
to make the problem worse rather than better, because it makes it more 
likely that the tail end of the checkpoint will have to fight with the 
clients for write bandwidth, at which point they both suffer.  If you just 
get the checkpoint done fast, the clients can't fill the pool as fast as 
the BufferSync is writing it out, and things are as happy as they can be 
without a major rewrite to all this code.  I can get a tiny improvement in 
some respects by delaying 2-5 seconds between finishing the writes and 
calling fsync, because that gives Linux a moment to usefully spool some of 
the data to the disk controller's cache; beyond that any additional delay 
is a problem.


Since it's only the high load cases I'm having trouble dealing with, this 
basically makes it a non-starter for me.  The focus on checkpoint_timeout 
and ignoring checkpoint_segments in the patch is also a big issue for me. 
At the same time, I recognize that the approach taken in LDC probably is a 
big improvement for many systems, it's just a step backwards for my 
highest throughput one.  I'd really enjoy hearing some results from 
someone else.


The number of buffers evicted by normal backends in a bgwriter_delay period 
is simple to keep track of, just increase a counter in StrategyGetBuffer and 
reset it when bgwriter wakes up.


I see you've already found the other helpful Itagaki patch in this area. 
I know I would like to see his code for tracking evictions commited, then 
I'd like that to be added as another counter in pg_stat_bgwriter (I 
mentioned that to Magnus in passing when he was setting the stats up but 
didn't press it because of the patch dependency).  Ideally, and this idea 
was also in Itagaki's patch with the writtenByBgWriter/ByBackEnds debug 
hook, I think it's important that you know how every buffer written to 
disk got there--was it a background writer, a checkpoint, or an eviction 
that wrote it out?  Track all those and you can really learn something 
about your write performance, data that's impossible to collect right now.


However, as Itagaki himself points out, doing something useful with 
bgwriter_lru_maxpages is only one piece of automatically tuning the 
background writer.  I hate to join in on chopping his patches up, but 
without some additional work I don't think the exact auto-tuning logic he 
then applies will work in all cases, which could make it more a problem 
than the current crude yet predictable method.  The whole way 
bgwriter_lru_maxpages and num_to_clean play off each other in his code 
currently has a number of failure modes I'm concerned about.  I'm not sure 
if a re-write using a moving-average approach (as I did in my auto-tuning 
writer prototype and as Tom just suggested here) will be sufficient to fix 
all of them.  Was already on my to-do list to investigate that further.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] Load distributed checkpoint V3

2007-04-05 Thread Greg Smith

On Fri, 6 Apr 2007, Takayuki Tsunakawa wrote:

could anyone evaluate O_SYNC approach again that commercial databases 
use and tell me if and why PostgreSQL's fsync() approach is better than 
theirs?


I noticed a big improvement switching the WAL to use O_SYNC (+O_DIRECT) 
instead of fsync on my big and my little servers with battery-backed 
cache, so I know sync writes perform reasonably well on my hardware. 
Since I've had problems with the fsync at checkpoint time, I did a similar 
test to yours recently, adding O_SYNC to the open calls and pulling the 
fsyncs out to get a rough idea how things would work.


Performance was reasonable most of the time, but when I hit a checkpoint 
with a lot of the buffer cache dirty it was incredibly bad.  It took 
minutes to write everything out, compared with a few seconds for the 
current case, and the background writer was too sluggish as well to help. 
This appears to match your data.


If you compare how Oracle handles their writes and checkpoints to the 
Postgres code, it's obvious they have a different architecture that 
enables them to support sync writing usefully.  I'd recommend the Database 
Writer Process section of 
http://www.lc.leidenuniv.nl/awcourse/oracle/server.920/a96524/c09procs.htm 
as an introduction for those not familiar with that; it's interesting 
reading for anyone tinking with background writer code.


It would be great to compare performance of the current PostgreSQL code 
with a fancy multiple background writer version using the latest sync 
methods or AIO; there have actually been multiple updates to improve 
O_SYNC writes within Linux during the 2.6 kernel series that make this 
more practical than ever on that platform.  But as you've already seen, 
the performance hurdle to overcome is significant, and it would have to be 
optional as a result.  When you add all this up--have to keep the current 
non-sync writes around as well, need to redesign the whole background 
writer/checkpoint approach around the idea of sync writes, and the 
OS-specific parts that would come from things like AIO--it gets real 
messy.  Good luck drumming up support for all that when the initial 
benchmarks suggest it's going to be a big step back.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Load distributed checkpoint V3

2007-04-06 Thread Greg Smith

On Fri, 6 Apr 2007, Takayuki Tsunakawa wrote:


Hmm... what makes you think that sync writes is useful for Oracle and
not for PostgreSQL?


They do more to push checkpoint-time work in advance, batch writes up more 
efficiently, and never let clients do the writing.  All of which make for 
a different type of checkpoint.


Like Simon points out, even if it were conceivable to mimic their design 
it might not even be legally feasible.  The point I was trying to make is 
this:  you've been saying that Oracle's writing technology has better 
performance in this area, which is probably true, and suggesting the cause 
of that was their using O_SYNC writes.  I wanted to believe that and even 
tested out a prototype.  The reality here appears to be that their 
checkpoints go smoother *despite* using the slower sync writes because 
they're built their design around the limitations of that write method.


I suspect it would take a similar scale of redesign to move Postgres in 
that direction; the issues you identified (the same ones I ran into) are 
not so easy to resolve.  You're certainly not going to move anybody in 
that direction by throwing a random comment into a discussion on the 
patches list about a feature useful *right now* in this area.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load distributed checkpoint V4

2007-04-23 Thread Greg Smith

On Thu, 19 Apr 2007, Heikki Linnakangas wrote:

In the sync phase, we sleep between each fsync until enough time/segments 
have passed, assuming that the time to fsync is proportional to the file 
length. I'm not sure that's a very good assumption.


I've been making scatter plots of fsync time vs. amount written to the 
database for a couple of months now, and while there's a trend there it's 
not a linear one based on data written.  Under Linux, to make a useful 
prediction about how long a fsync will take you first need to consider how 
much dirty data is already in the OS cache (the "Dirty:" figure in 
/proc/meminfo) before the write begins, relative to the kernel parameters 
that control write behavior.  Combine that with some knowledge of the 
caching behavior of the controller/disk combination you're using, and it's 
just barely possible to make a reasonable estimate.  Any less information 
than all that and you really have very little basis on which to guess how 
long it's going to take.


Other operating systems are going to give completely different behavior 
here, which of course makes the problem even worse.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Logging checkpoints and other slowdown causes

2007-05-11 Thread Greg Smith

On Fri, 11 May 2007, Heikki Linnakangas wrote:

Printing the messages at LOG-level would bring the code in line with the 
documentation, but I don't think we want to fill the log with checkpoint 
chatter unless the DBA explicitly asks for that.  How about INFO?


In putting that together, I pushed everything toward the lowest DEBUG 
level that it was possible for it to run at without the messages I wanted 
to see being completely lost.  I didn't want this to get rejected just on 
the basis of being too chatty for regular users.  Putting a single line in 
for each checkpoint at INFO, with all the individual progress points being 
at DEBUG1 or 2, would be close to optimal as far as I'm concerned.  More 
on this below.


Or we could add a GUC variable similar to log_connections or log_statement to 
control if the messages are printed or not, and use LOG.


If the levels for the individual messages are adjusted usefully the 
ability to control whether they show up or not falls out of the existing 
log level adjustments; I haven't ever felt a strong need for GUC when 
using this code.  If, as you say, there's already a move in this 
direction, then fine--another GUC would be better.


We don't need to print the times elapsed in each phase on a separate 
line, that's just derived information from the other lines, unless we 
use different log-levels for detail lines


Let me turn this around for a second and ask you this:  what do you think 
people are going to do with this information?  I can tell you what I do. 
I parse the logs and look for that single line with the summary 
information.  I then take the three pieces of information it provides (MB 
written, time to write, time to fsync) and save them into a table.  From 
there I generate stats, scatter plots, all sorts of useful stuff.


If you know when the checkpoint ended, and you know how long each of the 
pieces took, you can reconstruct the other times easily.  The way you 
describe this it is true--that the summary is redundant given the 
detail--but if you put yourself in the shoes of a log file parser the 
other way around is easier to work with.  Piecing together log entries is 
a pain, splitting them is easy.


If I had to only keep one line out of this, it would be the one with the 
summary.  It would be nice to have it logged at INFO.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] Automatic adjustment of bgwriter_lru_maxpages

2007-05-12 Thread Greg Smith
Attached are two patches that try to recast the ideas of Itagaki 
Takahiro's auto bgwriter_lru_maxpages patch in the direction I think this 
code needs to move.  Epic-length commentary follows.


The original code came from before there was a pg_stat_bgwriter.  The 
first patch (buf-alloc-stats) takes the two most interesting pieces of 
data the original patch collected, the number of buffers allocated 
recently and the number that the clients wrote out, and ties all that into 
the new stats structure.  With this patch applied, you can get a feel for 
things like churn/turnover in the buffer pool that were very hard to 
quantify before.  Also, it makes it easy to measure how well your 
background writer is doing at writing buffers so the clients don't have 
to.  Applying this would complete one of my personal goals for the 8.3 
release, which was having stats to track every type of buffer write.


I split this out because I think it's very useful to have regardless of 
whether the automatic tuning portion is accepted, and I think these 
smaller patches make the review easier.  The main thing I would recommend 
someone check is how am_bg_writer is (mis?)used here.  I spliced some of 
the debugging-only code from the original patch, and I can't tell if the 
result is a robust enough approach to solving the problem of having every 
client indirectly report their activity to the background writer.  Other 
than that, I think this code is ready for review and potentially 
comitting.


The second patch (limit-lru) adds on top of that a constraint of the LRU 
writer so that it doesn't do any more work than it has to.  Note that I 
left verbose debugging code in here because I'm much less confident this 
patch is complete.


It predicts upcoming buffer allocations using a 16-period weighted moving 
average of recent activity, which you can think of as the last 3.2 seconds 
at the default interval.  After testing a few systems that seemed a decent 
compromise of smoothing in both directions.  I found the 2X overallocation 
fudge factor of the original patch way too aggressive, and just pick the 
larger of the most recent allocation amount or the smoothed value.  The 
main thing that throws off the allocation estimation is when you hit a 
checkpoint, which can give a big spike after the background writer returns 
to BgBufferSync and notices all the buffers that were allocated during the 
checkpoint write; the code then tries to find more buffers it can recycle 
than it needs to.  Since the checkpoint itself normally leaves a large 
wake of reusable buffers behind it, I didn't find this to be a serious 
problem.


There's another communication issue here, which is that SyncOneBuffer 
needs to return more information about the buffer than it currently does 
once it gets it locked.  The background writer needs to know more than 
just if it was written to tune itself.  The original patch used a clever 
trick for this which worked but I found confusing.  I happen to have a 
bunch of other background writer tuning code I'm working on, and I had to 
come up with a more robust way to communicate buffer internals back via 
this channel.  I used that code here, it's a bitmask setup similar to how 
flags like BM_DIRTY are used.  It's overkill for solving this particular 
problem, but I think the interface is clean and it helps support future 
enhancements in intelligent background writing.


Now we get to the controversial part.  The original patch removed the 
bgwriter_lru_maxpages parameter and updated the documentation accordingly. 
I didn't do that here.  The reason is that after playing around in this 
area I'm not convinced yet I can satisfy all the tuning scenarios I'd like 
to be able to handle that way.  I describe this patch as enforcing a 
constraint instead; it allows you to set the LRU parameters much higher 
than was reasonable before without having to be as concerned about the LRU 
writer wasting resources.


I already brought up some issues in this area on -hackers ( 
http://archives.postgresql.org/pgsql-hackers/2007-04/msg00781.php ) but my 
work hasn't advanced as fast as I'd hoped.  I wanted to submit what I've 
finished anyway because I think any approach here is going to have cope 
with the issues addressed in these two patches, and I'm happy now with how 
they're solved here.  It's only a one-line delete to disable the LRU 
limiting behavior of the second patch, at which point it's strictly 
internals code with no expected functional impact that alternate 
approaches might be built on.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MDIndex: src/backend/storage/buffer/bufmgr.c
===
RCS file: /d3/pgsql/cvs/pgsql-local/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.1.1.1
diff -c -r1.1.1.1 bufmgr.c
*** s

Re: [PATCHES] Automatic adjustment of bgwriter_lru_maxpages

2007-05-13 Thread Greg Smith
| buffer_state == 
BUF_REUSABLE_WRITTEN)


And that was a pain all around; I kept having to stare at the table above 
to make sure the code was correct.  Also, in order to pass back full 
usage_count information I was back to either pointers or bitshifting 
anyway.  While this particular patch doesn't need the usage count, the 
later ones I'm working on do, and I'd like to get this interface complete 
while it's being tinkered with anyway.


Or how about moving the checks for dirty and pinned buffers from 
SyncOneBuffer to the callers?


There are 3 callers to SyncOneBuffer, and almost all the code is shared 
between them.  Trying to push even just the dirty/pinned stuff back into 
the callers would end up being a cut and paste job that would duplicate 
many lines.  That's on top of the fact that the buffer is cleanly 
locked/unlocked all in one section of code right now, and I didn't see how 
to move any parts of that to the callers without disrupting that clean 
interface.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] Automatic adjustment of bgwriter_lru_maxpages

2007-05-14 Thread Greg Smith

On Mon, 14 May 2007, ITAGAKI Takahiro wrote:


BTW, your patch will cut LRU writes short, but will not encourage to
do more works. So should set more aggressive values to bgwriter_lru_percent
and bgwriter_lru_maxpages as defaults?


Setting a bigger default maximum is one possibility I was thinking about. 
Since the whole background writer setup is kind of complicated, the other 
thing I was working on is writing a guide on how to use the new 
pg_stat_bgwriter information to figure out if you need to increase 
bgwriter_[all|lru]_pages (and the other parameters too).  It makes it much 
easier to write that if you can say "You can safely set 
bgwriter_lru_maxpages high because it only writes what it needs to based 
on your usage".


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PATCHES] [DOCS] OS/X startup scripts

2007-05-14 Thread Greg Smith

On Mon, 14 May 2007, Gregory Stark wrote:


Personally I find CVS so terribly slow for large trees like Postgres that it's
essential to use rsync to maintain a local CVS repository. That makes 'cvs
diff' remarkably fast.


Having recently tried to get this to work right and not quite nailed it 
down yet, I know it would be a help to new developers if someone using 
this technique would write out a suggested setup/workflow for generating 
patches this way.  Right now the best instructions out there are those 
that go with the buildfarm


http://pgfoundry.org/docman/view.php/140/4/PGBuildFarm-HOWTO.txt

and those aren't quite focused right if the goal is to work on new patches 
while keeping in sync with the repository.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PATCHES] Automatic adjustment of bgwriter_lru_maxpages

2007-05-14 Thread Greg Smith

On Mon, 14 May 2007, Heikki Linnakangas wrote:


If it's safe to set it high, let's default it to infinity.


The maximum right now is 1000, and that would be a reasonable new default. 
You really don't to write more than 1000 per interval anyway without 
taking a break for checkpoints; the more writes you do at once, the higher 
the chances are you'll have the whole thing stall because the OS makes you 
wait for a write (this is not a theoretical comment; I've watched it 
happen when I try to get the BGW doing too much).


If someone has so much activity that they're allocating more than that 
during a period, they should shrink the delay instead.  The kinds of 
systems where 1000 isn't high enough for bgwriter_lru_maxpages are going 
to be compelled to adjust these parameters anyway for good performance.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] Automatic adjustment of bgwriter_lru_maxpages

2007-05-14 Thread Greg Smith

On Mon, 14 May 2007, Tom Lane wrote:


If you can write something like that, why do we need the parameter at all?


Couple of reasons:

-As I already mentioned in my last message, I think it's unwise to let the 
LRU writes go completely unbounded.  I still think there should be a 
maximum, and if there is one it should be tunable.  You can get into 
situations where the only way to get the LRU writer to work at all is to 
set the % to scan fairly high, but that exposes you to way more writes 
than you might want per interval in situations where buffers to write are 
easy to find.


-There is considerable coupling between how the LRU and the all background 
writers work.  There are workloads where the LRU writer is relatively 
ineffective, and only the all one really works well.  If there is a 
limiter on the writes from the all writer, but not on the LRU, admins may 
not be able to get the balance between the two they want.  I know I 
wouldn't.


-Just because I can advise what is generally the right move, that doesn't 
mean it's always the right one.  Someone may notice that the maximum pages 
written limit is being nailed and not care.


The last system I really got deep into the background writer mechanics on, 
it could be very effective at improving performance and reducing 
checkpoint spikes under low to medium loads.  But under heavy load, it 
just got in the way of the individual backends running, which was 
absolutely necessary in order to execute the LRU mechanics (usage_count--) 
so less important buffers could be kicked out.  I would like people to 
still be able to set a tuning such that the background writers were useful 
under average loads, but didn't ever try to do too much.  It's much more 
difficult to do that if bgwriter_lru_maxpages goes away.


I realized recently the task I should take on here is to run some more 
experiments with the latest code and pass along suggested techniques for 
producing/identifying the kind of problem conditions I've run into in the 
past; then we can see if other people can reproduce them.  I got a new 
8-core server I need to thrash anyway and will try and do just that 
starting tomorrow.


For all I know my concerns are strictly a rare edge case.  But since the 
final adjustments to things like whether there is an upper limit or not 
are very small patches compared to what's already been done here, I sent 
in what I thought was ready to go because I didn't want to hold up 
reviewing the bulk of the code over some of these fine details.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] COPY-able csv log outputs

2007-05-20 Thread Greg Smith
I got a chance to review this patch over the weekend.  Basic API seems 
good, met all my requirements, no surprises with how the GUC variable 
controlled the feature.


The most fundamental issue I have with the interface is that using COPY 
makes it difficult to put any unique index on the resulting table.  I like 
to have a unique index on my imported log table because it rejects the 
dupe records if you accidentally import the same section of log file 
twice.  COPY tosses the whole thing if there's an index violation, which 
is a problem during a regular import because you will occasionally come 
across lines with the same timestamp that are similar in every way except 
for their statment; putting an index on the timestamp+statement seems 
impractical.


I've had a preference for INSERT from the beginning here that this 
reinforces.  I'm planning to just work around this issue by doing the COPY 
into a temporary table and then INSERTing from there.  I didn't want to 
just let the concern pass by without mentioning it though.  It crosses my 
mind that inserting some sort of unique log file line ID number would 
prevent the dupe issue and make for better ordering (it's possible to have 
two lines with the same timestamp show up in the wrong order now), not 
sure that's a practical idea to consider.


The basic coding of the patch seemed OK to me, but someone who is much 
more familiar than myself with the mechanics of pipes should take a look 
at that part of the patch before committing; it's complicated code and I 
can't comment on it.  There are some small formatting issues that need to 
be fixed, particularly in the host+port mapping.  I can fix those myself 
and submit a slightly updated patch.  There's some documentation 
improvements I want to make before this goes in as well.


The patch is actually broken fairly hard right now because of the switch 
from INSERT to COPY FROM CSV as the output format at the last minute.  It 
outputs missing fields as NULL (fine for INSERT) that chokes the CSV 
import when the session_start timestamp is missing.  All of those NULL 
values need to be just replaced with nothing for proper CSV syntax; there 
should just the comma for the next field.  I worked around this with


copy pglog from '/opt/pgsql/testlog.csv' with CSV null as 'NULL';

I can fix that too when I'm revising.  I plan to have a version free of 
obvious bugs to re-submit ready by next weekend.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] COPY-able csv log outputs

2007-05-21 Thread Greg Smith

On Sun, 20 May 2007, Andrew Dunstan wrote:


Does the format not include the per-process line number?


It does not, and I never noticed that under the prefix 
possibilities---never seemed import before!  The combination of 
timestamp/pid/line (%t %p %l) looks like a useful and unique key here, so 
I'll add another column for the line number to the output.  Thanks for 
pointing that out, I can finish cleaning up all the functional 
implementation work on this patch myself now.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] COPY-able csv log outputs

2007-05-28 Thread Greg Smith

On Sun, 20 May 2007, Andrew Dunstan wrote:

I've had a preference for INSERT from the beginning here that this 
reinforces.
COPY is our standard bulk insert mechanism. I think arguing against it would 
be a very hard sell.


Let me say my final peace on this subject...if I considered this data to 
be strictly bulk insert, then I'd completely agree here.  Most of the 
really interesting applications I was planning to build on top of this 
mechanism are more interactive than that though.  Here's a sample:


-Write a daemon that lives on the server, connects to a logging database, 
and pops into an idle loop based on LISTEN.
-A client app wants to see the recent logs files.  It uses NOTIFY to ask 
the daemon for them and LISTENs for a response.
-The daemon wakes up, reads all the log files since it last did something, 
and appends those records to the log file table.  It sends out a NOTIFY to 
say the log file table is current.


That enables remote clients to grab the log files from the server whenever 
they please, so they can actually monitor themselves.  Benchmarking is the 
initial app I expect to call this, and with some types of tests I expect 
the daemon to be importing every 10 minutes or so.


Assuming a unique index on the data to prevent duplication is a required 
feature, I can build this using the COPY format logs as well, but that 
requires I either a) am 100% perfect in making sure I never pass over the 
same data twice, which is particularly annoying when the daemon gets 
restarted, or b) break the COPY into single lines and insert them one at a 
time, at which point I'm not bulk loading at all.  If these were INSERT 
statements instead, I'd have a lot more tolerance for error, because the 
worst problem I'd ever run into is spewing some unique key violation 
errors into the logs if I accidentally imported too much.  With COPY, any 
mistake or synchronization issue and I lose the whole import.


I don't mean to try and stir this back up again as an argument 
(particularly not on this list).  There are plenty of other apps where 
COPY is clearly the best approach, you can easily make a case that my app 
is a fringe application rather than a mainstream one, and on the balance 
this job is still far easier than my current approach of parsing the logs. 
I just wanted to give a sample of how using COPY impacts the dynamics of 
how downstream applications will have to work with this data, so you can 
see that my contrary preference isn't completely random here.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] Seq scans status update

2007-05-28 Thread Greg Smith

On Mon, 28 May 2007, Tom Lane wrote:

But maybe that could be fixed if the clock sweep doesn't touch the 
usage_count of a pinned buffer.  Which in fact it may not do already --- 
didn't look.


StrategyGetBuffer doesn't care whether the buffer is pinned or not; it 
decrements the usage_count regardless.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Logging checkpoints and other slowdown causes

2007-05-28 Thread Greg Smith
I'll take another stab at refining this can of worms I opened.  The one 
thing I noticed on a quick review is that it's almost possible to skip all 
the calls to gettimeofday if log_checkpoints is off now.  I'd like to make 
that a specific goal, because that will make me feel better that adding 
this code has almost no performance impact relative to now unless you turn 
the feature on.


I agree with Simon that tracking create/drop separately is unnecessary. 
As for why all the timing info is in ms, given the scale of the numbers 
typically encountered I found it easier to work with.  I originally wanted 
resolution down to 0.1ms if the underlying OS supports it, which means 4 
figures to the right of the decimal point if the unit was switched to 
seconds.  Quite often the times reported are less than 100ms, so you'll 
normally be dealing with fractional part of a second.  If we take Heikki's 
example:


LOG:  checkpoint complete; buffers written=3.1 MB (9.6%) write=96.8 ms 
sync=32.0 ms


And switch it to seconds:

LOG:  checkpoint complete; buffers written=3.1 MB (9.6%) write=0.0968 ms 
sync=0.0320 ms


I don't find that as easy to work with.  The only way a timing in seconds 
would look OK is if the resolution of the whole thing is reduced to ms, 
which then makes 3 decimal points--easy to read as ms instead.  Having 
stared at a fair amount of this data now, that's probably fine; I'll 
collect up some more data on it from a fast server this week to confirm 
whether's it's worthless precision or worth capturing.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PATCHES] WIP: 2nd-generation buffer ring patch

2007-05-29 Thread Greg Smith

On Tue, 29 May 2007, Tom Lane wrote:

Do we have any decent way of measuring the effectiveness of the 
clock-sweep allocation algorithm?


I put a view on top of the current pg_buffercache (now that it include 
usage_count) that shows what the high usage_count buffers consist of. 
Since they were basically what I hoped for (like plenty of index blocks on 
popular tables) that seemed a reasonable enough measure of effectiveness 
for my purposes.  I briefly looked into adding some internal measurements 
in this area, like how many buffers are scanned on average to satisfy an 
allocation request; that would actually be easy to add to the buffer 
allocation stats part of the auto bgwriter_max_pages patch I submitted 
recently.


Based on my observations of buffer cache statistics, the number of pinned 
buffers at any time is small enough that in a reasonably sized buffer 
cache, I wouldn't expect a change in the pinned usage_count behavior to 
have any serious impact.  With what you're adjusting, the only time I can 
think of that there would be a noticable shift in fairness would be if 
ones buffer cache was very small relative to the number of clients, which 
is kind of an unreasonable situation to go out of your way to accommodate.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] Logging checkpoints and other slowdown causes

2007-05-29 Thread Greg Smith

On Tue, 29 May 2007, Heikki Linnakangas wrote:

The checkpoint will take at least a couple of seconds on any interesting 
system, so 0.1 s resolution should be enough IMHO.


You may be underestimating the resources some interesting systems are 
willing to put into speeding up checkpoints.  I'm sometimes dumping into a 
SAN whose cache is bigger than the shared_buffer cache in the server, and 
0.1s isn't really enough resolution in that situation.  A second is a 
really long checkpoint there.  Since even that's limited by fiber-channel 
speeds, I know it's possible to do better than what I'm seeing with 
something like a PCIe host adapter having on-board cache in the GB range 
(which isn't that expensive nowadays).


Also, even if the checkpoint total takes seconds, much of that is in the 
sync phase; the write time can still be in the small number of ms range, 
and I wouldn't want to see that truncated too much.


Anyway, I have a bunch of data on this subject being collected at this 
moment, and I'll rescale the results based on what I see after analyzing 
that this week.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] COPY-able csv log outputs

2007-06-01 Thread Greg Smith

On Fri, 1 Jun 2007, Andrew Dunstan wrote:


Greg Smith wrote:
Since the rotation size feature causes other issues anyway 
that make importing more complicated, documenting the issue seemed 
sufficient.


What are the other issues? I'm not happy about producing files with split 
lines.


Just that it's fairly simple and predictable to know what your log files 
are going to be named and when they'll rollover if you use something like 
a date-based naming convention--but when you add size-based rotation into 
the mix figuring out what files you need to import and when you should 
import them gets more complicated.


Clearly fixing this issue altogether would be better, and I gather the 
problem may extend to any time there is a switch to a new log file; my 
"workaround" doesn't appear good enough anyway.  I'm very glad I caught 
and mentioned this now.


Because of the extra naming/import complexity, it still might be 
worthwhile to suggest people not combine size-based rotation and the 
csvlog though.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] trace_checkpoint parameter patch

2007-06-12 Thread Greg Smith

On Tue, 12 Jun 2007, Satoshi Nagayasu wrote:


Here is a brand new patch to log a checkpointing load information
to tune the bgwriter parameter.


There is a work in progress patch that logs this and additional checkpoint 
information being reviewed in hopes of making it into the 8.3 build.  See 
http://archives.postgresql.org/pgsql-patches/2007-05/msg00455.php for the 
latest version.


The idea of using pg_rusage_init is a new one though; I hadn't thought the 
CPU usage info was interesting enough to figure out how to collect it. 
The way the patch mentioned above works it would be hard to squeeze it in 
the line usefully for formatting reasons.


I attempted to print block in / out using getrusage(), but I couldn't 
get them because they were always zero (on my linux).


I don't know what's wrong, but the I/O here is pretty simple:  the 
checkpoint wrote some amount of data that you can compute the size of 
easily within the code knowing the block size.  That's already done in the 
patch under review.


If you're interested in this area, you should check out the 
pg_stat_bgwriter feature already in the 8.3 CVS, look through the 
pgsql-hackers archives for the discussion this week on the topic 
"Controlling Load Distributed Checkpoints", and check out the "Automatic 
adjustment of bgwriter_lru_maxpages" patch whose latest version is at 
http://archives.postgresql.org/pgsql-patches/2007-05/msg00142.php


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Greg Smith

On Fri, 22 Jun 2007, Tom Lane wrote:


Yeah, I'm not sure that we've thought through the interactions with the
existing bgwriter behavior.


The entire background writer mess needs a rewrite, and the best way to 
handle that is going to shift considerably with LDC applied.


As the person who was complaining about corner cases I'm not in a position 
to talk more explicitly about, I can at least summarize my opinion of how 
I feel everyone should be thinking about this patch and you can take what 
you want from that.  In the context of getting an 8.3 release finalized, I 
think you should be building a LDC implementation that accomplishes the 
main goal of spreading the checkpoints out, which is clearly working, and 
erring on the side of more knobs in cases where it's unsure if they're 
needed or not.  It's clear what my position is which non-obvious knobs I 
think are important for some people.


Which is worse:  putting in a tuning setting that it's discovered 
everybody just uses the same value for, or discovering after release that 
there's a use-case for that setting and by hard-coding it you've made the 
DBAs for a class of applications you didn't think about miserable?  In the 
cases where there's good evidence so far of the right setting, just make 
that the default, and the only harm is GUC bloat.


Nothing should be done that changes the existing behavior if the LDC 
feature is turned off, so anything more obtrusive to the background writer 
is right out.  Make reducing the knobs, optimizing the default behavior, 
and rewriting the background writer to better fit into its new context a 
major goal of 8.4.  I know I've got a whole notebook full of stuff on that 
topic I've been ignoring as not to distract you guys from getting 8.3 
done.


That's the low risk plan, and the design/beta/release period here is short 
enough that I think going too experimental beyond that is a bad idea.  To 
pick an example, when I read this idea from Heikki:


You would have a checkpoint running practically all the time, and you 
would use checkpoint_timeout/checkpoint_segments to control how long it 
takes... If we do that, we should remove bgwriter_all_* settings


Whether or not I think this is an awesome idea, the very idea of a change 
that big at this point gives me the willies.  Just off the top of my head, 
there's a whole class of issues involving recycling xlog segments this 
would introduce I would be really unhappy with the implications of.  Did 
anyone else ever notice that when a new xlog segment is created, the write 
to clear it out doesn't happen via direct I/O like the rest of the xlog 
writes do?  That write goes through the regular buffered path instead. 
The checkpoint logging patch I submitted logs when this happens 
specifically because that particular issue messed with some operations and 
I found it important to be aware of.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Greg Smith

On Fri, 22 Jun 2007, Tom Lane wrote:

Greg had worried about being able to turn this behavior off, so we'd 
still need at least a bool, and we might as well expose the fraction 
instead.  I agree with removing the non-LRU part of the bgwriter's write 
logic though


If you accept that being able to turn LDC off is important, people who 
aren't turning it on may still want the existing bgwriter_all_* settings 
so they can keep running things the way they are now.  It's certainly 
reasonable to skip that code path when doing things the LDC way.


True, you'd have to replay 1.5 checkpoint intervals on average instead 
of 0.5 (more or less, assuming checkpoints had been short).  I don't 
think we're in the business of optimizing crash recovery time though.


If you're not, I think you should be.  Keeping that replay interval time 
down was one of the reasons why the people I was working with were 
displeased with the implications of the very spread out style of some LDC 
tunings.  They were already unhappy with the implied recovery time of how 
high they had to set checkpoint_settings for good performance, and making 
it that much bigger aggrevates the issue.  Given a knob where the LDC can 
be spread out a bit but not across the entire interval, that makes it 
easier to control how much expansion there is relative to the current 
behavior.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-23 Thread Greg Smith
This message is going to come off as kind of angry, and I hope you don't 
take that personally.  I'm very frustrated with this whole area right now 
but am unable to do anything to improve that situation.


On Fri, 22 Jun 2007, Tom Lane wrote:

If you've got specific evidence why any of these things need to be 
parameterized, let's see it.


All I'm trying to suggest here is that you might want to pause and 
consider whether you want to make a change that might break existing, 
happily working installations based just on the small number of tests that 
have been done on this patch so far.  A nice stack of DBT2 results is very 
informative, but the DBT2 workload is not everybody's workload.


Did you see anybody else predicting issues with the LDC patch on 
overloaded systems as are starting to be seen in the 150 warehouse/90% 
latency figures in Heikki's most recent results?  The way I remember that, 
it was just me pushing to expose that problem, because I knew it was there 
from my unfortunately private tests, but it was difficult to encounter the 
issue on other types of benchmarks (thanks again to Greg Stark and Heikki 
for helping with that).  But that's fine, if you want to blow off the rest 
of my suggestions now just because the other things I'm worried about are 
also very hard problem to expose and I can't hand you over a smoking gun, 
that's your decision.


Personally I think that we have a bad track record of exposing GUC 
variables as a substitute for understanding performance issues at the 
start, and this approach isn't doing any favors for DBAs.


I think this project has an awful track record of introducing new GUC 
variables and never having a plan to follow through with a process to 
figure out how they should be set.  The almost complete lack of 
standardization and useful tools for collecting performance information 
about this database boggles my mind, and you're never going to get the 
performance related sections of the GUC streamlined without it.


We were just talking about the mess that is effective_cache_size recently. 
As a more topical example here, the background writer was officially 
released in early 2005, with a bizarre collection of tunables.  I had to 
help hack on that code myself, over two years later, to even start 
exposing the internal statistics data needed to optimize it correctly. 
The main reason I can't prove some of my concerns is that I got so 
side-tracked adding the infrastructure needed to even show they exist that 
I wasn't able to nail down exactly what was going on well enough to 
generate a public test case before the project that exposed the issues 
wrapped up.


Right at the moment the best thing to do seems to be to enable LDC with 
a low minimum write rate and a high target duration, and remove the 
thereby-obsoleted "all buffers" scan of the existing bgwriter logic.


I have reason to believe there's a set of use cases where a more 
accelerated LDC approach than everyone seems to be learning toward is 
appropriate, which would then reinvigorate the need for the all-scan BGW 
component.  I have a whole new design for the non-LRU background writer 
that fixes most of what's wrong with it I'm waiting for 8.4 to pass out 
and get feedback on, but if everybody is hell bent on just yanking the 
whole thing out in preference to these really lazy checkpoints go ahead 
and do what you want.  My life would be easier if I just tossed all that 
out and forgot about the whole thing, and I'm real close to doing just 
that right now.



Did anyone else ever notice that when a new xlog segment is created,
the write to clear it out doesn't happen via direct I/O like the rest
of the xlog writes do?

It's not supposed to matter, because that path isn't supposed to be
taken often.


Yes, but during the situation it does happen in--when checkpoints take so 
much longer than expected that more segments have to be created, or in an 
archive logger faiure--it badly impacts an already unpleasant situation.



there's a whole class of issues involving recycling xlog segments this
would introduce I would be really unhappy with the implications of.

Really?  Name one.


You already mentioned expansion of the log segments used which is a 
primary issue.  Acting like all the additional segments used for some of 
the more extreme checkpoint spreading approaches are without cost is 
completely unrealistic IMHO.  In the situation I just described above, I 
also noticed the way O_DIRECT sync writes get mixed with buffered WAL 
writes seems to cause some weird I/O scheduling issues in Linux that can 
make worst-case latency degrade.  But since I can't prove that, I guess I 
might as well not even mention that either.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-24 Thread Greg Smith

On Sun, 24 Jun 2007, Simon Riggs wrote:


I can't see why anyone would want to turn off smoothing: If they are
doing many writes, then they will be effected by the sharp dive at
checkpoint, which happens *every* checkpoint.


There are service-level agreement situations where a short and sharp 
disruption is more acceptable than a more prolonged one.  As some of the 
overloaded I/O tests are starting to show, the LDC may be a backward step 
for someone in that sort of environment.


I am not a fan of introducing a replacement feature based on what I 
consider too limited testing, and I don't feel this one has been beat on 
long yet enough to start pruning features that would allow better backward 
compatibility/transitioning.  I think that's introducing an unnecessary 
risk to the design.


We won't need to set checkpoint_segments so high, since performance is 
smoothed across checkpoints by LDC and its OK to allow them more 
frequently. So this concern need not apply with LDC.


Performance *should* be smoothed across by checkpoints by LDC and my 
concern *may* not apply.  I think assuming it will always help based on 
the limited number of test results presented so far is extrapolation.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-24 Thread Greg Smith

On Mon, 25 Jun 2007, Tom Lane wrote:

I'm not sure why you hold such strong allegiance to the status quo.  We 
know that the status quo isn't working very well.


Don't get me wrong here; I am a big fan of this patch, think it's an 
important step forward, and it's exactly the fact that I'm so shell 
shocked from abuse by the status quo that I'm still mixed up in this mess 
(I really should be ignoring the lot of you and writing new code instead).


LDC certainly makes things better in almost every case.  My "allegiance" 
comes from having seen a class of transactions where LDC made things worse 
on a fast/overloaded system, in that it made some types of service 
guarantees harder to meet, and I just don't know who else might run into 
problems in that area.  I'm worried that if it's not adjustable, you're 
introducing a risk that you'll take a step backward for some of this 
code's users, and that will be hard to undo given the way releases are 
structured here.


I spent some time trading stocks for a living.  There are sometimes 
situations you can get into there where there is a tiny chance that 
something very bad can happen with a trade, and many people get wiped out 
by such things.  If it's possible in that situation to remove that risk 
with something inexpensive, you do it, even though the net expected value 
of the change might be slightly negative.  This seems like such a 
situation to me.  If it's possible to take away the risk of other people 
running into an unforseen problem with the LDC patch just by keeping a 
knob that's already there, unless that's an expensive operation my opinion 
is that you should pick a good default but not remove it yet.


And if you think that the current code had enormous amounts of testing 
before it went in, I've got to disillusion you :-(


It's having been on the painful receiving end of that fact that makes me 
so paranoid now :)


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-24 Thread Greg Smith

On Sun, 24 Jun 2007, Simon Riggs wrote:

Greg can't choose to use checkpoint_segments as the limit and then 
complain about unbounded recovery time, because that was clearly a 
conscious choice.


I'm complaining only because everyone seems content to wander in a 
direction where the multiplier on checkpoint_segments for how many 
segments are actually active at once will go up considerably, which can 
make a known problem (recovery time) worse.  I'm thinking about things 
like how the release notes for 8.3 are going to address this.  It the plan 
to say something like "whatever you set checkpoint_segments to before so 
you were happy with the crash recovery time, make sure to re-run all those 
tests again because we made a big change there, it may take a lot longer 
now with the same value, and too bad if you don't like it because there's 
no way to get the old behavior back.  Suck it up, decrease 
checkpoint_segments, and get used to having more checkpoints if you have 
to keep the same recovery time".


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-25 Thread Greg Smith

On Mon, 25 Jun 2007, Heikki Linnakangas wrote:

Greg, is this the kind of workload you're having, or is there some other 
scenario you're worried about?


The way transitions between completely idle and all-out bursts happen were 
one problematic area I struggled with.  Since the LRU point doesn't move 
during the idle parts, and the lingering buffers have a usage_count>0, the 
LRU scan won't touch them; the only way to clear out a bunch of dirty 
buffers leftover from the last burst is with the all-scan.  Ideally, you 
want those to write during idle periods so you're completely clean when 
the next burst comes.  My plan for the code I wanted to put into 8.4 one 
day was to have something like the current all-scan that defers to the LRU 
and checkpoint, such that if neither of them are doing anything it would 
go searching for buffers it might blow out.  Because the all-scan mainly 
gets in the way under heavy load right now I've only found mild settings 
helpful, but if it had a bit more information about what else was going on 
it could run much harder during slow spots.  That's sort of the next stage 
to the auto-tuning LRU writer code in the grand design floating through my 
head.


As a general comment on this subject, a lot of the work in LDC presumes 
you have an accurate notion of how close the next checkpoint is.  On 
systems that can dirty buffers and write WAL really fast, I've found hyper 
bursty workloads are a challenge for it to cope with.  You can go from 
thinking you have all sorts of time to stream the data out to discovering 
the next checkpoint is coming up fast in only seconds.  In that situation, 
you'd have been better off had you been writing faster during the period 
preceeding the burst when the code thought it should be "smooth"[1]. 
That falls into the category of things I haven't found a good way for 
other people to test (I happened to have an internal bursty app that 
aggrevated this area to use).


[1] This is actually a reference to "Yacht Rock", one of my favorite web 
sites:  http://www.channel101.com/shows/show.php?show_id=152


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-25 Thread Greg Smith

On Mon, 25 Jun 2007, Heikki Linnakangas wrote:

It only scans bgwriter_lru_percent buffers ahead of the clock hand. If the 
hand isn't moving, it keeps scanning the same buffers over and over again. 
You can crank it all the way up to 100%, though, in which case it would work, 
but that starts to get expensive CPU-wise.


In addition to being a CPU pig, that still won't necessarily work because 
the way the LRU writer ignores things with a non-zero usage_count.  If 
it's dirty, it's probably been used recently as well.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-25 Thread Greg Smith

On Mon, 25 Jun 2007, Heikki Linnakangas wrote:

Please describe the class of transactions and the service guarantees so 
that we can reproduce that, and figure out what's the best solution.


I'm confident you're already moving in that direction by noticing how the 
90th percentile numbers were kind of weird with your 150 warehouse DBT2 
tests, and I already mentioned how that could be usefully fleshed out by 
more tests during beta.  That number is the kind of service guarantee I'm 
talking about--if before 90% of transactions were <4.5ms, but now that 
number is closer to 6ms, that could be considered worse performance by 
some service metrics even if the average and worst-case performance were 
improved.


The only thing I can think of if you wanted to make the problem more like 
what I was seeing would be switching the transaction mix on that around to 
do more UPDATEs relative to the other types of transactions; having more 
of those seemed to aggrevate my LDC-related issues because they leave a 
different pattern of dirty+used buffers around than other operations.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Greg Smith

On Mon, 25 Jun 2007, Tom Lane wrote:

right now, BgBufferSync starts over from the current clock-sweep point 
on each call --- that is, each bgwriter cycle.  So it can't really be 
made to write very many buffers without excessive CPU work.  Maybe we 
should redefine it to have some static state carried across bgwriter 
cycles


The LRU portion restarts like that, and it's certainly not optimal.  But 
the auto-tuning LRU patch that's already near application makes this much 
less of an issue because it only does real work when buffers have been 
allocated, so the sweep point will have moved along.  I'll add this idea 
to my list of things that would be nice to have as part of a larger 
rewriter, I think it's more trouble than it's worth to chase right now.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Greg Smith

On Tue, 26 Jun 2007, Tom Lane wrote:


I have no doubt that there are scenarios such as you are thinking about,
but it definitely seems like a corner case that doesn't justify keeping
the all-buffers scan.  That scan is costing us extra I/O in ordinary
non-corner cases, so it's not free to keep it.


And scenarios I'm concerned about but can't diagram as easily fall into 
this category as well.  I agree that a LDC enabled config would ship with 
the all-buffers scan turned off as redundant and wasteful, in which the 
only cost to keep it is code baggage.  But the fact that there are corner 
cases floating around this area is what makes me feel that removing it 
altogether is still a bit premature.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, final patch

2007-06-26 Thread Greg Smith

On Tue, 26 Jun 2007, Gregory Stark wrote:


What exactly happens if a checkpoint takes so long that the next checkpoint
starts. Aside from it not actually helping is there much reason to avoid this
situation? Have we ever actually tested it?


More segments get created, and because of how they are cleared at the 
beginning this causes its own mini-I/O storm through the same buffered 
write channel the checkpoint writes are going into (which way or may not 
be the same way normal WAL writes go, depending on whether you're using 
O_[D]SYNC WAL writes).  I've seen some weird and intermittant breakdowns 
from the contention that occurs when this happens, and it's certainly 
something to be avoided.


To test it you could just use a big buffer cache, write like mad to it, 
and make checkpoint_segments smaller than it should be for that workload. 
It's easy enough to kill yourself exactly this way right now though, and 
the fact that LDC gives you a parameter to aim this particular foot-gun 
more precisely isn't a big deal IMHO as long as the documentation is 
clear.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Greg Smith

On Tue, 26 Jun 2007, Tom Lane wrote:

I'm not impressed with the idea of writing buffers because we might need 
them someday; that just costs extra I/O due to re-dirtying in too many 
scenarios.


This is kind of an interesting statement to me because it really 
highlights the difference in how I thinking about this problem from how 
you see it.  As far as I'm concerned, there's a hierarchy of I/O the 
database needs to finish that goes like this:


1) Client back-end writes (blocked until a buffer appears)
2) LRU writes so (1) doesn't happen
3) Checkpoint writes
4) Dirty pages with a non-zero usage count

In my view of the world, there should be one parameter for a target rate 
of how much I/O you can stand under normal use, and the background writer 
should work its way as far down this chain as it can until it meets that. 
If there's plenty of clean buffers for the expected new allocations and 
there's no checkpoint going on, by all means write out some buffers we 
might re-dirty if there's I/O to spare.  If you write them twice, so what? 
You didn't even get to that point as an option until all the important 
stuff was taken care of and the system was near idle.


The elimination of the all-scan background writer means that true hot and 
dirty spots in the buffer cache, like popular index blocks on a heavily 
updated table that never get a zero usage_count, are never going to be 
written out other than as part of the checkpoint process.  That's OK for 
now, but I'd like it to be the case that one day the database's I/O 
scheduling would eventually get to those, in order to optimize performance 
in the kind of bursty scenarios I've been mentioning lately.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Load Distributed Checkpoints, final patch

2007-06-26 Thread Greg Smith

On Tue, 26 Jun 2007, Heikki Linnakangas wrote:

I'm scheduling more DBT-2 tests at a high # of warehouses per Greg Smith's 
suggestion just to see what happens, but I doubt that will change my mind on 
the above decisions.


I don't either, at worst I'd expect a small documentation update perhaps 
with some warnings based on what's discovered there.  The form you've 
added checkpoint_completion_target in is sufficient to address all the 
serious concerns I had; I can turn it off, I can smooth just a bit without 
increasing recovery time too much, or I can go all out smooth.


Certainly no one should consider waiting for the tests I asked you about a 
hurdle to getting this patch committed, slowing that down was never my 
intention by bringing that up.  I'm just curious to see if anything 
scurries out of some the darker corners in this area when they're 
illuminated.  I'd actually like to see this get committed relatively soon 
because there's two interleaved merges stuck behind this one (the more 
verbose logging patch and the LRU modifications).


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] Load Distributed Checkpoints, final patch

2007-06-28 Thread Greg Smith

On Wed, 27 Jun 2007, Tom Lane wrote:


Also, the question of redesigning the bgwriter's LRU scan is
still open.  I believe that's on Greg's plate, too.


Greg's plate was temporarily fried after his house was hit by lightening 
yesterday.  I just got everything back on-line again, so no coding 
progress, but I think I finished assimilating your "epiphany" during that 
time.  Now I realize that what you're suggesting is that under healthy 
low-load conditions, the LRU really should be able to keep up right behind 
the clock sweep point.  Noting how far behind it is serves as a 
measurement of it failing to match the rate buffers that could be re-used 
are being dirtied, and the only question is how fast and far it should try 
to drive the point it has cleaned to forward when that happens.


Once you've built up enough XLOG segments, the system isn't too bad 
about recycling them, but there will be a nasty startup transient where 
foreground processes have to stop and make the things.


Exactly.  I found it problematic in four situations:

1) Slow checkpoint doesn't finish in time and new segments are being 
created while the checkpoint is also busy (this is the really bad one)


2) Archive logger stop doing anything (usually because the archive disk is 
filled) and nothing gets recycled until that's fixed.


2) checkpoint_segments is changed, so then performance is really sluggish 
for a bit until all the segments are built back up again


3) You ran an early manual checkpoint which doesn't seem to recycle as 
many segments usefully


Any change that would be more proactive about creating segments in these 
situations than the current code would be a benefit, even though these are 
not common paths people encounter.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] Checkpoint logging, revised patch

2007-06-29 Thread Greg Smith

On Fri, 29 Jun 2007, Heikki Linnakangas wrote:

LOG:  checkpoint complete; buffers written=5869 (35.8%); write=2.081 s, 
sync=4.851 s, total=7.066 s


My original patch converted the buffers written to MB. Easier to estimate 
MB/s by eye; I really came to hate multiplying by 8K. And people who have 
multiple boxes with different BLCKZ could merge their logs together and 
not have to worry about adjusting for that.


I found this simpler to present the results to non-PostgreSQL people 
without having to explain the buffer size.  It's easy for any IT person to 
follow; the idea I was working toward was "see, this log entry shows the 
long pauses are because it's writing 700MB of data all at once here, right 
next to that statment that took 10 seconds we found with 
log_min_duration_statement".


The current equivilant of what I had would be 
CheckpointStats.ckpt_bufs_written * BLCKSZ / (1024*1024) formatted as 
"%.1f MB"


One thing that's missing, that I originally hoped to achieve with this, 
is logging the cause of a checkpoint.


The way pg_stat_bgwriter keeps two separate counts for requested (which 
includes timeouts) and required (out of segments) checkpoints now 
satisfies the main reason I originally collected that info.  I felt it was 
critical when I wrote that patch for it to be possible to distinguish 
between which of the two was happening more, it's only in the nice to have 
category now.


Now, onto some slightly different things here, which were all aimed at 
developer-level troubleshooting.  With the nice reworking for logging 
checkpoints better, there were three tiny things my original patch did 
that got lost along the way that I'd suggest Tom might want to consider 
putting back during the final apply.  I'll include mini pseudo-diffs here 
for those so no one has to find them in my original patch, they're all 
one-liners:


1) Log every time a new WAL file was created, which ties into the recent 
discussion here that being a possible performance issue.  At least you can 
look for it happening this way:


src/backend/access/transam/xlog.c
--- 1856,1863 
(errcode_for_file_access(),
 errmsg("could not create file \"%s\": 
%m", tmppath)));


+   ereport(DEBUG2, (errmsg("WAL creating and filling new file on 
disk")));


/*
 * Zero-fill the file.  We have to do this the hard way to ensure 
that all


2) Add a lower-level DEBUG statement when autovaccum was finished, which 
helped me in several causes figure out if that was the cause of a problem 
(when really doing low-level testing, I would see a vacuum start, not know 
if it was done, and then wonder if that was the cause of a slow 
statement):


*** src/backend/postmaster/autovacuum.c
--- 811,814 
do_autovacuum();
+   ereport(DEBUG2,
+   (errmsg("autovacuum: processing database 
\"%s\" complete", dbname)));

}

3) I fixed a line in postmaster.c so it formatted fork PIDs the same way 
most other log statements do; most statements report it as (PID %d) and 
the difference in this form seemed undesirable (I spent a lot of time at 
DEBUG2 and these little things started to bug me):


*** src/backend/postmaster/postmaster.c
*** 2630,2636 
/* in parent, successful fork */
ereport(DEBUG2,
!   (errmsg_internal("forked new backend, pid=%d 
socket=%d",
 (int) pid, 
port->sock)));

--- 2630,2636 
/* in parent, successful fork */
ereport(DEBUG2,
!   (errmsg_internal("forked new backend (PID %d) 
socket=%d",
 (int) pid, 
port->sock)));


Little stuff, but all things I've found valuable on several occasions, 
which suggests eventually someone else may appreciate them as well.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Checkpoint logging, revised patch

2007-06-30 Thread Greg Smith

On Sat, 30 Jun 2007, Heikki Linnakangas wrote:

I don't think we currently use MB in any other log messages. If we go down 
that route, we need to consider switching to MB everywhere.


Are there a lot of messages that speak in terms of buffer pages though? 
Other than the debug-level stuff, which this checkpoint message has now 
been elevated out of, I don't recall seeing enough things that spoke in 
those terms to consider it a standard.  Someone feel free to correct me if 
I'm wrong here.



Did you check out log_autovacuum? Doesn't it do what you need?


I have not (most of this patch was done against 8.2 and then ported 
forward).  If that already shows clearly the start and end of each 
autovacuum section, ignore that I even brought this up.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


[PATCHES] Tracking buffer allocation stats (new LRU method base code)

2007-07-02 Thread Greg Smith
Now that all the other checkpoint-related patches seem to have settled 
down, I've been busy working my way through refreshing and fixing merge 
issues in what started as Itagaki Takahiro's auto bgwriter_lru_maxpages 
patch.  I split that into two patches before, and am now finished what I 
can do revising and testing the first of those.


The attached patch adds counters for the number of buffers allocated and 
the number written out by backends, and exposes all of that via 
pg_stat_bgwriter.  Here's a sample from after a period of benchmarking; 
the buffers_backend and buffers_alloc are the two new things here:


 checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean 
---+-++--

15 | 274 |7616495 |   6737430

buffers_backend | buffers_alloc | maxwritten_clean
+---+--
   35398659 |  24389383 |91631

To show how helpful this is, what I've been doing with all this is taking 
snapshots of the structure at the beginning and end of an individual 
benchmark test, then producing the delta to see the balance of who wrote 
once during the test.  Here's an example of a test with 200,000 of the 
UPDATE statement from the pgbench test.  The first run had no background 
writer, while the second had a moderately active one, and you can see that 
the counting all works as expected--and that you can learn quite a bit 
about how effective the background writer cleaner was from these numbers 
(obviously I had checkpoint_segments set high so there weren't any during 
the test):


clients | tps  | chkpts | buf_check | buf_clean | buf_backend | buf_alloc
+--++---+---+-+---
  1 | 1487 |  0 | 0 | 0 |   70934 | 85859
  1 | 1414 |  0 | 0 | 39005 |   38542 |100963

This patch was last submitted here:
http://archives.postgresql.org/pgsql-patches/2007-05/msg00142.php

At that time, Itagaki and Heikki seemed to approve the basic direction I'd 
gone and how I'd split the original code into monitoring and functional 
pieces.  The differences between that version of the patch and the 
attached one are I fixed the race-condition bug and terminology issue 
Heikki noticed, along with the merge to current HEAD.


Rather than wait until I'd finished testing the next layer on top of this 
(retuning the automatic was-LRU-now-cleaner code with Tom's latest insight 
on that topic) I figured I might as well send this part now.  So far this 
has been independant of the code that builds on it, and I'm done with this 
section.  I think it will take a serious look by someone who might commit 
it to make any more progress and I want to keep those queues moving.


As for issues in this code I am concerned about having reviewed, most of 
them come from my not having completely internalized the engine's forking 
model yet:


1) I'm not sure the way the way am_bg_writer was changed here is kosher.

2) The way the buffers are counted in the freelist code and sent back to 
the background writer feels like a bit of a hack to me.


3) The recent strategy changes in freelist.c left me unsure how to count 
some of what it does; I marked the section I'm concerned about with an XXX 
comment.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

buf-alloc-stats-2.patch
Description: Binary data

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Maintaining cluster order on insert

2007-07-10 Thread Greg Smith

On Mon, 9 Jul 2007, Tom Lane wrote:

The hardware is just a Dell x86_64 workstation with crappy IDE disk, so 
maybe things would look better elsewhere, but it's all I have to work 
with.


Do you have write-caching turned off on the drive so INSERTs are being 
rate-limited by WAL syncs?  Trying to characterize the class of setup 
you're using.


The part that seemed curious to me about your results in the unpatched 
version is why the first INSERT takes 2.4 seconds, while the second takes 
12.2 then later ones settle from 17-23 s.  I could understand the 12-23 
variation, but that the first one fires off in 2.4 seems kind of fishy; 
why so fast?  Is there something that's just fitting in memory in that 
case that's just missing in the patched version?


results-head:
...
executed DELETE in  14.770937 sec
executed VACUUM in  10.663301 sec
executed INSERT in   2.449248 sec (1st)
...
executed INSERT in  12.212027 sec (2nd)


results-patch:
...
executed DELETE in  18.062664 sec
executed VACUUM in  28.487570 sec
executed INSERT in  25.638022 sec (1st)
...
executed INSERT in  40.759404 sec (2nd)

--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] Maintaining cluster order on insert

2007-07-10 Thread Greg Smith

On Tue, 10 Jul 2007, Tom Lane wrote:


It's just desktop-grade junk :-(.  Feel free to repeat the test on
something more serious.


Right, but even such junk can be setup such that the disks honor commits, 
just wanted to confirm you didn't go out of your way to do that--sounds 
like you didn't.  My mini-lab at home has two PG test systems, one with a 
real write cache, the other has an IDE drive with the cache turned off so 
commits actually wait to hit the platter.  The painful slowness of writes 
in that situation keeps me grounded as to how much data integrity actually 
costs if you're not accelerating it, and the differences in benchmarks 
between the two systems (which are otherwise a similar class of hardware) 
can be informative.


Anyway the numbers seemed relatively consistent after the setup and 
first test cycle, so I think the part I was actually trying to draw 
conclusions from was probably real enough.


Agreed, just figuring out the test ramp-up situation, and your explanation 
for that quirk sounds reasonable.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PATCHES] HOT latest patch - version 8

2007-07-15 Thread Greg Smith

On Sun, 15 Jul 2007, Simon Riggs wrote:

Our main test case for OLTP is DBT-2 which follows TPC-C in being 
perfectly scalable with no hot spots in the heap and limited hot spots 
in the indexes. As such it's a poor test of real world applications, 
where Benfold's Law holds true.


I assume this is a typo on Benford's Law: 
http://en.wikipedia.org/wiki/Benford's_law which notes there are far more 
ones in real-world data sets.


If there were a Benfold's Law, it would surely involve the number 5 
instead.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] Async Commit, v21 (now: v22)

2007-07-24 Thread Greg Smith

On Tue, 24 Jul 2007, Gregory Stark wrote:

Do we really want the walwriter doing the majority of the wal-flushing 
work for normal commits? It seems like that's not going to be any 
advantage over just having some random backend do the commit.


Might there be some advantage in high-throughput/multi-[cpu|core] 
situations due to improved ability to keep that code in a single 
processor?  CPU cache issues are turning into scalability bottlenecks in 
so many designs I came across lately.  A distinct walwriter might be more 
likely to be (or even be explicitly) bound to a processor and stay there 
than when the code executes on any random backend.  The obvious flip side 
is that increased moving of data between processors more often may balance 
or even negate any potential improvement there.


More on the system tuning side, I know I've found that the background 
writer is a separate process enormously helpful because of how it allows 
monitoring the activity level of just it relative to the whole.  There are 
enough WAL-bound systems out there (particularly when there's no caching 
disk controller) that I could see similar value to being able to watch a 
distinct walwriter.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] HOT patch - version 15

2007-09-06 Thread Greg Smith

On Thu, 6 Sep 2007, Heikki Linnakangas wrote:


I wonder if pruning in bgwriter only is enough to make HOT work
efficiently. On a very frequently updated page, bgwriter will have to
work hard to keep up.


One of the goals for how I rebuilt the just-in-time BGW was to try and 
make smaller values for bgwriter_delay feasible.  None of the tunables 
*should* have to be adjusted if you need to run the bgwriter much more 
often to support some new HOT-related activity in there as well; this is 
actually my next area I wanted to do a bit more testing on myself.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] pgbench - startup delay

2007-12-10 Thread Greg Smith

On Mon, 10 Dec 2007, Tom Lane wrote:


I concur with Alvaro that this case seems adequately covered by
PGOPTIONS="-W n" pgbench ...


I started to disagree with this, but ultimately realized anyone who is 
running pgbench for long enough to get useful results shouldn't have their 
TPS impacted much at all by a few overhead seconds tacked onto the server 
startup.


I once wrote a similar patch to the one Dave submitted here and feel like 
it's worth committing at least a documentation patch to show how to deal 
with this.  It's not obvious that pgbench pays attention to the 
environment variables at all, and it's even less obvious that you can pass 
what look like server options in there.  I just poked around the 
documentation a bit and I didn't find anything that cleared up which 
options you can pass from a client; in addition to -W, I can imagine 
pgbench users might also want to use -S (sort memory) or -f (forbid 
scan/join types).  If I can get someone to clarify what is supported there 
I can put together a pgbench doc patch that addresses this topic.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] pgbench - startup delay

2007-12-10 Thread Greg Smith

On Mon, 10 Dec 2007, Neil Conway wrote:


Perhaps one of the slightly unfortunate consequences of the postmaster
=> postgres merge is that there is less of a clear distinction between
"postmaster options" and "postgres" options...


I'd already read all of the documentation that you and Tom suggested just 
before I sent my previous message, and I didn't find this subject clear at 
all.


On Mon, 10 Dec 2007, Tom Lane wrote:


It's not pgbench that is paying attention to this, it's libpq.


Right, but I wouldn't expect a typical pgbench user to know that.

Anything you'd be allowed to SET can be set from PGOPTIONS (-c or --var 
syntax...the restrictions are the same as for the underlying variable.


That clarifies the situation well enough for me.  I think this is a two 
part problem then.  It's not necessarily obvious that pgbench will use 
PGOPTIONS.  In addition to that, the current documentation is less clear 
than it could be on the subject of what you can usefully put into 
PGOPTIONS.  That's two small documentation patches I should be able to 
handle.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] SSL over Unix-domain sockets

2008-01-18 Thread Greg Smith

On Thu, 17 Jan 2008, Tom Lane wrote:


BTW, is a symlink's atime changed by accessing it?


It seems so in the cases I've tried or researched, but it's complicated. 
After burning through a bunch of time looking into this I wanted to drop 
some notes so nobody else has to wander down the specific dead-ends I just 
followed.


I figured I'd just run some experiments to figure this out for my Linux 
system, but that didn't go so well.  The process of running anything that 
shows the atime:


  ls -l --time=atime 
  ls -lu 
  stat 

actually updates the atime to right now along the way.  I hacked up 
something with perl that directly calls lstat() and it did the same thing.


Mystified, I found this thread suggesting the same thing is true on Mac OS 
X:  http://lists.apple.com/archives/darwin-kernel/2006/Dec/msg00054.html


The point made in there is that how symlinks are encoded varies not just 
from OS to OS but from filesystem to filesystem, and that encoding changes 
how things like atime work.  On Linux with ext2, I found this note:


"Symbolic links are also filesystem objects with inodes.  They deserve 
special mention because the data for them is stored within the inode 
itself if the symlink is less than 60 bytes long.  It uses the fields 
which would normally be used to store the pointers to data blocks."


So what I think is happening is:  the process of doing anything at all 
with a Linux symlink references the inode that has the link.  That updates 
the atime on that inode.  But since there's no actual data underneath that 
lookup in cases where the link is less than 60 bytes, the inode atime is 
the link atime, so that just updated the link's atime to right now as 
well.  I have no idea how any tmp cleaner could ever find a short symlink 
it can delete if I'm understanding this correctly.


I left behind the link I was just playing with and I'll see if I can get 
tmpwatch to eat it tomorrow, that seems like the most appropriate test 
here.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Proposing correction to posix_fadvise() usage in xlog.c

2008-03-01 Thread Greg Smith

On Fri, 29 Feb 2008, Mark Wong wrote:

Basically posix_fadvise() is being called right before the WAL segment 
file is closed, which effectively doesn't do anything as opposed to when 
the file is opened.  This proposed correction calls posix_fadvise() in 
three locations...


Actually, posix_fadvise is called nowhere; the one place it's referenced 
at is #ifdef'd out.  There's a comment in the code just above there as to 
why:  posix_fadvise only works on a limited number of platforms and as far 
as I know nobody has ever put the time into figuring out where it's safe 
or not safe so that configure can be taught that.  I think this may be a 
dead item because most places where posix_fadvise works correctly, you can 
use O_SYNC and get O_DIRECT right now to do the same thing.


If memory serves, one of the wins here is suppose to be that in a 
scenario where we are not expecting to re-read writes to the WAL we also 
do not want the writes to disk to flush out other data from the 
operating system disk cache.


Right, but you can get that already with O_SYNC on platforms where 
O_DIRECT is supported.


There's a related TODO here which is to use directio(3C) on Solaris, which 
Jignesh reports is needed instead of O_DIRECT to get the same behavior on 
that platform.


I am guessing that SATA drives have write cache enabled by default so it 
seems to make sense that using POSIX_FADV_DONTNEED will cause writes to 
be slower by writing through the disk cache.


I've never heard of a SATA drive that had its write cache disabled by 
default.  They're always on unless you force them off, and even then they 
can turn themselves back on again if there's a device reset and you didn't 
change the drive's default using the manufacturer's adjustment utility.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] CopyReadLineText optimization

2008-03-06 Thread Greg Smith

On Thu, 6 Mar 2008, Heikki Linnakangas wrote:

At the most conservative end, we could fall back to the current method 
on the first escape, quote or backslash character.


I would just count the number of escaped/quote characters on each line, 
and then at the end of the line switch modes between the current code on 
the new version based on what the previous line looked like.  That way the 
only additional overhead is a small bit only when escapes show up often, 
plus a touch more just once per line.  Barely noticable in the case where 
nothing is escaped, very small regression for escape-heavy stuff but 
certainly better than the drop you reported in the last rev of this patch.


Rev two of that design would keep a weighted moving average of the total 
number of escaped characters per line (say wma=(7*wma+current)/8) and 
switch modes based on that instead of the previous one.  There's enough 
play in the transition between where the two approaches work better at 
that this should be easy enough to get a decent transition between. 
Based on your data I would put the transition at wma>4, which should keep 
the old code in play even if only half the lines have the bad regression 
that shows up with >8 escapes per line.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://mail.postgresql.org/mj/mj_wwwusr?domain=postgresql.org&extra=pgsql-patches


Re: [PATCHES] Expose checkpoint start/finish times into SQL.

2008-04-03 Thread Greg Smith

On Thu, 3 Apr 2008, Robert Treat wrote:

You can plug a single item graphed over time into things like rrdtool to 
get good trending information. And it's often easier to do this using 
sql interfaces to get the data than pulling it out of log files (almost 
like the db was designed for that :-)


The pg_stat_bgwriter value for buffers_checkpoint was intentionally 
implemented in 8.3 such that it jumps in one big lump when the checkpoint 
is done.  While it's not the ideal interface for what you're looking for, 
the reason for that is to made it possible to build a "when was the last 
checkpoint finished?" interface via some remote monitoring tool just by 
determining the last time that the value jumped upwards.  You can easily 
see them just by graphing that value, it shouldn't be too hard to teach 
something with rrdtool guts to find them.


Since checkpoints have a fairly predictable duration in 8.3, as long as 
you catch the start or end of them you can make a resonable guess where 
the other side was.  The case you're trying to avoid here, the system 
going a long time without checkpointing, can be implemented by looking for 
a begin or end regularly, you don't need to track both.  As long as 
there's a checkpoint finish "pulse" in buffers_checkpoint showing up 
regularly you're fine.  The only situation I can think of where this might 
be problematic is where the system has been idle enough to not have any 
buffers to write at checkpoint time, but I recall a code path there where 
checkpoints stop altogether unless there's been activity so even tracking 
the time may not change that.


Ultimately a lot of the other questions you might ask (i.e. "how many 
buffers have been written per hour by checkpoints?") require processing 
the numbers in this way anyway, and I thought this implementation was good 
enough to monitor the situation you're trying to avoid--presuming you're 
using some sort of moderately powerful remote monitoring tool.  Theo's 
patch would make it easier to answer with a simple command which has some 
value; a little SQL in a cron job would be good enough to trigger an alert 
rather than needing a real monitoring probe.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Expose checkpoint start/finish times into SQL.

2008-04-03 Thread Greg Smith

On Thu, 3 Apr 2008, Joshua D. Drake wrote:


For knowing how long checkpoints are taking. If they are taking too
long you may need to adjust your bgwriter settings, and it is a
serious drag to parse postgresql logs for this info.


There's some disconnect here between what I think you want here and what 
Theo's patch does.


If I were running an 8.2 system, there's a whole lot of diagnostic 
information I'd like to have handy in order to optimize checkpoints, as 
you describe.  But this patch doesn't help 8.2 users, because the whole 
pg_stat_bgwriter structure is 8.3 only--it provides much of what is needed 
in this area.


In 8.3, there is very little background writer tuning to do.  What tuning 
you can do is easy to figure out from what's in pg_stat_bgwriter.  The new 
checkpoint_completion_target feature works better than tuning the 8.2 
background writer to reduce checkpoint spikes ever did anyway, and that's 
the main thing that drives how long checkpoints take to process now.


So in the only situation where this patch can be applied, 8.3, I don't 
think the scenario you describe is likely to pop up.  And the combination 
of two 8.3 features (log_checkpoint and the CSV logs, it's no coincidence 
I worked on both) should allow a good enough way to hack around this area 
when it does.  Just make the logs rotate fairly regularly, and every time 
they do import the last CSV section that's now finished into a table so 
you can compute statistics about how the checkpoints are doing there.  I 
don't see why that sort of stuff should go into core when it's now easy to 
do outside of it.  I have a whole stack of scripts in that area I plan to 
release over the summer.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Expose checkpoint start/finish times into SQL.

2008-04-03 Thread Greg Smith

On Thu, 3 Apr 2008, Tom Lane wrote:

As of PG 8.3, the bgwriter tries very hard to make the elapsed time of a 
checkpoint be just about checkpoint_timeout * 
checkpoint_completion_target, regardless of load factors.


In the cases where the timing on checkpoint writes are timeout driven. 
When they're checkpoint_segments driven the time is derived from that 
drumbeat instead.  And those cases are the more interesting ones, because 
those are the ones that will speed up during really busy periods.


I think it's completely relevant to track how often checkpoints are 
happening because they do still vary even in 8.3.  I'm just not sure if 
the current methods available for that really aren't good enough, or if 
it's just the case that not everyone is aware of all of them.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Expose checkpoint start/finish times into SQL.

2008-04-03 Thread Greg Smith

On Thu, 3 Apr 2008, Tom Lane wrote:

"the system stopped checkpointing" does not strike me as a routine 
occurrence that we should be making provisions for DBAs to watch for. 
What, pray tell, is the DBA supposed to do when and if he notices that?


Schedule downtime rather than wait for it to happen unpredictably when the 
inevitable crash happens.



I'd much rather be spending our time and effort on understanding what
broke for you, and fixing the code so it doesn't happen again.


(Here I start laughing all over again as I recall Robert's talk, which we 
really need to get you in particular the video of)  Their situation had 
possible causes that included a bit flipping in bad memory, which is 
pretty hard to code around (unless you're a Core Wars veteran).  I'm 
familiar with that part of the checkpoint code path, and everything I was 
able to think of when hearing the outline of events was already considered 
and rejected as not being a likely cause.  This patch comes out of 
pragmatic acceptance that, sometimes, stuff will happen you can't easily 
explain, but that doesn't mean it's not worth keeping an eye on it anyway 
so it doesn't sneak up on you again.


Anyway, I think this whole thing would be better handled by a larger 
internals view that this whole codebase could use a dose of anyway.  What 
I really want is an interface like this:


psql> select pg_internals('last_checkpoint_time');

and then start sprinkling exports of those probe points in some popular 
places people would like to look at.


I will apologize now for suggesting this, followed by not having enough 
time to code it in the near future.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Expose checkpoint start/finish times into SQL.

2008-04-03 Thread Greg Smith

On Fri, 4 Apr 2008, Tom Lane wrote:


(And you still didn't tell me what the actual failure case was.)


Database stops checkpointing.  WAL files pile up.  In the middle of 
backup, system finally dies, and when it starts recovery there's a bad 
record in the WAL files--which there are now thousands of to apply, and 
the bad one is 4 hours of replay in.  Believe it or not, it goes downhill 
from there.


It's what kicked off the first step that's the big mystery.  The only code 
path I thought of that can block checkpoints like this is when the 
archive_command isn't working anymore, and that wasn't being used.  Given 
some of the other corruption found later and the bad memory issues 
discovered, a bit flipping in the "do I need to checkpoint now?" code or 
data seems just as likely as any other explanation.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Expose checkpoint start/finish times into SQL.

2008-04-04 Thread Greg Smith

On Fri, 4 Apr 2008, Tom Lane wrote:


The actual advice I'd give to a DBA faced with such a case is to
kill -ABRT the bgwriter and send the stack trace to -hackers.


And that's a perfect example of where they're trying to get to.  They 
didn't notice the problem until after the crash.  The server didn't come 
back up (busy processing WALs) and that downtime was caught by a 
monitoring system.  At that point it was too late to collect debugging 
information on what was wrong inside the server processes that might have 
given a clue what happened.


If they'd have noticed it while the server was up, perhaps because the 
"last checkpoint" value hadn't changed in a long time (which seems like it 
might be available via stats even if, as you say, the background writer is 
out of its mind at that point), they could have done such a kill and 
collected some actual useful information here.  That's the theory at 
least.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Sorting writes during checkpoint

2008-04-15 Thread Greg Smith

On Tue, 15 Apr 2008, ITAGAKI Takahiro wrote:


2x Quad core Xeon, 16GB RAM, 4x HDD (RAID-0)


What is the disk controller in this system?  I'm specifically curious 
about what write cache was involved, so I can get a better feel for the 
hardware your results came from.


I'm busy rebuilding my performance testing systems right now, once that's 
done I can review this on a few platforms.  One thing that jumped out at 
me just reading the code is this happening inside BufferSync:


buf_to_write = (BufAndTag *) palloc(NBuffers * sizeof(BufAndTag));

If shared_buffers(=NBuffers) is set to something big, this could give some 
memory churn.  And I think it's a bad idea to allocate something this 
large at checkpoint time, because what happens if that fails?  Really not 
the time you want to discover there's no RAM left.


Since you're always going to need this much memory for the system to 
operate, and the current model has the system running a checkpoint >50% of 
the time, the only thing that makes sense to me is to allocate it at 
server start time once and be done with it.  That should improve 
performance over the original patch as well.


BufAndTag is a relatively small structure (5 ints).  Let's call it 40 
bytes; even that's only a 0.5% overhead relative to the shared buffer 
allocation.  If we can speed checkpoints significantly with that much 
overhead it sounds like a good tradeoff to me.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Sorting writes during checkpoint

2008-04-16 Thread Greg Smith

On Wed, 16 Apr 2008, ITAGAKI Takahiro wrote:


Dirty region of database was probably larger than disk controller's cache.


Might be worthwhile to run with log_checkpoints on and collect some 
statistics there next time you're running these tests.  It's a good habit 
to get other testers into regardless; it's nice to be able to say 
something like "during the 15 checkpoints encountered during this test, 
the largest dirty area was 516MB while the median was 175MB".



Hmm, but I think we need to copy buffer tags into bgwriter's local memory
in order to avoid locking taga many times in the sorting. Is it better to
allocate sorting buffers at the first time and keep and reuse it from then on?


That what I was thinking:  allocate the memory when the background writer 
starts and just always have it there, the allocation you're doing is 
always the same size.  If it's in use 50% of the time anyway (which it is 
if you have checkpoint_completion_target at its default), why introduce 
the risk that an allocation will fail at checkpoint time?  Just allocate 
it once and keep it around.



It is 0.25% of shared buffers; when shared_buffers is set to 10GB,
it takes 25MB of process local memory.


Your numbers are probably closer to correct.  I was being pessimistic 
about the size of all the integers just to demonstrate that it's not 
really a significant amount of memory even if they're large.


If we want to consume less memory for it, RelFileNode in BufferTag could 
be hashed and packed into an integer


I personally don't feel it's worth making the code any more complicated 
than it needs to be just to save a fraction of a percent of the total 
memory used by the buffer pool.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] configure option for XLOG_BLCKSZ

2008-05-02 Thread Greg Smith

On Fri, 2 May 2008, Tom Lane wrote:

The case for varying BLCKSZ is marginal already, and I've seen none at 
all for varying XLOG_BLCKSZ.


I recall someone on the performance list who felt it useful increase 
XLOG_BLCKSZ to support a high-write environment with WAL shipping, just to 
make sending the files over the network more efficient.  Can't seem to 
find a reference in the archives though.


If you look at things like the giant Sun system tests, there was 
significant tuning getting all the block sizes to line up better with the 
underlying hardware.  I would not be surprised to discover that sort of 
install gains a bit from slinging WAL files around in larger chunks as 
well.  They're already using small values for commit_delay just to get the 
typical WAL write to be in larger blocks.


As PostgreSQL makes it way into higher throughput environments, it 
wouldn't surprise me to discover more of these situations where switching 
WAL segments every 16MB turns into a bottleneck.  Right now, it may only 
be a few people in the world, but saying "that's big enough" for an 
allocation of anything usually turns out wrong if you wait long enough.


One real concern I have with making this easier to adjust is that I'd hate 
to let people pick any old block size with the default wal_sync_method, 
only to have them later discover they can't turn on any direct I/O write 
method because they botched the alignment restrictions.


Another issue though is whether it makes sense for XLOG_BLCKSZ to be 
different from BLCKSZ at all, at least in the default case.  They are 
both the unit of I/O and it's not clear why you'd want different units.


There are lots of people who use completely different physical or logical 
disk setups for the WAL disk than the regular database.  That's going to 
get even more varied moving forward as SSD starts getting used more, since 
those devices have a very different set of block size optimization 
characteristics compared with traditional RAID setups.  They prefer 
smaller blocks to match the underlying flash better, and you don't pay as 
much of a penalty for writing that way because lining up with the spinning 
disk isn't important.  Someone who put one of DB/WAL on SSD and the other 
on traditional disk might end up with very different DB/WAL block sizes to 
match.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Sorting writes during checkpoint

2008-05-04 Thread Greg Smith

On Sun, 4 May 2008, Tom Lane wrote:


However, I am completely unable to measure any performance improvement
from it.  Given the possible risk of out-of-memory failures, I think the
patch should not be applied without some direct proof of performance
benefits, and I don't see any.


Fair enough.  There were some pgbench results attached to the original 
patch submission that gave me a good idea how to replicate the situation 
where there's some improvement.  I expect I can take a shot at quantifying 
that independantly near the end of this month if nobody else gets to it 
before then (I'm stuck sorting out a number of OS level issue right now 
before my testing system is online again).  Was planning to take a longer 
look at Greg Stark's prefetching work at that point as well.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Sorting writes during checkpoint

2008-05-04 Thread Greg Smith

On Sun, 4 May 2008, Tom Lane wrote:

Well, I tried a pgbench test similar to that one --- on smaller hardware 
than was reported, so it was a bit smaller test case, but it should have 
given similar results.


My pet theory on cases where sorting will help suggests you may need a 
write-caching controller for this patch to be useful.  I expect we'll see 
the biggest improvement in situations where the total amount of dirty 
buffers is larger than the write cache and the cache becomes blocked.  If 
you're not offloading to another device like that, the OS-level elevator 
sorting will handle sorting for you close enough to optimally that I doubt 
this will help much (and in fact may just get in the way).



Of course it's notoriously hard to get consistent numbers out of pgbench
anyway, so I'd rather see some other test case ...


I have some tools to run pgbench results many times and look for patterns 
that work fairly well for the consistency part.  pgbench will dirty a very 
high percentage of the buffer cache by checkpoint time relative to how 
much work it does, which makes it close to a best case for confirming 
there is a potential improvement here.


I think a reasonable approach is to continue trying to quantify some 
improvement using pgbench with an eye toward also doing DBT2 tests, which 
provoke similar behavior at checkpoint time.  I suspect someone who 
already has a known good DBT2 lab setup with caching controller hardware 
(EDB?) might be able to do a useful test of this patch without too much 
trouble on their part.


Unless someone can volunteer to test sooner, I think we should drop this 
item from the current commitfest queue.


This patch took a good step forward toward being commited this round with 
your review, which is the important part from my perspective (as someone 
who would like this to be committed if it truly works).  I expect that 
performance related patches will often take more than one commitfest to 
pass through.


From the perspective of keeping the committer's plates clean, a reasonable 
system for this situation might be for you to bounce this into the 
rejected pile as "Returned for testing" immediately, to clearly remove it 
from the main queue.  A reasonable expectation there is that you might 
consider it again during May if someone gets back with said testing 
results before the 'fest ends.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Sorting writes during checkpoint

2008-05-04 Thread Greg Smith

On Mon, 5 May 2008, Tom Lane wrote:

It bothers me a bit that the patch forces writes to be done "all of file 
A in order, then all of file B in order, etc".  We don't know enough 
about the disk layout of the files to be sure that that's good. (This 
might also mean that whether there is a win is going to be platform and 
filesystem dependent ...)


I think most platform and filesystem implementations have disk location 
correlated enough with block order that this particular issue isn't a 
large one.  If the writes are mainly going to one logical area (a single 
partition or disk array), it should be a win as long as the sorting step 
itself isn't introducing a delay.  I am concered that in a more 
complicated case than pgbench, where the writes are spread across multiple 
arrays say, that forcing writes in order may slow things down.


Example:  let's say there's two tablespaces mapped to two arrays, A and B, 
that the data is being written to at checkpoint time.  In the current 
case, that I/O might be AABAABABBBAB, which is going to keep both arrays 
busy writing.  The sorted case would instead make that AABB so 
only one array will be active at a time.  It may very well be the case 
that the improvement from lowering seeks on the writes to A and B is less 
than the loss coming from not keeping both continuously busy.


I think I can simulate this by using a modified pgbench script that works 
against an accounts1 and accounts2 with equal frequency, where 1&2 are 
actually on different tablespaces on two disks.



Right, that's in the ground rules for commitfests: if the submitter can
respond to complaints before the fest is over, we'll reconsider the
patch.


The small optimization I was trying to suggest was that you just bounce 
this type of patch automatically to the "rejected for " section of the 
commitfest wiki page in cases like these.  The standard practice on this 
sort of queue is to automatically reclassify when someone has made a pass 
over the patch, leaving the original source to re-open with more 
information.  That keeps the unprocessed part of the queue always 
shrinking, and as long as people know that they can get it reconsidered by 
submitting new results it's not unfair to them.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [GENERAL] pgbench not setting scale size correctly?

2008-05-07 Thread Greg Smith

On Wed, 7 May 2008, Bruce Momjian wrote:


Patch attached that issues a warning.


This doesn't take into account the -F case and the warning isn't quite 
right because of that as well.  When I get a break later today I'll create 
a new patch that handles that correctly, I see where it should go now that 
I look at this again.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [GENERAL] pgbench not setting scale size correctly?

2008-05-09 Thread Greg Smith

On Wed, 7 May 2008, Bruce Momjian wrote:


Tom Lane wrote:

Greg Smith <[EMAIL PROTECTED]> writes:

The way the option parsing code is done would make complaining in the case
where your parameter is ignored a bit of a contortion.


Yeah.  But couldn't we have that part issue a warning if -s had been set
on the command line?


Patch attached that issues a warning.


Turns out it wasn't so contorted.  Updated patch attached that only warns 
in the exact cases where the setting is ignored, and the warning says how 
it's actually setting the scale.  I tested all the run types and it 
correctly complains only when warranted, samples:


$ ./pgbench -s 200 -i pgbench
creating tables...
1 tuples done. ...

$ ./pgbench -s 100 pgbench
Scale setting ignored by standard tests, using database branch count
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 200 ...

$ ./pgbench -s 100 -f select.sql pgbench
starting vacuum...end.
transaction type: Custom query
scaling factor: 100 ...

--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MDIndex: contrib/pgbench/pgbench.c
===
RCS file: /home/gsmith/cvsrepo/pgsql/contrib/pgbench/pgbench.c,v
retrieving revision 1.79
diff -u -r1.79 pgbench.c
--- contrib/pgbench/pgbench.c   19 Mar 2008 03:33:21 -  1.79
+++ contrib/pgbench/pgbench.c   9 May 2008 07:12:21 -
@@ -1645,6 +1645,9 @@
exit(0);
}
 
+   if (scale && (ttype != 3))
+   fprintf(stderr,"Scale setting ignored by standard tests, using 
database branch count\n");
+
remains = nclients;
 
if (getVariable(&state[0], "scale") == NULL)

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] posix advises ...

2008-05-13 Thread Greg Smith

On Sun, 11 May 2008, Hans-Juergen Schoenig wrote:


we also made some simple autoconf hack to check for broken posix_fadvise.


Because of how you did that, your patch is extremely difficult to even 
test.  You really should at least scan the output from diff you're about 
to send before submitting a patch to make sure it makes sense--yours is 
over 30,000 lines long just to implement a small improvement.  The reason 
for that is that you regenerated "configure" using a later version of 
Autoconf than the official distribution, and the diff for the result is 
gigantic.  You even turned off the check in configure.in that specifically 
prevents you from making that mistake so it's not like you weren't warned.


You should just show the necessary modifications to configure.in in your 
patch, certainly shouldn't submit a patch that subverts the checks there, 
and leave out the resulting configure file if you didn't use the same 
version of Autoconf.


I find the concept behind this patch very useful and I'd like to see a 
useful one re-submitted.  I'm in the middle of setting up some new 
hardware this month and was planning to test the existing fadvise patches 
Greg Stark sent out as part of that.  Having another one to test for 
accelerating multiple sequential scans would be extremely helpful to add 
to that, because then I think I can reuse some of the test cases Jeff 
Davis put together for the 8.3 improvements in that area to see how well 
it works.  It wasn't as clear to me how to test the existing bitmap scan 
patch, yours seems a more straightforward patch to use as a testing ground 
for fadvise effectiveness.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] posix advises ...

2008-06-19 Thread Greg Smith

On Thu, 19 Jun 2008, Zoltan Boszormenyi wrote:


This patch (revisited and ported to current CVS HEAD) is indeed using
Greg's original patch and also added another patch written by Mark Wong
that helps evicting closed XLOGs from memory faster.


Great, that will save me some trouble.  I've got a stack of Linux 
performance testing queued up (got stuck behind a kernel bug impacting 
pgbench) for the next couple of weeks and I'll include this in that 
testing.  I think I've got a similar class of hardware as you tested on 
for initial evaluation--I'm getting around 200MB/s sequential I/O right 
now out of my small RAID setup,.


I added your patch to the queue for next month's CommitFest and listed 
myself as the initial reviewer, but a commit that soon is unlikely. 
Performance tests like this usually take a while to converge, and since 
this is using a less popular API I expect a round of portability concerns, 
too.


Where did Marc's patch come from?  I'd like to be able to separate out 
that change from the rest if necessary.


Also, if you have any specific test cases you ran that I could start by 
trying to replicate a speedup on, those would be handy as well.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pgbench minor fixes

2008-07-05 Thread Greg Smith

On Sun, 6 Jul 2008, Russell Smith wrote:


Simon Riggs wrote:

Minor patch on pgbench

1. -i option should run vacuum analyze only on pgbench tables, not *all*
tables in database.


How does this work with custom scripts?


That's the initialization step.  It creates the 4 tables, populates then, 
then runs VACCUM ANALYZE.  There is no notion of what script you'll end up 
executing yet.  If you have a truly custom script that works against other 
data instead of the pgbench tables, you won't even be executing this 
initialization bit.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] Sorting writes during checkpoint

2008-07-09 Thread Greg Smith

On Fri, 4 Jul 2008, Simon Riggs wrote:


No action on this seen since last commitfest, but I think we should do
something with it, rather than just ignore it.


Just no action worth reporting yet.  Over the weekend I finally reached 
the point where I've got a system that should be capable of independently 
replicating the results improvement setup here, and I've started 
performance testing of the patch.  Getting useful checkpoint test results 
from pgbench is really a pain.



Sorting by file might have inadvertently shown benefit at the tablespace
level on a larger server with spread out data whereas on Tom's test
system I would guess just a single tablespace was used.


I doubt this has anything to do with it, only because the pgbench schema 
doesn't split into tablespaces usefully.  Almost all of the real action is 
on a single table, accounts.


My suspicion is that sorting only benefits in situations where you have a 
disk controller with a significant amount of RAM on it, but the server RAM 
is much larger.  In that case the sorting horizon of the controller itself 
is smaller than what the server can do, and the sorting makes it less 
likely you'll end up with the controller filled with unsorted stuff that 
takes a long time to clear.


In Tom's test, there's probably only 8 or 16MB worth of cache on the disk 
itself, so you can't get a large backlog of unsorted writes clogging the 
write pipeline.  But most server systems have 256MB or more of RAM there, 
and if you get that filled with seek-heavy writes (which might only clear 
at a couple of MB a second) the delay for that cache to empty can be 
considerable.


That said, I've got a 256MB controller here and have a very similar disk 
setup to the one postiive results were reported on, but so far I don't see 
any significant difference after applying the sorted writes patch.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [HACKERS][PATCHES] odd output in restore mode

2008-07-28 Thread Greg Smith

On Wed, 23 Jul 2008, Kevin Grittner wrote:


In our scripts we handle this by copying to a temp directory on the
same mount point as the archive directory and doing a mv to the
archive location when the copy is successfully completed.  I think
that this even works on Windows.  Could that just be documented as a
strong recommendation for the archive script?


This is exactly what I always do.  I think the way cp is shown in the 
examples promotes what's really a bad practice for lots of reasons, this 
particular problem being just one of them.


I've been working on an improved archive_command shell script that I 
expect to submit for comments and potential inclusion in the documentation 
as a better base for other people to build on.  This is one of the options 
for how it can operate.  It would be painful but not impossible to convert 
a subset of that script to run under Windows as well, at least enough to 
cover this particular issue.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches