[PATCHES] New features for pgbench
The attached adds two new command line switches to pgbench: -x: Generate extended detail in the latency log, including a timestamp for each transaction -X: Do extra cleanup after the run (vacuum on all tables, checkpoint) before stopping the clock. This gives substantially more consistancy in results between runs. Most pgbench results I see people present are so short that they're skewed considerably by whether there was a checkpoint in the middle of the run. This also allows testing situations with various autovacuum settings fairly. There's an update to the README describing the features, as well as correcting/extending some of the existing documentation. I generated the patch from the 8.2.3 release. Since pgbench runs the same way unless you pass it one of the new flags, I was hoping this would be considered for the next 8.2 update. I have a series of additional scripts I'll be releasing shortly that do interesting analysis of this extended latency data from pgbench (graphs of TPS and latency, that sort of thing), and I'd hate for that to only be available on 8.3. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MDIndex: contrib/pgbench/README.pgbench === RCS file: /var/lib/pgsql/cvs/postgresql/contrib/pgbench/README.pgbench,v retrieving revision 1.1.1.1 diff -u -r1.1.1.1 README.pgbench --- contrib/pgbench/README.pgbench 11 Feb 2007 23:57:37 - 1.1.1.1 +++ contrib/pgbench/README.pgbench 12 Feb 2007 00:07:35 - @@ -94,10 +94,14 @@ default is 1. NOTE: scaling factor should be at least as large as the largest number of clients you intend to test; else you'll mostly be measuring update contention. + Regular (not initializing) runs using one of the + built-in tests will detect scale based on the number of + branches in the database. For custom (-f) runs this can + be manually specified. -D varname=value - Define a variable. It can be refereed to by a script - provided by using -f option. Multile -D options are allowed. + Define a variable. It can be refered to by a script + provided by using -f option. Multiple -D options are allowed. -U login Specify db user's login name if it is different from @@ -143,6 +147,24 @@ where time is measured in microseconds. + -x + If a logfile is being generated with -l, use an + extended format for the log that is: + + client_id transaction_no time file_no time-epoch time-us + + where time is measured in microseconds, the file_no is + which test file was used (useful when multiple were + specified with -f), and time-epoch/time-us are a + UNIX epoch format timestamp followed by an offset + in microseconds (suitable for creating a ISO 8601 + timestamp with a fraction of a second) of when + the transaction completed. + + -X Perform extra cleanup after the test has completed + (vacuums and a checkpoint) to make test results + more uniform + -d debug option. Index: contrib/pgbench/pgbench.c === RCS file: /var/lib/pgsql/cvs/postgresql/contrib/pgbench/pgbench.c,v retrieving revision 1.1.1.1 diff -u -r1.1.1.1 pgbench.c --- contrib/pgbench/pgbench.c 11 Feb 2007 23:57:37 - 1.1.1.1 +++ contrib/pgbench/pgbench.c 12 Feb 2007 00:07:28 - @@ -76,6 +76,10 @@ bool use_log;/* log transaction latencies to a file */ +bool extended_log; /* provide extended detailed in latency log */ + +bool extra_cleanup; /* do database cleanup after test is over */ + intremains;/* number of remaining clients */ intis_connect; /* establish connection for each transaction */ @@ -455,7 +459,13 @@ diff = (int) (now.tv_sec - st->txn_begin.tv_sec) * 100.0 + (int) (now.tv_usec - st->txn_begin.tv_usec); - fprintf(LOGFILE, "%d %d %.0f\n", st->id, st->cnt, diff); + if (extended_log) + fprintf(LOGFILE, "%d %d %.0f %d %ld %ld\n", + st->id, st->cnt, diff, + st->use_file, now.tv_sec,now.tv_usec); + else + fprintf(LOGFILE, "%
Re: [PATCHES] New features for pgbench
On Sun, 11 Feb 2007, Tom Lane wrote: Does this have any impact on the reported results (by slowing pg_bench itself)? I didn't put more code than I had to in the transaction path, to avoid any slowdown. I didn't convert the timestamp to human readable format or anything intensive like that to avoid impacting the pgbench results. It's just dumping some data that was already sitting there. There is an extra if statement for each transaction, and a slightly longer fprintf when running with the extra latency output in place. That's it. The file gets "%d %d %.0f %d %ld %ld\n" instead of "%d %d %.0f\n" The main drawback to logging more as the default is about twice as much disk I/O for writing the latency log out. That's small change compared with the WAL/database writes that must be going on to generate that transaction, and I sure haven't been able to measure any change in results. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] New features for pgbench
On Mon, 12 Feb 2007, NikhilS wrote: From your patch I see that it augments the -l flag. IMHO it does not make sense to add another flag. We can save the "if" check and log the extended contents as part of -l itself. I wanted something people could apply to 8.2 without breaking existing scripts (regardless of whether it was accepted into 8.2). And I expected some concern over whether this change effects results. By putting in a switch, it's possible to test both ways, with only the if being added to the default case. If I read it correctly, are you suggesting that the same database with a prior pgbench run be used for further pgbench runs? How is it useful? How can one guarantee consistency of observed tps values with this in place? Right now when you run pgbench, the results vary considerably from run to run even if you completely rebuild the database every time. I've found that a lot of that variation comes from two things: 1) If your run is so small that it usually doesn't generate a checkpoint, the runs that do encounter one will be slower--possibly a lot slower if you have a large buffer cache. Similarly, runs that are just long enough to normally encounter one checkpoint will take longer if they happen to run into two, and so on. There are games you can play with improving pgbench performance by using more checkpoint_segments and a bigger shared_buffer cache that look like they dramatically improve results. But what you're mainly doing is just making checkpoints less frequent, reducing the odds that you'll run into one during the pgbench test itself. 2) The standard pgbench test does 3 UPDATEs per transaction. That leaves behind a lot of dead rows that need to be vacuumed. The amount of autovacuum that occurs during the run will vary. This means that some runs finish with more dead space left behind than others. It really isn't fair that a pgbench run that involves cleaning up more of its own mess during the test will get a lower TPS result than one that just generates a bunch of dead tuples and leaves the mess hanging around. Right now, optimal pgbench results are generated with the autovacuum turned completely off; that's just not realistic. In order to get a completely fair comparison, I've adopted a policy that says the run isn't over until the database has been cleaned up such that it's in a similar state to how it started: all tables are vacuumed, and all updates have been written to disk. The new -X behavior forces this cleanup to be considered part of the test. Whether or not you choose to use it for your regular tests, I suggest trying it out. You may be as surprised as I was at exactly how much vacuuming work is leftover after a long pgbench run, and how dramatically it lowers TPS results if you consider that cleanup essential to the test. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] New features for pgbench
On Mon, 12 Feb 2007, Tom Lane wrote: This is a real issue, but I think your proposed patch does not fix it. I certainly wouldn't claim that my patch _fixes_ the problem in the general case; it provides one way to measure it. Currently it's not obvious to new pgbench users that the problem even exists at all. I feel it's important to draw attention to the fact that it's something you should be aware of, even if an automatic resolution to the problem isn't obvious yet. In the context I run pgbench in, it is also a workable fix. I don't even pay attention to pgbench results unless I'm popping 10,000 (desktop) to 100,000 (server) transactions through it. In that context, I believe it fairly penalizes the transactions for the data they leave behind for maintenance. I completely agree that people doing short runs shouldn't use this switch. Anyway, I like your idea of describing the lower TPS number as including maintenance, that matches the terminology used within the documentation better. I will reformat the output to use that term. Here's what I'm gonna do. The patch I submitted was prepared with the goal of possibly being implemented in 8.2. I thought a change to contrib/ that added a feature turned off by default might have a shot at a backport, and I wanted something people could use on the current release to be available. Now that I know it's never going into an offical 8.2, I will prepare a slightly different patch aimed at 8.3--incorporating all the feedback I've gotten here as either code changes or additional documentation--and resubmit in another week or so. Thanks for the feedback. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] New features for pgbench
On Tue, 13 Feb 2007, Takayuki Tsunakawa wrote: The Win32 APIs that pgbench is using for gettimeofday() (in src/port/gettimeofday.c) is much lower in resolution than Linux. I wasn't aware of this issue, and it certainly makes the whole latency side of pgbench pretty useless on Win32. There is code in src/include/executor/instrument.h that uses a higher resolution Windows timer API than gettimeofday() does (as you point out, that one is only resolves to one Windows tick, about 15ms). If I can get a Windows build environment setup, I'll see if I can borrow that solution for pgbench. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] WIP patch - INSERT-able log statements
On Sat, 17 Feb 2007, FAST PostgreSQL wrote: #log_output_type = 'text' #Valid values are 'SQL' or 'text' Defaults to 'text' which is status quo. If it is set to 'SQL' log will be output as INSERT commands. This assumes someone wants either the INSERT-able logs or the current, easily readable ones. I know I don't want either--I want both. There are times I want to look through the logs with a text editor, there are times where I want to query against them. I would suggest treating this similarly to how the Windows eventlog is handled: made SQL INSERT format another option available to log_destination, so it can be combined with the existing formats. In addition to the syslog concerns you already mentioned (which are themselves a showstopper for using this feature in companies that rely on or aggregate syslogs), I know I'd want to keep the existing logs rolling in parallel while I tested out the SQL-based version for a while, before cutting over to exclusively INSERT format logs. I've thought a bit about how to implement this TODO already (I have a log file parser and I hate maintaining it), and the only thing that made sense to me was giving a new parameter with the filename to output to in this format. For example, make a new log_sql_filename with the same syntax already used for log_filename. There will probably need to be a second parameter for the table name to insert into as you've already commented on. And like Joshua has already suggested, the main useful applications for this feature I've thought of all involve reading from the INSERT-able logs in real-time, using something like "tail -f", and pumping that data immediately into a logger table. Also, I feel that supporting the whole log_line_prefix syntax for this feature is not just overkill, it's a bad idea. Output everything in a standard, complete format instead, and then it becomes easy for the community at large to build tools on top of that to analyze the log database entries instead of having so many ad-hoc approaches. You want a subset, use a view or copy just the fields you want into another table. I would guess this simplifies the patch as well. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] WIP patch - INSERT-able log statements
On Tue, 20 Feb 2007, FAST PostgreSQL wrote: I think adding the 'format' of the log as an option in the 'destination' may be confusing. We can have a new boolean variable like 'output_sql_log' or 'log_sql_format' which will trigger the output of INSERT-able log in addition to syslog/stderr/eventlog in text format as it is now. What's confusing about it? Consider this hypothetical postgresql.conf snippet: log_destination = 'stderr,sql' # Valid values are combinations of # stderr, syslog, sql, and eventlog, # depending on platform. # These are relevant when logging to sql: log_sql_table = 'pg_log' # Table SQL formatted logs INSERT into # Default is 'pg_log' Adding new GUC variables isn't without cost; no reason to add a new one when there's a perfectly good destination one already whose syntax is completely appropriate for this task. I personally would just ignore the duration two entries per statement problem and make that the log analyzer software's job to fix, but I'm likely biased here because I don't ever do anything with that data. My preference would be for the sql-logs to use the same variables as that of redirect_stderr. In the directory mentioned by the user on log_directory we just output the sql logs in a file with an '.SQL' appended to the filename specified by the user in log_filename. This also means we can make use of the log_truncation and log_rotate options (I hope. I will have a look at that). You're talking along the same lines here I was trying to suggest in my previous message. Keep all the current behavior as far as truncation and rotation go, just provide a different name for the file. If you just appended a suffix like .sql to the existing name, that would remove another GUC variable requirement. I think eventually people will complain about this, and want a separately formatted filename altogether, but there's nothing wrong with the approach you suggest for a first version of this feature. I know I'm so desparate for a better solution to log importing that I'd be happy with any workable improvement. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] WIP patch - INSERT-able log statements
On Tue, 20 Feb 2007, Tom Lane wrote: A smaller problem is that this forces people to incur a gettimeofday call for every message logged I'm stumped trying to think of an application that would require importing the logs into a database to analyze them, but not need the timestamp. I'd expect it to be the primary key on the data. Is it worth providing a knob to determine the set of columns emitted? Myself and Guillaume felt that having the format be standardized had significant value from a downstream application perspective; it would be nice to know that everyone can work together to write one simple tool chain to process these things and it would work everywhere. The current level of log output customization is part of what makes log analysis tools so much of a pain. How about this as a simple way to proceed: have the patch include everything, as Arul already planned. When it's done, do some benchmarking with it turned on or off. If it really seems like a drag, then consider a GUC addition to trim it down. Why optimize prematurely? It's not like this will be on by default. My guess is that the person sophisticated to analyze their logs probably has an installation that can support the overhead. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] WIP patch - INSERT-able log statements
On Thu, 22 Feb 2007, FAST PostgreSQL wrote: As we are triggering the sql output in log_destination, if the user gives 'syslog,sql' as options he is going to get two different looking logs (in terms of contents) depending upon his settings. Yes, exactly; it's a good thing. People add and remove things from the text logs to make them easier to read. It's one of the reasons they're harder to process. Since readability isn't a requirement for the SQL formatted ones, you can pack a lot more into there and make it available easily anyway. I keep having every part of this conversation twice, so here's take two on this one. The things that people want out of the text logs are not necessarily the same things they want from the SQL ones. For example, I have a situation where the semantics of the syslog output is being driven by Sarbanes-Oxley related mechanics. But the SQL logs are be based on my requirements, which is to include close enough to everything that it might as well be the whole set, in case I forgot something I find I need later. The SQL logs are *completely* different from the syslog setup. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Load distributed checkpoint V3
On Fri, 23 Mar 2007, ITAGAKI Takahiro wrote: Here is the latest version of Load distributed checkpoint patch. Couple of questions for you: -Is it still possible to get the original behavior by adjusting your tunables? It would be nice to do a before/after without having to recompile, and I know I'd be concerned about something so different becoming the new default behavior. -Can you suggest a current test case to demonstrate the performance improvement here? I've tried several variations on stretching out checkpoints like you're doing here and they all made slow checkpoint issues even worse on my Linux system. I'm trying to evaluate this fairly. -This code operates on the assumption you have a good value for the checkpoint timeout. Have you tested its behavior when checkpoints are being triggered by checkpoint_segments being reached instead? -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Load distributed checkpoint V3
On Mon, 26 Mar 2007, ITAGAKI Takahiro wrote: I'm assuming checkpoints are triggered by timeout in normal use -- and it's my recommended configuration whether the patch is installed or not. I'm curious what other people running fairly serious hardware do in this area for write-heavy loads, whether it's timeout or segment limits that normally trigger their checkpoints. I'm testing on a slightly different class of machine than your sample results, something that is in the 1500 TPS range running the pgbench test you describe. Running that test, I always hit the checkpoint_segments wall well before any reasonable timeout. With 64 segments, I get a checkpoint every two minutes or so. There's something I'm working on this week that may help out other people trying to test your patch out. I've put together some simple scripts that graph (patched) pgbench results, which make it very easy to see what changes when you alter the checkpoint behavior. Edges are still rough but the scripts work for me, will be polishing and testing over the next few days: http://www.westnet.com/~gsmith/content/postgresql/pgbench.htm (Note that the example graphs there aren't from the production system I mentioned above, they're from my server at home, which is similar to the system your results came from). -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PATCHES] pgbench transaction timestamps
This patch changes the way pgbench outputs its latency log files so that every transaction gets a timestamp and notes which transaction type was executed. It's a one-line change that just dumps some additional information that was already sitting in that area of code. I also made a couple of documentation corrections and clarifications on some of the more confusing features of pgbench. It's straightforward to parse log files in this format to analyze what happened during the test at a higher level than was possible with the original format. You can find some rough sample code to convert this latency format into CVS files and then into graphs at http://www.westnet.com/~gsmith/content/postgresql/pgbench.htm which I'll be expanding on once I get all my little patches sent in here. If you recall the earlier version of this patch I submitted, it added a cleanup feature that did a vacuum and checkpoint after the test was finished and reported two TPS results. The idea was to quantify how much of a hit the eventual table maintenance required to clean up after the test would take. While those things do influence results and cause some of the run-to-run variation in TPS (checkpoints are particularly visible in the graphs), after further testing I concluded running a VACUUM VERBOSE and CHECKPOINT in a script afterwards and analyzing the results was more useful than integrating something into pgbench itself. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MDIndex: contrib/pgbench/README.pgbench === RCS file: /projects/cvsroot/pgsql/contrib/pgbench/README.pgbench,v retrieving revision 1.15 diff -c -r1.15 README.pgbench *** contrib/pgbench/README.pgbench 21 Oct 2006 06:31:28 - 1.15 --- contrib/pgbench/README.pgbench 1 Apr 2007 00:57:48 - *** *** 94,103 default is 1. NOTE: scaling factor should be at least as large as the largest number of clients you intend to test; else you'll mostly be measuring update contention. -D varname=value ! Define a variable. It can be refereed to by a script ! provided by using -f option. Multile -D options are allowed. -U login Specify db user's login name if it is different from --- 94,107 default is 1. NOTE: scaling factor should be at least as large as the largest number of clients you intend to test; else you'll mostly be measuring update contention. + Regular (not initializing) runs using one of the + built-in tests will detect scale based on the number of + branches in the database. For custom (-f) runs it can + be manually specified with this parameter. -D varname=value ! Define a variable. It can be refered to by a script ! provided by using -f option. Multiple -D options are allowed. -U login Specify db user's login name if it is different from *** *** 139,147 with the name "pgbench_log.xxx", where xxx is the PID of the pgbench process. The format of the log is: ! client_id transaction_no time ! where time is measured in microseconds. -d debug option. --- 143,157 with the name "pgbench_log.xxx", where xxx is the PID of the pgbench process. The format of the log is: ! client_id transaction_no time file_no time-epoch time-us ! where time is measured in microseconds, , the file_no is ! which test file was used (useful when multiple were ! specified with -f), and time-epoch/time-us are a ! UNIX epoch format timestamp followed by an offset ! in microseconds (suitable for creating a ISO 8601 ! timestamp with a fraction of a second) of when ! the transaction completed. -d debug option. *** *** 163,168 --- 173,180 (7) end; + If you specify -N, (4) and (5) aren't included in the transaction. + o -f option This supports for reading transaction script from a specified Index: contrib/pgbench/pgbench.c === RCS file: /projects/cvsroot/pgsql/contrib/pgbench/pgbench.c,v retrieving revision 1.62 diff -c -r1.62 pgbench.c *** contrib/pgbench/pgbench.c 13 Mar 2007 09:06:35 - 1.62 --- contrib/pgbench/pgbench.c 1 Apr 2007 00:57:49 - *** *** 461,467 diff = (int) (now.tv_sec - st->txn_begin.tv_sec) * 100.0 +
[PATCHES] Add usage counts to pg_buffercache
This patch adds the usage count statistic to the information available in contrib/pgbuffercache. Earlier this month a discussion about my first attempt to instrument the background writer had Tom asking for details about the usage histogram I was seeing, and this patch proved to be the easiest way I found to take a look at that. In situations where one is trying to optimize the background writer, it's very hard to adjust how much to rely on the LRU writer versus the one that writes everything unless you know whether your dirty buffers are typically used heavily (like index blocks) or not (like new INSERT data). Some statistics about the usage counts in your buffer cache are extremely helpful in making that decision. I'll even pass along an ugly but fun query that utilizes this. The following will give you a summary of your buffer cache broken into 32 sections. Each line shows the average usage count of that section, as a positive number if most buffers dirty and a negative one if most are clean. If you refresh this frequently enough, you can actually watch things like how checkpoints move through the buffer cache: SELECT current_timestamp, -- Split into 32 bins of data round(bufferid / (cast((select setting from pg_settings where name='shared_buffers') as int) / (32 - 1.0))) as section, round( -- Average usage count, capped at 5 case when avg(usagecount)>5 then 5 else avg(usagecount) end * -- -1 when the majority are clean records, 1 when most are dirty (case when sum(case when isdirty then 1 else -1 end)>0 then 1 else -1 end)) as color_intensity FROM pg_buffercache GROUP BY round(bufferid / (cast((select setting from pg_settings where name='shared_buffers') as int) / (32 - 1.0))); The 32 can be changed to anything, that's just what fits on my screen. The main idea of the above is that if you dump all this to a file regularly, it's possible to produce a graph of it showing how the cache has changed over time by assigning a different color intensity based on the usage count--at a massive cost in overhead, of course. I'll be passing along all that code once I get it ready for other people to use. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MDIndex: README.pg_buffercache === RCS file: /projects/cvsroot/pgsql/contrib/pg_buffercache/README.pg_buffercache,v retrieving revision 1.3 diff -c -r1.3 README.pg_buffercache *** README.pg_buffercache 26 Apr 2006 22:50:17 - 1.3 --- README.pg_buffercache 1 Apr 2007 02:17:15 - *** *** 40,46 reldatabase| pg_database.oid | Database for the relation. relblocknumber | | Offset of the page in the relation. isdirty| | Is the page dirty? ! There is one row for each buffer in the shared cache. Unused buffers are shown with all fields null except bufferid. --- 40,46 reldatabase| pg_database.oid | Database for the relation. relblocknumber | | Offset of the page in the relation. isdirty| | Is the page dirty? !usagecount | | Page LRU count There is one row for each buffer in the shared cache. Unused buffers are shown with all fields null except bufferid. *** *** 60,79 regression=# \d pg_buffercache; View "public.pg_buffercache" !Column | Type | Modifiers ! +-+--- !bufferid | integer | !relfilenode| oid | !reltablespace | oid | !reldatabase| oid | !relblocknumber | bigint | !isdirty| boolean | View definition: SELECT p.bufferid, p.relfilenode, p.reltablespace, p.reldatabase, ! p.relblocknumber, p.isdirty FROM pg_buffercache_pages() p(bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid, relblocknumber bigint, ! isdirty boolean); regression=# SELECT c.relname, count(*) AS buffers FROM pg_class c INNER JOIN pg_buffercache b --- 60,81 regression=# \d pg_buffercache; View "public.pg_buffercache" !Column | Type| Modifiers ! +--+--- !bufferid | integer | !relfilenode| oid | !reltablespace | oid | !reldatabase| oid | !relblocknumber | bigint | !isdirty| boolean | !usagecount | smallint | ! View definition: SELECT p.bufferid, p.relfilenode, p.reltablespace, p.reldatabase, ! p.relblocknumber, p.isdirty, p.usagecount FROM pg_buffercache_pages() p(bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid, relblocknumber bigint, ! isdirty boo
Re: [PATCHES] Add usage counts to pg_buffercache
On Sun, 1 Apr 2007, Russell Smith wrote: Currently the max buffer count is 5. But is that a complete safe assumption? Maybe a compile time check that BM_MAX_USAGE_COUNT is < 16k would ensure that things don't go wrong? I actually wasn't even aware that was a hard limit; I just assumed that all my systems just never got over 5 before the LRU hit them. How 'bout that. I'm all for paranoia, but the idea of letting LRU usage counts go over 16 bits seems pretty unlikely with the current approach. That would mean those pages would need 64K passes over the buffer cache before they could get evicted, which would take quite a while. My bet is that bufferid gets expanded from 32 bits before that happens, which would break pg_buffercache similarly and isn't being checked for either. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[PATCHES] Logging checkpoints and other slowdown causes
This patch puts entries in the log file around some backend events that regularly cause client pauses, most of which are visible only when running with log_min_messages = debug2. The main focus is regularly occuring checkpoints, from either running out of segments or hitting a timeout. This is an example of what gets written to the logs at checkpoint time: DEBUG2: checkpoint required (timeout passed) DEBUG2: checkpoint starting DEBUG2: checkpoint flushing buffer pool DEBUG2: checkpoint database fsync starting DEBUG2: checkpoint database fsync complete DEBUG1: checkpoint buffers dirty=16.7 MB (13.0%) write=174.8 ms sync=3716.5 ms DEBUG2: checkpoint complete; 0 transaction log file(s) added, 0 removed, 0 recycled The idea here is that if you want to really watch what your system is doing, either to track down a problem or for benchmarking, you log at DEBUG2 and timestamp the logs with something like log_line_prefix = '[%m]:'. Then line up your timestamped transactions with the logs to figure out what happened during bad periods. This is very easy to do with the timestamped pgbench latency logs for example, so you can put them in time sequence order and see something like "oh, those long transactions all came in between the fsync start and end". The summary line with the write/sync timing information has a broader use and suggests when you need to go deeper, which is why I put that one at DEBUG1. I also adjust some other log messages to make it easier to run the system at DEBUG2 without going completely crazy, like moving individual WAL segment recycles to DEBUG3 and standardizing the format of the fork/exit messages so they're easier to filter out. My main concern about this code is how it adds several gettimeofday calls and complicated string building inside FlushBufferPool, all of which is wasted if this data isn't actually going to get logged. I'd like to have something that tells me whether I should even bother, but I didn't want to poke into the elog variables for fear of adding a dependency on its current structure to here. Ideally, it would be nice to call something like WillLog(DEBUG2) and get a boolean back saying whether something at that log level will be output anywhere; I don't know enough about the logging code to add such a thing to it myself. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MDIndex: src/backend/access/transam/xlog.c === RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.265 diff -c -r1.265 xlog.c *** src/backend/access/transam/xlog.c 3 Mar 2007 20:02:26 - 1.265 --- src/backend/access/transam/xlog.c 1 Apr 2007 04:55:19 - *** *** 1591,1600 if (new_highbits != old_highbits || new_segno >= old_segno + (uint32) (CheckPointSegments-1)) { ! #ifdef WAL_DEBUG ! if (XLOG_DEBUG) ! elog(LOG, "time for a checkpoint, signaling bgwriter"); ! #endif RequestCheckpoint(false, true); } } --- 1591,1598 if (new_highbits != old_highbits || new_segno >= old_segno + (uint32) (CheckPointSegments-1)) { ! ereport(DEBUG2,(errmsg( ! "checkpoint required (wrote checkpoint_segments)"))); RequestCheckpoint(false, true); } } *** *** 1858,1863 --- 1856,1863 (errcode_for_file_access(), errmsg("could not create file \"%s\": %m", tmppath))); + ereport(DEBUG2, (errmsg("WAL creating and filling new file on disk"))); + /* * Zero-fill the file. We have to do this the hard way to ensure that all * the file space has really been allocated --- on platforms that allow *** *** 2565,2571 true, &max_advance, true)) { ! ereport(DEBUG2, (errmsg("recycled trans
Re: [PATCHES] bgwriter stats
On Thu, 29 Mar 2007, Magnus Hagander wrote: I've included a couple of more counters per ideas from Greg Smith in his logging patch. I just submitted a patch that logs the remaining things of value from my original that couldn't be derived from the information you're collecting. Between them I'm happy that a lot of previously hidden performance issues can now be monitored--not necessarily easily, but it's at least possible. I made one small change to your code in there as well I wanted to highlight here. You updated the buffers written by checkpoints one at a time as they wrote out. When I tried to develop something that monitored pg_stat_bgwriter looking for when checkpoints happened, this made it difficult to answer the important question "how many buffers did the last checkpoint write?" just from the stats structure because I assumed it's possible to get a view in the middle of the checkpoint. It took watching both the total and the checkpoint count, and that was hard to work with. I modified things so that the checkpoint buffers written number gets updated in one shot when the buffer flush is done. No partial results, much easier to monitor: when the buffers_checkpoint value changes, the difference from the previous value is what the last checkpoint wrote. I needed that total anyway which is why I just slipped it into the other patch. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] bgwriter stats
On Sun, 1 Apr 2007, Magnus Hagander wrote: The data in pg_stat_bgwriter certainly doesn't update *during* a checkpoint, if that's what you're saying. Scratch previous message, replace with "dude ur code rulez!" I was working on the assumption it was possible to get a partial result because I haven't had enough time track all the code paths involved to prove otherwise. Glad to hear it was never an issue. Doesn't change what I submitted though. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Logging checkpoints and other slowdown causes
On Tue, 3 Apr 2007, Peter Eisentraut wrote: Something that is aimed at a user should not be enabled at a "debug" level. Debug levels are for debugging, not for very high verbosity. I asked for feedback about where to log at when I intially sent the first version of this in and didn't hear anything back on that part, so I pushed these in line with other log messages I saw. The messages for when checkpoints start and stop were both logged at DEBUG2, so I put progress reports on the other significant phases of the process there as well. I don't expect these messages will be helpful for a normal user--that's what the new data in pg_stats_bgwriter is for. Their main purpose of this patch is debugging checkpoint related performance issues at a level I'd expect only a developer to work at; they're also helpful for someone writing benchmark code. There are several patches in process floating around that aim to adjust either the background writer or the checkpoint process to reduce the impact of checkpoints. This logging allows grading their success at that. As my tests with this patch in place suggest this problem is far from solved with any of the current suggestions, I'd like to get other developers looking at that problem the same way I have been; that's why I'd like to see some standardization on how checkpoints are instrumented. The fact that really advanced users might also use this for troubleshooting I consider a bonus rather than the main focus here. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Load distributed checkpoint V3
On Thu, 5 Apr 2007, Heikki Linnakangas wrote: Unfortunately because of the recent instrumentation and CheckpointStartLock patches this patch doesn't apply cleanly to CVS HEAD anymore. Could you fix the bitrot and send an updated patch, please? The "Logging checkpoints and other slowdown causes" patch I submitted touches some of the same code as well, that's another possible merge coming depending on what order this all gets committed in. Running into what I dubbed perpetual checkpoints was one of the reasons I started logging timing information for the various portions of the checkpoint, to tell when it was bogged down with slow writes versus being held up in sync for various (possibly fixed with your CheckpointStartLock) issues. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Load distributed checkpoint V3
On Thu, 5 Apr 2007, Heikki Linnakangas wrote: Bgwriter has two goals: 1. keep enough buffers clean that normal backends never need to do a write 2. smooth checkpoints by writing buffers ahead of time Load distributed checkpoints will do 2. in a much better way than the bgwriter_all_* guc options. I think we should remove that aspect of bgwriter in favor of this patch. My first question about the LDC patch was whether I could turn it off and return to the existing mechanism. I would like to see a large pile of data proving this new approach is better before the old one goes away. I think everyone needs to do some more research and measurement here before assuming the problem can be knocked out so easily. The reason I've been busy working on patches to gather statistics on this area of code is because I've tried most simple answers to getting the background writer to work better and made little progress, and I'd like to see everyone else doing the same at least collecting the right data. Let me suggest a different way of looking at this problem. At any moment, some percentage of your buffer pool is dirty. Whether it's 0% or 100% dramatically changes what the background writer should be doing. Whether most of the data is usage_count>0 or not also makes a difference. None of the current code has any idea what type of buffer pool they're working with, and therefore they don't have enough information to make a well-informed prediction about what is going to happen in the near future. I'll tell you what I did to the all-scan. I ran a few hundred hours worth of background writer tests to collect data on what it does wrong, then wrote a prototype automatic background writer that resets the all-scan parameters based on what I found. It keeps a running estimate of how dirty the pool at large is using a weighted average of the most recent scan with the past history. From there, I have a simple model that predicts how much of the buffer we can scan in any interval, and intends to enforce a maximum bound on the amount of physical I/O you're willing to stream out. The beta code is sitting at http://www.westnet.com/~gsmith/content/postgresql/bufmgr.c if you want to see what I've done so far. The parts that are done work fine--as long as you give it a reasonable % to scan by default, it will correct all_max_pages and the interval in real-time to meet the scan rate requested you want given how much is currently dirty; the I/O rate is computed but doesn't limit properly yet. Why haven't I brought this all up yet? Two reasons. The first is because it doesn't work on my system; checkpoints and overall throughput get worse when you try to shorten them by running the background writer at optimal aggressiveness. Under really heavy load, the writes slow down as all the disk caches fill, the background writer fights with reads on the data that isn't in the mostly dirty cache (introducing massive seek delays), it stops cleaning effectively, and it's better for it to not even try. My next generation of code was going to start with the LRU flush and then only move onto the all-scan if there's time leftover. The second is that I just started to get useful results here in the last few weeks, and I assumed it's too big of a topic to start suggesting major redesigns to the background writer mechanism at that point (from me at least!). I was waiting for 8.3 to freeze before even trying. If you want to push through a redesign there, maybe you can get away with it at this late moment. But I ask that you please don't remove anything from the current design until you have significant test results to back up that change. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Load distributed checkpoint V3
On Thu, 5 Apr 2007, Heikki Linnakangas wrote: The purpose of the bgwriter_all_* settings is to shorten the duration of the eventual checkpoint. The reason to shorten the checkpoint duration is to limit the damage to other I/O activity it causes. My thinking is that assuming the LDC patch is effective (agreed, needs more testing) at smoothening the checkpoint, the duration doesn't matter anymore. Do you want to argue there's other reasons to shorten the checkpoint duration? My testing results suggest that LDC doesn't smooth the checkpoint usefully when under a high (>30 client here) load, because (on Linux at least) the way the OS caches writes clashes badly with how buffers end up being evicted if the buffer pool fills back up before the checkpoint is done. In that context, anything that slows down the checkpoint duration is going to make the problem worse rather than better, because it makes it more likely that the tail end of the checkpoint will have to fight with the clients for write bandwidth, at which point they both suffer. If you just get the checkpoint done fast, the clients can't fill the pool as fast as the BufferSync is writing it out, and things are as happy as they can be without a major rewrite to all this code. I can get a tiny improvement in some respects by delaying 2-5 seconds between finishing the writes and calling fsync, because that gives Linux a moment to usefully spool some of the data to the disk controller's cache; beyond that any additional delay is a problem. Since it's only the high load cases I'm having trouble dealing with, this basically makes it a non-starter for me. The focus on checkpoint_timeout and ignoring checkpoint_segments in the patch is also a big issue for me. At the same time, I recognize that the approach taken in LDC probably is a big improvement for many systems, it's just a step backwards for my highest throughput one. I'd really enjoy hearing some results from someone else. The number of buffers evicted by normal backends in a bgwriter_delay period is simple to keep track of, just increase a counter in StrategyGetBuffer and reset it when bgwriter wakes up. I see you've already found the other helpful Itagaki patch in this area. I know I would like to see his code for tracking evictions commited, then I'd like that to be added as another counter in pg_stat_bgwriter (I mentioned that to Magnus in passing when he was setting the stats up but didn't press it because of the patch dependency). Ideally, and this idea was also in Itagaki's patch with the writtenByBgWriter/ByBackEnds debug hook, I think it's important that you know how every buffer written to disk got there--was it a background writer, a checkpoint, or an eviction that wrote it out? Track all those and you can really learn something about your write performance, data that's impossible to collect right now. However, as Itagaki himself points out, doing something useful with bgwriter_lru_maxpages is only one piece of automatically tuning the background writer. I hate to join in on chopping his patches up, but without some additional work I don't think the exact auto-tuning logic he then applies will work in all cases, which could make it more a problem than the current crude yet predictable method. The whole way bgwriter_lru_maxpages and num_to_clean play off each other in his code currently has a number of failure modes I'm concerned about. I'm not sure if a re-write using a moving-average approach (as I did in my auto-tuning writer prototype and as Tom just suggested here) will be sufficient to fix all of them. Was already on my to-do list to investigate that further. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] Load distributed checkpoint V3
On Fri, 6 Apr 2007, Takayuki Tsunakawa wrote: could anyone evaluate O_SYNC approach again that commercial databases use and tell me if and why PostgreSQL's fsync() approach is better than theirs? I noticed a big improvement switching the WAL to use O_SYNC (+O_DIRECT) instead of fsync on my big and my little servers with battery-backed cache, so I know sync writes perform reasonably well on my hardware. Since I've had problems with the fsync at checkpoint time, I did a similar test to yours recently, adding O_SYNC to the open calls and pulling the fsyncs out to get a rough idea how things would work. Performance was reasonable most of the time, but when I hit a checkpoint with a lot of the buffer cache dirty it was incredibly bad. It took minutes to write everything out, compared with a few seconds for the current case, and the background writer was too sluggish as well to help. This appears to match your data. If you compare how Oracle handles their writes and checkpoints to the Postgres code, it's obvious they have a different architecture that enables them to support sync writing usefully. I'd recommend the Database Writer Process section of http://www.lc.leidenuniv.nl/awcourse/oracle/server.920/a96524/c09procs.htm as an introduction for those not familiar with that; it's interesting reading for anyone tinking with background writer code. It would be great to compare performance of the current PostgreSQL code with a fancy multiple background writer version using the latest sync methods or AIO; there have actually been multiple updates to improve O_SYNC writes within Linux during the 2.6 kernel series that make this more practical than ever on that platform. But as you've already seen, the performance hurdle to overcome is significant, and it would have to be optional as a result. When you add all this up--have to keep the current non-sync writes around as well, need to redesign the whole background writer/checkpoint approach around the idea of sync writes, and the OS-specific parts that would come from things like AIO--it gets real messy. Good luck drumming up support for all that when the initial benchmarks suggest it's going to be a big step back. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Load distributed checkpoint V3
On Fri, 6 Apr 2007, Takayuki Tsunakawa wrote: Hmm... what makes you think that sync writes is useful for Oracle and not for PostgreSQL? They do more to push checkpoint-time work in advance, batch writes up more efficiently, and never let clients do the writing. All of which make for a different type of checkpoint. Like Simon points out, even if it were conceivable to mimic their design it might not even be legally feasible. The point I was trying to make is this: you've been saying that Oracle's writing technology has better performance in this area, which is probably true, and suggesting the cause of that was their using O_SYNC writes. I wanted to believe that and even tested out a prototype. The reality here appears to be that their checkpoints go smoother *despite* using the slower sync writes because they're built their design around the limitations of that write method. I suspect it would take a similar scale of redesign to move Postgres in that direction; the issues you identified (the same ones I ran into) are not so easy to resolve. You're certainly not going to move anybody in that direction by throwing a random comment into a discussion on the patches list about a feature useful *right now* in this area. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load distributed checkpoint V4
On Thu, 19 Apr 2007, Heikki Linnakangas wrote: In the sync phase, we sleep between each fsync until enough time/segments have passed, assuming that the time to fsync is proportional to the file length. I'm not sure that's a very good assumption. I've been making scatter plots of fsync time vs. amount written to the database for a couple of months now, and while there's a trend there it's not a linear one based on data written. Under Linux, to make a useful prediction about how long a fsync will take you first need to consider how much dirty data is already in the OS cache (the "Dirty:" figure in /proc/meminfo) before the write begins, relative to the kernel parameters that control write behavior. Combine that with some knowledge of the caching behavior of the controller/disk combination you're using, and it's just barely possible to make a reasonable estimate. Any less information than all that and you really have very little basis on which to guess how long it's going to take. Other operating systems are going to give completely different behavior here, which of course makes the problem even worse. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Logging checkpoints and other slowdown causes
On Fri, 11 May 2007, Heikki Linnakangas wrote: Printing the messages at LOG-level would bring the code in line with the documentation, but I don't think we want to fill the log with checkpoint chatter unless the DBA explicitly asks for that. How about INFO? In putting that together, I pushed everything toward the lowest DEBUG level that it was possible for it to run at without the messages I wanted to see being completely lost. I didn't want this to get rejected just on the basis of being too chatty for regular users. Putting a single line in for each checkpoint at INFO, with all the individual progress points being at DEBUG1 or 2, would be close to optimal as far as I'm concerned. More on this below. Or we could add a GUC variable similar to log_connections or log_statement to control if the messages are printed or not, and use LOG. If the levels for the individual messages are adjusted usefully the ability to control whether they show up or not falls out of the existing log level adjustments; I haven't ever felt a strong need for GUC when using this code. If, as you say, there's already a move in this direction, then fine--another GUC would be better. We don't need to print the times elapsed in each phase on a separate line, that's just derived information from the other lines, unless we use different log-levels for detail lines Let me turn this around for a second and ask you this: what do you think people are going to do with this information? I can tell you what I do. I parse the logs and look for that single line with the summary information. I then take the three pieces of information it provides (MB written, time to write, time to fsync) and save them into a table. From there I generate stats, scatter plots, all sorts of useful stuff. If you know when the checkpoint ended, and you know how long each of the pieces took, you can reconstruct the other times easily. The way you describe this it is true--that the summary is redundant given the detail--but if you put yourself in the shoes of a log file parser the other way around is easier to work with. Piecing together log entries is a pain, splitting them is easy. If I had to only keep one line out of this, it would be the one with the summary. It would be nice to have it logged at INFO. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Automatic adjustment of bgwriter_lru_maxpages
Attached are two patches that try to recast the ideas of Itagaki Takahiro's auto bgwriter_lru_maxpages patch in the direction I think this code needs to move. Epic-length commentary follows. The original code came from before there was a pg_stat_bgwriter. The first patch (buf-alloc-stats) takes the two most interesting pieces of data the original patch collected, the number of buffers allocated recently and the number that the clients wrote out, and ties all that into the new stats structure. With this patch applied, you can get a feel for things like churn/turnover in the buffer pool that were very hard to quantify before. Also, it makes it easy to measure how well your background writer is doing at writing buffers so the clients don't have to. Applying this would complete one of my personal goals for the 8.3 release, which was having stats to track every type of buffer write. I split this out because I think it's very useful to have regardless of whether the automatic tuning portion is accepted, and I think these smaller patches make the review easier. The main thing I would recommend someone check is how am_bg_writer is (mis?)used here. I spliced some of the debugging-only code from the original patch, and I can't tell if the result is a robust enough approach to solving the problem of having every client indirectly report their activity to the background writer. Other than that, I think this code is ready for review and potentially comitting. The second patch (limit-lru) adds on top of that a constraint of the LRU writer so that it doesn't do any more work than it has to. Note that I left verbose debugging code in here because I'm much less confident this patch is complete. It predicts upcoming buffer allocations using a 16-period weighted moving average of recent activity, which you can think of as the last 3.2 seconds at the default interval. After testing a few systems that seemed a decent compromise of smoothing in both directions. I found the 2X overallocation fudge factor of the original patch way too aggressive, and just pick the larger of the most recent allocation amount or the smoothed value. The main thing that throws off the allocation estimation is when you hit a checkpoint, which can give a big spike after the background writer returns to BgBufferSync and notices all the buffers that were allocated during the checkpoint write; the code then tries to find more buffers it can recycle than it needs to. Since the checkpoint itself normally leaves a large wake of reusable buffers behind it, I didn't find this to be a serious problem. There's another communication issue here, which is that SyncOneBuffer needs to return more information about the buffer than it currently does once it gets it locked. The background writer needs to know more than just if it was written to tune itself. The original patch used a clever trick for this which worked but I found confusing. I happen to have a bunch of other background writer tuning code I'm working on, and I had to come up with a more robust way to communicate buffer internals back via this channel. I used that code here, it's a bitmask setup similar to how flags like BM_DIRTY are used. It's overkill for solving this particular problem, but I think the interface is clean and it helps support future enhancements in intelligent background writing. Now we get to the controversial part. The original patch removed the bgwriter_lru_maxpages parameter and updated the documentation accordingly. I didn't do that here. The reason is that after playing around in this area I'm not convinced yet I can satisfy all the tuning scenarios I'd like to be able to handle that way. I describe this patch as enforcing a constraint instead; it allows you to set the LRU parameters much higher than was reasonable before without having to be as concerned about the LRU writer wasting resources. I already brought up some issues in this area on -hackers ( http://archives.postgresql.org/pgsql-hackers/2007-04/msg00781.php ) but my work hasn't advanced as fast as I'd hoped. I wanted to submit what I've finished anyway because I think any approach here is going to have cope with the issues addressed in these two patches, and I'm happy now with how they're solved here. It's only a one-line delete to disable the LRU limiting behavior of the second patch, at which point it's strictly internals code with no expected functional impact that alternate approaches might be built on. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MDIndex: src/backend/storage/buffer/bufmgr.c === RCS file: /d3/pgsql/cvs/pgsql-local/src/backend/storage/buffer/bufmgr.c,v retrieving revision 1.1.1.1 diff -c -r1.1.1.1 bufmgr.c *** s
Re: [PATCHES] Automatic adjustment of bgwriter_lru_maxpages
| buffer_state == BUF_REUSABLE_WRITTEN) And that was a pain all around; I kept having to stare at the table above to make sure the code was correct. Also, in order to pass back full usage_count information I was back to either pointers or bitshifting anyway. While this particular patch doesn't need the usage count, the later ones I'm working on do, and I'd like to get this interface complete while it's being tinkered with anyway. Or how about moving the checks for dirty and pinned buffers from SyncOneBuffer to the callers? There are 3 callers to SyncOneBuffer, and almost all the code is shared between them. Trying to push even just the dirty/pinned stuff back into the callers would end up being a cut and paste job that would duplicate many lines. That's on top of the fact that the buffer is cleanly locked/unlocked all in one section of code right now, and I didn't see how to move any parts of that to the callers without disrupting that clean interface. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Automatic adjustment of bgwriter_lru_maxpages
On Mon, 14 May 2007, ITAGAKI Takahiro wrote: BTW, your patch will cut LRU writes short, but will not encourage to do more works. So should set more aggressive values to bgwriter_lru_percent and bgwriter_lru_maxpages as defaults? Setting a bigger default maximum is one possibility I was thinking about. Since the whole background writer setup is kind of complicated, the other thing I was working on is writing a guide on how to use the new pg_stat_bgwriter information to figure out if you need to increase bgwriter_[all|lru]_pages (and the other parameters too). It makes it much easier to write that if you can say "You can safely set bgwriter_lru_maxpages high because it only writes what it needs to based on your usage". -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] [DOCS] OS/X startup scripts
On Mon, 14 May 2007, Gregory Stark wrote: Personally I find CVS so terribly slow for large trees like Postgres that it's essential to use rsync to maintain a local CVS repository. That makes 'cvs diff' remarkably fast. Having recently tried to get this to work right and not quite nailed it down yet, I know it would be a help to new developers if someone using this technique would write out a suggested setup/workflow for generating patches this way. Right now the best instructions out there are those that go with the buildfarm http://pgfoundry.org/docman/view.php/140/4/PGBuildFarm-HOWTO.txt and those aren't quite focused right if the goal is to work on new patches while keeping in sync with the repository. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Automatic adjustment of bgwriter_lru_maxpages
On Mon, 14 May 2007, Heikki Linnakangas wrote: If it's safe to set it high, let's default it to infinity. The maximum right now is 1000, and that would be a reasonable new default. You really don't to write more than 1000 per interval anyway without taking a break for checkpoints; the more writes you do at once, the higher the chances are you'll have the whole thing stall because the OS makes you wait for a write (this is not a theoretical comment; I've watched it happen when I try to get the BGW doing too much). If someone has so much activity that they're allocating more than that during a period, they should shrink the delay instead. The kinds of systems where 1000 isn't high enough for bgwriter_lru_maxpages are going to be compelled to adjust these parameters anyway for good performance. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Automatic adjustment of bgwriter_lru_maxpages
On Mon, 14 May 2007, Tom Lane wrote: If you can write something like that, why do we need the parameter at all? Couple of reasons: -As I already mentioned in my last message, I think it's unwise to let the LRU writes go completely unbounded. I still think there should be a maximum, and if there is one it should be tunable. You can get into situations where the only way to get the LRU writer to work at all is to set the % to scan fairly high, but that exposes you to way more writes than you might want per interval in situations where buffers to write are easy to find. -There is considerable coupling between how the LRU and the all background writers work. There are workloads where the LRU writer is relatively ineffective, and only the all one really works well. If there is a limiter on the writes from the all writer, but not on the LRU, admins may not be able to get the balance between the two they want. I know I wouldn't. -Just because I can advise what is generally the right move, that doesn't mean it's always the right one. Someone may notice that the maximum pages written limit is being nailed and not care. The last system I really got deep into the background writer mechanics on, it could be very effective at improving performance and reducing checkpoint spikes under low to medium loads. But under heavy load, it just got in the way of the individual backends running, which was absolutely necessary in order to execute the LRU mechanics (usage_count--) so less important buffers could be kicked out. I would like people to still be able to set a tuning such that the background writers were useful under average loads, but didn't ever try to do too much. It's much more difficult to do that if bgwriter_lru_maxpages goes away. I realized recently the task I should take on here is to run some more experiments with the latest code and pass along suggested techniques for producing/identifying the kind of problem conditions I've run into in the past; then we can see if other people can reproduce them. I got a new 8-core server I need to thrash anyway and will try and do just that starting tomorrow. For all I know my concerns are strictly a rare edge case. But since the final adjustments to things like whether there is an upper limit or not are very small patches compared to what's already been done here, I sent in what I thought was ready to go because I didn't want to hold up reviewing the bulk of the code over some of these fine details. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] COPY-able csv log outputs
I got a chance to review this patch over the weekend. Basic API seems good, met all my requirements, no surprises with how the GUC variable controlled the feature. The most fundamental issue I have with the interface is that using COPY makes it difficult to put any unique index on the resulting table. I like to have a unique index on my imported log table because it rejects the dupe records if you accidentally import the same section of log file twice. COPY tosses the whole thing if there's an index violation, which is a problem during a regular import because you will occasionally come across lines with the same timestamp that are similar in every way except for their statment; putting an index on the timestamp+statement seems impractical. I've had a preference for INSERT from the beginning here that this reinforces. I'm planning to just work around this issue by doing the COPY into a temporary table and then INSERTing from there. I didn't want to just let the concern pass by without mentioning it though. It crosses my mind that inserting some sort of unique log file line ID number would prevent the dupe issue and make for better ordering (it's possible to have two lines with the same timestamp show up in the wrong order now), not sure that's a practical idea to consider. The basic coding of the patch seemed OK to me, but someone who is much more familiar than myself with the mechanics of pipes should take a look at that part of the patch before committing; it's complicated code and I can't comment on it. There are some small formatting issues that need to be fixed, particularly in the host+port mapping. I can fix those myself and submit a slightly updated patch. There's some documentation improvements I want to make before this goes in as well. The patch is actually broken fairly hard right now because of the switch from INSERT to COPY FROM CSV as the output format at the last minute. It outputs missing fields as NULL (fine for INSERT) that chokes the CSV import when the session_start timestamp is missing. All of those NULL values need to be just replaced with nothing for proper CSV syntax; there should just the comma for the next field. I worked around this with copy pglog from '/opt/pgsql/testlog.csv' with CSV null as 'NULL'; I can fix that too when I'm revising. I plan to have a version free of obvious bugs to re-submit ready by next weekend. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] COPY-able csv log outputs
On Sun, 20 May 2007, Andrew Dunstan wrote: Does the format not include the per-process line number? It does not, and I never noticed that under the prefix possibilities---never seemed import before! The combination of timestamp/pid/line (%t %p %l) looks like a useful and unique key here, so I'll add another column for the line number to the output. Thanks for pointing that out, I can finish cleaning up all the functional implementation work on this patch myself now. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] COPY-able csv log outputs
On Sun, 20 May 2007, Andrew Dunstan wrote: I've had a preference for INSERT from the beginning here that this reinforces. COPY is our standard bulk insert mechanism. I think arguing against it would be a very hard sell. Let me say my final peace on this subject...if I considered this data to be strictly bulk insert, then I'd completely agree here. Most of the really interesting applications I was planning to build on top of this mechanism are more interactive than that though. Here's a sample: -Write a daemon that lives on the server, connects to a logging database, and pops into an idle loop based on LISTEN. -A client app wants to see the recent logs files. It uses NOTIFY to ask the daemon for them and LISTENs for a response. -The daemon wakes up, reads all the log files since it last did something, and appends those records to the log file table. It sends out a NOTIFY to say the log file table is current. That enables remote clients to grab the log files from the server whenever they please, so they can actually monitor themselves. Benchmarking is the initial app I expect to call this, and with some types of tests I expect the daemon to be importing every 10 minutes or so. Assuming a unique index on the data to prevent duplication is a required feature, I can build this using the COPY format logs as well, but that requires I either a) am 100% perfect in making sure I never pass over the same data twice, which is particularly annoying when the daemon gets restarted, or b) break the COPY into single lines and insert them one at a time, at which point I'm not bulk loading at all. If these were INSERT statements instead, I'd have a lot more tolerance for error, because the worst problem I'd ever run into is spewing some unique key violation errors into the logs if I accidentally imported too much. With COPY, any mistake or synchronization issue and I lose the whole import. I don't mean to try and stir this back up again as an argument (particularly not on this list). There are plenty of other apps where COPY is clearly the best approach, you can easily make a case that my app is a fringe application rather than a mainstream one, and on the balance this job is still far easier than my current approach of parsing the logs. I just wanted to give a sample of how using COPY impacts the dynamics of how downstream applications will have to work with this data, so you can see that my contrary preference isn't completely random here. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Seq scans status update
On Mon, 28 May 2007, Tom Lane wrote: But maybe that could be fixed if the clock sweep doesn't touch the usage_count of a pinned buffer. Which in fact it may not do already --- didn't look. StrategyGetBuffer doesn't care whether the buffer is pinned or not; it decrements the usage_count regardless. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Logging checkpoints and other slowdown causes
I'll take another stab at refining this can of worms I opened. The one thing I noticed on a quick review is that it's almost possible to skip all the calls to gettimeofday if log_checkpoints is off now. I'd like to make that a specific goal, because that will make me feel better that adding this code has almost no performance impact relative to now unless you turn the feature on. I agree with Simon that tracking create/drop separately is unnecessary. As for why all the timing info is in ms, given the scale of the numbers typically encountered I found it easier to work with. I originally wanted resolution down to 0.1ms if the underlying OS supports it, which means 4 figures to the right of the decimal point if the unit was switched to seconds. Quite often the times reported are less than 100ms, so you'll normally be dealing with fractional part of a second. If we take Heikki's example: LOG: checkpoint complete; buffers written=3.1 MB (9.6%) write=96.8 ms sync=32.0 ms And switch it to seconds: LOG: checkpoint complete; buffers written=3.1 MB (9.6%) write=0.0968 ms sync=0.0320 ms I don't find that as easy to work with. The only way a timing in seconds would look OK is if the resolution of the whole thing is reduced to ms, which then makes 3 decimal points--easy to read as ms instead. Having stared at a fair amount of this data now, that's probably fine; I'll collect up some more data on it from a fast server this week to confirm whether's it's worthless precision or worth capturing. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] WIP: 2nd-generation buffer ring patch
On Tue, 29 May 2007, Tom Lane wrote: Do we have any decent way of measuring the effectiveness of the clock-sweep allocation algorithm? I put a view on top of the current pg_buffercache (now that it include usage_count) that shows what the high usage_count buffers consist of. Since they were basically what I hoped for (like plenty of index blocks on popular tables) that seemed a reasonable enough measure of effectiveness for my purposes. I briefly looked into adding some internal measurements in this area, like how many buffers are scanned on average to satisfy an allocation request; that would actually be easy to add to the buffer allocation stats part of the auto bgwriter_max_pages patch I submitted recently. Based on my observations of buffer cache statistics, the number of pinned buffers at any time is small enough that in a reasonably sized buffer cache, I wouldn't expect a change in the pinned usage_count behavior to have any serious impact. With what you're adjusting, the only time I can think of that there would be a noticable shift in fairness would be if ones buffer cache was very small relative to the number of clients, which is kind of an unreasonable situation to go out of your way to accommodate. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] Logging checkpoints and other slowdown causes
On Tue, 29 May 2007, Heikki Linnakangas wrote: The checkpoint will take at least a couple of seconds on any interesting system, so 0.1 s resolution should be enough IMHO. You may be underestimating the resources some interesting systems are willing to put into speeding up checkpoints. I'm sometimes dumping into a SAN whose cache is bigger than the shared_buffer cache in the server, and 0.1s isn't really enough resolution in that situation. A second is a really long checkpoint there. Since even that's limited by fiber-channel speeds, I know it's possible to do better than what I'm seeing with something like a PCIe host adapter having on-board cache in the GB range (which isn't that expensive nowadays). Also, even if the checkpoint total takes seconds, much of that is in the sync phase; the write time can still be in the small number of ms range, and I wouldn't want to see that truncated too much. Anyway, I have a bunch of data on this subject being collected at this moment, and I'll rescale the results based on what I see after analyzing that this week. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] COPY-able csv log outputs
On Fri, 1 Jun 2007, Andrew Dunstan wrote: Greg Smith wrote: Since the rotation size feature causes other issues anyway that make importing more complicated, documenting the issue seemed sufficient. What are the other issues? I'm not happy about producing files with split lines. Just that it's fairly simple and predictable to know what your log files are going to be named and when they'll rollover if you use something like a date-based naming convention--but when you add size-based rotation into the mix figuring out what files you need to import and when you should import them gets more complicated. Clearly fixing this issue altogether would be better, and I gather the problem may extend to any time there is a switch to a new log file; my "workaround" doesn't appear good enough anyway. I'm very glad I caught and mentioned this now. Because of the extra naming/import complexity, it still might be worthwhile to suggest people not combine size-based rotation and the csvlog though. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] trace_checkpoint parameter patch
On Tue, 12 Jun 2007, Satoshi Nagayasu wrote: Here is a brand new patch to log a checkpointing load information to tune the bgwriter parameter. There is a work in progress patch that logs this and additional checkpoint information being reviewed in hopes of making it into the 8.3 build. See http://archives.postgresql.org/pgsql-patches/2007-05/msg00455.php for the latest version. The idea of using pg_rusage_init is a new one though; I hadn't thought the CPU usage info was interesting enough to figure out how to collect it. The way the patch mentioned above works it would be hard to squeeze it in the line usefully for formatting reasons. I attempted to print block in / out using getrusage(), but I couldn't get them because they were always zero (on my linux). I don't know what's wrong, but the I/O here is pretty simple: the checkpoint wrote some amount of data that you can compute the size of easily within the code knowing the block size. That's already done in the patch under review. If you're interested in this area, you should check out the pg_stat_bgwriter feature already in the 8.3 CVS, look through the pgsql-hackers archives for the discussion this week on the topic "Controlling Load Distributed Checkpoints", and check out the "Automatic adjustment of bgwriter_lru_maxpages" patch whose latest version is at http://archives.postgresql.org/pgsql-patches/2007-05/msg00142.php -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Fri, 22 Jun 2007, Tom Lane wrote: Yeah, I'm not sure that we've thought through the interactions with the existing bgwriter behavior. The entire background writer mess needs a rewrite, and the best way to handle that is going to shift considerably with LDC applied. As the person who was complaining about corner cases I'm not in a position to talk more explicitly about, I can at least summarize my opinion of how I feel everyone should be thinking about this patch and you can take what you want from that. In the context of getting an 8.3 release finalized, I think you should be building a LDC implementation that accomplishes the main goal of spreading the checkpoints out, which is clearly working, and erring on the side of more knobs in cases where it's unsure if they're needed or not. It's clear what my position is which non-obvious knobs I think are important for some people. Which is worse: putting in a tuning setting that it's discovered everybody just uses the same value for, or discovering after release that there's a use-case for that setting and by hard-coding it you've made the DBAs for a class of applications you didn't think about miserable? In the cases where there's good evidence so far of the right setting, just make that the default, and the only harm is GUC bloat. Nothing should be done that changes the existing behavior if the LDC feature is turned off, so anything more obtrusive to the background writer is right out. Make reducing the knobs, optimizing the default behavior, and rewriting the background writer to better fit into its new context a major goal of 8.4. I know I've got a whole notebook full of stuff on that topic I've been ignoring as not to distract you guys from getting 8.3 done. That's the low risk plan, and the design/beta/release period here is short enough that I think going too experimental beyond that is a bad idea. To pick an example, when I read this idea from Heikki: You would have a checkpoint running practically all the time, and you would use checkpoint_timeout/checkpoint_segments to control how long it takes... If we do that, we should remove bgwriter_all_* settings Whether or not I think this is an awesome idea, the very idea of a change that big at this point gives me the willies. Just off the top of my head, there's a whole class of issues involving recycling xlog segments this would introduce I would be really unhappy with the implications of. Did anyone else ever notice that when a new xlog segment is created, the write to clear it out doesn't happen via direct I/O like the rest of the xlog writes do? That write goes through the regular buffered path instead. The checkpoint logging patch I submitted logs when this happens specifically because that particular issue messed with some operations and I found it important to be aware of. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Fri, 22 Jun 2007, Tom Lane wrote: Greg had worried about being able to turn this behavior off, so we'd still need at least a bool, and we might as well expose the fraction instead. I agree with removing the non-LRU part of the bgwriter's write logic though If you accept that being able to turn LDC off is important, people who aren't turning it on may still want the existing bgwriter_all_* settings so they can keep running things the way they are now. It's certainly reasonable to skip that code path when doing things the LDC way. True, you'd have to replay 1.5 checkpoint intervals on average instead of 0.5 (more or less, assuming checkpoints had been short). I don't think we're in the business of optimizing crash recovery time though. If you're not, I think you should be. Keeping that replay interval time down was one of the reasons why the people I was working with were displeased with the implications of the very spread out style of some LDC tunings. They were already unhappy with the implied recovery time of how high they had to set checkpoint_settings for good performance, and making it that much bigger aggrevates the issue. Given a knob where the LDC can be spread out a bit but not across the entire interval, that makes it easier to control how much expansion there is relative to the current behavior. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
This message is going to come off as kind of angry, and I hope you don't take that personally. I'm very frustrated with this whole area right now but am unable to do anything to improve that situation. On Fri, 22 Jun 2007, Tom Lane wrote: If you've got specific evidence why any of these things need to be parameterized, let's see it. All I'm trying to suggest here is that you might want to pause and consider whether you want to make a change that might break existing, happily working installations based just on the small number of tests that have been done on this patch so far. A nice stack of DBT2 results is very informative, but the DBT2 workload is not everybody's workload. Did you see anybody else predicting issues with the LDC patch on overloaded systems as are starting to be seen in the 150 warehouse/90% latency figures in Heikki's most recent results? The way I remember that, it was just me pushing to expose that problem, because I knew it was there from my unfortunately private tests, but it was difficult to encounter the issue on other types of benchmarks (thanks again to Greg Stark and Heikki for helping with that). But that's fine, if you want to blow off the rest of my suggestions now just because the other things I'm worried about are also very hard problem to expose and I can't hand you over a smoking gun, that's your decision. Personally I think that we have a bad track record of exposing GUC variables as a substitute for understanding performance issues at the start, and this approach isn't doing any favors for DBAs. I think this project has an awful track record of introducing new GUC variables and never having a plan to follow through with a process to figure out how they should be set. The almost complete lack of standardization and useful tools for collecting performance information about this database boggles my mind, and you're never going to get the performance related sections of the GUC streamlined without it. We were just talking about the mess that is effective_cache_size recently. As a more topical example here, the background writer was officially released in early 2005, with a bizarre collection of tunables. I had to help hack on that code myself, over two years later, to even start exposing the internal statistics data needed to optimize it correctly. The main reason I can't prove some of my concerns is that I got so side-tracked adding the infrastructure needed to even show they exist that I wasn't able to nail down exactly what was going on well enough to generate a public test case before the project that exposed the issues wrapped up. Right at the moment the best thing to do seems to be to enable LDC with a low minimum write rate and a high target duration, and remove the thereby-obsoleted "all buffers" scan of the existing bgwriter logic. I have reason to believe there's a set of use cases where a more accelerated LDC approach than everyone seems to be learning toward is appropriate, which would then reinvigorate the need for the all-scan BGW component. I have a whole new design for the non-LRU background writer that fixes most of what's wrong with it I'm waiting for 8.4 to pass out and get feedback on, but if everybody is hell bent on just yanking the whole thing out in preference to these really lazy checkpoints go ahead and do what you want. My life would be easier if I just tossed all that out and forgot about the whole thing, and I'm real close to doing just that right now. Did anyone else ever notice that when a new xlog segment is created, the write to clear it out doesn't happen via direct I/O like the rest of the xlog writes do? It's not supposed to matter, because that path isn't supposed to be taken often. Yes, but during the situation it does happen in--when checkpoints take so much longer than expected that more segments have to be created, or in an archive logger faiure--it badly impacts an already unpleasant situation. there's a whole class of issues involving recycling xlog segments this would introduce I would be really unhappy with the implications of. Really? Name one. You already mentioned expansion of the log segments used which is a primary issue. Acting like all the additional segments used for some of the more extreme checkpoint spreading approaches are without cost is completely unrealistic IMHO. In the situation I just described above, I also noticed the way O_DIRECT sync writes get mixed with buffered WAL writes seems to cause some weird I/O scheduling issues in Linux that can make worst-case latency degrade. But since I can't prove that, I guess I might as well not even mention that either. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Sun, 24 Jun 2007, Simon Riggs wrote: I can't see why anyone would want to turn off smoothing: If they are doing many writes, then they will be effected by the sharp dive at checkpoint, which happens *every* checkpoint. There are service-level agreement situations where a short and sharp disruption is more acceptable than a more prolonged one. As some of the overloaded I/O tests are starting to show, the LDC may be a backward step for someone in that sort of environment. I am not a fan of introducing a replacement feature based on what I consider too limited testing, and I don't feel this one has been beat on long yet enough to start pruning features that would allow better backward compatibility/transitioning. I think that's introducing an unnecessary risk to the design. We won't need to set checkpoint_segments so high, since performance is smoothed across checkpoints by LDC and its OK to allow them more frequently. So this concern need not apply with LDC. Performance *should* be smoothed across by checkpoints by LDC and my concern *may* not apply. I think assuming it will always help based on the limited number of test results presented so far is extrapolation. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Mon, 25 Jun 2007, Tom Lane wrote: I'm not sure why you hold such strong allegiance to the status quo. We know that the status quo isn't working very well. Don't get me wrong here; I am a big fan of this patch, think it's an important step forward, and it's exactly the fact that I'm so shell shocked from abuse by the status quo that I'm still mixed up in this mess (I really should be ignoring the lot of you and writing new code instead). LDC certainly makes things better in almost every case. My "allegiance" comes from having seen a class of transactions where LDC made things worse on a fast/overloaded system, in that it made some types of service guarantees harder to meet, and I just don't know who else might run into problems in that area. I'm worried that if it's not adjustable, you're introducing a risk that you'll take a step backward for some of this code's users, and that will be hard to undo given the way releases are structured here. I spent some time trading stocks for a living. There are sometimes situations you can get into there where there is a tiny chance that something very bad can happen with a trade, and many people get wiped out by such things. If it's possible in that situation to remove that risk with something inexpensive, you do it, even though the net expected value of the change might be slightly negative. This seems like such a situation to me. If it's possible to take away the risk of other people running into an unforseen problem with the LDC patch just by keeping a knob that's already there, unless that's an expensive operation my opinion is that you should pick a good default but not remove it yet. And if you think that the current code had enormous amounts of testing before it went in, I've got to disillusion you :-( It's having been on the painful receiving end of that fact that makes me so paranoid now :) -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Sun, 24 Jun 2007, Simon Riggs wrote: Greg can't choose to use checkpoint_segments as the limit and then complain about unbounded recovery time, because that was clearly a conscious choice. I'm complaining only because everyone seems content to wander in a direction where the multiplier on checkpoint_segments for how many segments are actually active at once will go up considerably, which can make a known problem (recovery time) worse. I'm thinking about things like how the release notes for 8.3 are going to address this. It the plan to say something like "whatever you set checkpoint_segments to before so you were happy with the crash recovery time, make sure to re-run all those tests again because we made a big change there, it may take a lot longer now with the same value, and too bad if you don't like it because there's no way to get the old behavior back. Suck it up, decrease checkpoint_segments, and get used to having more checkpoints if you have to keep the same recovery time". -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Mon, 25 Jun 2007, Heikki Linnakangas wrote: Greg, is this the kind of workload you're having, or is there some other scenario you're worried about? The way transitions between completely idle and all-out bursts happen were one problematic area I struggled with. Since the LRU point doesn't move during the idle parts, and the lingering buffers have a usage_count>0, the LRU scan won't touch them; the only way to clear out a bunch of dirty buffers leftover from the last burst is with the all-scan. Ideally, you want those to write during idle periods so you're completely clean when the next burst comes. My plan for the code I wanted to put into 8.4 one day was to have something like the current all-scan that defers to the LRU and checkpoint, such that if neither of them are doing anything it would go searching for buffers it might blow out. Because the all-scan mainly gets in the way under heavy load right now I've only found mild settings helpful, but if it had a bit more information about what else was going on it could run much harder during slow spots. That's sort of the next stage to the auto-tuning LRU writer code in the grand design floating through my head. As a general comment on this subject, a lot of the work in LDC presumes you have an accurate notion of how close the next checkpoint is. On systems that can dirty buffers and write WAL really fast, I've found hyper bursty workloads are a challenge for it to cope with. You can go from thinking you have all sorts of time to stream the data out to discovering the next checkpoint is coming up fast in only seconds. In that situation, you'd have been better off had you been writing faster during the period preceeding the burst when the code thought it should be "smooth"[1]. That falls into the category of things I haven't found a good way for other people to test (I happened to have an internal bursty app that aggrevated this area to use). [1] This is actually a reference to "Yacht Rock", one of my favorite web sites: http://www.channel101.com/shows/show.php?show_id=152 -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Mon, 25 Jun 2007, Heikki Linnakangas wrote: It only scans bgwriter_lru_percent buffers ahead of the clock hand. If the hand isn't moving, it keeps scanning the same buffers over and over again. You can crank it all the way up to 100%, though, in which case it would work, but that starts to get expensive CPU-wise. In addition to being a CPU pig, that still won't necessarily work because the way the LRU writer ignores things with a non-zero usage_count. If it's dirty, it's probably been used recently as well. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Mon, 25 Jun 2007, Heikki Linnakangas wrote: Please describe the class of transactions and the service guarantees so that we can reproduce that, and figure out what's the best solution. I'm confident you're already moving in that direction by noticing how the 90th percentile numbers were kind of weird with your 150 warehouse DBT2 tests, and I already mentioned how that could be usefully fleshed out by more tests during beta. That number is the kind of service guarantee I'm talking about--if before 90% of transactions were <4.5ms, but now that number is closer to 6ms, that could be considered worse performance by some service metrics even if the average and worst-case performance were improved. The only thing I can think of if you wanted to make the problem more like what I was seeing would be switching the transaction mix on that around to do more UPDATEs relative to the other types of transactions; having more of those seemed to aggrevate my LDC-related issues because they leave a different pattern of dirty+used buffers around than other operations. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Mon, 25 Jun 2007, Tom Lane wrote: right now, BgBufferSync starts over from the current clock-sweep point on each call --- that is, each bgwriter cycle. So it can't really be made to write very many buffers without excessive CPU work. Maybe we should redefine it to have some static state carried across bgwriter cycles The LRU portion restarts like that, and it's certainly not optimal. But the auto-tuning LRU patch that's already near application makes this much less of an issue because it only does real work when buffers have been allocated, so the sweep point will have moved along. I'll add this idea to my list of things that would be nice to have as part of a larger rewriter, I think it's more trouble than it's worth to chase right now. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Tue, 26 Jun 2007, Tom Lane wrote: I have no doubt that there are scenarios such as you are thinking about, but it definitely seems like a corner case that doesn't justify keeping the all-buffers scan. That scan is costing us extra I/O in ordinary non-corner cases, so it's not free to keep it. And scenarios I'm concerned about but can't diagram as easily fall into this category as well. I agree that a LDC enabled config would ship with the all-buffers scan turned off as redundant and wasteful, in which the only cost to keep it is code baggage. But the fact that there are corner cases floating around this area is what makes me feel that removing it altogether is still a bit premature. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, final patch
On Tue, 26 Jun 2007, Gregory Stark wrote: What exactly happens if a checkpoint takes so long that the next checkpoint starts. Aside from it not actually helping is there much reason to avoid this situation? Have we ever actually tested it? More segments get created, and because of how they are cleared at the beginning this causes its own mini-I/O storm through the same buffered write channel the checkpoint writes are going into (which way or may not be the same way normal WAL writes go, depending on whether you're using O_[D]SYNC WAL writes). I've seen some weird and intermittant breakdowns from the contention that occurs when this happens, and it's certainly something to be avoided. To test it you could just use a big buffer cache, write like mad to it, and make checkpoint_segments smaller than it should be for that workload. It's easy enough to kill yourself exactly this way right now though, and the fact that LDC gives you a parameter to aim this particular foot-gun more precisely isn't a big deal IMHO as long as the documentation is clear. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Tue, 26 Jun 2007, Tom Lane wrote: I'm not impressed with the idea of writing buffers because we might need them someday; that just costs extra I/O due to re-dirtying in too many scenarios. This is kind of an interesting statement to me because it really highlights the difference in how I thinking about this problem from how you see it. As far as I'm concerned, there's a hierarchy of I/O the database needs to finish that goes like this: 1) Client back-end writes (blocked until a buffer appears) 2) LRU writes so (1) doesn't happen 3) Checkpoint writes 4) Dirty pages with a non-zero usage count In my view of the world, there should be one parameter for a target rate of how much I/O you can stand under normal use, and the background writer should work its way as far down this chain as it can until it meets that. If there's plenty of clean buffers for the expected new allocations and there's no checkpoint going on, by all means write out some buffers we might re-dirty if there's I/O to spare. If you write them twice, so what? You didn't even get to that point as an option until all the important stuff was taken care of and the system was near idle. The elimination of the all-scan background writer means that true hot and dirty spots in the buffer cache, like popular index blocks on a heavily updated table that never get a zero usage_count, are never going to be written out other than as part of the checkpoint process. That's OK for now, but I'd like it to be the case that one day the database's I/O scheduling would eventually get to those, in order to optimize performance in the kind of bursty scenarios I've been mentioning lately. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Load Distributed Checkpoints, final patch
On Tue, 26 Jun 2007, Heikki Linnakangas wrote: I'm scheduling more DBT-2 tests at a high # of warehouses per Greg Smith's suggestion just to see what happens, but I doubt that will change my mind on the above decisions. I don't either, at worst I'd expect a small documentation update perhaps with some warnings based on what's discovered there. The form you've added checkpoint_completion_target in is sufficient to address all the serious concerns I had; I can turn it off, I can smooth just a bit without increasing recovery time too much, or I can go all out smooth. Certainly no one should consider waiting for the tests I asked you about a hurdle to getting this patch committed, slowing that down was never my intention by bringing that up. I'm just curious to see if anything scurries out of some the darker corners in this area when they're illuminated. I'd actually like to see this get committed relatively soon because there's two interleaved merges stuck behind this one (the more verbose logging patch and the LRU modifications). -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] Load Distributed Checkpoints, final patch
On Wed, 27 Jun 2007, Tom Lane wrote: Also, the question of redesigning the bgwriter's LRU scan is still open. I believe that's on Greg's plate, too. Greg's plate was temporarily fried after his house was hit by lightening yesterday. I just got everything back on-line again, so no coding progress, but I think I finished assimilating your "epiphany" during that time. Now I realize that what you're suggesting is that under healthy low-load conditions, the LRU really should be able to keep up right behind the clock sweep point. Noting how far behind it is serves as a measurement of it failing to match the rate buffers that could be re-used are being dirtied, and the only question is how fast and far it should try to drive the point it has cleaned to forward when that happens. Once you've built up enough XLOG segments, the system isn't too bad about recycling them, but there will be a nasty startup transient where foreground processes have to stop and make the things. Exactly. I found it problematic in four situations: 1) Slow checkpoint doesn't finish in time and new segments are being created while the checkpoint is also busy (this is the really bad one) 2) Archive logger stop doing anything (usually because the archive disk is filled) and nothing gets recycled until that's fixed. 2) checkpoint_segments is changed, so then performance is really sluggish for a bit until all the segments are built back up again 3) You ran an early manual checkpoint which doesn't seem to recycle as many segments usefully Any change that would be more proactive about creating segments in these situations than the current code would be a benefit, even though these are not common paths people encounter. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] Checkpoint logging, revised patch
On Fri, 29 Jun 2007, Heikki Linnakangas wrote: LOG: checkpoint complete; buffers written=5869 (35.8%); write=2.081 s, sync=4.851 s, total=7.066 s My original patch converted the buffers written to MB. Easier to estimate MB/s by eye; I really came to hate multiplying by 8K. And people who have multiple boxes with different BLCKZ could merge their logs together and not have to worry about adjusting for that. I found this simpler to present the results to non-PostgreSQL people without having to explain the buffer size. It's easy for any IT person to follow; the idea I was working toward was "see, this log entry shows the long pauses are because it's writing 700MB of data all at once here, right next to that statment that took 10 seconds we found with log_min_duration_statement". The current equivilant of what I had would be CheckpointStats.ckpt_bufs_written * BLCKSZ / (1024*1024) formatted as "%.1f MB" One thing that's missing, that I originally hoped to achieve with this, is logging the cause of a checkpoint. The way pg_stat_bgwriter keeps two separate counts for requested (which includes timeouts) and required (out of segments) checkpoints now satisfies the main reason I originally collected that info. I felt it was critical when I wrote that patch for it to be possible to distinguish between which of the two was happening more, it's only in the nice to have category now. Now, onto some slightly different things here, which were all aimed at developer-level troubleshooting. With the nice reworking for logging checkpoints better, there were three tiny things my original patch did that got lost along the way that I'd suggest Tom might want to consider putting back during the final apply. I'll include mini pseudo-diffs here for those so no one has to find them in my original patch, they're all one-liners: 1) Log every time a new WAL file was created, which ties into the recent discussion here that being a possible performance issue. At least you can look for it happening this way: src/backend/access/transam/xlog.c --- 1856,1863 (errcode_for_file_access(), errmsg("could not create file \"%s\": %m", tmppath))); + ereport(DEBUG2, (errmsg("WAL creating and filling new file on disk"))); /* * Zero-fill the file. We have to do this the hard way to ensure that all 2) Add a lower-level DEBUG statement when autovaccum was finished, which helped me in several causes figure out if that was the cause of a problem (when really doing low-level testing, I would see a vacuum start, not know if it was done, and then wonder if that was the cause of a slow statement): *** src/backend/postmaster/autovacuum.c --- 811,814 do_autovacuum(); + ereport(DEBUG2, + (errmsg("autovacuum: processing database \"%s\" complete", dbname))); } 3) I fixed a line in postmaster.c so it formatted fork PIDs the same way most other log statements do; most statements report it as (PID %d) and the difference in this form seemed undesirable (I spent a lot of time at DEBUG2 and these little things started to bug me): *** src/backend/postmaster/postmaster.c *** 2630,2636 /* in parent, successful fork */ ereport(DEBUG2, ! (errmsg_internal("forked new backend, pid=%d socket=%d", (int) pid, port->sock))); --- 2630,2636 /* in parent, successful fork */ ereport(DEBUG2, ! (errmsg_internal("forked new backend (PID %d) socket=%d", (int) pid, port->sock))); Little stuff, but all things I've found valuable on several occasions, which suggests eventually someone else may appreciate them as well. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Checkpoint logging, revised patch
On Sat, 30 Jun 2007, Heikki Linnakangas wrote: I don't think we currently use MB in any other log messages. If we go down that route, we need to consider switching to MB everywhere. Are there a lot of messages that speak in terms of buffer pages though? Other than the debug-level stuff, which this checkpoint message has now been elevated out of, I don't recall seeing enough things that spoke in those terms to consider it a standard. Someone feel free to correct me if I'm wrong here. Did you check out log_autovacuum? Doesn't it do what you need? I have not (most of this patch was done against 8.2 and then ported forward). If that already shows clearly the start and end of each autovacuum section, ignore that I even brought this up. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
[PATCHES] Tracking buffer allocation stats (new LRU method base code)
Now that all the other checkpoint-related patches seem to have settled down, I've been busy working my way through refreshing and fixing merge issues in what started as Itagaki Takahiro's auto bgwriter_lru_maxpages patch. I split that into two patches before, and am now finished what I can do revising and testing the first of those. The attached patch adds counters for the number of buffers allocated and the number written out by backends, and exposes all of that via pg_stat_bgwriter. Here's a sample from after a period of benchmarking; the buffers_backend and buffers_alloc are the two new things here: checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean ---+-++-- 15 | 274 |7616495 | 6737430 buffers_backend | buffers_alloc | maxwritten_clean +---+-- 35398659 | 24389383 |91631 To show how helpful this is, what I've been doing with all this is taking snapshots of the structure at the beginning and end of an individual benchmark test, then producing the delta to see the balance of who wrote once during the test. Here's an example of a test with 200,000 of the UPDATE statement from the pgbench test. The first run had no background writer, while the second had a moderately active one, and you can see that the counting all works as expected--and that you can learn quite a bit about how effective the background writer cleaner was from these numbers (obviously I had checkpoint_segments set high so there weren't any during the test): clients | tps | chkpts | buf_check | buf_clean | buf_backend | buf_alloc +--++---+---+-+--- 1 | 1487 | 0 | 0 | 0 | 70934 | 85859 1 | 1414 | 0 | 0 | 39005 | 38542 |100963 This patch was last submitted here: http://archives.postgresql.org/pgsql-patches/2007-05/msg00142.php At that time, Itagaki and Heikki seemed to approve the basic direction I'd gone and how I'd split the original code into monitoring and functional pieces. The differences between that version of the patch and the attached one are I fixed the race-condition bug and terminology issue Heikki noticed, along with the merge to current HEAD. Rather than wait until I'd finished testing the next layer on top of this (retuning the automatic was-LRU-now-cleaner code with Tom's latest insight on that topic) I figured I might as well send this part now. So far this has been independant of the code that builds on it, and I'm done with this section. I think it will take a serious look by someone who might commit it to make any more progress and I want to keep those queues moving. As for issues in this code I am concerned about having reviewed, most of them come from my not having completely internalized the engine's forking model yet: 1) I'm not sure the way the way am_bg_writer was changed here is kosher. 2) The way the buffers are counted in the freelist code and sent back to the background writer feels like a bit of a hack to me. 3) The recent strategy changes in freelist.c left me unsure how to count some of what it does; I marked the section I'm concerned about with an XXX comment. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD buf-alloc-stats-2.patch Description: Binary data ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Maintaining cluster order on insert
On Mon, 9 Jul 2007, Tom Lane wrote: The hardware is just a Dell x86_64 workstation with crappy IDE disk, so maybe things would look better elsewhere, but it's all I have to work with. Do you have write-caching turned off on the drive so INSERTs are being rate-limited by WAL syncs? Trying to characterize the class of setup you're using. The part that seemed curious to me about your results in the unpatched version is why the first INSERT takes 2.4 seconds, while the second takes 12.2 then later ones settle from 17-23 s. I could understand the 12-23 variation, but that the first one fires off in 2.4 seems kind of fishy; why so fast? Is there something that's just fitting in memory in that case that's just missing in the patched version? results-head: ... executed DELETE in 14.770937 sec executed VACUUM in 10.663301 sec executed INSERT in 2.449248 sec (1st) ... executed INSERT in 12.212027 sec (2nd) results-patch: ... executed DELETE in 18.062664 sec executed VACUUM in 28.487570 sec executed INSERT in 25.638022 sec (1st) ... executed INSERT in 40.759404 sec (2nd) -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] Maintaining cluster order on insert
On Tue, 10 Jul 2007, Tom Lane wrote: It's just desktop-grade junk :-(. Feel free to repeat the test on something more serious. Right, but even such junk can be setup such that the disks honor commits, just wanted to confirm you didn't go out of your way to do that--sounds like you didn't. My mini-lab at home has two PG test systems, one with a real write cache, the other has an IDE drive with the cache turned off so commits actually wait to hit the platter. The painful slowness of writes in that situation keeps me grounded as to how much data integrity actually costs if you're not accelerating it, and the differences in benchmarks between the two systems (which are otherwise a similar class of hardware) can be informative. Anyway the numbers seemed relatively consistent after the setup and first test cycle, so I think the part I was actually trying to draw conclusions from was probably real enough. Agreed, just figuring out the test ramp-up situation, and your explanation for that quirk sounds reasonable. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] HOT latest patch - version 8
On Sun, 15 Jul 2007, Simon Riggs wrote: Our main test case for OLTP is DBT-2 which follows TPC-C in being perfectly scalable with no hot spots in the heap and limited hot spots in the indexes. As such it's a poor test of real world applications, where Benfold's Law holds true. I assume this is a typo on Benford's Law: http://en.wikipedia.org/wiki/Benford's_law which notes there are far more ones in real-world data sets. If there were a Benfold's Law, it would surely involve the number 5 instead. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] Async Commit, v21 (now: v22)
On Tue, 24 Jul 2007, Gregory Stark wrote: Do we really want the walwriter doing the majority of the wal-flushing work for normal commits? It seems like that's not going to be any advantage over just having some random backend do the commit. Might there be some advantage in high-throughput/multi-[cpu|core] situations due to improved ability to keep that code in a single processor? CPU cache issues are turning into scalability bottlenecks in so many designs I came across lately. A distinct walwriter might be more likely to be (or even be explicitly) bound to a processor and stay there than when the code executes on any random backend. The obvious flip side is that increased moving of data between processors more often may balance or even negate any potential improvement there. More on the system tuning side, I know I've found that the background writer is a separate process enormously helpful because of how it allows monitoring the activity level of just it relative to the whole. There are enough WAL-bound systems out there (particularly when there's no caching disk controller) that I could see similar value to being able to watch a distinct walwriter. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] HOT patch - version 15
On Thu, 6 Sep 2007, Heikki Linnakangas wrote: I wonder if pruning in bgwriter only is enough to make HOT work efficiently. On a very frequently updated page, bgwriter will have to work hard to keep up. One of the goals for how I rebuilt the just-in-time BGW was to try and make smaller values for bgwriter_delay feasible. None of the tunables *should* have to be adjusted if you need to run the bgwriter much more often to support some new HOT-related activity in there as well; this is actually my next area I wanted to do a bit more testing on myself. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] pgbench - startup delay
On Mon, 10 Dec 2007, Tom Lane wrote: I concur with Alvaro that this case seems adequately covered by PGOPTIONS="-W n" pgbench ... I started to disagree with this, but ultimately realized anyone who is running pgbench for long enough to get useful results shouldn't have their TPS impacted much at all by a few overhead seconds tacked onto the server startup. I once wrote a similar patch to the one Dave submitted here and feel like it's worth committing at least a documentation patch to show how to deal with this. It's not obvious that pgbench pays attention to the environment variables at all, and it's even less obvious that you can pass what look like server options in there. I just poked around the documentation a bit and I didn't find anything that cleared up which options you can pass from a client; in addition to -W, I can imagine pgbench users might also want to use -S (sort memory) or -f (forbid scan/join types). If I can get someone to clarify what is supported there I can put together a pgbench doc patch that addresses this topic. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] pgbench - startup delay
On Mon, 10 Dec 2007, Neil Conway wrote: Perhaps one of the slightly unfortunate consequences of the postmaster => postgres merge is that there is less of a clear distinction between "postmaster options" and "postgres" options... I'd already read all of the documentation that you and Tom suggested just before I sent my previous message, and I didn't find this subject clear at all. On Mon, 10 Dec 2007, Tom Lane wrote: It's not pgbench that is paying attention to this, it's libpq. Right, but I wouldn't expect a typical pgbench user to know that. Anything you'd be allowed to SET can be set from PGOPTIONS (-c or --var syntax...the restrictions are the same as for the underlying variable. That clarifies the situation well enough for me. I think this is a two part problem then. It's not necessarily obvious that pgbench will use PGOPTIONS. In addition to that, the current documentation is less clear than it could be on the subject of what you can usefully put into PGOPTIONS. That's two small documentation patches I should be able to handle. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] [HACKERS] SSL over Unix-domain sockets
On Thu, 17 Jan 2008, Tom Lane wrote: BTW, is a symlink's atime changed by accessing it? It seems so in the cases I've tried or researched, but it's complicated. After burning through a bunch of time looking into this I wanted to drop some notes so nobody else has to wander down the specific dead-ends I just followed. I figured I'd just run some experiments to figure this out for my Linux system, but that didn't go so well. The process of running anything that shows the atime: ls -l --time=atime ls -lu stat actually updates the atime to right now along the way. I hacked up something with perl that directly calls lstat() and it did the same thing. Mystified, I found this thread suggesting the same thing is true on Mac OS X: http://lists.apple.com/archives/darwin-kernel/2006/Dec/msg00054.html The point made in there is that how symlinks are encoded varies not just from OS to OS but from filesystem to filesystem, and that encoding changes how things like atime work. On Linux with ext2, I found this note: "Symbolic links are also filesystem objects with inodes. They deserve special mention because the data for them is stored within the inode itself if the symlink is less than 60 bytes long. It uses the fields which would normally be used to store the pointers to data blocks." So what I think is happening is: the process of doing anything at all with a Linux symlink references the inode that has the link. That updates the atime on that inode. But since there's no actual data underneath that lookup in cases where the link is less than 60 bytes, the inode atime is the link atime, so that just updated the link's atime to right now as well. I have no idea how any tmp cleaner could ever find a short symlink it can delete if I'm understanding this correctly. I left behind the link I was just playing with and I'll see if I can get tmpwatch to eat it tomorrow, that seems like the most appropriate test here. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Proposing correction to posix_fadvise() usage in xlog.c
On Fri, 29 Feb 2008, Mark Wong wrote: Basically posix_fadvise() is being called right before the WAL segment file is closed, which effectively doesn't do anything as opposed to when the file is opened. This proposed correction calls posix_fadvise() in three locations... Actually, posix_fadvise is called nowhere; the one place it's referenced at is #ifdef'd out. There's a comment in the code just above there as to why: posix_fadvise only works on a limited number of platforms and as far as I know nobody has ever put the time into figuring out where it's safe or not safe so that configure can be taught that. I think this may be a dead item because most places where posix_fadvise works correctly, you can use O_SYNC and get O_DIRECT right now to do the same thing. If memory serves, one of the wins here is suppose to be that in a scenario where we are not expecting to re-read writes to the WAL we also do not want the writes to disk to flush out other data from the operating system disk cache. Right, but you can get that already with O_SYNC on platforms where O_DIRECT is supported. There's a related TODO here which is to use directio(3C) on Solaris, which Jignesh reports is needed instead of O_DIRECT to get the same behavior on that platform. I am guessing that SATA drives have write cache enabled by default so it seems to make sense that using POSIX_FADV_DONTNEED will cause writes to be slower by writing through the disk cache. I've never heard of a SATA drive that had its write cache disabled by default. They're always on unless you force them off, and even then they can turn themselves back on again if there's a device reset and you didn't change the drive's default using the manufacturer's adjustment utility. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] CopyReadLineText optimization
On Thu, 6 Mar 2008, Heikki Linnakangas wrote: At the most conservative end, we could fall back to the current method on the first escape, quote or backslash character. I would just count the number of escaped/quote characters on each line, and then at the end of the line switch modes between the current code on the new version based on what the previous line looked like. That way the only additional overhead is a small bit only when escapes show up often, plus a touch more just once per line. Barely noticable in the case where nothing is escaped, very small regression for escape-heavy stuff but certainly better than the drop you reported in the last rev of this patch. Rev two of that design would keep a weighted moving average of the total number of escaped characters per line (say wma=(7*wma+current)/8) and switch modes based on that instead of the previous one. There's enough play in the transition between where the two approaches work better at that this should be easy enough to get a decent transition between. Based on your data I would put the transition at wma>4, which should keep the old code in play even if only half the lines have the bad regression that shows up with >8 escapes per line. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://mail.postgresql.org/mj/mj_wwwusr?domain=postgresql.org&extra=pgsql-patches
Re: [PATCHES] Expose checkpoint start/finish times into SQL.
On Thu, 3 Apr 2008, Robert Treat wrote: You can plug a single item graphed over time into things like rrdtool to get good trending information. And it's often easier to do this using sql interfaces to get the data than pulling it out of log files (almost like the db was designed for that :-) The pg_stat_bgwriter value for buffers_checkpoint was intentionally implemented in 8.3 such that it jumps in one big lump when the checkpoint is done. While it's not the ideal interface for what you're looking for, the reason for that is to made it possible to build a "when was the last checkpoint finished?" interface via some remote monitoring tool just by determining the last time that the value jumped upwards. You can easily see them just by graphing that value, it shouldn't be too hard to teach something with rrdtool guts to find them. Since checkpoints have a fairly predictable duration in 8.3, as long as you catch the start or end of them you can make a resonable guess where the other side was. The case you're trying to avoid here, the system going a long time without checkpointing, can be implemented by looking for a begin or end regularly, you don't need to track both. As long as there's a checkpoint finish "pulse" in buffers_checkpoint showing up regularly you're fine. The only situation I can think of where this might be problematic is where the system has been idle enough to not have any buffers to write at checkpoint time, but I recall a code path there where checkpoints stop altogether unless there's been activity so even tracking the time may not change that. Ultimately a lot of the other questions you might ask (i.e. "how many buffers have been written per hour by checkpoints?") require processing the numbers in this way anyway, and I thought this implementation was good enough to monitor the situation you're trying to avoid--presuming you're using some sort of moderately powerful remote monitoring tool. Theo's patch would make it easier to answer with a simple command which has some value; a little SQL in a cron job would be good enough to trigger an alert rather than needing a real monitoring probe. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Expose checkpoint start/finish times into SQL.
On Thu, 3 Apr 2008, Joshua D. Drake wrote: For knowing how long checkpoints are taking. If they are taking too long you may need to adjust your bgwriter settings, and it is a serious drag to parse postgresql logs for this info. There's some disconnect here between what I think you want here and what Theo's patch does. If I were running an 8.2 system, there's a whole lot of diagnostic information I'd like to have handy in order to optimize checkpoints, as you describe. But this patch doesn't help 8.2 users, because the whole pg_stat_bgwriter structure is 8.3 only--it provides much of what is needed in this area. In 8.3, there is very little background writer tuning to do. What tuning you can do is easy to figure out from what's in pg_stat_bgwriter. The new checkpoint_completion_target feature works better than tuning the 8.2 background writer to reduce checkpoint spikes ever did anyway, and that's the main thing that drives how long checkpoints take to process now. So in the only situation where this patch can be applied, 8.3, I don't think the scenario you describe is likely to pop up. And the combination of two 8.3 features (log_checkpoint and the CSV logs, it's no coincidence I worked on both) should allow a good enough way to hack around this area when it does. Just make the logs rotate fairly regularly, and every time they do import the last CSV section that's now finished into a table so you can compute statistics about how the checkpoints are doing there. I don't see why that sort of stuff should go into core when it's now easy to do outside of it. I have a whole stack of scripts in that area I plan to release over the summer. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Expose checkpoint start/finish times into SQL.
On Thu, 3 Apr 2008, Tom Lane wrote: As of PG 8.3, the bgwriter tries very hard to make the elapsed time of a checkpoint be just about checkpoint_timeout * checkpoint_completion_target, regardless of load factors. In the cases where the timing on checkpoint writes are timeout driven. When they're checkpoint_segments driven the time is derived from that drumbeat instead. And those cases are the more interesting ones, because those are the ones that will speed up during really busy periods. I think it's completely relevant to track how often checkpoints are happening because they do still vary even in 8.3. I'm just not sure if the current methods available for that really aren't good enough, or if it's just the case that not everyone is aware of all of them. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Expose checkpoint start/finish times into SQL.
On Thu, 3 Apr 2008, Tom Lane wrote: "the system stopped checkpointing" does not strike me as a routine occurrence that we should be making provisions for DBAs to watch for. What, pray tell, is the DBA supposed to do when and if he notices that? Schedule downtime rather than wait for it to happen unpredictably when the inevitable crash happens. I'd much rather be spending our time and effort on understanding what broke for you, and fixing the code so it doesn't happen again. (Here I start laughing all over again as I recall Robert's talk, which we really need to get you in particular the video of) Their situation had possible causes that included a bit flipping in bad memory, which is pretty hard to code around (unless you're a Core Wars veteran). I'm familiar with that part of the checkpoint code path, and everything I was able to think of when hearing the outline of events was already considered and rejected as not being a likely cause. This patch comes out of pragmatic acceptance that, sometimes, stuff will happen you can't easily explain, but that doesn't mean it's not worth keeping an eye on it anyway so it doesn't sneak up on you again. Anyway, I think this whole thing would be better handled by a larger internals view that this whole codebase could use a dose of anyway. What I really want is an interface like this: psql> select pg_internals('last_checkpoint_time'); and then start sprinkling exports of those probe points in some popular places people would like to look at. I will apologize now for suggesting this, followed by not having enough time to code it in the near future. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Expose checkpoint start/finish times into SQL.
On Fri, 4 Apr 2008, Tom Lane wrote: (And you still didn't tell me what the actual failure case was.) Database stops checkpointing. WAL files pile up. In the middle of backup, system finally dies, and when it starts recovery there's a bad record in the WAL files--which there are now thousands of to apply, and the bad one is 4 hours of replay in. Believe it or not, it goes downhill from there. It's what kicked off the first step that's the big mystery. The only code path I thought of that can block checkpoints like this is when the archive_command isn't working anymore, and that wasn't being used. Given some of the other corruption found later and the bad memory issues discovered, a bit flipping in the "do I need to checkpoint now?" code or data seems just as likely as any other explanation. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Expose checkpoint start/finish times into SQL.
On Fri, 4 Apr 2008, Tom Lane wrote: The actual advice I'd give to a DBA faced with such a case is to kill -ABRT the bgwriter and send the stack trace to -hackers. And that's a perfect example of where they're trying to get to. They didn't notice the problem until after the crash. The server didn't come back up (busy processing WALs) and that downtime was caught by a monitoring system. At that point it was too late to collect debugging information on what was wrong inside the server processes that might have given a clue what happened. If they'd have noticed it while the server was up, perhaps because the "last checkpoint" value hadn't changed in a long time (which seems like it might be available via stats even if, as you say, the background writer is out of its mind at that point), they could have done such a kill and collected some actual useful information here. That's the theory at least. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Sorting writes during checkpoint
On Tue, 15 Apr 2008, ITAGAKI Takahiro wrote: 2x Quad core Xeon, 16GB RAM, 4x HDD (RAID-0) What is the disk controller in this system? I'm specifically curious about what write cache was involved, so I can get a better feel for the hardware your results came from. I'm busy rebuilding my performance testing systems right now, once that's done I can review this on a few platforms. One thing that jumped out at me just reading the code is this happening inside BufferSync: buf_to_write = (BufAndTag *) palloc(NBuffers * sizeof(BufAndTag)); If shared_buffers(=NBuffers) is set to something big, this could give some memory churn. And I think it's a bad idea to allocate something this large at checkpoint time, because what happens if that fails? Really not the time you want to discover there's no RAM left. Since you're always going to need this much memory for the system to operate, and the current model has the system running a checkpoint >50% of the time, the only thing that makes sense to me is to allocate it at server start time once and be done with it. That should improve performance over the original patch as well. BufAndTag is a relatively small structure (5 ints). Let's call it 40 bytes; even that's only a 0.5% overhead relative to the shared buffer allocation. If we can speed checkpoints significantly with that much overhead it sounds like a good tradeoff to me. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Sorting writes during checkpoint
On Wed, 16 Apr 2008, ITAGAKI Takahiro wrote: Dirty region of database was probably larger than disk controller's cache. Might be worthwhile to run with log_checkpoints on and collect some statistics there next time you're running these tests. It's a good habit to get other testers into regardless; it's nice to be able to say something like "during the 15 checkpoints encountered during this test, the largest dirty area was 516MB while the median was 175MB". Hmm, but I think we need to copy buffer tags into bgwriter's local memory in order to avoid locking taga many times in the sorting. Is it better to allocate sorting buffers at the first time and keep and reuse it from then on? That what I was thinking: allocate the memory when the background writer starts and just always have it there, the allocation you're doing is always the same size. If it's in use 50% of the time anyway (which it is if you have checkpoint_completion_target at its default), why introduce the risk that an allocation will fail at checkpoint time? Just allocate it once and keep it around. It is 0.25% of shared buffers; when shared_buffers is set to 10GB, it takes 25MB of process local memory. Your numbers are probably closer to correct. I was being pessimistic about the size of all the integers just to demonstrate that it's not really a significant amount of memory even if they're large. If we want to consume less memory for it, RelFileNode in BufferTag could be hashed and packed into an integer I personally don't feel it's worth making the code any more complicated than it needs to be just to save a fraction of a percent of the total memory used by the buffer pool. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] configure option for XLOG_BLCKSZ
On Fri, 2 May 2008, Tom Lane wrote: The case for varying BLCKSZ is marginal already, and I've seen none at all for varying XLOG_BLCKSZ. I recall someone on the performance list who felt it useful increase XLOG_BLCKSZ to support a high-write environment with WAL shipping, just to make sending the files over the network more efficient. Can't seem to find a reference in the archives though. If you look at things like the giant Sun system tests, there was significant tuning getting all the block sizes to line up better with the underlying hardware. I would not be surprised to discover that sort of install gains a bit from slinging WAL files around in larger chunks as well. They're already using small values for commit_delay just to get the typical WAL write to be in larger blocks. As PostgreSQL makes it way into higher throughput environments, it wouldn't surprise me to discover more of these situations where switching WAL segments every 16MB turns into a bottleneck. Right now, it may only be a few people in the world, but saying "that's big enough" for an allocation of anything usually turns out wrong if you wait long enough. One real concern I have with making this easier to adjust is that I'd hate to let people pick any old block size with the default wal_sync_method, only to have them later discover they can't turn on any direct I/O write method because they botched the alignment restrictions. Another issue though is whether it makes sense for XLOG_BLCKSZ to be different from BLCKSZ at all, at least in the default case. They are both the unit of I/O and it's not clear why you'd want different units. There are lots of people who use completely different physical or logical disk setups for the WAL disk than the regular database. That's going to get even more varied moving forward as SSD starts getting used more, since those devices have a very different set of block size optimization characteristics compared with traditional RAID setups. They prefer smaller blocks to match the underlying flash better, and you don't pay as much of a penalty for writing that way because lining up with the spinning disk isn't important. Someone who put one of DB/WAL on SSD and the other on traditional disk might end up with very different DB/WAL block sizes to match. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Sorting writes during checkpoint
On Sun, 4 May 2008, Tom Lane wrote: However, I am completely unable to measure any performance improvement from it. Given the possible risk of out-of-memory failures, I think the patch should not be applied without some direct proof of performance benefits, and I don't see any. Fair enough. There were some pgbench results attached to the original patch submission that gave me a good idea how to replicate the situation where there's some improvement. I expect I can take a shot at quantifying that independantly near the end of this month if nobody else gets to it before then (I'm stuck sorting out a number of OS level issue right now before my testing system is online again). Was planning to take a longer look at Greg Stark's prefetching work at that point as well. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Sorting writes during checkpoint
On Sun, 4 May 2008, Tom Lane wrote: Well, I tried a pgbench test similar to that one --- on smaller hardware than was reported, so it was a bit smaller test case, but it should have given similar results. My pet theory on cases where sorting will help suggests you may need a write-caching controller for this patch to be useful. I expect we'll see the biggest improvement in situations where the total amount of dirty buffers is larger than the write cache and the cache becomes blocked. If you're not offloading to another device like that, the OS-level elevator sorting will handle sorting for you close enough to optimally that I doubt this will help much (and in fact may just get in the way). Of course it's notoriously hard to get consistent numbers out of pgbench anyway, so I'd rather see some other test case ... I have some tools to run pgbench results many times and look for patterns that work fairly well for the consistency part. pgbench will dirty a very high percentage of the buffer cache by checkpoint time relative to how much work it does, which makes it close to a best case for confirming there is a potential improvement here. I think a reasonable approach is to continue trying to quantify some improvement using pgbench with an eye toward also doing DBT2 tests, which provoke similar behavior at checkpoint time. I suspect someone who already has a known good DBT2 lab setup with caching controller hardware (EDB?) might be able to do a useful test of this patch without too much trouble on their part. Unless someone can volunteer to test sooner, I think we should drop this item from the current commitfest queue. This patch took a good step forward toward being commited this round with your review, which is the important part from my perspective (as someone who would like this to be committed if it truly works). I expect that performance related patches will often take more than one commitfest to pass through. From the perspective of keeping the committer's plates clean, a reasonable system for this situation might be for you to bounce this into the rejected pile as "Returned for testing" immediately, to clearly remove it from the main queue. A reasonable expectation there is that you might consider it again during May if someone gets back with said testing results before the 'fest ends. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Sorting writes during checkpoint
On Mon, 5 May 2008, Tom Lane wrote: It bothers me a bit that the patch forces writes to be done "all of file A in order, then all of file B in order, etc". We don't know enough about the disk layout of the files to be sure that that's good. (This might also mean that whether there is a win is going to be platform and filesystem dependent ...) I think most platform and filesystem implementations have disk location correlated enough with block order that this particular issue isn't a large one. If the writes are mainly going to one logical area (a single partition or disk array), it should be a win as long as the sorting step itself isn't introducing a delay. I am concered that in a more complicated case than pgbench, where the writes are spread across multiple arrays say, that forcing writes in order may slow things down. Example: let's say there's two tablespaces mapped to two arrays, A and B, that the data is being written to at checkpoint time. In the current case, that I/O might be AABAABABBBAB, which is going to keep both arrays busy writing. The sorted case would instead make that AABB so only one array will be active at a time. It may very well be the case that the improvement from lowering seeks on the writes to A and B is less than the loss coming from not keeping both continuously busy. I think I can simulate this by using a modified pgbench script that works against an accounts1 and accounts2 with equal frequency, where 1&2 are actually on different tablespaces on two disks. Right, that's in the ground rules for commitfests: if the submitter can respond to complaints before the fest is over, we'll reconsider the patch. The small optimization I was trying to suggest was that you just bounce this type of patch automatically to the "rejected for " section of the commitfest wiki page in cases like these. The standard practice on this sort of queue is to automatically reclassify when someone has made a pass over the patch, leaving the original source to re-open with more information. That keeps the unprocessed part of the queue always shrinking, and as long as people know that they can get it reconsidered by submitting new results it's not unfair to them. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] [GENERAL] pgbench not setting scale size correctly?
On Wed, 7 May 2008, Bruce Momjian wrote: Patch attached that issues a warning. This doesn't take into account the -F case and the warning isn't quite right because of that as well. When I get a break later today I'll create a new patch that handles that correctly, I see where it should go now that I look at this again. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] [GENERAL] pgbench not setting scale size correctly?
On Wed, 7 May 2008, Bruce Momjian wrote: Tom Lane wrote: Greg Smith <[EMAIL PROTECTED]> writes: The way the option parsing code is done would make complaining in the case where your parameter is ignored a bit of a contortion. Yeah. But couldn't we have that part issue a warning if -s had been set on the command line? Patch attached that issues a warning. Turns out it wasn't so contorted. Updated patch attached that only warns in the exact cases where the setting is ignored, and the warning says how it's actually setting the scale. I tested all the run types and it correctly complains only when warranted, samples: $ ./pgbench -s 200 -i pgbench creating tables... 1 tuples done. ... $ ./pgbench -s 100 pgbench Scale setting ignored by standard tests, using database branch count starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 200 ... $ ./pgbench -s 100 -f select.sql pgbench starting vacuum...end. transaction type: Custom query scaling factor: 100 ... -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MDIndex: contrib/pgbench/pgbench.c === RCS file: /home/gsmith/cvsrepo/pgsql/contrib/pgbench/pgbench.c,v retrieving revision 1.79 diff -u -r1.79 pgbench.c --- contrib/pgbench/pgbench.c 19 Mar 2008 03:33:21 - 1.79 +++ contrib/pgbench/pgbench.c 9 May 2008 07:12:21 - @@ -1645,6 +1645,9 @@ exit(0); } + if (scale && (ttype != 3)) + fprintf(stderr,"Scale setting ignored by standard tests, using database branch count\n"); + remains = nclients; if (getVariable(&state[0], "scale") == NULL) -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] posix advises ...
On Sun, 11 May 2008, Hans-Juergen Schoenig wrote: we also made some simple autoconf hack to check for broken posix_fadvise. Because of how you did that, your patch is extremely difficult to even test. You really should at least scan the output from diff you're about to send before submitting a patch to make sure it makes sense--yours is over 30,000 lines long just to implement a small improvement. The reason for that is that you regenerated "configure" using a later version of Autoconf than the official distribution, and the diff for the result is gigantic. You even turned off the check in configure.in that specifically prevents you from making that mistake so it's not like you weren't warned. You should just show the necessary modifications to configure.in in your patch, certainly shouldn't submit a patch that subverts the checks there, and leave out the resulting configure file if you didn't use the same version of Autoconf. I find the concept behind this patch very useful and I'd like to see a useful one re-submitted. I'm in the middle of setting up some new hardware this month and was planning to test the existing fadvise patches Greg Stark sent out as part of that. Having another one to test for accelerating multiple sequential scans would be extremely helpful to add to that, because then I think I can reuse some of the test cases Jeff Davis put together for the 8.3 improvements in that area to see how well it works. It wasn't as clear to me how to test the existing bitmap scan patch, yours seems a more straightforward patch to use as a testing ground for fadvise effectiveness. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] posix advises ...
On Thu, 19 Jun 2008, Zoltan Boszormenyi wrote: This patch (revisited and ported to current CVS HEAD) is indeed using Greg's original patch and also added another patch written by Mark Wong that helps evicting closed XLOGs from memory faster. Great, that will save me some trouble. I've got a stack of Linux performance testing queued up (got stuck behind a kernel bug impacting pgbench) for the next couple of weeks and I'll include this in that testing. I think I've got a similar class of hardware as you tested on for initial evaluation--I'm getting around 200MB/s sequential I/O right now out of my small RAID setup,. I added your patch to the queue for next month's CommitFest and listed myself as the initial reviewer, but a commit that soon is unlikely. Performance tests like this usually take a while to converge, and since this is using a less popular API I expect a round of portability concerns, too. Where did Marc's patch come from? I'd like to be able to separate out that change from the rest if necessary. Also, if you have any specific test cases you ran that I could start by trying to replicate a speedup on, those would be handy as well. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] pgbench minor fixes
On Sun, 6 Jul 2008, Russell Smith wrote: Simon Riggs wrote: Minor patch on pgbench 1. -i option should run vacuum analyze only on pgbench tables, not *all* tables in database. How does this work with custom scripts? That's the initialization step. It creates the 4 tables, populates then, then runs VACCUM ANALYZE. There is no notion of what script you'll end up executing yet. If you have a truly custom script that works against other data instead of the pgbench tables, you won't even be executing this initialization bit. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Sorting writes during checkpoint
On Fri, 4 Jul 2008, Simon Riggs wrote: No action on this seen since last commitfest, but I think we should do something with it, rather than just ignore it. Just no action worth reporting yet. Over the weekend I finally reached the point where I've got a system that should be capable of independently replicating the results improvement setup here, and I've started performance testing of the patch. Getting useful checkpoint test results from pgbench is really a pain. Sorting by file might have inadvertently shown benefit at the tablespace level on a larger server with spread out data whereas on Tom's test system I would guess just a single tablespace was used. I doubt this has anything to do with it, only because the pgbench schema doesn't split into tablespaces usefully. Almost all of the real action is on a single table, accounts. My suspicion is that sorting only benefits in situations where you have a disk controller with a significant amount of RAM on it, but the server RAM is much larger. In that case the sorting horizon of the controller itself is smaller than what the server can do, and the sorting makes it less likely you'll end up with the controller filled with unsorted stuff that takes a long time to clear. In Tom's test, there's probably only 8 or 16MB worth of cache on the disk itself, so you can't get a large backlog of unsorted writes clogging the write pipeline. But most server systems have 256MB or more of RAM there, and if you get that filled with seek-heavy writes (which might only clear at a couple of MB a second) the delay for that cache to empty can be considerable. That said, I've got a 256MB controller here and have a very similar disk setup to the one postiive results were reported on, but so far I don't see any significant difference after applying the sorted writes patch. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [HACKERS][PATCHES] odd output in restore mode
On Wed, 23 Jul 2008, Kevin Grittner wrote: In our scripts we handle this by copying to a temp directory on the same mount point as the archive directory and doing a mv to the archive location when the copy is successfully completed. I think that this even works on Windows. Could that just be documented as a strong recommendation for the archive script? This is exactly what I always do. I think the way cp is shown in the examples promotes what's really a bad practice for lots of reasons, this particular problem being just one of them. I've been working on an improved archive_command shell script that I expect to submit for comments and potential inclusion in the documentation as a better base for other people to build on. This is one of the options for how it can operate. It would be painful but not impossible to convert a subset of that script to run under Windows as well, at least enough to cover this particular issue. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches