date:20110505

Re: [HACKERS] patch for new feature: Buffer Cache Hibernation

2011-05-05 Thread Dimitri Fontaine

Josh Berkus j...@agliodbs.com writes:
 I thought that Dimitri had already implemented this using Fincore.  It's
 linux-only, but that should work well enough to test the general concept.

Actually, Cédric did, and I have a clone of his repository where I did
some debian packaging of it.

  http://villemain.org/projects/pgfincore
  http://git.postgresql.org/gitweb?p=pgfincore.git;a=summary
  http://git.postgresql.org/gitweb?p=pgfincore.git;a=tree

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for new feature: Buffer Cache Hibernation

2011-05-05 Thread Cédric Villemain

2011/5/4 Josh Berkus j...@agliodbs.com:
 All,

 I thought that Dimitri had already implemented this using Fincore.  It's
 linux-only, but that should work well enough to test the general concept.

Harald provided me some pointers at pgday in Stuttgart to make it work
with windows but ... hum I have not windows and wasn't enought
motivated to make it work on it if no one need it.

I didn't search recently on the different kernels, but any kernel
supporting mincore and posix_fadvise should work. (so probably the
same set of kernel that support our 'effective_io_concurrency').

Still waiting for (free)BSD support .


-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pull up aggregate subquery

2011-05-05 Thread Hitoshi Harada

2011/5/5 Tom Lane t...@sss.pgh.pa.us:
 Hitoshi Harada umi.tan...@gmail.com writes:
 I sometimes wonder if we could pull up aggregate query in the optimizer.

 I don't have time to look at this right now, but please add to the
 upcoming commitfest.

Done.
https://commitfest.postgresql.org/action/patch_view?id=548

I'll work further if I find time.

Regards,

-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Backpatching of Teach the regular expression functions to do case-insensitive matching

2011-05-05 Thread Andres Freund

Hi,

In my opinion this is actually a bug in  9.0. As its a (imo) low impact fix 
thats constrained to two files it seems sensible to backpatch it now that the 
solution has proven itself in the field?

The issue is hard to find and has come up several times in the field. And it 
has 
been slightly embarassing more than once ;)

I am happy to provide patches for the supported releases  9.0 if that helps 
the issue.

Greetings,

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for new feature: Buffer Cache Hibernation

2011-05-05 Thread Mitsuru IWASAKI

Hi, thanks for good suggestions.

  Postgres usually starts with ZERO buffer cache.  By saving the buffer
  cache data structure into hibernation files just before shutdown, and
  loading them at startup, postgres can start operations with the saved
  buffer cache as the same condition as just before the last shutdown.
 
 This seems like a lot of complication for rather dubious gain.  What
 happens when the DBA changes the shared_buffers setting, for instance?

It was my first concern actually.  Current implementation is stopping
reading hibernation file when detecting the size mismatch among
shared_buffers and hibernation file.  I think it is a safety way.
As Alvaro Herrera mentioned, it would be possible to adjust copying
buffer bloks, but changing shared_buffers setting is not so often I
think.

 How do you protect against the cached buffers getting out-of-sync with
 the actual disk files (especially during recovery scenarios)?  What

Saving DB buffer cahce is called at shutdown after finishing
bgwriter's final checkpoint process, so dirty-buffers should not exist
I believe.
For recovery scenarios, I need to research it though...
Could you describe what is need to be consider?

 about crash-induced corruption in the cache file itself (consider the
 not-unlikely possibility that init will kill the database before it's
 had time to dump all the buffers during a system shutdown)?  Do you have

I think this is important point.  I'll implement validation function for
hibernation file.

 any proof that writing out a few GB of buffers and then reading them
 back in is actually much cheaper than letting the database re-read the
 data from the disk files?

I think this means sequential-read vs scattered-read.
The largest hibernation file is for buffer blocks, and sequential-read
from it would be much faster than scattered-read from database file
via smgrread() block by block.
As Greg Stark suggested, re-reading from database file based on buffer
descriptors was one of implementation candidates (it can reduce
storage consumption for hibernation), but I chose creating buffer
blocks raw image file and reading it for the performance.


Thanks

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for new feature: Buffer Cache Hibernation

2011-05-05 Thread Mitsuru IWASAKI

Hi,

 I think that PgFincore (http://pgfoundry.org/projects/pgfincore/)
 provides similar functionality.  Are you familiar with that?  If so,
 could you contrast your approach with that one?

I'm not familiar with PgFincore at all sorry, but I got source code
and documents and read through them just now.
# and I'm a novice on postgres actually...
The target both is to reduce physical I/O, but their approaches and
gains are different.
My understanding is like this;

+-+ +-+
| Postgres(backend)   | | Postgres|
| +-+ | | |
| | DB Buffer Cache | | | |
| | (shared buffers)| | | |
| |*my target   | | | |
| +-+ | | |
|   ^  ^  | | |
|   |  |  | | |
|   v  v  | | |
| +-+ | | +-+ | 
| |  buffer manager | | | |pgfincore| |
| +-+ | | +-+ |
+---^--^--+ +--^--+
|  |smgrread() |posix_fadvise()
|read()|   | userland
==
|  |   | kernel
|  +-+-+
||
|v
|   ++
|   | File System|
|   |   +-+  |
+--|   | FS Buffer Cache |  |
|   |*PgFincore target|  |
|   +-+  |
|^   ^   |
+|---|---+
 |   |
==
 |   |   hardware
   +-|---|+
   | |   v  Physical Disk |
   | |   +--+ |
   | |   | base/16384/24598 | |
   | v   +--+ |
   | +--+ |
   | |Buffer Cache Hibernation Files| |
   | +--+ |
   +--+

In summary, PgFincore's target is File System Buffer Cache, Buffer
Cache Hibernation's target is DB Buffer Cache(shared buffers).

PgFincore is trying to preload database file by posix_fadvise() into
File System Buffer Cache, not into DB Buffer Cache(shared buffers).
On query execution, buffer manager will get DB buffer blocks by
smgrread() from file system unless necessary blocks exist in DB Buffer
Cache.  At this point, physical reads may not happen because part of
(or entire) database file is already loaded into FS Buffer Cache.

The gain depends on the file system, especially size of File System
Buffer Cache.
Preloading database file is equivalent to following command in short.
$ cat base/16384/24598  /dev/null

I think PgFincore is good for data warehouse in applications.


Buffer Cache Hibernation, my approach, is more simple and straight forward.
It try to save/load the contents of DB Buffer Cache(shared buffers) using
regular files(called Buffer Cache Hibernation Files).
At startup, buffer manager will load DB buffer blocks into DB Buffer
Cache from Buffer Cache Hibernation Files which was saved at the last
shutdown.  Note that database file will not be read, so it is not
cached in File System Buffer Cache at all.  Only contents of DB Buffer
Cache are filled.  Therefore, the DB buffer cache miss penalty would
be larger than PgFincore's.

The gain depends on the size of shared buffers, and how often the
similar queries are executed before and after restarting.

Buffer Cache Hibernation is good for OLTP in applications.


I think that PgFincore and Buffer Cache Hibernation is not exclusive,
they can co-work together in different caching levels.



Sorry for my poor english skill, but I'm doing my best :)

Thanks

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GSoC 2011: Fast GiST index build

2011-05-05 Thread Teodor Sigaev


gistFindCorrectParent function have branch with following comment:
/*
* awful!!, we need search tree to find parent ... , but before we
* should release all old parent
*/
Can you provide me an example of case when this branch works?


See several lines above:
if (parent-blkno == InvalidBlockNumber)

/*
 * end of chain and still didn't found parent, It's very-very
 * rare situation when root splited
 */
break;


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for new feature: Buffer Cache Hibernation

2011-05-05 Thread Cédric Villemain

2011/5/5 Mitsuru IWASAKI iwas...@jp.freebsd.org:
 Hi,

 I think that PgFincore (http://pgfoundry.org/projects/pgfincore/)
 provides similar functionality.  Are you familiar with that?  If so,
 could you contrast your approach with that one?

 I'm not familiar with PgFincore at all sorry, but I got source code
 and documents and read through them just now.
 # and I'm a novice on postgres actually...
 The target both is to reduce physical I/O, but their approaches and
 gains are different.
 My understanding is like this;

 +-+     +-+
 | Postgres(backend)   |     | Postgres            |
 | +-+ |     |                     |
 | | DB Buffer Cache | |     |                     |
 | | (shared buffers)| |     |                     |
 | |*my target       | |     |                     |
 | +-+ |     |                     |
 |   ^      ^          |     |                     |
 |   |      |          |     |                     |
 |   v      v          |     |                     |
 | +-+ |     | +-+ |
 | |  buffer manager | |     | |    pgfincore    | |
 | +-+ |     | +-+ |
 +---^--^--+     +--^--+
    |      |smgrread()                 |posix_fadvise()
    |read()|                           |                 userland
 ==
    |      |                           |                 kernel
    |      +-+-+
    |                    |
    |                    v
    |       ++
    |       | File System            |
    |       |   +-+  |
    +--|   | FS Buffer Cache |  |
            |   |*PgFincore target|  |
            |   +-+  |
            |    ^       ^           |
            +|---|---+
                 |       |
 ==
                 |       |                               hardware
       +-|---|+
       |         |       v  Physical Disk |
       |         |   +--+ |
       |         |   | base/16384/24598 | |
       |         v   +--+ |
       | +--+ |
       | |Buffer Cache Hibernation Files| |
       | +--+ |
       +--+


littel detail, pgfincore store its data per relation in a file, like you do.
I rewrote a bit that, and it will store its data directly in
postgresql tables, as well as it will be able to restore the cache
from raw bitstring.

 In summary, PgFincore's target is File System Buffer Cache, Buffer
 Cache Hibernation's target is DB Buffer Cache(shared buffers).

Correct. (btw I am very happy of your idea and that you get time to do it)


 PgFincore is trying to preload database file by posix_fadvise() into
 File System Buffer Cache, not into DB Buffer Cache(shared buffers).
 On query execution, buffer manager will get DB buffer blocks by
 smgrread() from file system unless necessary blocks exist in DB Buffer
 Cache.  At this point, physical reads may not happen because part of
 (or entire) database file is already loaded into FS Buffer Cache.

 The gain depends on the file system, especially size of File System
 Buffer Cache.
 Preloading database file is equivalent to following command in short.
 $ cat base/16384/24598  /dev/null

Not exactly.

it exists 2 calls :

 * pgfadv_WILLNEED
 * pgfadv_WILLNEED_snapshot

The former ask to load each segment of a relation *but* the kernel can
decide to not do that or load only part of each segment. (so it is not
as brutal as cat file  /dev/null )
The later read *exactly* each blocks required in each segment, not all
blocks except if all were in cache while doing the snapshot. (this one
is the part of the snapshot/restore combo)


 I think PgFincore is good for data warehouse in applications.

Pgfincore with bitstring storage in a table allow streaming to
HotStandbys and get better response in case of switch-over/fail-over
by doing some house-keeping on the HotStandby and keep it really hot
;)

Even web applications have large database today 

(they is more, but it is no the subject)



 Buffer Cache Hibernation, my approach, is more simple and straight forward.
 It try to save/load the contents of DB Buffer Cache(shared buffers) using
 regular files(called Buffer Cache Hibernation Files).
 At startup, buffer manager will load DB buffer blocks into DB Buffer
 Cache from Buffer Cache Hibernation Files which was saved at the last
 shutdown.  Note that database file will not be read, so it is not
 cached in File System Buffer Cache at all.  Only contents of DB Buffer
 Cache are filled.  Therefore, the DB buffer cache miss penalty would
 be larger than PgFincore's.

 The gain depends on the size of shared buffers, and how often the

Re: [HACKERS] GSoC 2011: Fast GiST index build

2011-05-05 Thread Alexander Korotkov

2011/5/5 Teodor Sigaev teo...@sigaev.ru

 See several lines above:
if (parent-blkno == InvalidBlockNumber)

/*
 * end of chain and still didn't found parent, It's
 very-very
 * rare situation when root splited
 */
break;


As I understood it's because we can't move root to another page.


With best regards,
Alexander Korotkov.

Re: [HACKERS] VARIANT / ANYTYPE datatype

2011-05-05 Thread Merlin Moncure

On Wed, May 4, 2011 at 8:03 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Alvaro Herrera alvhe...@commandprompt.com writes:
 As a followup idea there exists the desire to store records as records
 and not text representation of same (given differing record types, of
 course), for which it'd be more worthwhile.

 Maybe.  The conventional wisdom is that text representation of data is
 more compact than PG's internal representation by a significant factor
 --- our FAQ says up to 5x, in fact.  I know that that's including row
 overhead and indexes and so on, but I still don't find it to be a given
 that you're going to win on space with this sort of trick.

I've done a lot of testing of the text vs binary format on the wire
format...not exactly the same set of issues, but pretty close since
you have to send all the oids, lengths, etc.   Conventional wisdom is
correct although overstated for this topic.  Even in truly
pathological cases for text, for example in sending multiple levels of
redundant escaping in complex structures, the text format will almost
always be smaller.  For 'typical' data it can be significantly
smaller.  Two exceptions most people will run into are bytea obviously
and the timestamp family of types where binary style manipulation is a
huge win both in terms of space and performance.

For complex data (say 3+ levels of composites stacked in arrays),
binary type formats are much *faster*, albeit larger, via binary as
long as you are not bandwidth constrained, and presumably they would
be as well for variants. Perhaps even more so, because some of the
manipulations made converting tuple storage to binary wire formats
don't have to happen.  That said, while there are use cases for
sending highly structured data over the wire, I can't think of any for
direct storage on a table in variant type scenarios, at least not yet
:-).

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] adding a new column in IDENTIFY_SYSTEM

2011-05-05 Thread Magnus Hagander

On Wed, May 4, 2011 at 22:42, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, May 4, 2011 at 3:47 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jaime Casanova ja...@2ndquadrant.com writes:
 I want to propose the addition of a new field in IDENTIFY_SYSTEM:
 xlogversion, which will carry XLOG_PAGE_MAGIC from primary.
 The idea of sending that info is to allow us to know if the xlog page
 version of two different major versions are compatible or not.
 Currently pg_upgrade requires the primary to be taken down,

 That's *intentional*.

 The notion of WAL-shipping-replication compatibility between two
 different major versions is insane on its face.  They will not have
 compatible system catalog contents.  You might get perfect replication
 of the master's catalogs, but the slave wouldn't be able to interpret
 them.

 That's exactly how hard in place upgrade was to begin with.

 Considering how valuable this would be, it seems worth it to pursue this.

 The reason we have XLOG_PAGE_MAGIC is really more the opposite: to
 prevent people from trying to recover across a minor version update in
 which we had to break XLOG compatibility.  I don't recall right now
 if that's ever actually happened, but it definitely could.

 If that is true, then allowing this patch will allow us to detect that
 incompatibility when the standby connects to the master, and explain
 the issue in a useful error message. Otherwise we will just barf on
 the magic value.

 Having access to these details might make it possible to upgrade. They
 could be inferred, but it would be better to have the full data so we
 can take an informed decision about whether or not it is possible.

 So even if people don't believe in the rationale behind the patch,
 would allowing it harm anything at this point?

Adding it for the sake of upgrades seems very far fetched.

Adding it for the sake of giving a better error message seems like a
very good idea. But in that case, the client side code to actually
give a better error message should be included from the start, IMHO.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] adding a new column in IDENTIFY_SYSTEM

2011-05-05 Thread Tom Lane

Magnus Hagander mag...@hagander.net writes:
 So even if people don't believe in the rationale behind the patch,
 would allowing it harm anything at this point?

 Adding it for the sake of upgrades seems very far fetched.

 Adding it for the sake of giving a better error message seems like a
 very good idea. But in that case, the client side code to actually
 give a better error message should be included from the start, IMHO.

What's not apparent to me is how we'll even get to this check; if
there's a mismatch, won't the database system identifier comparison
fail first in most scenarios?

I'm also wondering why send WAL version number and not, say, catalog
version number, if there's some idea that we need more tests than the
system identifier comparison.

Given reasonable answers to these questions, I'd not object to putting
in additional error testing.  I concur with Magnus that the patch should
actually provide those tests, and not just put in an unused field.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Patch to improve style of generate_history.pl perl script

2011-05-05 Thread Bruce Momjian

I have applied the attached patch to improve the style of our
generate_history.pl perl script.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/doc/src/sgml/generate_history.pl b/doc/src/sgml/generate_history.pl
new file mode 100644
index 2b81569..a6c0bd7
*** a/doc/src/sgml/generate_history.pl
--- b/doc/src/sgml/generate_history.pl
***
*** 13,22 
  
  use strict;
  
! my($srcdir) = shift;
! defined($srcdir) || die $0: missing required argument: srcdir\n;
! my($infile) = shift;
! defined($infile) || die $0: missing required argument: inputfile\n;
  
  # Emit DOCTYPE header so that the output is a self-contained SGML document
  print !DOCTYPE appendix PUBLIC \-//OASIS//DTD DocBook V4.2//EN\\n;
--- 13,22 
  
  use strict;
  
! my $srcdir = shift;
! die $0: missing required argument: srcdir\n if !defined($srcdir);
! my $infile = shift;
! die $0: missing required argument: inputfile\n if !defined($infile);
  
  # Emit DOCTYPE header so that the output is a self-contained SGML document
  print !DOCTYPE appendix PUBLIC \-//OASIS//DTD DocBook V4.2//EN\\n;
*** process_file($infile);
*** 26,36 
  exit 0;
  
  sub process_file {
! my($filename) = @_;
  
  local *FILE;		# need a local filehandle so we can recurse
  
! my($f) = $srcdir . '/' . $filename;
  open(FILE, $f) || die could not read $f: $!\n;
  
  while (FILE) {
--- 26,36 
  exit 0;
  
  sub process_file {
! my $filename = shift;
  
  local *FILE;		# need a local filehandle so we can recurse
  
! my $f = $srcdir . '/' . $filename;
  open(FILE, $f) || die could not read $f: $!\n;
  
  while (FILE) {

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] alpha5

2011-05-05 Thread Bruce Momjian

Robert Haas wrote:
 On Mon, Mar 28, 2011 at 9:26 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Robert Haas robertmh...@gmail.com writes:
  On Sat, Mar 26, 2011 at 9:41 PM, Robert Haas robertmh...@gmail.com wrote:
  Per previous discussion, I'm going to wrap alpha5 Monday morning
  Eastern time, barring objections.
 
  It seems that the 'make distcheck' build is broken. ?Apparently we
  don't have any automated testing of this? ?Anyone know what to fix
  here?
 
  Bruce keeps trying to put cross-references where they mustn't go ...
 
 Actually those are all my fault.  Sorry, I'm still learning the ropes.
  I didn't realize xref couldn't be used in the release notes; it looks
 like Bruce used link rather than xref for the things I used xref
 for.

FYI, this is documented in release.sgml in an SGML comment:

For new features, add links to the documentation sections.  Use /link
not just / so that generate_history.pl can remove it, so HISTORY.html
can be created without links to the main documentation.  Don't use
xref.

xref can't be used because there is no text to keep, and /link ( not
/) has to be used so we remove the right tag.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Visibility map and hint bits

2011-05-05 Thread Bruce Momjian

There has been a lot of recent discussion about the visibility map (for
index-only scans) and hint bits (trying to avoid double-writing a
table).

I wonder if we could fix both of these at the same time.  Once the
visibility map is reliable, can we use that to avoid updating the hint
bits on all rows on a page?

For bulk loads, all the pages are going have the same xid and all be
visible, so instead of writing the entire table, we just write the
visibility map.

I think the problem is that we have the PD_ALL_VISIBLE page flag, which
requires a write of the page as well.  Could we get by with only the
visibility bits and remove PD_ALL_VISIBLE?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VARIANT / ANYTYPE datatype

2011-05-05 Thread Jim Nasby

On May 4, 2011, at 6:24 PM, Andrew Dunstan wrote:
 I'm far from convinced that storing deltas per column rather than per record 
 is a win anyway. I don't have hard numbers to hand, but my vague recollection 
 is that my tests showed it to be a design that used more space.

It depends on how many fields you're changing in one go and how wide the table 
is. It's also a PITA to identify what fields actually changed if you're storing 
everything. In the case of logging, I'd say that what's really needed is a way 
to store a table record that has an indicator of what fields actually changed 
(and possibly not storing anything for fields that didn't change). That table 
record would need to also deal with changes to the underlying table structure.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch to improve style of generate_history.pl perl script

2011-05-05 Thread Andrew Dunstan




On 05/05/2011 12:49 PM, Bruce Momjian wrote:

I have applied the attached patch to improve the style of our
generate_history.pl perl script.



Wow, really? There's nothing wrong with the changes but they seem pretty 
trivial and pointless to me. Maybe I've drunk the perl TIMTOWTDI koolaid 
too much.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FDW table hints

2011-05-05 Thread Magnus Hagander

On Tue, May 3, 2011 at 16:19, Tom Lane t...@sss.pgh.pa.us wrote:
 Dave Page dp...@pgadmin.org writes:
 On Tue, May 3, 2011 at 10:33 AM, Susanne Ebrecht
 susa...@2ndquadrant.com wrote:
 When we make such a hint for foreign tables then we should make a similar
 hint for views.

 A view really isn't a table, unlike a foreign table, so I don't think
 that argument holds.

 Well, from the implementation standpoint a foreign table is a lot more
 like a view than it is like a table.  I think the real point is that a
 hint for this on views would be a waste of translator manpower, because
 we've not heard of anyone making that mistake.

The *implementation* is in this case, IMHO; irrelevant. The relevant
part is what it looks like to the *user*, and to the user a foreign
table looks a lot more like a table than a view does.

Since I brought it up - a patch along this line?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index ff84045..e3d902e 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -183,10 +183,18 @@ DefineIndex(RangeVar *heapRelation,
 	/* Note: during bootstrap may see uncataloged relation */
 	if (rel-rd_rel-relkind != RELKIND_RELATION 
 		rel-rd_rel-relkind != RELKIND_UNCATALOGED)
+	{
+		if (rel-rd_rel-relkind == RELKIND_FOREIGN_TABLE)
+			ereport(ERROR,
+	(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+	 errmsg(cannot create index on FOREIGN TABLE \%s\,
+		 heapRelation-relname)));
+
 		ereport(ERROR,
 (errcode(ERRCODE_WRONG_OBJECT_TYPE),
  errmsg(\%s\ is not a table,
 		heapRelation-relname)));
+	}
 
 	/*
 	 * Don't try to CREATE INDEX on temp tables of other backends.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VARIANT / ANYTYPE datatype

2011-05-05 Thread Andrew Dunstan




On 05/05/2011 01:00 PM, Jim Nasby wrote:

On May 4, 2011, at 6:24 PM, Andrew Dunstan wrote:

I'm far from convinced that storing deltas per column rather than per record is 
a win anyway. I don't have hard numbers to hand, but my vague recollection is 
that my tests showed it to be a design that used more space.

It depends on how many fields you're changing in one go and how wide the table 
is. It's also a PITA to identify what fields actually changed if you're storing 
everything.


No it's not. Instead of storing OLD/NEW, store a base record and a delta 
record (an hstore with just the changed fields) for an update. This 
saves space and means you only have to calculate what changed once.




cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FDW table hints

2011-05-05 Thread Tom Lane

Magnus Hagander mag...@hagander.net writes:
 Since I brought it up - a patch along this line?

Please don't capitalize foreign table there.

Also, I think an else before the other ereport would make the code
clearer, even though it's not strictly necessary.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FDW table hints

2011-05-05 Thread Magnus Hagander

On Thu, May 5, 2011 at 19:22, Tom Lane t...@sss.pgh.pa.us wrote:
 Magnus Hagander mag...@hagander.net writes:
 Since I brought it up - a patch along this line?

 Please don't capitalize foreign table there.

Yeah, I was a bit split about that myself. Will change.


 Also, I think an else before the other ereport would make the code
 clearer, even though it's not strictly necessary.

Agreed.

Unless someone objects, I'll go put that in later tonight. (along with
a comment on why there are two different msgs)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch to improve style of generate_history.pl perl script

2011-05-05 Thread Bruce Momjian

Andrew Dunstan wrote:
 
 
 On 05/05/2011 12:49 PM, Bruce Momjian wrote:
  I have applied the attached patch to improve the style of our
  generate_history.pl perl script.
 
 
 Wow, really? There's nothing wrong with the changes but they seem pretty 
 trivial and pointless to me. Maybe I've drunk the perl TIMTOWTDI koolaid 
 too much.

I wanted it to be consistent with our other Perl script.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch: fix race in SSI's CheckTargetForConflictsIn

2011-05-05 Thread Kevin Grittner

Dan Ports d...@csail.mit.edu wrote:
 While running some benchmarks to test SSI performance, I found a
 race condition that's capable of causing a segfault. A patch is
 attached.
 
 The bug is in CheckTargetForConflictsIn, which scans the list of
 SIREAD locks on a lock target when it's modified. There's an
 optimization in there where the writing transaction will remove a
 SIREAD lock that it holds itself, because it's being replaced with
 a (stronger) write lock. To do that, it needs to drop its shared
 lwlocks and reacquire them in exclusive mode. The existing code
 deals with concurrent modifications in that interval by redoing
 checks. However, it misses the case where some other transaction
 removes all remaining locks on the target, and proceeds to remove
 the lock target itself.
 
 The attached patch fixes this by deferring the SIREAD lock removal
 until the end of the function. At that point, there isn't any need
 to worry about concurrent updates to the target's lock list. The
 resulting code is also simpler.
 
I just want to confirm that this addresses a real (although very
narrow) race condition.  It results from code used to implement a
valuable optimization described in Cahill's thesis: don't get or
keep an SIREAD lock on a row which has a write lock.  The write lock
is stronger and will cause a write/write conflict with any
overlapping transactions which would care about a read/write
conflict.  The pattern of reading a row and then updating or
deleting it is so common that this optimization does a lot to avoid
promotion of predicate locks to coarser granularity, and thereby
helps avoid false positives.
 
While the optimization is valuable, the code used to implement it
was pretty horrid.  (And that was me that wrote it.)  It has already
been the cause of several other fixes since the main patch went in. 
What Dan has done here is move the optimization out of the middle of
the loop which is doing the conflict detection, and in doing so has
reduced the number of lines of code needed, reduced the amount of
fiddling with LW locks, and all around made the code more robust and
more understandable.
 
I've reviewed the patch and it looks good to me.  Dan has beat up on
it with the same DBT-2 run which exposed the race condition without
seeing a problem.  Although a much smaller patch could address the
immediate problem, I strongly feel that Dan has taken the right
approach by refactoring this bit to something fundamentally cleaner
and less fragile.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Visibility map and hint bits

2011-05-05 Thread Merlin Moncure

On Thu, May 5, 2011 at 11:59 AM, Bruce Momjian br...@momjian.us wrote:
 There has been a lot of recent discussion about the visibility map (for
 index-only scans) and hint bits (trying to avoid double-writing a
 table).

I still think a small tqual.c maintained cache of hint bits will
effectively eliminate hint bit i/o issues surrounding bulk loads.  Tom
fired a shot across the bow regarding the general worthiness of that
technique though (see:
http://postgresql.1045698.n5.nabble.com/Process-local-hint-bit-cache-td4270229.html)
:(.  I can rig up a cleaned up version of the patch pretty
easily...it's a local change and fairly simple.

I don't think there is any way to remove the hint bits without
suffering some other problem.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Visibility map and hint bits

2011-05-05 Thread Bruce Momjian

Merlin Moncure wrote:
 On Thu, May 5, 2011 at 11:59 AM, Bruce Momjian br...@momjian.us wrote:
  There has been a lot of recent discussion about the visibility map (for
  index-only scans) and hint bits (trying to avoid double-writing a
  table).
 
 I still think a small tqual.c maintained cache of hint bits will
 effectively eliminate hint bit i/o issues surrounding bulk loads.  Tom
 fired a shot across the bow regarding the general worthiness of that
 technique though (see:
 http://postgresql.1045698.n5.nabble.com/Process-local-hint-bit-cache-td4270229.html)
 :(.  I can rig up a cleaned up version of the patch pretty
 easily...it's a local change and fairly simple.
 
 I don't think there is any way to remove the hint bits without
 suffering some other problem.

Was that the idea that the pages had to fit in the cache and be updated
with hint bits before being written to disk?  Restricting that to the
size of the buffer cache seemed very limiting.

One 8k visibilty map page can hold bits for 1/2 gig of heap pages so I
thought that would be a better all-visible indictor and avoid many
all-visible page writes in bulk load cases.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Comments on system tables and columns

2011-05-05 Thread Bruce Momjian

Alvaro Herrera wrote:
 Excerpts from Thom Brown's message of lun mar 28 08:14:07 -0300 2011:
  Hi,
  
  I notice that none of the system tables or columns thereof bear any
  comments.  Is this intentional, or an oversight?  I would have thought
  comments would be useful since the column names aren't exactly always
  self-explanatory.
 
 Bruce has been working on changes to have catalog objects (tables, views
 and columns) contain comments, but he deferred it to 9.2 because it
 involved nontrivial pieces of infrastructure (mainly to avoid
 duplication with the SGML catalog documentation).

Attached are diffs that change the Makefile and initdb, and a perl
script to pull the system view comments out of the SGML docs.  I need to
do more work to pull stuff for the system tables.  This does work in
testing.

I will work on this more for 9.2.  

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
new file mode 100644
index 3a83461..68ed5fe
*** a/src/backend/catalog/Makefile
--- b/src/backend/catalog/Makefile
*** OBJS = catalog.o dependency.o heap.o ind
*** 16,22 
 pg_operator.o pg_proc.o pg_db_role_setting.o pg_shdepend.o pg_type.o \
 storage.o toasting.o
  
! BKIFILES = postgres.bki postgres.description postgres.shdescription
  
  include $(top_srcdir)/src/backend/common.mk
  
--- 16,22 
 pg_operator.o pg_proc.o pg_db_role_setting.o pg_shdepend.o pg_type.o \
 storage.o toasting.o
  
! BKIFILES = postgres.bki postgres.description postgres.shdescription system_view_comments.sql
  
  include $(top_srcdir)/src/backend/common.mk
  
*** schemapg.h: postgres.bki ;
*** 59,70 
--- 59,74 
  postgres.bki: genbki.pl Catalog.pm $(POSTGRES_BKI_SRCS)
  	$(PERL) -I $(catalogdir) $ $(pg_includes) --set-version=$(MAJORVERSION) $(POSTGRES_BKI_SRCS)
  
+ system_view_comments.sql: $(top_builddir)/doc/src/sgml/catalogs.sgml gen_comments.pl
+ 	$(PERL) $(srcdir)/gen_comments.pl $  $@
+ 
  .PHONY: install-data
  install-data: $(BKIFILES) installdirs
  	$(INSTALL_DATA) $(call vpathsearch,postgres.bki) '$(DESTDIR)$(datadir)/postgres.bki'
  	$(INSTALL_DATA) $(call vpathsearch,postgres.description) '$(DESTDIR)$(datadir)/postgres.description'
  	$(INSTALL_DATA) $(call vpathsearch,postgres.shdescription) '$(DESTDIR)$(datadir)/postgres.shdescription'
  	$(INSTALL_DATA) $(srcdir)/system_views.sql '$(DESTDIR)$(datadir)/system_views.sql'
+ 	$(INSTALL_DATA) $(srcdir)/system_view_comments.sql '$(DESTDIR)$(datadir)/system_view_comments.sql'
  	$(INSTALL_DATA) $(srcdir)/information_schema.sql '$(DESTDIR)$(datadir)/information_schema.sql'
  	$(INSTALL_DATA) $(srcdir)/sql_features.txt '$(DESTDIR)$(datadir)/sql_features.txt'
  
*** installdirs:
*** 73,79 
  
  .PHONY: uninstall-data
  uninstall-data:
! 	rm -f $(addprefix '$(DESTDIR)$(datadir)'/, $(BKIFILES) system_views.sql information_schema.sql sql_features.txt)
  
  # postgres.bki, postgres.description, postgres.shdescription, and schemapg.h
  # are in the distribution tarball, so they are not cleaned here.
--- 77,83 
  
  .PHONY: uninstall-data
  uninstall-data:
! 	rm -f $(addprefix '$(DESTDIR)$(datadir)'/, $(BKIFILES) system_views.sql system_view_comments.sql information_schema.sql sql_features.txt)
  
  # postgres.bki, postgres.description, postgres.shdescription, and schemapg.h
  # are in the distribution tarball, so they are not cleaned here.
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
new file mode 100644
index acd2514..f0d72e9
*** a/src/bin/initdb/initdb.c
--- b/src/bin/initdb/initdb.c
*** static char *dictionary_file;
*** 102,107 
--- 102,108 
  static char *info_schema_file;
  static char *features_file;
  static char *system_views_file;
+ static char *system_view_comments_file;
  static bool made_new_pgdata = false;
  static bool found_existing_pgdata = false;
  static bool made_new_xlogdir = false;
*** static void setup_auth(void);
*** 166,171 
--- 167,173 
  static void get_set_pwd(void);
  static void setup_depend(void);
  static void setup_sysviews(void);
+ static void append_sysviews(char **lines);
  static void setup_description(void);
  static void setup_collation(void);
  static void setup_conversion(void);
*** setup_depend(void)
*** 1419,1432 
  static void
  setup_sysviews(void)
  {
! 	PG_CMD_DECL;
! 	char	  **line;
! 	char	  **sysviews_setup;
  
  	fputs(_(creating system views ... ), stdout);
  	fflush(stdout);
  
! 	sysviews_setup = readfile(system_views_file);
  
  	/*
  	 * We use -j here to avoid backslashing stuff in system_views.sql
--- 1421,1452 
  static void
  setup_sysviews(void)
  {
! 	char	  **sysview_lines;
! 	char	  **sysview_comment_lines;
  
  	fputs(_(creating system views ... ), stdout);

Re: [HACKERS] Visibility map and hint bits

2011-05-05 Thread Robert Haas

On Thu, May 5, 2011 at 12:59 PM, Bruce Momjian br...@momjian.us wrote:
 I wonder if we could fix both of these at the same time.  Once the
 visibility map is reliable, can we use that to avoid updating the hint
 bits on all rows on a page?

I don't think so.  There are two problems:

1. If there is a long-running transaction on the system, it will not
be possible to set PD_ALL_VISIBLE, but hint bits can still be set.  So
there could be a significant performance regression if we don't set
hint bits in that case.

2. Making the visibility map crash-safe will mean making setting hint
bits emit XLOG records, so it can't be done on Hot Standby servers at
all, and it's much more expensive than just setting a hint bit on the
master.

 For bulk loads, all the pages are going have the same xid and all be
 visible, so instead of writing the entire table, we just write the
 visibility map.

 I think the problem is that we have the PD_ALL_VISIBLE page flag, which
 requires a write of the page as well.  Could we get by with only the
 visibility bits and remove PD_ALL_VISIBLE?

In some ways, that would make things much simpler.  But to make that
work, every insert/update/delete to a page would have to pin the
visibility map page and clear PD_ALL_VISIBLE if appropriate, so it
might not be good from a performance standpoint, especially in
high-concurrency workloads.  Right now, if PD_ALL_VISIBLE isn't set,
we don't bother touching the visibility map page, which seems like a
possibly important optimization.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Visibility map and hint bits

2011-05-05 Thread Merlin Moncure

On Thu, May 5, 2011 at 1:34 PM, Bruce Momjian br...@momjian.us wrote:
 Merlin Moncure wrote:
 On Thu, May 5, 2011 at 11:59 AM, Bruce Momjian br...@momjian.us wrote:
  There has been a lot of recent discussion about the visibility map (for
  index-only scans) and hint bits (trying to avoid double-writing a
  table).

 I still think a small tqual.c maintained cache of hint bits will
 effectively eliminate hint bit i/o issues surrounding bulk loads.  Tom
 fired a shot across the bow regarding the general worthiness of that
 technique though (see:
 http://postgresql.1045698.n5.nabble.com/Process-local-hint-bit-cache-td4270229.html)
 :(.  I can rig up a cleaned up version of the patch pretty
 easily...it's a local change and fairly simple.

 I don't think there is any way to remove the hint bits without
 suffering some other problem.

 Was that the idea that the pages had to fit in the cache and be updated
 with hint bits before being written to disk?  Restricting that to the
 size of the buffer cache seemed very limiting.

 One 8k visibilty map page can hold bits for 1/2 gig of heap pages so I
 thought that would be a better all-visible indictor and avoid many
 all-visible page writes in bulk load cases.

no, that was my first idea -- check visibility when you evict.  that
helps a different problem but not bulk loads.  One way it could help
is for marking PD_ALL_VISIBLE.  This might also be a winner but there
is some valid skepticism that adding more work for bgwriter is really
a good idea.

The tqual cache idea is such that there is a small cache that
remembers the commit/cancel status of recently seen transactions. If
scan a tuple and the status is known via cache, you set the bit but
don't mark the page dirty.  That way, if you are scanning a lot of
unhinted tuples with similar xid, you don't need to jam out i/o.  I
think the general concept is clean, but it might need some buy in from
tom and some performance testing for justification.

The alternate 'cleaner' approach of maintaining larger transam.c cache
had some downsides I saw no simple workaround for.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Visibility map and hint bits

2011-05-05 Thread Kevin Grittner

Merlin Moncure mmonc...@gmail.com wrote:
 
 a small cache that remembers the commit/cancel status of recently
 seen transactions.
 
How is that different from the head of the clog SLRU?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] new clang report

2011-05-05 Thread Peter Eisentraut

On mån, 2011-05-02 at 14:45 -0400, Tom Lane wrote:
  So issue here is actually that clang has an option
 
 -fmath-errno
 Indicate that math functions should be treated as
 updating errno.
 
 Really?  It looks to me like the issue is that pow() is returning NaN
 instead of Inf for an out-of-range result.  That's a bug: the correct
 result is *not* ill-defined, it's simply too large to represent.
 If that has anything to do with errno, it's an implementation artifact
 that's unrelated to the claimed meaning of the switch.
 
 But I would also note that the Single Unix Spec is unequivocal about
 this case:
 
 If the correct value would cause overflow, +-HUGE_VAL is
 returned, and errno is set to [ERANGE].
 
 That's IS set, not may be set as in some other cases.  So this
 behavior should not depend on any such compiler switch anyway, unless
 the intent of the switch is ignore the standard and do whatever we
 feel like.

Well, the intent of the switch actually appears to be rather follow the
standard and do what it says with the default behavior being the
questionable one.  So we could easily settle on that you need to use
that switch, otherwise it's not supported.  (This is actually quite
similar to the old days when some systems had -ffast-math the default.)

Btw., when you build a simple test program in the default mode, pow()
indeed returns Inf on overflow.  There appear to be some code generation
or optimization problems when it builds the postgres code, because the
problem goes away with either -O0 or by inserting an elog or something
like that after the pow() call.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Visibility map and hint bits

2011-05-05 Thread Merlin Moncure

On Thu, May 5, 2011 at 2:00 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Merlin Moncure mmonc...@gmail.com wrote:

 a small cache that remembers the commit/cancel status of recently
 seen transactions.

 How is that different from the head of the clog SLRU?

several things:
*) any slru access requires lock (besides the lock itself, you are
spending cycles in critical path)
*) cache access happens at different stage of processing in
HeapTupleSatisfiesMVCC: both TransactionIdIsCurrentTransactionId and
TransactionIdIsInProgress have to be checked first. Logically, it's
extension of hint bit check itself, not expansion of lower levels of
caching
*) in tqual.c you can sneak in some small optimizations like only
caching the bit if it's known good in the WAL (XlogNeedsFlush).  That
way you don't need to keep checking it over and over for the same
trasaction
*) slru level accesses happen too late to give much benefit:

I can't stress enough how tight HeapTupleSatisfiesMVCC is.  On my
workstation VM, each non inline function call shows up measurably in
profiling.  I think anything you do here has to be inline, hand
rolled, and very tight (you can forget anything around dynahash).
Delegating the cache management to transam or (even worse) slru level
penalizes some workloads non-trivially.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FDW table hints

2011-05-05 Thread Magnus Hagander

On Thu, May 5, 2011 at 19:26, Magnus Hagander mag...@hagander.net wrote:
 On Thu, May 5, 2011 at 19:22, Tom Lane t...@sss.pgh.pa.us wrote:
 Magnus Hagander mag...@hagander.net writes:
 Since I brought it up - a patch along this line?

 Please don't capitalize foreign table there.

 Yeah, I was a bit split about that myself. Will change.


 Also, I think an else before the other ereport would make the code
 clearer, even though it's not strictly necessary.

 Agreed.

 Unless someone objects, I'll go put that in later tonight. (along with
 a comment on why there are two different msgs)


Done.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Process wakeups when idle and power consumption

2011-05-05 Thread Peter Geoghegan

There is a general need to have Postgres consume fewer CPU cycles and
less power when idle. Until something is done about this, shared
hosting providers, particularly those who want to deploy many VM
instances with databases, will continue to choose MySQL out of hand.

I have quantified the difference in the number of wake-ups when idle
between Postgres and MySQL using Intel's powertop utility on my
laptop, which runs Fedora 14. These figures are for a freshly initdb'd
database from git master, and mysql-server 5.1.56 from my system's
package manager.

*snip*
   2.7% ( 11.5)   [  ] postgres
   1.1% (  4.6)   [  1663] Xorg
   0.9% (  3.7)   [  1463] wpa_supplicant
   0.6% (  2.7)   [  ] [ahci] interrupt
   0.5% (  2.2)   [  ] mysqld
*snip*

Postgres consistenly has 11.5 wakeups per second, while MySQL
consistently has 2.2 wakeups (averaged over the 5 second period that
each cycle of instrumentation lasts).

If I turn on archiving, the figure for Postgres naturally increases:

*snip*
   1.7% ( 12.5)   [  ] postgres
   1.6% ( 12.0)   [   808] phy0
   0.7% (  5.4)   [  1463] wpa_supplicant
   0.6% (  4.3)   [  ] [ahci] interrupt
   0.3% (  2.2)   [  ] mysqld
*snip*

It increases by exactly the amount that you'd expect after looking at
pgarch.c - one wakeup per second. This is because there is a loop
within the main event loop for the process that is a prime example of
what unix_latch.c describes as the common pattern of using
pg_usleep() or select() to wait until a signal arrives, where the
signal handler sets a global variable. The loop naps for one second
per iteration.

Attached is the first in what I hope will become a series of patches
for reducing power consumption when idle. It makes the archiver
process wake far less frequently, using a latch primitive,
specifically a non-shared latch. I'm not sure if I should have used a
shared latch, and have SetLatch() calls replace
SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) calls. Would that have
broken some implied notion of encapsulation? In any case, if I apply
the patch and rebuild, the difference is quite apparent:

***snip***
 3.9% ( 21.8)   [  1663] Xorg
   3.2% ( 17.9)   [  ] [ath9k] interrupt
   2.1% ( 11.9)   [   808] phy0
   2.1% ( 11.5)   [  ] postgres
   1.0% (  5.4)   [  1463] wpa_supplicant
   0.4% (  2.2)   [  ] mysqld
***snip***

The difference from not running the archiver at all appears to have
been completely eliminated (in fact, we still wake up every
PGARCH_AUTOWAKE_INTERVAL seconds, which is 60 seconds, but that
usually isn't apparent to powertop, which measures wakeups over 5
second periods).

If we could gain similar decreases in idle power consumption across
all Postgres ancillary processes, perhaps we'd see Postgres available
as an option for shared hosting plans more frequently. When these
differences are multiplied by thousands of VM instances, they really
matter. Unfortunately, there doesn't seem to be a way to get powertop
to display its instrumentation per-process to quickly get a detailed
overview of where those wake-ups occur across all pg processes.

I hope to work on reducing wakeups for PG ancillary processes in this
order (order of perceived difficulty), using shared latches to
eliminate the waiting pattern in each case:

* WALWriter
* BgWriter
* WALReceiver
* Startup process

I'll need to take a look at statistics, autovacuum and Logger
processes too, to see if they present more subtle opportunities for
reduced idle power consumption.

Do constants like PGARCH_AUTOWAKE_INTERVAL need to always be set at
their current, conservative levels? Perhaps these sorts of values
could be collectively controlled with a single GUC that represents a
trade-off between CPU cycles used when idle against
safety/reliability. On the other hand, there are GUCs that control
that per process in some cases already, such as wal_writer_delay, and
that suggestion could well be a bit woolly. It might be an enum value
that represented various levels of concern that would default to
something like 'conservative' (i.e. the current values).

Thoughts?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index b40375a..01e5350 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -44,6 +44,7 @@
 #include storage/pmsignal.h
 #include utils/guc.h
 #include utils/ps_status.h
+#include storage/latch.h
 
 
 /* --
@@ -87,6 +88,12 @@ static volatile sig_atomic_t got_SIGTERM = false;
 static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
+/*
+ * Latch that archiver loop waits on until it is awakened by 
+ * signals, each of which there is a handler for
+ */
+static volatile Latch mainloop_latch;
+
 /* --
  * Local function forward declarations
  * --
@@ -228,6 +235,8 @@ PgArchiverMain(int

Re: [HACKERS] new clang report

2011-05-05 Thread Tom Lane

Peter Eisentraut pete...@gmx.net writes:
 Btw., when you build a simple test program in the default mode, pow()
 indeed returns Inf on overflow.  There appear to be some code generation
 or optimization problems when it builds the postgres code, because the
 problem goes away with either -O0 or by inserting an elog or something
 like that after the pow() call.

Hmm.  Sounds to me like clang is trying to insert an inlined version of
pow() that gets this case wrong.  Any of -fmath-errno, -O0, or possibly
other things discourage it from doing that, and then the non-inline code
gets it right.  Bug for sure.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Process wakeups when idle and power consumption

2011-05-05 Thread Tom Lane

Peter Geoghegan pe...@2ndquadrant.com writes:
 Attached is the first in what I hope will become a series of patches
 for reducing power consumption when idle.

Cool.  This has been on my personal to-do list for awhile, but it keeps
on failing to get to the top, so I'm glad to see somebody else putting
time into it.

The major problem I'm aware of for getting rid of periodic wakeups is
the need for child processes to notice when the postmaster has died
unexpectedly.  Your patch appears to degrade the archiver's response
time for that really significantly, like from O(1 sec) to O(1 min),
which I don't think is acceptable.  We've occasionally kicked around
ideas for mechanisms that would solve this problem, but nothing's gotten
done.  It doesn't seem to be an easy problem to solve portably...

 +  * The caveat about signals invalidating the timeout of 
 +  * WaitLatch() on some platforms can be safely disregarded, 

Really?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Process wakeups when idle and power consumption

2011-05-05 Thread Alvaro Herrera

Excerpts from Peter Geoghegan's message of jue may 05 16:49:25 -0300 2011:

 I'll need to take a look at statistics, autovacuum and Logger
 processes too, to see if they present more subtle opportunities for
 reduced idle power consumption.

More subtle?  Autovacuum wakes up once per second and it could sleep a
lot longer if it weren't for the loop that checks for signals.  I think
that could be improved a lot.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Process wakeups when idle and power consumption

2011-05-05 Thread A.M.


On May 5, 2011, at 4:08 PM, Alvaro Herrera wrote:

 Excerpts from Peter Geoghegan's message of jue may 05 16:49:25 -0300 2011:
 
 I'll need to take a look at statistics, autovacuum and Logger
 processes too, to see if they present more subtle opportunities for
 reduced idle power consumption.
 
 More subtle?  Autovacuum wakes up once per second and it could sleep a
 lot longer if it weren't for the loop that checks for signals.  I think
 that could be improved a lot.

Could kqueue be of use here? Non-kqueue-supporting platforms could always fall 
back to the existing select().

Cheers,
M
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Process wakeups when idle and power consumption

2011-05-05 Thread Robert Haas

On Thu, May 5, 2011 at 4:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 +              * The caveat about signals invalidating the timeout of
 +              * WaitLatch() on some platforms can be safely disregarded,

 Really?

I'm a bit confused by the phrasing of this comment as well, but it
does seem to me that if all the relevant signal handlers set the
latch, then it ought not to be necessary to break the sleep down into
one-second intervals.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [BUGS] BUG #5957: createdb with description and md5 auth forces to provide password twice

2011-05-05 Thread Bruce Momjian

Grzegorz Szpetkowski wrote:
 
 The following bug has been logged online:
 
 Bug reference:  5957
 Logged by:  Grzegorz Szpetkowski
 Email address:  gszpetkow...@gmail.com
 PostgreSQL version: 9.0.3
 Operating system:   Ubuntu 10.10
 Description:createdb with description and md5 auth forces to provide
 password twice
 Details: 
 
 How to reproduce the problem:
 
 1.Create new role with (encrypted password):
 createuser -SdRP user
 2.In PostgreSQL 9.0.3 I found pg_hba.conf with local all all ident, so
 change to local all all md5
 3.Restart/Reload used cluster
 4.Execute createdb -U user mydb My DB Description
 
 Output:
 
 Password: 
 Password: 
 
 creatdb command prompts password twice and I think it's improper behaviour
 (and documentation is silent about that).

Interesting.  This is happening because we are connecting to one
database to create the new database, and then connecting to the new
database to add the comment.

Prior to PG 8.2, this was necessary to put the comment on the database,
but now that we have the shared comment/description table
pg_shdescription, this is not necessary.

Do we need createdb to be able to create databases for pre-8.2 clusters?
If not, the attached patch fixes the double-prompting.

Also, why is this code used to create the new database?

conn = connectDatabase(strcmp(dbname, postgres) == 0 ? template1 : 
postgres,
   host, port, username, prompt_password, progname);

Do we assume more users can connect to the 'postgres' database, but we
want 'postgres' to connect to 'template1' in case it wants to drop the
'postgres' database?  Whatever the purpose, this code certainly needs a
comment.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
new file mode 100644
index 9b72eac..b1c417b
*** a/src/bin/scripts/createdb.c
--- b/src/bin/scripts/createdb.c
*** main(int argc, char *argv[])
*** 208,219 
  	}
  
  	PQclear(result);
- 	PQfinish(conn);
  
  	if (comment)
  	{
- 		conn = connectDatabase(dbname, host, port, username, prompt_password, progname);
- 
  		printfPQExpBuffer(sql, COMMENT ON DATABASE %s IS , fmtId(dbname));
  		appendStringLiteralConn(sql, comment, conn);
  		appendPQExpBuffer(sql, ;\n);
--- 208,216 
*** main(int argc, char *argv[])
*** 231,239 
  		}
  
  		PQclear(result);
- 		PQfinish(conn);
  	}
  
  	exit(0);
  }
  
--- 228,237 
  		}
  
  		PQclear(result);
  	}
  
+ 	PQfinish(conn);
+ 	
  	exit(0);
  }
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Process wakeups when idle and power consumption

2011-05-05 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 On Thu, May 5, 2011 at 4:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 +  * The caveat about signals invalidating the timeout of
 +  * WaitLatch() on some platforms can be safely disregarded,

 Really?

 I'm a bit confused by the phrasing of this comment as well, but it
 does seem to me that if all the relevant signal handlers set the
 latch, then it ought not to be necessary to break the sleep down into
 one-second intervals.

[ reads code some more ... ]  Yeah, I think you are probably right,
which makes it just a badly phrased comment.  The important point here
is that the self-pipe trick in unix_latch.c fixes the problem, so long
as we are relying on latch release and NOT timeout-driven wakeup.

What that really means is that any WaitOnLatch call with a finite
timeout ought to be viewed with a jaundiced eye.  Ideally, we want them
all to be waiting for latch release and nothing else.  I'm concerned
that we're going to be moving towards some intermediate state where we
have WaitOnLatch calls with very long timeouts, because the longer the
timeout, the worse the problem gets on platforms that have the problem.
If you have say a 1-minute timeout, it's not difficult to believe that
you'll basically never wake up because of random signals resetting the
timeout.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [BUGS] BUG #5957: createdb with description and md5 auth forces to provide password twice

2011-05-05 Thread Tom Lane

Bruce Momjian br...@momjian.us writes:
 Prior to PG 8.2, this was necessary to put the comment on the database,
 but now that we have the shared comment/description table
 pg_shdescription, this is not necessary.

 Do we need createdb to be able to create databases for pre-8.2 clusters?
 If not, the attached patch fixes the double-prompting.

Well, if you're only going to change this in HEAD, that might be an
acceptable limitation, but if you intend to back-patch I think not.
Older versions of createdb are probably significantly more likely to
be used with even-older servers.

Seems like it wouldn't be that hard to test the server version and only
reconnect if it's pre-8.2.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [BUGS] BUG #5957: createdb with description and md5 auth forces to provide password twice

2011-05-05 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  Prior to PG 8.2, this was necessary to put the comment on the database,
  but now that we have the shared comment/description table
  pg_shdescription, this is not necessary.
 
  Do we need createdb to be able to create databases for pre-8.2 clusters?
  If not, the attached patch fixes the double-prompting.
 
 Well, if you're only going to change this in HEAD, that might be an
 acceptable limitation, but if you intend to back-patch I think not.
 Older versions of createdb are probably significantly more likely to
 be used with even-older servers.

This code has been that way since pre-8.2 so I see no need to backpatch;
this is the first such complaint I have seen.

 Seems like it wouldn't be that hard to test the server version and only
 reconnect if it's pre-8.2.

I am not excited about adding more code for this so I am thinking
head-only.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Backpatching of Teach the regular expression functions to do case-insensitive matching

2011-05-05 Thread Tom Lane

Andres Freund and...@anarazel.de writes:
 In my opinion this is actually a bug in  9.0. As its a (imo) low impact fix 
 thats constrained to two files it seems sensible to backpatch it now that the 
 solution has proven itself in the field?

FWIW, I still don't trust that patch a lot (and I was the one who wrote it).
The question is whether you believe that every platform uses Unicode
code points directly as the wchar_t representation in UTF8-based locales.
Although we've not heard any complaints suggesting that that assumption
doesn't hold, I don't feel that 9.0 has been out long enough to prove
much.  The complaints about the old code were rare enough to make me
think people just don't exercise the case too much, or don't notice if
the behavior is wrong.  So it likely hasn't had that much exercise in
9.0 either.

In short, I'm pretty wary of back-patching it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Process wakeups when idle and power consumption

2011-05-05 Thread Peter Geoghegan

On 5 May 2011 22:22, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Thu, May 5, 2011 at 4:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 +              * The caveat about signals invalidating the timeout of
 +              * WaitLatch() on some platforms can be safely disregarded,

 Really?

 I'm a bit confused by the phrasing of this comment as well, but it
 does seem to me that if all the relevant signal handlers set the
 latch, then it ought not to be necessary to break the sleep down into
 one-second intervals.

 [ reads code some more ... ]  Yeah, I think you are probably right,
 which makes it just a badly phrased comment.  The important point here
 is that the self-pipe trick in unix_latch.c fixes the problem, so long
 as we are relying on latch release and NOT timeout-driven wakeup.

Why do you think that my comment is badly phrased?

 What that really means is that any WaitOnLatch call with a finite
 timeout ought to be viewed with a jaundiced eye.  Ideally, we want them
 all to be waiting for latch release and nothing else.  I'm concerned
 that we're going to be moving towards some intermediate state where we
 have WaitOnLatch calls with very long timeouts, because the longer the
 timeout, the worse the problem gets on platforms that have the problem.
 If you have say a 1-minute timeout, it's not difficult to believe that
 you'll basically never wake up because of random signals resetting the
 timeout.

Unless all signal handlers for signals that we expect call SetLatch()
anyway, as in this case.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Some surprising precedence behavior in PG's grammar

2011-05-05 Thread Tom Lane

I wrote:
 Andrew Dunstan and...@dunslane.net writes:
 If we do need a precedence setting for NULL_P, then I think it should 
 probably be on its own and not sharing one with IS.

 Yeah, I was thinking that too.  If we put %prec on the IS [NOT] NULL
 productions then there is no need for NULL_P to have exactly its current
 precedence; anything above POSTFIXOP would preserve the current behavior
 in the DEFAULT ... NULL case.  (And if we decided we wanted to flip that
 behavior, anything below POSTFIXOP would do that.)

On reflection I decided that the best quick-fix is to put NULL into the
list of keywords that are already precedence-grouped with IDENT.  That
at least makes sure that it has precedence behavior equivalent to any
plain old non-keyword.  If you can find a better fix, maybe we could
apply it to the other cases mentioned there as well.

 BTW, I wonder why NOTNULL and ISNULL have their own precedence levels,
 rather than being made to act exactly like IS [NOT] NULL ...

Is anybody up for changing that, or should we leave well enough alone?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Process wakeups when idle and power consumption

2011-05-05 Thread Tom Lane

Peter Geoghegan pe...@2ndquadrant.com writes:
 On 5 May 2011 22:22, Tom Lane t...@sss.pgh.pa.us wrote:
 What that really means is that any WaitOnLatch call with a finite
 timeout ought to be viewed with a jaundiced eye.  Ideally, we want them
 all to be waiting for latch release and nothing else.  I'm concerned
 that we're going to be moving towards some intermediate state where we
 have WaitOnLatch calls with very long timeouts, because the longer the
 timeout, the worse the problem gets on platforms that have the problem.
 If you have say a 1-minute timeout, it's not difficult to believe that
 you'll basically never wake up because of random signals resetting the
 timeout.

 Unless all signal handlers for signals that we expect call SetLatch()
 anyway, as in this case.

It's signals that we don't expect that I'm a bit worried about here.

In any case, the bottom line is that having a timeout on WaitOnLatch
is a kludge, and we should try to avoid it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Backpatching of Teach the regular expression functions to do case-insensitive matching

2011-05-05 Thread Robert Haas

On Thu, May 5, 2011 at 5:21 AM, Andres Freund and...@anarazel.de wrote:
 In my opinion this is actually a bug in  9.0. As its a (imo) low impact fix
 thats constrained to two files it seems sensible to backpatch it now that the
 solution has proven itself in the field?

 The issue is hard to find and has come up several times in the field. And it 
 has
 been slightly embarassing more than once ;)

Can you share some more details about your experiences?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Why is RegisterPredicateLockingXid called while holding XidGenLock?

2011-05-05 Thread Tom Lane

Inquiring minds want to know ...

This seems like a pretty lousy place to do it, first because of the
contention hit from holding that high-traffic lock any longer than
necessary, and second because every added chance for error between
ExtendCLOG() and TransactionIdAdvance(ShmemVariableCache-nextXid)
gives us another way to fail in the way recently mentioned by Joe
Conway:
http://archives.postgresql.org/message-id/4dbe4e7d.80...@joeconway.com

Even if it's actually necessary to set up that data structure while
holding XidGenLock, I would *really* like the call to not be exactly
where it is.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] clog_redo causing very long recovery time

2011-05-05 Thread Tom Lane

Joseph Conway m...@joeconway.com writes:
 I'm working with a client that uses Postgres on what amounts to an
 appliance.

 The database is therefore subject to occasional torture such as, in this
 particular case, running out of disk space while performing a million
 plus queries (of mixed varieties, many using plpgsql with exception
 handling -- more on that later), and eventually being power-cycled. Upon
 restart, clog_redo was called approx 885000 times (CLOG_ZEROPAGE) during
 recovery, which took almost 2 hours on their hardware. I should note
 that this is on Postgres 8.3.x.

 After studying the source, I can only see one possible way that this
 could have occurred:

 In varsup.c:GetNewTracsactionId(), ExtendCLOG() needs to succeed on a
 freshly zeroed clog page, and ExtendSUBTRANS() has to fail. Both of
 these calls can lead to a page being pushed out of shared buffers and to
 disk, so given a lack of disk space, sufficient clog buffers, but lack
 of subtrans buffers, this could happen. At that point the transaction id
 does not get advanced, so clog zeros the same page, extendSUBTRANS()
 fails again, rinse and repeat.

 I believe in the case above, subtrans buffers were exhausted due to the
 extensive use of plpgsql with exception handling.

Hmm, interesting.  I believe that it's not really a question of buffer
space or lack of it, but whether the OS will give us disk space when we
try to add a page to the current pg_subtrans file.  In any case, the
point is that a failure between ExtendCLOG and incrementing nextXid
is bad news.

 The attached fix-clogredo diff is my proposal for a fix for this.

That seems pretty grotty :-(

I think a more elegant fix might be to just swap the order of the 
ExtendCLOG and ExtendSUBTRANS calls in GetNewTransactionId.  The
reason that would help is that pg_subtrans isn't WAL-logged, so if
we succeed doing ExtendSUBTRANS and then fail in ExtendCLOG, we
won't have written any XLOG entry, and thus repeated failures will not
result in repeated XLOG entries.  I seem to recall having considered
exactly that point when the clog WAL support was first done, but the
scenario evidently wasn't considered when subtransactions were stuck
in :-(.

It would probably also help to put in a comment admonishing people
to not add stuff right there.  I see the SSI guys have fallen into
the same trap.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] clog_redo causing very long recovery time

2011-05-05 Thread Alvaro Herrera

Excerpts from Tom Lane's message of vie may 06 00:22:43 -0300 2011:

 I think a more elegant fix might be to just swap the order of the 
 ExtendCLOG and ExtendSUBTRANS calls in GetNewTransactionId.  The
 reason that would help is that pg_subtrans isn't WAL-logged, so if
 we succeed doing ExtendSUBTRANS and then fail in ExtendCLOG, we
 won't have written any XLOG entry, and thus repeated failures will not
 result in repeated XLOG entries.  I seem to recall having considered
 exactly that point when the clog WAL support was first done, but the
 scenario evidently wasn't considered when subtransactions were stuck
 in :-(.

I'm pretty sure it would have never occured to me to consider such a
scenario.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] clog_redo causing very long recovery time

2011-05-05 Thread Joe Conway

On 05/05/2011 08:22 PM, Tom Lane wrote:
 Joseph Conway m...@joeconway.com writes:
 The attached fix-clogredo diff is my proposal for a fix for this.
 
 That seems pretty grotty :-(
 
 I think a more elegant fix might be to just swap the order of the 
 ExtendCLOG and ExtendSUBTRANS calls in GetNewTransactionId.  The
 reason that would help is that pg_subtrans isn't WAL-logged, so if
 we succeed doing ExtendSUBTRANS and then fail in ExtendCLOG, we
 won't have written any XLOG entry, and thus repeated failures will not
 result in repeated XLOG entries.  I seem to recall having considered
 exactly that point when the clog WAL support was first done, but the
 scenario evidently wasn't considered when subtransactions were stuck
 in :-(.

Yes, that does seem much nicer :-)

 It would probably also help to put in a comment admonishing people
 to not add stuff right there.  I see the SSI guys have fallen into
 the same trap.

Right -- I think another similar problem exists in GetNewMultiXactId
where ExtendMultiXactOffset could succeed and write an XLOG entry and
then  ExtendMultiXactMember could fail before advancing nextMXact. The
problem in this case is that they both write XLOG entries, so a simple
reversal doesn't help.

Joe

-- 
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting,  24x7 Support



signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] clog_redo causing very long recovery time

2011-05-05 Thread Tom Lane

Joe Conway m...@joeconway.com writes:
 Right -- I think another similar problem exists in GetNewMultiXactId
 where ExtendMultiXactOffset could succeed and write an XLOG entry and
 then  ExtendMultiXactMember could fail before advancing nextMXact. The
 problem in this case is that they both write XLOG entries, so a simple
 reversal doesn't help.

Hmm.  Maybe we need a real fix then.  I was just sitting here
speculating about whether we'd ever decide we need to WAL-log
pg_subtrans --- because if we did, my solution would fail.

I still think that the right fix is to avoid emitting redundant
XLOG records in the first place, rather than hacking recovery
to not process them.  Possibly we could modify slru.c so that
it could be determined whether zeroing of the current page had
already happened.  In a quick look, it looks like noting whether
latest_page_number had already been advanced to that page might
do the trick.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: Why is RegisterPredicateLockingXid called while holding XidGenLock?

2011-05-05 Thread Dan Ports

On Thu, May 05, 2011 at 11:12:40PM -0400, Tom Lane wrote:
 Even if it's actually necessary to set up that data structure while
 holding XidGenLock, I would *really* like the call to not be exactly
 where it is.

Good question.

I don't believe it needs to be while XidGenLock is being held at all;
certainly not in this particular place. All that really matters is that
it happens before any backend starts seeing said xid in tuple headers.

I believe the call can be moved over to AssignTransactionId. I'd
probably put it at the end of that function, but it can go anywhere
between there and where it is now. Do you have any preference?

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: Why is RegisterPredicateLockingXid called while holding XidGenLock?

2011-05-05 Thread Kevin Grittner

Tom Lane  wrote:
 
 This seems like a pretty lousy place to do it, first because of the
 contention hit from holding that high-traffic lock any longer than
 necessary, and second because every added chance for error between
 ExtendCLOG() and TransactionIdAdvance(ShmemVariableCache-nextXid)
 gives us another way to fail in the way recently mentioned by Joe
 Conway:

http://archives.postgresql.org/message-id/4dbe4e7d.80...@joeconway.com
 
 Even if it's actually necessary to set up that data structure while
 holding XidGenLock, I would *really* like the call to not be
 exactly where it is.
 
On a quick scan, I can't see any reason this couldn't be moved down
to just above the return.  I think the reason I dropped it where I
did was simply that it was where we seemed to be letting other parts
of the system know about the new xid, so I was going for consistency.
 
I want to double-check this when I'm a little more awake, but my I
don't think that anything will be doing anything in the predicate
locking area where xid would matter until after the return from this
function.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why is RegisterPredicateLockingXid called while holding XidGenLock?

2011-05-05 Thread Tom Lane

Dan Ports d...@csail.mit.edu writes:
 On Thu, May 05, 2011 at 11:12:40PM -0400, Tom Lane wrote:
 Even if it's actually necessary to set up that data structure while
 holding XidGenLock, I would *really* like the call to not be exactly
 where it is.

 Good question.

 I don't believe it needs to be while XidGenLock is being held at all;
 certainly not in this particular place. All that really matters is that
 it happens before any backend starts seeing said xid in tuple headers.

 I believe the call can be moved over to AssignTransactionId. I'd
 probably put it at the end of that function, but it can go anywhere
 between there and where it is now. Do you have any preference?

Yeah, I was thinking that it'd be better to pull it out of
GetNewTransactionId and put it in a higher level.  No strong preference
about where in AssignTransactionId to put it.  Is there any chance that
it would be significant whether we do it before or after taking the lock
on the XID (XactLockTableInsert)?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: Why is RegisterPredicateLockingXid called while holding XidGenLock?

2011-05-05 Thread Kevin Grittner

Tom Lane  wrote:
 
 Yeah, I was thinking that it'd be better to pull it out of
 GetNewTransactionId and put it in a higher level.
 
As long as it is always called when an xid is assigned.  Since this
function appears to be on the only path to that, it should be fine.
 
 No strong preference about where in AssignTransactionId to put it.
 Is there any chance that it would be significant whether we do it
 before or after taking the lock on the XID (XactLockTableInsert)?
 
No, but since we need to do it only on a top level assignment, we
could save a couple cycles by putting it on an else on line 456.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Debug contrib/cube code

2011-05-05 Thread Nick Raj

Hi,
I am using postgresql-8.4.6. I want to debug the contrib/cube code. Can we
able to debug that cube code?  Because there is no .configure  file to
enable debug. Is there is any way to change make file to enable debug?

Thanks

57 matches

Mail list logo