Re: [HACKERS] Synchronous Standalone Master Redoux

2012-07-17 Thread Heikki Linnakangas

On 16.07.2012 22:01, Robert Haas wrote:

On Sat, Jul 14, 2012 at 7:54 PM, Josh Berkusj...@agliodbs.com  wrote:

So, here's the core issue with degraded mode.  I'm not mentioning this
to block any patch anyone has, but rather out of a desire to see someone
address this core problem with some clever idea I've not thought of.
The problem in a nutshell is: indeterminancy.

Assume someone implements degraded mode.  Then:

1. Master has one synchronous standby, Standby1, and two asynchronous,
Standby2 and Standby3.

2. Standby1 develops a NIC problem and is in and out of contact with
Master.  As a result, it's flipping in and out of synchronous / degraded
mode.

3. Master fails catastrophically due to a RAID card meltdown.  All data
lost.

At this point, the DBA is in kind of a pickle, because he doesn't know:

(a) Was Standby1 in synchronous or degraded mode when Master died?  The
only log for that was on Master, which is now gone.

(b) Is Standby1 actually the most caught up standby, and thus the
appropriate new master for Standby2 and Standby3, or is it behind?

With the current functionality of Synchronous Replication, you don't
have either piece of indeterminancy, because some external management
process (hopefully located on another server) needs to disable
synchronous replication when Standby1 develops its problem.  That is, if
the master is accepting synchronous transactions at all, you know that
Standby1 is up-to-date, and no data is lost.

While you can answer (b) by checking all servers, (a) is particularly
pernicious, because unless you have the application log all operating
in degraded mode messages, there is no way to ever determine the truth.


Good explanation.

In brief, the problem here is that you can only rely on the
no-transaction-loss guarantee provided by synchronous replication if
you can be certain that you'll always be aware of it when synchronous
replication gets shut off.  Right now that is trivially true, because
it has to be shut off manually.  If we provide a facility that logs a
message and then shuts it off, we lose that certainty, because the log
message could get eaten en route by the same calamity that takes down
the master.  There is no way for the master to WAIT for the log
message to be delivered and only then degrade.

However, we could craft a mechanism that has this effect.  Suppose we
create a new GUC with a name like
synchronous_replication_status_change_command.  If we're thinking
about switching between synchronous replication and degraded mode
automatically, we first run this command.  If it returns 0, then we're
allowed to switch, but if it returns anything else, then we're not
allowed to switch (but can retry the command after a suitable
interval).  The user is responsible for supplying a command that
records the status change somewhere off-box in a fashion that's
sufficiently durable that the user has confidence that the
notification won't subsequently be lost.  For example, the
user-supplied command could SSH into three machines located in
geographically disparate data centers and create a file with a certain
name on each one, returning 0 only if it's able to reach at least two
of them and create the file on all the ones it can reach.  If the
master dies, but at least two out of the those three machines are
still alive, we can be certain of determining with confidence whether
the master might have been in degraded mode at the time of the crash.

More or less paranoid versions of this scheme are possible depending
on user preferences, but the key point is that for the
no-transaction-loss guarantee to be of any use, there has to be a way
to reliably know whether that guarantee was in effect at the time the
master died in a fire.   Logging isn't enough, but I think some more
sophisticated mechanism can get us there.


Yeah, I think that's the right general approach. Not necessarily that 
exact GUC, but something like that. I don't want PostgreSQL to get more 
involved in determining the state of the standby, when to do failover, 
or when to fall back to degraded mode. That's a whole new territory with 
all kinds of problems, and there is plenty of software out there to 
handle that. Usually you have some external software to do monitoring 
and to initiate failovers anyway. What we need is a better API for 
co-operating with such software, to perform failover, and to switch 
replication between synchronous and asynchronous modes.


BTW, one little detail that I don't think has been mentioned in this 
thread before: Even though the master currently knows whether a standby 
is connected or not, and you could write a patch to act based on that, 
there are other failure scenarios where you would still not be happy. 
For example, imagine that the standby has a disk failure. It stays 
connected to the master, but fails to fsync anything to disk. Would you 
want to fall back to degraded mode and just do asynchronous replication 
in that case? How do you decide when to do 

Re: [HACKERS] Synchronous Standalone Master Redoux

2012-07-17 Thread Daniel Farina
On Mon, Jul 16, 2012 at 10:58 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 BTW, one little detail that I don't think has been mentioned in this thread
 before: Even though the master currently knows whether a standby is
 connected or not, and you could write a patch to act based on that, there
 are other failure scenarios where you would still not be happy. For example,
 imagine that the standby has a disk failure. It stays connected to the
 master, but fails to fsync anything to disk. Would you want to fall back to
 degraded mode and just do asynchronous replication in that case? How do you
 decide when to do that in the master? Or what if the standby keeps making
 progress, but becomes incredibly slow for some reason, like disk failure in
 a RAID array? I'd rather outsource all that logic to external monitoring
 software - software that you should be running anyway.

I would like to express some support for the non-edge nature of this
case.  Outside of simple loss of availability of a server, losing
access to a block device is probably the second-most-common cause of
loss of availability for me.  It's especially insidious because simple
select 1 checks may continue to return for quite some time, so
instead we rely on linux diskstats parsing to see if write progress
hits zero for a while.

In cases like these, the overhead of a shell-command to rapidly
consort with a decision-making process can be prohibitive -- it's
already a pretty big waster of time for me in wal
archiving/dearchiving, where process startup and SSL negotiation and
lack of parallelization can be pretty slow.  This may also exhibit
this problem.

I would like to plead that whatever is done would be most useful being
controllable via non-GUCs in its entirely -- arguably that is already
the case, since one can write a replication protocol client to do the
job, by faking the standby status update messages, but perhaps there
is a more lucid way if one makes accommodation.  In particular, the
awkwardness of using pg_receivexlog[0] or a similar tool for replacing
archive_command is something that I feel should be addressed
eventually, as to not be a second-class citizen.  Although that is
already being worked on[1]...the archive command has no backpressure
either, other than out of disk.

The case of restore_command is even more sore: remastering or
archive-recovery via streaming protocol actions is kind of a pain at
the moment.  I haven't thoroughly explored this yet and I don't think
it is documented, but it can be hard for something that is dearchiving
from wal segments stored somewhere to find exactly the right record to
start replaying at: the wal record format is not stable, and it need
not be, if the server helps by ignoring records that predate what it
requires or can inform the process feeding WAL that it got things
wrong.  Maybe that is the case, but it is not documented.  I also
don't think any guarantees around the maximum size or alignment of WAL
shipped by the streaming protocol in XLogData messages, and that's too
bad.  Also, the endianness of WAL position fields in the XLogData is
host-byte-order-dependent, which sucks if you are forwarding WAL
around but need to know what range is contained in a message.  In
practice many people can say all I have is little-endian, but it is
somewhat unpleasant and not necessarily the case.

Correct me if I'm wrong, I'd be glad for it.

[0]: see the notes section,
http://www.postgresql.org/docs/devel/static/app-pgreceivexlog.html
[1]: http://archives.postgresql.org/pgsql-hackers/2012-06/msg00348.php

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using pg_upgrade on log-shipping standby servers

2012-07-17 Thread Daniel Farina
On Mon, Jul 16, 2012 at 5:29 PM, Jeff Davis pg...@j-davis.com wrote:
 On Tue, 2012-07-10 at 11:50 -0400, Bruce Momjian wrote:
 I don't think we can assume that because pg_upgrade was run on the
 master and standby that they are binary identical, can we?  Technically
 the user file are identical, but the system catalogs and WAL might be
 different, hence my suggestion to run rsync before allowing the standby
 to rejoin the primary.

 Do you have plans to change that in the future?

 If we know that the user data files are identical between primary and
 replica, it would be nice if we could provide a robust way to avoid
 copying them.

How about this alternative that may sound crazy, but would lend itself
to some unification in archiving:

Could pg_upgrade emit WAL segment(s) to provide continuity of a
timeline?  So something like:

* Take down the writable primary for pg_upgrade
* Some WAL is emitted and possibly archived
* The old version, when reaching the special pg_upgrade WAL, could
exit or report its situation having paused replay (as clearly, it
cannot proceed). Unsure.
* Start up a new version of postgres on the same cluster at that
point, which plays the upgrade-WAL.

I see this being pretty mechanically intensive, but right now my hands
are completely tied as to achieving total continuity of my archives,
costing a base-backup's worth of risk window upon upgrade.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pl/perl and utf-8 in sql_ascii databases

2012-07-17 Thread Kyotaro HORIGUCHI
Hello,

  I suppose that testing for the two cases and additional
  one case which runs pg_do_encoding_conversion(), say latin1,
  would be enough to confirm that encoding/decoding is properly
  done, since the concrete conversion scheme is not significant
  this case.
  
  So I recommend that we should add the test for latin1 and omit
  the test from other than sql_ascii, utf8 and latin1. This might
  be archieved by create empty plperl_lc.sql and plperl_lc.out
  files for those encodings.
  
  What do you think about that?
 
 I think that's probably too much engineering for something that doesn't
 really warrant it.  A real solution to this problem could be to create
 yet another new test file containing just this function definition and
 the query that calls it, and have one expected file for each encoding;
 but that's too much work and too many files, I'm afraid.

I agree completely. The balance between the additional complexity
of regress and the what we would get from the complexity...

 I can see us supporting tests that require a small number of expected
 files.  No Make tricks with file copying, though.  If we can't get
 some easy way to test this without that, I submit we should just remove
 the test.

Ok I agree to do so.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] b-tree index search algorithms

2012-07-17 Thread Samuel Vogel

Am 17.07.12 05:21, schrieb Tom Lane:

Samuel Vogel s...@muel-vogel.de writes:

I'm currently on a university research project if performance could be
increased by substituting different inter-node search algorithms instead
of the currently used binary search.

Hm, what have you got in mind exactly?


At first I will try a simple interpolation search, but problems start 
there since I need to have a numerical representation of the index keys 
(or map them to one) to do the interpolation.



But I'm having troubles understanding how the general b-tree
implementation (nbtree.h) is used to represent for example a simple
primary key on an integer column. I've debug printed the
'scankey-sk_argument' and all attributes of the index tuples on the
pages being traversed (simply ran 'DatumGetInt32' on both) but I never
see one of the integers actually appearing in my table being logged when
I do a select.

Not clear what you did wrong from this amount of detail, but integer
keys ought to be pretty obvious at the debugger level.


Okay, to be more specific: Printing 
'DatumGetInt32(scankey-sk_argument)' in '_bt_compare' never shows me 50 
when I execute this query: SELECT * FROM simpletest WHERE id = 50;



This is why I assume that all column values are hashed before they are
pushed into the b-tree,

PG's b-trees do not hash anything.  If you're not seeing interpretable
key values then you're doing something wrong in your inspection
methodology.


Okay, how are indexes on char/text columns handled then? Are they hashed 
before being put into the b-tree or is my assumption correct, that in 
that case the Datum is only a link to where the actual data is stored 
and only 'scankey-sk_func' knows how to make use of it (compare it).
In that case it would be extremly hard to get to a numeric 
representation which can be used for the interpolation.


Regards,
Samuel Vogel

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Closing out the June commitfest

2012-07-17 Thread Simon Riggs
On 16 July 2012 01:16, Tom Lane t...@sss.pgh.pa.us wrote:
 We are now at the end of the originally scheduled one-month window for
 the June commitfest.  While the numbers look fairly bad:

 Needs Review: 17, Waiting on Author: 10, Ready for Committer: 3, Committed: 
 29, Returned with Feedback: 12, Rejected: 5. Total: 76.

 it's not quite a complete disaster, because almost all of the needs
 review patches did actually get some review and/or had new versions
 posted during the fest.  We did not get them to the point of being
 committable, but we did make progress.  I only see about three patches
 that seem to have received no attention whatsoever.

 At this point we could move the open items to the September fest and
 call this one good, or we could keep trying to close things out.
 Personally I'd like to do the former, because we really need to spend
 some effort on closing out the various open issues for 9.2, and the
 commitfest seems to have sucked up all the available time of those who
 might've been fixing those issues over the past month.

Sounds fine to me.

I've been unavailable for much of this CF, so my intention is to
continue with my parts of it. Meaning the reviews I was scheduled to
do won't be put off until Sept. But first, I'll review the 9.2 open
items list again.

That's a personal point, not trying to suggest everybody else should do that.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Covering Indexes

2012-07-17 Thread Simon Riggs
On 28 June 2012 13:16, David E. Wheeler da...@justatheory.com wrote:

 Very interesting design document for SQLite 4:

   http://www.sqlite.org/src4/doc/trunk/www/design.wiki

 I'm particularly intrigued by covering indexes. For example:

 CREATE INDEX cover1 ON table1(a,b) COVERING(c,d);

 This allows the following query to do an index-only scan:

 SELECT c, d FROM table1 WHERE a=? AND b=?;

 Now that we have index-only scans in 9.2, I'm wondering if it would make 
 sense to add covering index support, too, where additional, unindexed columns 
 are stored alongside indexed columns.


Just to be clear, the ability to have covered indexes is already in
9.2, I updated the release notes to explain that a few months back.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Covering Indexes

2012-07-17 Thread David E. Wheeler
On Jul 17, 2012, at 5:18 PM, Simon Riggs wrote:

 Now that we have index-only scans in 9.2, I'm wondering if it would make 
 sense to add covering index support, too, where additional, unindexed 
 columns are stored alongside indexed columns.
 
 Just to be clear, the ability to have covered indexes is already in
 9.2, I updated the release notes to explain that a few months back.

You mean this?

 Allow queries to retrieve data only from indexes, avoiding heap access 
 (Robert Haas, Ibrar Ahmed, Heikki Linnakangas, Tom Lane)
 
 This is often called index-only scans or covering indexes. This is 
 possible for heap pages with exclusively all-visible tuples, as reported by 
 the visibility map. The visibility map was made crash-safe as a necessary 
 part of implementing this feature.

That’s not how SQLite is using the term “covering index.” What they mean is the 
ability to have additional, unindexed columns in an index, so that they can 
*also* be returned in the event of an index-only scan.

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Covering Indexes

2012-07-17 Thread Simon Riggs
On 17 July 2012 16:21, David E. Wheeler da...@justatheory.com wrote:
 On Jul 17, 2012, at 5:18 PM, Simon Riggs wrote:

 Now that we have index-only scans in 9.2, I'm wondering if it would make 
 sense to add covering index support, too, where additional, unindexed 
 columns are stored alongside indexed columns.

 Just to be clear, the ability to have covered indexes is already in
 9.2, I updated the release notes to explain that a few months back.

 You mean this?

 Allow queries to retrieve data only from indexes, avoiding heap access 
 (Robert Haas, Ibrar Ahmed, Heikki Linnakangas, Tom Lane)

 This is often called index-only scans or covering indexes. This is 
 possible for heap pages with exclusively all-visible tuples, as reported by 
 the visibility map. The visibility map was made crash-safe as a necessary 
 part of implementing this feature.

 That’s not how SQLite is using the term “covering index.” What they mean is 
 the ability to have additional, unindexed columns in an index, so that they 
 can *also* be returned in the event of an index-only scan.

  CREATE INDEX ON foo (a, b, c, d);

allows

  SELECT c, d FROM foo WHERE a = ? AND b = ?

to use an index only scan.

The phrase unindexed seems misleading since the data is clearly in
the index from the description on the URL you gave. And since the
index is non-unique, I don't see any gap between Postgres and
SQLliite4.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Covering Indexes

2012-07-17 Thread David E. Wheeler
On Jul 17, 2012, at 5:32 PM, Simon Riggs wrote:

  CREATE INDEX ON foo (a, b, c, d);
 
 allows
 
  SELECT c, d FROM foo WHERE a = ? AND b = ?
 
 to use an index only scan.
 
 The phrase unindexed seems misleading since the data is clearly in
 the index from the description on the URL you gave. And since the
 index is non-unique, I don't see any gap between Postgres and
 SQLliite4.

Yeah, but that index is unnecessarily big if one will never use c or d in the 
search. The nice thing about covering indexes as described for SQLite 4 and 
implemented in MSSQL is that you can specify additional columns that just come 
along for the ride, but are not part of the indexed data:

CREATE INDEX cover1 ON table1(a,b) COVERING(c,d);

Yes, you can do that by also indexing c and d as of 9.2, but it might be nice 
to be able to include them in the index as additional row data without actually 
indexing them.

Best,

David
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Covering Indexes

2012-07-17 Thread Simon Riggs
On 17 July 2012 16:54, David E. Wheeler da...@justatheory.com wrote:
 On Jul 17, 2012, at 5:32 PM, Simon Riggs wrote:

  CREATE INDEX ON foo (a, b, c, d);

 allows

  SELECT c, d FROM foo WHERE a = ? AND b = ?

 to use an index only scan.

 The phrase unindexed seems misleading since the data is clearly in
 the index from the description on the URL you gave. And since the
 index is non-unique, I don't see any gap between Postgres and
 SQLliite4.

 Yeah, but that index is unnecessarily big if one will never use c or d in the 
 search. The nice thing about covering indexes as described for SQLite 4 and 
 implemented in MSSQL is that you can specify additional columns that just 
 come along for the ride, but are not part of the indexed data:

 CREATE INDEX cover1 ON table1(a,b) COVERING(c,d);

 Yes, you can do that by also indexing c and d as of 9.2, but it might be nice 
 to be able to include them in the index as additional row data without 
 actually indexing them.

Can you explain what you mean by without actually indexing them?
ISTM that it is a non-feature if the index is already non-unique, and
the difference is simply down to the amount of snake oil applied to
the descriptive text on the release notes.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Covering Indexes

2012-07-17 Thread Vik Reykja
On Tue, Jul 17, 2012 at 6:08 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 17 July 2012 16:54, David E. Wheeler da...@justatheory.com wrote:
  Yeah, but that index is unnecessarily big if one will never use c or d
 in the search. The nice thing about covering indexes as described for
 SQLite 4 and implemented in MSSQL is that you can specify additional
 columns that just come along for the ride, but are not part of the indexed
 data:
 
  CREATE INDEX cover1 ON table1(a,b) COVERING(c,d);
 
  Yes, you can do that by also indexing c and d as of 9.2, but it might be
 nice to be able to include them in the index as additional row data without
 actually indexing them.

 Can you explain what you mean by without actually indexing them?
 ISTM that it is a non-feature if the index is already non-unique, and
 the difference is simply down to the amount of snake oil applied to
 the descriptive text on the release notes.


It would be useful in non-unique indexes to store data without ordering
operators (like xml).


Re: [HACKERS] Covering Indexes

2012-07-17 Thread David Johnston
 -Original Message-
 From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-
 ow...@postgresql.org] On Behalf Of David E. Wheeler
 Sent: Tuesday, July 17, 2012 11:55 AM
 To: Simon Riggs
 Cc: Pg Hackers
 Subject: Re: [HACKERS] Covering Indexes
 
 On Jul 17, 2012, at 5:32 PM, Simon Riggs wrote:
 
   CREATE INDEX ON foo (a, b, c, d);
 
  allows
 
   SELECT c, d FROM foo WHERE a = ? AND b = ?
 
  to use an index only scan.
 
  The phrase unindexed seems misleading since the data is clearly in
  the index from the description on the URL you gave. And since the
  index is non-unique, I don't see any gap between Postgres and
  SQLliite4.
 
 Yeah, but that index is unnecessarily big if one will never use c or d in
the
 search. The nice thing about covering indexes as described for SQLite 4
and
 implemented in MSSQL is that you can specify additional columns that just
 come along for the ride, but are not part of the indexed data:
 
 CREATE INDEX cover1 ON table1(a,b) COVERING(c,d);
 
 Yes, you can do that by also indexing c and d as of 9.2, but it might be
nice to
 be able to include them in the index as additional row data without
actually
 indexing them.
 
 Best,
 
 David

Concretely, I would presume that the contents of a covering index could then
look like the following (a,b,c,d):

(2,1,2,A)
(2,1,5,A) -- the 5 is out of natural order but exists in the covering
part
(2,1,3,A)

Whereas PostgreSQL would be forced to have the index ordered as such:

(2,1,2,A)
(2,1,3,A)
(2,1,5,A)

Either way the data in c and d are IN THE INDEX otherwise in neither
case could the data values be returned while strictly querying the index.

So the question that needs to be asked is what kind of performance increase
can be had during DML (insert/update) statements and whether those gains are
worth pursuing.  Since these other engines appear to allow both cases you
should be able to get at least a partial idea of the performance gains
between index (a,b,c,d) and index (a,b) covering (c,d).

Vik's concurrent point regarding non-indexable values makes some sense but
the use case there seems specialized as I suspect that in the general case
values that are non-indexable (if there truly are any) are generally those
that would be too large to warrant sticking into an index in the first
place.  But, XML values do ring true in my mind (particularly frequently
used fragments that are generally quite small).  But again whether that is a
reasonable use case for a covering index I do not know.  It feels like
trying to solve the remaining 10% when it took a long while to even muster
up enough support and resources to solve the 90%.

David J.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Covering Indexes

2012-07-17 Thread Simon Riggs
On 17 July 2012 17:41, David Johnston pol...@yahoo.com wrote:

 Concretely, I would presume that the contents of a covering index could then
 look like the following (a,b,c,d):

 (2,1,2,A)
 (2,1,5,A) -- the 5 is out of natural order but exists in the covering
 part
 (2,1,3,A)

 Whereas PostgreSQL would be forced to have the index ordered as such:

 (2,1,2,A)
 (2,1,3,A)
 (2,1,5,A)

 Either way the data in c and d are IN THE INDEX otherwise in neither
 case could the data values be returned while strictly querying the index.

 So the question that needs to be asked is what kind of performance increase
 can be had during DML (insert/update) statements and whether those gains are
 worth pursuing.  Since these other engines appear to allow both cases you
 should be able to get at least a partial idea of the performance gains
 between index (a,b,c,d) and index (a,b) covering (c,d).

There is a use case, already discussed, whereby that is useful for
   create unique index on foo (a,b) covering (c,d)

but there really isn't any functional difference between
   create index on foo (a,b) covering (c,d)

and
   create index on foo (a,b,c,d)

There is a potential performance impact. But as Tom says, that might
even be negative if it is actually measurable.


 Vik's concurrent point regarding non-indexable values makes some sense but
 the use case there seems specialized as I suspect that in the general case
 values that are non-indexable (if there truly are any) are generally those
 that would be too large to warrant sticking into an index in the first
 place.

I think it would be easy enough to add noop operators for sorts if you
wanted to do that.

 But, XML values do ring true in my mind (particularly frequently
 used fragments that are generally quite small).  But again whether that is a
 reasonable use case for a covering index I do not know.  It feels like
 trying to solve the remaining 10% when it took a long while to even muster
 up enough support and resources to solve the 90%.

The main thing is that we definitely already do have covering indexes
and we will be announcing we have that soon. The fact we have chosen
to implement that without adding new syntax strikes me as a selling
point as well, so all client tools still work.

So the feature we are talking about here needs to be called something
else, otherwise we will be confusing people. Unsorted trailing index
columns...

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Covering Indexes

2012-07-17 Thread Andrew Dunstan


On 07/17/2012 12:41 PM, David Johnston wrote:


So the question that needs to be asked is what kind of performance increase
can be had during DML (insert/update) statements and whether those gains are
worth pursuing.  Since these other engines appear to allow both cases you
should be able to get at least a partial idea of the performance gains
between index (a,b,c,d) and index (a,b) covering (c,d).



Tom's recent answer to me on this point (as I understood it) was that he 
would expect performance to degrade, not improve, since the btree code 
is known not to perform well when there are many non-unique values.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Covering Indexes

2012-07-17 Thread Tom Lane
David E. Wheeler da...@justatheory.com 
ca+u5nmjz33zsvqpzk-auoindxkq6elip1hgq53byodlpwfd...@mail.gmail.com writes:
 On Jul 17, 2012, at 5:32 PM, Simon Riggs wrote:
 The phrase unindexed seems misleading since the data is clearly in
 the index from the description on the URL you gave. And since the
 index is non-unique, I don't see any gap between Postgres and
 SQLliite4.

 Yeah, but that index is unnecessarily big if one will never use c or d
 in the search.

The data would still have to be stored in the leaf entries, at least.
Yeah, you could possibly omit the unindexed columns from upper tree
levels, but with typical btree fanout ratios in the hundreds, the
overall space savings would be negligible.  The idea of different index
tuple descriptors on different tree levels doesn't appeal to me, either.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CompactCheckpointerRequestQueue versus pad bytes

2012-07-17 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Jul 16, 2012 at 9:58 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 BTW, I wonder whether the code that checks for relfilenode conflict
 when selecting a pg_class or relfilenode OID tries both file naming
 conventions?  If not, should we make it do so?

 I don't believe it does, nor do I see what we would gain by making it to do 
 so.

What we would gain is ensuring that we aren't using the same relfilenode
for both a regular table and a temp table.  Do you really want to assume
that such a conflict is 100% safe?  It sounds pretty scary to me, and
even if we were sure the backend would never get confused, what about
client-side code that's looking at relfilenode?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] b-tree index search algorithms

2012-07-17 Thread Tom Lane
Samuel Vogel s...@muel-vogel.de writes:
 Am 17.07.12 05:21, schrieb Tom Lane:
 Samuel Vogel s...@muel-vogel.de writes:
 I'm currently on a university research project if performance could be
 increased by substituting different inter-node search algorithms instead
 of the currently used binary search.

 Hm, what have you got in mind exactly?

 At first I will try a simple interpolation search, but problems start 
 there since I need to have a numerical representation of the index keys 
 (or map them to one) to do the interpolation.

Dunno about that.  btree knows nothing about the datatypes it's working
on except that they have comparison functions.  Converting the values
to some sort of numeric scale that you can interpolate on seems
logically dubious and fraught with practical difficulties.  Now, we do
have some code in selfuncs.c that tries to do that, for some data types,
but it's only for planner estimation purposes, and we don't rely very
heavily on its results even in that context.  Depending on it to be
right for search purposes sounds pretty scary.

 Not clear what you did wrong from this amount of detail, but integer
 keys ought to be pretty obvious at the debugger level.

 Okay, to be more specific: Printing 
 'DatumGetInt32(scankey-sk_argument)' in '_bt_compare' never shows me 50 
 when I execute this query: SELECT * FROM simpletest WHERE id = 50;

Um, what does it show you?  DatumGetInt32 is a macro, and at least in
gdb that won't work at all:

(gdb) p DatumGetInt32(scankey-sk_argument)
No symbol DatumGetInt32 in current context.

However, just looking at the value produces sane answers for me:

(gdb) p *scankey
$1 = {sk_flags = 196608, sk_attno = 1, sk_strategy = 0, sk_subtype = 23, 
  sk_collation = 0, sk_func = {fn_addr = 0x486ec0 btint4cmp, fn_oid = 351, 
fn_nargs = 2, fn_strict = 1 '\001', fn_retset = 0 '\000', 
fn_stats = 2 '\002', fn_extra = 0x0, fn_mcxt = 0x2b82fe8, fn_expr = 0x0}, 
  sk_argument = 50}
(gdb) p scankey-sk_argument 
$2 = 50

 PG's b-trees do not hash anything.  If you're not seeing interpretable
 key values then you're doing something wrong in your inspection
 methodology.

 Okay, how are indexes on char/text columns handled then?

The datum values will be pointers to strings.

 ... is my assumption correct, that in 
 that case the Datum is only a link to where the actual data is stored 
 and only 'scankey-sk_func' knows how to make use of it (compare it).
 In that case it would be extremly hard to get to a numeric 
 representation which can be used for the interpolation.

The btree code is (or reasonably can be) aware that such values are
pass-by-reference, and how to get to the bits.  But the comparison
semantics of two different values are not something it knows about
except by asking the comparison function.  This can be quite a
nontrivial matter even for text, since we follow strcoll() comparison
rules.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] isolation check takes a long time

2012-07-17 Thread Alvaro Herrera

Excerpts from Andrew Dunstan's message of dom jul 15 16:42:22 -0400 2012:

 I'm looking into that. But given that the default is to set
 max_prepared_transactions to 0, shouldn't we just remove that test from the
 normal installcheck schedule?
 
 We could provide an alternative schedule that does include it.

That's a thought -- AFAIR we do provide a numeric_big test that's not
exercised by the regular regress schedule, for a precedent.

However, there's more work to do in isolation testing.  It'd be good to
have it being routinely run in serializable isolation level, for
example, not just in read committed.  I wouldn't want to overload the
slowest machines in the buildfarm (some of which are already barely
capable of running the tests on all branches in a 24h schedule, of which
Stefan Kaltenbrunner is so proud), but if we could have a few of the
fastest members running isolation and isolation-serializable, with
max_prepared_transactions set to a nonzero value, that'd be great.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] several problems in pg_receivexlog

2012-07-17 Thread Fujii Masao
On Fri, Jul 13, 2012 at 1:15 AM, Magnus Hagander mag...@hagander.net wrote:
 On Thu, Jul 12, 2012 at 6:07 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Jul 12, 2012 at 8:39 PM, Magnus Hagander mag...@hagander.net wrote:
 On Tue, Jul 10, 2012 at 7:03 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Jul 10, 2012 at 3:23 AM, Fujii Masao masao.fu...@gmail.com wrote:
 Hi,

 I found several problems in pg_receivexlog, e.g., memory leaks,
 file-descripter leaks, ..etc. The attached patch fixes these problems.

 ISTM there are still some other problems in pg_receivexlog, so I'll
 read it deeply later.

 While pg_basebackup background process is streaming WAL records,
 if its replication connection is terminated (e.g., walsender in the server
 is accidentally terminated by SIGTERM signal), pg_basebackup ends
 up failing to include all required WAL files in the backup. The problem
 is that, in this case, pg_basebackup doesn't emit any error message at all.
 So an user might misunderstand that a base backup has been successfully
 taken even though it doesn't include all required WAL files.

 Ouch. That is definitely a bug if it behaves that way.


 To fix this problem, I think that, when the replication connection is
 terminated, ReceiveXlogStream() should check whether we've already
 reached the stop point by calling stream_stop() before returning TRUE.
 If we've not yet (this means that we've not received all required WAL
 files yet), ReceiveXlogStream() should return FALSE and
 pg_basebackup should emit an error message.  Comments?

 Doesn't it already return false because it detects the error of the
 connection? What's the codepath where we end up returning true even
 though we had a connection failure? Shouldn't that end up under the
 could not read copy data branch, which already returns false?

 You're right. If the error is detected, that function always returns false
 and the error message is emitted (but I think that current error message
 pg_basebackup: child process exited with error 1 is confusing),
 so it's OK. But if walsender in the server is terminated by SIGTERM,
 no error is detected and pg_basebackup background process gets out
 of the loop in ReceiveXlogStream() and returns true.

 Oh. Because the server does a graceful shutdown. D'uh, of course.

 Then yes, your suggested fix seems like a good one.

Attached patch adds the fix.

Also I found I had forgotten to set the file descriptor to -1 at the end of
ReceiveXlogStream(), in previously-committed my patch. Attached patch
fixes this problem.

Regards,

-- 
Fujii Masao


pgreceivexlog_check_stoppoint_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CompactCheckpointerRequestQueue versus pad bytes

2012-07-17 Thread Tom Lane
I wrote:
 I had thought that we might get a performance boost here by saving fsync
 queue traffic, but I see that md.c was already not calling
 register_dirty_segment for temp rels, so there's no joy there.

Actually, wait a second.  We were smart enough to not send fsync
requests in the first place for temp rels.  But we were not smart enough
to not call ForgetRelationFsyncRequests when deleting a temp rel,
which made for an entirely useless scan through the pending-fsyncs
table.  So there could be win there, on top of not forwarding the actual
unlink operation.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pl/perl and utf-8 in sql_ascii databases

2012-07-17 Thread Alvaro Herrera

Excerpts from Kyotaro HORIGUCHI's message of mar jul 17 05:01:10 -0400 2012:

  I think that's probably too much engineering for something that doesn't
  really warrant it.  A real solution to this problem could be to create
  yet another new test file containing just this function definition and
  the query that calls it, and have one expected file for each encoding;
  but that's too much work and too many files, I'm afraid.
 
 I agree completely. The balance between the additional complexity
 of regress and the what we would get from the complexity...

I had to remove both that test and the one about the 0x80, because it
wasn't working for me in either SQL_ASCII or Latin1, I forget which.
I'm not sure I understand the reason for the failure -- I was getting a
false result instead of true, which was unexpected.  Maybe there's a
trivial explanation for this .. or maybe it really is broken.

In any case, maybe it'd be a good idea to have more tests related to
encodings, if we can write them in some reasonable manner.  But only in
HEAD, I guess, because having to backpatch stuff and test every branch
in at least three encodings is just too annoying.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Split contrib documentation into extensions and programs

2012-07-17 Thread Peter Eisentraut
On ons, 2012-05-09 at 14:44 -0400, Alvaro Herrera wrote:
 Excerpts from Peter Eisentraut's message of mié may 09 13:54:53 -0400 2012:
  Split contrib documentation into extensions and programs
  
  Create separate appendixes for contrib extensions and other server
  plugins on the one hand, and utility programs on the other.  Recast
  the documentation of the latter as refentries, so that man pages are
  generated.
 
 I noticed that the pgupgrade manpage doesn't have the titles of each
 step in Usage.  Not sure this is a serious problem, but it's different
 from the HTML at any rate.

Fixed.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using pg_upgrade on log-shipping standby servers

2012-07-17 Thread Jeff Davis
On Tue, 2012-07-17 at 01:02 -0700, Daniel Farina wrote:
 Could pg_upgrade emit WAL segment(s) to provide continuity of a
 timeline?  So something like:

By segments did you mean records?

 * Take down the writable primary for pg_upgrade
 * Some WAL is emitted and possibly archived
 * The old version, when reaching the special pg_upgrade WAL, could
 exit or report its situation having paused replay (as clearly, it
 cannot proceed). Unsure.

I don't really understand this step.

 * Start up a new version of postgres on the same cluster at that
 point, which plays the upgrade-WAL.
 
 I see this being pretty mechanically intensive, but right now my hands
 are completely tied as to achieving total continuity of my archives,
 costing a base-backup's worth of risk window upon upgrade.

Does continuity of archives mean avoid downtime or maintain a
single WAL sequence. 

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] isolation check takes a long time

2012-07-17 Thread Noah Misch
On Tue, Jul 17, 2012 at 01:56:19PM -0400, Alvaro Herrera wrote:
 Excerpts from Andrew Dunstan's message of dom jul 15 16:42:22 -0400 2012:
  I'm looking into that. But given that the default is to set
  max_prepared_transactions to 0, shouldn't we just remove that test from the
  normal installcheck schedule?

 That's a thought -- AFAIR we do provide a numeric_big test that's not
 exercised by the regular regress schedule, for a precedent.

It would be nice to have a pattern for adding tests run less often than every
commit but more often than whenever a human explicitly remembers the test
and invokes it manually.  Perhaps a schedule that the recommended buildfarm
configuration would somehow run every two weeks and before each release
(including betas and branch releases).

 However, there's more work to do in isolation testing.  It'd be good to
 have it being routinely run in serializable isolation level, for
 example, not just in read committed.

Except for the foreign key specs, isolation test specs request a specific
isolation level when starting their transactions.  Running such specs under
different default_transaction_isolation settings primarily confirms that
BEGIN TRANSACTION ISOLATION LEVEL x is indistinguishable from BEGIN under
default_transaction_isolation = x.  It might also discover transaction
isolation sensitivity in the setup/cleanup steps, which often omit explicit
transaction control.  I don't think such verification justifies regularly
running thousands of tests.  The foreign key tests, however, would benefit
from running under all three isolation levels.  Let's control it per-spec
instead of repeating the entire suite.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] New buildfarm client release

2012-07-17 Thread Andrew Dunstan
There is a new release of the PostgreSQL buildfarm client available at 
https://github.com/downloads/PGBuildFarm/client-code/build-farm-4_7.tgz


Most of the changes in the release are minor bug fixes. Enhancements 
include:


 * extra_config can now have a DEFAULT key, and these entries are
   prepended to any branch-specific entry. This means you can now set
   common extra config in one place instead of having to specify them
   for each branch. The sample entry now uses this as its key instead
   of HEAD, and the sample entry has a line for fsync = off, which
   can improve the speed of buildfarm runs.
 * provide for include as well as exclude filters on file names for
   deciding if a run is needed. The old config setting of
   trigger_filter is still honored as if it were trigger_exclude, the
   new name.
 * allow skipping install steps as well as other steps.

Enjoy!

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using pg_upgrade on log-shipping standby servers

2012-07-17 Thread Bruce Momjian
On Mon, Jul 16, 2012 at 05:29:26PM -0700, Jeff Davis wrote:
 On Tue, 2012-07-10 at 11:50 -0400, Bruce Momjian wrote:
  I don't think we can assume that because pg_upgrade was run on the
  master and standby that they are binary identical, can we?  Technically
  the user file are identical, but the system catalogs and WAL might be
  different, hence my suggestion to run rsync before allowing the standby
  to rejoin the primary.
 
 Do you have plans to change that in the future?
 
 If we know that the user data files are identical between primary and
 replica, it would be nice if we could provide a robust way to avoid
 copying them.

Well, rsync --checksum would work, but both systems have to be down for
that.  You could snapshot the down primary and rsync --checksum that
against the standby, but I am not sure how much that helps us.

I can't figure out how to make this work better without adding a whole
lot more code to pg_upgrade that might need adjustment for every minor
release, i.e. pg_upgrade knows nothing about the WAL file format, and I
want to keep it that way.

However, I have two ideas.  First, I don't know _why_ the
primary/standby would be any different after pg_upgrade, so I added the
documentation mention because I couldn't _guarantee_ they were the same.
Actually, if people can test this, we might be able to say this is safe.

Second, the user files (large) are certainly identical, it is only the
system tables (small) that _might_ be different, so rsync'ing just those
would add the guarantee, but I know of no easy way to rsync just the
system tables.

Does that help?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)

2012-07-17 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Jul 16, 2012 at 3:18 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 BTW, while we are on the subject: hasn't this split completely broken
 the statistics about backend-initiated writes?

 Yes, it seems to have done just that.

So I went to fix this in the obvious way (attached), but while testing
it I found that the number of buffers_backend events reported during
a regression test run barely changed; which surprised the heck out of
me, so I dug deeper.  The cause turns out to be extremely scary:
ForwardFsyncRequest isn't getting called at all in the bgwriter process,
because the bgwriter process has a pendingOpsTable.  So it just queues
its fsync requests locally, and then never acts on them, since it never
runs any checkpoints anymore.

This implies that nobody has done pull-the-plug testing on either HEAD
or 9.2 since the checkpointer split went in (2011-11-01), because even
a modicum of such testing would surely have shown that we're failing to
fsync a significant fraction of our write traffic.

Furthermore, I would say that any performance testing done since then,
if it wasn't looking at purely read-only scenarios, isn't worth the
electrons it's written on.  In particular, any performance gain that
anybody might have attributed to the checkpointer splitup is very
probably hogwash.

This is not giving me a warm feeling about our testing practices.

As far as fixing the bug is concerned, the reason for the foulup
is that mdinit() looks to IsBootstrapProcessingMode() to decide
whether to create a pendingOpsTable.  That probably was all right
when it was coded, but what it means today is that *any* process
started via AuxiliaryProcessMain will have one; thus not only do
bgwriters have one, but so do walwriter and walreceiver processes;
which might not represent a bug today but it's pretty scary anyway.
I think we need to fix that so it's more directly dependent on the
auxiliary process type.  We can't use flags set by the respective
FooMain() functions, such as am_bg_writer, because mdinit is called
from BaseInit() which happens before reaching those functions.
My suggestion is that bootstrap.c ought to make the process's
AuxProcType value available and then mdinit should consult that to
decide what to do.  (Having done that, we might consider getting rid
of the retail process-type flags am_bg_writer etc.)

regards, tom lane

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 5f93fccbfab1bbb8306f5de4ad228f3cb48b0862..41a7b2be4f680db08556d948eaaa002ed50119c5 100644
*** a/src/backend/postmaster/bgwriter.c
--- b/src/backend/postmaster/bgwriter.c
*** BackgroundWriterMain(void)
*** 341,346 
--- 341,357 
  }
  
  
+ /*
+  * IsBackgroundWriterProcess
+  *		Return true if running in background writer process.
+  */
+ bool
+ IsBackgroundWriterProcess(void)
+ {
+ 	return am_bg_writer;
+ }
+ 
+ 
  /* 
   *		signal handler routines
   * 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 92fd4276cd1b3be81d1ac741f9f6ea09d241ea52..bd1db4811d661d75e616bc77157c4a9e9b7e92fc 100644
*** a/src/backend/postmaster/checkpointer.c
--- b/src/backend/postmaster/checkpointer.c
*** bool
*** 1124,1129 
--- 1124,1130 
  ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
  {
  	CheckpointerRequest *request;
+ 	bool		am_bg_writer;
  	bool		too_full;
  
  	if (!IsUnderPostmaster)
*** ForwardFsyncRequest(RelFileNode rnode, F
*** 1131,1141 
  
  	if (am_checkpointer)
  		elog(ERROR, ForwardFsyncRequest must not be called in checkpointer);
  
  	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
  
  	/* Count all backend writes regardless of if they fit in the queue */
! 	CheckpointerShmem-num_backend_writes++;
  
  	/*
  	 * If the checkpointer isn't running or the request queue is full, the
--- 1132,1144 
  
  	if (am_checkpointer)
  		elog(ERROR, ForwardFsyncRequest must not be called in checkpointer);
+ 	am_bg_writer = IsBackgroundWriterProcess();
  
  	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
  
  	/* Count all backend writes regardless of if they fit in the queue */
! 	if (!am_bg_writer)
! 		CheckpointerShmem-num_backend_writes++;
  
  	/*
  	 * If the checkpointer isn't running or the request queue is full, the
*** ForwardFsyncRequest(RelFileNode rnode, F
*** 1150,1156 
  		 * Count the subset of writes where backends have to do their own
  		 * fsync
  		 */
! 		CheckpointerShmem-num_backend_fsync++;
  		LWLockRelease(CheckpointerCommLock);
  		return false;
  	}
--- 1153,1160 
  		 * Count the subset of writes where backends have to do their own
  		 * fsync
  		 */
! 		if (!am_bg_writer)
! 			CheckpointerShmem-num_backend_fsync++;
  		LWLockRelease(CheckpointerCommLock);
  		return false;
  	}
diff 

Re: [HACKERS] Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)

2012-07-17 Thread Peter Geoghegan
On 17 July 2012 23:56, Tom Lane t...@sss.pgh.pa.us wrote:
 This implies that nobody has done pull-the-plug testing on either HEAD
 or 9.2 since the checkpointer split went in (2011-11-01), because even
 a modicum of such testing would surely have shown that we're failing to
 fsync a significant fraction of our write traffic.

 Furthermore, I would say that any performance testing done since then,
 if it wasn't looking at purely read-only scenarios, isn't worth the
 electrons it's written on.  In particular, any performance gain that
 anybody might have attributed to the checkpointer splitup is very
 probably hogwash.

 This is not giving me a warm feeling about our testing practices.

The checkpointer slit-up was not justified as a performance
optimisation so much as a re-factoring effort that might have some
concomitant performance benefits. While I agree that it is regrettable
that this was allowed to go undetected for so long, I do not find it
especially surprising that some performance testing results post-split
didn't strike somebody as fool's gold. Much of the theory surrounding
checkpoint tuning, if followed, results in relatively little work
being done during the sync phase of a checkpoint, especially if an I/O
scheduler like deadline is used.

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using pg_upgrade on log-shipping standby servers

2012-07-17 Thread Daniel Farina
On Tue, Jul 17, 2012 at 11:55 AM, Jeff Davis pg...@j-davis.com wrote:
 On Tue, 2012-07-17 at 01:02 -0700, Daniel Farina wrote:
 Could pg_upgrade emit WAL segment(s) to provide continuity of a
 timeline?  So something like:

 By segments did you mean records?

Yes.  It would be nicer not to have to tie it to the WAL segment file size.

 * Take down the writable primary for pg_upgrade
 * Some WAL is emitted and possibly archived
 * The old version, when reaching the special pg_upgrade WAL, could
 exit or report its situation having paused replay (as clearly, it
 cannot proceed). Unsure.

 I don't really understand this step.

Some WAL is emitted and possibly archived needs a subject in that fragment:

pg_upgrade somehow (directly, or indirectly) emits and/or archives
WAL used to complete binary-upgrade.  That means that it should
appear in the WAL stream and be subject to archive_command, like any
other WAL.

The sticky part is what the standby should do when it encounters the
special wal-upgrade records.  It should probably pause replay to allow
some other program to stop the old postgres version and start the new
version with the same cluster.

 * Start up a new version of postgres on the same cluster at that
 point, which plays the upgrade-WAL.

 I see this being pretty mechanically intensive, but right now my hands
 are completely tied as to achieving total continuity of my archives,
 costing a base-backup's worth of risk window upon upgrade.

 Does continuity of archives mean avoid downtime or maintain a
 single WAL sequence.

The latter.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)

2012-07-17 Thread Craig Ringer

On 07/18/2012 06:56 AM, Tom Lane wrote:

Robert Haas robertmh...@gmail.com writes:

On Mon, Jul 16, 2012 at 3:18 PM, Tom Lane t...@sss.pgh.pa.us wrote:

BTW, while we are on the subject: hasn't this split completely broken
the statistics about backend-initiated writes?

Yes, it seems to have done just that.

So I went to fix this in the obvious way (attached), but while testing
it I found that the number of buffers_backend events reported during
a regression test run barely changed; which surprised the heck out of
me, so I dug deeper.  The cause turns out to be extremely scary:
ForwardFsyncRequest isn't getting called at all in the bgwriter process,
because the bgwriter process has a pendingOpsTable.  So it just queues
its fsync requests locally, and then never acts on them, since it never
runs any checkpoints anymore.

This implies that nobody has done pull-the-plug testing on either HEAD
or 9.2 since the checkpointer split went in (2011-11-01)


That makes me wonder if on top of the buildfarm, extending some 
buildfarm machines into a crashfarm is needed:


- Keep kvm instances with copy-on-write snapshot disks and the build env 
on them

- Fire up the VM, do a build, and start the server
- From outside the vm have the test controller connect to the server and 
start a test run

- Hard-kill the OS instance at a random point in time.
- Start the OS instance back up
- Start Pg back up and connect to it again
- From the test controller, test the Pg install for possible corruption 
by reading the indexes and tables, doing some test UPDATEs, etc.


The main challenge would be coming up with suitable tests to run, ones 
that could then be checked to make sure nothing was broken. The test 
controller would know how far a test got before the  OS got killed and 
would know which test it was running, so it'd be able to check for 
expected data if provided with appropriate test metadata. Use of enable_ 
flags should permit scans of indexes and table heaps to be forced.


What else should be checked? The main thing that comes to mind for me is 
something I've worried about for a while: that Pg might not always 
handle out-of-disk-space anywhere near as gracefully as it's often 
claimed to. There's no automated testing for that, so it's hard to 
really know. A harnessed VM could be used to test that. Instead of 
virtual plug pull tests it could generate a virtual disk of constrained 
random size, run its tests until out-of-disk caused failure, stop Pg, 
expand the disk, restart Pg, and run its checks.


Variants where WAL was on a separate disk and only WAL or only the main 
non-WAL disk run out of space would also make sense and be easy to 
produce with such a harness.


I've written some automated kvm test harnesses, so I could have a play 
with this idea. I would probably need some help with the test design, 
though, and the guest OS would be Linux, Linux, or Linux at least to 
start with.


Opinions?

--
Craig Ringer

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)

2012-07-17 Thread Tom Lane
Craig Ringer ring...@ringerc.id.au writes:
 On 07/18/2012 06:56 AM, Tom Lane wrote:
 This implies that nobody has done pull-the-plug testing on either HEAD
 or 9.2 since the checkpointer split went in (2011-11-01)

 That makes me wonder if on top of the buildfarm, extending some 
 buildfarm machines into a crashfarm is needed:

Not sure if we need a whole farm, but certainly having at least one
machine testing this sort of stuff on a regular basis would make me feel
a lot better.

 The main challenge would be coming up with suitable tests to run, ones 
 that could then be checked to make sure nothing was broken.

One fairly simple test scenario could go like this:

* run the regression tests
* pg_dump the regression database
* run the regression tests again
* hard-kill immediately upon completion
* restart database, allow it to perform recovery
* pg_dump the regression database
* diff previous and new dumps; should be the same

The main thing this wouldn't cover is discrepancies in user indexes,
since pg_dump doesn't do anything that's likely to result in indexscans
on user tables.  It ought to be enough to detect the sort of system-wide
problem we're talking about here, though.

In general I think the hard part is automated reproduction of an
OS-crash scenario, but your ideas about how to do that sound promising.
Once we have that going, it shouldn't be hard to come up with tests
of the form do X, hard-crash, recover, check X still looks sane.

 What else should be checked? The main thing that comes to mind for me is 
 something I've worried about for a while: that Pg might not always 
 handle out-of-disk-space anywhere near as gracefully as it's often 
 claimed to.

+1

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] During Xlog replaying, is there maybe emitted xlog?

2012-07-17 Thread xu2002261
Hi all,

 I reviewed the source code, and saw the following calling path:

 StartupXLOG()  StartupDatabase()  RmgrTable[rmid].rm_cleanup()  
btree_xlog_cleanup()  _bt_insert_parent  _bt_insertonpg()  XLogInsert()

As we can see, during xlog replaying, XLog may be emitted. So whether there are 
some *ISSUE* in above calling stack?


thanks.

发信日期:2012-07-18 



xu2002261 


Re: [HACKERS] During Xlog replaying, is there maybe emitted xlog?

2012-07-17 Thread Tom Lane
xu2002261 xu2002...@163.com writes:
 Hi all,
  I reviewed the source code, and saw the following calling path:

  StartupXLOG()  StartupDatabase()  RmgrTable[rmid].rm_cleanup()  
 btree_xlog_cleanup()  _bt_insert_parent  _bt_insertonpg()  XLogInsert()

 As we can see, during xlog replaying, XLog may be emitted. So whether there 
 are some *ISSUE* in above calling stack?

No, it's entirely correct.  That path isn't during replay, it's upon
completion of replay, where we're cleaning up anything that failed to be
completed before the crash.  Emitting more WAL is allowed then.  Note
the comment a few lines above the rm_cleanup calls:

/*
 * Resource managers might need to write WAL records, eg, to 
record
 * index cleanup actions.  So temporarily enable 
XLogInsertAllowed in
 * this process only.
 */

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] During Xlog replaying, is there maybe emitted xlog?

2012-07-17 Thread xu2002261
Thanks a lot. 

oops, indeed, the clean up stage is not in the XLog replay, So there is no 
problem.


2012-07-18 



xu2002261 



发件人: Tom Lane 
发送时间: 2012-07-18  10:05:26 
收件人: xu2002261 
抄送: pgsql-hackers 
主题: Re: [HACKERS] During Xlog replaying, is there maybe emitted xlog? 
 
xu2002261 xu2002...@163.com writes:
 Hi all,
  I reviewed the source code, and saw the following calling path:
  StartupXLOG()  StartupDatabase()  RmgrTable[rmid].rm_cleanup()  
 btree_xlog_cleanup()  _bt_insert_parent  _bt_insertonpg()  XLogInsert()
 As we can see, during xlog replaying, XLog may be emitted. So whether there 
 are some *ISSUE* in above calling stack?
No, it's entirely correct.  That path isn't during replay, it's upon
completion of replay, where we're cleaning up anything that failed to be
completed before the crash.  Emitting more WAL is allowed then.  Note
the comment a few lines above the rm_cleanup calls:
/*
 * Resource managers might need to write WAL records, eg, to record
 * index cleanup actions.  So temporarily enable XLogInsertAllowed in
 * this process only.
 */
regards, tom lane


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-17 Thread Greg Smith

On 07/16/2012 02:39 PM, Robert Haas wrote:

Unfortunately, there are lots of important operations (like bulk
loading, SELECT * FROM bigtable, and VACUUM notverybigtable) that
inevitably end up writing out their own dirty buffers.  And even when
the background writer does write something, it's not always clear that
this is a positive thing.  Here's Greg Smith commenting on the
more-is-worse phenonmenon:

http://archives.postgresql.org/pgsql-hackers/2012-02/msg00564.php


You can add crash recovery to the list of things where the interaction 
with the OS write cache matters a lot too, something I just took a 
beating and learned from recently.  Since the recovery process is 
essentially one giant unified backend, how effectively the background 
writer and/or checkpointer move writes from recovery to themselves is 
really important.  It's a bit easier to characterize than a complicated 
mixed set of clients, which has given me a couple of ideas to chase down.


What I've been doing for much of the last month (instead of my original 
plan of reviewing patches) is moving toward the bottom of characterizing 
that under high pressure.  It provides an even easier way to compare 
multiple write strategies at the OS level than regular pgbench-like 
benchmarks.  Recovery playback with a different tuning becomes as simple 
as rolling back to a simple base backup and replaying all the WAL, 
possibly including some number of bulk operations that showed up.  You 
can measure that speed instead of transaction-level throughput.  I'm 
seeing the same ~100% difference in performance between various Linux 
tunings on recovery as I was getting on VACUUM tests, and it's a whole 
lot easier to setup and (ahem) replicate the results.  I'm putting 
together a playback time benchmark based on this observation.


The fact that I have servers all over the place now with 64GB worth of 
RAM has turned the topic of how much dirty memory should be used for 
write caching into a hot item for me again in general too.  If I live 
through 9.3 development, I expect to have a lot more ideas about how to 
deal with this whole area play out in the upcoming months.  I could 
really use a cool day to sit outside thinking about it right now.



Jeff Janes and I came up with what I believe to be a plausible
explanation for the problem:

http://archives.postgresql.org/pgsql-hackers/2012-03/msg00356.php

I kinda think we ought to be looking at fixing that for 9.2, and
perhaps even back-patching further, but nobody else seemed terribly
excited about it.


FYI, I never rejected any of that thinking, I just haven't chewed on 
what you two were proposing.  If that's still something you think should 
be revisited for 9.2, I'll take a longer look at it.  My feeling on this 
so far has really been that the write blocking issues are much larger 
than the exact logic used by the background writer during the code you 
were highlighting, which I always saw as more active/important during 
idle periods.  This whole area needs to get a complete overhaul during 
9.3 though, especially since there are plenty of people who want to fit 
checksum writes into that path too.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)

2012-07-17 Thread Greg Smith

On 07/17/2012 06:56 PM, Tom Lane wrote:

So I went to fix this in the obvious way (attached), but while testing
it I found that the number of buffers_backend events reported during
a regression test run barely changed; which surprised the heck out of
me, so I dug deeper.  The cause turns out to be extremely scary:
ForwardFsyncRequest isn't getting called at all in the bgwriter process,
because the bgwriter process has a pendingOpsTable.


When I did my testing early this year to look at checkpointer 
performance (among other 9.2 write changes like group commit), I did see 
some cases where buffers_backend was dramatically different on 9.2 vs. 
9.1  There were plenty of cases where the totals across a 10 minute 
pgbench were almost identical though, so this issue didn't stick out 
then.  That's a very different workload than the regression tests though.



This implies that nobody has done pull-the-plug testing on either HEAD
or 9.2 since the checkpointer split went in (2011-11-01), because even
a modicum of such testing would surely have shown that we're failing to
fsync a significant fraction of our write traffic.


Ugh.  Most of my pull the plug testing the last six months has been 
focused on SSD tests with older versions.  I want to duplicate this (and 
any potential fix) now that you've highlighted it.



Furthermore, I would say that any performance testing done since then,
if it wasn't looking at purely read-only scenarios, isn't worth the
electrons it's written on.  In particular, any performance gain that
anybody might have attributed to the checkpointer splitup is very
probably hogwash.


There hasn't been any performance testing that suggested the 
checkpointer splitup was justified.  The stuff I did showed it being 
flat out negative for a subset of pgbench oriented cases, which didn't 
seem real-world enough to disprove it as the right thing to do though.


I thought there were two valid justifications for the checkpointer split 
(which is not a feature I have any corporate attachment to--I'm as 
isolated from how it was developed as you are).  The first is that it 
seems like the right architecture to allow reworking checkpoints and 
background writes for future write path optimization.  A good chunk of 
the time when I've tried to improve one of those (like my spread sync 
stuff from last year), the code was complicated by the background writer 
needing to follow the drum of checkpoint timing, and vice-versa.  Being 
able to hack on those independently got a sign of relief from me.  And 
while this adds some code duplication in things like the process setup, 
I thought the result would be cleaner for people reading the code to 
follow too.  This problem is terrible, but I think part of how it crept 
in is that the single checkpoint+background writer process was doing way 
too many things to even follow all of them some days.


The second justification for the split was that it seems easier to get a 
low power result from, which I believe was the angle Peter Geoghegan was 
working when this popped up originally.  The checkpointer has to run 
sometimes, but only at a 50% duty cycle as it's tuned out of the box.  
It seems nice to be able to approach that in a way that's power 
efficient without coupling it to whatever heartbeat the BGW is running 
at.  I could even see people changing the frequencies for each 
independently depending on expected system load.  Tune for lower power 
when you don't expect many users, that sort of thing.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using pg_upgrade on log-shipping standby servers

2012-07-17 Thread Bruce Momjian
On Tue, Jul 17, 2012 at 04:49:39PM -0700, Daniel Farina wrote:
 On Tue, Jul 17, 2012 at 11:55 AM, Jeff Davis pg...@j-davis.com wrote:
  On Tue, 2012-07-17 at 01:02 -0700, Daniel Farina wrote:
  Could pg_upgrade emit WAL segment(s) to provide continuity of a
  timeline?  So something like:
 
  By segments did you mean records?
 
 Yes.  It would be nicer not to have to tie it to the WAL segment file size.
 
  * Take down the writable primary for pg_upgrade
  * Some WAL is emitted and possibly archived
  * The old version, when reaching the special pg_upgrade WAL, could
  exit or report its situation having paused replay (as clearly, it
  cannot proceed). Unsure.
 
  I don't really understand this step.
 
 Some WAL is emitted and possibly archived needs a subject in that fragment:
 
 pg_upgrade somehow (directly, or indirectly) emits and/or archives
 WAL used to complete binary-upgrade.  That means that it should
 appear in the WAL stream and be subject to archive_command, like any
 other WAL.
 
 The sticky part is what the standby should do when it encounters the
 special wal-upgrade records.  It should probably pause replay to allow
 some other program to stop the old postgres version and start the new
 version with the same cluster.

WAL is not guaranteed to be the same between PG major versions, so doing
anything with WAL is pretty much a no-go.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)

2012-07-17 Thread Craig Ringer

On 07/18/2012 12:00 PM, Greg Smith wrote:

The second justification for the split was that it seems easier to get 
a low power result from, which I believe was the angle Peter Geoghegan 
was working when this popped up originally.  The checkpointer has to 
run sometimes, but only at a 50% duty cycle as it's tuned out of the 
box.  It seems nice to be able to approach that in a way that's power 
efficient without coupling it to whatever heartbeat the BGW is running 
at.  I could even see people changing the frequencies for each 
independently depending on expected system load.  Tune for lower power 
when you don't expect many users, that sort of thing.


Yeah - I'm already seeing benefits from that on my laptop, with much 
less need to stop Pg when I'm not using it.


--
Craig Ringer



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)

2012-07-17 Thread Craig Ringer

On 07/18/2012 08:31 AM, Tom Lane wrote:

Not sure if we need a whole farm, but certainly having at least one
machine testing this sort of stuff on a regular basis would make me feel
a lot better.


OK. That's something I can actually be useful for.

My current qemu/kvm test harness control code is in Python since that's 
what all the other tooling for the project I was using it for is in. Is 
it likely to be useful for me to adapt that code for use for a Pg 
crash-test harness, or will you need a particular tool/language to be 
used? If so, which/what? I'll do pretty much anything except Perl. I'll 
have a result for you more quickly working in Python, though I'm happy 
enough to write it in C (or Java, but I'm guessing that won't get any 
enthusiasm around here).



One fairly simple test scenario could go like this:

* run the regression tests
* pg_dump the regression database
* run the regression tests again
* hard-kill immediately upon completion
* restart database, allow it to perform recovery
* pg_dump the regression database
* diff previous and new dumps; should be the same

The main thing this wouldn't cover is discrepancies in user indexes,
since pg_dump doesn't do anything that's likely to result in indexscans
on user tables.  It ought to be enough to detect the sort of system-wide
problem we're talking about here, though.


It also won't detect issues that only occur during certain points in 
execution, under concurrent load, etc. Still, a start, and I could look 
at extending it into some kind of crash fuzzing once the basics were 
working.



In general I think the hard part is automated reproduction of an
OS-crash scenario, but your ideas about how to do that sound promising.


It's worked well for other testing I've done. Any writes that're still 
in the guest OS's memory, write queues, etc are lost when kvm is killed, 
just like a hard crash. Anything the kvm guest has flushed to disk is 
on the host and preserved - either on the host's disks 
(cache=writethrough) or at least in dirty writeback buffers in ram 
(cache=writeback).


kvm can even do a decent job of simulating a BBU-equipped write-through 
volume by allowing the host OS to do write-back caching of KVM's backing 
device/files. You don't get to set a max write-back cache size directly, 
but Linux I/O writeback settings provide some control.


My favourite thing about kvm is that it's just another command. It can 
be run headless and controlled via virtual serial console and/or its 
monitor socket. It doesn't require special privileges and can operate on 
ordinary files. It's very well suited for hooking into test harnesses.


The only challenge with using kvm/qemu is that there have been some 
breaking changes and a couple of annoying bugs that mean I won't be able 
to support anything except pretty much the latest versions initially. 
kvm is easy to compile and has limited dependencies, so I don't expect 
that to be an issue, but thought it was worth raising.


--
Craig Ringer

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)

2012-07-17 Thread Tom Lane
Greg Smith g...@2ndquadrant.com writes:
 On 07/17/2012 06:56 PM, Tom Lane wrote:
 Furthermore, I would say that any performance testing done since then,
 if it wasn't looking at purely read-only scenarios, isn't worth the
 electrons it's written on.  In particular, any performance gain that
 anybody might have attributed to the checkpointer splitup is very
 probably hogwash.

 There hasn't been any performance testing that suggested the 
 checkpointer splitup was justified.  The stuff I did showed it being 
 flat out negative for a subset of pgbench oriented cases, which didn't 
 seem real-world enough to disprove it as the right thing to do though.

Just to clarify, I'm not saying that this means we should revert the
checkpointer split.  What I *am* worried about is that we may have been
hacking other things on the basis of faulty performance tests.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers