date:20111025

Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-25 Thread Simon Riggs

On Mon, Oct 24, 2011 at 7:13 AM, Florian Pflug  wrote:

> I think Simon's theory that we're starting recovery from the wrong place,
> i.e. should start with an earlier WAL location, is probably correct. The
> question is, why?

Err, that's not what I said and I don't mean that. Having said that,
what I said about pg_control being invalid would imply that, so is
wrong also.

We are starting recovery at the right place but we are initialising
the clog and subtrans incorrectly. Precisely, the oldestActiveXid is
being derived later than it should be, which can cause problems if
this then means that whole pages are unitialised in subtrans. The bug
only shows up if you do enough transactions (2048 is always enough) to
move to the next subtrans page between the redo pointer and the
checkpoint record while at the same time we do not have a long running
transaction that spans those two points. That's just enough to happen
reasonably frequently on busy systems and yet just enough to have
slipped through testing.

We must either

1. During CreateCheckpoint() we should derive oldestActiveXid before
we derive the redo location

2. Change the way subtrans pages are initialized during recovery so we
don't rely on oldestActiveXid

I need to think some more before a decision on this in my own mind,
but I lean towards doing (1) as a longer term fix and doing (2) as a
short term fix for existing releases. I expect to have a fix later
today.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] debug query execution

2011-10-25 Thread vadym nikolaiev

Hi.

I would like to ask you which sources are responsible for execute queries in
PostgreSQL?
( i would like to run a simple query and debug it execution on PostgreSql
server for understanding how PostgeSql does query processing internally)

regards,
Vadym

Re: [HACKERS] Online base backup from the hot-standby

2011-10-25 Thread Fujii Masao

On Tue, Oct 25, 2011 at 3:44 PM, Heikki Linnakangas
 wrote:
 +
 +      Again connect to the database as a superuser, and execute
 +pg_stop_backup. This terminates the backup mode, but
 does not
 +      perform a switch to the next WAL segment, create a backup history
 file and
 +      wait for all required WAL segments to be archived,
 +      unlike that during normal processing.
 +
 +
>>>
>>> How do you ensure that all the required WAL segments have been archived,
>>> then?
>>
>> The patch doesn't provide any capability to ensure that, IOW assumes
>> that's
>> a user responsibility. If a user wants to ensure that, he/she needs to
>> calculate
>> the backup start and end WAL files from the result of pg_start_backup()
>> and pg_stop_backup() respectively, and needs to wait until those files
>> have
>> appeared in the archive. Also if the required WAL file has not been
>> archived
>> yet, a user might need to execute pg_switch_xlog() in the master.
>
> Frankly, I think this whole thing is too fragile. The procedure is
> superficially similar to what you do on master: run pg_start_backup(), rsync
> data directory, run pg_stop_backup(), but is actually subtly different and
> more complicated. If you don't know that, and don't follow the full
> procedure, you get a corrupt backup. And the backup might look ok, and might
> even sometimes work, which means that you won't notice in quick testing.
> That's a *huge* foot-gun.
>
> I think we need to step back and find a way to make this:
> a) less complicated, or at least
> b) more robust, so that if you don't follow the procedure, you get an error.

One idea to make the way more robust is to change the PostgreSQL so that
it writes the buffer page to a temporary space instead of database file
during a backup. This means that there is no torn-pages in the database files
of the backup. After backup, the data blocks are written back to the database
files over time. When recovery starts from that backup(i.e., backup_label is
found), it clears the temporary space in the backup first and continues recovery
by using the database files which contain no torn-pages. OTOH,
in crash recovery (i.e., backup_label is not found), recovery is performed by
using both database files and temporary space. This whole approach would
make the standby-only backup available even if FPW is disabled in the master
and you don't care about the order to backup the control file.

But this idea looks overkill. It seems very complicated to implement that, and
likely to invite other bugs. I don't have any other good and simple
idea for now.

> With pg_basebackup, we have a fighting chance of getting this right, because
> we have more control over how the backup is made. For example, we can
> co-operate with the buffer manager to avoid torn-pages, eliminating the need
> for full_page_writes=on, and we can include a control file with the correct
> end-of-backup location automatically, without requiring user intervention.
> pg_basebackup is less flexible than the pg_start/stop_backup method, and
> unfortunately you're more likely to need the flexibility in a more
> complicated setup with a hot standby server and all, but making the generic
> pg_start/stop_backup method work seems infeasible at the moment.

Yes, so we should give up supporting manual procedure? And extend
pg_basebackup for the standby-only backup, first? I can live with this.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-25 Thread Simon Riggs

On Tue, Oct 25, 2011 at 8:03 AM, Simon Riggs  wrote:

> We are starting recovery at the right place but we are initialising
> the clog and subtrans incorrectly. Precisely, the oldestActiveXid is
> being derived later than it should be, which can cause problems if
> this then means that whole pages are unitialised in subtrans. The bug
> only shows up if you do enough transactions (2048 is always enough) to
> move to the next subtrans page between the redo pointer and the
> checkpoint record while at the same time we do not have a long running
> transaction that spans those two points. That's just enough to happen
> reasonably frequently on busy systems and yet just enough to have
> slipped through testing.
>
> We must either
>
> 1. During CreateCheckpoint() we should derive oldestActiveXid before
> we derive the redo location
>
> 2. Change the way subtrans pages are initialized during recovery so we
> don't rely on oldestActiveXid
>
> I need to think some more before a decision on this in my own mind,
> but I lean towards doing (1) as a longer term fix and doing (2) as a
> short term fix for existing releases. I expect to have a fix later
> today.

(1) looks the best way forwards in all cases.

Patch attached. Will be backpatched to 9.0

I think it is possible to avoid taking XidGenLock during
GetRunningTransactions() now, but I haven't included that change in
this patch.

Any other comments before commit?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


oldestActiveXid_fixed.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2011-10-25 Thread Magnus Hagander

2011/10/25 Shigeru Hanada :
> I'd like to propose pgsql_fdw, FDW for external PostgreSQL server, as a
> contrib module.  I think that this module would be the basis of further
> SQL/MED development for core, e.g. join-push-down and ANALYZE support.

I have not looked at the code itself, but I wonder if we shouldn't
consider making this a part of core-proper, not just a contrib module.
The fact that it isn't *already* available in core surprises a lot of
people...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Online base backup from the hot-standby

2011-10-25 Thread Magnus Hagander

On Tue, Oct 25, 2011 at 10:50, Fujii Masao  wrote:
> On Tue, Oct 25, 2011 at 3:44 PM, Heikki Linnakangas
>  wrote:
> +
> +      Again connect to the database as a superuser, and execute
> +pg_stop_backup. This terminates the backup mode, but
> does not
> +      perform a switch to the next WAL segment, create a backup history
> file and
> +      wait for all required WAL segments to be archived,
> +      unlike that during normal processing.
> +
> +

 How do you ensure that all the required WAL segments have been archived,
 then?
>>>
>>> The patch doesn't provide any capability to ensure that, IOW assumes
>>> that's
>>> a user responsibility. If a user wants to ensure that, he/she needs to
>>> calculate
>>> the backup start and end WAL files from the result of pg_start_backup()
>>> and pg_stop_backup() respectively, and needs to wait until those files
>>> have
>>> appeared in the archive. Also if the required WAL file has not been
>>> archived
>>> yet, a user might need to execute pg_switch_xlog() in the master.
>>
>> Frankly, I think this whole thing is too fragile. The procedure is
>> superficially similar to what you do on master: run pg_start_backup(), rsync
>> data directory, run pg_stop_backup(), but is actually subtly different and
>> more complicated. If you don't know that, and don't follow the full
>> procedure, you get a corrupt backup. And the backup might look ok, and might
>> even sometimes work, which means that you won't notice in quick testing.
>> That's a *huge* foot-gun.
>>
>> I think we need to step back and find a way to make this:
>> a) less complicated, or at least
>> b) more robust, so that if you don't follow the procedure, you get an error.
>
> One idea to make the way more robust is to change the PostgreSQL so that
> it writes the buffer page to a temporary space instead of database file
> during a backup. This means that there is no torn-pages in the database files
> of the backup. After backup, the data blocks are written back to the database
> files over time. When recovery starts from that backup(i.e., backup_label is
> found), it clears the temporary space in the backup first and continues 
> recovery
> by using the database files which contain no torn-pages. OTOH,
> in crash recovery (i.e., backup_label is not found), recovery is performed by
> using both database files and temporary space. This whole approach would
> make the standby-only backup available even if FPW is disabled in the master
> and you don't care about the order to backup the control file.
>
> But this idea looks overkill. It seems very complicated to implement that, and
> likely to invite other bugs. I don't have any other good and simple
> idea for now.
>
>> With pg_basebackup, we have a fighting chance of getting this right, because
>> we have more control over how the backup is made. For example, we can
>> co-operate with the buffer manager to avoid torn-pages, eliminating the need
>> for full_page_writes=on, and we can include a control file with the correct
>> end-of-backup location automatically, without requiring user intervention.
>> pg_basebackup is less flexible than the pg_start/stop_backup method, and
>> unfortunately you're more likely to need the flexibility in a more
>> complicated setup with a hot standby server and all, but making the generic
>> pg_start/stop_backup method work seems infeasible at the moment.
>
> Yes, so we should give up supporting manual procedure? And extend
> pg_basebackup for the standby-only backup, first? I can live with this.

I don't think we should necessarily give up completely. But doing a
pg_basebackup way *first* seems reasonable - because it's going to be
the easiest one to "get right", given that we have more control there.
Doesn't mean we shouldn't extend it in the future...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-25 Thread Florian Pflug

On Oct25, 2011, at 11:13 , Simon Riggs wrote:
> On Tue, Oct 25, 2011 at 8:03 AM, Simon Riggs  wrote:
>> We are starting recovery at the right place but we are initialising
>> the clog and subtrans incorrectly. Precisely, the oldestActiveXid is
>> being derived later than it should be, which can cause problems if
>> this then means that whole pages are unitialised in subtrans. The bug
>> only shows up if you do enough transactions (2048 is always enough) to
>> move to the next subtrans page between the redo pointer and the
>> checkpoint record while at the same time we do not have a long running
>> transaction that spans those two points. That's just enough to happen
>> reasonably frequently on busy systems and yet just enough to have
>> slipped through testing.
>> 
>> We must either
>> 
>> 1. During CreateCheckpoint() we should derive oldestActiveXid before
>> we derive the redo location

> (1) looks the best way forwards in all cases.

Let me see if I understand this

The probem seems to be that we currently derive oldestActiveXid end the end of
the checkpoint, just before writing the checkpoint record. Since we use
oldestActiveXid to initialize SUBTRANS, this is wrong. Records written before
that checkpoint record (but after the REDO location, of course) may very well
contain XIDs earlier than that wrongly derived oldestActiveXID, and if attempt
to touch these XID's SUBTRANS state, we error out.

Your patch seems sensible, because the checkpoint "logically" occurs at the
REDO location not the checkpoint's location, so we ought to log an 
oldestActiveXID
corresponding to that location.

What I don't understand is how this affects the CLOG. How does oldestActiveXID
factor into CLOG initialization?

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Idea: Always consistent in-database cache using SSI mechanisms

2011-10-25 Thread Magnus Hagander

On Tue, Oct 25, 2011 at 00:00, Alexander Korotkov  wrote:
> On Tue, Oct 25, 2011 at 1:46 AM, Kevin Grittner
>  wrote:
>>
>> Alexander Korotkov  wrote:
>>
>> > Coundn't be predicate locking implementation in SSI be used for
>> > in-database cache invalidation.
>>
>> It would not necessarily be limited to *in-database* caches.  The
>> main thing would be to design a good API to the predicate locking
>> portion of SSI, which I think is about 80% of the SSI code.  Dan and
>> I both have an interest in such further use, and there have been
>> others who have talked about potential uses for the non-blocking
>> predicate locking.  I think the API would need to be based around a
>> listen/notify model.

IIRC, I discussed this with Dan Ports at pgcon, as it was one of the
things he had been looking into as well. You might want to talk to him
about it.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Online base backup from the hot-standby

2011-10-25 Thread Fujii Masao

On Tue, Oct 25, 2011 at 7:19 PM, Magnus Hagander  wrote:
> I don't think we should necessarily give up completely. But doing a
> pg_basebackup way *first* seems reasonable - because it's going to be
> the easiest one to "get right", given that we have more control there.
> Doesn't mean we shouldn't extend it in the future...

Agreed. The question is -- how far should we change pg_basebackup to
"get right"? I think it's not difficult to change it so that it backs up
the control file at the end. But eliminating the need for full_page_writes=on
seems not easy. No? So I'm not inclined to do that in at least first commit.
Otherwise, I'm afraid the patch would become huge.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Online base backup from the hot-standby

2011-10-25 Thread Magnus Hagander

On Tue, Oct 25, 2011 at 13:54, Fujii Masao  wrote:
> On Tue, Oct 25, 2011 at 7:19 PM, Magnus Hagander  wrote:
>> I don't think we should necessarily give up completely. But doing a
>> pg_basebackup way *first* seems reasonable - because it's going to be
>> the easiest one to "get right", given that we have more control there.
>> Doesn't mean we shouldn't extend it in the future...
>
> Agreed. The question is -- how far should we change pg_basebackup to
> "get right"? I think it's not difficult to change it so that it backs up
> the control file at the end. But eliminating the need for full_page_writes=on
> seems not easy. No? So I'm not inclined to do that in at least first commit.
> Otherwise, I'm afraid the patch would become huge.

It's more server side of base backups than the actual pg_basebackup
tool of course, but I'm sure that's what we're all referring to here.

Personally, I'd see the fpw stuff as part of the infrastructure
needed. Meaning that the fpw stuff should go in *first*, and the
pg_basebackup stuff later.

If we want something to go in early, that could be as simple as a
version of pg_basebackup that runs against the slave but only if
full_page_writes=on on the master. If it's not, it throws an error.
Then we can improve upon that by adding handling of fpw=off, first by
infrastructure, then by tool.

Doing it piece by piece like that is probably a good idea, since as
you say, all at once will be pretty huge.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2011-10-25 Thread Tom Lane

Magnus Hagander  writes:
> 2011/10/25 Shigeru Hanada :
>> I'd like to propose pgsql_fdw, FDW for external PostgreSQL server, as a
>> contrib module.  I think that this module would be the basis of further
>> SQL/MED development for core, e.g. join-push-down and ANALYZE support.

> I have not looked at the code itself, but I wonder if we shouldn't
> consider making this a part of core-proper, not just a contrib module.
> The fact that it isn't *already* available in core surprises a lot of
> people...

We've just spent a whole lot of blood and sweat on making the extension
mechanism work nicely.  I don't understand this urge to not use it.

ATM I'm not sure it's even a good idea to push pgsql_fdw into contrib.
Once we do that its release schedule will get locked to core's ---
wouldn't it be better to keep flexibility for now, while it's in such
active development?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2011-10-25 Thread Magnus Hagander

On Tue, Oct 25, 2011 at 14:08, Tom Lane  wrote:
> Magnus Hagander  writes:
>> 2011/10/25 Shigeru Hanada :
>>> I'd like to propose pgsql_fdw, FDW for external PostgreSQL server, as a
>>> contrib module.  I think that this module would be the basis of further
>>> SQL/MED development for core, e.g. join-push-down and ANALYZE support.
>
>> I have not looked at the code itself, but I wonder if we shouldn't
>> consider making this a part of core-proper, not just a contrib module.
>> The fact that it isn't *already* available in core surprises a lot of
>> people...
>
> We've just spent a whole lot of blood and sweat on making the extension
> mechanism work nicely.  I don't understand this urge to not use it.

We're back to the old discussion, I guess.. I'm happy to see it as an
extension, but I think it should be included with the standard
installation. Like we do with for example pl/pgsql (which I realize
has a dependency on the backend anyway, so it can't be done another
way easily) and pl/perl (which doesn't, AFAIK, so it's a better
example)


> ATM I'm not sure it's even a good idea to push pgsql_fdw into contrib.
> Once we do that its release schedule will get locked to core's ---
> wouldn't it be better to keep flexibility for now, while it's in such
> active development?

I would be happy to keep it outside, and integrate it in the final CF
for example :)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2011-10-25 Thread Kohei KaiGai

>> ATM I'm not sure it's even a good idea to push pgsql_fdw into contrib.
>> Once we do that its release schedule will get locked to core's ---
>> wouldn't it be better to keep flexibility for now, while it's in such
>> active development?
>
> I would be happy to keep it outside, and integrate it in the final CF
> for example :)
>
Right now, file_fdw is the only FDW module that we have in the core,
however, it is inadequacy to proof the new concept of FDW feature
to utilize external RDBMS, such as join push-down of foreign tables.

I think the pgsql-fdw module also should be included in the core
distribution as a basis of future enhancement, unless we don't
need any working modules when an enhancement of FDW is
proposed.

Thanks,
-- 
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2011-10-25 Thread Stephen Frost

* Kohei KaiGai (kai...@kaigai.gr.jp) wrote:
> Right now, file_fdw is the only FDW module that we have in the core,

Erm, guess I'm a bit confused why we've got that in core while not
putting pgsql_fdw in core.  This all gets back to previous discussions
around 'recommended' contrib modules (which should really be installed
by default on the filesystem through the distros, ala Debian's
"recommends:" approach) and 'other' contrib modules.

I'm in favor of making that distinction.  I would still have pgsql_fdw,
file_fdw, etc, be packaged more-or-less the same way and still use the
CREATE EXTENTION framework, of course.

It would be nice if we didn't have to lock the release schedule of those
recommended modules to the core release schedule, or even to each other,
but that's a separate issue, imv.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-25 Thread Simon Riggs

On Tue, Oct 25, 2011 at 12:39 PM, Florian Pflug  wrote:

> What I don't understand is how this affects the CLOG. How does oldestActiveXID
> factor into CLOG initialization?

It is an entirely different error.

Chris' clog error was caused by a file read error. The file was
opened, we did a seek within the file and then the call to read()
failed to return a complete page from the file.

The xid shown is 22811359, which is the nextxid in the control file.

So StartupClog() must have failed trying to read the clog page from disk.

That isn't a Hot Standby problem, a recovery problem nor is it certain
its a PostgreSQL problem.

OTOH SlruPhysicalReadPage() does cope gracefully with missing clog
files during recovery, so maybe we can think of a way to make recovery
cope with a SLRU_READ_FAILED error gracefully also. Any ideas?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Online base backup from the hot-standby

2011-10-25 Thread Steve Singer


On 11-10-25 02:44 AM, Heikki Linnakangas wrote:
With pg_basebackup, we have a fighting chance of getting this right, 
because we have more control over how the backup is made. For example, 
we can co-operate with the buffer manager to avoid torn-pages, 
eliminating the need for full_page_writes=on, and we can include a 
control file with the correct end-of-backup location automatically, 
without requiring user intervention. pg_basebackup is less flexible 
than the pg_start/stop_backup method, and unfortunately you're more 
likely to need the flexibility in a more complicated setup with a hot 
standby server and all, but making the generic pg_start/stop_backup 
method work seems infeasible at the moment.


Would pg_basebackup be able to work with the buffer manager on the slave 
to avoid full_page_writes=on needing to be set on the master?  (the 
point of this is to be able to take the base backup without having the 
backup program contact the master). If so could pg_start_backup() not 
just put the buffer manager into the same state?





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Online base backup from the hot-standby

2011-10-25 Thread Heikki Linnakangas


On 25.10.2011 15:56, Steve Singer wrote:

On 11-10-25 02:44 AM, Heikki Linnakangas wrote:

With pg_basebackup, we have a fighting chance of getting this right,
because we have more control over how the backup is made. For example,
we can co-operate with the buffer manager to avoid torn-pages,
eliminating the need for full_page_writes=on, and we can include a
control file with the correct end-of-backup location automatically,
without requiring user intervention. pg_basebackup is less flexible
than the pg_start/stop_backup method, and unfortunately you're more
likely to need the flexibility in a more complicated setup with a hot
standby server and all, but making the generic pg_start/stop_backup
method work seems infeasible at the moment.


Would pg_basebackup be able to work with the buffer manager on the slave
to avoid full_page_writes=on needing to be set on the master? (the point
of this is to be able to take the base backup without having the backup
program contact the master).


In theory, yes. I'm not sure how difficult it would be in practice. 
Currently, the walsender process just scans and copies everything in the 
data directory, at the filesystem level. It would have to go through the 
buffer manager instead, to avoid reading a page at the same time that 
the buffer manager is writing it out.



If so could pg_start_backup() not just put the buffer manager into the same 
state?


No. . The trick that pg_basebackup (= walsender) can do is to co-operate 
with the buffer manager when reading each page. An external program 
cannot do that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2011-10-25 Thread Marko Kreen

On Tue, Oct 25, 2011 at 3:08 PM, Tom Lane  wrote:
> Magnus Hagander  writes:
>> 2011/10/25 Shigeru Hanada :
>>> I'd like to propose pgsql_fdw, FDW for external PostgreSQL server, as a
>>> contrib module.  I think that this module would be the basis of further
>>> SQL/MED development for core, e.g. join-push-down and ANALYZE support.
>
>> I have not looked at the code itself, but I wonder if we shouldn't
>> consider making this a part of core-proper, not just a contrib module.
>> The fact that it isn't *already* available in core surprises a lot of
>> people...
>
> We've just spent a whole lot of blood and sweat on making the extension
> mechanism work nicely.  I don't understand this urge to not use it.
>
> ATM I'm not sure it's even a good idea to push pgsql_fdw into contrib.
> Once we do that its release schedule will get locked to core's ---
> wouldn't it be better to keep flexibility for now, while it's in such
> active development?

Simple question - do FDW internals need work?

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unreproducible bug in snapshot import code

2011-10-25 Thread Gurjeet Singh

On Tue, Oct 25, 2011 at 2:33 AM, Fujii Masao  wrote:

> On Tue, Oct 25, 2011 at 2:41 AM, Gurjeet Singh 
> wrote:
> > On Mon, Oct 24, 2011 at 1:08 PM, Alvaro Herrera <
> alvhe...@commandprompt.com>
> > wrote:
> >>
> >> Excerpts from Gurjeet Singh's message of lun oct 24 13:55:44 -0300 2011
> :
> >>
> >> > Starting from line 89 of terminal1.txt we see this snippet:
> >> >
> >> > postgres=# rollback;
> >> > ROLLBACK
> >> > postgres=# begin TRANSACTION ;
> >> > BEGIN
> >> > postgres=# set transaction snapshot '02C8-1';
> >> > ERROR:  SET TRANSACTION SNAPSHOT must be called before any query
> >> > postgres=# rollback;
> >> > ROLLBACK
> >> >
> >> >
> >> > As you can see the SET TRANSACTION SNAPSHOT was the first
> statement
> >> > in
> >> > that transaction, and yet the ERROR message says that it is not.
> >>
> >> Maybe the tab-completion feature issued a query before the set
> >> transaction command.
> >
> > I have tried reproducing the bug starting from 1 and 2 transactions
> before
> > the one shown in snippet, and I used tab-completion to get the same
> > screen-output as termonal1.txt and yet it's not reproducible.
>
> I could reproduce it when I typed TAB just after typing "set" in "set
> transaction snapshot".
> As Tom and Alvaro pointed out, the tab-completion issues a query and which
> prevents the "set transaction snapshot" command.
>

Great! That settles it then. Reproducible, but not a bug.

Thanks,
-- 
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

[HACKERS] lexemes in prefix search going through dictionary modifications

2011-10-25 Thread Sushant Sinha

I am currently using the prefix search feature in text search. I find
that the prefix characters are treated the same as a normal lexeme and
passed through stemming and stopword dictionaries. This seems like a bug
to me. 

db=# select to_tsquery('english', 's:*');
NOTICE:  text-search query contains only stop words or doesn't contain
lexemes, ignored
 to_tsquery 

 
(1 row)

db=# select to_tsquery('simple', 's:*');
 to_tsquery 

 's':*
(1 row)


I also think that this is a mistake. It should only be highlighting "s".
db=# select ts_headline('sushant', to_tsquery('simple', 's:*'));
  ts_headline   

 sushant


Thanks,
Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] lexemes in prefix search going through dictionary modifications

2011-10-25 Thread Florian Pflug

On Oct25, 2011, at 17:26 , Sushant Sinha wrote:
> I am currently using the prefix search feature in text search. I find
> that the prefix characters are treated the same as a normal lexeme and
> passed through stemming and stopword dictionaries. This seems like a bug
> to me.

Hm, I don't think so. If they don't pass through stopword dictionaries,
then queries containing stopwords will fail to find any rows - which is
probably not what one would expect.

Here's an example:

Query for records containing the* and car*. The @@-operator returns true,
because the stopword is removed from both the tsvector and the tsquery
(the 'english' dictionary drops 'these' as a stopward and stems 'cars' to
'car. Both the tsvector and the query end up being just 'car')

postgres=# select to_tsvector('english', 'these cars') @@ to_tsquery('english', 
'the:* & car:*');
 ?column? 
--
 t
(1 row)

Here what happens stopwords aren't removed from the query
(Now, the tsvector ends up being 'car', but the query is 'the:* & car:*')

postgres=# select to_tsvector('english', 'these cars') @@ to_tsquery('simple', 
'the:* & car:*');
 ?column? 
--
 f
(1 row)

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] isolationtester's "dry run" mode

2011-10-25 Thread Alvaro Herrera

This patch adds a -n option to isolationtester.  With this option,
instead of running the tests, it prints the permutations that would be
run.  This is very useful for generating test specs which need to remove
permutations that are not valid (because they would run a command on a
blocked session).

Objections?

Sample run:

$ ./isolationtester -n < ~-/specs/classroom-scheduling.spec
permutation "rx1" "wy1" "c1" "ry2" "wx2" "c2"
permutation "rx1" "wy1" "ry2" "c1" "wx2" "c2"
permutation "rx1" "wy1" "ry2" "wx2" "c1" "c2"
permutation "rx1" "wy1" "ry2" "wx2" "c2" "c1"
permutation "rx1" "ry2" "wy1" "c1" "wx2" "c2"
permutation "rx1" "ry2" "wy1" "wx2" "c1" "c2"
permutation "rx1" "ry2" "wy1" "wx2" "c2" "c1"
permutation "rx1" "ry2" "wx2" "wy1" "c1" "c2"
permutation "rx1" "ry2" "wx2" "wy1" "c2" "c1"
permutation "rx1" "ry2" "wx2" "c2" "wy1" "c1"
permutation "ry2" "rx1" "wy1" "c1" "wx2" "c2"
permutation "ry2" "rx1" "wy1" "wx2" "c1" "c2"
permutation "ry2" "rx1" "wy1" "wx2" "c2" "c1"
permutation "ry2" "rx1" "wx2" "wy1" "c1" "c2"
permutation "ry2" "rx1" "wx2" "wy1" "c2" "c1"
permutation "ry2" "rx1" "wx2" "c2" "wy1" "c1"
permutation "ry2" "wx2" "rx1" "wy1" "c1" "c2"
permutation "ry2" "wx2" "rx1" "wy1" "c2" "c1"
permutation "ry2" "wx2" "rx1" "c2" "wy1" "c1"
permutation "ry2" "wx2" "c2" "rx1" "wy1" "c1"


-- 
Álvaro Herrera 


isotester-dryrun.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] isolationtester's "dry run" mode

2011-10-25 Thread Alvaro Herrera


Excerpts from Alvaro Herrera's message of mar oct 25 13:32:31 -0300 2011:
> This patch adds a -n option to isolationtester.  With this option,
> instead of running the tests, it prints the permutations that would be
> run.  This is very useful for generating test specs which need to remove
> permutations that are not valid (because they would run a command on a
> blocked session).

Forgot to credit the author: this patch was developed by Alexander
Shulgin.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Range Types - typo + NULL string constructor

2011-10-25 Thread Jeff Davis

On Mon, 2011-10-24 at 13:15 +0300, Heikki Linnakangas wrote:
> Hmm, I don't think that's safe. After Oid wraparound, a range type oid 
> might get reused for some other range type, and the cache would return 
> stale values. Extremely unlikely to happen by accident, but could be 
> exploited by an attacker.
> 

Any ideas on how to remedy that? I don't have another plan for making it
perform well. Plugging it into the cache invalidation mechanism seems
like overkill, but I suppose that would solve the problem.

Aren't there a few other cases like this floating around the code? I
know the single-xid cache is potentially vulnerable to xid wraparound
for the same reason.

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] lexemes in prefix search going through dictionary modifications

2011-10-25 Thread Sushant Sinha

On Tue, 2011-10-25 at 18:05 +0200, Florian Pflug wrote:
> On Oct25, 2011, at 17:26 , Sushant Sinha wrote:
> > I am currently using the prefix search feature in text search. I find
> > that the prefix characters are treated the same as a normal lexeme and
> > passed through stemming and stopword dictionaries. This seems like a bug
> > to me.
> 
> Hm, I don't think so. If they don't pass through stopword dictionaries,
> then queries containing stopwords will fail to find any rows - which is
> probably not what one would expect.

I think what you are saying a feature is really a bug. I am fairly sure
that when someone says to_tsquery('english', 's:*') one is looking for
an entry that has a *non-stopword* word that starts with 's'. And
specially so in a text search configuration that eliminates stop words. 

Does it even make sense to stem, abbreviate, synonym for a few letters?
It will be so unpredictable.

-Sushant.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for distinguishing PG instances in event log v2

2011-10-25 Thread Magnus Hagander

2011/7/16 MauMau :
> Hello,
>
> The attached file is a revised patch which reflects all review comments by
> Magnus in:
>
> http://archives.postgresql.org/pgsql-hackers/2011-07/msg00839.php
>
> I made sure the previous tests (both custom and default "PostgreSQL" event
> source) succeeded.
>
> I'm submitting this to the currently open CommitFest 2001-9 shortly. Please
> review it again.

I have applied this patch after another round of rather extensive modifications.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: GiST for range types (was Re: [HACKERS] Range Types - typo + NULL string constructor)

2011-10-25 Thread Alexander Korotkov

On Mon, Oct 24, 2011 at 3:05 PM, Alexander Korotkov wrote:

> If we allow user to specify own gist_penalty function, then such function
> should deal with:
> 1) GiST-specific data structures such as GISTENTRY.
> 2) Decomposing ranges using range_deserialize.
> 3) Inifinities, which we could handle in general penalty functions.
> Thats why I prefere to implement subtype_diff.
>
I forgot another agument for having subtype_diff:
4) In my picksplit algorithm it would be more natural to use subtype_diff
for measuring overlap than use penalty function.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] (PATCH) Adding CORRESPONDING to Set Operations

2011-10-25 Thread Erik Rijkers

On Tue, October 25, 2011 19:49, Kerem Kat wrote:
> On Mon, Oct 24, 2011 at 20:52, Erik Rijkers  wrote:
>> On Wed, October 19, 2011 15:01, Kerem Kat wrote:
>>> Adding CORRESPONDING to Set Operations
> I have corrected the ordering of the targetlists of subqueries. And
> added 12 regression
> tests for column list ordering. Can you confirm that the order has
> changed for you?
>

Yes, this one is OK.

thanks,

Erik Rijkers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Firing order of RI triggers

2011-10-25 Thread Tom Lane

I've looked into the cause of bug #6268,
http://archives.postgresql.org/pgsql-bugs/2011-10/msg00223.php

It's fairly simple: we're firing RI triggers in the wrong order.

What's happening is that we update the tuple and queue RI_FKey_check_upd
and RI_FKey_cascade_upd events for the update action, *in that order*.
When RI_FKey_check_upd runs, it checks things and quite properly
complains that there's no matching PK, since the row that used to have
the matching PK is now obsolete.

Had RI_FKey_cascade_upd fired first, all would have been well,
because when RI_FKey_check_upd fired for this particular update, it
would've seen the new tuple is already obsolete and done nothing.

The reason they fire in the wrong order is that triggers for a single
event are fired in name order, and the names being used are things like
RI_ConstraintTrigger_53569.  Most of the time, the trigger with higher
OID is going to sort last ... and createForeignKeyTriggers creates the
check triggers before the action triggers.

You might wonder why this doesn't mean that all self-referential foreign
key situations are broken all the time.  Well, the answer is that the
problem is usually masked by the optimization that avoids firing a check
trigger at all if the referencing field didn't change --- see
AfterTriggerSaveEvent.  In the test case given in the bug, the first
UPDATE within the transaction doesn't see the problem because of this.
But in the second UPDATE of the same row, that optimization is disabled,
so the check trigger fires and fails.

As far as I can see, the only practical way to fix this is to change the
names given to RI triggers so that cascade actions will fire before
check triggers.  Just changing the order of creation would fix it 99.99%
of the time, but fail on the times when the first trigger had OID 9
and the second OID 100, for example.  And I definitely don't think
we want to mess with the general rule that triggers fire in name order.

I'm thinking we could do "RI_ConstraintTrigger_a_" for "action"
triggers and "RI_ConstraintTrigger_c_" for "checking" triggers,
and then the names would be guaranteed to sort correctly.

I'm not sure if this is something we can back-patch --- I don't see any
dependencies in our own code on what names RI triggers have, but I'm
afraid there is client-side code out there that knows it.  In any case,
changing the name assignments would not fix things for existing
triggers; but if we did back-patch then any affected users could just
drop and re-create the problematic FK constraint.  Or maybe we could
back-patch a change in creation order and rely on that usually working.
Given the lack of prior complaints that might be good enough.

Comments?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] lexemes in prefix search going through dictionary modifications

2011-10-25 Thread Florian Pflug

On Oct25, 2011, at 18:47 , Sushant Sinha wrote:
> On Tue, 2011-10-25 at 18:05 +0200, Florian Pflug wrote:
>> On Oct25, 2011, at 17:26 , Sushant Sinha wrote:
>>> I am currently using the prefix search feature in text search. I find
>>> that the prefix characters are treated the same as a normal lexeme and
>>> passed through stemming and stopword dictionaries. This seems like a bug
>>> to me.
>> 
>> Hm, I don't think so. If they don't pass through stopword dictionaries,
>> then queries containing stopwords will fail to find any rows - which is
>> probably not what one would expect.
> 
> I think what you are saying a feature is really a bug. I am fairly sure
> that when someone says to_tsquery('english', 's:*') one is looking for
> an entry that has a *non-stopword* word that starts with 's'. And
> specially so in a text search configuration that eliminates stop words.

But the whole idea of removing stopwords from the query is that users
*don't* need to be aware of the precise list of stopwords. The way I see
it, stopwords are simply an optimization that helps reduce the size of
your fulltext index.

Assume, for example, that the postgres mailing list archive search used
tsearch (which I think it does, but I'm not sure). It'd then probably make
sense to add "postgres" to the list of stopwords, because it's bound to 
appear in nearly every mail. But wouldn't you want searched which include
'postgres*' to turn up empty? Quite certainly not.

> Does it even make sense to stem, abbreviate, synonym for a few letters?
> It will be so unpredictable.

That depends on the language. In german (my native tongue), one can
concatenate nouns to form new nouns. It's this not entirely unreasonable
that one would want the prefix to be stemmed to it's singular form before
being matched.

Also, suppose you're using a dictionary which corrects common typos. Who
says you wouldn't want that to be applied to prefix queries?

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] (PATCH) Adding CORRESPONDING to Set Operations

2011-10-25 Thread Kerem Kat

On Mon, Oct 24, 2011 at 20:52, Erik Rijkers  wrote:
> On Wed, October 19, 2011 15:01, Kerem Kat wrote:
>> Adding CORRESPONDING to Set Operations
>> Initial patch, filename: corresponding_clause_v2.patch
>
> I had a quick look at the behaviour of this patch.
>
> Btw, the examples in your email were typoed (one select is missing):
>
>> SELECT 1 a, 2 b, 3 c UNION CORRESPONDING 4 b, 5 d, 6 c, 7 f;
> should be:
>  SELECT 1 a, 2 b, 3 c UNION CORRESPONDING select 4 b, 5 d, 6 c, 7 f;
>
> and
>
>> SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(b) 4 b, 5 d, 6 c, 7 f;
> should be:
>  SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(b) select 4 b, 5 d, 6 c, 7 f;
>>

Yes you are correct, mea culpa.

>
>
>
> But there is also a small bug, I think: the order in the CORRESPONDING BY 
> list should be followed,
> according to the standard (foundation, p. 408):
>
> "2) If  is specified, then let SL be a  list> of those  name>s explicitly appearing in the  in the order 
> that these
> s appear in the . Every  
> in the
>  shall be a  of both T1 and T2."
>
> That would make this wrong, I think:
>
> SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(c,b) select 5 d, 6 c, 7 f, 4 b ;
>
>  b | c
> ---+---
>  2 | 3
>  4 | 6
> (2 rows)
>
> i.e., I think it should show columns in the order c, b (and not b, c); the 
> order of the
> CORRESPONDING BY phrase.
>
> (but maybe I'm misreading the text of the standard; I find it often difficult 
> to follow)
>

It wasn't a misread, I checked the draft, in my version same
explanation is at p.410.
I have corrected the ordering of the targetlists of subqueries. And
added 12 regression
tests for column list ordering. Can you confirm that the order has
changed for you?


>
> Thanks,
>
>
> Erik Rijkers
>
>

Regards,

Kerem KAT
*** a/doc/src/sgml/queries.sgml
--- b/doc/src/sgml/queries.sgml
***
*** 1225,1230 
--- 1225,1233 
 EXCEPT


+CORRESPONDING
+   
+   
 set union


***
*** 1241,1249 
 The results of two queries can be combined using the set operations
 union, intersection, and difference.  The syntax is
  
! query1 UNION ALL query2
! query1 INTERSECT ALL query2
! query1 EXCEPT ALL query2
  
 query1 and
 query2 are queries that can use any of
--- 1244,1252 
 The results of two queries can be combined using the set operations
 union, intersection, and difference.  The syntax is
  
! query1 UNION ALL CORRESPONDING BY (select_list) query2
! query1 INTERSECT ALL CORRESPONDING BY (select_list) query2
! query1 EXCEPT ALL CORRESPONDING BY (select_list) query2
  
 query1 and
 query2 are queries that can use any of
***
*** 1283,1288 
--- 1286,1299 

  

+ CORRESPONDING returns all columns that are in both query1 and query2 with the same name.
+   
+ 
+   
+ CORRESPONDING BY returns all columns in the column list that are also in both query1 and query2 with the same name.
+   
+ 
+   
 In order to calculate the union, intersection, or difference of two
 queries, the two queries must be union compatible,
 which means that they return the same number of columns and
*** a/doc/src/sgml/sql.sgml
--- b/doc/src/sgml/sql.sgml
***
*** 859,865 
  [ WHERE condition ]
  [ GROUP BY expression [, ...] ]
  [ HAVING condition [, ...] ]
! [ { UNION | INTERSECT | EXCEPT } [ ALL ] select ]
  [ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
  [ LIMIT { count | ALL } ]
  [ OFFSET start ]
--- 859,865 
  [ WHERE condition ]
  [ GROUP BY expression [, ...] ]
  [ HAVING condition [, ...] ]
! [ { UNION | INTERSECT | EXCEPT } [ ALL ] [ CORRESPONDING [ BY ( expression ) ] ] select ]
  [ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
  [ LIMIT { count | ALL } ]
  [ OFFSET start ]
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
***
*** 2507,2512 
--- 2507,2513 
  	COPY_NODE_FIELD(lockingClause);
  	COPY_SCALAR_FIELD(op);
  	COPY_SCALAR_FIELD(all);
+ 	COPY_NODE_FIELD(correspondingClause);
  	COPY_NODE_FIELD(larg);
  	COPY_NODE_FIELD(rarg);
  
***
*** 2522,2527 
--- 2523,2530 
  	COPY_SCALAR_FIELD(all);
  	COPY_NODE_FIELD(larg);
  	COPY_NODE_FIELD(rarg);
+ 	COPY_NODE_FIELD(correspondingColumns);
+ 	COPY_SCALAR_FIELD(hasCorrespondingBy);
  	COPY_NODE_FIELD(colTypes);
  	COPY_NODE_FIELD(colTypmods);
  	COPY_NODE_FIELD(colCollations);
*** a/src/backend/nodes/equalfuncs.c
--- b/src/backend/nodes/equalfuncs.c
***
*** 982,987 
--- 982,988 
  	COMPARE_NODE_FIELD(lockingClause);
  	COMPARE_SCALAR_FIELD(op);
  	COMPARE_SCALAR_FIELD(all);
+ 	COMPARE_NODE_FIELD(correspondingClause);
  	COMPARE_NODE_FIELD(larg);
  	COMPARE_NODE_FIELD(rarg);
  
***
*** 995,1000 
--- 996,1003 
  	COMPARE_SCALAR_FIELD(all);
  	COMPARE_NODE_FIELD(larg);
  	COMPARE_NODE_FIELD(r

Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-25 Thread Chris Redekop

> Chris, can you rearrange the backup so you copy the pg_control file as
> the first act after the pg_start_backup?

I tried this and it doesn't seem to make any difference.  I also tried the
patch and I can no longer reproduce the subtrans error, however instead it
now it starts up, but never gets to the point where it'll accept
connections.  It starts up but if I try to do anything I always get "FATAL:
 the database system is starting up"...even if the load is removed from the
primary, the standby still never finishes "starting up".  Attached below is
a log of one of these startup attempts.  In my testing with the patch
applied approx 3 in 10 attempts start up successfully, 7 in 10 attempts go
into the "db is starting up" statethe pg_clog error is still there, but
seems much harder to reproduce nowI've seen it only once since applying
the patch (out of probably 50 or 60 under-load startup attempts).  It does
seem to be "moody" like that thoit will be very difficult to reproduce
for a while, and then it will happen damn-near every time for a
while...weirdness

On a bit of a side note, I've been thinking of changing my scripts so that
they perform an initial rsync prior to doing the
startbackup-rsync-stopbackup just so that the second rsync will be
fasterso that the backup is in progress for a shorter period of time, as
while it is running it will stop other standbys from starting upthis
shouldn't cause any issues eh?


2011-10-25 13:43:24.035 MDT [15072]: [1-1] LOG:  database system was
interrupted; last known up at 2011-10-25 13:43:11 MDT
2011-10-25 13:43:24.035 MDT [15072]: [2-1] LOG:  creating missing WAL
directory "pg_xlog/archive_status"
2011-10-25 13:43:24.037 MDT [15072]: [3-1] LOG:  entering standby mode
DEBUG:  received replication command: IDENTIFY_SYSTEM
DEBUG:  received replication command: START_REPLICATION 2/CF00
2011-10-25 13:43:24.041 MDT [15073]: [1-1] LOG:  streaming replication
successfully connected to primary
2011-10-25 13:43:24.177 MDT [15092]: [1-1] FATAL:  the database system is
starting up
2011-10-25 13:43:24.781 MDT [15072]: [4-1] DEBUG:  checkpoint record is at
2/CF81A478
2011-10-25 13:43:24.781 MDT [15072]: [5-1] DEBUG:  redo record is at
2/CF20; shutdown FALSE
2011-10-25 13:43:24.781 MDT [15072]: [6-1] DEBUG:  next transaction ID:
0/4634700; next OID: 1188228
2011-10-25 13:43:24.781 MDT [15072]: [7-1] DEBUG:  next MultiXactId: 839;
next MultiXactOffset: 1686
2011-10-25 13:43:24.781 MDT [15072]: [8-1] DEBUG:  oldest unfrozen
transaction ID: 1669, in database 1
2011-10-25 13:43:24.781 MDT [15072]: [9-1] DEBUG:  transaction ID wrap limit
is 2147485316, limited by database with OID 1
2011-10-25 13:43:24.783 MDT [15072]: [10-1] DEBUG:  resetting unlogged
relations: cleanup 1 init 0
2011-10-25 13:43:24.791 MDT [15072]: [11-1] DEBUG:  initializing for hot
standby
2011-10-25 13:43:24.791 MDT [15072]: [12-1] LOG:  consistent recovery state
reached at 2/CF81A4D0
2011-10-25 13:43:24.791 MDT [15072]: [13-1] LOG:  redo starts at 2/CF20
2011-10-25 13:43:25.019 MDT [15072]: [14-1] LOG:  consistent state delayed
because recovery snapshot incomplete
2011-10-25 13:43:25.019 MDT [15072]: [15-1] CONTEXT:  xlog redo  running
xacts:
nextXid 4634700 latestCompletedXid 4634698 oldestRunningXid 4634336; 130
xacts:
4634336 4634337 4634338 4634339 4634340 4634341 4634342 4634343 4634344
4634345
4634346 4634347 4634348 4634349 4634350 4634351 4634352 4634353 4634354
4634355
4634356 4634357 4634358 4634359 4634360 4634361 4634362 4634363 4634364
4634365
4634366 4634367 4634368 4634369 4634370 4634371 4634515 4634516 4634517
4634518
4634519 4634520 4634521 4634522 4634523 4634524 4634525 4634526 4634527
4634528
4634529 4634530 4634531 4634532 4634533 4634534 4634535 4634536 4634537
4634538
4634539 4634540 4634541 4634542 4634543 4634385 4634386 4634387 4634388
4634389
4634390 4634391 4634392 4634393 4634394 4634395 4634396 4634397 4634398
4634399
4634400 4634401 4634402 4634403 4634404 4634405 4634406 4634407 4634408
4634409
4634410 4634411 4634412 4634413 4634414 4634415 4634416 4634417 4634418
4634419
4634420 4634579 4634580 4634581 4634582 4634583 4634584 4634585 4634586
4634587
4634588 4634589 4634590 4634591 4634592 4634593 4634594 4634595 4634596
4634597
4634598 4634599 4634600 4634601 4634602 4634603 4634604 4634605 4634606
4634607;
 subxid ovf
2011-10-25 13:43:25.240 MDT [15130]: [1-1] FATAL:  the database system is
starting up
DEBUG:  standby "sync_rep_test" has now caught up with primary
2011-10-25 13:43:26.304 MDT [15167]: [1-1] FATAL:  the database system is
starting up
2011-10-25 13:43:27.366 MDT [15204]: [1-1] FATAL:  the database system is
starting up
2011-10-25 13:43:28.426 MDT [15241]: [1-1] FATAL:  the database system is
starting up
2011-10-25 13:43:29.461 MDT [15275]: [1-1] FATAL:  the database system is
starting up
and so on...


On Tue, Oct 25, 2011 at 6:51 AM, Simon Riggs  wrote:

> On Tue, Oct 25, 2011 at 12:39 PM, Florian Pflug  wrote:
>
> > What I do

Re: [HACKERS] Firing order of RI triggers

2011-10-25 Thread Alvaro Herrera


Excerpts from Tom Lane's message of mar oct 25 14:57:43 -0300 2011:

> I'm not sure if this is something we can back-patch --- I don't see any
> dependencies in our own code on what names RI triggers have, but I'm
> afraid there is client-side code out there that knows it.

Yeah, sounds possible.

> In any case,
> changing the name assignments would not fix things for existing
> triggers; but if we did back-patch then any affected users could just
> drop and re-create the problematic FK constraint.  Or maybe we could
> back-patch a change in creation order and rely on that usually working.
> Given the lack of prior complaints that might be good enough.

The latter looks reasonable ... particularly if the symptoms of a
botched order would be immediately visible -- the user could just drop
and reload the constraints to fix the order in the very unlikely case
that they are reversed.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Firing order of RI triggers

2011-10-25 Thread Tom Lane

Alvaro Herrera  writes:
> Excerpts from Tom Lane's message of mar oct 25 14:57:43 -0300 2011:
>> ...  Or maybe we could
>> back-patch a change in creation order and rely on that usually working.
>> Given the lack of prior complaints that might be good enough.

> The latter looks reasonable ... particularly if the symptoms of a
> botched order would be immediately visible -- the user could just drop
> and reload the constraints to fix the order in the very unlikely case
> that they are reversed.

Well, the symptoms would probably be just like in the bug report: you'd
get unexpected failures from double updates of a self-referential row in
a single transaction.  That's a sufficiently weird corner case that most
people probably wouldn't exercise it right away.  But given that this
problem has been there from day one and nobody noticed before, I'm not
too concerned about the intersection of people who have an issue and
people who are unlucky enough to get an end-of-decade trigger OID.
I think 100% solution in HEAD and 99.99% solution in back branches
should be good enough.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] lexemes in prefix search going through dictionary modifications

2011-10-25 Thread Sushant Sinha

On Tue, 2011-10-25 at 19:27 +0200, Florian Pflug wrote:

> Assume, for example, that the postgres mailing list archive search used
> tsearch (which I think it does, but I'm not sure). It'd then probably make
> sense to add "postgres" to the list of stopwords, because it's bound to 
> appear in nearly every mail. But wouldn't you want searched which include
> 'postgres*' to turn up empty? Quite certainly not.

That improves recall for "postgres:*" query and certainly doesn't help
other queries like "post:*". But more importantly it affects precision
for all queries like "a:*", "an:*", "and:*", "s:*", 't:*', "the:*", etc
(When that is the only search it also affects recall as no row matches
an empty tsquery). Since stopwords are smaller, it means prefix search
for a few characters is meaningless. And I would argue that is when the
prefix search is more important -- only when you know a few characters.

-Sushant.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Update on documentation builds on OSX w/ macports

2011-10-25 Thread Florian Pflug

On Oct20, 2011, at 05:07 , Dan Ports wrote:
> On Thu, Oct 20, 2011 at 02:02:09AM +0200, Florian Pflug wrote:
>> I've patched the ports for openjade, iso8879 and docbook-dsssl,
>> and added a new port for docbook-sgml-4.2. These patches are sitting
>> in the macports trac now, waiting to be applied.
> 
> I'll try to take a look at them in the next couple days (with my
> MacPorts hat on), unless someone beats me to it.

And an effective hat that was! Thanks for cleaning up and applying
all of those patches, Dan!

Should someone want to try this out, beware that clang (at least
the version shipped with XCode 4.2) doesn't like openjade. Also,
universal builds of openjade seem to be broken. On my machine, I
thus need to say

  port install openjade -universal configure.compiler=llvm-gcc-4.2

instead of just

  port install openjade

I've submitted a patch for that, so hopefully that problem too will
go away in the near future.

Anyway, after having installed openjade, a simple

  port install docbook-sgml-4.2

should be enough to make our documentation build work.

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

2011-10-25 Thread Tom Lane

I believe I have reproduced the behavior described by Andrew Hammond in
http://archives.postgresql.org/pgsql-general/2011-10/msg00928.php

This is using the regression database:

1. In session 1, do
set default_statistics_target TO 1;
analyze tenk1;
(We need the large stats target to ensure that tenk1's pg_statistic
entries require toasting.)

2. Attach to session 1 with a debugger and set a breakpoint at
CommitTransaction's call to CallXactCallbacks (or anyplace after
ProcArrayEndTransaction and before AtEOXact_Inval).

3. In session 2, do

select count(*) from tenk1 where fivethous < 2500;

(This loads up session 2's syscaches with toasted pg_statistic entries.)

4. In session 1, again do

analyze tenk1;

and wait for it to stop at the breakpoint.

5. In session 3 (or you can use session 2 for this), do
 vacuum verbose pg_statistic;
You should see it removing toast entries that were generated in step 1
and obsoleted in step 4.

6. In session 2, again do

select count(*) from tenk1 where fivethous < 2500;

and voila:

ERROR:  missing chunk number 0 for toast value 53668 in pg_toast_2619

What has happened here is that the second ANALYZE has marked itself
committed in pg_clog and no longer running in the ProcArray, so VACUUM
feels entitled to remove toast tuples that the ANALYZE deleted.  However,
the ANALYZE has not yet sent out the sinval messages that would inform
session 2 that its syscache entries are obsolete.  In Andrew's report,
presumably the machine was under enough load to slow down ANALYZE at
just this point, and there was a concurrent autovacuum that would have
done the rest of the deed.  The problem could only be seen for a short
interval, which squares with his report, and with a similar one from
Tim Uckun back in September.

Ordinarily, sending out sinval messages post-commit is okay because we
don't release locks until after that, and we suppose that our locks
prevent any other transactions from getting to the point of using
syscache entries that might have been invalidated by our transaction.
However, *we have carefully hacked on ANALYZE until it doesn't take any
locks that would block concurrent queries on the analyzed table.*  So
the normal protection against stale syscache entries simply doesn't
work for pg_statistic fetches.

I'm not sure about a good way to fix this.  When we last dealt with a
similar failure, Heikki suggested that we forcibly detoast all fields in
a tuple that we're putting into the syscaches:
http://archives.postgresql.org/pgsql-hackers/2011-08/msg00661.php
I don't much like that, though, as it seems expensive, and I'm worried
about possible circularity of needing to know about all toastable fields
while making a syscache entry, and anyway it's papering over a symptom
rather than solving the actual problem that we're relying on a stale
syscache entry.

We could fix it by not using a syscache anymore for pg_statistic
entries, but that's probably not acceptable from a performance
standpoint.

A clean fix would be to add locking that blocks would-be users of
pg_statistic entries when an ANALYZE is about to commit.  This isn't
much fun from a performance standpoint either, but at least it should be
relatively cheap most of the time.

Thoughts?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] isolationtester and invalid permutations

2011-10-25 Thread Alvaro Herrera

Instead of simply aborting a spec that specifies running commands on
blocked sessions (what we call an invalid permutation), it seems more
useful to report the problem, cleanup the sessions, and continue with
the next permutation.

This, in conjunction with the dry-run patch I submitted earlier, makes
it easier to determine a working spec: dry-run the spec; copy the
so-generated permutation lines into the spec; run the spec normally,
which reports the invalid permutations; comment out the invalid
permutations from the spec; done.

The attached patch, again from Alexander Shulgin (with some tweaks from
me) does that.

Comments?

-- 
Álvaro Herrera 


isotester-allow-invalid-perms.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-25 Thread Chris Redekop

>
> That isn't a Hot Standby problem, a recovery problem nor is it certain
> its a PostgreSQL problem.
>
Do you have any theories on this that I could help investigate?  It happens
even when using pg_basebackup and it persists until another sync is
performed, so the files must be in some state that that it can't recover
fromwithout understanding the internals just viewing from an
outside perspective, I don't really see how this could not be a PostgreSQL
problem

Re: [HACKERS] Online base backup from the hot-standby

2011-10-25 Thread Fujii Masao

On Tue, Oct 25, 2011 at 9:03 PM, Magnus Hagander  wrote:
> On Tue, Oct 25, 2011 at 13:54, Fujii Masao  wrote:
>> On Tue, Oct 25, 2011 at 7:19 PM, Magnus Hagander  wrote:
>>> I don't think we should necessarily give up completely. But doing a
>>> pg_basebackup way *first* seems reasonable - because it's going to be
>>> the easiest one to "get right", given that we have more control there.
>>> Doesn't mean we shouldn't extend it in the future...
>>
>> Agreed. The question is -- how far should we change pg_basebackup to
>> "get right"? I think it's not difficult to change it so that it backs up
>> the control file at the end. But eliminating the need for full_page_writes=on
>> seems not easy. No? So I'm not inclined to do that in at least first commit.
>> Otherwise, I'm afraid the patch would become huge.
>
> It's more server side of base backups than the actual pg_basebackup
> tool of course, but I'm sure that's what we're all referring to here.
>
> Personally, I'd see the fpw stuff as part of the infrastructure
> needed. Meaning that the fpw stuff should go in *first*, and the
> pg_basebackup stuff later.

Agreed. I'll extract FPW stuff from the patch that I submitted, and revise it
as the infrastructure patch.

The changes of pg_start_backup() etc that Ishiduka-san did are also
a server-side infrastructure. I will extract them as another infrastructure one.

Ishiduka-san, if you have time, feel free to try the above, barring objection.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

2011-10-25 Thread Robert Haas

On Tue, Oct 25, 2011 at 9:06 PM, Tom Lane  wrote:
> What has happened here is that the second ANALYZE has marked itself
> committed in pg_clog and no longer running in the ProcArray, so VACUUM
> feels entitled to remove toast tuples that the ANALYZE deleted.  However,
> the ANALYZE has not yet sent out the sinval messages that would inform
> session 2 that its syscache entries are obsolete.  In Andrew's report,
> presumably the machine was under enough load to slow down ANALYZE at
> just this point, and there was a concurrent autovacuum that would have
> done the rest of the deed.  The problem could only be seen for a short
> interval, which squares with his report, and with a similar one from
> Tim Uckun back in September.
>
> Ordinarily, sending out sinval messages post-commit is okay because we
> don't release locks until after that, and we suppose that our locks
> prevent any other transactions from getting to the point of using
> syscache entries that might have been invalidated by our transaction.
> However, *we have carefully hacked on ANALYZE until it doesn't take any
> locks that would block concurrent queries on the analyzed table.*  So
> the normal protection against stale syscache entries simply doesn't
> work for pg_statistic fetches.

This is very similar to one of the issues that reared its ugly head in
regards to Simon's now-reverted patch to lower DDL locking strength.
You identified some other issues there as well, but *one* of the
issues was that, as in this case, the sinval mechanism fails to
provide the necessary synchronization guarantees unless the lock
required to reread the updated data conflicts with the lock required
to change the data.  In that case, "the data" meant "the pg_class
entry" or "the pg_attribute" entry whereas here it means "the
pg_statistic entry", but I believe the principal is the same.  And
there as here, (1) there is a fundamental conflict between what the
sinval mechanism requires for correctness and what is actually
desirable in terms of lock levels from a user experience point of view
and (2) it is relatively easy to write code that looks superficially
safe but which actually contains subtle race conditions.  IIRC, you
never thought Simon's patch looked safe, but I'm guessing that this
pg_statistic bug has been around for a long time.

So I'm wondering if we ought to rethink our position that users of the
sinval machinery must provide their own external synchronization
through heavyweight locking, and instead build the synchronization
into the sinval mechanism itself.  One idea I had was to include the
XID of the transaction sending the sinval mechanism in every message,
and to force clients receiving a message to do XactLockTableWait() for
each such XID.  That would force the backend reloading its cache to
wait until the committing transaction reaches the lock-release phase.
If we sent out sinval messages just before removing ourselves from the
ProcArray, I think that would more-or-less fix this bug (although
maybe I'm missing some reason why it's not practical to send them that
early) except that I don't see any way to handle the sinval-reset
case, which seems to more or less kill this idea in its tracks.

But maybe there's some other mechanism whereby we could combine
sending the sinval messages slightly earlier (before
ProcArrayEndTransaction) with blocking anyone who processes those
messages until after the committing backend finishes
ProcArrayEndTransaction.  For example, you could add an additional
LWLock, which has to be held in exclusive mode by a committing
transaction that sends any sinval messages.  It must be acquired
before sending the sinval messages and can't be released until after
ProcArrayEndTransaction() is complete.  Anyone processing a sinval
message must acquire and release the lock in shared mode before
reloading their caches, so that we guarantee that at the time you
reread the catalogs, any transactions involved in sending those
messages are visible.

That's actually a bit coarse-grained; there's probably a better
mechanism, but I'm just throwing this out to see if the basic idea has
any legs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

2011-10-25 Thread Tom Lane

Robert Haas  writes:
> On Tue, Oct 25, 2011 at 9:06 PM, Tom Lane  wrote:
>> Ordinarily, sending out sinval messages post-commit is okay because we
>> don't release locks until after that, and we suppose that our locks
>> prevent any other transactions from getting to the point of using
>> syscache entries that might have been invalidated by our transaction.
>> However, *we have carefully hacked on ANALYZE until it doesn't take any
>> locks that would block concurrent queries on the analyzed table.*  So
>> the normal protection against stale syscache entries simply doesn't
>> work for pg_statistic fetches.

> This is very similar to one of the issues that reared its ugly head in
> regards to Simon's now-reverted patch to lower DDL locking strength.
> You identified some other issues there as well, but *one* of the
> issues was that, as in this case, the sinval mechanism fails to
> provide the necessary synchronization guarantees unless the lock
> required to reread the updated data conflicts with the lock required
> to change the data.

Right.  We may take as little as AccessShareLock on a relation before
examining its pg_statistic entries, and ANALYZE isn't taking anything
that would block that.

> So I'm wondering if we ought to rethink our position that users of the
> sinval machinery must provide their own external synchronization
> through heavyweight locking, and instead build the synchronization
> into the sinval mechanism itself.

Yeah, it's starting to feel like we need a basic redesign of sinval
... although I'd not care to back-patch that, so we also need to think
of a sane solution for the back branches.

> If we sent out sinval messages just before removing ourselves from the
> ProcArray, I think that would more-or-less fix this bug (although
> maybe I'm missing some reason why it's not practical to send them that
> early) except that I don't see any way to handle the sinval-reset
> case, which seems to more or less kill this idea in its tracks.

The other reason that doesn't work is there's a race condition: someone
might load their cache entry immediately after the sinval message went
past, but before the updating transaction commits.

> But maybe there's some other mechanism whereby we could combine
> sending the sinval messages slightly earlier (before
> ProcArrayEndTransaction) with blocking anyone who processes those
> messages until after the committing backend finishes
> ProcArrayEndTransaction.  For example, you could add an additional
> LWLock, which has to be held in exclusive mode by a committing
> transaction that sends any sinval messages.

Doesn't sound very scalable :-(.

Even given your recent changes to reduce the overhead of checking for
sinval messages, I'm not sure that it'd be practical to move the sinval
message processing to just-before-we-look-up-a-cache-entry.  Right now,
we do AcceptInvalidationMessages basically once per table per query
(or maybe it's twice or so, but anyway a very small multiplier on that).
If we try to do it every time through SearchSysCache, we are probably
talking two to three orders of magnitude more checks, which ISTM is
certain to push the sinval queue back up to the top of the heap for
contention.

But in any case, this isn't the core of the problem.  The real point
here is that we need a guarantee that a syscache entry we're going to
use is/was valid as of some suitable time point later than the start of
our current transaction.  (Once we have taken a snapshot, VACUUM will
know that it can't remove any tuples that were deleted after the time of
that snapshot; so even for SnapshotNow fetches, it's important to have
an MVCC snapshot to protect toast-table dereferences.)  Perhaps rather
than tying the problem into SearchSysCache, we should attach the
overhead to GetTransactionSnapshot, which is called appealingly few
times per query.  But right offhand it seems like that only protects us
against the toast-tuple-deletion problem, not against the more general
one of getting a stale view of the status of some relation.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

42 matches

Mail list logo