Re: [HACKERS] GiST insert algorithm rewrite

2011-01-09 Thread Heikki Linnakangas

On 09.01.2011 07:05, Tom Lane wrote:

I just found out that the benchmark test script in
contrib/intarray/bench/ crashes HEAD in gistdoinsert() --- it looks like
it's trying to pop to a stack entry that isn't there.

Run it per the instructions in the intarray documentation:

$ createdb TEST
$ psql TEST  ../_int.sql
...
$ ./create_test.pl | psql TEST
CREATE TABLE
CREATE TABLE
CREATE INDEX
CREATE INDEX
server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
connection to server was lost

The script generates randomized data, so possibly it won't fail every
time, but it failed three out of three times for me.  The changes I'm
about to commit in intarray don't seem to make any difference.


Thanks, fixed. Apparently my testing never hit the case where an update, 
rather than an insert, into the root page causes it to split.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2011-01-08 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 21.12.2010 20:00, Heikki Linnakangas wrote:
 One final version, with a bug fix wrt. root page split and some cleanup.
 I'm planning to commit this before Christmas. It's a big patch, so
 review would be much appreciated.

 Committed. Phew! Review  testing is of course still appreciated, given 
 how big and complicated this was.

I just found out that the benchmark test script in
contrib/intarray/bench/ crashes HEAD in gistdoinsert() --- it looks like
it's trying to pop to a stack entry that isn't there.

Run it per the instructions in the intarray documentation:

$ createdb TEST
$ psql TEST  ../_int.sql
...
$ ./create_test.pl | psql TEST
CREATE TABLE
CREATE TABLE
CREATE INDEX
CREATE INDEX
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost

The script generates randomized data, so possibly it won't fail every
time, but it failed three out of three times for me.  The changes I'm
about to commit in intarray don't seem to make any difference.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-12-23 Thread Heikki Linnakangas

On 21.12.2010 20:00, Heikki Linnakangas wrote:

One final version, with a bug fix wrt. root page split and some cleanup.
I'm planning to commit this before Christmas. It's a big patch, so
review would be much appreciated.


Committed. Phew! Review  testing is of course still appreciated, given 
how big and complicated this was.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-12-13 Thread Robert Haas
On Mon, Dec 13, 2010 at 7:09 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Attached is an updated patch, but that issue with limited number of backup
 blocks needs to be resolved. The straightforward way would be to change the
 WAL format to increase the limit.

Eh, is that going to bloat WAL?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-12-13 Thread Heikki Linnakangas

On 13.12.2010 15:04, Robert Haas wrote:

On Mon, Dec 13, 2010 at 7:09 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

Attached is an updated patch, but that issue with limited number of backup
blocks needs to be resolved. The straightforward way would be to change the
WAL format to increase the limit.


Eh, is that going to bloat WAL?


Nah. The way split now works is that:

1. Split the page. Write a WAL record with the contents of the page halves.

2. Insert the downlink pointers in the parent, and set the NSN and clear 
F_FOLLOW_RIGHT flags on the child pages. A 2nd WAL record is written for 
this.


In this new patch version, at step 2, the 2nd WAL record updates the 
LSNs and takes full-page-images of the child pages if necessary. 
Previous patches overlooked that. Usually a full page image won't be 
necessary, because we just wrote the page-split WAL record at step 1 for 
those pages. It's only if a checkpoint started between in the small 
window between steps 1 and 2.


So this should have no effect on performance, but it is necessary for 
correctness.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Increasing max # of backup blocks (was Re: [HACKERS] GiST insert algorithm rewrite)

2010-12-13 Thread Heikki Linnakangas

On 13.12.2010 15:20, Heikki Linnakangas wrote:

On 13.12.2010 15:04, Robert Haas wrote:

On Mon, Dec 13, 2010 at 7:09 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

Attached is an updated patch, but that issue with limited number of
backup
blocks needs to be resolved. The straightforward way would be to
change the
WAL format to increase the limit.


Eh, is that going to bloat WAL?


Nah.


Or did you mean, if changing the WAL format to raise the limit would 
bloat WAL? Depends on how it's done, of course. Assuming MAXALIGN=4, 
there's 2 wasted bytes in XLogRecord at the moment that we could take 
into use without making the header bigger.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-12-13 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 But that creates a new problem: There's a maximum of three backup blocks 
 per WAL record, but a GiST page can be split into any number of child 
 pages as one operation. You might run out of backup block slots.

 Attached is an updated patch, but that issue with limited number of 
 backup blocks needs to be resolved. The straightforward way would be to 
 change the WAL format to increase the limit.

I don't think you can fix it that way.  If a page can be split into any
number of child pages, then no fixed number of pages in a WAL record
will be enough.  Even if you were willing to suppose that ~16 would be
enough, it would be a bad idea because of the extra overhead added into
XLogInsert, which'd be paid by *every* WAL insert operation.

I think you need to refactor the operation so that there's one WAL
record per child page, or something along that line.  I concede this
might be diffcult :-(

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-12-13 Thread Greg Stark
On Mon, Dec 13, 2010 at 3:14 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I think you need to refactor the operation so that there's one WAL
 record per child page, or something along that line.  I concede this
 might be diffcult :-(

If it's only the backup blocks that matter couldn't you generate noop
WAL records with just the full page image in them. Once all those are
generated then generate the actual split operation and since all the
pages have been written to wal since the last checkpoint they won't
need any backup block slots.

This would require surpressing any checkpoints between writing the
first backup block and the final operation record. That might be
pretty hard to do cleanly.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-12-13 Thread Heikki Linnakangas

On 13.12.2010 19:19, Greg Stark wrote:

On Mon, Dec 13, 2010 at 3:14 PM, Tom Lanet...@sss.pgh.pa.us  wrote:

I think you need to refactor the operation so that there's one WAL
record per child page, or something along that line.  I concede this
might be diffcult :-(


If it's only the backup blocks that matter couldn't you generate noop
WAL records with just the full page image in them. Once all those are
generated then generate the actual split operation and since all the
pages have been written to wal since the last checkpoint they won't
need any backup block slots.

This would require surpressing any checkpoints between writing the
first backup block and the final operation record. That might be
pretty hard to do cleanly.


That would work, but it brings us back to square one 
(http://archives.postgresql.org/message-id/4ccfee61.2090...@enterprisedb.com). 
It's not necessarily a bad idea, A capability to hold off checkpoints 
might be the easiest way to do this, and other things in the future.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-12-13 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 13.12.2010 19:19, Greg Stark wrote:
 If it's only the backup blocks that matter couldn't you generate noop
 WAL records with just the full page image in them. Once all those are
 generated then generate the actual split operation and since all the
 pages have been written to wal since the last checkpoint they won't
 need any backup block slots.
 
 This would require surpressing any checkpoints between writing the
 first backup block and the final operation record. That might be
 pretty hard to do cleanly.

 That would work, but it brings us back to square one 

Yeah.  Wouldn't the original page-split record have been carrying full
page images already?  (And if so, why didn't we have this problem in the
previous implementation?)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-12-13 Thread Heikki Linnakangas

On 13.12.2010 19:48, Tom Lane wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:

On 13.12.2010 19:19, Greg Stark wrote:

If it's only the backup blocks that matter couldn't you generate noop
WAL records with just the full page image in them. Once all those are
generated then generate the actual split operation and since all the
pages have been written to wal since the last checkpoint they won't
need any backup block slots.

This would require surpressing any checkpoints between writing the
first backup block and the final operation record. That might be
pretty hard to do cleanly.



That would work, but it brings us back to square one


Yeah.  Wouldn't the original page-split record have been carrying full
page images already?


Yes.

BTW, the original split record doesn't run into the limit because it 
doesn't use the backup-block mechanism, it contains all the tuples for 
all the pages in the main payload.



 (And if so, why didn't we have this problem in the
previous implementation?)


In the previous implementation, the NSN was updated immediately in the 
page split record, and there was no follow-right flag to clear. So the 
child pages didn't need to be updated when the downlinks are inserted.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-12-13 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 13.12.2010 19:48, Tom Lane wrote:
 Yeah.  Wouldn't the original page-split record have been carrying full
 page images already?

 Yes.

 BTW, the original split record doesn't run into the limit because it 
 doesn't use the backup-block mechanism, it contains all the tuples for 
 all the pages in the main payload.

I see.

 (And if so, why didn't we have this problem in the
 previous implementation?)

 In the previous implementation, the NSN was updated immediately in the 
 page split record, and there was no follow-right flag to clear. So the 
 child pages didn't need to be updated when the downlinks are inserted.

Can we fix it so that each child page is updated, and its downlink
inserted, as a separate atomic action?  That'd require each intermediate
state to be consistent and crash-safe, but I think you really need the
intermediate states to be consistent anyway because of concurrent scans.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-12-13 Thread Heikki Linnakangas

On 13.12.2010 20:30, Tom Lane wrote:

Can we fix it so that each child page is updated, and its downlink
inserted, as a separate atomic action?  That'd require each intermediate
state to be consistent and crash-safe, but I think you really need the
intermediate states to be consistent anyway because of concurrent scans.


Yes, all the intermediate states are consistent. I'm looking at that 
approach as we speak. The logic to track what we've done and what needs 
to be done as the changes are propagated gets quite hairy, but in 
principle it should work.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-12-03 Thread Robert Haas
On Dec 3, 2010, at 4:54 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:
 Here's an updated patch.

How carefully have you perf-tested this?

 On closer look, supporting the invalid tuples in scans was trivial, so I kept 
 that after all. So you can still query an index with invalid tuples. If an 
 insert encounters one, you get an error, and VACUUM emits a LOG message on 
 any such tuples.

Cool.

 There's one bug remaining that I found during testing. If you crash, leaving 
 an incomplete split behind, and then vacuum the table removing all the 
 aborted tuples from the pages, it's possible that you end up with a 
 completely empty page that has no downlink yet. The code to complete 
 incomplete splits doesn't cope with that at the moment - it doesn't know how 
 to construct a parent key for a child that has no tuples.

I think we can live with this.
 


...Robert
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-12-01 Thread Heikki Linnakangas

On 01.12.2010 04:10, Robert Haas wrote:

On Tue, Nov 30, 2010 at 10:26 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

Does the current code cope with the corruption?


It's not corruption, but intended degradation. Yes, the current code copes
with it, that's how GiST survives a crash. However, even with the current
code, VACUUM will nag if it finds any invalid tuples with this message:

ereport(NOTICE,
(errmsg(index \%s\ needs VACUUM FULL or REINDEX to finish crash
recovery,

That's harmless, in the sense that all scans and inserts work fine, but
scans might need to do more work than if the invalid tuple wasn't there.

I don't think we need to go out of our way to support such degraded indexes
in 9.1. If you see such notices in your logs, you should REINDEX anyway,
before of after pg_upgrade. Let's just make sure that you get a reasonable
error message in 9.1 if a scan or insert encounters such a tuple.


I just don't want to take a risk of giving people unexpected wrong
answers.  It's not clear to me whether that's a risk here or not.


You'll get an error if a scan encounters an invalid tuple.

In the patch I posted, I just ripped out everything related to invalid 
tuples altogether. But we should add a check and ereport for that before 
commit.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-12-01 Thread Robert Haas
On Wed, Dec 1, 2010 at 4:00 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 01.12.2010 04:10, Robert Haas wrote:

 On Tue, Nov 30, 2010 at 10:26 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 Does the current code cope with the corruption?

 It's not corruption, but intended degradation. Yes, the current code
 copes
 with it, that's how GiST survives a crash. However, even with the current
 code, VACUUM will nag if it finds any invalid tuples with this message:

 ereport(NOTICE,
        (errmsg(index \%s\ needs VACUUM FULL or REINDEX to finish crash
 recovery,

 That's harmless, in the sense that all scans and inserts work fine, but
 scans might need to do more work than if the invalid tuple wasn't there.

 I don't think we need to go out of our way to support such degraded
 indexes
 in 9.1. If you see such notices in your logs, you should REINDEX anyway,
 before of after pg_upgrade. Let's just make sure that you get a
 reasonable
 error message in 9.1 if a scan or insert encounters such a tuple.

 I just don't want to take a risk of giving people unexpected wrong
 answers.  It's not clear to me whether that's a risk here or not.

 You'll get an error if a scan encounters an invalid tuple.

 In the patch I posted, I just ripped out everything related to invalid
 tuples altogether. But we should add a check and ereport for that before
 commit.

All right, that seems like a reasonable backstop, if we're fairly sure
this won't be a common scenario.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-30 Thread Heikki Linnakangas

On 27.11.2010 21:31, Bruce Momjian wrote:

Heikki Linnakangas wrote:

There's no on-disk format changes, except for the additional flag in the
page headers, so this does not affect pg_upgrade. However, if there's
any invalid keys in the old index because of an incomplete insertion,
the new code will not understand that. So you should run vacuum to
ensure that there's no such invalid keys in the index before upgrading.
Vacuum will print a message in the log if it finds any, and you will
have to reindex. But that's what it suggests you to do anyway.


OK, pg_upgrade has code to report invalid gin and hash indexes because
of changes between PG 8.3 and 8.4.  Is this something we would do for
9.0 to 9.1?


9.1. The problem that started this whole thing is there in older 
versions, but given the lack of real-life reports and the scale of the 
changes required it doesn't seem wise to backport.



You are saying it would have to be run before the upgrade.  Can it not
be run after?


After would work too.


I can output a script to VACUUM all such indexes, and tell users to
manually REINDEX any index that generates a warning messasge.  I don't
have any way to automate an optional REINDEX step.


That seems good enough.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-30 Thread Heikki Linnakangas

On 30.11.2010 11:55, Heikki Linnakangas wrote:

On 27.11.2010 21:31, Bruce Momjian wrote:

Heikki Linnakangas wrote:

There's no on-disk format changes, except for the additional flag in the
page headers, so this does not affect pg_upgrade. However, if there's
any invalid keys in the old index because of an incomplete insertion,
the new code will not understand that. So you should run vacuum to
ensure that there's no such invalid keys in the index before upgrading.
Vacuum will print a message in the log if it finds any, and you will
have to reindex. But that's what it suggests you to do anyway.


OK, pg_upgrade has code to report invalid gin and hash indexes because
of changes between PG 8.3 and 8.4. Is this something we would do for
9.0 to 9.1?


9.1. The problem that started this whole thing is there in older
versions, but given the lack of real-life reports and the scale of the
changes required it doesn't seem wise to backport.


Oh sorry, I read your question as 9.0 *or* 9.1.

Only GiST indexes that have any invalid tuples in them n, as a result 
of a crash, need to be reindexed. That's very rare in practice, so we 
shouldn't invalidate all GiST indexes. I don't think there's any simple 
way to check whether reindex is required, so I think we have to just 
document this.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 5:02 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 30.11.2010 11:55, Heikki Linnakangas wrote:

 On 27.11.2010 21:31, Bruce Momjian wrote:

 Heikki Linnakangas wrote:

 There's no on-disk format changes, except for the additional flag in the
 page headers, so this does not affect pg_upgrade. However, if there's
 any invalid keys in the old index because of an incomplete insertion,
 the new code will not understand that. So you should run vacuum to
 ensure that there's no such invalid keys in the index before upgrading.
 Vacuum will print a message in the log if it finds any, and you will
 have to reindex. But that's what it suggests you to do anyway.

 OK, pg_upgrade has code to report invalid gin and hash indexes because
 of changes between PG 8.3 and 8.4. Is this something we would do for
 9.0 to 9.1?

 9.1. The problem that started this whole thing is there in older
 versions, but given the lack of real-life reports and the scale of the
 changes required it doesn't seem wise to backport.

 Oh sorry, I read your question as 9.0 *or* 9.1.

 Only GiST indexes that have any invalid tuples in them n, as a result of a
 crash, need to be reindexed. That's very rare in practice, so we shouldn't
 invalidate all GiST indexes. I don't think there's any simple way to check
 whether reindex is required, so I think we have to just document this.

It seems odd to say, the indexes are corrupted, but they're probably
not, so let's not worry about it.

I assume there's no way to make the new code cope with any
pre-existing corruption?

Does the current code cope with the corruption?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-30 Thread Heikki Linnakangas

On 30.11.2010 16:23, Robert Haas wrote:

On Tue, Nov 30, 2010 at 5:02 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

On 30.11.2010 11:55, Heikki Linnakangas wrote:


On 27.11.2010 21:31, Bruce Momjian wrote:


Heikki Linnakangas wrote:


There's no on-disk format changes, except for the additional flag in the
page headers, so this does not affect pg_upgrade. However, if there's
any invalid keys in the old index because of an incomplete insertion,
the new code will not understand that. So you should run vacuum to
ensure that there's no such invalid keys in the index before upgrading.
Vacuum will print a message in the log if it finds any, and you will
have to reindex. But that's what it suggests you to do anyway.


OK, pg_upgrade has code to report invalid gin and hash indexes because
of changes between PG 8.3 and 8.4. Is this something we would do for
9.0 to 9.1?


9.1. The problem that started this whole thing is there in older
versions, but given the lack of real-life reports and the scale of the
changes required it doesn't seem wise to backport.


Oh sorry, I read your question as 9.0 *or* 9.1.

Only GiST indexes that have any invalid tuples in them n, as a result of a
crash, need to be reindexed. That's very rare in practice, so we shouldn't
invalidate all GiST indexes. I don't think there's any simple way to check
whether reindex is required, so I think we have to just document this.


It seems odd to say, the indexes are corrupted, but they're probably
not, so let's not worry about it.

I assume there's no way to make the new code cope with any
pre-existing corruption?

Does the current code cope with the corruption?


It's not corruption, but intended degradation. Yes, the current code 
copes with it, that's how GiST survives a crash. However, even with the 
current code, VACUUM will nag if it finds any invalid tuples with this 
message:


ereport(NOTICE,
	(errmsg(index \%s\ needs VACUUM FULL or REINDEX to finish crash 
recovery,


That's harmless, in the sense that all scans and inserts work fine, but 
scans might need to do more work than if the invalid tuple wasn't there.


I don't think we need to go out of our way to support such degraded 
indexes in 9.1. If you see such notices in your logs, you should REINDEX 
anyway, before of after pg_upgrade. Let's just make sure that you get a 
reasonable error message in 9.1 if a scan or insert encounters such a tuple.


There is a section on this in the docs, BTW: 
http://www.postgresql.org/docs/9.0/static/gist-recovery.html


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-30 Thread Bruce Momjian
Heikki Linnakangas wrote:
  Does the current code cope with the corruption?
 
 It's not corruption, but intended degradation. Yes, the current code 
 copes with it, that's how GiST survives a crash. However, even with the 
 current code, VACUUM will nag if it finds any invalid tuples with this 
 message:
 
 ereport(NOTICE,
   (errmsg(index \%s\ needs VACUUM FULL or REINDEX to finish crash 
 recovery,
 
 That's harmless, in the sense that all scans and inserts work fine, but 
 scans might need to do more work than if the invalid tuple wasn't there.
 
 I don't think we need to go out of our way to support such degraded 
 indexes in 9.1. If you see such notices in your logs, you should REINDEX 
 anyway, before of after pg_upgrade. Let's just make sure that you get a 
 reasonable error message in 9.1 if a scan or insert encounters such a tuple.
 
 There is a section on this in the docs, BTW: 
 http://www.postgresql.org/docs/9.0/static/gist-recovery.html

OK, administrators will be prompted during normal operation --- seems
there is nothing extra for pg_upgrade to do here.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-30 Thread Robert Haas
On Tue, Nov 30, 2010 at 10:26 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Does the current code cope with the corruption?

 It's not corruption, but intended degradation. Yes, the current code copes
 with it, that's how GiST survives a crash. However, even with the current
 code, VACUUM will nag if it finds any invalid tuples with this message:

 ereport(NOTICE,
        (errmsg(index \%s\ needs VACUUM FULL or REINDEX to finish crash
 recovery,

 That's harmless, in the sense that all scans and inserts work fine, but
 scans might need to do more work than if the invalid tuple wasn't there.

 I don't think we need to go out of our way to support such degraded indexes
 in 9.1. If you see such notices in your logs, you should REINDEX anyway,
 before of after pg_upgrade. Let's just make sure that you get a reasonable
 error message in 9.1 if a scan or insert encounters such a tuple.

I just don't want to take a risk of giving people unexpected wrong
answers.  It's not clear to me whether that's a risk here or not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-27 Thread Bruce Momjian
Heikki Linnakangas wrote:
 There's no on-disk format changes, except for the additional flag in the 
 page headers, so this does not affect pg_upgrade. However, if there's 
 any invalid keys in the old index because of an incomplete insertion, 
 the new code will not understand that. So you should run vacuum to 
 ensure that there's no such invalid keys in the index before upgrading. 
 Vacuum will print a message in the log if it finds any, and you will 
 have to reindex. But that's what it suggests you to do anyway.

OK, pg_upgrade has code to report invalid gin and hash indexes because
of changes between PG 8.3 and 8.4.  Is this something we would do for
9.0 to 9.1?

You are saying it would have to be run before the upgrade.  Can it not
be run after?

I can output a script to VACUUM all such indexes, and tell users to
manually REINDEX any index that generates a warning messasge.  I don't
have any way to automate an optional REINDEX step.


-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-18 Thread Heikki Linnakangas

On 17.11.2010 19:36, Teodor Sigaev wrote:

Hmm, will have to do some benchmarking on that. I'm using the Consistent
function when walking down to check if the downlink needs to be updated,
and assumed that it would be insignificant compared to the cost of
calling Penalty on all the keys on the page.

Why consistent?! It's impossible - you don't know right strategy number,
index with storage type/over type could do not accept the same type as
query. Index over tsvector is an example.


Sorry, I was confused. I'm calling the gistgetadjusted() function, which 
uses the Union function. Ie. I'm doing the same we did before when 
propagating the changes up the tree. I'm just doing it on the way down 
instead.


I ran some quick performance tests on my laptop, and couldn't see any 
measurable difference with the patch. So I think we're good on 
performance. I used the attached scripts, with \timing.


Have you had a chance to look at the patch yet? I'm hesitant to commit 
before you take a look at it, though I still have to proofread it myself 
as well.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
DROP TABLE IF EXISTS tt;
CREATE TABLE tt (a int4);
CREATE INDEX i_tt ON tt USING gist (a);

checkpoint;
INSERT INTO tt
SELECT i FROM generate_series(1, 20) i;
DROP TABLE IF EXISTS tt;
CREATE TABLE tt (a tsvector);
CREATE INDEX i_tt ON tt USING gist (a);

checkpoint;
INSERT INTO tt
SELECT (chr(97+(i%27)) || chr(97+(i%29)) || repeat(chr(97+(i%23)), (i%11) + 1))::tsvector AS a
FROM generate_series(1, 20) i;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-17 Thread Teodor Sigaev

Hmm, will have to do some benchmarking on that. I'm using the Consistent
function when walking down to check if the downlink needs to be updated,
and assumed that it would be insignificant compared to the cost of
calling Penalty on all the keys on the page.
Why consistent?! It's impossible - you don't know right strategy number, index 
with storage type/over type could do not accept the same type as query. Index 
over tsvector is an example.



There should be no difference in performance here AFAICS. The children
need to be updated a second time to clear the flag, but we don't release
the locks on them in the middle, and we're only talking about setting a
single flag, so it should make no difference.


Agree with that
--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-17 Thread Teodor Sigaev
Sorry, I missed beginning of discussion on GiST, so I read it on the web mail 
archive.


You wrote:
http://archives.postgresql.org/pgsql-hackers/2010-11/msg00939.php
[skip]
0. (the child page is locked)
1. The parent page is locked.
2. The child page is split. The original page becomes the left half, and new 
buffers are allocated for the right halves.
3. The downlink is inserted on the parent page (and the original downlink is 
updated to reflect only the keys that stayed on the left page). While keeping 
the child pages locked, the NSN field on the children are updated with the new 
LSN of the parent page.

...
The scan checks that by comparing the LSN it saw on the parent page with the NSN 
on the child page. If parent LSN  NSN, we saw the parent before the downlink 
was inserted.


Now, the problem with crash recovery is that the above algorithm depends on the 
split to keep the parent and child locked until the downlink is inserted in the 
parent. If you crash between steps 2 and 3, the locks are gone. If a later 
insert then updates the parent page, because of a split on some unrelated child 
page, that will bump the LSN of the parent above the NSN on the child. Scans 
will see that the parent LSN  child NSN, and will no longer follow the  rightlink.

[skip]


I disagree with that opinion: if we crash between 2 and 3 then why will somebody 
update parent before WAL replay? WAL replay process in this case should complete 
child split by inserting invalid pointer and tree become correct again, 
although it needs to repair invalid pointers. The same situation with b-tree: 
WAL replay repairs incomplete split before any other processing.


Or do I miss something important?



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-17 Thread Heikki Linnakangas

On 17.11.2010 19:46, Teodor Sigaev wrote:

I disagree with that opinion: if we crash between 2 and 3 then why will
somebody update parent before WAL replay? WAL replay process in this
case should complete child split by inserting invalid pointer and tree
become correct again, although it needs to repair invalid pointers.
The same situation with b-tree: WAL replay repairs incomplete split
before any other processing.

Or do I miss something important?


Yeah, see the thread that started this:

http://archives.postgresql.org/pgsql-hackers/2010-11/msg00052.php
http://archives.postgresql.org/message-id/12375.1289429...@sss.pgh.pa.us

The code currently relies on the end-of-recovery processing to finish 
the incomplete, but I'm trying to get rid of that end-of-recovery 
processing altogether.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-16 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 There are two key changes to the insert algorithm:

 1. When we walk down the tree searching for a suitable leaf page to 
 insert the new tuple to, we update the nodes on the way down so that 
 they are consistent with the new key we're inserting. The old GiST 
 algorithm adjusted all the parent pages only after inserting the tuple 
 on the leaf. Updating them on the way down ensures that the tree is 
 self-consistent at all times, even if we crash just after inserting the 
 new tuple on the leaf page.

 2. When a page is split, we mark the new left page with a flag to 
 indicate that the downlink for the page to the right hasn't been 
 inserted yet. When the downlink is inserted, the flag is cleared. Again 
 the purpose is to ensure that the tree is self-consistent at all times. 
 If we crash just after a page split, before the downlink is inserted, 
 scans will find the tuples on the right page by following the rightlink. 
 It's slightly less performant, but correct.

The one thought that comes to mind is how does the flag business work
after multiple splittings?  That is, assume we have a page that has the
flag set because of a previous crash.  If we have to split either that
page or its right sibling, what do we do with the flags?  I think that
it can be made to work, so long as searches keep moving right as long
as the flag is set.  But this needs to be thought through, and
documented in the README file.  I'm particularly worried whether the
after-the-fact fixup that you mention in README might fail in such
scenarios.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-16 Thread Heikki Linnakangas

On 16.11.2010 20:01, Tom Lane wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:

2. When a page is split, we mark the new left page with a flag to
indicate that the downlink for the page to the right hasn't been
inserted yet. When the downlink is inserted, the flag is cleared. Again
the purpose is to ensure that the tree is self-consistent at all times.
If we crash just after a page split, before the downlink is inserted,
scans will find the tuples on the right page by following the rightlink.
It's slightly less performant, but correct.


The one thought that comes to mind is how does the flag business work
after multiple splittings?  That is, assume we have a page that has the
flag set because of a previous crash.  If we have to split either that
page or its right sibling, what do we do with the flags?


As I mentioned in the README, the insertion algorithm finishes any 
incomplete splits it sees before proceeding. AFAICS that should ensure 
that the situation never arises where you try to split a page that 
already has the flag set. Or its right sibling; such a page can only be 
reached via the rightlink so you would see the page with the flag set first.


Hmm, there is one corner-case that I didin't think of before: One 
backend splits a leaf page, and another backend concurrently splits the 
parent of that leaf page. If for some reason the backend operating on 
the parent page dies, releasing the locks, the other backend will see 
the incomplete split when it walks up to insert the downlink. Although 
it should be possible to handle that, I think we can simply give up on 
inserting the downlink in that case, and leave that split incomplete as 
well. It's still correct, and next insert that comes along will complete 
the splits, from top to bottom.


BTW, I don't try to fix incomplete splits during vacuum in the patch. 
That's perhaps a bit surprising, and probably would be easy to add, but 
I left it out for now as it's not strictly necessary.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-16 Thread Robert Haas
On Tue, Nov 16, 2010 at 1:22 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 BTW, I don't try to fix incomplete splits during vacuum in the patch. That's
 perhaps a bit surprising, and probably would be easy to add, but I left it
 out for now as it's not strictly necessary.

Seems like it would be good to have this; otherwise, the split might
stay incompletely indefinitely?  Would that be bad?

If we start to enlarge the bounding boxes on the higher levels of the
tree and then crash before inserting the key, is there any mechanism
for getting them back down to the minimal size?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-16 Thread Heikki Linnakangas

On 16.11.2010 20:46, Robert Haas wrote:

On Tue, Nov 16, 2010 at 1:22 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

BTW, I don't try to fix incomplete splits during vacuum in the patch. That's
perhaps a bit surprising, and probably would be easy to add, but I left it
out for now as it's not strictly necessary.


Seems like it would be good to have this; otherwise, the split might
stay incompletely indefinitely?  Would that be bad?


Nothing bad should happen. Scans that need to traverse the incompletely 
split page would just be marginally slower.



If we start to enlarge the bounding boxes on the higher levels of the
tree and then crash before inserting the key, is there any mechanism
for getting them back down to the minimal size?


No. There's also no mechanism for trimming the bounding boxes if a tuple 
is deleted.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-16 Thread Robert Haas
On Tue, Nov 16, 2010 at 1:50 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 If we start to enlarge the bounding boxes on the higher levels of the
 tree and then crash before inserting the key, is there any mechanism
 for getting them back down to the minimal size?

 No. There's also no mechanism for trimming the bounding boxes if a tuple is
 deleted.

Oh.  So do the indexes just degrade over time until they eventually
need to be REINDEX'd?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-16 Thread Teodor Sigaev

they are consistent with the new key we're inserting. The old GiST
algorithm adjusted all the parent pages only after inserting the tuple
on the leaf. Updating them on the way down ensures that the tree is
Hm, performance? while you traverse to leaf page, on each inner page you will 
need to call unionFn/equalFn methods to decide update or not key on inner page. 
Now we stops do that after first positive result of equalFn while walking up. 
Next, when child page splits then you need to update parent twice - first time 
while walk down, and second while walk up.


As I see, you try to implement algorithm very close to original, but it was 
rather slow.

http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=0ad7db4be4b1f0208271c49fc1c8348f11ebc5b3





self-consistent at all times, even if we crash just after inserting the
new tuple on the leaf page.

2. When a page is split, we mark the new left page with a flag to
indicate that the downlink for the page to the right hasn't been
inserted yet. When the downlink is inserted, the flag is cleared. Again
Again, twice write of new children (it could be several because of 
implementation of user-defined picksplit method).




--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-16 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Oh.  So do the indexes just degrade over time until they eventually
 need to be REINDEX'd?

At some point you might reach a state where a reindex would be helpful.
But the same is true of btrees.  I don't think this is a serious
objection, at least not unless backed by evidence that the tree often
degrades rapidly.  Anyway fixing it would be material for a different
patch.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-16 Thread Heikki Linnakangas

On 16.11.2010 21:20, Teodor Sigaev wrote:

they are consistent with the new key we're inserting. The old GiST
algorithm adjusted all the parent pages only after inserting the tuple
on the leaf. Updating them on the way down ensures that the tree is

Hm, performance? while you traverse to leaf page, on each inner page you
will need to call unionFn/equalFn methods to decide update or not key on
inner page. Now we stops do that after first positive result of equalFn
while walking up. Next, when child page splits then you need to update
parent twice - first time while walk down, and second while walk up.


Hmm, will have to do some benchmarking on that. I'm using the Consistent 
function when walking down to check if the downlink needs to be updated, 
and assumed that it would be insignificant compared to the cost of 
calling Penalty on all the keys on the page.



As I see, you try to implement algorithm very close to original, but it
was rather slow.
http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=0ad7db4be4b1f0208271c49fc1c8348f11ebc5b3


Thanks for that, I'll have to read that through as well.


self-consistent at all times, even if we crash just after inserting the
new tuple on the leaf page.

2. When a page is split, we mark the new left page with a flag to
indicate that the downlink for the page to the right hasn't been
inserted yet. When the downlink is inserted, the flag is cleared. Again

Again, twice write of new children (it could be several because of
implementation of user-defined picksplit method).


There should be no difference in performance here AFAICS. The children 
need to be updated a second time to clear the flag, but we don't release 
the locks on them in the middle, and we're only talking about setting a 
single flag, so it should make no difference.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST insert algorithm rewrite

2010-11-16 Thread Robert Haas
On Tue, Nov 16, 2010 at 3:46 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Oh.  So do the indexes just degrade over time until they eventually
 need to be REINDEX'd?

 At some point you might reach a state where a reindex would be helpful.
 But the same is true of btrees.  I don't think this is a serious
 objection, at least not unless backed by evidence that the tree often
 degrades rapidly.  Anyway fixing it would be material for a different
 patch.

Oh, I agree it's not for this patch to fix it, if it's already that
way.  I was just curious.  I think index maintenance is going to be a
problem we have to devote some cycles to down the road, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers