On 22.10.2013 21:25, Andres Freund wrote:
On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote:
Splitting a B-tree page is a two-stage process: First, the page is split,
and then a downlink for the new right page is inserted into the parent
(which might recurse to split the parent page, too). What happens if
inserting the downlink fails for some reason? I tried that out, and it turns
out that it's not nice.

I used this to cause a failure:

--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel,
                        _bt_relbuf(rel, pbuf);
                }

+               elog(ERROR, "fail!");
+
                /* get high key from left page == lowest key on new right page 
*/
                ritem = (IndexTuple) PageGetItem(page,
                                                                                
 PageGetItemId(page, P_HIKEY));

postgres=# create table foo (i int4 primary key);
CREATE TABLE
postgres=# insert into foo select generate_series(1, 10000);
ERROR:  fail!

That's not surprising. But when I removed that elog again and restarted the
server, I still can't insert. The index is permanently broken:

postgres=# insert into foo select generate_series(1, 10000);
ERROR:  failed to re-find parent key in index "foo_pkey" for split pages 4/5

In real life, you would get a failure like this e.g if you run out of memory
or disk space while inserting the downlink to the parent. Although rare in
practice, it's no fun if it happens.

Why doesn't the incomplete split mechanism prevent this? Because we do
not delay checkpoints on the primary and a checkpoint happened just
befor your elog(ERROR) above?

Because there's no recovery involved. The failure I injected (or an out-of-memory or out-of-disk-space in the real world) doesn't cause a PANIC, just an ERROR that rolls back the current transaction, nothing more.

We could put a critical section around the whole recursion that inserts the downlinks, so that you would get a PANIC and the incomplete split mechanism would fix it at recovery. But that would hardly be an improvement.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to