Simon Riggs <[EMAIL PROTECTED]> writes: > Thinking about this some more, I ask myself: why is it we log index > inserts at all? We log heap inserts, which contain all the information > we need to replay all index inserts also, so why bother?
(1) We can't run user-defined functions during log replay. Quite aside from any risk of nondeterminism, the normal transaction infrastructure isn't functioning in that environment. (2) Some of the index code is itself deliberately nondeterministic. I'm thinking in particular of the move-right-or-not choice in _bt_insertonpg() when there are many equal keys, but randomization is in general a useful algorithmic technique that we'd have to forswear. (3) In the presence of concurrency, the sequence of heap-insert WAL records isn't enough info, because it doesn't tell you what order the index inserts occurred in. The btree code, at least, is sufficiently concurrent that even knowing the sequence of leaf-key insertions isn't full information --- it's not hard to imagine cases where decisions about where to split upper-level pages are dependent on which process manages to obtain lock on a page first. There are probably some other reasons that I forgot. Check the archives; this point has been debated before. Basically the problem here is that you can't mix logged and non-logged operations --- if you're going to WAL-log any operations on an index then you have to be sure that the replay will regenerate exactly the same series of index states that happened the first time. So none of this is an argument against "rebuild the index at end of replay"; but I don't see any workable half measures. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings