On Thu, Oct 8, 2009 at 1:26 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Robert Haas <robertmh...@gmail.com> writes: >> On Thu, Oct 8, 2009 at 12:21 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >>> Another approach that was discussed earlier was to divvy the rows into >>> batches. Say every thousand rows you sub-commit and start a new >>> subtransaction. Up to that point you save aside the good rows somewhere >>> (maybe a tuplestore). If you get a failure partway through a batch, >>> you start a new subtransaction and re-insert the batch's rows up to the >>> bad row. This could be pretty awful in the worst case, but most of the >>> time it'd probably perform well. You could imagine dynamically adapting >>> the batch size depending on how often errors occur ... > >> Yeah, I think that's promising. There is of course the possibility >> that a row which previously succeeded could fail the next time around, >> but most of the time that shouldn't happen, and it should be possible >> to code it so that it still behaves somewhat sanely if it does. > > Actually, my thought was that failure to reinsert a previously good > tuple should cause us to abort the COPY altogether. This is a > cheap-and-easy way of avoiding sorceror's apprentice syndrome. > Suppose the failures are coming from something like out of disk space, > transaction timeout, whatever ... a COPY that keeps on grinding no > matter what is *not* ideal.
I think you handle that by putting a cap on the total number of errors you're willing to accept (and in any event you'll always skip the row that failed, so forward progress can't cease altogether). For out of disk space or transaction timeout, sure, but you might also have things like a serialization error that occurs on the reinsert that didn't occur on the original. You don't want that to kill the whole bulk load, I would think. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers