It seems that commit 01ec2563 has a subtle bug, which stems from the
fact that logtape.c no longer follows the rule described above
ltsGetFreeBlock():

/*
 * Select a currently unused block for writing to.
 *
 * NB: should only be called when writer is ready to write immediately,
 * to ensure that first write pass is sequential.
 */
static long
ltsGetFreeBlock(LogicalTapeSet *lts)
{
    ...
}

Previously, all calls to ltsGetFreeBlock() were immediately followed
by a corresponding call to ltsWriteBlock(); we wrote out the
newly-allocated-in-first-pass block there and then. It's a good idea
for a corresponding ltsWriteBlock() call to happen immediately, and
it's *essential* for it to happen before any later block is written
during the first write pass (when tuples are initially dumped out to
form runs), since, as noted above ltsWriteBlock():

/*
 * Write a block-sized buffer to the specified block of the underlying file.
 *
 * NB: should not attempt to write beyond current end of file (ie, create
 * "holes" in file), since BufFile doesn't allow that.  The first write pass
 * must write blocks sequentially.
...

However, a "write beyond current end of file" now seems to actually be
attempted at times, resulting in an arcane error during sorting. This
is more or less because LogicalTapeWrite() doesn't heed the warning
from ltsGetFreeBlock(), as seen here:

    while (size > 0)
    {
        if (lt->pos >= TapeBlockPayloadSize)
        {
            ...

            /*
             * First allocate the next block, so that we can store it in the
             * 'next' pointer of this block.
             */
            nextBlockNumber = ltsGetFreeBlock(lts);

            /* set the next-pointer and dump the current block. */
            TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber;
            ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer);
            ...
        }
        ...

    }

Note that LogicalTapeWrite() now allocates each block (calls
ltsGetFreeBlock()) one block in advance of current block, immediately
before *current* block is written (not the next block/block just
allocated). So, the corresponding ltsWriteBlock() call *must* happen
at an unspecified time in the future, typically the next time control
ends up at exactly the same point, where the block that was "next"
becomes "current".

I'm not about to argue that we should go back to following this
ltsGetFreeBlock() rule, though; I can see why Heikki refactored
LogicalTapeWrite() to allocate blocks a block in advance. Still, we
need to be more careful in avoiding the underlying problem that the
ltsGetFreeBlock() rule was intended to prevent. Attached patch 0001-*
has logtape.c be sure to write out a tape's buffer every time
tuplesort ends a run.

I have a test case. Build Postgres with only 0002-* applied, which
makes MAX_PHYSICAL_FILESIZE far smaller than its current value
(current value is 2^30, while this reduces it to BLCKSZ). Then:

postgres=# set replacement_sort_tuples = 0; set work_mem = 64;
SET
SET
postgres=# with a as (select repeat('a', 7000) i from
generate_series(1, 7)) select * from a order by i;
ERROR:  XX000: could not write block 5 of temporary file: Success
LOCATION:  ltsWriteBlock, logtape.c:204

This "block 5" corresponds to an fd.c file/BufFile segment that does
not in fact exist when this error is raised. Since BufFiles multiplex
BLCKSZ-sized segment files in this build of Postgres, it's quite
likely, in general, that writes marking the end of a run will straddle
"segment boundaries", which is what it takes for buffile.c/logtape.c
to throw this error. (The BufFile contract does not allow "holes" in
files, but segment boundaries must be crossed as the critical point
(just before a tape switch) to actually see this error -- you'd have
to be very unlucky to have things line up in exactly the wrong way
with 1GiB BufFile segments, but it can still happen.)

-- 
Peter Geoghegan
From d4a611e94a3b4504f034fdb31a8ab4a72955b887 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@bowt.ie>
Date: Thu, 5 Jan 2017 11:29:53 -0800
Subject: [PATCH 2/2] Reduce MAX_PHYSICAL_FILESIZE value to BLCKSZ

This constant controls the size of underlying fd.c managed files that
the BufFile abstraction multiplexes for its callers.  A very low value
for the constant can be useful for flushing out bugs.
---
 src/backend/storage/file/buffile.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 7ebd636..ac3e400 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -47,7 +47,7 @@
  * The reason is that we'd like large temporary BufFiles to be spread across
  * multiple tablespaces when available.
  */
-#define MAX_PHYSICAL_FILESIZE	0x40000000
+#define MAX_PHYSICAL_FILESIZE	BLCKSZ
 #define BUFFILE_SEG_SIZE		(MAX_PHYSICAL_FILESIZE / BLCKSZ)
 
 /*
-- 
2.7.4

From 389859e1a31fab7fe906176d33660583c12e4277 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@bowt.ie>
Date: Thu, 5 Jan 2017 11:58:24 -0800
Subject: [PATCH 1/2] Fix bug in logical tapeset free block management

Make sure that the first write pass always writes out blocks in
sequential order.  The simplifications made to the logtape block format
in commit 01ec2563 did not tie down all the cases in which
non-sequential writes of blocks in the first write pass could happen.
When non-sequential writes are performed in the first write pass,
BufFile seeking indicates a problem, making logtape.c raise an error.

The extra steps taken within this commit to prevent problems are applied
more broadly than strictly necessary; they're only necessary in the
first write pass, and not at the end of every run (round of writing to a
logical tape).  It doesn't seem worth bothering tuplesort.c with that
detail, though.
---
 src/backend/utils/sort/logtape.c   | 54 ++++++++++++++++++++++++++++++++------
 src/backend/utils/sort/tuplesort.c |  1 +
 src/include/utils/logtape.h        |  1 +
 3 files changed, 48 insertions(+), 8 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index f6a9dba..e07e328 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -282,8 +282,10 @@ freeBlocks_cmp(const void *a, const void *b)
 /*
  * Select a currently unused block for writing to.
  *
- * NB: should only be called when writer is ready to write immediately,
- * to ensure that first write pass is sequential.
+ * NB: blocks returned from here must be written out in the order of their
+ * initial allocation during the first write pass.  This is necessary
+ * because ltsWriteBlock() does not support writing beyond current end of
+ * file.
  */
 static long
 ltsGetFreeBlock(LogicalTapeSet *lts)
@@ -426,6 +428,37 @@ LogicalTapeSetForgetFreeSpace(LogicalTapeSet *lts)
 }
 
 /*
+ * Indicate that round of writing to logical tape is over.
+ *
+ * When tuplesort marks the end of a run on tape, it must call here
+ * immediately afterwards in order to ensure that pending dirty data is
+ * written immediately.
+ */
+void
+LogicalTapeEndWriting(LogicalTapeSet *lts, int tapenum)
+{
+	LogicalTape *lt;
+
+	Assert(tapenum >= 0 && tapenum < lts->nTapes);
+	lt = &lts->tapes[tapenum];
+	Assert(lt->writing);
+	/* Must have written to tape by now */
+	Assert(lt->curBlockNumber != -1);
+
+	/*
+	 * Write eagerly to avoid out-of-block-order writes in first write pass.
+	 * See comments within LogicalTapeWrite().  Note that we don't unset the
+	 * dirty flag here to avoid bothering LogicalTapeWrite() with needing to
+	 * consider partial block writes.
+	 */
+	if (lt->dirty)
+	{
+		TapeBlockSetNBytes(lt->buffer, lt->nbytes);
+		ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer);
+	}
+}
+
+/*
  * Write to a logical tape.
  *
  * There are no error returns; we ereport() on failure.
@@ -473,8 +506,12 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 			}
 
 			/*
-			 * First allocate the next block, so that we can store it in the
-			 * 'next' pointer of this block.
+			 * Allocate the next block, so that we can store it in the
+			 * 'next' pointer of this block.  Care must be taken to ensure
+			 * that the allocated block is written out before another call
+			 * to ltsGetFreeBlock(), which is why caller is required to call
+			 * LogicalTapeEndWriting() later (this writes out the block
+			 * allocated on last call here for tape's round of writes).
 			 */
 			nextBlockNumber = ltsGetFreeBlock(lts);
 
@@ -557,6 +594,7 @@ LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum, size_t buffer_size)
 		{
 			TapeBlockSetNBytes(lt->buffer, lt->nbytes);
 			ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer);
+			lt->dirty = false;
 		}
 		lt->writing = false;
 	}
@@ -591,9 +629,9 @@ LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum, size_t buffer_size)
  * Rewind logical tape and switch from reading to writing.
  *
  * NOTE: we assume the caller has read the tape to the end; otherwise
- * untouched data and indirect blocks will not have been freed. We
- * could add more code to free any unread blocks, but in current usage
- * of this module it'd be useless code.
+ * untouched data will not have been freed. We could add more code to
+ * free any unread blocks, but in current usage of this module it'd be
+ * useless code.
  */
 void
 LogicalTapeRewindForWrite(LogicalTapeSet *lts, int tapenum)
@@ -686,7 +724,7 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum)
 	{
 		TapeBlockSetNBytes(lt->buffer, lt->nbytes);
 		ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer);
-		lt->writing = false;
+		lt->dirty = false;
 	}
 	lt->writing = false;
 	lt->frozen = true;
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index cbaf009..c166e5d 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -3542,6 +3542,7 @@ markrunend(Tuplesortstate *state, int tapenum)
 	unsigned int len = 0;
 
 	LogicalTapeWrite(state->tapeset, tapenum, (void *) &len, sizeof(len));
+	LogicalTapeEndWriting(state->tapeset, tapenum);
 }
 
 /*
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index e802c4b..54e3542 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -29,6 +29,7 @@ extern void LogicalTapeSetClose(LogicalTapeSet *lts);
 extern void LogicalTapeSetForgetFreeSpace(LogicalTapeSet *lts);
 extern size_t LogicalTapeRead(LogicalTapeSet *lts, int tapenum,
 				void *ptr, size_t size);
+extern void LogicalTapeEndWriting(LogicalTapeSet *lts, int tapenum);
 extern void LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 				 void *ptr, size_t size);
 extern void LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum,
-- 
2.7.4

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to