Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

Peter Geoghegan Tue, 20 Jan 2015 17:40:01 -0800

On Tue, Jan 20, 2015 at 5:32 PM, Robert Haas <robertmh...@gmail.com> wrote:
> I was assuming we were going to fix this by undoing the abbreviation
> (as in the abort case) when we spill to disk, and not bothering with
> it thereafter.


The spill-to-disk case is at least as compelling at the internal sort
case. The overhead of comparisons is much higher for tapesort.

Attached patch serializes keys. On reflection, I'm inclined to go with
this approach. Even if the CPU overhead of reconstructing strxfrm()
blobs is acceptable for text, it might be much more expensive for
other types. I'm loathe to throw away those abbreviated keys
unnecessarily.

We don't have to worry about having aborted abbreviation, since once
we spill to disk we've effectively committed to abbreviation. This
patch formalizes the idea that there is strictly a pass-by-value
representation required for such cases (but not that the original
Datums must be of a pass-by-reference, which is another thing
entirely). I've tested it some, obviously with Andrew's testcase and
the regression tests, but also with my B-Tree verification tool.
Please review it.

Sorry about this.
-- 
Peter Geoghegan

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 6d3aa88..a442617 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -413,10 +413,13 @@ struct Tuplesortstate
  *
  * If state->randomAccess is true, then the stored representation of the
  * tuple must be followed by another "unsigned int" that is a copy of the
- * length --- so the total tape space used is actually sizeof(unsigned int)
- * more than the stored length value.  This allows read-backwards.  When
- * randomAccess is not true, the write/read routines may omit the extra
- * length word.
+ * length (less any abbreviated key that is stored, which is also possible) ---
+ * so the total tape space used is actually sizeof(unsigned int) more than the
+ * stored length value, as well as another sizeof(Datum) overhead for storing
+ * abbreviated keys.  This allows read-backwards, and avoids the need to
+ * re-compute abbreviated keys.  When randomAccess is not true, the write/read
+ * routines may omit the extra length word.  Also, when abbreviation is not in
+ * play, abbreviated keys are not stored.
  *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
@@ -3124,11 +3127,15 @@ writetup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup)
 	char	   *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
 	unsigned int tupbodylen = tuple->t_len - MINIMAL_TUPLE_DATA_OFFSET;
 
-	/* total on-disk footprint: */
+	/* total on-disk footprint for tuple: */
 	unsigned int tuplen = tupbodylen + sizeof(int);
 
 	LogicalTapeWrite(state->tapeset, tapenum,
 					 (void *) &tuplen, sizeof(tuplen));
+	/* Store abbreviated key, if any (not accounted for by tuplen) */
+	if (state->sortKeys->abbrev_converter)
+		LogicalTapeWrite(state->tapeset, tapenum,
+						 (void *) &stup->datum1, sizeof(Datum));
 	LogicalTapeWrite(state->tapeset, tapenum,
 					 (void *) tupbody, tupbodylen);
 	if (state->randomAccess)	/* need trailing length word? */
@@ -3150,6 +3157,10 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 	HeapTupleData htup;
 
 	USEMEM(state, GetMemoryChunkSpace(tuple));
+	/* read in the tuple abbreviated key, if any */
+	if (state->sortKeys->abbrev_converter)
+		LogicalTapeReadExact(state->tapeset, tapenum, (void *) &stup->datum1,
+							 sizeof(Datum));
 	/* read in the tuple proper */
 	tuple->t_len = tuplen;
 	LogicalTapeReadExact(state->tapeset, tapenum,
@@ -3161,10 +3172,12 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 	/* set up first-column key value */
 	htup.t_len = tuple->t_len + MINIMAL_TUPLE_OFFSET;
 	htup.t_data = (HeapTupleHeader) ((char *) tuple - MINIMAL_TUPLE_OFFSET);
-	stup->datum1 = heap_getattr(&htup,
-								state->sortKeys[0].ssup_attno,
-								state->tupDesc,
-								&stup->isnull1);
+	/* set up first-column key value (for non-abbreviated case) */
+	if (!state->sortKeys->abbrev_converter)
+		stup->datum1 = heap_getattr(&htup,
+									state->sortKeys[0].ssup_attno,
+									state->tupDesc,
+									&stup->isnull1);
 }
 
 /*
@@ -3355,12 +3368,20 @@ writetup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup)
 {
 	HeapTuple	tuple = (HeapTuple) stup->tuple;
 	unsigned int tuplen = tuple->t_len + sizeof(ItemPointerData) + sizeof(int);
+	AttrNumber	leading = state->indexInfo->ii_KeyAttrNumbers[0];
 
-	/* We need to store t_self, but not other fields of HeapTupleData */
+	/*
+	 * We need to store t_self, but not other fields of HeapTupleData (although
+	 * possibly abbreviated key value)
+	 */
 	LogicalTapeWrite(state->tapeset, tapenum,
 					 &tuplen, sizeof(tuplen));
 	LogicalTapeWrite(state->tapeset, tapenum,
 					 &tuple->t_self, sizeof(ItemPointerData));
+	/* Store abbreviated key, if any (not accounted for by tuplen) */
+	if (state->sortKeys->abbrev_converter && leading != 0)
+		LogicalTapeWrite(state->tapeset, tapenum,
+						 (void *) &stup->datum1, sizeof(Datum));
 	LogicalTapeWrite(state->tapeset, tapenum,
 					 tuple->t_data, tuple->t_len);
 	if (state->randomAccess)	/* need trailing length word? */
@@ -3377,6 +3398,7 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 {
 	unsigned int t_len = tuplen - sizeof(ItemPointerData) - sizeof(int);
 	HeapTuple	tuple = (HeapTuple) palloc(t_len + HEAPTUPLESIZE);
+	AttrNumber	leading = state->indexInfo->ii_KeyAttrNumbers[0];
 
 	USEMEM(state, GetMemoryChunkSpace(tuple));
 	/* Reconstruct the HeapTupleData header */
@@ -3386,6 +3408,10 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 						 &tuple->t_self, sizeof(ItemPointerData));
 	/* We don't currently bother to reconstruct t_tableOid */
 	tuple->t_tableOid = InvalidOid;
+	/* read in the tuple abbreviated key, if any */
+	if (state->sortKeys->abbrev_converter && leading != 0)
+		LogicalTapeReadExact(state->tapeset, tapenum, (void *) &stup->datum1,
+							 sizeof(Datum));
 	/* Read in the tuple body */
 	LogicalTapeReadExact(state->tapeset, tapenum,
 						 tuple->t_data, tuple->t_len);
@@ -3393,8 +3419,11 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 		LogicalTapeReadExact(state->tapeset, tapenum,
 							 &tuplen, sizeof(tuplen));
 	stup->tuple = (void *) tuple;
-	/* set up first-column key value, if it's a simple column */
-	if (state->indexInfo->ii_KeyAttrNumbers[0] != 0)
+	/*
+	 * set up first-column key value, if it's a simple column, and this wasn't
+	 * already read in as an abbreviated key
+	 */
+	if (!state->sortKeys->abbrev_converter && leading != 0)
 		stup->datum1 = heap_getattr(tuple,
 									state->indexInfo->ii_KeyAttrNumbers[0],
 									state->tupDesc,
@@ -3655,9 +3684,14 @@ writetup_index(Tuplesortstate *state, int tapenum, SortTuple *stup)
 	IndexTuple	tuple = (IndexTuple) stup->tuple;
 	unsigned int tuplen;
 
+	/* tuplen does not include abbreviated key (if any) */
 	tuplen = IndexTupleSize(tuple) + sizeof(tuplen);
 	LogicalTapeWrite(state->tapeset, tapenum,
 					 (void *) &tuplen, sizeof(tuplen));
+	/* Store abbreviated key, if any (not accounted for by tuplen) */
+	if (state->sortKeys->abbrev_converter)
+		LogicalTapeWrite(state->tapeset, tapenum,
+						 (void *) &stup->datum1, sizeof(Datum));
 	LogicalTapeWrite(state->tapeset, tapenum,
 					 (void *) tuple, IndexTupleSize(tuple));
 	if (state->randomAccess)	/* need trailing length word? */
@@ -3676,17 +3710,22 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 	IndexTuple	tuple = (IndexTuple) palloc(tuplen);
 
 	USEMEM(state, GetMemoryChunkSpace(tuple));
+	/* read in the tuple abbreviated key, if any */
+	if (state->sortKeys->abbrev_converter)
+		LogicalTapeReadExact(state->tapeset, tapenum, (void *) &stup->datum1,
+							 sizeof(Datum));
 	LogicalTapeReadExact(state->tapeset, tapenum,
 						 tuple, tuplen);
 	if (state->randomAccess)	/* need trailing length word? */
 		LogicalTapeReadExact(state->tapeset, tapenum,
 							 &tuplen, sizeof(tuplen));
 	stup->tuple = (void *) tuple;
-	/* set up first-column key value */
-	stup->datum1 = index_getattr(tuple,
-								 1,
-								 RelationGetDescr(state->indexRel),
-								 &stup->isnull1);
+	/* set up first-column key value (for non-abbreviated case) */
+	if (!state->sortKeys->abbrev_converter)
+		stup->datum1 = index_getattr(tuple,
+									 1,
+									 RelationGetDescr(state->indexRel),
+									 &stup->isnull1);
 }
 
 /*
diff --git a/src/include/utils/sortsupport.h b/src/include/utils/sortsupport.h
index 62fedfa..b295fcd 100644
--- a/src/include/utils/sortsupport.h
+++ b/src/include/utils/sortsupport.h
@@ -117,17 +117,18 @@ typedef struct SortSupportData
 	 * This allows opclass authors to supply a conversion routine, used to
 	 * create an alternative representation of the underlying type (an
 	 * "abbreviated key").  Typically, this representation is an ad-hoc,
-	 * pass-by-value Datum format that only the opclass has knowledge of.  An
-	 * alternative comparator, used only with this alternative representation
-	 * must also be provided (which is assigned to "comparator").  This
-	 * representation is a simple approximation of the original Datum.  It must
-	 * be possible to compare datums of this representation with each other
-	 * using the supplied alternative comparator, and have any non-zero return
-	 * value be a reliable proxy for what a proper comparison would indicate.
-	 * Returning zero from the alternative comparator does not indicate
-	 * equality, as with a conventional support routine 1, though -- it
-	 * indicates that it wasn't possible to determine how the two abbreviated
-	 * values compared.  A proper comparison, using "abbrev_full_comparator"/
+	 * pass-by-value Datum format that only the opclass has knowledge of (the
+	 * representation must be pass-by-value).  An alternative comparator, used
+	 * only with this alternative representation must also be provided (which
+	 * is assigned to "comparator").  This representation is a simple
+	 * approximation of the original Datum.  It must be possible to compare
+	 * datums of this representation with each other using the supplied
+	 * alternative comparator, and have any non-zero return value be a reliable
+	 * proxy for what a proper comparison would indicate.  Returning zero from
+	 * the alternative comparator does not indicate equality, as with a
+	 * conventional support routine 1, though -- it indicates that it wasn't
+	 * possible to determine how the two abbreviated values compared.  A proper
+	 * comparison, using "abbrev_full_comparator"/
 	 * ApplySortAbbrevFullComparator() is therefore required.  In many cases
 	 * this results in most or all comparisons only using the cheap alternative
 	 * comparison func, which is typically implemented as code that compiles to

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

Reply via email to