On 09/15/2014 06:28 PM, Alexander Korotkov wrote:
Hackers,

some GIN opclasses uses collation-aware comparisons while they don't need
to do especially collation-aware comparison. Examples are text[] and hstore
opclasses.

Hmm. It would be nice to use the index for inequality searches, at least on text[]. We don't support that currently, but it would require collation-awareness.

Depending on collation this may make them a much slower.

See example.

# show lc_collate ;
  lc_collate
─────────────
  ru_RU.UTF-8
(1 row)

# create table test as (select array_agg(i::text) from
generate_series(1,1000000) i group by (i-1)/10);
SELECT 100000

# create index test_idx on test using gin(array_agg);
CREATE INDEX
Time: *26930,423 ms*

# create index test_idx2 on test using gin(array_agg collate "C");
CREATE INDEX
Time: *5143,682 ms*

Index creation with collation "ru_RU.UTF-8" is 5 times slower while
collation has absolutely no effect on index functionality.

It occurs to me that practically all of those comparisons happen when we populate the red-black Tree, during the index build. The purpose of the red-black tree is to collect identical keys together, but there is actually no requirement that the order of the red-black tree matches the order of the index. It also isn't strictly required that it recognizes equal keys as equal. The only requirement is that it doesn't incorrectly put two keys that are equal according to the compare-function, into two different nodes.

We could therefore use plain memcmp() to compare the Datums while building the red-black tree. Keys that are bit-wise equal are surely considered as equal by the compare-function. That makes the index build a lot faster. With the attached quick patch:

postgres=# create index test_idx on test using gin(array_agg );
CREATE INDEX
Time: 880.620 ms

This is on my laptop. Without the patch, that takes about 4.7 seconds with the C locale, so this is much faster than even using the C locale.

- Heikki
diff --git a/src/backend/access/gin/ginbulk.c b/src/backend/access/gin/ginbulk.c
index 3af0271..02d8600 100644
--- a/src/backend/access/gin/ginbulk.c
+++ b/src/backend/access/gin/ginbulk.c
@@ -22,6 +22,13 @@
 #define DEF_NENTRY	2048		/* GinEntryAccumulator allocation quantum */
 #define DEF_NPTR	5			/* ItemPointer initial allocation quantum */
 
+static int fastCompareEntries(GinState *ginstate, OffsetNumber attnum,
+				  Datum a, GinNullCategory categorya,
+				  Datum b, GinNullCategory categoryb);
+static int fastCompareAttEntries(GinState *ginstate,
+					 OffsetNumber attnuma, Datum a, GinNullCategory categorya,
+					 OffsetNumber attnumb, Datum b, GinNullCategory categoryb);
+static int fastCmpDatums(int16 attlen, bool attbyval, Datum a, Datum b);
 
 /* Combiner function for rbtree.c */
 static void
@@ -67,9 +74,104 @@ cmpEntryAccumulator(const RBNode *a, const RBNode *b, void *arg)
 	const GinEntryAccumulator *eb = (const GinEntryAccumulator *) b;
 	BuildAccumulator *accum = (BuildAccumulator *) arg;
 
-	return ginCompareAttEntries(accum->ginstate,
-								ea->attnum, ea->key, ea->category,
-								eb->attnum, eb->key, eb->category);
+	return fastCompareAttEntries(accum->ginstate,
+								 ea->attnum, ea->key, ea->category,
+								 eb->attnum, eb->key, eb->category);
+}
+
+
+/*
+ * Compare two keys of the same index column.
+ *
+ * This is like ginCompareEntries, but uses memcmp() for the comparison instead
+ * of the real compare function.
+ */
+static int
+fastCompareEntries(GinState *ginstate, OffsetNumber attnum,
+				  Datum a, GinNullCategory categorya,
+				  Datum b, GinNullCategory categoryb)
+{
+	/* if not of same null category, sort by that first */
+	if (categorya != categoryb)
+		return (categorya < categoryb) ? -1 : 1;
+
+	/* all null items in same category are equal */
+	if (categorya != GIN_CAT_NORM_KEY)
+		return 0;
+
+	/* both not null, so compare the Datums */
+	return fastCmpDatums(ginstate->origTupdesc->attrs[attnum - 1]->attlen,
+						 ginstate->origTupdesc->attrs[attnum - 1]->attbyval,
+						 a, b);
+}
+
+
+/*
+ * Compare two keys of possibly different index columns.
+ *
+ * This is like ginCompareAttEntries, but uses memcmp() for the comparison
+ * instead of the real compare function.
+ */
+static int
+fastCompareAttEntries(GinState *ginstate,
+					 OffsetNumber attnuma, Datum a, GinNullCategory categorya,
+					 OffsetNumber attnumb, Datum b, GinNullCategory categoryb)
+{
+	/* attribute number is the first sort key */
+	if (attnuma != attnumb)
+		return (attnuma < attnumb) ? -1 : 1;
+
+	return fastCompareEntries(ginstate, attnuma, a, categorya, b, categoryb);
+}
+
+/*
+ * Compare two Datums. This is like datumIsEqual, but returns -1, 0, or 1,
+ * rather than just a boolean.
+ *
+ * XXX: This assumes that the arguments are not toasted. Is that safe?
+ */
+static int
+fastCmpDatums(int16 typLen, bool typByVal, Datum value1, Datum value2)
+{
+	int			res;
+
+	if (typByVal)
+	{
+		/*
+		 * just compare the two datums. NOTE: just comparing "len" bytes will
+		 * not do the work, because we do not know how these bytes are aligned
+		 * inside the "Datum".  We assume instead that any given datatype is
+		 * consistent about how it fills extraneous bits in the Datum.
+		 */
+		if (value1 > value2)
+			return 1;
+		else if (value1 < value2)
+			return -1;
+		else
+			return 0;
+	}
+	else
+	{
+		Size		size1,
+					size2;
+		char	   *s1,
+				   *s2;
+
+		/*
+		 * Compare the bytes pointed by the pointers stored in the datums.
+		 */
+		size1 = datumGetSize(value1, typByVal, typLen);
+		size2 = datumGetSize(value2, typByVal, typLen);
+		if (size1 > size2)
+			return 1;
+		else if (size1 < size2)
+			return -1;
+
+		s1 = (char *) DatumGetPointer(value1);
+		s2 = (char *) DatumGetPointer(value2);
+		res = memcmp(s1, s2, size1);
+	}
+	return res;
 }
 
 /* Allocator function for rbtree.c */
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to