On Thu, Sep 3, 2015 at 5:35 PM, David Rowley
<david.row...@2ndquadrant.com> wrote:
> My test cases are:

Note that my text caching and unsigned integer comparison patches have
moved the baseline down quite noticeably. I think that my mobile
processor out-performs the Xeon you used for this, which seems a
little odd even taken the change in baseline performance into account.

> set work_mem ='1GB';
> create table t1 as select md5(random()::text) from
> generate_series(1,10000000);
>
> Times are in milliseconds. Median and average over 10 runs.
>
> Test 1
> select count(distinct md5) from t1;
>
>                  Master Patched Median 10,965.77 10,986.30 (99.81%) Average
> 10,983.63 11,013.55 (99.73%)

> Are you seeing any speedup from any of these on your hardware?

I gather that 10,965.77 here means 10,965 milliseconds, since that
roughly matches what I get.

For the sake of simplicity, I will focus on your test 1 as a baseline.
Note that I ran VACUUM FREEZE before any testing was performed, just
in case.

On my laptop:

postgres=# \dt+ t1
                  List of relations
 Schema | Name | Type  | Owner |  Size  | Description
--------+------+-------+-------+--------+-------------
 public | t1   | table | pg    | 651 MB |
(1 row)

Times for "Test 1", "select count(distinct md5) from t1":

Patch:

Time: 10076.870 ms
Time: 10094.873 ms
Time: 10125.253 ms  <-- median
Time: 10222.042 ms
Time: 10269.247 ms

Master:

Time: 10641.142 ms
Time: 10706.181 ms
Time: 10708.860 ms < -- median
Time: 10725.426 ms
Time: 10781.398 ms

So, to answer your question: Yes, I can see a benefit for this query
on my test hardware (laptop), although it is not spectacular. It may
still be quite worthwhile.

I attach a revised version of the patch tested here, following
feedback from Andres. This should not make any difference to the
performance.

It's worth considering that for some (possibly legitimate) reason, the
built-in function call is ignored by your compiler, since GCC has
license to do that. You might try this on both master and patched
builds:

~/postgresql/src/backend/utils/sort$ gdb -batch -ex 'file tuplesort.o'
-ex 'disassemble tuplesort_gettuple_common' > prefetch_disassembly.txt

Then diff the file prefetch_disassembly.txt for each build to see what
the differences are in practice. Consider an extract of the output on
my system:

...
   0x00000000000028ee <+926>: callq  0x28f3 <tuplesort_gettuple_common+931>
   0x00000000000028f3 <+931>: nopl   0x0(%rax,%rax,1)
   0x00000000000028f8 <+936>: sub    $0x1,%eax
   0x00000000000028fb <+939>: test   %eax,%eax
   0x00000000000028fd <+941>: mov    %eax,0xd0(%rdi)
   0x0000000000002903 <+947>: jne    0x25ce <tuplesort_gettuple_common+126>
   0x0000000000002909 <+953>: jmpq   0x2710 <tuplesort_gettuple_common+448>
   0x000000000000290e <+958>: xchg   %ax,%ax
   0x0000000000002910 <+960>: mov    0x58(%rdi),%rsi
   0x0000000000002914 <+964>: lea    (%rax,%rax,2),%rax
   0x0000000000002918 <+968>: lea    (%rsi,%rax,8),%rax
   0x000000000000291c <+972>: mov    0x30(%rax),%rax
   0x0000000000002920 <+976>: prefetchnta (%rax)
   0x0000000000002923 <+979>: mov    $0x1,%eax
   0x0000000000002928 <+984>: jmpq   0x2712 <tuplesort_gettuple_common+450>
   0x000000000000292d <+989>: nopl   (%rax)
...

Notably, there is a prefetchnta instruction here.

Note that I'm going away on vacation in about a week. I wanted to give
people feedback on various things before then, since it was overdue.
FYI, after Thursday I will be very unlikely to answer e-mail for a
couple of weeks.

-- 
Peter Geoghegan
From ed1cf0675470c3c198ff140f71f3a40adcd4d02a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Sun, 12 Jul 2015 13:14:01 -0700
Subject: [PATCH] Prefetch from memtuples array in tuplesort

Testing shows that prefetching the "tuple proper" of a slightly later
SortTuple in the memtuples array during each of many sequential,
in-logical-order SortTuple fetches speeds up various sorting intense
operations considerably.  For example, B-Tree index builds are
accelerated as leaf pages are created from the memtuples array.
(i.e.  The operation following actually "performing" the sort, but
before a tuplesort_end() call is made as a B-Tree spool is destroyed.)

Similarly, ordered set aggregates (all cases except the datumsort case
with a pass-by-value type), and regular heap tuplesorts benefit to about
the same degree.  The optimization is only used when sorts fit in
memory, though.

Also, prefetch a few places ahead within the analogous "fetching" point
in tuplestore.c.  This appears to offer similar benefits in certain
cases.  For example, queries involving large common table expressions
significantly benefit.
---
 config/c-compiler.m4                | 17 +++++++++++++++++
 configure                           | 31 +++++++++++++++++++++++++++++++
 configure.in                        |  1 +
 src/backend/utils/sort/tuplesort.c  | 21 +++++++++++++++++++++
 src/backend/utils/sort/tuplestore.c | 13 +++++++++++++
 src/include/c.h                     | 14 ++++++++++++++
 src/include/pg_config.h.in          |  3 +++
 src/include/pg_config.h.win32       |  3 +++
 src/include/pg_config_manual.h      | 10 ++++++++++
 9 files changed, 113 insertions(+)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 550d034..8be2122 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -271,6 +271,23 @@ fi])# PGAC_C_BUILTIN_UNREACHABLE
 
 
 
+# PGAC_C_BUILTIN_PREFETCH
+# -------------------------
+# Check if the C compiler understands __builtin_prefetch(),
+# and define HAVE__BUILTIN_PREFETCH if so.
+AC_DEFUN([PGAC_C_BUILTIN_PREFETCH],
+[AC_CACHE_CHECK(for __builtin_prefetch, pgac_cv__builtin_prefetch,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([],
+[int i = 0;__builtin_prefetch(&i, 0, 3);])],
+[pgac_cv__builtin_prefetch=yes],
+[pgac_cv__builtin_prefetch=no])])
+if test x"$pgac_cv__builtin_prefetch" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_PREFETCH, 1,
+          [Define to 1 if your compiler understands __builtin_prefetch.])
+fi])# PGAC_C_BUILTIN_PREFETCH
+
+
+
 # PGAC_C_VA_ARGS
 # --------------
 # Check if the C compiler understands C99-style variadic macros,
diff --git a/configure b/configure
index b771a83..420eb2c 100755
--- a/configure
+++ b/configure
@@ -11338,6 +11338,37 @@ if test x"$pgac_cv__builtin_unreachable" = xyes ; then
 $as_echo "#define HAVE__BUILTIN_UNREACHABLE 1" >>confdefs.h
 
 fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int i = 0;__builtin_prefetch(&i, 0, 3);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_prefetch=yes
+else
+  pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"$pgac_cv__builtin_prefetch" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_PREFETCH 1" >>confdefs.h
+
+fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __VA_ARGS__" >&5
 $as_echo_n "checking for __VA_ARGS__... " >&6; }
 if ${pgac_cv__va_args+:} false; then :
diff --git a/configure.in b/configure.in
index b5868b0..0e25b89 100644
--- a/configure.in
+++ b/configure.in
@@ -1320,6 +1320,7 @@ PGAC_C_BUILTIN_BSWAP32
 PGAC_C_BUILTIN_BSWAP64
 PGAC_C_BUILTIN_CONSTANT_P
 PGAC_C_BUILTIN_UNREACHABLE
+PGAC_C_BUILTIN_PREFETCH
 PGAC_C_VA_ARGS
 PGAC_STRUCT_TIMEZONE
 PGAC_UNION_SEMUN
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d532e87..7ed0d1f 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1670,6 +1670,27 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 				if (state->current < state->memtupcount)
 				{
 					*stup = state->memtuples[state->current++];
+
+					/*
+					 * Perform memory prefetch of "tuple proper" of the
+					 * SortTuple that's three places ahead of current
+					 * (which is returned to caller).  Testing shows that
+					 * this significantly boosts the performance for
+					 * TSS_SORTEDINMEM "forward" callers by hiding memory
+					 * latency behind their processing of returned tuples.
+					 *
+					 * Don't do this for pass-by-value datum sorts;  even
+					 * though hinting a NULL address does not affect
+					 * correctness, it would have a noticeable overhead
+					 * here.
+					 */
+#ifdef USE_MEM_PREFETCH
+					if (stup->tuple != NULL &&
+						state->current + 2 < state->memtupcount)
+						pg_read_prefetch(
+							state->memtuples[state->current + 2].tuple);
+#endif
+
 					return true;
 				}
 				state->eof_reached = true;
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index 51f474d..e9cb599 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -902,6 +902,19 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					return NULL;
 				if (readptr->current < state->memtupcount)
 				{
+					/*
+					 * Perform memory prefetch of tuple that's three places
+					 * ahead of current (which is returned to caller).
+					 * Testing shows that this significantly boosts the
+					 * performance for TSS_INMEM "forward" callers by
+					 * hiding memory latency behind their processing of
+					 * returned tuples.
+					 */
+#ifdef USE_MEM_PREFETCH
+					if (readptr->current + 3 < state->memtupcount)
+						pg_read_prefetch(state->memtuples[readptr->current + 3]);
+#endif
+
 					/* We have another tuple, so return it */
 					return state->memtuples[readptr->current++];
 				}
diff --git a/src/include/c.h b/src/include/c.h
index 8163b00..3d05ff8 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -927,6 +927,20 @@ typedef NameData *Name;
 #define pg_unreachable() abort()
 #endif
 
+/*
+ * Prefetch support -- Support memory prefetching hints on some platforms.
+ *
+ * pg_read_prefetch() is specialized for the case where an array is accessed
+ * sequentially, and we can prefetch a pointer within the next element (or an
+ * even later element) in order to hide memory latency.  This case involves
+ * prefetching addresses with low temporal locality.  Note that it's rather
+ * difficult to get any kind of speedup with this;  any use of the intrinsic
+ * should be carefully tested.  It's okay to pass it an invalid or NULL
+ * address, although it's best avoided.
+ */
+#if defined(USE_MEM_PREFETCH)
+#define pg_read_prefetch(addr)		__builtin_prefetch((addr), 0, 0)
+#endif
 
 /* ----------------------------------------------------------------
  *				Section 8:	random stuff
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index a20e0cd..ad1d17a 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -666,6 +666,9 @@
 /* Define to 1 if your compiler understands __builtin_constant_p. */
 #undef HAVE__BUILTIN_CONSTANT_P
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index 8566065..990c3c3 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -514,6 +514,9 @@
 /* Define to 1 if your compiler understands __builtin_constant_p. */
 /* #undef HAVE__BUILTIN_CONSTANT_P */
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 /* #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P */
 
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index e278fa0..4c7b1d5 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -153,6 +153,16 @@
 #endif
 
 /*
+ * USE_MEM_PREFETCH controls whether Postgres will attempt to use memory
+ * prefetching.  Usually the automatic configure tests are sufficient, but
+ * it's conceivable that using prefetching is counter-productive on some
+ * platforms.  If necessary you can remove the #define here.
+ */
+#ifdef HAVE__BUILTIN_PREFETCH
+#define USE_MEM_PREFETCH
+#endif
+
+/*
  * USE_SSL code should be compiled only when compiling with an SSL
  * implementation.  (Currently, only OpenSSL is supported, but we might add
  * more implementations in the future.)
-- 
1.9.1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to