Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

Peter Geoghegan Sat, 13 Sep 2014 13:29:58 -0700

On Fri, Sep 12, 2014 at 11:38 AM, Robert Haas <robertmh...@gmail.com> wrote:
> Based on discussion thus far it seems that there's a possibility that
> the trade-off may be different for short strings vs. long strings.  If
> the string is small enough to fit in the L1 CPU cache, then it may be
> that memcmp() followed by strcoll() is not much more expensive than
> strcoll().  That should be easy to figure out: write a standalone C
> program that creates a bunch of arbitrary, fairly-short strings, say
> 32 bytes, in a big array.  Make the strings different near the end,
> but not at the beginning.  Then have the program either do strcoll()
> on every pair (O(n^2)) or, with a #define, memcmp() followed by
> strcoll() on every pair.  It should be easy to see whether the
> memcmp() adds 1% or 25% or 100%.


I get where you're coming from now (I think). It seems like you're
interested in getting a sense of what it would cost to do an
opportunistic memcmp() in a universe where memory latency isn't by far
the dominant cost (which it clearly is, as evidenced by my most recent
benchmark where a variant of Heikki's highly unsympathetic SQL test
case showed a ~1% regression). You've described a case with totally
predictable access patterns, perfectly suited to prefetching, and with
no other memory access bottlenecks. As I've said, this seems
reasonable (at least with those caveats in mind).

The answer to your high level question appears to be: the
implementation (the totally opportunistic "memcmp() == 0" thing)
benefits from the fact that we're always bottlenecked on memory, and
to a fairly appreciable degree. I highly doubt that this is something
that can fail to work out with real SQL queries, but anything that
furthers our understanding of the optimization is a good thing. Of
course, within tuplesort we're not really getting the totally
opportunistic memcmp()s "for free" - rather, we're using a resource
that we would not otherwise be able to use at all.

This graph illustrates the historic trends of CPU and memory performance:

http://www.cs.virginia.edu/stream/stream_logo.gif

I find this imbalance quite remarkable - no wonder researchers are
finding ways to make the best of the situation. To make matters worse,
the per-core trends for memory bandwidth are now actually *negative
growth*. We may actually be going backwards, if we assume that that's
the bottleneck, and that we cannot parallelize things.

Anyway, attached rough test program implements what you outline. This
is for 30,000 32 byte strings (where just the final two bytes differ).
On my laptop, output looks like this (edited to only show median
duration in each case):

"""
Strings generated - beginning tests
(baseline) duration of comparisons without useless memcmp()s: 13.445506 seconds

duration of comparisons with useless memcmp()s: 17.720501 seconds
"""

It looks like about a 30% increase in run time when we always have
useless memcmps().

You can change the constants around easily - let's consider 64 KiB
strings now (by changing STRING_SIZE to 65536). In order to make the
program not take too long, I also reduce the number of strings
(N_STRINGS) from 30,000 to 1,000. If I do so, this is what I see as
output:

"""
Strings generated - beginning tests
(baseline) duration of comparisons without useless memcmp()s: 11.205683 seconds

duration of comparisons with useless memcmp()s: 14.542997 seconds
"""

It looks like the overhead here is surprisingly consistent with the
first run - again, about a 30% increase in run time.

As for 1MiB strings (this time, with an N_STRINGS of 300):

"""
Strings generated - beginning tests
(baseline) duration of comparisons without useless memcmp()s: 23.624183 seconds

duration of comparisons with useless memcmp()s: 35.332728 seconds
"""

So, at this scale, the overhead gets quite a bit worse, but the case
also becomes quite a bit less representative (if that's even
possible). I suspect that the call stack's stupidly large size may be
a problem here, but I didn't try and fix that.

Does this answer your question? Are you intent on extrapolating across
different platforms based on this test program, rather than looking at
real SQL queries? While a certain amount of that makes sense, I think
we should focus on queries that have some chance of being seen in real
production PostgreSQL instances. Failing that, actual SQL queries.

-- 
Peter Geoghegan

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

/* STRING_SIZE does not include NUL byte */
#define STRING_SIZE		32
#define N_STRINGS		30000
#define LAST_N_DIFFER	2

#define INSTR_TIME_SUBTRACT(x,y) \
	do { \
		(x).tv_sec -= (y).tv_sec; \
		(x).tv_usec -= (y).tv_usec; \
		/* Normalize */ \
		while ((x).tv_usec < 0) \
		{ \
			(x).tv_usec += 1000000; \
			(x).tv_sec--; \
		} \
	} while (0)

#define INSTR_TIME_GET_DOUBLE(t) \
	(((double) (t).tv_sec) + ((double) (t).tv_usec) / 1000000.0)

#define Min(x, y)		((x) < (y) ? (x) : (y))

/*
 * Generate single random alphabetical ASCII char.  Uses an unseeded rand()
 * call, to ensure inter-run determinism.
 */
static inline char
generate_prandom_printable_char(void)
{
	for (;;)
	{
		char crand = rand() & 0x7F;

		if ((crand >= 'A' && crand <= 'Z') || (crand >= 'a' && crand <= 'z'))
			return crand;
	}
}

/*
 * Generate a random "payload" string.  The returned string is pre-calcuated
 * once, to form the bulk of each distinct string.
 */
static inline char *
generate_prandom_payload()
{
	char   *payload = malloc(STRING_SIZE + 1);
	int		i;

	/*
	 * The final LAST_N_DIFFER bytes will ultimately be clobbered with distinct
	 * bytes, but that occurs separately, per string.
	 */
	for (i = 0; i < STRING_SIZE; i++)
		payload[i] = generate_prandom_printable_char();

	/*
	 * Add NUL here, even though complete strings are not NULL terminated (or,
	 * rather, are copied into a buffer on the stack in order to add a NUL once
	 * per comparison)
	 */
	payload[STRING_SIZE] = '\0';

	return payload;
}

int
main(int argc, const char *argv[])
{
	char  **strings = malloc(N_STRINGS * sizeof(char *));
	char   *payload = generate_prandom_payload();
	int		i, j;

	/* Initialize strings */
	for (i = 0; i < N_STRINGS; i++)
	{
		int		j;

		strings[i] = malloc(STRING_SIZE);

		memcpy(strings[i], payload, STRING_SIZE);

		/* Last LAST_N_DIFFER characters randomly vary */
		for (j = LAST_N_DIFFER; j > 0; j--)
		{
			char 	n_last = generate_prandom_printable_char();

			strings[i][STRING_SIZE - j] = n_last;
		}

		/* Don't terminate -- no room in buffer */
	}

	printf("Strings generated - beginning tests\n");

	for (;;)
	{
		struct timeval before, after;

		/*
		 ******************************************************************
		 * Baseline -- no wasted memcmp().
		 *
		 * Compare each string to each other string using only strcoll()
		 ******************************************************************
		 */
		gettimeofday(&before, NULL);
		for (i = 0; i < N_STRINGS; i++)
		{
			char   *str = strings[i];

			/* Quadratic string comparison */
			for (j = 0; j < N_STRINGS; j++)
			{
				int		cmp;
				char	buf1[STRING_SIZE + 1];
				char 	buf2[STRING_SIZE + 1];
				char   *other;

				other = strings[j];

				memcpy(buf1, str, STRING_SIZE);
				memcpy(buf2, other, STRING_SIZE);

				buf1[STRING_SIZE] = '\0';
				buf2[STRING_SIZE] = '\0';

				cmp = strcoll(buf1, buf2);

				/*
				 * memcmp() tie-breaker (equivalent to the one that currently
				 * appears within varstr_cmp()) isn't considered.  It's only
				 * relevant to a small number of locales, like hu_HU, where
				 * strcoll() might (rarely) indicate equality in respect of a
				 * pair of non-identical strings.  If strcoll() did return 0,
				 * then for most locales it's certain that the first memcmp()
				 * would have worked out.
				 */
			}
		}
		/* Report no-wasted-memcmp case duration */
		gettimeofday(&after, NULL);
		INSTR_TIME_SUBTRACT(after, before);
		printf("(baseline) duration of comparisons without useless memcmp()s: %f seconds\n\n",
			   INSTR_TIME_GET_DOUBLE(after));

		/*
		 ******************************************************************
		 * Compare each string to each other string using useless memcmp(),
		 * followed by tie-breaker strcoll()
		 ******************************************************************
		 */
		gettimeofday(&before, NULL);
		for (i = 0; i < N_STRINGS; i++)
		{
			char   *str = strings[i];

			/* Quadratic string comparison */
			for (j = 0; j < N_STRINGS; j++)
			{
				int		cmp;
				char	buf1[STRING_SIZE + 1];
				char 	buf2[STRING_SIZE + 1];
				char   *other;

				other = strings[j];

				/*
				 * waste some cycles.
				 *
				 * Don't give the implementation the benefit of sometimes
				 * getting away with a memcmp(), even though ordinarily that's
				 * perfectly representative/legitimate.
				 *
				 * Need to NUL terminate each buffer.
				 */
				cmp = memcmp(str, other, STRING_SIZE);

				memcpy(buf1, str, STRING_SIZE);
				memcpy(buf2, other, STRING_SIZE);

				buf1[STRING_SIZE] = '\0';
				buf2[STRING_SIZE] = '\0';

				cmp = strcoll(buf1, buf2);
			}
		}
		/* Report wasted-memcmp case duration */
		gettimeofday(&after, NULL);
		INSTR_TIME_SUBTRACT(after, before);
		printf("duration of comparisons with useless memcmp()s: %f seconds\n\n",
			   INSTR_TIME_GET_DOUBLE(after));

	}	/* Loop forever */

	return 0;
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

Reply via email to