Re: [GENERAL] varchar as primary key

2007-05-04 Thread Jeff Davis
On Fri, 2007-05-04 at 23:45 +0200, Martijn van Oosterhout wrote:
> On Fri, May 04, 2007 at 03:42:45PM -0400, Tom Lane wrote:
> > It's hardly credible that you could do either strcmp or strcoll in 2 nsec
> > on any run-of-the-mill hardware.  What I think is happening is that the
> > compiler is aware that these are side-effect-free functions and is
> > removing the calls entirely, or at least moving them out of the loops;
> > these times would be credible for loops consisting only of an increment,
> > test, and branch.
> 
> It's not the compiler, it's the C library. strcmp and strcoll are
> defined as:
> 
>   extern int strcoll (__const char *__s1, __const char *__s2)
>  __THROW __attribute_pure__ __nonnull ((1, 2));
> 
> In this context "pure" is essentially what IMMUTABLE is in postgres.
> 
> Which doesn't change the fact that strcoll is expensive.
> 

Thanks for clearing that up. I should have done a sanity check on those
numbers to begin with.

By the way, I didn't see the warning Tom mentioned using -Wall on either
system.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] varchar as primary key

2007-05-04 Thread Martijn van Oosterhout
On Fri, May 04, 2007 at 03:42:45PM -0400, Tom Lane wrote:
> It's hardly credible that you could do either strcmp or strcoll in 2 nsec
> on any run-of-the-mill hardware.  What I think is happening is that the
> compiler is aware that these are side-effect-free functions and is
> removing the calls entirely, or at least moving them out of the loops;
> these times would be credible for loops consisting only of an increment,
> test, and branch.

It's not the compiler, it's the C library. strcmp and strcoll are
defined as:

  extern int strcoll (__const char *__s1, __const char *__s2)
 __THROW __attribute_pure__ __nonnull ((1, 2));

In this context "pure" is essentially what IMMUTABLE is in postgres.

Which doesn't change the fact that strcoll is expensive.

Have a nice day,
-- 
Martijn van Oosterhout   <[EMAIL PROTECTED]>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.


signature.asc
Description: Digital signature


Re: [GENERAL] varchar as primary key

2007-05-04 Thread Tom Lane
Jeff Davis <[EMAIL PROTECTED]> writes:
> $ ./cmp
> locale set to: en_US.UTF-8
> strcmp time elapsed:  2034183 us
> strcoll time elapsed: 2019880 us

It's hardly credible that you could do either strcmp or strcoll in 2 nsec
on any run-of-the-mill hardware.  What I think is happening is that the
compiler is aware that these are side-effect-free functions and is
removing the calls entirely, or at least moving them out of the loops;
these times would be credible for loops consisting only of an increment,
test, and branch.

Integer overflow in your elapsed-time calculation is probably a risk
as well --- do the reports add up to something like the actual elapsed
time?

I tried a modified form of your program (attached) on an FC6 machine
and found that at any optimization level above -O0, that compiler
optimizes the strcmp() case into oblivion, even with code added as below
to try to make it look like a real operation.  The strcoll() call without
any following test, as you had, draws a warning about "statement with
no effect" which is pretty suspicious too.  With the modified program
I get

$ gcc -O1 -Wall cmptest.c
$ time ./a.out
locale set to: en_US.UTF-8
strcmp time elapsed:  0 us
strcoll time elapsed: 67756363 us

real1m7.758s
user1m7.746s
sys 0m0.006s

$ gcc -O0 -Wall cmptest.c
$ time ./a.out
locale set to: en_US.UTF-8
strcmp time elapsed:  4825504 us
strcoll time elapsed: 68864890 us

real1m13.692s
user1m13.676s
sys 0m0.010s

So as best I can tell, strcoll() is pretty dang expensive on Linux too.

regards, tom lane

#include 
#include 
#include 
#include 
#include 

#define ITERATIONS 1
#define THE_LOCALE "en_US.UTF-8"

int main(int argc, char *argv[]) {
int i;
char *str1 = "abcdefghijklmnop1";
char *str2 = "abcdefghijklmnop2";
char *newlocale;
struct timeval t1,t2,t3;
double elapsed_strcmp,elapsed_strcoll;

if( (newlocale = setlocale(LC_ALL,THE_LOCALE)) == NULL ) {
printf("error setting locale!\n");
exit(1);
} 
else {
printf("locale set to: %s\n",newlocale);
}

gettimeofday(&t1,NULL);
for(i=0; i < ITERATIONS; i++) {
if (strcmp(str1,str2) == 0)
printf("unexpected equality\n");
}
gettimeofday(&t2,NULL);
for(i=0; i < ITERATIONS; i++) {
if (strcoll(str1,str2) == 0)
printf("unexpected equality\n");
}
gettimeofday(&t3,NULL);
elapsed_strcmp = (t2.tv_sec * 100.0 + t2.tv_usec) - (t1.tv_sec * 
100.0 + t1.tv_usec);
elapsed_strcoll = (t3.tv_sec * 100.0 + t3.tv_usec) - (t2.tv_sec * 
100.0 + t2.tv_usec);
printf("strcmp time elapsed:  %.0f us\n",elapsed_strcmp);
printf("strcoll time elapsed: %.0f us\n",elapsed_strcoll);

return 0;
}

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] varchar as primary key

2007-05-04 Thread Jeff Davis
On Fri, 2007-05-04 at 13:52 -0400, Tom Lane wrote:
> Jeff Davis <[EMAIL PROTECTED]> writes:
> > I used strcmp() and strcoll() in a tight loop, and the result was
> > indistinguishable.
> 
> That's not particularly credible ... were you testing this in a
> standalone test program?  If so, did you remember to do setlocale()
> first?  Without that, you'll be in C locale regardless of environment
> contents.

I have attached a revised cmp.c that includes some extra checks. It
looks like the locale is being set correctly and still I don't see a
difference.


$ gcc --version
gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)
$ uname -a
_ 2.6.9-34.ELsmp #1 SMP Wed Mar 8 00:27:03 CST 2006 i686
i686 i386 GNU/Linux
$ ./cmp
locale set to: en_US.UTF-8
strcmp time elapsed:  2034183 us
strcoll time elapsed: 2019880 us


If I had to guess, I'd say maybe strcoll() optimizes the simple cases
somehow.

[ checks FreeBSD ... ]
On FreeBSD, it's a different story. strcoll() really hurts there
(painfully so). I'm glad you pointed that out, because I have my
production boxes on FreeBSD.

Regards,
Jeff Davis
#include 
#include 
#include 
#include 

#define ITERATIONS 10
#define THE_LOCALE "en_US.UTF-8"

int main(int argc, char *argv[]) {
	int i;
	char buff11[256];
	char buff12[256];
	char *buff21;
	char *buff22;
	char *str1 = "abcdefghijklmnop1";
	char *str2 = "abcdefghijklmnop2";
	char *newlocale;
	struct timeval t1,t2,t3;
	int elapsed_strcmp,elapsed_strcoll;

	int len1 = strlen(str1);
	int len2 = strlen(str2);
	if( (newlocale = setlocale(LC_ALL,THE_LOCALE)) == NULL ) {
		printf("error setting locale!\n");
		exit(1);
	} 
	else {
		printf("locale set to: %s\n",newlocale);
	}
	
	gettimeofday(&t1,NULL);
	for(i=0; i < ITERATIONS; i++) {
		strcmp(str1,str2);
	}
	gettimeofday(&t2,NULL);
	for(i=0; i < ITERATIONS; i++) {
		strcoll(str1,str2);
	}
	gettimeofday(&t3,NULL);
	elapsed_strcmp = (t2.tv_sec * 100 + t2.tv_usec) - (t1.tv_sec * 100 + t1.tv_usec);
	elapsed_strcoll = (t3.tv_sec * 100 + t3.tv_usec) - (t2.tv_sec * 100 + t2.tv_usec);
	printf("strcmp time elapsed:  %d us\n",elapsed_strcmp);
	printf("strcoll time elapsed: %d us\n",elapsed_strcoll);

}

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] varchar as primary key

2007-05-04 Thread Tom Lane
Jeff Davis <[EMAIL PROTECTED]> writes:
> I used strcmp() and strcoll() in a tight loop, and the result was
> indistinguishable.

That's not particularly credible ... were you testing this in a
standalone test program?  If so, did you remember to do setlocale()
first?  Without that, you'll be in C locale regardless of environment
contents.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [GENERAL] varchar as primary key

2007-05-04 Thread Jeff Davis
On Thu, 2007-05-03 at 23:08 -0400, Tom Lane wrote:
> Jeff Davis <[EMAIL PROTECTED]> writes:
> > If you're using a non-C locale, it's slower than strcmp() too.
> > PostgreSQL has to do an extra memcpy() in order to use strcoll(),
> > because strings in postgresql aren't necessarily NULL-terminated and
> > there's no such thing as strncoll(), unfortunately (a comment in the
> > code points this out).
> 
> The memcpy is the least of the problem --- in many non-C locales,
> strcoll() is simply a dog, because the collation rules are ridiculously
> complex.
> 

I was going by:

src/backend/utils/adt/varlena.c

/*
 * Unfortunately, there is no strncoll(), so in the non-C locale case we
 * have to do some memory copying.  This turns out to be significantly
 * slower, so we optimize the case where LC_COLLATE is C.  We also try
to
 * optimize relatively-short strings by avoiding palloc/pfree overhead.
 */

I tried with some simple C code, and it looks like the memcpy() does
account for a significant slowdown -- at least in my case (en_US.UTF-8).

In my tests, I was just comparing two strings:
"abcdefghijklmnop1"
"abcdefghijklmnop2"
(which seem to be "normal" length for a string that might be sorted)

I used strcmp() and strcoll() in a tight loop, and the result was
indistinguishable. However, when I added in two memcpy()s -- which are
necessary for any non-C locale -- it slowed down drastically (an order
of magnitude).

I also made two test data directories on my workstation, one C and one
UTF-8, and then I made a table in each consisting of 1M records of md5
(random()). The one with locale C took about 9 seconds to sort, and the
one with en_US.UTF-8 took about 16 seconds to sort.

So, I think for some locales memcpy() is the problem and in others the
memcpy() overhead may be overshadowed by strcoll().

When memcpy() is the problem it is disappointing because it would be
completely unnecessary if only there existed a strncoll()  :(

Regards,
Jeff Davis



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] varchar as primary key

2007-05-04 Thread Merlin Moncure

On 5/3/07, Alexander Staubo <[EMAIL PROTECTED]> wrote:

PostgreSQL uses B-trees for its indexes, insertion time is logarithmic
regardless of the type of the key, but strings have a larger overhead
since they involve character comparisons; (i - j) is a lot faster than
strcmp(i, j). If you do go for strings, I would suggest that the
beginning of the key be statistically distributed as widely as
possible; ie., avoid common prefixes.


I think the performance benefits of i - j over strcmp(i,j) are mostly
irrelevant, locale issues aside.  The main reason why integer keys can
be faster is because the index is smaller and puts less pressure on
cache.  This has to stacked up against the fact you are often hitting
the varchar index anyways for sorting and filtering purposes (swapping
a int for text index is only a guaranteed win if you can drop the text
index completely).  So, by using integers from performance perspective
we are mostly trying to prevent a cache miss (during which time a
computer might perform 100k strcmp operations).  If there is also a
varchar index, and it is used for various queries, it may actually be
faster to drop the integer index altogether because it is competing
with cache resources with the integer index.  Unfortunately, this is
more often the case than not in my experience.

As solid state technologies continue to mature and near zero latency
storage systems become widespread, this advantage will lessen as the
penalty for a cache miss becomes much less.

merlin

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] varchar as primary key

2007-05-03 Thread Tom Lane
Jeff Davis <[EMAIL PROTECTED]> writes:
> If you're using a non-C locale, it's slower than strcmp() too.
> PostgreSQL has to do an extra memcpy() in order to use strcoll(),
> because strings in postgresql aren't necessarily NULL-terminated and
> there's no such thing as strncoll(), unfortunately (a comment in the
> code points this out).

The memcpy is the least of the problem --- in many non-C locales,
strcoll() is simply a dog, because the collation rules are ridiculously
complex.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [GENERAL] varchar as primary key

2007-05-03 Thread Jeff Davis
On Thu, 2007-05-03 at 22:32 +0200, Alexander Staubo wrote:
> On 5/3/07, Matthew Hixson <[EMAIL PROTECTED]> wrote:
> >Is there a significant performance difference between using int
> > primary keys and string primary keys in Postgres?
> 
> PostgreSQL uses B-trees for its indexes, insertion time is logarithmic
> regardless of the type of the key, but strings have a larger overhead
> since they involve character comparisons; (i - j) is a lot faster than
> strcmp(i, j). If you do go for strings, I would suggest that the

If you're using a non-C locale, it's slower than strcmp() too.
PostgreSQL has to do an extra memcpy() in order to use strcoll(),
because strings in postgresql aren't necessarily NULL-terminated and
there's no such thing as strncoll(), unfortunately (a comment in the
code points this out).

This mostly matters in sorting. If your B-tree is in memory and you do a
simple lookup (what else would you do with UUIDs?), it's unlikely to
take very long at all.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] varchar as primary key

2007-05-03 Thread Dawid Kuroczko

On 5/3/07, Jeff Davis <[EMAIL PROTECTED]> wrote:

On Thu, 2007-05-03 at 08:58 -0700, Matthew Hixson wrote:
> I'm investigating the usage of a UUID primary key generator using
> Hibernate and Postgres.  The reason for using a UUID is that we will
> have an application hosted at different sites in different
> databases.  We will need to aggregate the data back into a single
> database from time to time and we want to avoid PK collisions.
>Is there a significant performance difference between using int
> primary keys and string primary keys in Postgres?

If the only thing you need to do is avoid primary key collisions, why
not just store an extra int that represents the "site ID" and have a
primary key on (the_key, site_id)?

That way you're *sure* no collisions happen. A UUID has a lot of bits,
but if the random generator is flawed a collision can still happen.


Also, why not simply use sequences?
You can declare sequence to START WITH 1 and INCREMENT BY 10,
and on the other site START WITH 2 INCREMENT BY 10.
There is no chance these will collide (unless human intervenes ;)),
and you can safely add 8 more similar servers, each with similar
sequence.

 Regards,
  Dawid

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] varchar as primary key

2007-05-03 Thread Alexander Staubo

On 5/3/07, Matthew Hixson <[EMAIL PROTECTED]> wrote:

   Is there a significant performance difference between using int
primary keys and string primary keys in Postgres?


PostgreSQL uses B-trees for its indexes, insertion time is logarithmic
regardless of the type of the key, but strings have a larger overhead
since they involve character comparisons; (i - j) is a lot faster than
strcmp(i, j). If you do go for strings, I would suggest that the
beginning of the key be statistically distributed as widely as
possible; ie., avoid common prefixes.

Alexander.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] varchar as primary key

2007-05-03 Thread Jeff Davis
On Thu, 2007-05-03 at 08:58 -0700, Matthew Hixson wrote:
> I'm investigating the usage of a UUID primary key generator using  
> Hibernate and Postgres.  The reason for using a UUID is that we will  
> have an application hosted at different sites in different  
> databases.  We will need to aggregate the data back into a single  
> database from time to time and we want to avoid PK collisions.
>Is there a significant performance difference between using int  
> primary keys and string primary keys in Postgres?

If the only thing you need to do is avoid primary key collisions, why
not just store an extra int that represents the "site ID" and have a
primary key on (the_key, site_id)?

That way you're *sure* no collisions happen. A UUID has a lot of bits,
but if the random generator is flawed a collision can still happen.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] varchar as primary key

2007-05-03 Thread Alvaro Herrera
William Garrison wrote:
> I don't recommend it.  There are better ways to store UUIDs:
> 
> char(32)<-- Easy to work with, fixed length, inefficient
> varchar(32) <-- 4 bytes larger due to variable size
> bytea() <-- 20 bytes, variable length
> bit(128)<-- 16 bytes, optimal
> 
> I don't like char() or varchar() because of case-senstivity and 
> inefficiency.
> 
> We used bytea, and created a small function byte2guid() and guid2byte() 
> to handle converting to/from strings when working at a SQL prompt.  But 
> the production code doesn't use those.  In retrospect, I would like to 
> have tried BIT(128) since I think fixed-length columns perform better 
> than variable-length columns.

FWIW, in Postgres, all those types are stored with the word length in
front of each datum.  We don't use the column maximum length as a cue
for the storage of each individual datum.  So the two first items you
list above are stored identically.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] varchar as primary key

2007-05-03 Thread William Garrison

I don't recommend it.  There are better ways to store UUIDs:

char(32)<-- Easy to work with, fixed length, inefficient
varchar(32) <-- 4 bytes larger due to variable size
bytea() <-- 20 bytes, variable length
bit(128)<-- 16 bytes, optimal

I don't like char() or varchar() because of case-senstivity and 
inefficiency.


We used bytea, and created a small function byte2guid() and guid2byte() 
to handle converting to/from strings when working at a SQL prompt.  But 
the production code doesn't use those.  In retrospect, I would like to 
have tried BIT(128) since I think fixed-length columns perform better 
than variable-length columns.


Matthew Hixson wrote:
I'm investigating the usage of a UUID primary key generator using 
Hibernate and Postgres.  The reason for using a UUID is that we will 
have an application hosted at different sites in different databases.  
We will need to aggregate the data back into a single database from time 
to time and we want to avoid PK collisions.
  Is there a significant performance difference between using int 
primary keys and string primary keys in Postgres?

  Thanks,
   -M@

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq




---(end of broadcast)---
TIP 6: explain analyze is your friend


[GENERAL] varchar as primary key

2007-05-03 Thread Matthew Hixson
I'm investigating the usage of a UUID primary key generator using  
Hibernate and Postgres.  The reason for using a UUID is that we will  
have an application hosted at different sites in different  
databases.  We will need to aggregate the data back into a single  
database from time to time and we want to avoid PK collisions.
  Is there a significant performance difference between using int  
primary keys and string primary keys in Postgres?

  Thanks,
   -M@

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq