[SQL] Table design question

2006-06-01 Thread David Clarke

I'm reading Joe Celko's book SQL Programming Style for the second time
and although I've been an OO developer for quite a few years I'm
fairly green wrt SQL. Joe is obviously something of a curmudgeon and I
would fall squarely into his newbie OO developer ordinal scale and I'm
trying to avoid the slide into stupid newbie OO developer.

So I'm designing a table and I'm looking for an appropriate key. The
natural key is a string from a few characters up to a maximum of
perhaps 100. Joe gets quite fierce about avoiding the use of a serial
id column as a key. The string is unique in the table and fits the
criteria for a key. So should I follow Joe's advice and use my natural
key as the primary key? It sounds reasonable but it will mean at least
one other table will have the string as a foreign key. My postgres
intro book has id columns all over the place but is it really that big
an issue these days to have a 100 character primary key? Are there
postgres-specific implications for either approach?

Thanks
Dave

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [SQL] Table design question

2006-06-01 Thread David Clarke

On 6/2/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

On Thu, 01 Jun 2006, Chris Browne wrote:

> Celko is decidedly *NOT* promoting the notion that you should use a
> 100 byte long "natural key."
>
> Jamie's comments of "Orthodox versus Reform" seem reasonably
> appropriate in outlining something of the difference between the
> positions.

Just to be clear, that was all I was trying to do. I probably should
have mentioned that any attempt to use such an attribute as a PK should
be met with a baseball bat or other shillelagh-ish implement, but was
interrupted several times during that email drafting.

> I may not care for doing this; you may not either; a company that
> builds auto parts that they want to sell into the automotive industry
> may care about standardizing their part IDs quite a lot.

This is another important point. In some situations, a rigid data model
can be a godsend to coders. If you happen to sit in such an enviable
position, I would encourage you to take advantage of it. (This doesn't
mean picking bad keys, of course.)

None of this should be taken as bashing Celko - he's a smart man and an
excellent source of advice.

-j



Thanks everyone who replied (and also for the insightful and measured
responses, not every news group is so lucky). I had progressed down
the path of the serial id column but re-reading Celko's book - he
spends some pages railing against "proprietary auto-numbering
features" - I wanted to feel confident I was making the right choice.

Thanks again
Dave

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[SQL] Alternative to serial primary key

2006-07-06 Thread David Clarke

I posted a couple of weeks back a question regarding the use of a 100
char column as a primary key and the responses uniformily advised the
use of a serial column. My concern is that the key is effectively
abstract and I want to use the column as a foreign key in other
tables. It occurred to me that if I used a hash function on insert to
generate another column and used that column as the primary key then I
have a value that meets a lot of the requirements for a good key,
including that I can regenerate the exact value from my data,
something that is impossible with a serial id. I also don't have to
index the 100 char column in order to search on the table, I just need
to calculate the hash value and check that against the calculated
column. It does violate the rule that a table shouldn't contain a
column that is calculated from another column in the table but I think
it would still be more effective than a serial id.

Is this a reasonable/normal thing to do? I know postgres contains an
md5() hash function, is this likely to be fast enough to make this an
effective choice? Are there other options? Am I just a noob barking up
the wrong tree? It is getting kind of late and my brain is starting to
hurt.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [SQL] Alternative to serial primary key

2006-07-06 Thread David Clarke

On 7/6/06, Markus Schaber <[EMAIL PROTECTED]> wrote:

This is a good idea if you want to have taller indices, but you still
need to re-check the "real" key due to hash collisions.


I am aware there are collisions with md5 but without any actual proof
I believe the risk to be very low with the data I'm storing which is a
kind of scrubbed free form residential address.



If you've plenty of time to spend, you could also bring the hash index
type back to life, which is currently deprecated according to the
PostgreSQL docs...


Unfortunately with the meagre time I have available, the only dent I'm
likely to make is in the wall with my head.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [SQL] Alternative to serial primary key

2006-07-06 Thread David Clarke

On 7/6/06, Andrew Sullivan <[EMAIL PROTECTED]> wrote:

That sort of undermines the value of the calculated primary key,
though, doesn't it?  He'd need the unique index for FK references,
which was the point, I thought.



Yes, that occurred to me as well. Frankly I believe the md5 collision
generation is more of a practical issue for crypto where for my
purposes the potential for two residential street addresses to
generate the same md5 hash value is effectively zero. And the md5
function is a builtin which I would hope is faster than anything I
could write in pgsql. Could be wrong, I have been before.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [SQL] Alternative to serial primary key

2006-07-06 Thread David Clarke

On 7/7/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

i agree.  all my primary keys are abstract - even
though some don't have to be.  iow, i'm comfortable
using serials as my primary key even when i don't
absolutely need to.


Yes I had in fact already created my table using a serial as the
primary key but I've been reading Celko's SQL Programming Style and
the use of a hash on the address column as the primary key (and for
use in FK's) meets a number of the requirements for a good key. The
address column itself is the natural primary key but it doesn't make
for a good FK. Plus I feel I would be remiss in not exploring an
alternative to the serial key.

To recap, yes there is only a single column, yes it is varchar. I need
to do a lookup on the address column which is unique and use it as a
foreign key in other tables. Using a serial id would obviously work
and has been recommended. But having a hash function over the address
column as the primary key means I can always regenerate my primary key
from the data which is impossible with a serial key. I believe the
risk of collision using md5 is effectively zero on this data and I can
put a unique index over it.

I'm kind of new to sql so apologies if this is a naive approach.
Thanks to all for responses.

Dave

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [SQL] Alternative to serial primary key

2006-07-07 Thread David Clarke

On 7/7/06, D'Arcy J.M. Cain  wrote:

Are you sure?  I have a hard time imagining a situation where that

Absolutely.


Also, you need to get into a lot more coding to handle the fact that
"521 Main Avenue" is the same address as "521 Main Av." and "521 Main
Ave" and even "521 Main."

Actually that is being done for me and you're correct, it is a lot of
effort but there are a variety of services out there and I'm not
trying to reinvent the wheel.


And even given all of that, I would probably still use serial.

Because?


Danger, Will Robinson.  The phrase "regenerate my primary key"
immediately raises the hairs on the back of my neck.  If the primary
key can ever change, you have a broken schema.


Perhaps my choice of words was somewhat hasty. A serial is totally
divorced from the data it represents whereas a md5 hash is (for my
purposes) unique, stable, verifiable, and simple.

Dave

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [SQL] Alternative to serial primary key

2006-07-07 Thread David Clarke

On 7/7/06, Scott Marlowe <[EMAIL PROTECTED]> wrote:

On Thu, 2006-07-06 at 16:43, Aaron Bono wrote:
I'll repeat my previous statement that this is premature optimization,
and the hash is kind the wrong direction.

If you store an int and the 1 to 100 characters in a varchar, you'll
have about 4 to 8 bytes for the int (6 I think, but it's been a while)
plus 1 to 200 or possibly more for the characters in the address.

If you use C local with ASCII encoding, you can get single byte.

If you switch to an md5 hash, you'll need ~50 bytes (average address
about 1/2 max length, just a guess) plus 32 bytes, plus the extra bytes
to keep track of the length of the fields.

The table now becomes wider itself, and the md5 is generally about as
big as the address, or fairly close to it.

And you've got the possibility of md5 collisions to deal with.

I'd say just FK off of the address field.  It's a natural key, fairly
small (100 bytes ain't really that big) and your primary key never needs
any kind of regenerating or anything, because it's already there.

Just set it up with cascading updates and deletes in case you need to
edit it in the future.

The first rule of optimization:  Don't


Yep, this was pretty much where I started from and I totally agree
with you regarding premature optimisation. I would point out that md5
hash is 128 bits or 16 bytes and not 32 so the difference between the
hash value and the source data is somewhat larger than you've
suggested. My original post a few weeks back was really about the use
of a natural key of varchar(100) as a foreign key in other tables. The
response was to not do it and instead use a serial which is basically
how I was progressing. Celko's comments re the use of autonumbering
schemes have obviously been gnawing away at me. That is why I asked
the question about the hashing approach which I saw as a way to
fulfill a lot of the criteria he has suggested for a primary key and
still have a reasonable value to use as a foreign key. Please ignore
the regeneration comment, it was written in haste and not because I
really anticipate any need to regenerate my primary key at any stage.

The question remains regarding the use of a string value as a primary
key for the table and as a foreign key in other tables. If I use the
address column as a foreign key in a differrent table will postgres
physically duplicate the data or will it simply attach the appropriate
index magically behind the scenes? I agree that the address column is
fairly small and I've heard of others using the likes of GUIDs as key
values which are not a lot smaller than I would expect my average
address to be.

Thanks
Dave

Dave

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [SQL] Alternative to serial primary key

2006-07-07 Thread David Clarke

On 7/7/06, Andrew Sullivan <[EMAIL PROTECTED]> wrote:

On Thu, Jul 06, 2006 at 05:18:16PM -0400, D'Arcy J.M. Cain wrote:
> owns/resides there in a situation where the address can never be
> changed, e.g. "521 Main" splitting into "521A Main" and "521B Main."

And anyone who has looked at an even moderately old city map will
tell you that even this "impossible case" is by no means impossible.



Actually for  my purposes it doesn't matter, I'm using a freeform
address string to index a lat/long. I fully expect to have multiple
unique addresses with a duplicate lat/long, i.e. the location doesn't
change - only the name of it. I'm personally hoping not to be around
long enough for geological events to change the physical location but
living in the shaky isles anything is possible.

Thanks again to all who have responded. I guess I'm probably labouring
the point but it's been instructive at least. Being fairly new to
database design I wanted to have enough info to be confident with my
approach.

Dave

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly