[SQL] Table design question
I'm reading Joe Celko's book SQL Programming Style for the second time and although I've been an OO developer for quite a few years I'm fairly green wrt SQL. Joe is obviously something of a curmudgeon and I would fall squarely into his newbie OO developer ordinal scale and I'm trying to avoid the slide into stupid newbie OO developer. So I'm designing a table and I'm looking for an appropriate key. The natural key is a string from a few characters up to a maximum of perhaps 100. Joe gets quite fierce about avoiding the use of a serial id column as a key. The string is unique in the table and fits the criteria for a key. So should I follow Joe's advice and use my natural key as the primary key? It sounds reasonable but it will mean at least one other table will have the string as a foreign key. My postgres intro book has id columns all over the place but is it really that big an issue these days to have a 100 character primary key? Are there postgres-specific implications for either approach? Thanks Dave ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [SQL] Table design question
On 6/2/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: On Thu, 01 Jun 2006, Chris Browne wrote: > Celko is decidedly *NOT* promoting the notion that you should use a > 100 byte long "natural key." > > Jamie's comments of "Orthodox versus Reform" seem reasonably > appropriate in outlining something of the difference between the > positions. Just to be clear, that was all I was trying to do. I probably should have mentioned that any attempt to use such an attribute as a PK should be met with a baseball bat or other shillelagh-ish implement, but was interrupted several times during that email drafting. > I may not care for doing this; you may not either; a company that > builds auto parts that they want to sell into the automotive industry > may care about standardizing their part IDs quite a lot. This is another important point. In some situations, a rigid data model can be a godsend to coders. If you happen to sit in such an enviable position, I would encourage you to take advantage of it. (This doesn't mean picking bad keys, of course.) None of this should be taken as bashing Celko - he's a smart man and an excellent source of advice. -j Thanks everyone who replied (and also for the insightful and measured responses, not every news group is so lucky). I had progressed down the path of the serial id column but re-reading Celko's book - he spends some pages railing against "proprietary auto-numbering features" - I wanted to feel confident I was making the right choice. Thanks again Dave ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[SQL] Alternative to serial primary key
I posted a couple of weeks back a question regarding the use of a 100 char column as a primary key and the responses uniformily advised the use of a serial column. My concern is that the key is effectively abstract and I want to use the column as a foreign key in other tables. It occurred to me that if I used a hash function on insert to generate another column and used that column as the primary key then I have a value that meets a lot of the requirements for a good key, including that I can regenerate the exact value from my data, something that is impossible with a serial id. I also don't have to index the 100 char column in order to search on the table, I just need to calculate the hash value and check that against the calculated column. It does violate the rule that a table shouldn't contain a column that is calculated from another column in the table but I think it would still be more effective than a serial id. Is this a reasonable/normal thing to do? I know postgres contains an md5() hash function, is this likely to be fast enough to make this an effective choice? Are there other options? Am I just a noob barking up the wrong tree? It is getting kind of late and my brain is starting to hurt. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [SQL] Alternative to serial primary key
On 7/6/06, Markus Schaber <[EMAIL PROTECTED]> wrote: This is a good idea if you want to have taller indices, but you still need to re-check the "real" key due to hash collisions. I am aware there are collisions with md5 but without any actual proof I believe the risk to be very low with the data I'm storing which is a kind of scrubbed free form residential address. If you've plenty of time to spend, you could also bring the hash index type back to life, which is currently deprecated according to the PostgreSQL docs... Unfortunately with the meagre time I have available, the only dent I'm likely to make is in the wall with my head. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [SQL] Alternative to serial primary key
On 7/6/06, Andrew Sullivan <[EMAIL PROTECTED]> wrote: That sort of undermines the value of the calculated primary key, though, doesn't it? He'd need the unique index for FK references, which was the point, I thought. Yes, that occurred to me as well. Frankly I believe the md5 collision generation is more of a practical issue for crypto where for my purposes the potential for two residential street addresses to generate the same md5 hash value is effectively zero. And the md5 function is a builtin which I would hope is faster than anything I could write in pgsql. Could be wrong, I have been before. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [SQL] Alternative to serial primary key
On 7/7/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: i agree. all my primary keys are abstract - even though some don't have to be. iow, i'm comfortable using serials as my primary key even when i don't absolutely need to. Yes I had in fact already created my table using a serial as the primary key but I've been reading Celko's SQL Programming Style and the use of a hash on the address column as the primary key (and for use in FK's) meets a number of the requirements for a good key. The address column itself is the natural primary key but it doesn't make for a good FK. Plus I feel I would be remiss in not exploring an alternative to the serial key. To recap, yes there is only a single column, yes it is varchar. I need to do a lookup on the address column which is unique and use it as a foreign key in other tables. Using a serial id would obviously work and has been recommended. But having a hash function over the address column as the primary key means I can always regenerate my primary key from the data which is impossible with a serial key. I believe the risk of collision using md5 is effectively zero on this data and I can put a unique index over it. I'm kind of new to sql so apologies if this is a naive approach. Thanks to all for responses. Dave ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [SQL] Alternative to serial primary key
On 7/7/06, D'Arcy J.M. Cain wrote: Are you sure? I have a hard time imagining a situation where that Absolutely. Also, you need to get into a lot more coding to handle the fact that "521 Main Avenue" is the same address as "521 Main Av." and "521 Main Ave" and even "521 Main." Actually that is being done for me and you're correct, it is a lot of effort but there are a variety of services out there and I'm not trying to reinvent the wheel. And even given all of that, I would probably still use serial. Because? Danger, Will Robinson. The phrase "regenerate my primary key" immediately raises the hairs on the back of my neck. If the primary key can ever change, you have a broken schema. Perhaps my choice of words was somewhat hasty. A serial is totally divorced from the data it represents whereas a md5 hash is (for my purposes) unique, stable, verifiable, and simple. Dave ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [SQL] Alternative to serial primary key
On 7/7/06, Scott Marlowe <[EMAIL PROTECTED]> wrote: On Thu, 2006-07-06 at 16:43, Aaron Bono wrote: I'll repeat my previous statement that this is premature optimization, and the hash is kind the wrong direction. If you store an int and the 1 to 100 characters in a varchar, you'll have about 4 to 8 bytes for the int (6 I think, but it's been a while) plus 1 to 200 or possibly more for the characters in the address. If you use C local with ASCII encoding, you can get single byte. If you switch to an md5 hash, you'll need ~50 bytes (average address about 1/2 max length, just a guess) plus 32 bytes, plus the extra bytes to keep track of the length of the fields. The table now becomes wider itself, and the md5 is generally about as big as the address, or fairly close to it. And you've got the possibility of md5 collisions to deal with. I'd say just FK off of the address field. It's a natural key, fairly small (100 bytes ain't really that big) and your primary key never needs any kind of regenerating or anything, because it's already there. Just set it up with cascading updates and deletes in case you need to edit it in the future. The first rule of optimization: Don't Yep, this was pretty much where I started from and I totally agree with you regarding premature optimisation. I would point out that md5 hash is 128 bits or 16 bytes and not 32 so the difference between the hash value and the source data is somewhat larger than you've suggested. My original post a few weeks back was really about the use of a natural key of varchar(100) as a foreign key in other tables. The response was to not do it and instead use a serial which is basically how I was progressing. Celko's comments re the use of autonumbering schemes have obviously been gnawing away at me. That is why I asked the question about the hashing approach which I saw as a way to fulfill a lot of the criteria he has suggested for a primary key and still have a reasonable value to use as a foreign key. Please ignore the regeneration comment, it was written in haste and not because I really anticipate any need to regenerate my primary key at any stage. The question remains regarding the use of a string value as a primary key for the table and as a foreign key in other tables. If I use the address column as a foreign key in a differrent table will postgres physically duplicate the data or will it simply attach the appropriate index magically behind the scenes? I agree that the address column is fairly small and I've heard of others using the likes of GUIDs as key values which are not a lot smaller than I would expect my average address to be. Thanks Dave Dave ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [SQL] Alternative to serial primary key
On 7/7/06, Andrew Sullivan <[EMAIL PROTECTED]> wrote: On Thu, Jul 06, 2006 at 05:18:16PM -0400, D'Arcy J.M. Cain wrote: > owns/resides there in a situation where the address can never be > changed, e.g. "521 Main" splitting into "521A Main" and "521B Main." And anyone who has looked at an even moderately old city map will tell you that even this "impossible case" is by no means impossible. Actually for my purposes it doesn't matter, I'm using a freeform address string to index a lat/long. I fully expect to have multiple unique addresses with a duplicate lat/long, i.e. the location doesn't change - only the name of it. I'm personally hoping not to be around long enough for geological events to change the physical location but living in the shaky isles anything is possible. Thanks again to all who have responded. I guess I'm probably labouring the point but it's been instructive at least. Being fairly new to database design I wanted to have enough info to be confident with my approach. Dave ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly