Re: [PERFORM] Basic Q on superfluous primary keys

2007-04-22 Thread Craig A. James

Merlin Moncure wrote:

In the context of this debate, I see this argument all the time, with
the implied suffix: 'If only we used integer keys we would not have
had this problem...'.  Either the customer identifies parts with a
part number or they don't...and if they do identify parts with a
number and recycle the numbers, you have a problem...period.


On the contrary.  You create a new record with the same part number.  You mark the old 
part number "obsolete".  Everything else (the part's description, and all the 
relationships that it's in, such as order history, catalog inclusion, revision history, 
etc.) is unaffected.  New orders are placed against the new part number's DB record; for 
safety the old part number can have a trigger that prevent new orders from being placed.

Since the part number is NOT the primary key, duplicate part numbers are not a 
problem.  If you had instead used the part number as the primary key, you'd be 
dead in the water.

You can argue that the customer is making a dumb decision by reusing catalog 
numbers, and I'd agree.  But they do it, and as database designers we have to 
handle it.  In my particular system, we aggregate information from several 
hundred companies, and this exact scenario happens frequently.  Since we're 
only aggregating information, we have no control over the data that these 
companies provide.  If we'd used catalog numbers for primary keys, we'd have 
big problems.

Craig





---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Basic Q on superfluous primary keys

2007-04-18 Thread Merlin Moncure

On 4/18/07, Dave Dutcher <[EMAIL PROTECTED]> wrote:

I think a database with all natural keys is unrealistic.  For example if you
have a table that refers to people, are you going to use their name as a
primary key?  Names change all the time due to things like marriage,
divorce, or trouble with the law.  We have tables with 20 million rows which
reference back to a table of people, and if I used the person's name as key,
it would be a major pain when somebody's name changes.  Even if there is
referential integrity, one person might be referred to by 25% of the 20
million rows, so the update would take quite a long time.  Also the table
will be filled with dead rows and the indexes will likely be bloated.  If I
want to clean that up, it will take a vacuum full or a cluster which will
lock the whole table and run for hours.  If I use a surrogate key, I can
change their name in one row and be done with it.


That's perfectly reasonable (I mentioned this upthread)...there are a
couple of corner cases where RI costs too much  Exchanging a surrogate
for a natural is a valid performance consideration.  Usually, the
performance win is marginal at best (and your example suggests
possible normalization issues in the child table), sometimes there is
no alternativeupdating 5 million rows is obviously nasty.  That
said -- if the cost of update was zero, would you still do it that
way?  I'm trying to separate performance related issues, which are
reasonable and valid depending on the situation, with good design
principles.

merlin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Basic Q on superfluous primary keys

2007-04-18 Thread Jeff Davis
On Tue, 2007-04-17 at 21:06 -0700, Craig A. James wrote:
> Merlin Moncure wrote:
> > In the context of this debate, I see this argument all the time, with
> > the implied suffix: 'If only we used integer keys we would not have
> > had this problem...'.  Either the customer identifies parts with a
> > part number or they don't...and if they do identify parts with a
> > number and recycle the numbers, you have a problem...period.
> 
> On the contrary.  You create a new record with the same part number.  You 
> mark the old part number "obsolete".  Everything else (the part's 
> description, and all the relationships that it's in, such as order history, 
> catalog inclusion, revision history, etc.) is unaffected.  New orders are 
> placed against the new part number's DB record; for safety the old part 
> number can have a trigger that prevent new orders from being placed.
> 
> Since the part number is NOT the primary key, duplicate part numbers are not 
> a problem.  If you had instead used the part number as the primary key, you'd 
> be dead in the water.
> 
> You can argue that the customer is making a dumb decision by reusing catalog 
> numbers, and I'd agree.  But they do it, and as database designers we have to 
> handle it.  In my particular system, we aggregate information from several 
> hundred companies, and this exact scenario happens frequently.  Since we're 
> only aggregating information, we have no control over the data that these 
> companies provide.  If we'd used catalog numbers for primary keys, we'd have 
> big problems.
> 

Storing data is easy.

The difficulty lies in storing data in such a way that your assumptions
about the data remain valid and your queries still answer the questions
that you think they answer.

Because an internal ID field has no meaning outside of the database
(some auto-generated keys do have meaning outside the database, but I'm
not talking about those), you can't effectively query based on an
internal id any more than you can query by the ctid. So what do you
query by then? You query by natural keys anyway. Internal id fields are
an implementation detail related to performance (the real topic of this
discussion).

If you have two parts with the same part id, what does that mean? Sure,
you can store the data, but then the queries that assumed that data was
unique no longer hold. Sometimes you need two parts with the same part
id, but you have to know the meaning in order to query based on that
data.

Let me ask these questions:
 - Do you think that all of your relations have an internal id? 
 - Do you think that all the internal ids you use are unique in the
relations in which they appear?

If you answer "yes" to either question, consider that every query on
that data is also a relation and so are subselects and intermediate
results. Do those all have an id? If not, why not? How do you join a
virtual relation to a physical relation if the virtual relation has no
internal id? Is the id field still unique in the result of a join or
Cartesian product?

Regards,
Jeff Davis



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Basic Q on superfluous primary keys

2007-04-18 Thread Dave Dutcher
I think a database with all natural keys is unrealistic.  For example if you
have a table that refers to people, are you going to use their name as a
primary key?  Names change all the time due to things like marriage,
divorce, or trouble with the law.  We have tables with 20 million rows which
reference back to a table of people, and if I used the person's name as key,
it would be a major pain when somebody's name changes.  Even if there is
referential integrity, one person might be referred to by 25% of the 20
million rows, so the update would take quite a long time.  Also the table
will be filled with dead rows and the indexes will likely be bloated.  If I
want to clean that up, it will take a vacuum full or a cluster which will
lock the whole table and run for hours.  If I use a surrogate key, I can
change their name in one row and be done with it.  

Just my 2 cents.

Dave


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Basic Q on superfluous primary keys

2007-04-18 Thread Craig A. James

Merlin Moncure wrote:
Since the part number is NOT the primary key, duplicate part numbers 
are not a problem.  If you had instead used the part number as the 
primary key, you'd be dead in the water.


You are redefining the primary key to be (part_number,
obsoletion_date).  Now, if you had not anticipated that in the
original design (likely enough), you do have to refactor queries that
join on the table...so what?  If that's too much work, you can use a
view to take care of the problem (which may be a good idea anyways).
*you have to refactor the system anyways because you are now allowing
duplicate part numbers where previously (from the perspective of the
user), they were unique *.

The hidden advantage of pushing the full key through the database is
it tends to expose holes in the application/business logic.  Chances
are some query is not properly distinguishing obsoleted parts and now
the real problems come...surrogate keys do not remove complexity, they
simply sweep it under the rug.


This really boils down to an object-oriented perspective.  I have an object, a 
customer's catalog entry.  It has properties such as catalog number, 
description, etc, and whether it's obsolete or not.  Management of the object 
(its relation to other objects, its history, etc.) should NOT depend on the 
object's specific definition.

This is true whether the object is represented in Lisp, C++, Perl, or (in this 
case) an SQL schema.  Good object oriented design abstracts the object and its 
behavior from management of the object.  In C++, Perl, etc., we manage objects 
via a pointer or object reference.  In SQL, we reference objects by an 
*arbitrary* integer that is effectively a pointer to the object.

What you're suggesting is that I should break the object-oriented encapsulation 
by pulling out specific fields of the object, exposing those internal object 
details to the applications, and spreading those details across the whole 
schema. And I argue that this is wrong, because it breaks encapsulation.  By 
exposing the details of the object, if the details change, *all* of your 
relationships break, and all of your applications have to change.  And I've 
never seen a system where breaking object-oriented encapsulation was a good 
long-term solution.  Systems change, and object-oriented techniques were 
invented to help manage change.

This is one of the reasons the Postgres project was started way back when: To 
bring object-oriented techniques to the relational-database world.

Craig

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Basic Q on superfluous primary keys

2007-04-18 Thread Merlin Moncure

On 4/18/07, Craig A. James <[EMAIL PROTECTED]> wrote:

Merlin Moncure wrote:
> In the context of this debate, I see this argument all the time, with
> the implied suffix: 'If only we used integer keys we would not have
> had this problem...'.  Either the customer identifies parts with a
> part number or they don't...and if they do identify parts with a
> number and recycle the numbers, you have a problem...period.

On the contrary.  You create a new record with the same part number.  You mark the old 
part number "obsolete".  Everything else (the part's description, and all the 
relationships that it's in, such as order history, catalog inclusion, revision history, 
etc.) is unaffected.  New orders are placed against the new part number's DB record; for 
safety the old part number can have a trigger that prevent new orders from being placed.

Since the part number is NOT the primary key, duplicate part numbers are not a 
problem.  If you had instead used the part number as the primary key, you'd be 
dead in the water.


You are redefining the primary key to be (part_number,
obsoletion_date).  Now, if you had not anticipated that in the
original design (likely enough), you do have to refactor queries that
join on the table...so what?  If that's too much work, you can use a
view to take care of the problem (which may be a good idea anyways).
*you have to refactor the system anyways because you are now allowing
duplicate part numbers where previously (from the perspective of the
user), they were unique *.

The hidden advantage of pushing the full key through the database is
it tends to expose holes in the application/business logic.  Chances
are some query is not properly distinguishing obsoleted parts and now
the real problems come...surrogate keys do not remove complexity, they
simply sweep it under the rug.

merlin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Basic Q on superfluous primary keys

2007-04-18 Thread Greg Smith

On Wed, 18 Apr 2007, Richard Huxton wrote:

And at the end of the day that's the root of the problem. It's easy to be 
lulled into "well it looks like a primary key" rather than being able to 
guarantee it.


In some of these cases it is guaranteed to be a primary key given all 
available information at the time.  The funny thing about unexpected 
changes to a business model is that you never expect them.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Basic Q on superfluous primary keys

2007-04-18 Thread Richard Huxton

Craig A. James wrote:

Since we're only aggregating information, we have 
no control over the data that these companies provide.


And at the end of the day that's the root of the problem. It's easy to 
be lulled into "well it looks like a primary key" rather than being able 
to guarantee it.


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Basic Q on superfluous primary keys

2007-04-17 Thread Craig A. James

Merlin Moncure wrote:

In the context of this debate, I see this argument all the time, with
the implied suffix: 'If only we used integer keys we would not have
had this problem...'.  Either the customer identifies parts with a
part number or they don't...and if they do identify parts with a
number and recycle the numbers, you have a problem...period.


On the contrary.  You create a new record with the same part number.  You mark the old 
part number "obsolete".  Everything else (the part's description, and all the 
relationships that it's in, such as order history, catalog inclusion, revision history, 
etc.) is unaffected.  New orders are placed against the new part number's DB record; for 
safety the old part number can have a trigger that prevent new orders from being placed.

Since the part number is NOT the primary key, duplicate part numbers are not a 
problem.  If you had instead used the part number as the primary key, you'd be 
dead in the water.

You can argue that the customer is making a dumb decision by reusing catalog 
numbers, and I'd agree.  But they do it, and as database designers we have to 
handle it.  In my particular system, we aggregate information from several 
hundred companies, and this exact scenario happens frequently.  Since we're 
only aggregating information, we have no control over the data that these 
companies provide.  If we'd used catalog numbers for primary keys, we'd have 
big problems.

Craig






---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Basic Q on superfluous primary keys

2007-04-17 Thread Merlin Moncure

On 4/16/07, Greg Smith <[EMAIL PROTECTED]> wrote:

I think the point Craig was trying to make is that what you refer to here
as "extraordinary cases" are, in fact, rather common.  I've never seen a
database built on natural keys that didn't at some point turn ugly when
some internal or external business need suddenly invalidated the believed
uniqueness of that key.


I don't think it's so terrible to add a field to a key...I too have
worked on a ERP system based on natural keys and was quite amazed on
how well organized the database was.When the company decided to
re-number all the items in the database, it was a minor pain.
Extending a critical key would be a project any way you organize the
database IMO.  Natural keys are most common in manufacturing and
accounting systems because of the COBOL heritage, when natural keys
were the only way to realistically do it.  Unfortunately SQL really
missed the boat on keys...otherwise they would behave more like a
composite type.


The last really bad one I saw was a manufacturing database that used a
combination of the customer code and the customer's part number as the
key.  Surely if the customer changes their part number, we should switch
ours to match so the orders are easy to process, right?  When this got fun
was when one large customer who released products on a yearly schedule
decided to change the bill of material for many of their parts for the new
year, but re-used the same part number; oh, and they were still ordering
the old parts as well.  Hilarity ensued.


In the context of this debate, I see this argument all the time, with
the implied suffix: 'If only we used integer keys we would not have
had this problem...'.  Either the customer identifies parts with a
part number or they don't...and if they do identify parts with a
number and recycle the numbers, you have a problem...period.  Adding a
integer key only moves the confusion to a separate place, unless it is
used by the user to identify the part number and then *becomes* the
key, or a part of it.  If you hide the id from the user, then I claim
the data model is pretty much busted.

merlin

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Basic Q on superfluous primary keys

2007-04-16 Thread Greg Smith

On Mon, 16 Apr 2007, Merlin Moncure wrote:

extraordinary cases do happen, like a company overhauling its numbering 
systems, but such cases can be dealt with by a number of methods 
including letting RI do its thing.


I think the point Craig was trying to make is that what you refer to here 
as "extraordinary cases" are, in fact, rather common.  I've never seen a 
database built on natural keys that didn't at some point turn ugly when 
some internal or external business need suddenly invalidated the believed 
uniqueness of that key.


The last really bad one I saw was a manufacturing database that used a 
combination of the customer code and the customer's part number as the 
key.  Surely if the customer changes their part number, we should switch 
ours to match so the orders are easy to process, right?  When this got fun 
was when one large customer who released products on a yearly schedule 
decided to change the bill of material for many of their parts for the new 
year, but re-used the same part number; oh, and they were still ordering 
the old parts as well.  Hilarity ensued.


it is especially unclear how adding an integer to a table will somehow 
magically solve these problems.


If the key is a integer, it's always possible to figure out a trivial map 
that renumbers the entire database programmatically in order to merge two 
sets of data.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Basic Q on superfluous primary keys

2007-04-16 Thread Ron Mayer
Craig A. James wrote:
> Merlin Moncure wrote:
>> Using surrogate keys is dangerous and can lead to very bad design
>> habits that are unfortunately so prevalent in the software industry
>> they are virtually taught in schools.  ...  While there is
>> nothing wrong with them in principle (you are exchanging one key for
>> another as a performance optimization), they make it all too easy to
>> create denormalized designs and tables with no real identifying
>> criteria, etc,...
> 
> Wow, that's the opposite of everything I've ever been taught, and all my
> experience in the last few decades.
> 
> ...chemistry...two companies using the same "natural"
> keys had as much as 10% differences in their multi-million-compound
> databases.  These errors led to six-month to year-long delays, as each
> of the conflicting chemical record had to be examined by hand by a PhD
> chemist to reclassify it.

That sounds almost like a feature, not a bug - giving information
about what assumptions that went into the "natural key" need to be
reconsidered.

And I don't see how it would have been improved by adding a surrogate
key - except that the data would have been just as messed up though
harder to see where the messups were.

> We've always recommended to our customers that all primary keys be
> completely information free.  They should be not based on any
> information or combination of information from the data records.  Every
> time the customer has not followed this advice, they've later regretted it.

Hmm... but then do you put a unique index on what the
otherwise-would-have-been-natural-primary-key columns?

If not, you tend to get into the odd situation of multiple
rows that only vary in their surrogate key -- and it seems
the surrogate key is redundant.

> I'm sure there are situations where a natural key is appropriate, but I
> haven't seen it in my work.

I've seen both - and indeed usually use surrogate keys for convenience;
but also find that having to fix incorrect assumptions in natural primary
keys tends to raise business issues that are worth addressing anyway.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Basic Q on superfluous primary keys

2007-04-16 Thread Merlin Moncure

On 4/16/07, Craig A. James <[EMAIL PROTECTED]> wrote:

Merlin Moncure wrote:
> Using surrogate keys is dangerous and can lead to very bad design
> habits that are unfortunately so prevalent in the software industry
> they are virtually taught in schools.  ...  While there is
> nothing wrong with them in principle (you are exchanging one key for
> another as a performance optimization), they make it all too easy to
> create denormalized designs and tables with no real identifying
> criteria, etc,...



I can't recall ever seeing a "natural" key that was immutable.  In my business 
(chemistry), we've seen several disasterous situations were companies picked keys they thought were 
natural and immutable, and years down the road they discovered (for example) that chemical 
compounds they thought were pure were in fact isotopic mixtures, or simply the wrong molecule (as 
analytical techniques improved).  Or during a corporate takeover, they discovered that two 
companies using the same "natural" keys had as much as 10% differences in their 
multi-million-compound databases.  These errors led to six-month to year-long delays, as each of 
the conflicting chemical record had to be examined by hand by a PhD chemist to reclassify it.


while your example might be a good case study in proper
classification, it has nothing to do with key selection.  it is
especially unclear how adding an integer to a table will somehow
magically solve these problems.  are you claiming that a primary key
can't be changed?

mutability is strictly a performance argument.  since RI handles
cascading primary key changes, it's simply a matter of if you are
willing to wait for RI to do its work or not (if not, swap in the id
key as in my example).  the performance argument really only applies
to specific cases, and can be considered a attribute of certain
tables.  extraordinary cases do happen, like a company overhauling its
numbering systems, but such cases can be dealt with by a number of
methods including letting RI do its thing.

merlin

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Basic Q on superfluous primary keys

2007-04-16 Thread Craig A. James

Merlin Moncure wrote:

Using surrogate keys is dangerous and can lead to very bad design
habits that are unfortunately so prevalent in the software industry
they are virtually taught in schools.  ...  While there is
nothing wrong with them in principle (you are exchanging one key for
another as a performance optimization), they make it all too easy to
create denormalized designs and tables with no real identifying
criteria, etc,...


Wow, that's the opposite of everything I've ever been taught, and all my 
experience in the last few decades.

I can't recall ever seeing a "natural" key that was immutable.  In my business 
(chemistry), we've seen several disasterous situations were companies picked keys they thought were 
natural and immutable, and years down the road they discovered (for example) that chemical 
compounds they thought were pure were in fact isotopic mixtures, or simply the wrong molecule (as 
analytical techniques improved).  Or during a corporate takeover, they discovered that two 
companies using the same "natural" keys had as much as 10% differences in their 
multi-million-compound databases.  These errors led to six-month to year-long delays, as each of 
the conflicting chemical record had to be examined by hand by a PhD chemist to reclassify it.

In other businesses, almost any natural identifier you pick is subject to 
simple typographical errors.  When you discover the errors in a field you've 
used as a primary key, it can be quite hard to fix, particularly if you have 
distributed data across several systems and schemas.

We've always recommended to our customers that all primary keys be completely 
information free.  They should be not based on any information or combination 
of information from the data records.  Every time the customer has not followed 
this advice, they've later regretted it.

I'm sure there are situations where a natural key is appropriate, but I haven't 
seen it in my work.

Craig 


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Basic Q on superfluous primary keys

2007-04-16 Thread Merlin Moncure

On 4/14/07, Bill Moran <[EMAIL PROTECTED]> wrote:

In response to "Kynn Jones" <[EMAIL PROTECTED]>:
> The two situations are semantically identical: each record in table bar
> refers to a record in table foo.  The difference is that in the first
> schema, this referencing is done through an "artificial" serial-integer
> primary key, while in the second schema this reference is done through a
> data field that happens to be unique and not null, so it can serve as
> primary key.

I had a discussion about this recently on the Drupal mailing lists, at the
end of which I promised to do some benchmarking to determine whether or
not text keys really do hurt performance of indexes.  Unfortunately, I
still haven't followed through on that promise -- maybe I'll get to it
tomorrow.



The main reason why integer indexes are faster than natural
counterparts is that the index is smaller and puts less pressure on
cache.  This is however offset by removing joins here and there and
you usually just end up indexing the data anyways.  Performance is
kind of tangential to the argument though -- I've seen databases using
all natural keys and found them to be very clean and performant.

Using surrogate keys is dangerous and can lead to very bad design
habits that are unfortunately so prevalent in the software industry
they are virtually taught in schools.  Many software frameworks assume
you use them and refuse to work without them (avoid!)  While there is
nothing wrong with them in principle (you are exchanging one key for
another as a performance optimization), they make it all too easy to
create denormalized designs and tables with no real identifying
criteria, etc, and the resultant stinky queries to put it all back
together again, (full of unions, self joins, extraneous groups, case
statements, etc).

A good compromise in your designs is to identify your natural key but
use the surrogate if you have valid performance reasons:

CREATE TABLE foo (
 frobnitz_id int unique,
 frobnitz character(varying 100) PRIMARY KEY\
[...]
);

frobnitz_id is of course optional and not necessary in all tables.  It
may be a pain to relate a large table with a four or five part key and
judicious use of surrogates may be justified for performance or even
just to keep your queries smaller:

create table order_line_item_discount
(
 company_name text,
 order_no int,
 line_item_seq_no int,
 discount_code text,
 primary key(company_name, order_no, line_item_seq_no, discount_code)
)

becomes

create table order_line_item_discount
(
 order_line_item_id int,
 discount_code text,
 primary key (order_line_item_id, discount_code)
)

merlin

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Basic Q on superfluous primary keys

2007-04-14 Thread Bill Moran
In response to "Kynn Jones" <[EMAIL PROTECTED]>:

> Consider these two very similar schemas:
> 
> Schema 1:
> 
> 
> CREATE TABLE foo (
>   id serial PRIMARY KEY,
>   frobnitz character(varying 100) NOT NULL UNIQUE
> );
> 
> 
> CREATE TABLE bar (
>   id serial PRIMARY KEY,
>   foo_id int REFERENCES foo(id)
> )
> 
> 
> Schema 2:
> 
> 
> CREATE TABLE foo (
>   frobnitz character(varying 100) PRIMARY KEY
> );
> 
> 
> CREATE TABLE bar (
>   id serial PRIMARY KEY,
>   frobnitz character(varying 100) REFERENCES foo(frobnitz)
> )
> 
> 
> 
> 
> The two situations are semantically identical: each record in table bar
> refers to a record in table foo.  The difference is that in the first
> schema, this referencing is done through an "artificial" serial-integer
> primary key, while in the second schema this reference is done through a
> data field that happens to be unique and not null, so it can serve as
> primary key.

The first case is call a "surrogate key".  A little googling on that term
will turn up a wealth of discussion -- both for and against.

> I find Schema 1 awkward and unnatural; more specifically, foo.id seems
> unnecessary in light of the non-null uniqueness of foo.frobnitz.  But I
> remember once reading that "long" fields like foo.frobnitz did not make good
> primary keys.

I had a discussion about this recently on the Drupal mailing lists, at the
end of which I promised to do some benchmarking to determine whether or
not text keys really do hurt performance of indexes.  Unfortunately, I
still haven't followed through on that promise -- maybe I'll get to it
tomorrow.

> Is the field foo.id in Schema 1 superfluous?  For example, wouldn't the
> referencing from bar to foo really be done "behind the scenes" through some
> hidden field (oid?) instead of through the frobnitz text field?  Which of
> the two schemas would give better perfornance?

-- 
Bill Moran
Collaborative Fusion Inc.

[EMAIL PROTECTED]
Phone: 412-422-3463x4023


IMPORTANT: This message contains confidential information and is
intended only for the individual named. If the reader of this
message is not an intended recipient (or the individual
responsible for the delivery of this message to an intended
recipient), please be advised that any re-use, dissemination,
distribution or copying of this message is prohibited. Please
notify the sender immediately by e-mail if you have received
this e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or
error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The
sender therefore does not accept liability for any errors or
omissions in the contents of this message, which arise as a
result of e-mail transmission.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[PERFORM] Basic Q on superfluous primary keys

2007-04-14 Thread Kynn Jones

Consider these two very similar schemas:

Schema 1:


CREATE TABLE foo (
 id serial PRIMARY KEY,
 frobnitz character(varying 100) NOT NULL UNIQUE
);


CREATE TABLE bar (
 id serial PRIMARY KEY,
 foo_id int REFERENCES foo(id)
)


Schema 2:


CREATE TABLE foo (
 frobnitz character(varying 100) PRIMARY KEY
);


CREATE TABLE bar (
 id serial PRIMARY KEY,
 frobnitz character(varying 100) REFERENCES foo(frobnitz)
)




The two situations are semantically identical: each record in table bar
refers to a record in table foo.  The difference is that in the first
schema, this referencing is done through an "artificial" serial-integer
primary key, while in the second schema this reference is done through a
data field that happens to be unique and not null, so it can serve as
primary key.


I find Schema 1 awkward and unnatural; more specifically, foo.id seems
unnecessary in light of the non-null uniqueness of foo.frobnitz.  But I
remember once reading that "long" fields like foo.frobnitz did not make good
primary keys.


Is the field foo.id in Schema 1 superfluous?  For example, wouldn't the
referencing from bar to foo really be done "behind the scenes" through some
hidden field (oid?) instead of through the frobnitz text field?  Which of
the two schemas would give better perfornance?


Thanks!


kj