Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-10 Thread Vincent Veyron
Le jeudi 09 février 2012 à 16:30 -0600, Merlin Moncure a écrit :

 natural/surrogate is a performance/usability debate with various
 tradeoffs.  but using surrogate to 'create' uniqueness is a logical
 design error; maybe a very forgivable one for various reasons, but the
 point stands.

Please consider the following case :

I record insurance claims in the table below, where id_evenement,
id_agent and date_origine define a unique event.

However, records sometimes have to be canceled (set annule=true), and
re-recorded the same way. They're normally canceled once, but
occasionnally twice, or more (for various reasons).

What would you use for a primary key?

CREATE TABLE tbldossier (
id_evenement text NOT NULL,
id_agent integer NOT NULL,
date_origine date NOT NULL,
annule boolean DEFAULT false NOT NULL);


-- 
Vincent Veyron
http://marica.fr/
Logiciel de gestion des sinistres et des contentieux pour le service juridique


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-10 Thread David Johnston
On Feb 10, 2012, at 10:49, Vincent Veyron vv.li...@wanadoo.fr wrote:

 Le jeudi 09 février 2012 à 16:30 -0600, Merlin Moncure a écrit :
 
 natural/surrogate is a performance/usability debate with various
 tradeoffs.  but using surrogate to 'create' uniqueness is a logical
 design error; maybe a very forgivable one for various reasons, but the
 point stands.
 
 Please consider the following case :
 
 I record insurance claims in the table below, where id_evenement,
 id_agent and date_origine define a unique event.
 
 However, records sometimes have to be canceled (set annule=true), and
 re-recorded the same way. They're normally canceled once, but
 occasionnally twice, or more (for various reasons).
 
 What would you use for a primary key?
 
 CREATE TABLE tbldossier (
id_evenement text NOT NULL,
id_agent integer NOT NULL,
date_origine date NOT NULL,
annule boolean DEFAULT false NOT NULL);
 
 

One possibility is to add a version field (integer) and combine evenement and 
version to create the unique.  I'd also create a partial unique on 
evenement/annule to ensure you do not make more than one active version.

David J.
-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-10 Thread Vincent Veyron
Le vendredi 10 février 2012 à 14:19 -0500, David Johnston a écrit :

 One possibility is to add a version field (integer) and combine evenement 
 and version to create the unique.  I'd also create a partial unique on 
 evenement/annule to ensure you do not make more than one active version.
 

Hi David,


I don't find this solution very much different from using a surrogate
key?

Except that you then have to code it yourself, instead of using the
built-in serial.

-- 
Vincent Veyron
http://marica.fr/
Logiciel de gestion des sinistres et des contentieux pour le service juridique


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-10 Thread David Johnston

On Feb 10, 2012, at 14:56, Vincent Veyron vv.li...@wanadoo.fr wrote:

 Le vendredi 10 février 2012 à 14:19 -0500, David Johnston a écrit :
 
 One possibility is to add a version field (integer) and combine evenement 
 and version to create the unique.  I'd also create a partial unique on 
 evenement/annule to ensure you do not make more than one active version.
 
 
 Hi David,
 
 
 I don't find this solution very much different from using a surrogate
 key?
 
 Except that you then have to code it yourself, instead of using the
 built-in serial.
 
 

The version field has semantic meaning that a true sequential value does not.  
If you are creating  new data then you will ultimately always end up with some 
form of artificial identifier. Since you do not appear to want to assign a new 
case number when you annul and reopen an event you need some supplemental 
information to distinguish the two cases.

I would argue you should remove annul from the table, optionally replacing it 
with a current status field, and then use some kind of logging table to track 
changes in status.  In that case each event only ever exists once (no versions) 
and you gain flexibility in handling different statuses (besides just 
open/annulled).

David J.
-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-10 Thread Chris Travers
On Fri, Feb 10, 2012 at 7:49 AM, Vincent Veyron vv.li...@wanadoo.fr wrote:

 Le jeudi 09 février 2012 à 16:30 -0600, Merlin Moncure a écrit :

  natural/surrogate is a performance/usability debate with various
  tradeoffs.  but using surrogate to 'create' uniqueness is a logical
  design error; maybe a very forgivable one for various reasons, but the
  point stands.

 Please consider the following case :

 I record insurance claims in the table below, where id_evenement,
 id_agent and date_origine define a unique event.

 However, records sometimes have to be canceled (set annule=true), and
 re-recorded the same way. They're normally canceled once, but
 occasionnally twice, or more (for various reasons).

 What would you use for a primary key?

 CREATE TABLE tbldossier (
id_evenement text NOT NULL,
id_agent integer NOT NULL,
date_origine date NOT NULL,
annule boolean DEFAULT false NOT NULL);

 First, a surrogate key will make joins more robust and so it should be
there.  Also a partial unique index could be used.

We've had a similar issue with LedgerSMB and while our solution might not
apply to you it's worth mentioning.

We had an issue of storing sales tax rates which may change or expire at
some point, so something like:

CREATE TABLE tax (
id serial not null unique,
account_id int not null,
rate numeric not null,
valid_to date,
unique (valid_to, account_id)
);

Initially we created a partial unique index on account_id where valid_to is
null.

Later we changed valid_to to a timestamp and defaulted it to infinity.
 This allowed us to declare account_id, valid_to as the primary key.

Best Wishes,
Chris Travers


Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-09 Thread Merlin Moncure
On Thu, Feb 9, 2012 at 10:49 AM, Rich Shepard rshep...@appl-ecosys.com wrote:
  I have a lot of data currently in .pdf files. I can extract the relevant
 data to plain text and format it to create a large text file of INSERT INTO
 ... rows. I need a unique ID for each row and there are no columns that
 would make a natural key so the serial data type would be appropriate.

The record should be logically unique as well as physically unique (of
if it isn't, why bother making a unique constraint at all?).
Sometimes you *have* to force a surrogate, for example if certain
(broken) client tools need a primary key to work, but aside from that
you shouldn't rely on a surrogate to generate uniqueness.

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-09 Thread Rich Shepard

On Thu, 9 Feb 2012, Merlin Moncure wrote:


The record should be logically unique as well as physically unique (of if
it isn't, why bother making a unique constraint at all?). Sometimes you
*have* to force a surrogate, for example if certain (broken) client tools
need a primary key to work, but aside from that you shouldn't rely on a
surrogate to generate uniqueness.


merlin,

  I have reports containing macroinvertebrate collection data for several
hundred (or several thousand) of taxa. There is no natural key since there
are multiple rows for each site/date pair. Years ago Joe Celko taught me to
seek natural keys whenever they might exist. They don't here. That's why I
specifically mentioned that in my message.

  The only 'broken client tools' are their consistent uses of Microsoft
Excel to store data or providing text reports in pdf with other data.

Rich



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-09 Thread Andy Colson

On 2/9/2012 10:49 AM, Rich Shepard wrote:

I have a lot of data currently in .pdf files. I can extract the relevant
data to plain text and format it to create a large text file of INSERT
INTO
... rows. I need a unique ID for each row and there are no columns that
would make a natural key so the serial data type would be appropriate.

When I prepare the text file I can start each row with the delimiter (',')
to indicate there's a table column preceding. If I define the primary key
as serial type on that first position in the file, will postgres
automagically fill it in as each row is read into the table?

If not, or if there's a better way of approaching this task, please clue
me in to that.

TIA,

Rich





If you create a serial column, dont put the column name or a value into 
your insert statement.


create table junk (id serial, stuff text);
insert into junk(stuff) values ('my stuff');

or, and I've never done this, I think you can use the default keyword:

insert into junk(id, stuff) values (default, 'my stuff');


-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-09 Thread Scott Marlowe
On Thu, Feb 9, 2012 at 9:49 AM, Rich Shepard rshep...@appl-ecosys.com wrote:
  I have a lot of data currently in .pdf files. I can extract the relevant
 data to plain text and format it to create a large text file of INSERT INTO
 ... rows. I need a unique ID for each row and there are no columns that
 would make a natural key so the serial data type would be appropriate.

  When I prepare the text file I can start each row with the delimiter (',')
 to indicate there's a table column preceding. If I define the primary key
 as serial type on that first position in the file, will postgres
 automagically fill it in as each row is read into the table?

  If not, or if there's a better way of approaching this task, please clue
 me in to that.

If you format your copy statement with a column list that leaves out
the serial key the db will insert that for you.

file: /tmp/input.sql
copy test (i1) from stdin;
10
20
30
40
50
\.

create table test (id serial primary key, i1 int);
\i /tmp/input.sql
select * from test
 id | i1
+
  1 | 10
  2 | 20
  3 | 30
  4 | 40
  5 | 50
(5 rows)

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-09 Thread Rich Shepard

On Thu, 9 Feb 2012, Andy Colson wrote:

If you create a serial column, dont put the column name or a value into your 
insert statement.


create table junk (id serial, stuff text);
insert into junk(stuff) values ('my stuff');


Andy,

  That's what I assumed would work but did not know for sure.

Thanks,

Rich


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-09 Thread David Salisbury



On 2/9/12 10:08 AM, Rich Shepard wrote:

   I have reports containing macroinvertebrate collection data for several
hundred (or several thousand) of taxa. There is no natural key since there
are multiple rows for each site/date pair. Years ago Joe Celko taught me to
seek natural keys whenever they might exist. They don't here. That's why I
specifically mentioned that in my message.



Interesting.  I used to think natural keys were okay, but have since decided
that surrogates are the way to go.  That second layer of abstraction allows
for much easier data modifications when needed.  What would be an example
of a natural key that would be good to use, and why would it be preferable??

I'd think the key value must never change, and even say kingdom values in a
taxa table could possibly change.. might discover something new and do a
little reordering. :)  Also natural keys might be strings, which I'm thinking
would not be as efficient as integers for an index.

-ds

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-09 Thread Andy Colson

On 2/9/2012 4:10 PM, David Salisbury wrote:



On 2/9/12 10:08 AM, Rich Shepard wrote:

I have reports containing macroinvertebrate collection data for several
hundred (or several thousand) of taxa. There is no natural key since
there
are multiple rows for each site/date pair. Years ago Joe Celko taught
me to
seek natural keys whenever they might exist. They don't here. That's
why I
specifically mentioned that in my message.



Interesting. I used to think natural keys were okay, but have since decided
that surrogates are the way to go. That second layer of abstraction allows
for much easier data modifications when needed. What would be an example
of a natural key that would be good to use, and why would it be
preferable??

I'd think the key value must never change, and even say kingdom values in a
taxa table could possibly change.. might discover something new and do a
little reordering. :) Also natural keys might be strings, which I'm
thinking
would not be as efficient as integers for an index.

-ds




Yeah, this is a Vim vs Emacs war.  (Vim, :-) )

I prefer surrogates like you.  Its way to easy to pick something that 
one day has to change.


Within the last year I remember a long thread about this same thing.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-09 Thread Merlin Moncure
On Thu, Feb 9, 2012 at 4:20 PM, Andy Colson a...@squeakycode.net wrote:
 On 2/9/2012 4:10 PM, David Salisbury wrote:



 On 2/9/12 10:08 AM, Rich Shepard wrote:

 I have reports containing macroinvertebrate collection data for several
 hundred (or several thousand) of taxa. There is no natural key since
 there
 are multiple rows for each site/date pair. Years ago Joe Celko taught
 me to
 seek natural keys whenever they might exist. They don't here. That's
 why I
 specifically mentioned that in my message.



 Interesting. I used to think natural keys were okay, but have since
 decided
 that surrogates are the way to go. That second layer of abstraction allows
 for much easier data modifications when needed. What would be an example
 of a natural key that would be good to use, and why would it be
 preferable??

 I'd think the key value must never change, and even say kingdom values in
 a
 taxa table could possibly change.. might discover something new and do a
 little reordering. :) Also natural keys might be strings, which I'm
 thinking
 would not be as efficient as integers for an index.

 -ds



 Yeah, this is a Vim vs Emacs war.  (Vim, :-) )

 I prefer surrogates like you.  Its way to easy to pick something that one
 day has to change.

 Within the last year I remember a long thread about this same thing.

Sure, you can use surrogates, but you should still define or at least
be aware of a natural key if there is one. If you can't (which happens
on various type of data), then the surrogate is giving the illusion of
row uniqueness when there isn't one.   This is really a design error:
other keys could depend on this table's primary key which is a
provably ambiguous relationship.   Since your rows are not
informationally distinct from each other, why do you need to be able
to point at a specific one?

natural/surrogate is a performance/usability debate with various
tradeoffs.  but using surrogate to 'create' uniqueness is a logical
design error; maybe a very forgivable one for various reasons, but the
point stands.

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-09 Thread Andy Colson

On 2/9/2012 4:20 PM, Andy Colson wrote:

On 2/9/2012 4:10 PM, David Salisbury wrote:



On 2/9/12 10:08 AM, Rich Shepard wrote:

I have reports containing macroinvertebrate collection data for several
hundred (or several thousand) of taxa. There is no natural key since
there
are multiple rows for each site/date pair. Years ago Joe Celko taught
me to
seek natural keys whenever they might exist. They don't here. That's
why I
specifically mentioned that in my message.



Interesting. I used to think natural keys were okay, but have since
decided
that surrogates are the way to go. That second layer of abstraction
allows
for much easier data modifications when needed. What would be an example
of a natural key that would be good to use, and why would it be
preferable??

I'd think the key value must never change, and even say kingdom values
in a
taxa table could possibly change.. might discover something new and do a
little reordering. :) Also natural keys might be strings, which I'm
thinking
would not be as efficient as integers for an index.

-ds




Yeah, this is a Vim vs Emacs war. (Vim, :-) )

I prefer surrogates like you. Its way to easy to pick something that one
day has to change.

Within the last year I remember a long thread about this same thing.

-Andy




Ah, here it is:

http://archives.postgresql.org/pgsql-general/2011-04/msg00996.php

-Andy



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-09 Thread Rich Shepard

On Thu, 9 Feb 2012, David Salisbury wrote:


Interesting.  I used to think natural keys were okay, but have since decided
that surrogates are the way to go.  That second layer of abstraction allows
for much easier data modifications when needed.  What would be an example
of a natural key that would be good to use, and why would it be preferable??


  For water quality data the primary key is (site, date, param) since
there's only one value for a given parameter collected at a specific site on
a single day. No surrogate key needed.


I'd think the key value must never change, and even say kingdom values in
a taxa table could possibly change.. might discover something new and do a
little reordering. :) Also natural keys might be strings, which I'm
thinking would not be as efficient as integers for an index.


  The problem with real world data is that different taxonomic levels are
used. Not all organisms can be identified to species; some (such as the
round worms, or nematodes) are at the level of order. That means there is no
combination of columns that are consistently not NULL. Sigh.

Rich


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-09 Thread Rich Shepard

On Thu, 9 Feb 2012, Merlin Moncure wrote:


If you can't (which happens on various type of data), then the surrogate
is giving the illusion of row uniqueness when there isn't one.


  Ah, but each row is unique. However, there is no consisten set of non NULL
values that can consistently define a unique key for each row.

Rich


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-09 Thread David Salisbury


On 2/9/12 5:25 PM, Rich Shepard wrote:

For water quality data the primary key is (site, date, param) since
there's only one value for a given parameter collected at a specific
site on
a single day. No surrogate key needed.


Yea.  I was wondering if the surrogate key debate really boils down to the
composite primary key debate.  Seems so in my mind, though one could
maybe come up with a combination.  Basically aliases of values and
composite those.  Perhaps that's the ultimate methodology. :)


The problem with real world data is that different taxonomic levels are
used. Not all organisms can be identified to species; some (such as the
round worms, or nematodes) are at the level of order. That means there
is no
combination of columns that are consistently not NULL. Sigh.


I didn't know that about worms.  I did know grasses only went to the genus.
You could make a tall skinny self referential table though, and nothing
would be null and everything would be unique ( I think, unless certain
taxon values can appear under different higher order taxon values ).

Thanks for the view points out there.

Cheers,

-ds

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-09 Thread Adrian Klaver
On Thursday, February 09, 2012 5:18:19 pm David Salisbury wrote:
 On 2/9/12 5:25 PM, Rich Shepard wrote:
  For water quality data the primary key is (site, date, param) since
  there's only one value for a given parameter collected at a specific
  site on
  a single day. No surrogate key needed.
 
 Yea.  I was wondering if the surrogate key debate really boils down to the
 composite primary key debate.  Seems so in my mind, though one could
 maybe come up with a combination.  Basically aliases of values and
 composite those.  Perhaps that's the ultimate methodology. :)
 
  The problem with real world data is that different taxonomic levels are
  used. Not all organisms can be identified to species; some (such as the
  round worms, or nematodes) are at the level of order. That means there
  is no
  combination of columns that are consistently not NULL. Sigh.
 
 I didn't know that about worms.  I did know grasses only went to the genus.
 You could make a tall skinny self referential table though, and nothing
 would be null and everything would be unique ( I think, unless certain
 taxon values can appear under different higher order taxon values ).

OT. Alright, now I have to ask. When you say grasses(or for that matter round 
worms) cannot be identified to species are you talking about the data you are 
receiving or in general.  Because as far as I know there are many species 
identified for both. They are difficult to id but species do exist.

 
 Thanks for the view points out there.
 
 Cheers,
 
 -ds

-- 
Adrian Klaver
adrian.kla...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategy for Primary Key Generation When Populating Table

2012-02-09 Thread Chris Travers
On Thu, Feb 9, 2012 at 2:10 PM, David Salisbury salisb...@globe.gov wrote:



 On 2/9/12 10:08 AM, Rich Shepard wrote:

   I have reports containing macroinvertebrate collection data for several
 hundred (or several thousand) of taxa. There is no natural key since there
 are multiple rows for each site/date pair. Years ago Joe Celko taught me
 to
 seek natural keys whenever they might exist. They don't here. That's why I
 specifically mentioned that in my message.



 Interesting.  I used to think natural keys were okay, but have since
 decided
 that surrogates are the way to go.  That second layer of abstraction allows
 for much easier data modifications when needed.  What would be an example
 of a natural key that would be good to use, and why would it be
 preferable??

 I'd think the key value must never change, and even say kingdom values in a
 taxa table could possibly change.. might discover something new and do a
 little reordering. :)  Also natural keys might be strings, which I'm
 thinking
 would not be as efficient as integers for an index.

 Well, here is the approach we have taken with LedgerSMB:

Every table has a defined primary key, and where possible this is the
natural key.  There are cases where there is no natural key however and we
use a surrogate key.  However every table also has at least one single
column key whether it is the natural primary key or a surrogate one.

All joins are done on surrogate keys.

This has a few very specific advantages as the db schema changes:  if
criteria for the natural key must change because of evolving requirements,
the join conditions need not change.  Moreover joins don't require intimate
knowledge of natural keys between tables, making joins simpler and more
predictable, and easier to read.

So I don't think this is an either/or proposition.  I think there is a
great deal of benefit to the use of both.

Best Wishes,
Chris Travers