Re: [GENERAL] Primary keys for companies and people

2006-02-07 Thread John D. Burger

Leif B. Kristensen wrote:


Still, I'm struggling with the basic concept of /identity/, eg. is the
William Smith born to John Smith and Jane Doe in 1733, the same William
Smith who marries Mary Jones in the same parish in 1758? You may never
really know. Still, collecting such disparate facts under the same ID
number, thus taking the identity more or less for granted, is the modus
operandi of computer genealogy. Thus, one of the major objectives of
genealogy research, the assertion of identity, becomes totally hidden
the moment that you decide to cluster disparate evidence about what may
actually have been totally different persons, under a single ID number.


We have a similar issue in a database in which we are integrating 
multiple geographic gazetteers, e.g., USGS, NGA, Wordnet.  We cannot be 
sure that source A's Foobar City is the same as source B's.  Our 
approach is to =always= import them as separate entities, and use a 
table of equivalences that gets filled out using various heuristics.  
For example, if each source indicates that its Foobar City is in Baz 
County, and we decide to equate the counties, we may equate the cities.


The alternative is of course to collect each cluster of evidence under 
a

separate ID, but then the handling of a person becomes a programmer's
nightmare.


Our intent is to have views and/or special versions of the database 
that collapse equivalent entities, but I must confess that we have not 
done much along these lines - I hope it is not too nightmarish.


- John D. Burger
  MITRE


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [GENERAL] Primary keys for companies and people

2006-02-06 Thread Michael Glaesemann


On Feb 4, 2006, at 2:23 , Merlin Moncure wrote:


If you kind determine an easy natural differentiator, invent one:
create table contact
(
  account text, name text, memo text,
  primary key(account, name, memo)
);

The memo field is blank in most cases unlees it's needed.  Suppose you
were filling contact information in your databse and Taking your
second John Smith from an account...your operator says, 'we already
have a john smith for your account, can you give us something to
identify him?'  Put that in the memo field and there you go.


Merlin,

Thanks for the blissfully simple solution! I think this can work well  
for me.


Michael Glaesemann
grzm myrealbox com




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] Primary keys for companies and people

2006-02-03 Thread Merlin Moncure
 I definitely agree with you here, Merlin. Mutability is not the issue
 at hand. May I ask what strategies you use for determining uniqueness
 for people?

Well, that depends on the particular problem at hand.  If you had two
john smiths in your system, how would you distinguish them? If you
assinged an account number in your system and gave it to the person to
refer back to you, this ok...this is not a surrogate key per se, but a
meaningful alias.

However, that technique can't always be applied, take the case of the
'contacts' table for an account.  Since you don't give each contact of
each accunt a number and you don't print a number representing them on
various reports (there would be no reason to), adding a surrogate to
the table adds nothing to your database, it's just meaningless
infromation with no semantic value.  There *must* be a semantic
difference between the two John Smiths or you should be storing one
record in the database, not two.

If you kind determine an easy natural differentiator, invent one:
create table contact
(
  account text, name text, memo text,
  primary key(account, name, memo)
);

The memo field is blank in most cases unlees it's needed.  Suppose you
were filling contact information in your databse and Taking your
second John Smith from an account...your operator says, 'we already
have a john smith for your account, can you give us something to
identify him?'  Put that in the memo field and there you go.

Now your operator is taking information which has value pertaining to
the scope of the problem domain your application exists in.  This is
just one example of how to approach the problem  Now there is no
ambiguiity about which john smith you are dealing with. This may not
be a perfect solution for every application but there is basically has
to be a method of finding semantic unquenes to your data or you have a
hole in your data model.  Glossing over that hole with artificial
information solves nothing.

There are pracitcal reasons to use surrogates but the uniqueness
argument generally holds no water.  By 'generating' uniqueness you are
breaking your data on mathematical terms.  Until you truly understand
the ramifcations of that statement you can't really understand when
the practical cases apply.  Many of the arguments for surrugates based
on mutability and uniqueness are simply illogical.

The performance issue is more complex and leads to the issue of
practicality.  I wouldn't find any fault with a modeler who
benchmarked his app with a surrogate vs. a 5 part key  and chose the
former as long as he truly understood the downside to doing that
(extra joins).

Merlin

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Primary keys for companies and people

2006-02-02 Thread Leif B. Kristensen
On Thursday 02 February 2006 09:05, Michael Glaesemann wrote:

For people I'm more or less stumped. I can't think of a combination
of things that I know I'll be able to get from people that I'll want
to be able to add to the database. Starting off we'll have at least
7,000 individuals in the database, and I don't think that just family
and given names are going to be enough. I don't think we'll be able
to get telephone numbers for all of them, and definitely aren't going
to be getting birthdays for all.

I'm very interested to hear what other use in their applications for
holding people and companies.

I've been thinking long and hard about the same thing myself, in 
developing my genealogy database. For identification of people, there 
seems to be no realistic alternative to an arbitrary ID number.

Still, I'm struggling with the basic concept of /identity/, eg. is the 
William Smith born to John Smith and Jane Doe in 1733, the same William 
Smith who marries Mary Jones in the same parish in 1758? You may never 
really know. Still, collecting such disparate facts under the same ID 
number, thus taking the identity more or less for granted, is the modus 
operandi of computer genealogy. Thus, one of the major objectives of 
genealogy research, the assertion of identity, becomes totally hidden 
the moment that you decide to cluster disparate evidence about what may 
actually have been totally different persons, under a single ID number.

The alternative is of course to collect each cluster of evidence under a 
separate ID, but then the handling of a person becomes a programmer's 
nightmare.

I have been writing about my genealogy data model here: 
url:http://solumslekt.org/forays/blue.php The model has been slightly 
modified since I wrote this; due to what I perceive as 'gotchas' in the 
PostgreSQL implementation of table inheritance, I have dropped the 
'citations' table. Besides, I've dropped some of the surrogate keys, 
and more will follow. I really should update this article soon.

I should perhaps be posting this under another subject, but I feel that 
beneath the surface, Michael's problem and my own are strongly related.
-- 
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Primary keys for companies and people

2006-02-02 Thread David Goodenough
On Thursday 02 February 2006 09:07, Leif B. Kristensen wrote:
 On Thursday 02 February 2006 09:05, Michael Glaesemann wrote:
 For people I'm more or less stumped. I can't think of a combination
 of things that I know I'll be able to get from people that I'll want
 to be able to add to the database. Starting off we'll have at least
 7,000 individuals in the database, and I don't think that just family
 and given names are going to be enough. I don't think we'll be able
 to get telephone numbers for all of them, and definitely aren't going
 to be getting birthdays for all.
 
 I'm very interested to hear what other use in their applications for
 holding people and companies.

 I've been thinking long and hard about the same thing myself, in
 developing my genealogy database. For identification of people, there
 seems to be no realistic alternative to an arbitrary ID number.

 Still, I'm struggling with the basic concept of /identity/, eg. is the
 William Smith born to John Smith and Jane Doe in 1733, the same William
 Smith who marries Mary Jones in the same parish in 1758? You may never
 really know. Still, collecting such disparate facts under the same ID
 number, thus taking the identity more or less for granted, is the modus
 operandi of computer genealogy. Thus, one of the major objectives of
 genealogy research, the assertion of identity, becomes totally hidden
 the moment that you decide to cluster disparate evidence about what may
 actually have been totally different persons, under a single ID number.

 The alternative is of course to collect each cluster of evidence under a
 separate ID, but then the handling of a person becomes a programmer's
 nightmare.

 I have been writing about my genealogy data model here:
 url:http://solumslekt.org/forays/blue.php The model has been slightly
 modified since I wrote this; due to what I perceive as 'gotchas' in the
 PostgreSQL implementation of table inheritance, I have dropped the
 'citations' table. Besides, I've dropped some of the surrogate keys,
 and more will follow. I really should update this article soon.

 I should perhaps be posting this under another subject, but I feel that
 beneath the surface, Michael's problem and my own are strongly related.
There is also the problem that a name can change.  People change names
by deed-poll, and also women can adopt a married name or keep their old
one.  All in all an ID is about the only answer.

David

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [GENERAL] Primary keys for companies and people

2006-02-02 Thread Martijn van Oosterhout
On Thu, Feb 02, 2006 at 10:36:54AM +, David Goodenough wrote:
  Still, I'm struggling with the basic concept of /identity/, eg. is the
  William Smith born to John Smith and Jane Doe in 1733, the same William
  Smith who marries Mary Jones in the same parish in 1758? You may never
  really know. Still, collecting such disparate facts under the same ID
  number, thus taking the identity more or less for granted, is the modus
  operandi of computer genealogy. Thus, one of the major objectives of
  genealogy research, the assertion of identity, becomes totally hidden
  the moment that you decide to cluster disparate evidence about what may
  actually have been totally different persons, under a single ID number.
 
  The alternative is of course to collect each cluster of evidence under a
  separate ID, but then the handling of a person becomes a programmer's
  nightmare.

 There is also the problem that a name can change.  People change names
 by deed-poll, and also women can adopt a married name or keep their old
 one.  All in all an ID is about the only answer.

True, the issue being ofcourse that changing a name doesn't change
their identity.

To the GP, your page is an interesting one and raises several
interesting points. In particular the one about the person being the
conclusion of the rest of the database. You essentially have a set of
facts A married B in C on date D and you're trying to correlate
these. In the end it's just a certain amount of guess work, especially
since back then they wern't that particular about spelling as they are
today.

My naive view is that you're basically assigning trust values to each
fact and the chance that two citations refer to the same person. In
principle you'd be able to cross-reference all these citations and
build the structure quasi-automatically. I suppose in practice this is
done by hand.

As for your question, I think you're stuck with having a person ID.
Basically because you need to identify a person somehow. Given you
still have the original citiations, you can split a person into
multiple if the situation appears to not work out.

One thing I find odd though, your person objects have no birthdate or
deathdate. Or birth place either. I would have thought these elements
would be fundamental in determining if two people are the same, given
that they can't change and people are unlikely to forget them.

Put another way, two people with the same birthday in the same place
with similar names are very likely to be the same. If you can
demostrate this is not the case that's another fact. In the end you're
dealing with probabilities, you can never know for sure.

Anyway, hope this helps. It's a subject I've been vaguely interested in
but never really had the time to look into.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature


Re: [GENERAL] Primary keys for companies and people

2006-02-02 Thread Ted Byers


- Original Message - 
From: Leif B. Kristensen [EMAIL PROTECTED]

To: pgsql-general@postgresql.org
Sent: Thursday, February 02, 2006 4:07 AM
Subject: Re: [GENERAL] Primary keys for companies and people



[snip]
I'm very interested to hear what other use in their applications for
holding people and companies.


I've been thinking long and hard about the same thing myself, in
developing my genealogy database. For identification of people, there
seems to be no realistic alternative to an arbitrary ID number.

Still, I'm struggling with the basic concept of /identity/, eg. is the
William Smith born to John Smith and Jane Doe in 1733, the same William

[snip]

I have long been interested in this issue, and it is one that transcends the 
problem of IDs in IT.  For my second doctorate, I examined this in the 
context of historical investigation, applying numerical classification 
techniques to biographical information that can be extracted from historical 
documents.  It is, I fear, a problem for which only a probabilistic answer 
can be obtained in most historical cases.  For example, there was an 
eleventh century viking king Harold who as a teenager was part of his 
cousin's court, and then found it necessary to flee to Kiev when his cousin 
found hiimself on the losing side of a rebellion.  He then made his way into 
the Byzantine empire and served the emperor as a mercenary through much of 
the mediterranean, finally returning in fame and glory to Norway where he 
found another relative (a nephew IIRC) on the throne, which he inherited 
about a year after his return.  Impagine yourself as a historian trying to 
write his biography.  You'd find various documents all over the western 
world (as known in the viking age) written in a variety of languages, and 
using different names to refer to him.  It isn't an easy task to determine 
which documents refer specifically to him.  And to make things even more 
interesting, many documents refer to a given person only by his official 
title, and in other cases, the eldest son of each generation was given the 
same name as his father.


In my own case, in the time I was at the University of Toronto, I know of 
four other men who had precisely the same name I have.  I know this from 
strange phone calls from faculty I never studied with about assignments and 
examinations for courses I had never taken.  In each case, the professor 
checked again with the university's records department and found the correct 
student.  The last case was particularly disturbing since in that case, 
things were a bit different in that I had taken a graduate course with the 
professor in question, and he stopped me on campus and asked about an 
assignment for a given advanced undergraduate course that I had not taken, 
but my namesake had.  What made this disturbing is that not only did the 
other student carry my name, but he also looked enough like me that our 
professor could mistake me for him on campus!  I can only hope that he is a 
well behaved, law abiding citizen!  The total time period in question was 18 
years.  In general, the problem only gets more challenging as the length, 
and as the age, of the historical period considered increases.


The point is, not only are the combinations of family and given names not 
reliably unique, even certain biological data, such as photographs of the 
human face, not adequately unique.  Even DNA fingerprints, putatively the 
best available biometric technology, are not entirely reliable since even 
that can not distinguish between identical twins, and at the same time, 
there can be, admittedly extremely rare as far as we know, developmental 
anomalies resulting in a person being his own twin (this results from twin 
fetuses merging, with the consequence that the resulting person has some 
organ systems from one of the original fetuses and some from the other). 
For historical questions, I don't believe one can get any better than 
inference based on a balance of probabilities.  A geneologist has no option 
but to become an applied statistician!  For purposes of modern investigation 
or for the purpose of modern business, one may do better through an 
appropriate use of a combination of technologies.  This is a hard problem, 
even with the use of the best available technologies and especially given 
the current problems associated with identity theft.


For software developers in general, and database developers in particular, 
there are several distinct questions to consider.:


1) How does one reliably determine identity to begin with, and then use that 
identity with whatever technology one might use to represent it?


2) How good does this technology, and identification process need to be?  In 
other words, how does the cost of a mistake (esp. in identification) relate 
to the increased cost of using better technology?  In this analysis, one 
needs to consider both the cost of such a mistake to the person identified

Re: [GENERAL] Primary keys for companies and people

2006-02-02 Thread Merlin Moncure
  I should perhaps be posting this under another subject, but I feel that
  beneath the surface, Michael's problem and my own are strongly related.
 There is also the problem that a name can change.  People change names
 by deed-poll, and also women can adopt a married name or keep their old
 one.  All in all an ID is about the only answer.

I'll take the other side of this issue.  The fact that a primary key
is mutable does not make it any less primary.  As long as we can can
count on it to be unique, how often identiying info changes has no
bearing on its selection as a p-key from a relational standpoint.  SQL
gives us RI to help deal with this but in this is not always practical
if for example you you have a changing p-key that cascades to a
million records.  The performance issue has zero meaning in a
conceptual sense however and I think you are trying to grapple things
in conceptual terms.

By assigning a surrogate key to a person, you are simply moving the
guess work from one place to another.  If you can't logically
determine who 'John Smith' is, how can you possibly expect to relate
information to him? (which john smith? why, etc)...you are just hiding
a guess behind a number.  Put into other words, *a record must have
unique identifiying criteria or the table containing it cannot be
expected to give correct results*.  This is, more or less, a
mathematical truth.  The non key attributes of the tuple are now
undefined because they can give two or more different answers to the
same question.  Surrogates do not change this principle, they just
hide it but it still has to be dealt with.

Merlin

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] Primary keys for companies and people

2006-02-02 Thread Michael Glaesemann


On Feb 3, 2006, at 7:25 , Merlin Moncure wrote:

There is also the problem that a name can change.  People change  
names
by deed-poll, and also women can adopt a married name or keep  
their old

one.  All in all an ID is about the only answer.


I'll take the other side of this issue.  The fact that a primary key
is mutable does not make it any less primary.  As long as we can can
count on it to be unique, how often identiying info changes has no
bearing on its selection as a p-key from a relational standpoint.




The performance issue has zero meaning in a
conceptual sense however and I think you are trying to grapple things
in conceptual terms.


I definitely agree with you here, Merlin. Mutability is not the issue  
at hand. May I ask what strategies you use for determining uniqueness  
for people?


Michael Glaesemann
grzm myrealbox com


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] Primary keys for companies and people

2006-02-02 Thread Mark Dilger

Michael Glaesemann wrote:

Hello, all!

Recently there was quite a bit of discussion regarding surrogate keys  
and natural keys. I'm not interested in discussing the pros and cons  of 
surrogate keys. What I'd like to find out are the different  methods 
people actually use to uniquely identify companies and people  *besides* 
surrogate keys.


I'm currently working on an application that will include contact  
information, so being able to uniquely identify these two entities is  
of interest to me. Right now I'm thinking of uniquely identifying  
companies by telephone number. For example:


create table companies (
company_id integer primary key -- telephone number, not serial
, company_name text not null
);

Of course, the company may have more than one telephone number  
associated with it, so there will also be a table associating  telephone 
numbers and companies.


create table companies__telephone_numbers (
company_id integer not null
references companies (company_id)
on update cascade on delete cascade
, telephone_number integer not null
, unique (company_id, telephone_number)
);

There should also be a trigger that will check that the company_id  
matches an existing telephone number associated with the company,  
something like:


create function assert_company_id_telephone_number_exists
returns trigger
language plpgsql as $$
begin
if exists (
select company_id
from companies
except
select company_id
from companies
join companies__telephone_numbers on (company_id = telephone_number)
)
then raise exception 'company_id must match existing company  telephone 
number';

end if;
return null;
end;
$$;

For people I'm more or less stumped. I can't think of a combination  of 
things that I know I'll be able to get from people that I'll want  to be 
able to add to the database. Starting off we'll have at least  7,000 
individuals in the database, and I don't think that just family  and 
given names are going to be enough. I don't think we'll be able  to get 
telephone numbers for all of them, and definitely aren't going  to be 
getting birthdays for all.


I'm very interested to hear what other use in their applications for  
holding people and companies.


Michael Glaesemann
grzm myrealbox com




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq



Telephone numbers make bad primary keys because they get recycled.  A phone 
number that belongs to me this year may belong to somebody else next year.


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match