Re: [GENERAL] Primary Keys
On 04/29/2016 11:07 AM, Dustin Kempter wrote: Hi all, Is there a query I can run that will scan through all the tables of a database and give me a list of all tables without a primary key? Im not having any luck with this. Two options: First http://www.postgresql.org/docs/9.5/interactive/catalog-pg-class.html select * from pg_class where relhaspkey ='f' and relkind ='r' and relname not like 'pg_%'; *NOTE* from above link: " relhaspkey boolTrue if the table has (or once had) a primary key " So it may not totally reflect current reality. Second http://www.postgresql.org/docs/9.5/interactive/information-schema.html http://www.postgresql.org/docs/9.5/interactive/infoschema-table-constraints.html http://www.postgresql.org/docs/9.5/interactive/infoschema-tables.html I restricted the below to exclude system and information_schema tables: select * from information_schema.tables where table_catalog = 'test' and table_schema !='pg_catalog' and table_schema != 'information_schema' and table_name not in(select table_name from information_schema.table_constraints where constraint_type = 'PRIMARY KEY'); *NOTE* The information returned is dependent on the privileges of the user running the query, so if you want to see everything run as a superuser. Thanks in advance! -- Adrian Klaver adrian.kla...@aklaver.com -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Primary Keys
On Fri, Apr 29, 2016 at 1:20 PM, Melvin Davidsonwrote: > > > On Fri, Apr 29, 2016 at 2:07 PM, Dustin Kempter < > dust...@consistentstate.com> wrote: > >> Hi all, >> Is there a query I can run that will scan through all the tables of a >> database and give me a list of all tables without a primary key? Im not >> having any luck with this. >> >> Thanks in advance! > > > >> Please, ALWAYS provide Postgresql version & O/S, regardless of whether > you think it is pertinet. > > > Now try this: > > SELECT n.nspname, c.relname as table > FROM pg_class c > JOIN pg_namespace n ON (n.oid =c.relnamespace ) > WHERE relkind = 'r' AND >relname NOT LIKE 'pg_%' AND >relname NOT LIKE 'sql_%' AND >relhaspkey = FALSE > ORDER BY n.nspname, c.relname; > That gives a list of all tables in all schemas in the database. But this needs to be refined to those without a primary key. Using the -E switch, I looked at the output from \di+ , which will list whether the table has a primary key or not, that command is implemented via multiple SELECT statements which I haven't reviewed yet. > > > -- > *Melvin Davidson* > I reserve the right to fantasize. Whether or not you > wish to share my fantasy is entirely up to you. > -- The unfacts, did we have them, are too imprecisely few to warrant our certitude. Maranatha! <>< John McKown
Re: [GENERAL] Primary Keys
On Fri, Apr 29, 2016 at 2:07 PM, Dustin Kempterwrote: > Hi all, > Is there a query I can run that will scan through all the tables of a > database and give me a list of all tables without a primary key? Im not > having any luck with this. > > Thanks in advance! > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > Please, ALWAYS provide Postgresql version & O/S, regardless of whether you think it is pertinet. Now try this: SELECT n.nspname, c.relname as table FROM pg_class c JOIN pg_namespace n ON (n.oid =c.relnamespace ) WHERE relkind = 'r' AND relname NOT LIKE 'pg_%' AND relname NOT LIKE 'sql_%' AND relhaspkey = FALSE ORDER BY n.nspname, c.relname; -- *Melvin Davidson* I reserve the right to fantasize. Whether or not you wish to share my fantasy is entirely up to you.
[GENERAL] Primary Keys
Hi all, Is there a query I can run that will scan through all the tables of a database and give me a list of all tables without a primary key? Im not having any luck with this. Thanks in advance! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] primary keys
Hi Folks I'm looking for a bit of advice regarding alpha primary keys. I have a table (designed by someone else) that has a numeric primary key and also a unique non-null email address field. The use of the primary key is causing me some headaches in that in multiple database server environment each server allocates a unique number from a range (and that works fine) but when the table is replicated (master-master- master) the exception handling is a bit tricky because each database server may have records that are duplicate at the email address field - with a different primary key number. I don't know why it was done this way but it seems to me that the email addresses are unique, non null and could be used as the primary key. This would make the replication much faster and simpler. Does anyone out there think the change (number to email address as primary key) would be a bad idea ? your thoughts would be appreciated. regards Grant -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] primary keys
On Sat, Sep 12, 2009 at 12:35 PM, Grant Maxwell grant.maxw...@maxan.com.au wrote: Hi Folks I'm looking for a bit of advice regarding alpha primary keys. I have a table (designed by someone else) that has a numeric primary key and also a unique non-null email address field. The use of the primary key is causing me some headaches in that in multiple database server environment each server allocates a unique number from a range (and that works fine) but when the table is replicated (master-master-master) the exception handling is a bit tricky because each database server may have records that are duplicate at the email address field - with a different primary key number. I don't know why it was done this way but it seems to me that the email addresses are unique, non null and could be used as the primary key. This would make the replication much faster and simpler. Does anyone out there think the change (number to email address as primary key) would be a bad idea ? your thoughts would be appreciated. I think it's a fine idea so long as you understand: *) the index(es) will be larger *) referring tables will have to be updated (via RI) if/when the email address changes *) lots of people will tell you that you are doing it wrong :-) *) if you need to specialize p-key further (add another field), it can be a big change on the plus side: *) you get to drop an index because you obviously had to index the key serparately *) if you query a referring table and are interested in email address (and no dependent props), you get to skip a join *) sorting be email address can be free *) less data transfer headaches merlin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] primary keys
On Sun, 13 Sep 2009 02:35:02 +1000 Grant Maxwell grant.maxw...@maxan.com.au wrote: Hi Folks I'm looking for a bit of advice regarding alpha primary keys. I have a table (designed by someone else) that has a numeric primary key and also a unique non-null email address field. The use of the primary key is causing me some headaches in that in multiple database server environment each server allocates a unique number from a range (and that works fine) but when the table is replicated (master-master- master) the exception handling is a bit tricky because each database server may have records that are duplicate at the email address field - with a different primary key number. I don't know why it was done this way but it seems to me that the email addresses are unique, non null and could be used as the primary key. This would make the replication much faster and simpler. Does anyone out there think the change (number to email address as primary key) would be a bad idea ? your thoughts would be appreciated. I think you have two issues here, and if you identify them as such, you'll come up with a better decision. The first is, Should the email address be my primary key. And the answer to that really depends on whether you expect it to change a lot, whether you have a lot of FK relations and how easy/difficult it will be to manage those. The second, which may influence the first, is how do I manage conflicts when data from different sources is merged? which is a huge, complex question. One potential answer is to replace your integer surrogate key with a GUID key, which will be unique from all the different sources. You could also use a prefix system to ensure uniqueness. Another option is to use the email address itself. I'm surprised by your approach, as personally, I've seen a LOT of people who share a single email address (husband/wife, for example). It's not horribly common, since getting addresses is easy, but it happens, and it's something to consider when setting this up: if two people share an email address and try to add their records at different places, how will the system handle it? In any event, if all those factors tell you that you should switch to using the email address as the PK, I doubt you'll regret your decision. Just be sure to take into account the foreign key factor early on and you shouldn't have too many problems (in my experience, a lot of people are unaware of ON DELETE CASCADE and ON UPDATE CASCADE). Hope this helps. -Bill -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] primary keys
On Sat, Sep 12, 2009 at 2:11 PM, Bill Moran wmo...@potentialtech.com wrote: On Sun, 13 Sep 2009 02:35:02 +1000 Grant Maxwell grant.maxw...@maxan.com.au wrote: Hi Folks I'm looking for a bit of advice regarding alpha primary keys. I have a table (designed by someone else) that has a numeric primary key and also a unique non-null email address field. The use of the primary key is causing me some headaches in that in multiple database server environment each server allocates a unique number from a range (and that works fine) but when the table is replicated (master-master- master) the exception handling is a bit tricky because each database server may have records that are duplicate at the email address field - with a different primary key number. I don't know why it was done this way but it seems to me that the email addresses are unique, non null and could be used as the primary key. This would make the replication much faster and simpler. Does anyone out there think the change (number to email address as primary key) would be a bad idea ? your thoughts would be appreciated. I think you have two issues here, and if you identify them as such, you'll come up with a better decision. The first is, Should the email address be my primary key. And the answer to that really depends on whether you expect it to change a lot, whether you have a lot of FK relations and how easy/difficult it will be to manage those. The second, which may influence the first, is how do I manage conflicts when data from different sources is merged? which is a huge, complex question. One potential answer is to replace your integer surrogate key with a GUID key, which will be unique from all the different sources. You guid solves the surrogate issue wrt data transfer, but glosses over what happens when you have duplicates. could also use a prefix system to ensure uniqueness. Another option is to use the email address itself. I'm surprised by your approach, as personally, I've seen a LOT of people who share a single email address (husband/wife, for example). It's not horribly common, since getting that's going to depend on how it's defined in the app. couple of different approaches: email _must_ be unique, and we care not to whom it belongs: email (email primary key) contact (email references email [...]) we do care option 1: email (email, memo text, primary key(email, memo)) contact (email, memo, references email(email, memo)) storing info in memo field to distinguish the different users we do care option 2: email (email primary key) contact (contact_id) email_contact_map ( contact_id references contact, email references email, memo, -- 'person a', 'person b', etc primary key(contact_id, email), ) The point is this: if you (the user) needs to distinguish between email adress users, that information should be in the database...allowing multiple entry of email addresses via serial or guid allows you to sneak by this requirement merlin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] primary keys
On 13/09/2009, at 2:46 AM, Tom Lane wrote: Grant Maxwell grant.maxw...@maxan.com.au writes: I don't know why it was done this way but it seems to me that the email addresses are unique, non null and could be used as the primary key. This would make the replication much faster and simpler. Does anyone out there think the change (number to email address as primary key) would be a bad idea ? One thing that's often considered a useful attribute of a primary key is that it be immutable. In your application, do users ever change their email addresses? If so, what should happen --- is it okay to treat that as effectively a new entry? This would be ok. The table keeps a list of email addresses and some stats on them users may have multiple addresses but would never alter a specific record. Also, if you have any other tables referencing this one via foreign keys, you'd have to have them storing the email address instead of the serial number; it'll be bulkier and address updates will be that much more expensive. so text PKs will be less efficient than numeric ? You can find lots and lots and lots of discussion of this topic if you search the archives for talk about natural versus surrogate keys. Good pointer = thanks Tom regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] primary keys
Grant Maxwell grant.maxw...@maxan.com.au writes: I don't know why it was done this way but it seems to me that the email addresses are unique, non null and could be used as the primary key. This would make the replication much faster and simpler. Does anyone out there think the change (number to email address as primary key) would be a bad idea ? One thing that's often considered a useful attribute of a primary key is that it be immutable. In your application, do users ever change their email addresses? If so, what should happen --- is it okay to treat that as effectively a new entry? Also, if you have any other tables referencing this one via foreign keys, you'd have to have them storing the email address instead of the serial number; it'll be bulkier and address updates will be that much more expensive. You can find lots and lots and lots of discussion of this topic if you search the archives for talk about natural versus surrogate keys. regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] primary keys
In article [EMAIL PROTECTED], Klint Gore [EMAIL PROTECTED] wrote: % works for me on version 8.1.3 % % SELECT attname % FROM pg_index %JOIN pg_class ON (indrelid = pg_class.oid) %JOIN pg_attribute ON (attrelid = pg_class.oid) % WHERE indisprimary IS TRUE %AND attnum = any(indkey) %AND relname = $tablename; This will work on 7.4, 8.0, or 8.1 SELECT attname FROM pg_index JOIN pg_class as c1 ON (indrelid = c1.oid) JOIN pg_class as c2 ON (indexrelid = c2.oid) JOIN pg_attribute ON (attrelid = c2.oid) WHERE indisprimary AND c1.relname = $tablename ; No arrays are hurt by this query. -- Patrick TJ McPhee North York Canada [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[GENERAL] primary keys
I'm trying to craft a query that will determine what column(s) are the primary key for a given table. I have succeeded but the query is so ugly that it borders on silly and cannot work for an arbitrary number of tables since indkey is an int2vect and the ANY keyword does not work on it. Please tell me there's an easier way to do this. Here is the query for tablename $table. SELECT attname FROM pg_index JOIN pg_class ON (indrelid = pg_class.oid) JOIN pg_attribute ON (attrelid = pg_class.oid) WHERE indisprimary IS TRUE AND (attnum = indkey[0] OR attnum = indkey[1] OR attnum = indkey[2]) AND relname = '$table'; Orion ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [GENERAL] primary keys
On Wed, 19 Apr 2006 19:39:45 -0700, Orion Henry [EMAIL PROTECTED] wrote: I'm trying to craft a query that will determine what column(s) are the primary key for a given table. I have succeeded but the query is so ugly that it borders on silly and cannot work for an arbitrary number of tables since indkey is an int2vect and the ANY keyword does not work on it. Please tell me there's an easier way to do this. Here is the query for tablename $table. SELECT attname FROM pg_index JOIN pg_class ON (indrelid = pg_class.oid) JOIN pg_attribute ON (attrelid = pg_class.oid) WHERE indisprimary IS TRUE AND (attnum = indkey[0] OR attnum = indkey[1] OR attnum = indkey[2]) AND relname = '$table'; Orion works for me on version 8.1.3 SELECT attname FROM pg_index JOIN pg_class ON (indrelid = pg_class.oid) JOIN pg_attribute ON (attrelid = pg_class.oid) WHERE indisprimary IS TRUE AND attnum = any(indkey) AND relname = $tablename; or on v7 you could try select pcl.relname, (select array_accum(attname) from pg_attribute where attrelid = pco.conrelid and attnum = any(pco.conkey)) as cols from pg_constraint pco join pg_class pcl on pcl.oid = pco.conrelid where pcl.relname = $tablename and pco.contype = 'p' klint. +---+-+ : Klint Gore: Non rhyming: : EMail : [EMAIL PROTECTED] : slang - the: : Snail : A.B.R.I.: possibilities : : Mail University of New England : are useless : : Armidale NSW 2351 Australia : L.J.J. : : Fax : +61 2 6772 5376 : : +---+-+ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [GENERAL] Primary keys for companies and people
Leif B. Kristensen wrote: Still, I'm struggling with the basic concept of /identity/, eg. is the William Smith born to John Smith and Jane Doe in 1733, the same William Smith who marries Mary Jones in the same parish in 1758? You may never really know. Still, collecting such disparate facts under the same ID number, thus taking the identity more or less for granted, is the modus operandi of computer genealogy. Thus, one of the major objectives of genealogy research, the assertion of identity, becomes totally hidden the moment that you decide to cluster disparate evidence about what may actually have been totally different persons, under a single ID number. We have a similar issue in a database in which we are integrating multiple geographic gazetteers, e.g., USGS, NGA, Wordnet. We cannot be sure that source A's Foobar City is the same as source B's. Our approach is to =always= import them as separate entities, and use a table of equivalences that gets filled out using various heuristics. For example, if each source indicates that its Foobar City is in Baz County, and we decide to equate the counties, we may equate the cities. The alternative is of course to collect each cluster of evidence under a separate ID, but then the handling of a person becomes a programmer's nightmare. Our intent is to have views and/or special versions of the database that collapse equivalent entities, but I must confess that we have not done much along these lines - I hope it is not too nightmarish. - John D. Burger MITRE ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [GENERAL] Primary keys for companies and people
On Feb 4, 2006, at 2:23 , Merlin Moncure wrote: If you kind determine an easy natural differentiator, invent one: create table contact ( account text, name text, memo text, primary key(account, name, memo) ); The memo field is blank in most cases unlees it's needed. Suppose you were filling contact information in your databse and Taking your second John Smith from an account...your operator says, 'we already have a john smith for your account, can you give us something to identify him?' Put that in the memo field and there you go. Merlin, Thanks for the blissfully simple solution! I think this can work well for me. Michael Glaesemann grzm myrealbox com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [GENERAL] Primary keys for companies and people
I definitely agree with you here, Merlin. Mutability is not the issue at hand. May I ask what strategies you use for determining uniqueness for people? Well, that depends on the particular problem at hand. If you had two john smiths in your system, how would you distinguish them? If you assinged an account number in your system and gave it to the person to refer back to you, this ok...this is not a surrogate key per se, but a meaningful alias. However, that technique can't always be applied, take the case of the 'contacts' table for an account. Since you don't give each contact of each accunt a number and you don't print a number representing them on various reports (there would be no reason to), adding a surrogate to the table adds nothing to your database, it's just meaningless infromation with no semantic value. There *must* be a semantic difference between the two John Smiths or you should be storing one record in the database, not two. If you kind determine an easy natural differentiator, invent one: create table contact ( account text, name text, memo text, primary key(account, name, memo) ); The memo field is blank in most cases unlees it's needed. Suppose you were filling contact information in your databse and Taking your second John Smith from an account...your operator says, 'we already have a john smith for your account, can you give us something to identify him?' Put that in the memo field and there you go. Now your operator is taking information which has value pertaining to the scope of the problem domain your application exists in. This is just one example of how to approach the problem Now there is no ambiguiity about which john smith you are dealing with. This may not be a perfect solution for every application but there is basically has to be a method of finding semantic unquenes to your data or you have a hole in your data model. Glossing over that hole with artificial information solves nothing. There are pracitcal reasons to use surrogates but the uniqueness argument generally holds no water. By 'generating' uniqueness you are breaking your data on mathematical terms. Until you truly understand the ramifcations of that statement you can't really understand when the practical cases apply. Many of the arguments for surrugates based on mutability and uniqueness are simply illogical. The performance issue is more complex and leads to the issue of practicality. I wouldn't find any fault with a modeler who benchmarked his app with a surrogate vs. a 5 part key and chose the former as long as he truly understood the downside to doing that (extra joins). Merlin ---(end of broadcast)--- TIP 6: explain analyze is your friend
[GENERAL] Primary keys for companies and people
Hello, all! Recently there was quite a bit of discussion regarding surrogate keys and natural keys. I'm not interested in discussing the pros and cons of surrogate keys. What I'd like to find out are the different methods people actually use to uniquely identify companies and people *besides* surrogate keys. I'm currently working on an application that will include contact information, so being able to uniquely identify these two entities is of interest to me. Right now I'm thinking of uniquely identifying companies by telephone number. For example: create table companies ( company_id integer primary key -- telephone number, not serial , company_name text not null ); Of course, the company may have more than one telephone number associated with it, so there will also be a table associating telephone numbers and companies. create table companies__telephone_numbers ( company_id integer not null references companies (company_id) on update cascade on delete cascade , telephone_number integer not null , unique (company_id, telephone_number) ); There should also be a trigger that will check that the company_id matches an existing telephone number associated with the company, something like: create function assert_company_id_telephone_number_exists returns trigger language plpgsql as $$ begin if exists ( select company_id from companies except select company_id from companies join companies__telephone_numbers on (company_id = telephone_number) ) then raise exception 'company_id must match existing company telephone number'; end if; return null; end; $$; For people I'm more or less stumped. I can't think of a combination of things that I know I'll be able to get from people that I'll want to be able to add to the database. Starting off we'll have at least 7,000 individuals in the database, and I don't think that just family and given names are going to be enough. I don't think we'll be able to get telephone numbers for all of them, and definitely aren't going to be getting birthdays for all. I'm very interested to hear what other use in their applications for holding people and companies. Michael Glaesemann grzm myrealbox com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] Primary keys for companies and people
On Thursday 02 February 2006 09:05, Michael Glaesemann wrote: For people I'm more or less stumped. I can't think of a combination of things that I know I'll be able to get from people that I'll want to be able to add to the database. Starting off we'll have at least 7,000 individuals in the database, and I don't think that just family and given names are going to be enough. I don't think we'll be able to get telephone numbers for all of them, and definitely aren't going to be getting birthdays for all. I'm very interested to hear what other use in their applications for holding people and companies. I've been thinking long and hard about the same thing myself, in developing my genealogy database. For identification of people, there seems to be no realistic alternative to an arbitrary ID number. Still, I'm struggling with the basic concept of /identity/, eg. is the William Smith born to John Smith and Jane Doe in 1733, the same William Smith who marries Mary Jones in the same parish in 1758? You may never really know. Still, collecting such disparate facts under the same ID number, thus taking the identity more or less for granted, is the modus operandi of computer genealogy. Thus, one of the major objectives of genealogy research, the assertion of identity, becomes totally hidden the moment that you decide to cluster disparate evidence about what may actually have been totally different persons, under a single ID number. The alternative is of course to collect each cluster of evidence under a separate ID, but then the handling of a person becomes a programmer's nightmare. I have been writing about my genealogy data model here: url:http://solumslekt.org/forays/blue.php The model has been slightly modified since I wrote this; due to what I perceive as 'gotchas' in the PostgreSQL implementation of table inheritance, I have dropped the 'citations' table. Besides, I've dropped some of the surrogate keys, and more will follow. I really should update this article soon. I should perhaps be posting this under another subject, but I feel that beneath the surface, Michael's problem and my own are strongly related. -- Leif Biberg Kristensen | Registered Linux User #338009 http://solumslekt.org/ | Cruising with Gentoo/KDE ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [GENERAL] Primary keys for companies and people
On Thursday 02 February 2006 09:07, Leif B. Kristensen wrote: On Thursday 02 February 2006 09:05, Michael Glaesemann wrote: For people I'm more or less stumped. I can't think of a combination of things that I know I'll be able to get from people that I'll want to be able to add to the database. Starting off we'll have at least 7,000 individuals in the database, and I don't think that just family and given names are going to be enough. I don't think we'll be able to get telephone numbers for all of them, and definitely aren't going to be getting birthdays for all. I'm very interested to hear what other use in their applications for holding people and companies. I've been thinking long and hard about the same thing myself, in developing my genealogy database. For identification of people, there seems to be no realistic alternative to an arbitrary ID number. Still, I'm struggling with the basic concept of /identity/, eg. is the William Smith born to John Smith and Jane Doe in 1733, the same William Smith who marries Mary Jones in the same parish in 1758? You may never really know. Still, collecting such disparate facts under the same ID number, thus taking the identity more or less for granted, is the modus operandi of computer genealogy. Thus, one of the major objectives of genealogy research, the assertion of identity, becomes totally hidden the moment that you decide to cluster disparate evidence about what may actually have been totally different persons, under a single ID number. The alternative is of course to collect each cluster of evidence under a separate ID, but then the handling of a person becomes a programmer's nightmare. I have been writing about my genealogy data model here: url:http://solumslekt.org/forays/blue.php The model has been slightly modified since I wrote this; due to what I perceive as 'gotchas' in the PostgreSQL implementation of table inheritance, I have dropped the 'citations' table. Besides, I've dropped some of the surrogate keys, and more will follow. I really should update this article soon. I should perhaps be posting this under another subject, but I feel that beneath the surface, Michael's problem and my own are strongly related. There is also the problem that a name can change. People change names by deed-poll, and also women can adopt a married name or keep their old one. All in all an ID is about the only answer. David ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] Primary keys for companies and people
On Thu, Feb 02, 2006 at 10:36:54AM +, David Goodenough wrote: Still, I'm struggling with the basic concept of /identity/, eg. is the William Smith born to John Smith and Jane Doe in 1733, the same William Smith who marries Mary Jones in the same parish in 1758? You may never really know. Still, collecting such disparate facts under the same ID number, thus taking the identity more or less for granted, is the modus operandi of computer genealogy. Thus, one of the major objectives of genealogy research, the assertion of identity, becomes totally hidden the moment that you decide to cluster disparate evidence about what may actually have been totally different persons, under a single ID number. The alternative is of course to collect each cluster of evidence under a separate ID, but then the handling of a person becomes a programmer's nightmare. There is also the problem that a name can change. People change names by deed-poll, and also women can adopt a married name or keep their old one. All in all an ID is about the only answer. True, the issue being ofcourse that changing a name doesn't change their identity. To the GP, your page is an interesting one and raises several interesting points. In particular the one about the person being the conclusion of the rest of the database. You essentially have a set of facts A married B in C on date D and you're trying to correlate these. In the end it's just a certain amount of guess work, especially since back then they wern't that particular about spelling as they are today. My naive view is that you're basically assigning trust values to each fact and the chance that two citations refer to the same person. In principle you'd be able to cross-reference all these citations and build the structure quasi-automatically. I suppose in practice this is done by hand. As for your question, I think you're stuck with having a person ID. Basically because you need to identify a person somehow. Given you still have the original citiations, you can split a person into multiple if the situation appears to not work out. One thing I find odd though, your person objects have no birthdate or deathdate. Or birth place either. I would have thought these elements would be fundamental in determining if two people are the same, given that they can't change and people are unlikely to forget them. Put another way, two people with the same birthday in the same place with similar names are very likely to be the same. If you can demostrate this is not the case that's another fact. In the end you're dealing with probabilities, you can never know for sure. Anyway, hope this helps. It's a subject I've been vaguely interested in but never really had the time to look into. Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. signature.asc Description: Digital signature
Re: [GENERAL] Primary keys for companies and people
- Original Message - From: Leif B. Kristensen [EMAIL PROTECTED] To: pgsql-general@postgresql.org Sent: Thursday, February 02, 2006 4:07 AM Subject: Re: [GENERAL] Primary keys for companies and people [snip] I'm very interested to hear what other use in their applications for holding people and companies. I've been thinking long and hard about the same thing myself, in developing my genealogy database. For identification of people, there seems to be no realistic alternative to an arbitrary ID number. Still, I'm struggling with the basic concept of /identity/, eg. is the William Smith born to John Smith and Jane Doe in 1733, the same William [snip] I have long been interested in this issue, and it is one that transcends the problem of IDs in IT. For my second doctorate, I examined this in the context of historical investigation, applying numerical classification techniques to biographical information that can be extracted from historical documents. It is, I fear, a problem for which only a probabilistic answer can be obtained in most historical cases. For example, there was an eleventh century viking king Harold who as a teenager was part of his cousin's court, and then found it necessary to flee to Kiev when his cousin found hiimself on the losing side of a rebellion. He then made his way into the Byzantine empire and served the emperor as a mercenary through much of the mediterranean, finally returning in fame and glory to Norway where he found another relative (a nephew IIRC) on the throne, which he inherited about a year after his return. Impagine yourself as a historian trying to write his biography. You'd find various documents all over the western world (as known in the viking age) written in a variety of languages, and using different names to refer to him. It isn't an easy task to determine which documents refer specifically to him. And to make things even more interesting, many documents refer to a given person only by his official title, and in other cases, the eldest son of each generation was given the same name as his father. In my own case, in the time I was at the University of Toronto, I know of four other men who had precisely the same name I have. I know this from strange phone calls from faculty I never studied with about assignments and examinations for courses I had never taken. In each case, the professor checked again with the university's records department and found the correct student. The last case was particularly disturbing since in that case, things were a bit different in that I had taken a graduate course with the professor in question, and he stopped me on campus and asked about an assignment for a given advanced undergraduate course that I had not taken, but my namesake had. What made this disturbing is that not only did the other student carry my name, but he also looked enough like me that our professor could mistake me for him on campus! I can only hope that he is a well behaved, law abiding citizen! The total time period in question was 18 years. In general, the problem only gets more challenging as the length, and as the age, of the historical period considered increases. The point is, not only are the combinations of family and given names not reliably unique, even certain biological data, such as photographs of the human face, not adequately unique. Even DNA fingerprints, putatively the best available biometric technology, are not entirely reliable since even that can not distinguish between identical twins, and at the same time, there can be, admittedly extremely rare as far as we know, developmental anomalies resulting in a person being his own twin (this results from twin fetuses merging, with the consequence that the resulting person has some organ systems from one of the original fetuses and some from the other). For historical questions, I don't believe one can get any better than inference based on a balance of probabilities. A geneologist has no option but to become an applied statistician! For purposes of modern investigation or for the purpose of modern business, one may do better through an appropriate use of a combination of technologies. This is a hard problem, even with the use of the best available technologies and especially given the current problems associated with identity theft. For software developers in general, and database developers in particular, there are several distinct questions to consider.: 1) How does one reliably determine identity to begin with, and then use that identity with whatever technology one might use to represent it? 2) How good does this technology, and identification process need to be? In other words, how does the cost of a mistake (esp. in identification) relate to the increased cost of using better technology? In this analysis, one needs to consider both the cost of such a mistake to the person identified
Re: [GENERAL] Primary keys for companies and people
I should perhaps be posting this under another subject, but I feel that beneath the surface, Michael's problem and my own are strongly related. There is also the problem that a name can change. People change names by deed-poll, and also women can adopt a married name or keep their old one. All in all an ID is about the only answer. I'll take the other side of this issue. The fact that a primary key is mutable does not make it any less primary. As long as we can can count on it to be unique, how often identiying info changes has no bearing on its selection as a p-key from a relational standpoint. SQL gives us RI to help deal with this but in this is not always practical if for example you you have a changing p-key that cascades to a million records. The performance issue has zero meaning in a conceptual sense however and I think you are trying to grapple things in conceptual terms. By assigning a surrogate key to a person, you are simply moving the guess work from one place to another. If you can't logically determine who 'John Smith' is, how can you possibly expect to relate information to him? (which john smith? why, etc)...you are just hiding a guess behind a number. Put into other words, *a record must have unique identifiying criteria or the table containing it cannot be expected to give correct results*. This is, more or less, a mathematical truth. The non key attributes of the tuple are now undefined because they can give two or more different answers to the same question. Surrogates do not change this principle, they just hide it but it still has to be dealt with. Merlin ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] Primary keys for companies and people
On Feb 3, 2006, at 7:25 , Merlin Moncure wrote: There is also the problem that a name can change. People change names by deed-poll, and also women can adopt a married name or keep their old one. All in all an ID is about the only answer. I'll take the other side of this issue. The fact that a primary key is mutable does not make it any less primary. As long as we can can count on it to be unique, how often identiying info changes has no bearing on its selection as a p-key from a relational standpoint. The performance issue has zero meaning in a conceptual sense however and I think you are trying to grapple things in conceptual terms. I definitely agree with you here, Merlin. Mutability is not the issue at hand. May I ask what strategies you use for determining uniqueness for people? Michael Glaesemann grzm myrealbox com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [GENERAL] Primary keys for companies and people
Michael Glaesemann wrote: Hello, all! Recently there was quite a bit of discussion regarding surrogate keys and natural keys. I'm not interested in discussing the pros and cons of surrogate keys. What I'd like to find out are the different methods people actually use to uniquely identify companies and people *besides* surrogate keys. I'm currently working on an application that will include contact information, so being able to uniquely identify these two entities is of interest to me. Right now I'm thinking of uniquely identifying companies by telephone number. For example: create table companies ( company_id integer primary key -- telephone number, not serial , company_name text not null ); Of course, the company may have more than one telephone number associated with it, so there will also be a table associating telephone numbers and companies. create table companies__telephone_numbers ( company_id integer not null references companies (company_id) on update cascade on delete cascade , telephone_number integer not null , unique (company_id, telephone_number) ); There should also be a trigger that will check that the company_id matches an existing telephone number associated with the company, something like: create function assert_company_id_telephone_number_exists returns trigger language plpgsql as $$ begin if exists ( select company_id from companies except select company_id from companies join companies__telephone_numbers on (company_id = telephone_number) ) then raise exception 'company_id must match existing company telephone number'; end if; return null; end; $$; For people I'm more or less stumped. I can't think of a combination of things that I know I'll be able to get from people that I'll want to be able to add to the database. Starting off we'll have at least 7,000 individuals in the database, and I don't think that just family and given names are going to be enough. I don't think we'll be able to get telephone numbers for all of them, and definitely aren't going to be getting birthdays for all. I'm very interested to hear what other use in their applications for holding people and companies. Michael Glaesemann grzm myrealbox com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq Telephone numbers make bad primary keys because they get recycled. A phone number that belongs to me this year may belong to somebody else next year. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] Primary keys and speed
Hi, I have not yet seen an answer to the following, can I assume it's not a problem? On Thu, 2001-09-06 at 19:58, Rob Brown-Bayliss wrote: Hello. I am looking at useing uuid's as primary keys rather than a normal sequence of numbers. The uuids are long text strings like so: 7559e648-a29c-11d5-952f-00c026a18838 The reason for useing them is that it's almost gaurenteed that imported data from another site is going to have a unique identifier. But I was wondering if this will impact on the speed of the database. In the long run the application does not need to be blindingly fast as 99% of the time it is waiting on human interaction. Any ideas? -- Rob Brown-Bayliss ---==o==--- ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [GENERAL] Primary keys and speed
I am looking at useing uuid's as primary keys rather than a normal sequence of numbers. The uuids are long text strings like so: 7559e648-a29c-11d5-952f-00c026a18838 The reason for useing them is that it's almost gaurenteed that imported data from another site is going to have a unique identifier. But I was wondering if this will impact on the speed of the database. In the long run the application does not need to be blindingly fast as 99% of the time it is waiting on human interaction. Any ideas? I have not done *comparative* studies of using these kind of strings versus sequence numbers, but I have done a fair amount of this sort of thing in one of the applications I've been working on. On old hardware (dual ppro 200 with 512MB and RAID5) I got lookups times at around 10 milliseconds, from a table with 20 million records. The key was a 40 byte hex string. The timing was from running set show_query_stats = true; and then looking at the tail of the postgres serverlog. My best advice is to load some sample data in your schema on your hardware and give it a try. Hope this helps, -- Joe ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
[GENERAL] Primary keys and speed
Hello. I am looking at useing uuid's as primary keys rather than a normal sequence of numbers. The uuids are long text strings like so: 7559e648-a29c-11d5-952f-00c026a18838 The reason for useing them is that it's almost gaurenteed that imported data from another site is going to have a unique identifier. But I was wondering if this will impact on the speed of the database. In the long run the application does not need to be blindingly fast as 99% of the time it is waiting on human interaction. Any ideas? -- Rob Brown-Bayliss ---==o==--- ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly