[GENERAL] How best to represent relationships in a database generically?

2007-07-27 Thread Lincoln Yeoh

Hi,

Sorry, this really isn't postgresql specific, but I figure there are 
lots of smarter people around here.


Say I have lots of different objects (thousands or even millions?). 
Example: cow, grass, tiger, goat, fish, penguin.


BUT I'm not so interested in defining things by linking them to 
categories or giving them names, I'm trying to figure out a way to 
define things by their relationships with other things, and more 
importantly do searches and other processing by those relationships.


So, what would be the best way to store them so that a search for the 
relationship like grass is to cow, will also turn up cow is to 
tiger, and goat is to tiger, and fish is to penguin (and penguin is 
to bigger fish ;) ), and electricity is to computer. And a search for 
cow is to goat, could turn up tiger is to lion, and goat is to cow.


Is the only way to store all the links explicitly? e.g. have a huge 
link table storing stuff like obj = cow, subj = grass, type = 
consumes, probability=90% ( = means points/links to). Or even just 
have one table (links are objects too).


Or is it possible to somehow put the objects in a multidimensional 
space (1000 dimensions?) and keep trying to arrange the objects so 
that their relationships/vectors with/from each other are fairly 
consistent/reasonable based on current knowledge? Trouble is in 
some cases the grass eventually eats the cow, so maybe that doesn't 
work at all ;).


Or even do both? Maybe use the first as a cache, and the second for 
deeper stuff (flash of insight or got the punchline = figure 
out better arrangement/ joining of disparate maps).


My worry about the first approach is that the number of links might 
go up very much faster as you add more objects. But perhaps this 
won't be true in practice. The worry about the second approach is 
that it might get stuck, or run out of dimensions.


Is there a better way to do this? There must be right?

Wait for huge quantum computers and use qubits for each 
multidimensional coordinate? ;).


Regards,
Link.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] How best to represent relationships in a database generically?

2007-07-27 Thread Ted Byers
Here are just a few observations.
   
  What you have described is a classic taxonomy problem.  The simplest approach 
is to create a table to contain the basic attributes of the obtjects in the 
database.  Using your example, such attributes may be anatomical (two eyes, two 
ears, four limbs, one backbone, c.), ecological (categories of food items, 
such as plant, animal, fungal, c., temperature tolerances, c.), conventional 
taxonomic data (such as genus, family, class, phylum, c.).  You might even get 
creative and choose to use additional novel or unconventional kinds of 
attributes.  And you could go so far as to add a dynamic classification scheme 
based on conventional or novel numeric taxonomic algorithms, so the data itself 
provides a natural classification hierarchy.  Additional tables may provide 
more useful predictive capability, such as a table representing empiric 
observations (based either on field observation or analysis of gut contents) of 
what eats what (e.g. on date x animal y was observed
 eating z by Fred at location Q).  Analysis of such a table could provide bases 
for inferences such as animal x eats mammals of the family or animal y eats 
only insects, and all insects that will fit in its mouth.  This avoids a need 
for predefined relationships, and a suite of probable relationships can be 
constructed directly from the basic data.  It also provides support for a 
measure of predictive capability.  For example, if animal x has been observed 
eating mule deer, you might predict that it may also eat white tail deer where 
they coexist even if the animal has not been observed eating white tail deer.
   
  I'd suggest avoiding the route involving defining very high dimensional 
spaces.  That gets unweldy very quickly, and you'll find that most of such high 
dimensional spaces is empty.
   
  The probability of observing grass eating a cow is identically 0.  But the 
probability of observing pitcher plants, sun dews, or venus fly traps eating 
insects is identically 1, on the condition you make an effort to study 
carnivorous plants.  :-)  The number of possible relationships between objects 
increases as n! where n is the number of objects, but the number of kinds of 
relationships will be small.
   
  No one can tell you what the right approach is or what a better way to do 
what you want to do might be, at least from what you have said so far.  You 
haven't said precisely what you want to accomplish.  Is it simple information 
management, to tell your users what they already know?  Or perhaps it is a 
research tool.  Or maybe it is simple record keeping, and may rarely if ever be 
used except for audit purposes.
   
  The only way to provide the kind of answers you want, regarding a comparison 
of options for managing and using your data, is to do a formal analysis of your 
problem and how your users may want to use your product.  You might want to 
begin with a problem definition, from the perspective of a client or end user, 
and then work up a few use case scenarios.  And then, from that, you can work 
up a prototype data model with an entity-relationship diagram (NB: these 
relationships may or may not be related to the relationships you want to 
capture from your data - there are different kinds of relationships just as 
there are different kinds of models - don't you just love the fact natural 
language is multivocal?).  Until this homework is done, the best you'll likely 
get are a few general observations of the sort I have offered (and I do so from 
the perspective of an ecologist who happens to develop software).  You will 
want to ensure that any example you offer to illustrate what you
 want to do is directly related to the actual problem you are trying to solve.  
Otherwise the advice you get may be inappropriate.  The data model I'd use for 
a decision support tool for government officials responsible for environmental 
protection (such as at the EPA in the US) will be quite different from what I'd 
use for a university researcher, and both will be different from what I'd use 
for a financial consultant or a human resources office.  What you need to do 
will be as dependant on what real world problem you're trying to solve as it is 
on what data is available.  A tool to support research in the life sciences 
will probably not want to store relationships among living things explicitly, 
preferring to infer them from data analysis, but a tool for a government 
environmental regulator will instead likely want to store them explicitly (e.g. 
as input to simulation models used to assess dose and risk issues).
   
  Why wait for quantum computers?  Don't sell the computers available today 
short.  You can run mesoscale weather models on a modern desktop (the ones that 
used to run on mainframes only a few years ago, not the highly detailed ones 
used on mainframes now).  I have done ecosystem scale modeling using desktop 
machines, even those available ten years ago.  

Re: [GENERAL] How best to represent relationships in a database generically?

2007-07-27 Thread Edward Macnaghten

Lincoln Yeoh wrote:

Hi, importantly do searches and other processing by those relationships.

So, what would be the best way to store them so that a search for the 
relationship like grass is to cow, will also turn up cow is to 
tiger, and goat is to tiger, and fish is to penguin (and penguin is to 
bigger fish ;) ), and electricity is to computer. And a search for cow 
is to goat, could turn up tiger is to lion, and goat is to cow.


Is the only way to store all the links explicitly? e.g. have a huge 
link table storing stuff like obj = cow, subj = grass, type = 
consumes, probability=90% ( = means points/links to). Or even just 
have one table (links are objects too).


Hi

This is a generic database design problem rather than a Postgres or SQL 
one, but here goes


Excuse ASCII art..

What you really have is a multi - multi relationship, such as


   A  ---  B


Where A is a table containing grass, cow, fish
and B is the table containing cow, tiger and penguin

I know, A and B are the same table, so the multi - relationship is in fact

  A  A

As you cannot have a multi-multi relationship in a RDBMS, you need a 
link table...


 A  --- C B

or more precisely

A  --- C  --- A

This would be represented as tables as something like

create table thingy (
  thingy_key varchar(12)  primary key,
  thingy_desc  varchar(30)
  
 );
or whatever

and...

create table munchies (
 eater varchar(12) not null,
 dinner varchar(12)  not null
 probablility_pc number(4,2)
 constraing pkey_munchies primary key(eater, dinner) );

or whatever, where eater and dinner are foreign keys for thingy_key

The munchies table can get big, but do not worry about that.  It is 
small and RDBMS (especially Postgres) should handle it well even on a 
smallish machine.


Hope that makes sense

Eddy



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match