[sqlalchemy] similiraty search

2011-01-29 Thread NiL
Hi all,

let's say I have documents and keywords, join by a many to many
relation

when I display a document detail, I wish to also display other
document that have a maximum of common keyword.

so, Doc1 had kwA, kwB and kwC

if Doc2 has all 3 kw, and Doc3 2 kw only, I'd like to request them in
this order

Is it posible ?


Now, I would be happy enough with an answer to this first scenario,
but 

we could imagine that the kw themselves have values, kwA is twice the
weight of kwB


Of course I'm in a sqla environement

regards

NiL

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] similiraty search

2011-01-29 Thread Michael Bayer

On Jan 29, 2011, at 3:14 PM, NiL wrote:

 Hi all,
 
 let's say I have documents and keywords, join by a many to many
 relation
 
 when I display a document detail, I wish to also display other
 document that have a maximum of common keyword.
 
 so, Doc1 had kwA, kwB and kwC
 
 if Doc2 has all 3 kw, and Doc3 2 kw only, I'd like to request them in
 this order
 
 Is it posible ?
 
 
 Now, I would be happy enough with an answer to this first scenario,
 but 
 
 we could imagine that the kw themselves have values, kwA is twice the
 weight of kwB
 
 
 Of course I'm in a sqla environement

You're in luck since this is a fun relational problem -  describing the answer 
in such a way that illustrates how I come up with it is a little time 
consuming.The first thing that stands out is selecting records from a table 
that are similar to records in that same table.This says that some table is 
going to be related to itself, in this case its the association table between 
documents and keywords, and aliases will be used to use the same table twice in 
one query, with conditions that relate them together and exclude their equality.

The other thing is that the query is aware of how many of something there 
are, so its a given that COUNT() and GROUP BY will be used, which then leads 
inevitably that some subquery will exist that gets the raw counts and ids which 
then relates back to the main thing we're selecting (the pattern with 
count/group by in a subquery that relates back to related entities is very 
common, and an example is at 
http://www.sqlalchemy.org/docs/orm/tutorial.html#using-subqueries ).   

This is quicker to show straight in SQL.   Here's some tables 'd', 'k', and 
'd_k':

sqlite create table d (id integer primary key, name varchar(10));
sqlite create table k (id integer primary key, name varchar(10));
sqlite create table d_k (d_id integer references d(id), k_id integer 
references k(id), primary key (d_id, k_id));

here's data like you describe:

d's:

sqlite insert into d (id, name) values (1, 'd1');
sqlite insert into d (id, name) values (2, 'd2');
sqlite insert into d (id, name) values (3, 'd3');

k's:

sqlite insert into k (id, name) values (1, 'k1');
sqlite insert into k (id, name) values (2, 'k2');
sqlite insert into k (id, name) values (3, 'k3');

d1 has 3 kw, d2 has 3 kw, d3 has 2 kw:

sqlite insert into d_k (d_id, k_id) values (1, 1);
sqlite insert into d_k (d_id, k_id) values (1, 2);
sqlite insert into d_k (d_id, k_id) values (1, 3);
sqlite insert into d_k (d_id, k_id) values (2, 1);
sqlite insert into d_k (d_id, k_id) values (2, 2);
sqlite insert into d_k (d_id, k_id) values (2, 3);
sqlite insert into d_k (d_id, k_id) values (3, 2);
sqlite insert into d_k (d_id, k_id) values (3, 3);

First thing we figure out, how to get all the 'd's that have keywords 
overlapping to d1.   We'll relate d_k to itself using two aliases:

sqlite select d.*, k.name 
   ... from d join d_k as rel_d on d.id=rel_d.d_id
   ... join k on k.id=rel_d.k_id 
   ... join d_k as our_d on rel_d.k_id=our_d.k_id
   ... where our_d.d_id=1 and rel_d.d_id!=our_d.d_id;
2|d2|k1
2|d2|k2
3|d3|k2
2|d2|k3
3|d3|k3

Above we want to select d's (d.*), the keyword they relate on (k.name), to get 
those we select d-d_k as rel_d-k .   Then to relate rel_d to d1, join to 
d_k as our_d on rel_d.k_id=our_d.k_id where our_d.d_id=1, and also exclude d1 
from the result with rel_d.d_id!=1, or more generally 
rel_d.d_id!=our_d.d_id. It's again a common pattern when selecting from a 
table relating to itself to add a left_side.id != right_side.id which 
excludes the equivalence row.

When I know I'm going to be using COUNT to count something, I usually have to 
test the queries without the COUNT to wrap my head around what it is I want to 
be counting.  in this case, we can see the above result listed out 'd2' three 
times and 'd3' twice.  We want to count that and package that information up.   
 The information we want is, the d ids and how many 'k's each has in common 
with our parent 'd'. We don't actually need the k.name at this point, so 
taking out k.name:

sqlite select d.id
   ... from d join d_k as rel_d on d.id=rel_d.d_id
   ... join d_k as our_d on rel_d.k_id=our_d.k_id 
   ... where our_d.d_id=1 and rel_d.d_id!=our_d.d_id;
2
2
3
2
3

those are our related 'd' ids, now we count:

sqlite select d.id, count(d.id) as d_count
   ... from d join d_k as rel_d on d.id=rel_d.d_id
   ... join d_k as our_d on rel_d.k_id=our_d.k_id 
   ... where our_d.d_id=1 and rel_d.d_id!=our_d.d_id 
   ... group by d.id;
2|3
3|2

That's the information we want.  Notice we're only getting 'd.id' here, not the 
whole 'd' row.  If we were using MySQL, MySQL would let us GROUP BY just 'd.id' 
and request the rest of the row; MySQL lets us do that since it assumes we know 
that we're grouping on unique keys (and if we're not, it in fact returns 
non-deterministic results, which is the ultimate no-no