On Jan 29, 2011, at 3:14 PM, NiL wrote:
Hi all,
let's say I have documents and keywords, join by a many to many
relation
when I display a document detail, I wish to also display other
document that have a maximum of common keyword.
so, Doc1 had kwA, kwB and kwC
if Doc2 has all 3 kw, and Doc3 2 kw only, I'd like to request them in
this order
Is it posible ?
Now, I would be happy enough with an answer to this first scenario,
but
we could imagine that the kw themselves have values, kwA is twice the
weight of kwB
Of course I'm in a sqla environement
You're in luck since this is a fun relational problem - describing the answer
in such a way that illustrates how I come up with it is a little time
consuming.The first thing that stands out is selecting records from a table
that are similar to records in that same table.This says that some table is
going to be related to itself, in this case its the association table between
documents and keywords, and aliases will be used to use the same table twice in
one query, with conditions that relate them together and exclude their equality.
The other thing is that the query is aware of how many of something there
are, so its a given that COUNT() and GROUP BY will be used, which then leads
inevitably that some subquery will exist that gets the raw counts and ids which
then relates back to the main thing we're selecting (the pattern with
count/group by in a subquery that relates back to related entities is very
common, and an example is at
http://www.sqlalchemy.org/docs/orm/tutorial.html#using-subqueries ).
This is quicker to show straight in SQL. Here's some tables 'd', 'k', and
'd_k':
sqlite create table d (id integer primary key, name varchar(10));
sqlite create table k (id integer primary key, name varchar(10));
sqlite create table d_k (d_id integer references d(id), k_id integer
references k(id), primary key (d_id, k_id));
here's data like you describe:
d's:
sqlite insert into d (id, name) values (1, 'd1');
sqlite insert into d (id, name) values (2, 'd2');
sqlite insert into d (id, name) values (3, 'd3');
k's:
sqlite insert into k (id, name) values (1, 'k1');
sqlite insert into k (id, name) values (2, 'k2');
sqlite insert into k (id, name) values (3, 'k3');
d1 has 3 kw, d2 has 3 kw, d3 has 2 kw:
sqlite insert into d_k (d_id, k_id) values (1, 1);
sqlite insert into d_k (d_id, k_id) values (1, 2);
sqlite insert into d_k (d_id, k_id) values (1, 3);
sqlite insert into d_k (d_id, k_id) values (2, 1);
sqlite insert into d_k (d_id, k_id) values (2, 2);
sqlite insert into d_k (d_id, k_id) values (2, 3);
sqlite insert into d_k (d_id, k_id) values (3, 2);
sqlite insert into d_k (d_id, k_id) values (3, 3);
First thing we figure out, how to get all the 'd's that have keywords
overlapping to d1. We'll relate d_k to itself using two aliases:
sqlite select d.*, k.name
... from d join d_k as rel_d on d.id=rel_d.d_id
... join k on k.id=rel_d.k_id
... join d_k as our_d on rel_d.k_id=our_d.k_id
... where our_d.d_id=1 and rel_d.d_id!=our_d.d_id;
2|d2|k1
2|d2|k2
3|d3|k2
2|d2|k3
3|d3|k3
Above we want to select d's (d.*), the keyword they relate on (k.name), to get
those we select d-d_k as rel_d-k . Then to relate rel_d to d1, join to
d_k as our_d on rel_d.k_id=our_d.k_id where our_d.d_id=1, and also exclude d1
from the result with rel_d.d_id!=1, or more generally
rel_d.d_id!=our_d.d_id. It's again a common pattern when selecting from a
table relating to itself to add a left_side.id != right_side.id which
excludes the equivalence row.
When I know I'm going to be using COUNT to count something, I usually have to
test the queries without the COUNT to wrap my head around what it is I want to
be counting. in this case, we can see the above result listed out 'd2' three
times and 'd3' twice. We want to count that and package that information up.
The information we want is, the d ids and how many 'k's each has in common
with our parent 'd'. We don't actually need the k.name at this point, so
taking out k.name:
sqlite select d.id
... from d join d_k as rel_d on d.id=rel_d.d_id
... join d_k as our_d on rel_d.k_id=our_d.k_id
... where our_d.d_id=1 and rel_d.d_id!=our_d.d_id;
2
2
3
2
3
those are our related 'd' ids, now we count:
sqlite select d.id, count(d.id) as d_count
... from d join d_k as rel_d on d.id=rel_d.d_id
... join d_k as our_d on rel_d.k_id=our_d.k_id
... where our_d.d_id=1 and rel_d.d_id!=our_d.d_id
... group by d.id;
2|3
3|2
That's the information we want. Notice we're only getting 'd.id' here, not the
whole 'd' row. If we were using MySQL, MySQL would let us GROUP BY just 'd.id'
and request the rest of the row; MySQL lets us do that since it assumes we know
that we're grouping on unique keys (and if we're not, it in fact returns
non-deterministic results, which is the ultimate no-no