Re: [sqlalchemy] similiraty search

Michael Bayer Sat, 29 Jan 2011 19:16:47 -0800

On Jan 29, 2011, at 3:14 PM, NiL wrote:

> Hi all,
> 
> let's say I have documents and keywords, join by a many to many
> relation
> 
> when I display a document detail, I wish to also display other
> document that have a maximum of common keyword.
> 
> so, Doc1 had kwA, kwB and kwC
> 
> if Doc2 has all 3 kw, and Doc3 2 kw only, I'd like to request them in
> this order
> 
> Is it posible ?
> 
> 
> Now, I would be happy enough with an answer to this first scenario,
> but ....
> 
> we could imagine that the kw themselves have values, kwA is twice the
> weight of kwB
> 
> 
> Of course I'm in a sqla environement


You're in luck since this is a fun relational problem -  describing the answer 
in such a way that illustrates how I come up with it is a little time 
consuming.    The first thing that stands out is selecting records from a table 
that are similar to records in that same table.    This says that some table is 
going to be related to itself, in this case its the association table between 
documents and keywords, and aliases will be used to use the same table twice in 
one query, with conditions that relate them together and exclude their equality.

The other thing is that the query is aware of "how many" of something there 
are, so its a given that COUNT() and GROUP BY will be used, which then leads 
inevitably that some subquery will exist that gets the raw counts and ids which 
then relates back to the main thing we're selecting (the pattern with 
count/group by in a subquery that relates back to "related entities" is very 
common, and an example is at 
http://www.sqlalchemy.org/docs/orm/tutorial.html#using-subqueries ).   

This is quicker to show straight in SQL.   Here's some tables 'd', 'k', and 
'd_k':

sqlite> create table d (id integer primary key, name varchar(10));
sqlite> create table k (id integer primary key, name varchar(10));
sqlite> create table d_k (d_id integer references d(id), k_id integer 
references k(id), primary key (d_id, k_id));

here's data like you describe:

d's:

sqlite> insert into d (id, name) values (1, 'd1');
sqlite> insert into d (id, name) values (2, 'd2');
sqlite> insert into d (id, name) values (3, 'd3');

k's:

sqlite> insert into k (id, name) values (1, 'k1');
sqlite> insert into k (id, name) values (2, 'k2');
sqlite> insert into k (id, name) values (3, 'k3');

d1 has 3 kw, d2 has 3 kw, d3 has 2 kw:

sqlite> insert into d_k (d_id, k_id) values (1, 1);
sqlite> insert into d_k (d_id, k_id) values (1, 2);
sqlite> insert into d_k (d_id, k_id) values (1, 3);
sqlite> insert into d_k (d_id, k_id) values (2, 1);
sqlite> insert into d_k (d_id, k_id) values (2, 2);
sqlite> insert into d_k (d_id, k_id) values (2, 3);
sqlite> insert into d_k (d_id, k_id) values (3, 2);
sqlite> insert into d_k (d_id, k_id) values (3, 3);

First thing we figure out, how to get all the 'd's that have keywords 
overlapping to d1.   We'll relate d_k to itself using two aliases:

sqlite> select d.*, k.name 
   ...>     from d join d_k as rel_d on d.id=rel_d.d_id
   ...>     join k on k.id=rel_d.k_id 
   ...>     join d_k as our_d on rel_d.k_id=our_d.k_id
   ...>     where our_d.d_id=1 and rel_d.d_id!=our_d.d_id;
2|d2|k1
2|d2|k2
3|d3|k2
2|d2|k3
3|d3|k3

Above we want to select d's (d.*), the keyword they relate on (k.name), to get 
those we select d->"d_k as rel_d"->k .   Then to relate "rel_d" to d1, join to 
"d_k as our_d on rel_d.k_id=our_d.k_id" where our_d.d_id=1, and also exclude d1 
from the result with "rel_d.d_id!=1", or more generally 
"rel_d.d_id!=our_d.d_id".     It's again a common pattern when selecting from a 
table relating to itself to add a "left_side.id != right_side.id" which 
excludes the equivalence row.

When I know I'm going to be using COUNT to count something, I usually have to 
test the queries without the COUNT to wrap my head around what it is I want to 
be counting.  in this case, we can see the above result listed out 'd2' three 
times and 'd3' twice.  We want to count that and package that information up.   
 The information we want is, the "d" ids and how many 'k's each has in common 
with our parent 'd'.     We don't actually need the k.name at this point, so 
taking out k.name:

sqlite> select d.id
   ...>     from d join d_k as rel_d on d.id=rel_d.d_id
   ...>     join d_k as our_d on rel_d.k_id=our_d.k_id 
   ...>     where our_d.d_id=1 and rel_d.d_id!=our_d.d_id;
2
2
3
2
3

those are our related 'd' ids, now we count:

sqlite> select d.id, count(d.id) as d_count
   ...>     from d join d_k as rel_d on d.id=rel_d.d_id
   ...>     join d_k as our_d on rel_d.k_id=our_d.k_id 
   ...>     where our_d.d_id=1 and rel_d.d_id!=our_d.d_id 
   ...>     group by d.id;
2|3
3|2

That's the information we want.  Notice we're only getting 'd.id' here, not the 
whole 'd' row.  If we were using MySQL, MySQL would let us GROUP BY just 'd.id' 
and request the rest of the row; MySQL lets us do that since it assumes we know 
that we're grouping on unique keys (and if we're not, it in fact returns 
non-deterministic results, which is the ultimate no-no in relational 
databases).   But I like to write SQL that works on Postgresql/Oracle too and 
is SQL compliant, and those DBs require that we GROUP BY every column that 
isn't sent into an aggregate function (i.e. the COUNT function).

Easy enough to get our full 'd' row related to this subset.   Lift and move the 
above query into a subquery, (<select statement>) AS d_counts, then JOIN back 
to 'd', order by d_counts.d_count:

sqlite> select d.*, d_counts.d_count
   ...> FROM d JOIN (
   ...>     select d.id, count(d.id) as d_count
   ...>         from d join d_k as rel_d on d.id=rel_d.d_id
   ...>         join d_k as our_d on rel_d.k_id=our_d.k_id 
   ...>         where our_d.d_id=1 and rel_d.d_id!=our_d.d_id 
   ...>         group by d.id
   ...> ) AS d_counts on d.id=d_counts.id 
   ...> ORDER BY d_counts.d_count desc;
2|d2|3
3|d3|2

and that's answer one.

we just riff on this same idea, add a new column, to factor in keywords that 
have 'weight':

sqlite> alter table k add column weight integer;
sqlite> update k set weight=10  where id=1;
sqlite> update k set weight=5 where id in (2, 3);

let's go back to our query that got us 'd's and their keywords related to d1 
and also ask for the weight:

sqlite> select d.*, k.name, k.weight 
   ...>      from d join d_k as rel_d on d.id=rel_d.d_id
   ...>      join k on k.id=rel_d.k_id 
   ...>      join d_k as our_d on rel_d.k_id=our_d.k_id
   ...>      where our_d.d_id=1 and rel_d.d_id!=our_d.d_id;
2|d2|k1|10
2|d2|k2|5
3|d3|k2|5
2|d2|k3|5
3|d3|k3|5

we can zap the name out again, use the sum of the weights, also an aggregate 
function, instead of counts for each group of common 'd' ids:

sqlite> select d.id, sum(k.weight) as total_weight
   ...>      from d join d_k as rel_d on d.id=rel_d.d_id
   ...>      join k on k.id=rel_d.k_id 
   ...>      join d_k as our_d on rel_d.k_id=our_d.k_id
   ...>      where our_d.d_id=1 and rel_d.d_id!=our_d.d_id
   ...>      group by d.id
   ...> ;
2|20
3|10

wrap in the subquery and bake:

sqlite> select d.*, d_weights.total_weight
   ...> from d join (
   ...>     select d.id, sum(k.weight) as total_weight
   ...>          from d join d_k as rel_d on d.id=rel_d.d_id
   ...>          join k on k.id=rel_d.k_id 
   ...>          join d_k as our_d on rel_d.k_id=our_d.k_id
   ...>          where our_d.d_id=1 and rel_d.d_id!=our_d.d_id
   ...>          group by d.id
   ...>     ) as d_weights on d.id=d_weights.id
   ...>     order by d_weights.total_weight desc;
2|d2|20
3|d3|10

I'll illustrate a SQLAlchemy ORM query for that last one, assume entities 
Document, Keyword, related by relationship() using the Table object 
"document_keywords" as the "secondary" (i.e., the pattern illustrated at 
http://www.sqlalchemy.org/docs/orm/relationships.html#many-to-many ):

from sqlalchemy import func

rel_d = document_keywords.alias()
our_d = document_keywords.alias()

weight_subq = session.query(Document.id, 
func.sum(Keyword.weight).label('total_weight')).\
                                join((rel_d, Document.id==rel_d.c.d_id)).\
                                join((Keyword, Keyword.id==rel_d.c.k_id)).\
                                join((our_d, rel_d.c.k_id==our_d.c.k_id)).\
                                filter(our_d.c.d_id==1).\
                                filter(rel_d.c.d_id!=our_d.c.d_id).\
                                group_by(Document.id).\
                                subquery()

related_documents = session.query(Document, weight_subq.c.total_weight).\
                        join((weight_subq, Document.id==weight_subq.c.id)).\
                        order_by(weight_subq.c.total_weight.desc())

If you notice all those join(()) calls with the tuples, the extra "tuple" won't 
be needed in 0.7 which is nearing beta releases.
































> 
> regards
> 
> NiL
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "sqlalchemy" group.
> To post to this group, send email to sqlalchemy@googlegroups.com.
> To unsubscribe from this group, send email to 
> sqlalchemy+unsubscr...@googlegroups.com.
> For more options, visit this group at 
> http://groups.google.com/group/sqlalchemy?hl=en.
> 

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.

Re: [sqlalchemy] similiraty search

Reply via email to