On Jan 29, 2011, at 3:14 PM, NiL wrote: > Hi all, > > let's say I have documents and keywords, join by a many to many > relation > > when I display a document detail, I wish to also display other > document that have a maximum of common keyword. > > so, Doc1 had kwA, kwB and kwC > > if Doc2 has all 3 kw, and Doc3 2 kw only, I'd like to request them in > this order > > Is it posible ? > > > Now, I would be happy enough with an answer to this first scenario, > but .... > > we could imagine that the kw themselves have values, kwA is twice the > weight of kwB > > > Of course I'm in a sqla environement
You're in luck since this is a fun relational problem - describing the answer in such a way that illustrates how I come up with it is a little time consuming. The first thing that stands out is selecting records from a table that are similar to records in that same table. This says that some table is going to be related to itself, in this case its the association table between documents and keywords, and aliases will be used to use the same table twice in one query, with conditions that relate them together and exclude their equality. The other thing is that the query is aware of "how many" of something there are, so its a given that COUNT() and GROUP BY will be used, which then leads inevitably that some subquery will exist that gets the raw counts and ids which then relates back to the main thing we're selecting (the pattern with count/group by in a subquery that relates back to "related entities" is very common, and an example is at http://www.sqlalchemy.org/docs/orm/tutorial.html#using-subqueries ). This is quicker to show straight in SQL. Here's some tables 'd', 'k', and 'd_k': sqlite> create table d (id integer primary key, name varchar(10)); sqlite> create table k (id integer primary key, name varchar(10)); sqlite> create table d_k (d_id integer references d(id), k_id integer references k(id), primary key (d_id, k_id)); here's data like you describe: d's: sqlite> insert into d (id, name) values (1, 'd1'); sqlite> insert into d (id, name) values (2, 'd2'); sqlite> insert into d (id, name) values (3, 'd3'); k's: sqlite> insert into k (id, name) values (1, 'k1'); sqlite> insert into k (id, name) values (2, 'k2'); sqlite> insert into k (id, name) values (3, 'k3'); d1 has 3 kw, d2 has 3 kw, d3 has 2 kw: sqlite> insert into d_k (d_id, k_id) values (1, 1); sqlite> insert into d_k (d_id, k_id) values (1, 2); sqlite> insert into d_k (d_id, k_id) values (1, 3); sqlite> insert into d_k (d_id, k_id) values (2, 1); sqlite> insert into d_k (d_id, k_id) values (2, 2); sqlite> insert into d_k (d_id, k_id) values (2, 3); sqlite> insert into d_k (d_id, k_id) values (3, 2); sqlite> insert into d_k (d_id, k_id) values (3, 3); First thing we figure out, how to get all the 'd's that have keywords overlapping to d1. We'll relate d_k to itself using two aliases: sqlite> select d.*, k.name ...> from d join d_k as rel_d on d.id=rel_d.d_id ...> join k on k.id=rel_d.k_id ...> join d_k as our_d on rel_d.k_id=our_d.k_id ...> where our_d.d_id=1 and rel_d.d_id!=our_d.d_id; 2|d2|k1 2|d2|k2 3|d3|k2 2|d2|k3 3|d3|k3 Above we want to select d's (d.*), the keyword they relate on (k.name), to get those we select d->"d_k as rel_d"->k . Then to relate "rel_d" to d1, join to "d_k as our_d on rel_d.k_id=our_d.k_id" where our_d.d_id=1, and also exclude d1 from the result with "rel_d.d_id!=1", or more generally "rel_d.d_id!=our_d.d_id". It's again a common pattern when selecting from a table relating to itself to add a "left_side.id != right_side.id" which excludes the equivalence row. When I know I'm going to be using COUNT to count something, I usually have to test the queries without the COUNT to wrap my head around what it is I want to be counting. in this case, we can see the above result listed out 'd2' three times and 'd3' twice. We want to count that and package that information up. The information we want is, the "d" ids and how many 'k's each has in common with our parent 'd'. We don't actually need the k.name at this point, so taking out k.name: sqlite> select d.id ...> from d join d_k as rel_d on d.id=rel_d.d_id ...> join d_k as our_d on rel_d.k_id=our_d.k_id ...> where our_d.d_id=1 and rel_d.d_id!=our_d.d_id; 2 2 3 2 3 those are our related 'd' ids, now we count: sqlite> select d.id, count(d.id) as d_count ...> from d join d_k as rel_d on d.id=rel_d.d_id ...> join d_k as our_d on rel_d.k_id=our_d.k_id ...> where our_d.d_id=1 and rel_d.d_id!=our_d.d_id ...> group by d.id; 2|3 3|2 That's the information we want. Notice we're only getting 'd.id' here, not the whole 'd' row. If we were using MySQL, MySQL would let us GROUP BY just 'd.id' and request the rest of the row; MySQL lets us do that since it assumes we know that we're grouping on unique keys (and if we're not, it in fact returns non-deterministic results, which is the ultimate no-no in relational databases). But I like to write SQL that works on Postgresql/Oracle too and is SQL compliant, and those DBs require that we GROUP BY every column that isn't sent into an aggregate function (i.e. the COUNT function). Easy enough to get our full 'd' row related to this subset. Lift and move the above query into a subquery, (<select statement>) AS d_counts, then JOIN back to 'd', order by d_counts.d_count: sqlite> select d.*, d_counts.d_count ...> FROM d JOIN ( ...> select d.id, count(d.id) as d_count ...> from d join d_k as rel_d on d.id=rel_d.d_id ...> join d_k as our_d on rel_d.k_id=our_d.k_id ...> where our_d.d_id=1 and rel_d.d_id!=our_d.d_id ...> group by d.id ...> ) AS d_counts on d.id=d_counts.id ...> ORDER BY d_counts.d_count desc; 2|d2|3 3|d3|2 and that's answer one. we just riff on this same idea, add a new column, to factor in keywords that have 'weight': sqlite> alter table k add column weight integer; sqlite> update k set weight=10 where id=1; sqlite> update k set weight=5 where id in (2, 3); let's go back to our query that got us 'd's and their keywords related to d1 and also ask for the weight: sqlite> select d.*, k.name, k.weight ...> from d join d_k as rel_d on d.id=rel_d.d_id ...> join k on k.id=rel_d.k_id ...> join d_k as our_d on rel_d.k_id=our_d.k_id ...> where our_d.d_id=1 and rel_d.d_id!=our_d.d_id; 2|d2|k1|10 2|d2|k2|5 3|d3|k2|5 2|d2|k3|5 3|d3|k3|5 we can zap the name out again, use the sum of the weights, also an aggregate function, instead of counts for each group of common 'd' ids: sqlite> select d.id, sum(k.weight) as total_weight ...> from d join d_k as rel_d on d.id=rel_d.d_id ...> join k on k.id=rel_d.k_id ...> join d_k as our_d on rel_d.k_id=our_d.k_id ...> where our_d.d_id=1 and rel_d.d_id!=our_d.d_id ...> group by d.id ...> ; 2|20 3|10 wrap in the subquery and bake: sqlite> select d.*, d_weights.total_weight ...> from d join ( ...> select d.id, sum(k.weight) as total_weight ...> from d join d_k as rel_d on d.id=rel_d.d_id ...> join k on k.id=rel_d.k_id ...> join d_k as our_d on rel_d.k_id=our_d.k_id ...> where our_d.d_id=1 and rel_d.d_id!=our_d.d_id ...> group by d.id ...> ) as d_weights on d.id=d_weights.id ...> order by d_weights.total_weight desc; 2|d2|20 3|d3|10 I'll illustrate a SQLAlchemy ORM query for that last one, assume entities Document, Keyword, related by relationship() using the Table object "document_keywords" as the "secondary" (i.e., the pattern illustrated at http://www.sqlalchemy.org/docs/orm/relationships.html#many-to-many ): from sqlalchemy import func rel_d = document_keywords.alias() our_d = document_keywords.alias() weight_subq = session.query(Document.id, func.sum(Keyword.weight).label('total_weight')).\ join((rel_d, Document.id==rel_d.c.d_id)).\ join((Keyword, Keyword.id==rel_d.c.k_id)).\ join((our_d, rel_d.c.k_id==our_d.c.k_id)).\ filter(our_d.c.d_id==1).\ filter(rel_d.c.d_id!=our_d.c.d_id).\ group_by(Document.id).\ subquery() related_documents = session.query(Document, weight_subq.c.total_weight).\ join((weight_subq, Document.id==weight_subq.c.id)).\ order_by(weight_subq.c.total_weight.desc()) If you notice all those join(()) calls with the tuples, the extra "tuple" won't be needed in 0.7 which is nearing beta releases. > > regards > > NiL > > -- > You received this message because you are subscribed to the Google Groups > "sqlalchemy" group. > To post to this group, send email to sqlalchemy@googlegroups.com. > To unsubscribe from this group, send email to > sqlalchemy+unsubscr...@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/sqlalchemy?hl=en. > -- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.