> > You are right - once you have the keywords it is a simple AND query. > The tricky part is once you've selected "vacation", what are the > keywords (the subset) that will be displayed to the user? That query > is: > > - select all photos that have the keyword 'vacation' > - return all (distinct) keywords of all those photos > > This query is performed often - it is what I'm trying to optimise. It > gets slightly more complex as you drill down. For example, say > "vacation" and "sports" are selected. The query is then: > > - select all photos that have the keywords 'vacation' and 'sports' > - return all (distinct) keywords of all those photos >
ok, if I understand you correctly, I think I have a nice solution, which should be very fast, and neat. I am assuming the following schema: create table photo( id integer primary key, data varchar(20) ); create table keyword( id integer primary key, data varchar(20) ); create table pk( idPhoto integer, idKeyword integer); create index pk_1 on pk (idPhoto); (these indexes are crucial to fast operation) create index pk_2 on pk (idkeyword); >> >> I am assuming the following GUI: say we have a window that is split into two panes. The left hand pane contains a tree structure with keywords. The right hand pane is a grid that shows the selected photos. The semantics of the treeview as as follows: The first level of the tree contains a distinct list of all available keywords. each node is initially collapsed. when a node is selected, the right-hand pane fills with the photos that have this keyword (we are at the top level, still). The tricky part, which I did not understand from your initial description, is "what are the children of a keyword node". I am assuming that what you want is this: the children of the keyword are those keywords that exist in all photos that have the active keyword. in other words, if I select "vacation", and this gives me 10 photos, and these photos in turn have each 5 more keywords, I expect to see a list of max 50 keywords (minus any non-distinct). If this is what you are doing, it is very clever, because it assures that your user queries will always produce (some) results. If my assumptions are correct, then you have two different issues to handle : a. given a list of keywords, which photos contain all of them ? this is needed in order to populate the right hand pane, and I think that it is NOT a simple AND thing b. given a list of keywords (which is produced by traversing the tree from the root up to the selected node), which keywords should appear as children of this node? it turns out that there are very simple sql constructs that can give these results with a few statements, instead of a programmatic solution which would indeed fire a large number of small subqueries. The idea is to utilize the sql SET (asw in mathematical sets) operations, which is why SQL was created in the first place. Let us take issue (a) first: say that the user has selected 2 keywords, whose IDS your tree structure knows (it stores them as data behind each node). further, the keyword path (in ID form is :) 100/745 (100 is vacation 745 is sports etc etc) The following query populates your right hand side: select idPhoto,data from pk join photo on id=idPhoto where idKeyword=100 intersect select idPhoto,data from pk join photo on id=idPhoto where idKeyword=745 the trick is the intersect operator, which gives you the desired AND effect. The joins are done in order to avoid re-querying for the actual photo data. in a simpler,essential form, this would be : select idPhoto from pk where idKeyword=100 intersect select idPhoto from pk where idKeyword=745 you can see how this can be expanded for any number of keywords, and how to build this statement dynamically, based on the current active tree node. Having solved (a), the solution to (b) is similar, and in fact uses the same query, only as a sub-query: what you need here is to find the proper children keywords for keyword 745. so, you do this: select distinct idKeyword from pk where idPhoto in ( select idPhoto from pk where idKeyword=100 intersect select idPhoto from pk where idKeyword=745 ) notice that the subquery in this case is the same as the one that would have run for node path 100/745. This suggests that clever caching can be applied here. for example, you could have saved the query that you run for the right hand side (the results pane), and re-use it: create temp table x as select idPhoto,data from pk join photo on id=idPhoto where idKeyword=100 intersect select idPhoto,data from pk join photo on id=idPhoto where idKeyword=745 then, select * from x populates your RHS, and select distinct idKeyword from pk where idPhoto in (Select idPhoto from x) populates your tree sub-nodes and, in fact you only need to run the (b) query the first time a node is expanded, not all the time, assuming static data. so, to finalize, in order to service a GUI such as the one I described, (and as I understand it, your requirements are similar if not identical), you need at most 3 queries, plus a cleanup (drop table x). I run a simple test, with the following data : 10,000 photos 1,000 keywords each photo has 10 keywords, randomly selected from the 1,000 available. on my machine, none of the above queries took more than 100 ms, and this is with a disk-based db.so, node navigation should cost you 200-300 ms, which is probably acceptable. Assuming a normal distribution of keywords, this solution should scale to millions of photos, with no perceptible degradation. of course, if ALL your photos have the same keyword, the intermeddiate results would be enormous. in such a case, you should amend your keyword table with a count of how many photos actually use this keyword, and use this to optimize the queries. since the INTERSECT technique can also be written with IN, you can "plan" the execution path in such a way that the smallest result set is evaluated first. but this may not be necessary, try the technique with real data, and see if it is needed. let me know what you think of this approach.