RE: [sqlite] temp_store assumptions

CARIOTOGLOU MIKE Fri, 08 Oct 2004 08:08:42 -0700

> 
> You are right - once you have the keywords it is a simple AND query. 
> The tricky part is once you've selected "vacation", what are the 
> keywords (the subset) that will be displayed to the user? That query 
> is:
> 
> - select all photos that have the keyword 'vacation'
> - return all (distinct) keywords of all those photos
> 
> This query is performed often - it is what I'm trying to optimise. It 
> gets slightly more complex as you drill down. For example, say 
> "vacation" and "sports" are selected. The query is then:
> 
> - select all photos that have the keywords 'vacation' and 'sports'
> - return all (distinct) keywords of all those photos
>


ok, if I understand you correctly, I think I have a nice solution, which
should be very fast, and neat.

I am assuming the following schema:

create table photo(
id integer primary key,
data varchar(20)
);

create table keyword(
id integer primary key,
data varchar(20)
);

create table pk(
idPhoto integer,
idKeyword integer);

create index pk_1 on pk (idPhoto); (these indexes are crucial to fast
operation)
create index pk_2 on pk (idkeyword);      >>       >>


I am assuming the following GUI:

say we have a window that is split into two panes. The left hand pane
contains a tree structure with keywords. The right
hand pane is a grid that shows the selected photos.
The semantics of the treeview as as follows:

The first level of the tree contains a distinct list of all available
keywords. each node is initially collapsed.
when a node is selected, the right-hand pane fills with the photos that have
this keyword (we are at the top level, still).
The tricky part, which I did not understand from your initial description,
is "what are the children of a keyword node".
I am assuming that what you want is this: the children of the keyword are
those keywords that exist in all photos that have the active keyword. in
other words, if I select "vacation", and this gives me 10 photos, and these
photos in turn have each 5 more keywords, I expect to see a list of max 50
keywords (minus any non-distinct). If this is what you are doing, it is very
clever, because it assures that your user queries will always produce (some)
results.

If my assumptions are correct, then you have two different issues to handle
:

a. given a list of keywords, which photos contain all of them ?
this is needed in order to populate the right hand pane, and I think that it
is NOT a simple AND thing

b. given a list of keywords (which is produced by traversing the tree from
the root up to the selected node), which keywords should appear as children
of this node?

it turns out that there are very simple sql constructs that can give these
results with a few statements, instead of a programmatic solution which
would indeed fire a large number of small subqueries. The idea is to utilize
the sql SET (asw in mathematical sets) operations, which is why SQL was
created in the first place.

Let us take issue (a) first:

say that the user has selected 2 keywords, whose IDS your tree structure
knows (it stores them as data behind each node).
further, the keyword path (in ID form is :)

100/745 (100 is vacation 745 is sports etc etc)

The following query populates your right hand side:

select idPhoto,data from pk join photo on id=idPhoto where idKeyword=100
intersect
select idPhoto,data from pk join photo on id=idPhoto where idKeyword=745

the trick is the intersect operator, which gives you the desired AND effect.
The joins are done in order to avoid re-querying for the actual photo data.
in a simpler,essential form, this would be :

select idPhoto from pk where idKeyword=100
intersect
select idPhoto from pk where idKeyword=745

you can see how this can be expanded for any number of keywords, and how to
build this statement dynamically, based on the
current active tree node.

Having solved (a), the solution to (b) is similar, and in fact uses the same
query, only as a sub-query:

what you need here is to find the proper children keywords for keyword 745.
so, you do this:

select distinct idKeyword from pk where idPhoto in
(
select idPhoto from pk where idKeyword=100
intersect
select idPhoto from pk where idKeyword=745
)

notice that the subquery in this case is the same as the one that would have
run for node path 100/745. This suggests that clever caching can be applied
here. for example, you could have saved the query that you run for the right
hand side (the results pane), and re-use it:

create temp table x as 
 select idPhoto,data from pk join photo on id=idPhoto where idKeyword=100
 intersect
 select idPhoto,data from pk join photo on id=idPhoto where idKeyword=745


then, 
select * from x 
populates your RHS, and

select distinct idKeyword from pk where idPhoto in (Select idPhoto from x)
populates your tree sub-nodes

and, in fact you only need to run the (b) query the first time a node is
expanded, not all the time, assuming static data.

so, to finalize, in order to service a GUI such as the one I described, (and
as I understand it, your requirements are similar if not identical), you
need at most 3 queries, plus a cleanup (drop table x). I run a simple test,
with the following
data :

10,000 photos
1,000 keywords

each photo has 10 keywords, randomly selected from the 1,000 available.

on my machine, none of the above queries took more than 100 ms, and this is
with a disk-based db.so, node navigation should cost you 200-300 ms, which
is probably acceptable. Assuming a normal distribution of keywords, this
solution should scale to millions of photos, with no perceptible
degradation. of course, if ALL your photos have the same keyword, the
intermeddiate results would be enormous. in such a case, you should amend
your keyword table with a count of how many photos actually use this
keyword, and use this to optimize the queries. since the INTERSECT technique
can also be written with IN, you can "plan" the execution path in such a way
that the smallest result set is evaluated first. but this may not be
necessary, try the technique with real data, and see if it is needed.

let me know what you think of this approach.

RE: [sqlite] temp_store assumptions

Reply via email to