RE: [sqlite] temp_store assumptions

2004-10-08 Thread CARIOTOGLOU MIKE
> 
> You are right - once you have the keywords it is a simple AND query. 
> The tricky part is once you've selected "vacation", what are the 
> keywords (the subset) that will be displayed to the user? That query 
> is:
> 
> - select all photos that have the keyword 'vacation'
> - return all (distinct) keywords of all those photos
> 
> This query is performed often - it is what I'm trying to optimise. It 
> gets slightly more complex as you drill down. For example, say 
> "vacation" and "sports" are selected. The query is then:
> 
> - select all photos that have the keywords 'vacation' and 'sports'
> - return all (distinct) keywords of all those photos
> 

ok, if I understand you correctly, I think I have a nice solution, which
should be very fast, and neat.

I am assuming the following schema:

create table photo(
id integer primary key,
data varchar(20)
);

create table keyword(
id integer primary key,
data varchar(20)
);

create table pk(
idPhoto integer,
idKeyword integer);

create index pk_1 on pk (idPhoto); (these indexes are crucial to fast
operation)
create index pk_2 on pk (idkeyword);  >>   >>


I am assuming the following GUI:

say we have a window that is split into two panes. The left hand pane
contains a tree structure with keywords. The right
hand pane is a grid that shows the selected photos.
The semantics of the treeview as as follows:

The first level of the tree contains a distinct list of all available
keywords. each node is initially collapsed.
when a node is selected, the right-hand pane fills with the photos that have
this keyword (we are at the top level, still).
The tricky part, which I did not understand from your initial description,
is "what are the children of a keyword node".
I am assuming that what you want is this: the children of the keyword are
those keywords that exist in all photos that have the active keyword. in
other words, if I select "vacation", and this gives me 10 photos, and these
photos in turn have each 5 more keywords, I expect to see a list of max 50
keywords (minus any non-distinct). If this is what you are doing, it is very
clever, because it assures that your user queries will always produce (some)
results.

If my assumptions are correct, then you have two different issues to handle
:

a. given a list of keywords, which photos contain all of them ?
this is needed in order to populate the right hand pane, and I think that it
is NOT a simple AND thing

b. given a list of keywords (which is produced by traversing the tree from
the root up to the selected node), which keywords should appear as children
of this node?

it turns out that there are very simple sql constructs that can give these
results with a few statements, instead of a programmatic solution which
would indeed fire a large number of small subqueries. The idea is to utilize
the sql SET (asw in mathematical sets) operations, which is why SQL was
created in the first place.

Let us take issue (a) first:

say that the user has selected 2 keywords, whose IDS your tree structure
knows (it stores them as data behind each node).
further, the keyword path (in ID form is :)

100/745 (100 is vacation 745 is sports etc etc)

The following query populates your right hand side:

select idPhoto,data from pk join photo on id=idPhoto where idKeyword=100
intersect
select idPhoto,data from pk join photo on id=idPhoto where idKeyword=745

the trick is the intersect operator, which gives you the desired AND effect.
The joins are done in order to avoid re-querying for the actual photo data.
in a simpler,essential form, this would be :

select idPhoto from pk where idKeyword=100
intersect
select idPhoto from pk where idKeyword=745

you can see how this can be expanded for any number of keywords, and how to
build this statement dynamically, based on the
current active tree node.

Having solved (a), the solution to (b) is similar, and in fact uses the same
query, only as a sub-query:

what you need here is to find the proper children keywords for keyword 745.
so, you do this:

select distinct idKeyword from pk where idPhoto in
(
select idPhoto from pk where idKeyword=100
intersect
select idPhoto from pk where idKeyword=745
)

notice that the subquery in this case is the same as the one that would have
run for node path 100/745. This suggests that clever caching can be applied
here. for example, you could have saved the query that you run for the right
hand side (the results pane), and re-use it:

create temp table x as 
 select idPhoto,data from pk join photo on id=idPhoto where idKeyword=100
 intersect
 select idPhoto,data from pk join photo on id=idPhoto where idKeyword=745


then, 
select * from x 
populates your RHS, and

select distinct idKeyword from pk where idPhoto in (Select idPhoto from x)
populates your tree sub-nodes

and, in fact you only need to run the (b) query the first time a node is
expanded, not all the time, assuming static data.

so, to finalize, in order to service a GUI such as the one I 

Re: [sqlite] temp_store assumptions

2004-10-08 Thread Demitri Muna
Mike,
Thank you very much for your useful reply.
the way to do this is to use a second db instance, open it to 
:MEMORY:, and
copy the data (once) to this new db. then, you will have a RAM-based 
db,
which should be fast.
Yes, this is what I was hoping to do with the first method. Thanks - 
I'll give this a try.

I am not sure I understand your GUI topology, though. dirlling down is 
a
process of elimination, as I understand it, which means executing 
queries
against progressively smaller result sets (temporary tables would do 
fine
here), and should be very fast with proper indexes. once I select
"vacation", I have a list of vacation photos. What is the user's next 
action
? do they select "sports" ? If so, is this meant to be "select photos 
with
vacation AND sports keywords" ? if so, this is trivial and very fast. 
just
do a join with the previous result set (saved in a temporary table).
You are right - once you have the keywords it is a simple AND query. 
The tricky part is once you've selected "vacation", what are the 
keywords (the subset) that will be displayed to the user? That query 
is:

- select all photos that have the keyword 'vacation'
- return all (distinct) keywords of all those photos
This query is performed often - it is what I'm trying to optimise. It 
gets slightly more complex as you drill down. For example, say 
"vacation" and "sports" are selected. The query is then:

- select all photos that have the keywords 'vacation' and 'sports'
- return all (distinct) keywords of all those photos
and, are you sure this is a hierarchical data base ? it sounds like the
typical many-to-many with split relation, which cannot be represented
hirerarchically, since the data is *not* hierarchical in nature.
No, the data is not inherently hierarchical at all. I'm kind of faking 
it by massaging it into something that looks hierarchical. I've had a 
very hard time communicating what I'm trying to accomplish, but I do 
have working code that I'm happy with. I just need to speed up my 
searches by an order of magnitude or two. I'm trying to do it in memory 
since I have not yet come up with a clever way to simplify the actual 
queries.

Thanks again for your help!
Demitri


Re: [sqlite] temp_store assumptions

2004-10-07 Thread mike cariotoglou
>
> - set the pragma "temp_store" to MEMORY
> - CREATE TEMPORARY VIEW temp_table AS ()
>
> I am assuming that temp_table is completely in memory, and any queries
> against it will not go back to the disk. Since the table itself is
> small, I am hoping that the overhead of reading the entire table into
> RAM will be shadowed by the speed improvement of the largish number of
> queries that don't have to go to the disk. Does this sound reasonable?
>
> --
I suspect not. creating a view does not actually generate any data it is
only when you SELECT from that view that any
data access happens. so your scheme is not caching into RAM at all.

the way to do this is to use a second db instance, open it to :MEMORY:, and
copy the data (once) to this new db. then, you will have a RAM-based db,
which should be fast.

I am not sure I understand your GUI topology, though. dirlling down is a
process of elimination, as I understand it, which means executing queries
against progressively smaller result sets (temporary tables would do fine
here), and should be very fast with proper indexes. once I select
"vacation", I have a list of vacation photos. What is the user's next action
? do they select "sports" ? If so, is this meant to be "select photos with
vacation AND sports keywords" ? if so, this is trivial and very fast. just
do a join with the previous result set (saved in a temporary table).

and, are you sure this is a hierarchical data base ? it sounds like the
typical many-to-many with split relation, which cannot be represented
hirerarchically, since the data is *not* hierarchical in nature.

it may well be that you have a pseudo-problem, created by a GUI that does
not match the nature of the data.



[sqlite] temp_store assumptions

2004-10-07 Thread Demitri Muna
Hello,
I have a large number of queries (maybe up to a few hundred at a time) 
that I am making against a single table. When performing these queries, 
the table that I query does not change. The table itself is small (one 
of the rows is a foreign key). I am trying to get better performance 
out of this process; I had an idea and want to see if all of my 
assumptions are correct before I code it (rather than come up with 
another method).

- set the pragma "temp_store" to MEMORY
- CREATE TEMPORARY VIEW temp_table AS ()
I am assuming that temp_table is completely in memory, and any queries 
against it will not go back to the disk. Since the table itself is 
small, I am hoping that the overhead of reading the entire table into 
RAM will be shadowed by the speed improvement of the largish number of 
queries that don't have to go to the disk. Does this sound reasonable?

--
In anticipation of someone asking what I am actually doing to see if 
there is a better way to approach my problem, I'll elaborate here, but 
this is extra credit. :)

Imagine a program that catalogs photographs and assigns keywords to 
them (think iPhoto). A simple way to describe this is to have three 
tables: photo, photo_to_keyword, and keyword. Simple. Now, if you think 
about it this information can be displayed in a hierarchical manner. 
Imagine the list of all distinct keywords. Select "Vacation" - this 
will filter the full list of photos to ones that have that keyword. Of 
all the photos that have the keyword "Vacation", some will be 
"Bermuda", some will be "Podunk, Nebraska", etc. The UI will display 
that subset, i.e. the list of all keywords of all photos with 
"Vacation" as a keyword. Selecting the next will drill down further.

The problem that I am having is "converting" the three relational 
tables into a hierarchical data structure. I can do it, but the it 
requires some fairly complex queries, and a large number of them. The 
result works, but it's horribly slow. Since searches are being done 
against "keyword"s (strings), the query essentially has to convert the 
keyword string into a row id, go to the join table, search for all 
photos with that keyword, get the list of all keyword row ids of those 
photos, and convert that list to strings for display. This search has 
to be done not only once for each keyword, but again for the second 
level since I will need to know if any of them "branches" to a second 
level. Hence the incredibly slow results.

So, in the case above the temporary table I was hoping to make would be:
CREATE TEMPORARY VIEW photo_to_keyword_temp AS SELECT ptk.photo_id, 
k.label FROM photo_to_keyword ptk, keyword k WHERE ptk.keyword_id = 
k.id

This would at least save the trip from and back to the keyword table.
I would appreciate any suggestions or tips!!
Cheers,
Demitri Muna