Re: [Freevo-devel] Re: improve thumbnailing

Jason Tackaberry Wed, 20 Apr 2005 16:43:24 -0700

On Wed, 2005-04-20 at 20:07 +0200, Dirk Meyer wrote:
> Adding is not a problem. If we have to index a whole directory,
> mmpython also takes it time. I need good SELECT speed (see below).


It's true that initial indexing is not as important in terms of
performance.  Still, it's worth looking at INSERT speed to see if using
a database is prohibitively slow.  Pickling is much faster there, by up
to an order of magnitude.  But, as you say,
mmpython/epeg/mplayer/whatever is slow, so relative to the overall
indexing time, the difference in performance between pickle and sqlite
is epsilon.

> No, I had a field dirname in each record. So I selected by this. Take
> a look at WIP/Dischi/mediadb for the code.

Our approach for table design is somewhat different.  I have a files
table which holds common data (basically stuff you get from stat) for
all files that are interesting (i.e. videos, images, audio) where each
file has a unique, integer id.  Then for each of the media-specific
tables, the file id is a foreign key reference to the files table.
(Well, sqlite doesn't enforce referential integrity of course, but the
design is still useful.)

Still, aside from data duplication, I don't think the way you've done
things should be very slow as long as you create an index on dirname,
which you did.  How much slower is this versus the pickling approach?

> > Here are some timing examples.  When dealing with 10000 simple records
> > in a files table (4 fields: file_id, dir_id, filename, mtime):
> >
> >    10k INSERTs: 1.48 seconds
> >    1 SELECT on whole directory (WHERE dir_id=1): 0.26 seconds
> 
> 0.26 seconds is a long time for that. The whole process reading the
> pickle + checking for new/changed/deleted files for 2k photos takes
> 0.03 - 0.08 seconds now. Everything except showing the menu is 0.26
> seconds. 

Well yes, but how meaningful is the above?  Firstly you're comparing my
results dealing with 10000 files with your results dealing with 2000
files.  And secondly our systems are likely rather different.

I upgraded to pysqlite2 from svn which has some performance improvements
and, on my system, doing a select on 10k records and putting those
records into a list of tuples takes 0.146 seconds (3 samples averaged).
Loading a pickle of the same data set takes 0.121 seconds (3 samples
averaged).  This is not a huge difference.

I can't argue that pickling is faster.  The question up in the air is:
is using an sqlite backend prohibitively slower -- i.e. does the
performance loss outweigh the benefits in flexibility?  Or, perhaps
another question is: does using an SQL backend really offer more
advantage over coding query routines that work with in-memory data
structures?  Certainly we don't need the full flexibility of a
relational database.  The use-cases in terms of how the user will want
to mine the data are fairly predictable in our applications.

I'm not claiming the sqlite approach is better.  In fact I don't fully
know yet.  There is still going to have to be some in-memory caching of
data, so the design might end up being complicated by sqlite that
outweighs the benefits gotten for querying.  I'll have to do some more
playing to see what's the right solution for MeBox.  But in my initial
tests, sqlite performs well enough to warrant consideration.

> You don't need os.listdir. If the directory mtime is still the same
> (you need an extra SELECT to get that information, I don't), you can
> skip checking for new/deleted files.

Yes, you're right of course.

> Yes, I add a progress box when there are more than x changed/new
> items. 

I don't want any popups for this sort of thing.  The only exception is
if the number of files in the directory is unusually large and it will
take very long to do the initial listdir/stat.  Better to get a list of
filenames and other data that can be gotten from stat(), display the
files to the user with some "loading" icons, and fill them in as they're
loaded by the vfs asynchronously.

> OK, you won't see all thumbnails at first, but they come later and the
> gui still works.

The way I've done my vfs is that directory loads take a synchronous
timeout value, where after X seconds (0.2 by default but I may tweak
that based on how it feels) it will return, and load the remaining file
metadata asynchronously.  So if the time it takes to get the
filelist/stats is 0.05 seconds, it has 0.15 seconds to load metadata and
thumbnails.  This means that you will see thumbnails at first, but only
as many as you can load in 0.2 seconds.  The rest will get loaded in the
background and the UI updated as it goes.

>  And if you pre-cache the thumbnails with a helper, you don't have that 
> problem. 

Well, if the helper is in another process, pre-caching isn't going to
help much.  (Unless by pre-cache you mean a read-ahead at the OS level.)

> browse it. I have a DVD with the 2k photos. Indexing creates too much
> time, also creating thumbnails and you have to wait will be too much
> for the user. But you have not much time between inserting the disc
> and showing it. You can't force the user to wait until an extra app
> checked the disc.

I'm not sure I follow here.  With a fully asynchronous design, there is
no such wait.  Of course the user will have to wait for the directory to
be fully indexed, but the UI won't block while this is being done.

> The helper sounds nice, but it should be optional.

Why?  The user doesn't need to know about such design details.

I wonder if we're talking about different things.  I mean, I'm not
suggesting how Freevo should be designed, but rather how I am doing the
vfs for MeBox.  Maybe we can use ideas from each other. :)

> > Also if there are other processes, say a web server, that wants
> > to do directory monitoring, it can just talk to the monitor process via
> > IPC, so we don't have multiple processes polling the same directory.
> 
> Replace mbus with IPC and I like your idea :)

Well, I meant IPC as a general term.  mbus can be used for IPC. :)

> P.S.: The signature is random, but fits
> 
> -- 
> "Everything should be made as simple as possible, but not simpler."
>         (A.Einstein)

True. :)

Cheers,
Jason.



-------------------------------------------------------
This SF.Net email is sponsored by: New Crystal Reports XI.
Version 11 adds new functionality designed to reduce time involved in
creating, integrating, and deploying reporting solutions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728
_______________________________________________
Freevo-devel mailing list
Freevo-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freevo-devel

Re: [Freevo-devel] Re: improve thumbnailing

Reply via email to