On Wed, 2005-04-20 at 20:07 +0200, Dirk Meyer wrote: > Adding is not a problem. If we have to index a whole directory, > mmpython also takes it time. I need good SELECT speed (see below).
It's true that initial indexing is not as important in terms of performance. Still, it's worth looking at INSERT speed to see if using a database is prohibitively slow. Pickling is much faster there, by up to an order of magnitude. But, as you say, mmpython/epeg/mplayer/whatever is slow, so relative to the overall indexing time, the difference in performance between pickle and sqlite is epsilon. > No, I had a field dirname in each record. So I selected by this. Take > a look at WIP/Dischi/mediadb for the code. Our approach for table design is somewhat different. I have a files table which holds common data (basically stuff you get from stat) for all files that are interesting (i.e. videos, images, audio) where each file has a unique, integer id. Then for each of the media-specific tables, the file id is a foreign key reference to the files table. (Well, sqlite doesn't enforce referential integrity of course, but the design is still useful.) Still, aside from data duplication, I don't think the way you've done things should be very slow as long as you create an index on dirname, which you did. How much slower is this versus the pickling approach? > > Here are some timing examples. When dealing with 10000 simple records > > in a files table (4 fields: file_id, dir_id, filename, mtime): > > > > 10k INSERTs: 1.48 seconds > > 1 SELECT on whole directory (WHERE dir_id=1): 0.26 seconds > > 0.26 seconds is a long time for that. The whole process reading the > pickle + checking for new/changed/deleted files for 2k photos takes > 0.03 - 0.08 seconds now. Everything except showing the menu is 0.26 > seconds. Well yes, but how meaningful is the above? Firstly you're comparing my results dealing with 10000 files with your results dealing with 2000 files. And secondly our systems are likely rather different. I upgraded to pysqlite2 from svn which has some performance improvements and, on my system, doing a select on 10k records and putting those records into a list of tuples takes 0.146 seconds (3 samples averaged). Loading a pickle of the same data set takes 0.121 seconds (3 samples averaged). This is not a huge difference. I can't argue that pickling is faster. The question up in the air is: is using an sqlite backend prohibitively slower -- i.e. does the performance loss outweigh the benefits in flexibility? Or, perhaps another question is: does using an SQL backend really offer more advantage over coding query routines that work with in-memory data structures? Certainly we don't need the full flexibility of a relational database. The use-cases in terms of how the user will want to mine the data are fairly predictable in our applications. I'm not claiming the sqlite approach is better. In fact I don't fully know yet. There is still going to have to be some in-memory caching of data, so the design might end up being complicated by sqlite that outweighs the benefits gotten for querying. I'll have to do some more playing to see what's the right solution for MeBox. But in my initial tests, sqlite performs well enough to warrant consideration. > You don't need os.listdir. If the directory mtime is still the same > (you need an extra SELECT to get that information, I don't), you can > skip checking for new/deleted files. Yes, you're right of course. > Yes, I add a progress box when there are more than x changed/new > items. I don't want any popups for this sort of thing. The only exception is if the number of files in the directory is unusually large and it will take very long to do the initial listdir/stat. Better to get a list of filenames and other data that can be gotten from stat(), display the files to the user with some "loading" icons, and fill them in as they're loaded by the vfs asynchronously. > OK, you won't see all thumbnails at first, but they come later and the > gui still works. The way I've done my vfs is that directory loads take a synchronous timeout value, where after X seconds (0.2 by default but I may tweak that based on how it feels) it will return, and load the remaining file metadata asynchronously. So if the time it takes to get the filelist/stats is 0.05 seconds, it has 0.15 seconds to load metadata and thumbnails. This means that you will see thumbnails at first, but only as many as you can load in 0.2 seconds. The rest will get loaded in the background and the UI updated as it goes. > And if you pre-cache the thumbnails with a helper, you don't have that > problem. Well, if the helper is in another process, pre-caching isn't going to help much. (Unless by pre-cache you mean a read-ahead at the OS level.) > browse it. I have a DVD with the 2k photos. Indexing creates too much > time, also creating thumbnails and you have to wait will be too much > for the user. But you have not much time between inserting the disc > and showing it. You can't force the user to wait until an extra app > checked the disc. I'm not sure I follow here. With a fully asynchronous design, there is no such wait. Of course the user will have to wait for the directory to be fully indexed, but the UI won't block while this is being done. > The helper sounds nice, but it should be optional. Why? The user doesn't need to know about such design details. I wonder if we're talking about different things. I mean, I'm not suggesting how Freevo should be designed, but rather how I am doing the vfs for MeBox. Maybe we can use ideas from each other. :) > > Also if there are other processes, say a web server, that wants > > to do directory monitoring, it can just talk to the monitor process via > > IPC, so we don't have multiple processes polling the same directory. > > Replace mbus with IPC and I like your idea :) Well, I meant IPC as a general term. mbus can be used for IPC. :) > P.S.: The signature is random, but fits > > -- > "Everything should be made as simple as possible, but not simpler." > (A.Einstein) True. :) Cheers, Jason. ------------------------------------------------------- This SF.Net email is sponsored by: New Crystal Reports XI. Version 11 adds new functionality designed to reduce time involved in creating, integrating, and deploying reporting solutions. Free runtime info, new features, or free trial, at: http://www.businessobjects.com/devxi/728 _______________________________________________ Freevo-devel mailing list Freevo-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freevo-devel