On Thu, 2005-04-21 at 21:41 +0200, Dirk Meyer wrote:
> for each file, but from time to time). But I get a problem when I have
> an item with the info, pickle in the background with a new process and
> after that save something using the item. The memory is now differnet
> From the file. So yes, that is a problem I now have. 

Right.  Then you'd need some mechanism to deal with merging changes.
Could get awkward.

> Correct, still thinking about it. But sending data from process to
> process also takes time. You think it is fast, but think of 10k of
> files. 

Well, fast is relative.  10k files will take about 0.1-0.2 seconds using
my IPC class on my system.  That's not exactly _fast_ but we also need
to be realistic: who is going to have a single directory of 10k files?

> don't need to add that to every file without one). Inside a dict of
> files with the metadata. To keep unpickle fast, some stuff like the
> above cover is only stored when it is different from the directory

I find images are much faster to load directly from disk as PNGs.  I
remember doing a test and found loading a pickled image slower.


> But do we want to keep the whole pickle db in memeory all the time?
> Maybe not (my pickled dir is about 6MB). So maybe load to memory
> first. 

This is actually a serious issue.  If we assume 100k records where each
record takes 1024 bytes, then that requires 100MB of memory to keep the
whole db in memory.  sqlite will surely have much saner memory
requirements.  The one area my test was unfair is that I was comparing
an in-memory search with an on-disk search.  The fact that sqlite was
still faster says a lot here.

A pickled dir of 6MB is very small.  My media collection is also small,
but I want to design MeBox so that it scales well and performs properly
on a media collection of around 100k files and directory sizes up to 10k
files.  I've shown in my last email that approach #1 is feasible
time-wise, but I did completely ignore memory requirements (which I
realized after I sent it).

In the tests I attached in my last email, the pickle approach has a
memory requirement of 63MB and the sqlite approach uses 2MB.  That's a
huge difference.  More importantly, the amount of data stored was by no
means complete.  We can expect much more to be held in the dictionaries.
So for large collections, the in-memory query used in #1 just doesn't
scale memory-wise.

So perhaps sqlite is the way to go after all.  The design is
complicated, but it does scale better.

> The lock is a primary problem. You don't know if it would block. 

Here's a possibility: an IPC call that tells the other process, "I want
to do a query, commit what you have."  Then it does its query and waits
for the other process to finish committing.

However, for large commits, this may end up being slower than just
transferring data from 10k files over IPC.  Some testing would need to
be done.

> That also was a big problem for me as you can see in my WIP mediadb
> test. Even worse: you don't know what entries a table has for a
> type. A plugin may want to store something to the video table and you
> don't know the variable while you write the vfs.

This is easily solved in design.  A plugin that provides support for a
new file type will register with the vfs: name of type, name of database
table, tuple of fields in database table, tuple of supported extensions,
function to index a file, etc.  As an implementation detail, you can
require that all fields have unique names across all tables (say, each
field is prefixed uniquely), and then put each field into a dict,
mapping to a table name.

So when you get a query, you parse it to see what fields are referenced.
Then you look up in the field-to-table dictionary and get a list of what
tables you need to query on.  Then you run the query on each of these
tables, making sure to remove references to field names that don't exist
on the current table.  You'd need some logic to construct the
appropriate intersection or union (depending if it's an AND or OR) on
the result sets.

It's workable.

> Yes, I know the problem. And each select takes time.

A common case will probably do a select on 1 or 2 tables.  Even if you
do a select on 4 tables, you're not going to be iterating over 100k
files each time, unless the user has a media collection of 400k files
(100k of each type).  If he has a media collection of 100k files, we
assume a worst-case of iterating over 200k rows: 100k for the main files
table, and at most 100k rows in the other type-specific tables.

It may make sense to not have a general purpose files table, and
duplicate the fields (like filename, modification time, file size, etc.)
in each type-specific table.  Or perhaps not, because then in the common
case where the user enters a directory, we need to search all the media
tables for that directory.  On the other hand, if we don't allow mixed
media types when browsing, you only need to select on one type.  This
may make most sense both in interface and database design.  Did that
make any sense? :)

> Aha, so a dict isn't even slower. But 100k files may be a small
> system, I guess some people have more files. But I get the point. 

I wonder who would have more than 100k files?  I suppose it's possible.
We want to make sure things scale properly.  I mean, if a user has as
collection of millions of files, he should own a Cray :)

> I did similar tests some weeks ago and came to the same conclusion. 

Conclusions change. :)

> The main problem with that is the size of the database in memory. The
> whole db of my system takes 16MB. Maybe we can live with that.

Or not, as explained above.

> I hope so. As I wrote on IRC I have a test version here with creating
> metadata in the background. Very nice and very fast.

Yeah, it makes all the difference, eh? :)

Cheers,
Jason.



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Freevo-devel mailing list
Freevo-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freevo-devel

Reply via email to