Jason Tackaberry wrote:
> On Fri, 2005-04-22 at 08:03 +0200, Dirk Meyer wrote:
>> 6 MB pickeled is for about 20k of files here.
>
> But the memory requirement on disk for the pickle isn't the memory
> requirement when that file is unpickled.  The simulation I did with 100k
> files took an additional 63MB RSS, but when I pickled that data out, it
> was only 7MB on disk.

I know. When can't keep everything in memory. Sounds fast, but it will
kill freevo.

>> Yes. But it still prefer the pickle way of handling things. I only
>> have to find a good way to create an index.
>
> It's simpler, true, ignoring how to create an index.  But with indexing,
> is it so simple anymore?  This is the sort of territory that goes well
> outside my expertise.

Mine, too. I have no ideas about databases. There were courses at the
university about database design and how they work, but other stuff
like networking and operation system internals were more interesting
to me.

>> I'm nout sure I understand. Freevo will register video to the
>> db. After that, the resume plugin wants to store the current
>> position. How do I add that to the table? One idea would be that is a
>> field a can't search for and each table has a field 'extra data' which
>> is a pickled string.
>
> Yes, that's one possibility.  Another of course is to simply have a
> position field in the table to begin with.  Or to create another table
> extra_data that maps [attribute, value] tuples to file or directory ids.

OK, pickle may be bad. I taked with crunchy and he pointed out that it
should be possible to use the mediadb (or vfs as you call it) from
other processes, maybe not python. So a well designed database is
needed here.

>> And slow. Each query needs time.
>
> That doesn't follow.  Each query needs time, of course, but that doesn't
> mean it will be slow.  It's really important we don't lose sight of
> common use-cases.  But we also must balance this in terms of what we
> _want_ to do.  Specifically, pickling a single directory is going to be
> faster.  And entering a directory is without a doubt the most common
> use-case.  But if we want to add querying of the whole repository, then
> this approach has steep memory requirements.

Sad but true.

>
> Let's have a further look at multiple queries, which you say is slow.  I
> made a 2-dimensional dictionary for 10k files.  I pickled that and
> loaded the pickle back.  cPickle.load() took 0.2964 seconds.
>
> I then populated an sqlite database with 1 million records: 100
> directories with 10k files in each.  But 5k of each directory are video
> files, and 5k are audio files.  Doing a query over video files where the
> result set is 5k took 0.3683.  Doing a query over two tables where all
> 5k is returned in the first query, and the second query returns nothing
> (i.e. it is a needless query) takes 0.3689.  Notice this is the same
> time.  When I use the expression "do a query" what I specifically mean
> is that I am looking for all rows with a given dir_id.  In other words,
> looking for all files in some directory.

0.29 against 0.37 sounds ok to me. But I speed up the pickle by using
a pickled string for the mmpython data inside the pickle. I don't
need to unpickle this before I really use the item. This can't be done
with sqlite.

> The basic discovery is this: queries, even over a table with a million
> rows, which return an empty result take practically no time.  0.0005
> seconds on my system.  The corollary here is that we are querying over
> an indexed column.  With my tests I'm querying by dir_id, which is
> indexed.

Than I don't understand why pyepg is so slow sometimes. We have an
index for everything we need and it still doesn't feel right. And I
have 2.6 GHz, what should other people feel here?

> When querying over fields which are not indexed, the results are not
> quite so rosey.  Time requirements increase quite a bit.  But this isn't
> the least bit surprising.  All this tells us is that we must consider
> what data we need to query on most, and index the database
> appropriately.  But in the most common use-case of a user entering a
> directory, with an index on the directory id, we scale quite nicely, and
> we lose hardly any time by querying multiple tables.

We know what we will look for. Except directories, the most common
queries are artist, album, rating and last played. So we could index
that. I don't think someone would want a listing with all first tracks
on all albums.

> When comparing sqlite and pickle with 10k files, we see that pickle is
> more than twice as fast.  But the times I gave you above for sqlite
> weren't all sqlite.  About 30-35% of that time was overhead where I
> converted the result rows into dictionaries.  Without that overhead, the
> 10k query took 0.4659 seconds.  So there is some optimization to be done
> with respect to that overhead.

OK, that I don't need. I wrapped the dict in an extra object. It
doesn't matter if I have a dict inside it or a sqlite result. 

> The bottom line is that it's "slow" only for certain definitions of
> slow.  If your definition of "slow" is "slower than pickle" than that's
> true.  It's true at least until we start doing user queries, and then
> your definition is in real trouble. :)

My definition of slow is that everything from pressing enter to
showing the new page shouldn't take more than 0.5 seconds when you
have normal directory sizes.

But it is possible to speed up the sqlite with thinking ahead. When
the user requests /foo, we know that the directory is shown and the
user might want to select a subdirectory. We could prefetch all
subdirs while the user is browsing to the currect item in the menu. 

>> Yes, and I'm switching from pickle to sqlite and back and to sqlite
>> again. Right now I prefer pickle and I'm coding that way. But I make
>> sure the db stuff can be changed very easy. Maybe I will support both
>> in some weeks to compare them.
>
> In spite of all the playing I've been doing -- (I was tempted to say
> "testing" but that has formal connotations and really there's nothing
> formal or probably fair about how I got my results) -- it's difficult to
> really know how it will perform in the real world until you just
> implement the damn thing.
>
> So I'm still leaning in sqlite's direction.  I'll work at coding that up
> this weekend. 

Great. I will improve my pickle solution and if I like your test code,
I may switch. 

> Committing pyevas might not happen until next week. :)

No problem.


Dischi

-- 
program, n.:
        A magic spell cast over a computer allowing it to turn one's input
        into error messages.  tr.v. To engage in a pastime similar to banging
        one's head against a wall, but with fewer opportunities for reward.

Attachment: pgpDER7QQloYw.pgp
Description: PGP signature

Reply via email to