Re: [Freevo-devel] Re: kaa.epg sucks

Jason Tackaberry Tue, 11 Oct 2005 10:49:38 -0700

On Tue, 2005-10-11 at 17:09 +0200, Dirk Meyer wrote:
> We should keep some things in mind. First, your IPC code is nice but
> not secure. Everyone can connect to the server and call python
> functions and such things.


I can add authentication.  But the data still wouldn't be encrypted, so
in any case this solution isn't suitable for use over a public network.
And I don't even think we should bother trying to make the IPC channel
encrypted.  Not only would it hurt performance, but it's probably
impossible to get it right anyway.  (And if we used m2crypto, say, our
program would leak like crazy.)

I think it's good enough for IPC to use filesystem access control (in
the case of unix sockets).  If the user wants to use it over the LAN I
can add some basic challenge/response authentication to kaa's ipc.

But for purposes of epg and vfs, I agree with your basic architecture
of: database on local machine, db reads in thread, db writes over ipc.
For something like managing recording schedules in kaa.record, a simple
authentication mechanism in kaa.base.ipc might do.

> local. We also can't use mbus because it is designed to be a message
> bus, not a bus to transport that much data (but it is secure btw). 

mbus is secure, is it?  High praise indeed.  I wouldn't use that word
about any software. :)  Even about openvpn, which could be the best
piece of software I use on my computer.

> (async). The thread will not only query the db, it will also create
> nice 'Program' objects so the main thread can use it without creating
> something. There should also be a cache to speed up stuff by not using
> the thread with db lookup at all.

Herein lies the main benefit of doing reads in a thread.  The thread can
also take care of putting the data in a manageable form for the main
application.  This is important particularly since Python is a hog when
it comes to object creation.

> Freevo knows what channels would be visible when entering the tv
> grid. So Freevo will request these channels with programs +- 6 hours
> at startup.

Using my rewrite of kaa.epg (which I've cleverly called kaa.epg2 for
now), this takes 0.2 seconds and returns 1978 program objects.  As a
point of interest, the query itself takes 0.02 seconds to execute,
another 0.1 seconds to convert the rows to tuples, another 0.05 seconds
to normalize the tuples into dicts (including unpickling ATTR_SIMPLE
attributes), and another 0.03 seconds to convert those to dicts to
Program objects.  So that 0.05 in normalize time is some low hanging
fruit and would bring that query down to 0.15 seconds (on my system, at
least).  Not slow, but I agree that it's worth prefetching.

The original kaa.epg executes that same query in 0.17 seconds.  Pretty
comparable performance there.  Keyword searches are a different story,
of course.  Searching for "simpsons" with kaa.epg returns 120 rows and
takes 0.15 seconds.  With kaa.epg2 and using the keyword support in
kaa.base.db, the same query takes 0.015 seconds.

BTW, when parsing my 17MB xmltv file, kaa.epg takes 74 minutes ([EMAIL 
PROTECTED]
$!#@) to execute, and uses 377MB RSS.  My rewrite (whose improvement is
mainly due to my use of libxml2, of course) takes 94 seconds and uses
less than half that memory.  That's a 50X performance improvement.
About 55% of that 94 seconds is due to keyword indexing (ATTR_KEYWORDS
attributes).  I could probably improve that time quite a bit by adding
mass add functionality to the API.  (Sort of like the difference between
pysqlite's execute and executemany.)

> cache. When you go to the right, freevo will ask the db for data + 10
> hours, just to be sure in case the user needs it. So we can cache in
> the background what we thing is needed next in a thread and the main
> loop can display stuff without using the db.

Probably not a bad idea to do prefetches like that.  There's a pretty
high initial overhead, so it's better to get more rows than you need.
For example, querying for the next 2 hours of program data takes 0.1
seconds and returns 200 rows.  Querying for the next 12 hours returns
2000 rows and takes 0.2 seconds.  10X the data for only 2X extra
execution time.  Actually, now that I think about that, something
doesn't seem right there.  Smells like an index isn't getting used (or
doesn't exist).

Anyway, I agree, prefetching in another thread seems to be the way to
go.

> Back to client / server. When we want to add data to the epg, we spawn
> a child. First we check if a write child is already running (try to
> connect to the ipc, if it doesn't work, start child, I have some test
> code for that). Same for kaa.vfs. One reading thread in each app, one
> writing app on each machine. 

Sounds sensible.

I think I need to change my opinion a bit about kaa.base.ipc.  My
original thinking was that you don't need to write a client API.  You
just grab a proxy to a remote object and use it as if it's local.  This
works in terms of functionality, but in practice, things aren't so
clear.  For example, in the epg example, you do a query and return a
list of 2000 Program objects.  Since objects get proxied by default, all
those Program objects are proxies.  So if we assume epg is a proxied,
remote object:

    for prog in epg.search(keywords="simpsons"):
       prog.description

That would be fairly slow, because since 'prog' is a proxied Program
object, each access of the description attribute goes over the wire.
Alternatively you could do this:

   for prog in epg.search(keywords="simpsons", __ipc_copy_result =
True):
      prog.description

That'd be fast, because each Program object is pickled (rather than just
a reference to it), so all those objects are local.  But if Program
object holds a reference to the epg (prog._epg in my case), if you use
__ipc_copy_result, the epg Guide object also gets pickled.  That's not
good.

Ideally you'd want Program objects to get pickled (so that attribute
accesses are local), but the epg reference is to a remote object.  This
isn't something kaa.base.ipc can do automatically.  It needs some
supporting logic.

So in reality, we'll need a client API that uses IPC and does
intelligent things for the API its wrapping.  This isn't really a
problem, it just means that kaa ipc isn't magic pixie dust like I
claimed it was. :)

> And kaa.epg has different sources. One is
> xmltv, a new one could be sync-from-other-db. 

As I mentioned on IRC, unless this is just a straight copy of the sqlite
file, this probably isn't worth it.  Syncing individual rows means
accessing the db through pysqlite in which case we're not really saving
anything.  With libxml2, parsing the xml file is very quick.  Almost all
the time is due to db accesses, so we're not saving much by syncing at
the row level from another db.

Copying the epgdb.sqlite file straight over would be a big win, of
course.  We could implement that eventually.

Cheers,
Jason.

signature.asc
Description: This is a digitally signed message part

Re: [Freevo-devel] Re: kaa.epg sucks

Reply via email to