dpkg and sqlite redux

sean finney Fri, 13 Apr 2007 15:42:11 -0700

hi folks,

sorry for breaking the thread, but i'm not subscribed and i wasn't cc'd,
and my current MUA sucks.


> from ian:

> Well, I don't know if I still count as one of the `dpkg team' but I
> think this is a terrible idea for lots of reasons.  dpkg needs to be
> very reliable; its databases must not get corrupted even under
> situations of stress.  It is also very useful that the databases are
> tractable with normal tools.  dpkg is very close to the bottom of the
> application stack; making it depend on a big and complex library like
> a SQL engine is a bad idea.

a couple points i'd like to clarify:

- i'm not advocating sqlite necessarily, but some db library.  and not even
  some db library as much as a smarter way of storing information to begin with,
  though some db's do provide advantages (query syntaxes, etc)
- i'm also suggesting that any such db could be used as a cache, so that if
  there are any problems it could be regenerated from the flat files.

that's not to say there still aren't problems with the suggestion (and
that there aren't possibly workarounds to such problems), but i think it
takes a bit of the bite out of the bark.

there are two main points to the motivation for doing something like
this in my eyes.  

first, you have the speed/efficiency factor, as already mentioned.  i
strongly suspect there's an order of magnitude or two in speed
reductions, which i would argue is worth some consideration.

secondly, you have the code complexity (wrt dpkg source code anyway).  
my time spent looking in dpkg is relatively short compared to most
folks here i'm sure, but i can't help but notice how much of dpkg's
internals are dedicated to representing, constructing, and manipulating
datastructures to make up for the lack of a better storage/query
format.  

something like sqlite would give the benefit of not only a smarter way
of storing the data but also it would lower the complexity of the
datastructures needed to work with the data, since much of the overhead
could be shifted into cleverly crafted queries.  

sure, it means you're basically outsourcing a chunk of work to a 3rd
party library who you know have to trust not to totally omg break
things.  that also means that the code complexity in the grand scheme of
things isn't actually less (more, even), but it's where it ought to be.
and really, an absolutist position on this line would mean not statically
linking against zlib/bz2.

> In practice I think the problem isn't that the *.list files are too
> inefficient an on-disk representation.  I have a number of machines
> with small CPUs and little memory and they don't have a difficulty
> there.

i don't have the numbers to back it up, but i do think it is a rather
inefficient format.  fwiw, i've tried compiling dpkg with profiling
info enabled (-pg), but this causes it to exit with SIGPROF while
scanning the list files (profiling timer times out)...  does anyone have
a better way to profile various dpkg runs better than /usr/bin/time?

also, wrt the status/available parsing... the same arguments outlined
above would apply here... a step better than parsing the files every
time they were needed would be to read the data pre-parsed from a db/cache.



        sean

signature.asc
Description: This is a digitally signed message part

dpkg and sqlite redux

Reply via email to