It occurred to me that we store a lot of SHA1 hashes in our databases
and they're all twice as big as they need to be because they're in
hex.  So I did some experiments.  First I read all the hashes out of
all the tables and dumped them into a huge file:

$ ls -1sh *.ids
8.3M mtn.ids
10M OE.ids

(no column or row separators in there).  That seemed big enough to be
worth further experimentation, so I wrote a really dumb
pseudo-migrator that just converted all the relevant columns out of
hexadecimal, expecting to get back about half the above.  I actually
get rather more than that:

$ ls -1sh *.mtn
115M monotone.mtn.baseline
105M monotone.mtn.nohex
116M OE.mtn.baseline
102M OE.mtn.nohex

The rest of the gain is presumably in reduced sqlite overhead - I'm
betting mostly it's smaller index tables (a lot of those hashes are
used as index keys).

I'm not sure whether this means we actually want to *do* this for
real.  It will make manual database queries have more binary garbage
in them; there are a lot of places in the code that will have to
change; we'll have to jump through hoops in a few places to get the
hashes to stay the same; we probably don't want to do this to the
netsync protocol, so there will be more conversions to do.  Still,
nearly 10% disk space savings is not to sneeze at, and I bet there
would be speed gains too, just from not having to read so much off the
disk.

There is another factor to consider.  There are 217,055 hashes in the
"mtn.ids" file; however, there are only 91,223 *unique* hashes.  (This
is because many of the hashes are used as pointers between tables.)
The ratio is similar for OE.ids.  Thus, it might be worthwhile to yank
all the hashes out into a separate table and reference them by row
number from the rest of the database.  Depending on how sqlite decides
to do things, this might be a *lot* better, as we could use INTEGER
PRIMARY KEYs in a whole bunch of tables where we currently have string
keys.  Technically this is orthogonal to the idea of storing the
hashes as raw data, but it might be enough of a gain by itself that we
don't want to bother with the de-hex-ificcation too (and, while the
code changes for it would be substantial, I think they'd also be in
fewer places).

zw


_______________________________________________
Monotone-devel mailing list
Monotone-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/monotone-devel

Reply via email to