It occurred to me that we store a lot of SHA1 hashes in our databases and they're all twice as big as they need to be because they're in hex. So I did some experiments. First I read all the hashes out of all the tables and dumped them into a huge file:
$ ls -1sh *.ids 8.3M mtn.ids 10M OE.ids (no column or row separators in there). That seemed big enough to be worth further experimentation, so I wrote a really dumb pseudo-migrator that just converted all the relevant columns out of hexadecimal, expecting to get back about half the above. I actually get rather more than that: $ ls -1sh *.mtn 115M monotone.mtn.baseline 105M monotone.mtn.nohex 116M OE.mtn.baseline 102M OE.mtn.nohex The rest of the gain is presumably in reduced sqlite overhead - I'm betting mostly it's smaller index tables (a lot of those hashes are used as index keys). I'm not sure whether this means we actually want to *do* this for real. It will make manual database queries have more binary garbage in them; there are a lot of places in the code that will have to change; we'll have to jump through hoops in a few places to get the hashes to stay the same; we probably don't want to do this to the netsync protocol, so there will be more conversions to do. Still, nearly 10% disk space savings is not to sneeze at, and I bet there would be speed gains too, just from not having to read so much off the disk. There is another factor to consider. There are 217,055 hashes in the "mtn.ids" file; however, there are only 91,223 *unique* hashes. (This is because many of the hashes are used as pointers between tables.) The ratio is similar for OE.ids. Thus, it might be worthwhile to yank all the hashes out into a separate table and reference them by row number from the rest of the database. Depending on how sqlite decides to do things, this might be a *lot* better, as we could use INTEGER PRIMARY KEYs in a whole bunch of tables where we currently have string keys. Technically this is orthogonal to the idea of storing the hashes as raw data, but it might be enough of a gain by itself that we don't want to bother with the de-hex-ificcation too (and, while the code changes for it would be substantial, I think they'd also be in fewer places). zw _______________________________________________ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel