Hi Roger,

Fantastic - thanks for your comprehensive reply. Funnily enough I had
already stripped out the INSERT INTO TABLE VALUES text to see if that
would make a difference but actually it looked like both compression
algorithms were pretty much eliminating all the duplication there
already. One interesting thing to note is that the 54MB of
uncompressed SQL unzips in lightning fast time - seriously, I mean ~1s
on my bog-standard "home" laptop, so that was an interesting finding.
Perhaps of use to others?

CSV is the next logical step and gets me under 5MB with LZMA and just
a shade over 6MB with Deflate. Fyi, BZip2 is just a shade under 6MB so
not a lot better than Deflate. My dataset is already fully
denormalized (although not sorted) and I'm using rowid instead of an
ID column so there's no sequential data either, but I do have a couple
of large, cross-referencing index tables (two of the largest) which I
bet I can squeeze down considerably by storing integer values instead
of strings to a "binary" CSV.

Now the only question is - is the db load (INSERT) performance of my
mobile app actually better than the average download case... and for
that I'm going to have to do some real work instead of just playing
with sqlite3 and 7zip :)

Cheers,
Larry


On 6 March 2012 01:41, Roger Binns <rog...@rogerbinns.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 05/03/12 02:59, Larry Knibb wrote:
>> Currently my database is 41.5MB with indexes and to reduce the download
>> I'm zipping it with max. compression Deflate down to 18.4MB.
>>
>> I noticed I can get it down to 12.1MB using LZMA compression but
>> unfortunately there's no LZMA library for Windows Phone yet... but
>> anyway.
>
> zip uses a compression window size of 32kb. bzip2 is 900kb.  LZMA on
> maximum can be gigabytes.
>
>> I further noticed that if I .dump the SQL to create the database then
>> zip that, the resulting size is just 6.7MB;
>
> A user on a mailing list I was on was trying to save space like you and
> found the same pattern you did.  A raw SQLite database has padding due to
> page size and alignment, plus prefix bytes to values (eg giving
> string/blob lengths), not to mention the btree structure for the records.
>
> SQL text was more efficient because there is no padding/alignment/btree.
> However every line has a redundant "insert into table values ..." so CSV
> is even more efficient as it is using one byte to mark the end of each
> value (the comma).  His data had large numbers somewhat sequential and
> they take more space as text (digits), so the CSV was adapted to be the
> deltas between values/rows which was usually a digit or so.  bzip2 was
> used on the final delta CSV to get the smallest size.  LZMA gained a few
> more percent on bzip2 in size but not enough to bother using.
>
> What works best for you will depend on what your data looks like, how much
> it varies, the types etc.  If there is a lot of text then you can also
> look to the current best text compressors - see
>
>  http://mattmahoney.net/dc/text.html
>
> You may also be able to save even more space by having the data a column
> at a time instead of a row at a time, although it would require
> considerably more temporary space to work out each row before
> reconstructing the database.
>
> Similarly you may be able to use a custom binary delta encoding that uses
> fewer bits to encode each row.
>
> Roger
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
>
> iEYEARECAAYFAk9U+tMACgkQmOOfHg372QT8dgCfXRV3OwrDE+oUksDNCy+GAccu
> h9YAoJ6LaZY7521lfj94BoIxGp1wMpS1
> =rxdX
> -----END PGP SIGNATURE-----
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to