All,
This is how I see the things.
RLE is good for record compression due to the following facts:
1) In-memory record has all the fields expanded up to their maximum
size. Shorter values are padded with zeroes (that are easy to compress).
2) In-memory record stores the NULL bitmap and at the same time zaps the
space of the NULL values with zeroes (that are easy to compress).
3) In-memory record has all the fields aligned to their respective
boundaries. Unused gaps are also zapped with zeroes (that are easy to
compress).
Of course, it also works good for small numbers stored in INT/BIGINT
fields and for repeating character sequences in CHAR/VARCHAR fields. But
I'd say that 95% of the RLE effect is covered by the aforementioned
three cases, especially (1) and (2). Longish VARCHAR fields in UTF8 are
an extreme example.
The legacy algorithm compresses up to 128 bytes into 2 bytes, i.e. has
maximum compression ratio of 64x. It means that 10KB compresses into 156
bytes while theoretically it could be compressed into three or four
bytes. Quite a huge difference. This is what the suggested new algorithm
(actually, a clever modification of the old one) is expected to solve.
So far so good.
My question is would the RLE compression still useful if points (1) -
(3) disappear. For example, the record is packed the following way:
- NULL bitmap
- alignment padding is skipped
- NULL fields are skipped
- VARCHARs are stored in their actual length
- probably the same for CHARs, just need to spend some CPU cycles to
calculate the actual length (*)
In other words, records are packed/unpacked semantically, using their
formats.
How such an approach compare to the old RLE algorithm? To the proposed
one? Anyone willing to test?
This way, RLE could still be used for long VARCHARs, optionally. Or
replaced with a value-based encoding or something else, again optionally.
I'd go even further and eliminate point (1) completely:
- record consists of two parts: fixed (described by the format) and
variable (follows the fixed one)
- all fixed-size fields are stored as now
- VARCHAR is stored as {length, offset}
- offset points to its data stored in the tail
i.e. switch to variable-length records. It would cost us some extra
reallocations at runtime, but not so many as one could think. It could
dramatically reduce memory usage in some subsystems. And
much-shorter-than-declared strings don't need mandatory RLE compression.
This surely still needs more thinking, but you get the idea.
(*) I doubt any sane person uses CHARs for strings longer than a few
dozens of characters, so I wouldn't care much about optimizing their
storage.
Dmitry
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
Firebird-Devel mailing list, web interface at
https://lists.sourceforge.net/lists/listinfo/firebird-devel