TL;DR: Compress your data, and do it early.

Disk latency is high.
Disks (including SSDS) wear out faster.
Memory (for cache) is expensive.
Memory bandwidth is expensive.
Memory latency is high.
Network bandwidth is expensive.
Network latency is high.
Storage bus bandwidth (SAS, SATA, USB, etc) is expensive.
Storage bus latency sucks.

The CPU overhead for common zlib-based compression is relatively
inexpensive compared to these things.

Everything that gets stored on disk is expected to be read at some
point.  Reading that will use memory and memory bandwidth on just
about any OS.  Memory used for caching is not cheap and neither is
memory bandwidth and latency.

Sure one could use O_DIRECT, an interface designed by deranged
monkeys[2] to avoid the caching, but it is tricky to use and
most apps need to be modified to use it.


Transparent compression at the filesystem or virtual memory[1]
layers helps at some points, but becomes worthless once your
data needs to be transferred to other machines which do not
compress transparently.


As a bonus, compression formats such as FLAC and gzip tend to come
with integrity checking, too, giving you extra piece-of-mind when
you have unreliable hardware.


Sometimes compression does not even require special algorithms or
code.  It could be as simple as choosing tabs over spaces for
indentation to get a 16% improvement in grep performance :)

   http://mid.gmane.org/[email protected]
   ("Re: On Tabs and Spaces" - Jeff King on the git mailing list)


Footnotes:
[1] https://en.wikipedia.org/wiki/Virtual_memory_compression
[2] http://man7.org/linux/man-pages/man2/open.2.html
--
unsubscribe: [email protected]
archive: http://80x24.org/misc/

Reply via email to