Hello!

In case of Apache Ignite, most of savings is due to BinaryObject format,
which encodes types and fields with byte sequences. Any enum/string flags
will also get in dictionary. And then as it processes a record it fills up
its individual dictionary.

But, in one cache, most if not all entries have identical BinaryObject
layout so a tiny dictionary covers that case. Compression algorithms are
not very keen on large dictionaries, preferring to work with local
regularities in byte stream.

E.g. if we have large entries in cache with low BinaryObject overhead,
they're served just fine by "generic" compression.

All of the above is my speculations, actually. I just observe that on a
large data set, compression ratio is around 0.4 (2.5x) with a dictionary of
1024 bytes. The rest is black box.

Regards,
-- 
Ilya Kasnacheev


вт, 4 сент. 2018 г. в 17:16, Dmitriy Setrakyan <dsetrak...@apache.org>:

> On Tue, Sep 4, 2018 at 2:55 AM, Ilya Kasnacheev <ilya.kasnach...@gmail.com
> >
> wrote:
>
> > Hello!
> >
> > Each node has a local dictionary (per node currently, per cache planned).
> > Dictionary is never shared between nodes. As data patterns shift,
> > dictionary rotation is also planned.
> >
> > With Zstd, the best dictionary size seems to be 1024 bytes. I imagine It
> is
> > enough to store common BinaryObject boilerplate, and everything else is
> > compressed on the fly. The source sample is 16k records.
> >
> >
> Thanks, Ilya, understood. I think per-cache is a better idea. However, I
> have a question about dictionary size. Ignite stores TBs of data. How do
> you plan the dictionary to fit in 1K bytes?
>
> D.
>

Reply via email to