Hello! In case of Apache Ignite, most of savings is due to BinaryObject format, which encodes types and fields with byte sequences. Any enum/string flags will also get in dictionary. And then as it processes a record it fills up its individual dictionary.
But, in one cache, most if not all entries have identical BinaryObject layout so a tiny dictionary covers that case. Compression algorithms are not very keen on large dictionaries, preferring to work with local regularities in byte stream. E.g. if we have large entries in cache with low BinaryObject overhead, they're served just fine by "generic" compression. All of the above is my speculations, actually. I just observe that on a large data set, compression ratio is around 0.4 (2.5x) with a dictionary of 1024 bytes. The rest is black box. Regards, -- Ilya Kasnacheev вт, 4 сент. 2018 г. в 17:16, Dmitriy Setrakyan <dsetrak...@apache.org>: > On Tue, Sep 4, 2018 at 2:55 AM, Ilya Kasnacheev <ilya.kasnach...@gmail.com > > > wrote: > > > Hello! > > > > Each node has a local dictionary (per node currently, per cache planned). > > Dictionary is never shared between nodes. As data patterns shift, > > dictionary rotation is also planned. > > > > With Zstd, the best dictionary size seems to be 1024 bytes. I imagine It > is > > enough to store common BinaryObject boilerplate, and everything else is > > compressed on the fly. The source sample is 16k records. > > > > > Thanks, Ilya, understood. I think per-cache is a better idea. However, I > have a question about dictionary size. Ignite stores TBs of data. How do > you plan the dictionary to fit in 1K bytes? > > D. >