Re: Compression prototype

Ilya Kasnacheev Fri, 31 Aug 2018 08:38:42 -0700

Just as I have started praising Zstd, it began to show JVM crashes in
native code in train dict :(


I guess it has limits to train buffer, after which errorneous behaviour is
exhibited. Maybe we will need to submit a pull request:)

Regards,
-- 
Ilya Kasnacheev


пт, 31 авг. 2018 г. в 11:56, Ilya Kasnacheev <ilya.kasnach...@gmail.com>:

> Hello!
>
> I am testing Zstd with dictionary, and it looks very very promising. I'm
> confident I can choose settings where it is faster than my own algo while
> bringing better compression ratio, on "cod" dataset.
>
> So I am happliy retiring my code and switching to Zstd. Would probably
> mean that we will ship compression implementation as a separate module.
>
> It is a pity that I did not find out about Zstd dictionary support
> earlier, that would mean I could skip a few days of work.
>
> Without dictionary the results of Zstd were worse than my own algo, but it
> was faster.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пн, 27 авг. 2018 г. в 14:53, Vyacheslav Daradur <daradu...@gmail.com>:
>
>> According to my benchmarks - zstd compression algorithm [1] looks very
>> interesting, it has a high compression ratio with quite good speed.
>> AFAIK it supports external dictionaries, but I'm not sure about using
>> it with "on the fly building" dictionaries. Anyway, have look at (it
>> has ASF 2.0 friendly license).
>>
>> Also, here is data generator / loader [1]. If it will be useful for
>> you we should ask Nikolay Izhikov to share public docs to start.
>>
>> [1] https://github.com/facebook/zstd
>> [2] https://github.com/nizhikov/ignite-cod-data-loader
>> On Mon, Aug 27, 2018 at 2:11 PM Ilya Kasnacheev
>> <ilya.kasnach...@gmail.com> wrote:
>> >
>> > Hello Vyacheslav!
>> >
>> > Unfortunately I have not found any efficient algorithms that will allow
>> me
>> > to use external dictionary as a pre-processed data structure. If plain
>> gzip
>> > is used without dictionary, the compression is around 0.7, as opposed to
>> > 0.4 that I will get with custom implementation, AFAIR the performance
>> was
>> > also worse. I didn't really try it with dictionary, but I assume
>> > performance will be even worse since it will have to scan dictionary
>> before
>> > getting to actual data.
>> >
>> > We have such a huge array of tests that we can just run them all with
>> > compression enabled, see if there are any new failures. But the impact
>> of
>> > my commit is fairly low, it is only triggered when data is written to
>> page
>> > (maybe to WAL also?), and we don't really do much frivolous stuff to
>> pages.
>> >
>> > Still, I am very much interested in finding existing compression
>> > implementations with support of external dictionary; I am also very much
>> > interested in having different implementations of compression for Apache
>> > Ignite (such as per page compression) and comparing them by benchmark
>> and
>> > by code impact. I am also very interested in large standard datasets for
>> > Apache Ignite (or generators thereof) so that we can run precise
>> benchmarks
>> > on various compression schemes. If you have any of the following, please
>> > get back to me.
>> >
>> > Regards,
>> > --
>> > Ilya Kasnacheev
>> >
>> >
>> > пн, 27 авг. 2018 г. в 11:35, Vyacheslav Daradur <daradu...@gmail.com>:
>> >
>> > > Hi Igniters!
>> > >
>> > > Ilya, I'm glad to see one more person who is interested in the
>> > > compression feature in Ignite.
>> > >
>> > > I looked through the pull request and want to share following
>> thoughts:
>> > >
>> > > It's very dangerous using a custom algorithm in this way - you store
>> > > serialized data separate from a dictionary and there are a lot of
>> > > points when we may lose data: rebalancing, serialization errors, node
>> > > rebooting and so on.
>> > >
>> > > I'd suggest the following ways to improve reliability:
>> > > - use well know algorithms: zstd, deflater, lzma, gzip e.g. that
>> > > allows us to decompress data in any situation
>> > > - store the dictionary inside page with data
>> > >
>> > > Also, we have a lot of discussions [1] [2] about compression on
>> > > BinaryObject and BinaryMarshaller level and Vladimir Ozerov was
>> > > strictly against a compression on this level.
>> > > If something has changed since then, you may look through [1] [2] [3]
>> > > I've done a lot of research in algorithms comparison it may be useful
>> > > for you.
>> > >
>> > > [1]
>> > >
>> http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html
>> > > [2]
>> > >
>> http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html
>> > > [3] https://issues.apache.org/jira/browse/IGNITE-3592
>> > > [4] https://issues.apache.org/jira/browse/IGNITE-5226
>> > > [5] https://github.com/daradurvs/ignite-compression
>> > > On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <dma...@apache.org>
>> wrote:
>> > > >
>> > > > >
>> > > > > Currently, the dictionary for decompression is only stored on
>> heap.
>> > > After
>> > > > > restart there's compressed data in the PDS, but there's no
>> dictionary
>> > > :)
>> > > >
>> > > >
>> > > > Basically, it means that I've lost my data, right? How about
>> persisting
>> > > > data to disk.
>> > > >
>> > > > Overall, we need Vladimir Ozerov to check the contribution. He was
>> the
>> > > one
>> > > > who sponsored the IEP and knows the area best.
>> > > >
>> > > > --
>> > > > Denis
>> > > >
>> > > > On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev <
>> > > ilya.kasnach...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Hello!
>> > > > >
>> > > > > It is somewhat a part of IEP-20, since I have updated it with this
>> > > > > particular direction.
>> > > > >
>> > > > > Regards,
>> > > > >
>> > > > > --
>> > > > > Ilya Kasnacheev
>> > > > >
>> > > > > 2018-08-24 2:56 GMT+03:00 Denis Magda <dma...@apache.org>:
>> > > > >
>> > > > > > Hi Ilya,
>> > > > > >
>> > > > > > Sounds terrific! Is this part of the following Ignite
>> enhancement
>> > > > > proposal?
>> > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
>> > > > > > 20%3A+Data+Compression+in+Ignite
>> > > > > >
>> > > > > > --
>> > > > > > Denis
>> > > > > >
>> > > > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev <
>> > > > > ilya.kasnach...@gmail.com
>> > > > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hello!
>> > > > > > >
>> > > > > > > My plan was to add a compression section to cache
>> configuration,
>> > > where
>> > > > > > you
>> > > > > > > can enable compression, enable key compression (which has
>> heavier
>> > > > > > > performance implications), adjust dictionary gathering
>> settings,
>> > > and in
>> > > > > > the
>> > > > > > > future possibly choose betwen algorithms. In fact I'm not
>> sure,
>> > > since
>> > > > > my
>> > > > > > > assumption is that you can always just use latest&greatest,
>> but
>> > > maybe
>> > > > > we
>> > > > > > > can have e.g. very fast and not very strong vs. slower but
>> stronger
>> > > > > one.
>> > > > > > >
>> > > > > > > I'm not sure yet if we should share dictionary between all
>> caches
>> > > vs.
>> > > > > > > having separate dictionary for every cache.
>> > > > > > >
>> > > > > > > With regards to data format, of course there will be room for
>> > > further
>> > > > > > > extension.
>> > > > > > >
>> > > > > > > Regards,
>> > > > > > >
>> > > > > > > --
>> > > > > > > Ilya Kasnacheev
>> > > > > > >
>> > > > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <
>> skoz...@gridgain.com>:
>> > > > > > >
>> > > > > > > > Hi Ilya
>> > > > > > > >
>> > > > > > > > Is there a plan to introduce it as an option of Ignite
>> > > configuration?
>> > > > > > In
>> > > > > > > > that instead the boolean type I suggest to use the enum and
>> > > reserve
>> > > > > the
>> > > > > > > > ability to extend compressions algorithms in future
>> > > > > > > >
>> > > > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev <
>> > > > > > > > ilya.kasnach...@gmail.com>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hello!
>> > > > > > > > >
>> > > > > > > > > I want to share with the developer community my
>> compression
>> > > > > > prototype.
>> > > > > > > > >
>> > > > > > > > > Long story short, it compresses BinaryObject's byte[] as
>> they
>> > > are
>> > > > > > > written
>> > > > > > > > > to Durable Memory page, operating on a pre-built
>> dictionary.
>> > > > > Typical
>> > > > > > > > > compression ratio is 0.4 (meaning 2.5x compression) using
>> > > custom
>> > > > > > > > > LZW+Huffman. Metadata, indexes and primitive values are
>> > > unaffected
>> > > > > > > > > entirely.
>> > > > > > > > >
>> > > > > > > > > This is akin to DB2's table-level compression[1] but
>> > > independently
>> > > > > > > > > invented.
>> > > > > > > > >
>> > > > > > > > > On Yardstick tests performance hit is -6% with PDS and up
>> to
>> > > -25%
>> > > > > (in
>> > > > > > > > > throughput) with In-Memory loads. It also means you can
>> fit
>> > > ~twice
>> > > > > as
>> > > > > > > > much
>> > > > > > > > > data into the same IM cluster, or have higher ram/disk
>> ratio
>> > > with
>> > > > > PDS
>> > > > > > > > > cluster, saving on hardware or decreasing latency.
>> > > > > > > > >
>> > > > > > > > > The code is available as PR 4295[2] (set
>> > > > > > IGNITE_ENABLE_COMPRESSION=true
>> > > > > > > > to
>> > > > > > > > > activate). Note that it will not presently survive a PDS
>> node
>> > > > > > restart.
>> > > > > > > > > The impact is very small, the patch should be applicable
>> to
>> > > most
>> > > > > 2.x
>> > > > > > > > > releases.
>> > > > > > > > >
>> > > > > > > > > Sure there's a long way before this prototype can have
>> hope of
>> > > > > being
>> > > > > > > > > included, but first I would like to hear input from fellow
>> > > > > igniters.
>> > > > > > > > >
>> > > > > > > > > See also IEP-20[3].
>> > > > > > > > >
>> > > > > > > > > 1.
>> > > > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.
>> > > > > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
>> > > > > > > > > 2. https://github.com/apache/ignite/pull/4295
>> > > > > > > > > 3.
>> > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
>> > > > > > > > > 20%3A+Data+Compression+in+Ignite
>> > > > > > > > >
>> > > > > > > > > Regards,
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > > Ilya Kasnacheev
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > Sergey Kozlov
>> > > > > > > > GridGain Systems
>> > > > > > > > www.gridgain.com
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Best Regards, Vyacheslav D.
>> > >
>>
>>
>>
>> --
>> Best Regards, Vyacheslav D.
>>
>

Re: Compression prototype

Reply via email to