Just as I have started praising Zstd, it began to show JVM crashes in native code in train dict :(
I guess it has limits to train buffer, after which errorneous behaviour is exhibited. Maybe we will need to submit a pull request:) Regards, -- Ilya Kasnacheev пт, 31 авг. 2018 г. в 11:56, Ilya Kasnacheev <ilya.kasnach...@gmail.com>: > Hello! > > I am testing Zstd with dictionary, and it looks very very promising. I'm > confident I can choose settings where it is faster than my own algo while > bringing better compression ratio, on "cod" dataset. > > So I am happliy retiring my code and switching to Zstd. Would probably > mean that we will ship compression implementation as a separate module. > > It is a pity that I did not find out about Zstd dictionary support > earlier, that would mean I could skip a few days of work. > > Without dictionary the results of Zstd were worse than my own algo, but it > was faster. > > Regards, > -- > Ilya Kasnacheev > > > пн, 27 авг. 2018 г. в 14:53, Vyacheslav Daradur <daradu...@gmail.com>: > >> According to my benchmarks - zstd compression algorithm [1] looks very >> interesting, it has a high compression ratio with quite good speed. >> AFAIK it supports external dictionaries, but I'm not sure about using >> it with "on the fly building" dictionaries. Anyway, have look at (it >> has ASF 2.0 friendly license). >> >> Also, here is data generator / loader [1]. If it will be useful for >> you we should ask Nikolay Izhikov to share public docs to start. >> >> [1] https://github.com/facebook/zstd >> [2] https://github.com/nizhikov/ignite-cod-data-loader >> On Mon, Aug 27, 2018 at 2:11 PM Ilya Kasnacheev >> <ilya.kasnach...@gmail.com> wrote: >> > >> > Hello Vyacheslav! >> > >> > Unfortunately I have not found any efficient algorithms that will allow >> me >> > to use external dictionary as a pre-processed data structure. If plain >> gzip >> > is used without dictionary, the compression is around 0.7, as opposed to >> > 0.4 that I will get with custom implementation, AFAIR the performance >> was >> > also worse. I didn't really try it with dictionary, but I assume >> > performance will be even worse since it will have to scan dictionary >> before >> > getting to actual data. >> > >> > We have such a huge array of tests that we can just run them all with >> > compression enabled, see if there are any new failures. But the impact >> of >> > my commit is fairly low, it is only triggered when data is written to >> page >> > (maybe to WAL also?), and we don't really do much frivolous stuff to >> pages. >> > >> > Still, I am very much interested in finding existing compression >> > implementations with support of external dictionary; I am also very much >> > interested in having different implementations of compression for Apache >> > Ignite (such as per page compression) and comparing them by benchmark >> and >> > by code impact. I am also very interested in large standard datasets for >> > Apache Ignite (or generators thereof) so that we can run precise >> benchmarks >> > on various compression schemes. If you have any of the following, please >> > get back to me. >> > >> > Regards, >> > -- >> > Ilya Kasnacheev >> > >> > >> > пн, 27 авг. 2018 г. в 11:35, Vyacheslav Daradur <daradu...@gmail.com>: >> > >> > > Hi Igniters! >> > > >> > > Ilya, I'm glad to see one more person who is interested in the >> > > compression feature in Ignite. >> > > >> > > I looked through the pull request and want to share following >> thoughts: >> > > >> > > It's very dangerous using a custom algorithm in this way - you store >> > > serialized data separate from a dictionary and there are a lot of >> > > points when we may lose data: rebalancing, serialization errors, node >> > > rebooting and so on. >> > > >> > > I'd suggest the following ways to improve reliability: >> > > - use well know algorithms: zstd, deflater, lzma, gzip e.g. that >> > > allows us to decompress data in any situation >> > > - store the dictionary inside page with data >> > > >> > > Also, we have a lot of discussions [1] [2] about compression on >> > > BinaryObject and BinaryMarshaller level and Vladimir Ozerov was >> > > strictly against a compression on this level. >> > > If something has changed since then, you may look through [1] [2] [3] >> > > I've done a lot of research in algorithms comparison it may be useful >> > > for you. >> > > >> > > [1] >> > > >> http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html >> > > [2] >> > > >> http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html >> > > [3] https://issues.apache.org/jira/browse/IGNITE-3592 >> > > [4] https://issues.apache.org/jira/browse/IGNITE-5226 >> > > [5] https://github.com/daradurvs/ignite-compression >> > > On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <dma...@apache.org> >> wrote: >> > > > >> > > > > >> > > > > Currently, the dictionary for decompression is only stored on >> heap. >> > > After >> > > > > restart there's compressed data in the PDS, but there's no >> dictionary >> > > :) >> > > > >> > > > >> > > > Basically, it means that I've lost my data, right? How about >> persisting >> > > > data to disk. >> > > > >> > > > Overall, we need Vladimir Ozerov to check the contribution. He was >> the >> > > one >> > > > who sponsored the IEP and knows the area best. >> > > > >> > > > -- >> > > > Denis >> > > > >> > > > On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev < >> > > ilya.kasnach...@gmail.com> >> > > > wrote: >> > > > >> > > > > Hello! >> > > > > >> > > > > It is somewhat a part of IEP-20, since I have updated it with this >> > > > > particular direction. >> > > > > >> > > > > Regards, >> > > > > >> > > > > -- >> > > > > Ilya Kasnacheev >> > > > > >> > > > > 2018-08-24 2:56 GMT+03:00 Denis Magda <dma...@apache.org>: >> > > > > >> > > > > > Hi Ilya, >> > > > > > >> > > > > > Sounds terrific! Is this part of the following Ignite >> enhancement >> > > > > proposal? >> > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- >> > > > > > 20%3A+Data+Compression+in+Ignite >> > > > > > >> > > > > > -- >> > > > > > Denis >> > > > > > >> > > > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev < >> > > > > ilya.kasnach...@gmail.com >> > > > > > > >> > > > > > wrote: >> > > > > > >> > > > > > > Hello! >> > > > > > > >> > > > > > > My plan was to add a compression section to cache >> configuration, >> > > where >> > > > > > you >> > > > > > > can enable compression, enable key compression (which has >> heavier >> > > > > > > performance implications), adjust dictionary gathering >> settings, >> > > and in >> > > > > > the >> > > > > > > future possibly choose betwen algorithms. In fact I'm not >> sure, >> > > since >> > > > > my >> > > > > > > assumption is that you can always just use latest&greatest, >> but >> > > maybe >> > > > > we >> > > > > > > can have e.g. very fast and not very strong vs. slower but >> stronger >> > > > > one. >> > > > > > > >> > > > > > > I'm not sure yet if we should share dictionary between all >> caches >> > > vs. >> > > > > > > having separate dictionary for every cache. >> > > > > > > >> > > > > > > With regards to data format, of course there will be room for >> > > further >> > > > > > > extension. >> > > > > > > >> > > > > > > Regards, >> > > > > > > >> > > > > > > -- >> > > > > > > Ilya Kasnacheev >> > > > > > > >> > > > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov < >> skoz...@gridgain.com>: >> > > > > > > >> > > > > > > > Hi Ilya >> > > > > > > > >> > > > > > > > Is there a plan to introduce it as an option of Ignite >> > > configuration? >> > > > > > In >> > > > > > > > that instead the boolean type I suggest to use the enum and >> > > reserve >> > > > > the >> > > > > > > > ability to extend compressions algorithms in future >> > > > > > > > >> > > > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev < >> > > > > > > > ilya.kasnach...@gmail.com> >> > > > > > > > wrote: >> > > > > > > > >> > > > > > > > > Hello! >> > > > > > > > > >> > > > > > > > > I want to share with the developer community my >> compression >> > > > > > prototype. >> > > > > > > > > >> > > > > > > > > Long story short, it compresses BinaryObject's byte[] as >> they >> > > are >> > > > > > > written >> > > > > > > > > to Durable Memory page, operating on a pre-built >> dictionary. >> > > > > Typical >> > > > > > > > > compression ratio is 0.4 (meaning 2.5x compression) using >> > > custom >> > > > > > > > > LZW+Huffman. Metadata, indexes and primitive values are >> > > unaffected >> > > > > > > > > entirely. >> > > > > > > > > >> > > > > > > > > This is akin to DB2's table-level compression[1] but >> > > independently >> > > > > > > > > invented. >> > > > > > > > > >> > > > > > > > > On Yardstick tests performance hit is -6% with PDS and up >> to >> > > -25% >> > > > > (in >> > > > > > > > > throughput) with In-Memory loads. It also means you can >> fit >> > > ~twice >> > > > > as >> > > > > > > > much >> > > > > > > > > data into the same IM cluster, or have higher ram/disk >> ratio >> > > with >> > > > > PDS >> > > > > > > > > cluster, saving on hardware or decreasing latency. >> > > > > > > > > >> > > > > > > > > The code is available as PR 4295[2] (set >> > > > > > IGNITE_ENABLE_COMPRESSION=true >> > > > > > > > to >> > > > > > > > > activate). Note that it will not presently survive a PDS >> node >> > > > > > restart. >> > > > > > > > > The impact is very small, the patch should be applicable >> to >> > > most >> > > > > 2.x >> > > > > > > > > releases. >> > > > > > > > > >> > > > > > > > > Sure there's a long way before this prototype can have >> hope of >> > > > > being >> > > > > > > > > included, but first I would like to hear input from fellow >> > > > > igniters. >> > > > > > > > > >> > > > > > > > > See also IEP-20[3]. >> > > > > > > > > >> > > > > > > > > 1. >> > > > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10. >> > > > > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html >> > > > > > > > > 2. https://github.com/apache/ignite/pull/4295 >> > > > > > > > > 3. >> > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- >> > > > > > > > > 20%3A+Data+Compression+in+Ignite >> > > > > > > > > >> > > > > > > > > Regards, >> > > > > > > > > >> > > > > > > > > -- >> > > > > > > > > Ilya Kasnacheev >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > -- >> > > > > > > > Sergey Kozlov >> > > > > > > > GridGain Systems >> > > > > > > > www.gridgain.com >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > >> > > >> > > >> > > -- >> > > Best Regards, Vyacheslav D. >> > > >> >> >> >> -- >> Best Regards, Vyacheslav D. >> >