[jira] [Comment Edited] (CASSANDRA-14482) ZSTD Compressor support in Cassandra
[ https://issues.apache.org/jira/browse/CASSANDRA-14482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774646#comment-16774646 ] Dinesh Joshi edited comment on CASSANDRA-14482 at 2/22/19 1:23 AM: --- Thanks, [~benedict] for the insightful comment. I patched the {{zstd-jni}} to add the ability to enable checksumming on the methods that [~bdeggleston] suggested. It was accepted upstream and is now available starting with {{zstd-jni-1.3.8-5}}. I have pulled it in and enabled it. I think that resolves Blake's concerns regarding GC and we get checksumming as well. was (Author: djoshi3): Thanks, [~benedict] for the insightful comment. I patched the {{zstd-jni}} to add the ability to enable compression on the methods that [~bdeggleston] suggested. It was accepted upstream and is now available starting with {{zstd-jni-1.3.8-5}}. I have pulled it in and enabled it. I think that resolves Blake's concerns regarding GC and we get checksumming as well. > ZSTD Compressor support in Cassandra > > > Key: CASSANDRA-14482 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14482 > Project: Cassandra > Issue Type: New Feature > Components: Dependencies, Feature/Compression >Reporter: Sushma A Devendrappa >Assignee: Dinesh Joshi >Priority: Major > Labels: performance, pull-request-available > Fix For: 4.x > > Time Spent: 2h 40m > Remaining Estimate: 0h > > ZStandard has a great speed and compression ratio tradeoff. > ZStandard is open source compression from Facebook. > More about ZSTD > [https://github.com/facebook/zstd] > https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-14482) ZSTD Compressor support in Cassandra
[ https://issues.apache.org/jira/browse/CASSANDRA-14482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773915#comment-16773915 ] Benedict edited comment on CASSANDRA-14482 at 2/21/19 10:15 AM: Going over the data twice is unlikely to incur much greater penalty than going over it once and doing both things. In fact, if the two behaviours are designed to behave optimally with the CPU pipeline (which compression and checksumming algorithms each certainly are) then mixing the two simultaneously would very likely be slower than running each independently. Looking at the ZStd code, it looks like it does the sensible thing and executes the checksum independently. It appears to checksum the input stream rather than the output, though, which is odd given that the latter should be smaller (and modulo any bugs in the compressor, should be just as good). The only possible advantage ZStd could probably have over us would to perform the checksum incrementally on, say, pages of data it is also compressing so that it is guaranteed to be in L1, and to guarantee no TLB misses. However, it doesn't *seem* to do this - it seems to assume you provide the data in reasonable chunks. Anyway, there should be no TLB misses on the size of data we're operating over when visiting it twice, and the data should be in L3 at worst, and prefetched to L2/L1. We could also probably do this ourselves, by providing only page-sized frames to compress and performing the checksum incrementally, though this would mean tighter integration with the C API, and is unlikely to be worth the effort. I have, though, made some assumptions about the ZStd code on reading it, as I didn't make time to fully read the codebase. was (Author: benedict): Going over the data twice is unlikely to incur much greater penalty than going over it once and doing both things. In fact, if the two behaviours are designed to behave optimally with the CPU pipeline (which compression and checksumming algorithms each certainly are) then mixing the two simultaneously would very likely be slower than running each independently. Looking at the ZStd code, it looks like it does the sensible thing and executes the checksum independently. It appears to checksum the input stream rather than the output, though, which is odd given that the latter should be smaller (and modulo any bugs in the compressor, should be just as good). The only possible advantage ZStd could probably have over us would to perform the checksum incrementally on, say, pages of data it is also compressing so that it is guaranteed to be in L1, and to guarantee no TLB misses. However, it doesn't *seem* to do this - it seems to assume you provide the data in reasonable chunks. Anyway, there should be no TLB misses on the size of data we're operating over when visiting it twice, and the data should be in L3 at worst, and prefetched to L2. We could also probably do this ourselves, by providing only page-sized frames to compress and performing the checksum incrementally, though this would mean tighter integration with the C API, and is unlikely to be worth the effort. I have, though, made some assumptions about the ZStd code on reading it, as I didn't make time to fully read the codebase. > ZSTD Compressor support in Cassandra > > > Key: CASSANDRA-14482 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14482 > Project: Cassandra > Issue Type: New Feature > Components: Dependencies, Feature/Compression >Reporter: Sushma A Devendrappa >Assignee: Dinesh Joshi >Priority: Major > Labels: performance, pull-request-available > Fix For: 4.x > > Time Spent: 2h 40m > Remaining Estimate: 0h > > ZStandard has a great speed and compression ratio tradeoff. > ZStandard is open source compression from Facebook. > More about ZSTD > [https://github.com/facebook/zstd] > https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-14482) ZSTD Compressor support in Cassandra
[ https://issues.apache.org/jira/browse/CASSANDRA-14482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773915#comment-16773915 ] Benedict edited comment on CASSANDRA-14482 at 2/21/19 10:13 AM: Going over the data twice is unlikely to incur much greater penalty than going over it once and doing both things. In fact, if the two behaviours are designed to behave optimally with the CPU pipeline (which compression and checksumming algorithms each certainly are) then mixing the two simultaneously would very likely be slower than running each independently. Looking at the ZStd code, it looks like it does the sensible thing and executes the checksum independently. It appears to checksum the input stream rather than the output, though, which is odd given that the latter should be smaller (and modulo any bugs in the compressor, should be just as good). The only possible advantage ZStd could probably have over us would to perform the checksum incrementally on, say, pages of data it is also compressing so that it is guaranteed to be in L1, and to guarantee no TLB misses. However, it doesn't *seem* to do this - it seems to assume you provide the data in reasonable chunks. Anyway, there should be no TLB misses on the size of data we're operating over when visiting it twice, and the data should be in L3 at worst, and prefetched to L2. We could also probably do this ourselves, by providing only page-sized frames to compress and performing the checksum incrementally, though this would mean tighter integration with the C API, and is unlikely to be worth the effort. I have, though, made some assumptions about the ZStd code on reading it, as I didn't make time to fully read the codebase. was (Author: benedict): Going over the data twice is unlikely to incur much greater penalty than going over it once and doing both things. In fact, if the two behaviours are designed to behave optimally with the CPU pipeline (which compression and checksumming algorithms each certainly are) then mixing the two simultaneously would almost certainly be slower than running each independently. Looking at the ZStd code, it looks like it does the sensible thing and executes the checksum independently. It appears to checksum the input stream rather than the output, though, which is odd given that the latter should be smaller (and modulo any bugs in the compressor, should be just as good). The only possible advantage ZStd could probably have over us would to perform the checksum incrementally on, say, pages of data it is also compressing so that it is guaranteed to be in L1, and to guarantee no TLB misses. However, it doesn't *seem* to do this - it seems to assume you provide the data in reasonable chunks. Anyway, there should be no TLB misses on the size of data we're operating over when visiting it twice, and the data should be in L3 at worst, and prefetched to L2. We could also probably do this ourselves, by providing only page-sized frames to compress and performing the checksum incrementally, though this would mean tighter integration with the C API, and is unlikely to be worth the effort. I have, though, made some assumptions about the ZStd code on reading it, as I didn't make time to fully read the codebase. > ZSTD Compressor support in Cassandra > > > Key: CASSANDRA-14482 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14482 > Project: Cassandra > Issue Type: New Feature > Components: Dependencies, Feature/Compression >Reporter: Sushma A Devendrappa >Assignee: Dinesh Joshi >Priority: Major > Labels: performance, pull-request-available > Fix For: 4.x > > Time Spent: 2h 40m > Remaining Estimate: 0h > > ZStandard has a great speed and compression ratio tradeoff. > ZStandard is open source compression from Facebook. > More about ZSTD > [https://github.com/facebook/zstd] > https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-14482) ZSTD Compressor support in Cassandra
[ https://issues.apache.org/jira/browse/CASSANDRA-14482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773821#comment-16773821 ] Dinesh Joshi edited comment on CASSANDRA-14482 at 2/21/19 8:40 AM: --- [~bdeggleston] thanks for the review. The reason I used the streams was that {{Zstd}} does not enable setting the checksumming flag via the {{ZStd::compress}} JNI static helper. I confirmed this with the JNI author and it goes deeper than just the JNI bindings. However using the compression stream causes GC. So here are the options we have right now - # Move forward with checksumming & accept the GC overhead # Move forward withOUT checksumming # Allow user to turn on/off checksumming using a compression preference parameter (turning on will incur GC, turning off wont) # Add our own checksumming (ugly, burns additional CPU and still generates some garbage) # Work with Zstd & Zstd JNI to enable passing in flags such as checksumming flag I personally think in the near term we should pick option 1-3 and move forward and open a follow on ticket to address the GC issue. I am opposed to doing our own checksumming especially because Zstd already supports it and it is just a matter of plumbing and adding the appropriate APIs to make it happen in a performant manner for JNI. If anybody has any other ideas, I am all ears. [~aweisberg] [~iamaleksey] [~benedict] [~jjirsa] please feel free to chime in. I am already discussing this issue in the Zstd community and have a working prototype of what we need but I think it is incomplete. I have reached out to [~dikanggu] to help surface it with the Zstd team as well. was (Author: djoshi3): [~bdeggleston] thanks for the review. The reason I used the streams was that {{Zstd}} does not enable setting the checksumming flag via the {{ZStd::compress}} JNI static helper. I confirmed this with the JNI author and it goes deeper than just the JNI bindings. However using the compression stream causes GC. So here are the options we have right now - # Move forward with checksumming & accept the GC overhead # Move forward withOUT checksumming # Allow user to turn on/off checksumming using a compression preference parameter (turning on will incur GC, turning off wont) # Add our own checksumming (ugly, burns additional CPU but still generates some garbage) # Work with Zstd & Zstd JNI to enable passing in flags such as checksumming flag I personally think in the near term we should pick option 1-3 and move forward and open a follow on ticket to address the GC issue. I am opposed to doing our own checksumming especially because Zstd already supports it and it is just a matter of plumbing and adding the appropriate APIs to make it happen in a performant manner for JNI. If anybody has any other ideas, I am all ears. [~aweisberg] [~iamaleksey] [~benedict] [~jjirsa] please feel free to chime in. I am already discussing this issue in the Zstd community and have a working prototype of what we need but I think it is incomplete. I have reached out to [~dikanggu] to help surface it with the Zstd team as well. > ZSTD Compressor support in Cassandra > > > Key: CASSANDRA-14482 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14482 > Project: Cassandra > Issue Type: New Feature > Components: Dependencies, Feature/Compression >Reporter: Sushma A Devendrappa >Assignee: Dinesh Joshi >Priority: Major > Labels: performance, pull-request-available > Fix For: 4.x > > Time Spent: 2h 40m > Remaining Estimate: 0h > > ZStandard has a great speed and compression ratio tradeoff. > ZStandard is open source compression from Facebook. > More about ZSTD > [https://github.com/facebook/zstd] > https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-14482) ZSTD Compressor support in Cassandra
[ https://issues.apache.org/jira/browse/CASSANDRA-14482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773821#comment-16773821 ] Dinesh Joshi edited comment on CASSANDRA-14482 at 2/21/19 8:40 AM: --- [~bdeggleston] thanks for the review. The reason I used the streams was that {{Zstd}} does not enable setting the checksumming flag via the {{ZStd::compress}} JNI static helper. I confirmed this with the JNI author and it goes deeper than just the JNI bindings. However using the compression stream causes GC. So here are the options we have right now - # Move forward with checksumming & accept the GC overhead # Move forward withOUT checksumming # Allow user to turn on/off checksumming using a compression preference parameter (turning on will incur GC, turning off wont) # Add our own checksumming (ugly, burns additional CPU and still generates some garbage) # Work with Zstd & Zstd JNI to enable passing in flags such as checksumming flag (no GC overhead) I personally think in the near term we should pick option 1-3 and move forward and open a follow on ticket to address the GC issue. I am opposed to doing our own checksumming especially because Zstd already supports it and it is just a matter of plumbing and adding the appropriate APIs to make it happen in a performant manner for JNI. If anybody has any other ideas, I am all ears. [~aweisberg] [~iamaleksey] [~benedict] [~jjirsa] please feel free to chime in. I am already discussing this issue in the Zstd community and have a working prototype of what we need but I think it is incomplete. I have reached out to [~dikanggu] to help surface it with the Zstd team as well. was (Author: djoshi3): [~bdeggleston] thanks for the review. The reason I used the streams was that {{Zstd}} does not enable setting the checksumming flag via the {{ZStd::compress}} JNI static helper. I confirmed this with the JNI author and it goes deeper than just the JNI bindings. However using the compression stream causes GC. So here are the options we have right now - # Move forward with checksumming & accept the GC overhead # Move forward withOUT checksumming # Allow user to turn on/off checksumming using a compression preference parameter (turning on will incur GC, turning off wont) # Add our own checksumming (ugly, burns additional CPU and still generates some garbage) # Work with Zstd & Zstd JNI to enable passing in flags such as checksumming flag I personally think in the near term we should pick option 1-3 and move forward and open a follow on ticket to address the GC issue. I am opposed to doing our own checksumming especially because Zstd already supports it and it is just a matter of plumbing and adding the appropriate APIs to make it happen in a performant manner for JNI. If anybody has any other ideas, I am all ears. [~aweisberg] [~iamaleksey] [~benedict] [~jjirsa] please feel free to chime in. I am already discussing this issue in the Zstd community and have a working prototype of what we need but I think it is incomplete. I have reached out to [~dikanggu] to help surface it with the Zstd team as well. > ZSTD Compressor support in Cassandra > > > Key: CASSANDRA-14482 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14482 > Project: Cassandra > Issue Type: New Feature > Components: Dependencies, Feature/Compression >Reporter: Sushma A Devendrappa >Assignee: Dinesh Joshi >Priority: Major > Labels: performance, pull-request-available > Fix For: 4.x > > Time Spent: 2h 40m > Remaining Estimate: 0h > > ZStandard has a great speed and compression ratio tradeoff. > ZStandard is open source compression from Facebook. > More about ZSTD > [https://github.com/facebook/zstd] > https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-14482) ZSTD Compressor support in Cassandra
[ https://issues.apache.org/jira/browse/CASSANDRA-14482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773821#comment-16773821 ] Dinesh Joshi edited comment on CASSANDRA-14482 at 2/21/19 8:38 AM: --- [~bdeggleston] thanks for the review. The reason I used the streams was that {{Zstd}} does not enable setting the checksumming flag via the {{ZStd::compress}} JNI static helper. I confirmed this with the JNI author and it goes deeper than just the JNI bindings. However using the compression stream causes GC. So here are the options we have right now - # Move forward with checksumming & accept the GC overhead # Move forward withOUT checksumming # Allow user to turn on/off checksumming using a compression preference parameter (turning on will incur GC, turning off wont) # Add our own checksumming (ugly, burns additional CPU but still generates some garbage) # Work with Zstd & Zstd JNI to enable passing in flags such as checksumming flag I personally think in the near term we should pick option 1-3 and move forward and open a follow on ticket to address the GC issue. I am opposed to doing our own checksumming especially because Zstd already supports it and it is just a matter of plumbing and adding the appropriate APIs to make it happen in a performant manner for JNI. If anybody has any other ideas, I am all ears. [~aweisberg] [~iamaleksey] [~benedict] [~jjirsa] please feel free to chime in. I am already discussing this issue in the Zstd community and have a working prototype of what we need but I think it is incomplete. I have reached out to [~dikanggu] to help surface it with the Zstd team as well. was (Author: djoshi3): [~bdeggleston] thanks for the review. The reason I used the streams was that {{Zstd}} does not enable setting the checksumming flag via the `ZStd::compress` JNI static helper. I confirmed this with the JNI author and it goes deeper than just the JNI bindings. However using the compression stream causes GC. So here are the options we have right now - # Move forward with checksumming & accept the GC overhead # Move forward withOUT checksumming # Allow user to turn on/off checksumming using a compression preference parameter (turning on will incur GC, turning off wont) # Add our own checksumming (ugly, burns additional CPU but still generates some garbage) # Work with Zstd & Zstd JNI to enable passing in flags such as checksumming flag I personally think in the near term we should pick option 1-3 and move forward and open a follow on ticket to address the GC issue. I am opposed to doing our own checksumming especially because Zstd already supports it and it is just a matter of plumbing and adding the appropriate APIs to make it happen in a performant manner for JNI. If anybody has any other ideas, I am all ears. [~aweisberg] [~iamaleksey] [~benedict] [~jjirsa] please feel free to chime in. I am already discussing this issue in the Zstd community and have a working prototype of what we need but I think it is incomplete. I have reached out to [~dikanggu] to help surface it with the Zstd team as well. > ZSTD Compressor support in Cassandra > > > Key: CASSANDRA-14482 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14482 > Project: Cassandra > Issue Type: New Feature > Components: Dependencies, Feature/Compression >Reporter: Sushma A Devendrappa >Assignee: Dinesh Joshi >Priority: Major > Labels: performance, pull-request-available > Fix For: 4.x > > Time Spent: 2h 40m > Remaining Estimate: 0h > > ZStandard has a great speed and compression ratio tradeoff. > ZStandard is open source compression from Facebook. > More about ZSTD > [https://github.com/facebook/zstd] > https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-14482) ZSTD Compressor support in Cassandra
[ https://issues.apache.org/jira/browse/CASSANDRA-14482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713792#comment-16713792 ] Jeff Jirsa edited comment on CASSANDRA-14482 at 12/8/18 10:50 PM: -- I'm not [~djoshi3] , but a few quick notes (he's still reviewer, just adding my unsolicited 2 cents) - Please include the license for the library in {{lib/}} - We're in a freeze, but I can't imagine this would break anyone's testing (modulo something like [~jrwest]'s property based/quicktheories tests, if he's got one that explores the space of compression algorithms/options/etc), and a major version is a great time for a change like this (and I think we need to be doing more of this, updating to modern algorithms is important for the project). It may be worth floating an email to dev list to see if anyone objects to including it was (Author: jjirsa): I'm not [~djoshi3] , but a few quick notes (he's still reviewer, just adding my unsolicited 2 cents) - Please include the license for the library in {{lib/}} - We're in a freeze, but I can't imagine this would break anyone's testing, and a major version is a great time for a change like this (and I think we need to be doing more of this, updating to modern algorithms is important for the project). It may be worth floating an email to dev list to see if anyone objects to including it > ZSTD Compressor support in Cassandra > > > Key: CASSANDRA-14482 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14482 > Project: Cassandra > Issue Type: New Feature > Components: Compression, Libraries >Reporter: Sushma A Devendrappa >Assignee: Sushma A Devendrappa >Priority: Major > Labels: performance, pull-request-available > Fix For: 4.x > > Time Spent: 10m > Remaining Estimate: 0h > > ZStandard has a great speed and compression ratio tradeoff. > ZStandard is open source compression from Facebook. > More about ZSTD > [https://github.com/facebook/zstd] > https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-14482) ZSTD Compressor support in Cassandra
[ https://issues.apache.org/jira/browse/CASSANDRA-14482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712363#comment-16712363 ] Dinesh Joshi edited comment on CASSANDRA-14482 at 12/7/18 5:43 AM: --- [~sushm...@gmail.com] I can help review this. was (Author: djoshi3): [~sushm...@gmail.com] I can help review this. Please go ahead and create a GH PR. > ZSTD Compressor support in Cassandra > > > Key: CASSANDRA-14482 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14482 > Project: Cassandra > Issue Type: New Feature > Components: Compression, Libraries >Reporter: Sushma A Devendrappa >Assignee: Sushma A Devendrappa >Priority: Major > Labels: performance, pull-request-available > Fix For: 4.x > > Time Spent: 10m > Remaining Estimate: 0h > > ZStandard has a great speed and compression ratio tradeoff. > ZStandard is open source compression from Facebook. > More about ZSTD > [https://github.com/facebook/zstd] > https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org