[ 
https://issues.apache.org/jira/browse/CASSANDRA-14482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773915#comment-16773915
 ] 

Benedict edited comment on CASSANDRA-14482 at 2/21/19 10:13 AM:
----------------------------------------------------------------

Going over the data twice is unlikely to incur much greater penalty than going 
over it once and doing both things.  In fact, if the two behaviours are 
designed to behave optimally with the CPU pipeline (which compression and 
checksumming algorithms each certainly are) then mixing the two simultaneously 
would very likely be slower than running each independently.

Looking at the ZStd code, it looks like it does the sensible thing and executes 
the checksum independently.  It appears to checksum the input stream rather 
than the output, though, which is odd given that the latter should be smaller 
(and modulo any bugs in the compressor, should be just as good).  

The only possible advantage ZStd could probably have over us would to perform 
the checksum incrementally on, say, pages of data it is also compressing so 
that it is guaranteed to be in L1, and to guarantee no TLB misses.  However, it 
doesn't *seem* to do this - it seems to assume you provide the data in 
reasonable chunks.  Anyway, there should be no TLB misses on the size of data 
we're operating over when visiting it twice, and the data should be in L3 at 
worst, and prefetched to L2.  We could also probably do this ourselves, by 
providing only page-sized frames to compress and performing the checksum 
incrementally, though this would mean tighter integration with the C API, and 
is unlikely to be worth the effort.

I have, though, made some assumptions about the ZStd code on reading it, as I 
didn't make time to fully read the codebase.


was (Author: benedict):
Going over the data twice is unlikely to incur much greater penalty than going 
over it once and doing both things.  In fact, if the two behaviours are 
designed to behave optimally with the CPU pipeline (which compression and 
checksumming algorithms each certainly are) then mixing the two simultaneously 
would almost certainly be slower than running each independently.

Looking at the ZStd code, it looks like it does the sensible thing and executes 
the checksum independently.  It appears to checksum the input stream rather 
than the output, though, which is odd given that the latter should be smaller 
(and modulo any bugs in the compressor, should be just as good).  

The only possible advantage ZStd could probably have over us would to perform 
the checksum incrementally on, say, pages of data it is also compressing so 
that it is guaranteed to be in L1, and to guarantee no TLB misses.  However, it 
doesn't *seem* to do this - it seems to assume you provide the data in 
reasonable chunks.  Anyway, there should be no TLB misses on the size of data 
we're operating over when visiting it twice, and the data should be in L3 at 
worst, and prefetched to L2.  We could also probably do this ourselves, by 
providing only page-sized frames to compress and performing the checksum 
incrementally, though this would mean tighter integration with the C API, and 
is unlikely to be worth the effort.

I have, though, made some assumptions about the ZStd code on reading it, as I 
didn't make time to fully read the codebase.

> ZSTD Compressor support in Cassandra
> ------------------------------------
>
>                 Key: CASSANDRA-14482
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14482
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Dependencies, Feature/Compression
>            Reporter: Sushma A Devendrappa
>            Assignee: Dinesh Joshi
>            Priority: Major
>              Labels: performance, pull-request-available
>             Fix For: 4.x
>
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> ZStandard has a great speed and compression ratio tradeoff. 
> ZStandard is open source compression from Facebook.
> More about ZSTD
> [https://github.com/facebook/zstd]
> https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to