[jira] [Commented] (CASSANDRA-47) SSTable compression

Pavel Yaskevich (JIRA) Tue, 19 Jul 2011 09:51:27 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067831#comment-13067831
 ]


Pavel Yaskevich commented on CASSANDRA-47:
------------------------------------------

bq. Why ? Is there something in compressed files that requires to have an Input 
and a Output class ? Can't we just have CDF seek() method throw an exception if 
the CDF has been opened in "rw" mode ? And if we don't split it (which again, I 
don't see why we would have to, but maybe I'm missing something), I'm pretty 
sure there is very little parts that will require refactoring (skip cache isn't 
one of them, CDF will just set skipCache to false; even though I don't see why 
skip cache would be a problem with compression).

The thing about Input/Output classes was mentioned previously at 
CASSANDRA-1470. I -1 doing "seek() method throw an exception if the CDF has 
been opened in "rw" mode" because this is not a clean interface but I rather 
prefer to make separate classes as that will be a more reasonable and clean 
design. Anyway, even right now common ancestor of both is RandomAccessFile (or 
even FileDataInput). So I -1 doing merge of CDF and BRAF before we have a BRAF 
refactored.

bq. In any case, having compression optional is a requirement and in my book, 
the more important one. To be clear, I'm -1 on committing anything where 
compression is not optional (we cannot ask people to trust compression on day 
1, and I strongly think that the "let's commit and fix after" is the wrong way 
to go). So we at least need CDF and BRAF to have some common ancestor for that.

To be clear, I'm not proposing "let's commit and fix after", compaction can be 
make optional easily with current state of the patch and I'm making it my top 
priority.

bq. I would prefer putting this index and the header into a separate component 
(a -Compression component ?).

Thinking about that further - I'm a bit conserved about adding one more file to 
handle a single SSTable, main goal of my design here was to make CDF 
independent from other components of the system to avoid any additional 
complexity so maybe it's better to stream file offsets to the temporary file 
while SSTable being written and after that store index section at the end of 
the file (as a conter-action of keeping that index in memory)?

bq. Talking about the header, the control bytes detection is not correct since 
we haven't done this so far: there is no guarantee an existing data file won't 
start by the bytes 'C' then 'D' (having or not having a -Compression component 
could serve this purpose though).

We can use a magic number the same way as gzip does 
http://en.wikipedia.org/wiki/Gzip#File_format.

> SSTable compression
> -------------------
>
>                 Key: CASSANDRA-47
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-47
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Pavel Yaskevich
>              Labels: compression
>             Fix For: 1.0
>
>         Attachments: CASSANDRA-47-v2.patch, CASSANDRA-47.patch, 
> snappy-java-1.0.3-rc4.jar
>
>
> We should be able to do SSTable compression which would trade CPU for I/O 
> (almost always a good trade).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-47) SSTable compression

Reply via email to