[ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071499#comment-13071499
 ] 

Benjamin Coverston commented on CASSANDRA-1608:
-----------------------------------------------

bq. Not a deal breaker for me – it's not hard to get old-style compactions to 
back up under sustained writes, either. Given a choice between "block writes 
until compactions catch up" or "let them back up and let the operater deal with 
it how he will," I'll take the latter.

Exposing number of SSTables in L0 as a JMX property probably isn't a bad idea.

bq. Is it even worth keeping bloom filters around with such a drastic reduction 
in worst-case number of sstables to check (for read path too)?

I think they are absolutely worth keeping around for unleveled sstables, but 
for leveled sstables the value is certainly questionable. Perhaps having some 
kind of LRU cache where we have an upper bound on the number of bloom filters 
we keep in memory would be wise. Is it possible that we could move these 
off-heap?

bq. I'd like to have a better understanding of what the tradeoff is between 
making these settings larger/smaller. Can we make these one-size-fits-all?

Some pros and cons here. The biggest con is that for a 64MB flushed sstable 
leveling that file when we choose a 25MB leveled size will require us to run a 
compaction on approximately 314MB of data (25MB * 10 + 64MB) to get the data 
leveled into L1. If we choose 50MB for our leveled size the math is the same, 
but we end up compacting 564MB of data. Taking into account level based scoring 
(to choose the next compaction candidates), these settings become somewhat 
dynamic and the interplay between flush size and sstable size is anything but 
subtle. A small leveled size in combination with a large flushing memtable 
means that each time you merge a flushed SSTable into L1 you could end up with 
many cycles of cascading compactions into L2, and potentially into L3 and 
higher until the scores for L1, L2, and L3 normalize into a range that again 
triggers compactions from L0 to L1.

I wanted to keep the time for each compaction to something < 10 seconds so I 
chose an sstable size in the range of 5-10 MB and that was effective.

I like the idea of having a one-size-fits-all setting for this, but whatever I 
choose I think that compaction is going force me to revisit it. Right now this 
setting is part of the schema, and it's a nested schema setting at that. I'm 
leaning toward "undocumented-setting" right now with a reasonable default.

> Redesigned Compaction
> ---------------------
>
>                 Key: CASSANDRA-1608
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Chris Goffinet
>            Assignee: Benjamin Coverston
>         Attachments: 1608-v2.txt, 1608-v8.txt, 1609-v10.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to