[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2016-07-21 Thread Wei Deng (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Deng updated CASSANDRA-1608:

Labels: lcs  (was: )

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
>  Labels: lcs
> Fix For: 1.0.0
>
> Attachments: 1608-22082011.txt, 1608-v2.txt, 1608-v4.txt, 1608-v5.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-08-25 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: 1608-v5.txt

Fixed up the patch according to the comments given. Took a stab a culling some 
of the SSTables from the locking mechanism.

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-22082011.txt, 1608-v2.txt, 1608-v4.txt, 1608-v5.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-08-25 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: (was: 1608-v5.txt)

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-22082011.txt, 1608-v2.txt, 1608-v4.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-08-25 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: 1608-v5.txt

I've made the changes requested in the last two comments. The latest 
changes/merge seem to have caused a regression when the # of SSTables increases 
beyond a few hundred. Next time I'll be able to look at this is Friday I'll try 
to figure out what on earth is going on.

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-22082011.txt, 1608-v2.txt, 1608-v4.txt, 1608-v5.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-08-23 Thread Jonathan Ellis (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-1608:
--

Attachment: 1608-v4.txt

v4 attached.

Manifest


- I noticed that Manifest.generations and lastCompactedKeys could be simplified 
to arrays if we are willing to assume that no node will have more than a PB or 
so of data in a single CF.  Which feels reasonable to me even with capacity 
expanding as fast as it is. :)

- What is the 1.25 supposed to be doing here?
{code}
// skip newlevel if the resulting sstables exceed newlevel threshold
if (maxBytesForLevel(newLevel) < SSTableReader.getTotalBytes(added)
&& SSTableReader.getTotalBytes(getLevel(newLevel + 1)) == 0 * 1.25)
{code}

- Why the "all on the same level" special case?  Is this just saying "L0 
compactions must go into L1?"
{noformat}
// the level for the added sstables is the max of the removed ones,
// plus one if the removed were all on the same level
{noformat}

- removed this.  if L0 is large, it doesn't necessarily follow that L1 is large 
too.  I don't see a good reason to second-guess the scoring here.
{code}
if (candidates.size() > 32 && bestLevel == 0)
{
candidates = getCandidatesFor(1);
}
{code}

- redid L0 candidate selection to follow the LevelDB algorithm (pick one L0, 
add other L0s and L1s that overlap).  This means that if we're doing sequential 
writes we don't do "extra" work compacting non-overlapping L0s unnecessarily.  
(A niche use to be sure given our emphasis on RP but it's not a lot of code.)

- L0 only gets two sstables before it's overcapacity?  Are we still allowing L0 
sstables to be large?  if so it's not even two

- "Exposing number of SSTables in L0 as a JMX property probably isn't a bad 
idea."

- it's not correct for the create/load code to assume that the first data 
directory stays constant across restarts -- it should check all directories 
when loading

CFS
===
- not immediately clear to me if the TODOs in isKeyInRemainingSSTables are 
something i should be concerned about
- why do we need the reference mark/unmark now but not before?  is this a bug 
fix independent of 1608?
- are we losing a lot of cycles to markCurrentViewReferenced on the read path 
now that this is 1000s of sstables instead of 10s?

DataTracker
===
- followed todo's suggestion to move incrementallyBackup to another thread
- why do we use a LinkedList in buildIntervalTree when we know the size 
beforehand?
- suspect that it's going to be faster to use interval tree to prune the search 
space for CollationController.collectTimeOrderedData, then sort that subset by 
timestamp.  Which would simplify DataTracker by not having to keep a list of 
sstables around sorted-by-timestamp -- could get rid of that entirely in favor 
of the tree, I think.

Compaction
==
- Did this code get moved somewhere else so manual compaction request against a 
single sstable remains a no-op for SizeTiered?
{code}
if (toCompact.size() < 2)
{
logger.info("Nothing to compact in " + 
cfs.getColumnFamilyName() + "." +
"Use forceUserDefinedCompaction if you wish to 
force compaction of single sstables " +
"(e.g. for tombstone collection)");
return 0;
}
{code}



> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-22082011.txt, 1608-v2.txt, 1608-v4.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SST

[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-08-22 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: (was: 1608-22082011.txt)

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-22082011.txt, 1608-v2.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-08-22 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: 1608-22082011.txt

added synchronization to add

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-22082011.txt, 1608-22082011.txt, 1608-v2.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-08-22 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: (was: 1608-v11.txt)

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-22082011.txt, 1608-v2.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-08-22 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: (was: 1608-v13.txt)

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-22082011.txt, 1608-v2.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-08-22 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: 1608-22082011.txt

Rebased and updated with some fixes. All tests should now pass.

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-22082011.txt, 1608-v11.txt, 1608-v13.txt, 
> 1608-v2.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-08-09 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: 1608-v13.txt

1608 without some of the cruft

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-v11.txt, 1608-v13.txt, 1608-v2.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-08-08 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: (was: 1608-v8.txt)

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-v11.txt, 1608-v2.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-08-08 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: 1608-v11.txt

Added level skipping logic.

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-v11.txt, 1608-v2.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-08-08 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: (was: 1609-v10.txt)

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-v2.txt, 1608-v8.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-07-26 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: 1609-v10.txt

Updated s.t. manifests are now in the data directory. rebased.

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-v2.txt, 1608-v8.txt, 1609-v10.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-07-26 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: (was: 1608-v7.txt)

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-v2.txt, 1608-v8.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-07-26 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: (was: 1608-v5.txt)

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-v2.txt, 1608-v8.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-07-26 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: (was: 1608-v3.txt)

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-v2.txt, 1608-v8.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-07-26 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: (was: 1608-v4.txt)

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-v2.txt, 1608-v8.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-07-26 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: (was: 0001-leveldb-style-compaction.patch)

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 1608-v2.txt, 1608-v8.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-06-29 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: 1608-v8.txt

First the good:

1. Modified the code s.t. tombstone purge during minor compactions use the 
interval tree to prune the list of SSTables speeding up compactions by at least 
an order of magnitude where the number of SSTables in a column family exceeds 
~500.

2. Tested reads and writes. Write speeds (unsurprisingly) are not affected by 
this compaction strategy. Reads seem to keep up as well. The interval tree does 
a good job here making sure that bloom filters are only queried only for those 
SSTables that fall into the queried range.

3. Three successive runs of stress inserting 10M keys resulted in ~3GB of data 
stored in leveldb. By comparison, the same run using the tiered (default) 
strategy resulted in ~8GB of data.

The Meh:

Compactions do back up when setting the flush size to 64MB and the leveled 
SSTable size to anywhere between 5-10MB. On the upside, if your load has peaks 
and quieter times this compaction strategy will trigger a periodic check to 
"catch up" if all event-scheduled compactions complete.

Interestingly this extra IO has an upside. For datasets that frequently 
overwrite old data that has already been flushed to disk there is the potential 
for substantial de-duplication of data. Further, during reads the number of 
rows that would need to be merged for a single row is bound by the number of 
levels + the number of un-leveled sstables.

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 0001-leveldb-style-compaction.patch, 1608-v2.txt, 
> 1608-v3.txt, 1608-v4.txt, 1608-v5.txt, 1608-v7.txt, 1608-v8.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-06-27 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: 1608-v7.txt

Added an interval tree to cull sstables that are not needed for point and range 
queries.

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 0001-leveldb-style-compaction.patch, 1608-v2.txt, 
> 1608-v3.txt, 1608-v4.txt, 1608-v5.txt, 1608-v7.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-06-24 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: (was: 2608-v6.txt)

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 0001-leveldb-style-compaction.patch, 1608-v2.txt, 
> 1608-v3.txt, 1608-v4.txt, 1608-v5.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-06-24 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: 2608-v6.txt

Added a patch with fixed range filters.

With ~1100 sstables average latency is substantially increased (~5-10x). I'm 
pretty sure that in order to improve on this well need to implement an interval 
tree to get a non-linear search time for overlapping sstables in interval 
queries.

The problem here is that there aren't any really good RBtree or even binary 
tree implementations that I have found in the dependencies that we currently 
have, and I really don't want to muddy this ticket up with that effort.

There are some potentially useful structures in UIMA that I can use to base the 
implementation of an interval tree off of, but right now I'm leaning toward 
doing this in a separate ticket. 

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
>Assignee: Benjamin Coverston
> Attachments: 0001-leveldb-style-compaction.patch, 1608-v2.txt, 
> 1608-v3.txt, 1608-v4.txt, 1608-v5.txt, 2608-v6.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-06-21 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: 1608-v5.txt

Added range filters to reads.

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
> Attachments: 0001-leveldb-style-compaction.patch, 1608-v2.txt, 
> 1608-v3.txt, 1608-v4.txt, 1608-v5.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-06-20 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: 1608-v4.txt

Fixed exception on startup.

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
> Attachments: 0001-leveldb-style-compaction.patch, 1608-v2.txt, 
> 1608-v3.txt, 1608-v4.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-06-20 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: 1608-v3.txt

Attaching the latest version.

Levels are scored according to size rather than number of SSTables.

Performance kind of sucked until I made a few tweaks.

You can modify the LevelDB SSTable size by the following command (example) in 
the CLI:

update column family Standard1 with compaction_strategy_options=[{ 
sstable_size_in_mb : 10 }];


Using a flush size of 64MB and an sstable_size_in_mb of 5 worked pretty well 
for keeping compactions moving through the levels and handling new SSTables as 
they entered the system.

I also enabled concurrent compactions which, to my surprise, helped 
considerably. In testing I also removed compaction throttling, but in the end I 
don't think it mattered too much for me.

This version also adds a manifest and recovery code to the mix. While running 
you can cat the manifest, it's human readable, and quite beautiful to see the 
levels interact with each other as SSTables are flushed and compactions roll 
through the levels.

Right now I'm getting an exception on startup from the keycache, I'm going to 
investigate that, but I think it may have to do with the fact that I am not 
initializing the compaction manager _after_ the CFS.








> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
> Attachments: 0001-leveldb-style-compaction.patch, 1608-v2.txt, 
> 1608-v3.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-06-16 Thread Jonathan Ellis (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-1608:
--

Attachment: 1608-v2.txt

Thanks, Ben. This is promising!

I pretty much concententrated on the Manifest, which I moved to a top-level 
class.  (Can you summarize what is different in LDBCompactionTask?)

I don't think trying to build levels out of non-leveled data is useful.  Even 
if you tried all permutations the odds of ending up with something useful are 
infinitesmally small.  I'd suggest adding a startup hook instead to 
CompactionStrategy, and if we start up w/ unleveled SSTables we level them 
before doing anything else.  (This will take a while, but not as long as 
leveling everything naively would, since we can just do a single 
compaction-of-everything, spitting out non-overlapping sstables of the desired 
size, and set those to the appropriate level.)

Updated DataTracker to add streamed sstables to level 0.  DataTracker public 
API probably needs a more thorough look though to see if we're missing 
anything. (Speaking of streaming, I think we do need to go by data size not 
sstable count b/c streamed sstables from repair can be arbitrarily large or 
small.)

In promote, do we need to check for all the removed ones being on the same 
level?  I can't think of a scenario where we're not merging from multiple 
levels.  If so I'd change that to an assert.  (In fact there should be exactly 
two levels involved, right?)

Did some surgery on getCompactionCandidates.  Generally renamed things to be 
more succinct. Feels like we getCompactionCandidates should do lower levels 
before doing higher levels?

We'll also need to think about which parts of the strategy/manifest need to be 
threadsafe. (All of them?)  Should definitely document this in AbstractCS.


> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
> Attachments: 0001-leveldb-style-compaction.patch, 1608-v2.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

2011-06-14 Thread Benjamin Coverston (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Coverston updated CASSANDRA-1608:
--

Attachment: 0001-leveldb-style-compaction.patch

Adding a patch for leveldb style compaction. I see this as a 'good start' and 
I'm looking for some further input. I'm not going to be able to work on this 
for the next week or so so I'm putting it here to start some discussion on this 
approach.

This implementation requires no durable manifest.

Ranges are created at SSTable creation (flush or compaction) or sstable index 
creation.

Exponent used for levels is 10.

Preliminary runs show that high write rates do make level 0 to level 1 
promotions back up substantially, but when cleared promotions out of level one 
seem to be very fast.

I found the best performance by removing the compaction throughput throttling 
and setting concurrent compactors to 1.

The SSTable size in this implementation is determined by the flush size in mb 
setting.

The recovery path reads the list of SSTables, groups them by non-overlapping 
ranges then places each range in its appropriate level.

Finally credit is due to the leveldb team as this design was inspired by the 
leveldb implementation.

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
> Attachments: 0001-leveldb-style-compaction.patch
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (CASSANDRA-1608) Redesigned Compaction

2010-10-20 Thread Stu Hood (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stu Hood updated CASSANDRA-1608:


Comment: was deleted

(was: > I guess Cassandra would only need a fixed count of exactly 2, making it 
a non-issue.
Yea, I think we realized this at the same time... rather than deleting from the 
original filter, you could mark it as superseded in a separate filter for the 
sstable, which could be sized to accommodate the number of supersedes/deletes 
you would need to do before you'd want to run a compaction for the sstable: 
less than 2 bits totally actually.

EDIT: Nevermind... false positives in the second set would cause the wrong data 
to be dropped.)

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
> Fix For: 0.7.1
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (CASSANDRA-1608) Redesigned Compaction

2010-10-12 Thread Jonathan Ellis (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-1608:
--

Affects Version/s: (was: 0.7 beta 2)
Fix Version/s: 0.7.1

> Redesigned Compaction
> -
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Chris Goffinet
> Fix For: 0.7.1
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.