> 1) Am I reading things correctly?
Yes. 
If you do a read/slice by name and more than min compaction level nodes where 
read the data is re-written so that the next read uses fewer SSTables.

> 2) What is really happening here?  Essentially minor compactions can occur 
> between 4 and 32 memtable flushes.  Looking through the code, this seems to 
> only effect a couple types of select statements (when selecting a specific 
> column on a specific key being one of them). During the time between these 
> two values, every "select" statement will perform a write.
Yup, only for readying a row where the column names are specified.
Remember minor compaction when using SizedTiered Compaction (the default) works 
on buckets of the same size. 

Imagine a row that had been around for a while and had fragments in more than 
Min Compaction Threshold sstables. Say it is 3 SSTables in the 2nd tier and 2 
sstables in the 1st. So it takes (potentially) 5 SSTable reads. If this row is 
read it will get hoisted back up. 

But the row has is in only 1 SSTable in the 2nd tier and 2 in the 1st tier it 
will not hoisted. 

There are a few short circuits in the SliceByName read path. One of them is to 
end the search when we know that no other SSTables contain columns that should 
be considered. So if the 4 columns you read frequently are hoisted into the 1st 
bucket your reads will get handled by that one bucket. 

It's not every select. Just those that touched more the min compaction 
sstables. 


> 3) Is this desired behavior?  Is there something else I should be looking at 
> that could be causing this behavior?
Yes.
https://issues.apache.org/jira/browse/CASSANDRA-2503

Cheers

   
-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/12/2012, at 12:58 PM, Michael Theroux <mthero...@yahoo.com> wrote:

> Hello,
> 
> We have an unusual situation that I believe I've reproduced, at least 
> temporarily, in a test environment.  I also think I see where this issue is 
> occurring in the code.
> 
> We have a specific column family that is under heavy read and write load on a 
> nightly basis.   For the purposes of this description, I'll refer to this 
> column family as "Bob".  During this nightly processing, sometimes Bob is 
> under very write load, other times it is very heavy read load.
> 
> The application is such that when something is written to Bob, a write is 
> made to one of two other tables.  We've witnessed a situation where the write 
> count on Bob far outstrips the write count on either of the other tables, by 
> a factor of 3->10.  This is based on the WriteCount available on the column 
> family JMX MBean.  We have not been able to find where in our code this is 
> happening, and we have gone as far as tracing our CQL calls to determine that 
> the relationship between Bob and the other tables are what we expect.
> 
> I brought up a test node to experiment, and see a situation where, when a 
> "select" statement is executed, a write will occur.
> 
> In my test, I perform the following (switching between nodetool and cqlsh):
> 
> update bob set 'about'='coworker' where key='<hex key>';    
> nodetool flush
> update bob set 'about'='coworker' where key='<hex key>';    
> nodetool flush
> update bob set 'about'='coworker' where key='<hex key>';    
> nodetool flush
> update bob set 'about'='coworker' where key='<hex key>';    
> nodetool flush
> update bob set 'about'='coworker' where key='<hex key>';    
> nodetool flush
> 
> Then, for a period of time (before a minor compaction occurs), a select 
> statement that selects specific columns will cause writes to occur in the 
> write count of the column family:
> 
> select about,changed,data from bob where key='<hex key>';
> 
> This situation will continue until a minor compaction is completed.
> 
> I went into the code and added some traces to CollationController.java:
> 
>   private ColumnFamily collectTimeOrderedData()
>     {
>         logger.debug("collectTimeOrderedData");
> 
>       ... <snip> ...
> 
> ---> HERE   logger.debug( "tables iterated: " + sstablesIterated +  " Min 
> compact: " + cfs.getMinimumCompactionThreshold() );
>             // "hoist up" the requested data into a more recent sstable
>             if (sstablesIterated > cfs.getMinimumCompactionThreshold()
>                 && !cfs.isCompactionDisabled()
>                 && cfs.getCompactionStrategy() instanceof 
> SizeTieredCompactionStrategy)
>             {
>                 RowMutation rm = new RowMutation(cfs.table.name, new 
> Row(filter.key, returnCF.cloneMe()));
>                 try
>                 {
> ---> HERE               logger.debug( "Apply hoisted up row mutation" );      
>                     // skipping commitlog and index updates is fine since 
> we're just de-fragmenting existing data
>                     Table.open(rm.getTable()).apply(rm, false, false);
>                 }
>                 catch (IOException e)
>                 {
>                     // log and allow the result to be returned
>                     logger.error("Error re-writing read results", e);
>                 }
>             } 
> ... <snip> ...
> 
> Performing the steps above, I see the following traces (in the test 
> environment I decreased the minimum compaction threshold to make this easier 
> to reproduce).  After I do a couple of update/flush, I see this in the log:
> 
> DEBUG [FlushWriter:7] 2012-12-14 22:54:40,106 CompactionManager.java (line 
> 117) Scheduling a background task check for bob with 
> SizeTieredCompactionStrategy
> 
> Then, until compaction occurs, I see (when performing a select):
> 
> DEBUG [ScheduledTasks:1] 2012-12-14 22:55:15,998 LoadBroadcaster.java (line 
> 86) Disseminating load info ...
> DEBUG [Thrift:12] 2012-12-14 22:55:16,990 CassandraServer.java (line 1227) 
> execute_cql_query
> DEBUG [Thrift:12] 2012-12-14 22:55:16,991 QueryProcessor.java (line 445) CQL 
> statement type: SELECT
> DEBUG [Thrift:12] 2012-12-14 22:55:16,991 StorageProxy.java (line 653) 
> Command/ConsistencyLevel is SliceByNamesReadCommand(table='open', 
> key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, 
> columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', 
> columnName='null')', columns=[about,changed,data,])/ONE
> DEBUG [Thrift:12] 2012-12-14 22:55:16,992 ReadCallback.java (line 79) 
> Blockfor is 1; setting up requests to /10.0.4.20
> DEBUG [Thrift:12] 2012-12-14 22:55:16,992 StorageProxy.java (line 669) 
> reading data locally
> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 StorageProxy.java (line 813) 
> LocalReadRunnable reading SliceByNamesReadCommand(table='open', 
> key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, 
> columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', 
> columnName='null')', columns=[about,changed,data,])
> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line 
> 68) In get top level columns: class 
> org.apache.cassandra.db.filter.NamesQueryFilter type: Standard valid: class 
> org.apache.cassandra.db.marshal.BytesType
> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line 
> 84) collectTimeOrderedData
> ---> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java 
> (line 188) tables iterated: 4 Min compact: 2
> ----> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java 
> (line 198) Apply hoisted up row mutation
> DEBUG [ReadStage:61] 2012-12-14 22:55:17,193 Table.java (line 395) applying 
> mutation of row 
> 804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342
> 
> The above traces will occur every time I repeat the above select statement.
> 
> Minor compaction doesn't start until a few minutes after the request was 
> submitted above (note, this is an unloaded test node):
> 
> DEBUG [CompactionExecutor:11] 2012-12-14 22:57:03,278 IntervalNode.java (line 
> 45) Creating IntervalNode from 
> [Interval(DecoratedKey(Token(bytes[804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce...
> 
> Once minor compaction occurs, the behavior around write count being 
> incremented stops, until more than the minimum compaction threshold memtables 
> are flush to disk.
> 
> So, my questions are:
> 
> 1) Am I reading things correctly?
> 
> 2) What is really happening here?  Essentially minor compactions can occur 
> between 4 and 32 memtable flushes.  Looking through the code, this seems to 
> only effect a couple types of select statements (when selecting a specific 
> column on a specific key being one of them). During the time between these 
> two values, every "select" statement will perform a write.
> 
> 3) Is this desired behavior?  Is there something else I should be looking at 
> that could be causing this behavior?
> 
> We are running Cassandra 1.1.2, with SizeTieredCompactionStrategy.  
> Any help is appreciated,
> Thanks,
> -Mike
> 
> 
>  

Reply via email to