[ 
https://issues.apache.org/jira/browse/CASSANDRA-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049444#comment-13049444
 ] 

Alan Liang edited comment on CASSANDRA-2753 at 6/14/11 9:40 PM:
----------------------------------------------------------------

There are basically 3 places where we need to track max timestamps:

1. Memtable flush
2. During compaction (we simply take the max timestamp already recorded for the 
sstables)
3. Streamed data (normal columns and counter columns)

The challenge here is to capture the max timestamp for newly streamed data. 

For non-counter streamed data, RowIndexer#doIndexing goes through the streamed 
data files and simply updates the cache for the new rows. It iterates over the 
column families without deserializing the columns. To capture max timestamp 
here, I actually deserialize the columns from disk. This incurs more CPU but 
since it is already doing disk seeks when calling  
deserializeFromSSTableNoColumns(), the seek is less costly.

For counter streamed data, CommutativeRowIndexer#doIndexing actually creates 
new data files from the streamed data files. It does this by building an 
AbstractCompactedRow which can be either PreCompactedRow or LazilyCompactedRow. 
Collecting the max timestamp for PreCompactedRow is easy since all the columns 
are in memory. For LazilyCompactedRow, the only place where I can observe the 
max timestamp is during the #write method. Capturing the max timestamp inside 
#write is obviously not ideal since it would introduce a side effect. 
Alternatively, I could capture the max timestamp by deserializing the entire 
LazilyCompactedRow again but this obviously would mean more IO/CPU.

So it looks like I have to capture the max timestamp inside #write.

      was (Author: alanliang):
    There are basically 3 places where we need to track max timestamps:

1. Memtable flush
2. During compaction (we simply take the max timestamp already recorded for the 
sstables)
3. Streamed data (normal columns and counter columns)

The challenge here is to capture the max timestamp for newly streamed data. 

For non-counter streamed data, RowIndexer#doIndexing goes through the streamed 
data files and simply updates the cache for the new rows. It iterates over the 
column families without deserializing the columns. To capture max timestamp 
here, I actually deserialize the columns from disk. This incurs more CPU but 
since it is already doing disk seeks when calling  
deserializeFromSSTableNoColumns(), the seek is less costly.

For counter streamed data, CommutativeRowIndexer#doIndexing actually creates 
new data files from the streamed data files. It does this by building an 
AbstractCompactedRow which can be either PreCompactedRow or LazilyCompactedRow. 
Collecting the max timestamp for PreCompactedRow is easy since all the columns 
are in memory. For LazilyCompactedRow, the only place where I can observe the 
max timestamp is during the #write method. Capturing the max timestamp is 
obviously not ideal since it would introduce a side effect. Alternatively, I 
could capture the max timestamp by deserializing the entire LazilyCompactedRow 
again but this obviously would mean more IO/CPU.

So it looks like I have to capture the max timestamp inside #write.
  
> Capture the max client timestamp for an SSTable
> -----------------------------------------------
>
>                 Key: CASSANDRA-2753
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2753
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Alan Liang
>            Assignee: Alan Liang
>            Priority: Minor
>         Attachments: 
> 0003-capture-max-timestamp-for-sstable-and-introduced-SST.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to