[jira] [Commented] (ACCUMULO-1177) Decrease time it takes to recover after tablet server failures

Keith Turner (JIRA) Mon, 25 Mar 2013 11:43:17 -0700

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13612967#comment-13612967
 ]


Keith Turner commented on ACCUMULO-1177:
----------------------------------------

I was thinking about the design for all of the 1.6 walog changes.   Currently 
each batch of mutations thats written to a walog has a sequence number attached 
to it.  This sequence number relates the mutations to an instance of an in 
memory map.  Currently this sequence number is recorded in start and stop minor 
compaction events in the walog.   I'm thinking we can possibly dispense with 
these start and stop minc event in the walog and instead store the seq# in the 
metadata table.  This could be done with the mutation that writes out a new 
minor compaction file to the metadata table to make it atomic.  It could be 
stored in a new column.  It would need to increase monotonically for the 
lifetime of a tablet.

I think this will have two benefits.

 * For ACCUMULO-1083, I think this will greatly simplify minor compactions and 
recovery.  I think it avoids having to group logs and consider each group in 
order at recovery time.  We would not need to write start and stop minc events 
to all active walog groups (even if no mutations  were written to the current 
walog of the group).  
 * For this issue, I think it leads to the possibility of sorting less data at 
recovery time.  We can analyze the metadata table and determine for each walog 
what (tablet, seq #) pairs are needed.  Then we would only need to sort 
mutations where the (tablet, seq #) is > whats needed.

The drawback of this approach is that the walog will be less self contained.   


                
> Decrease time it takes to recover after tablet server failures
> --------------------------------------------------------------
>
>                 Key: ACCUMULO-1177
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1177
>             Project: Accumulo
>          Issue Type: Improvement
>            Reporter: Keith Turner
>             Fix For: 1.6.0
>
>
> Examine the end-to-end process for recovering from failures and look for ways 
> to speed it up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-1177) Decrease time it takes to recover after tablet server failures

Reply via email to