[ 
https://issues.apache.org/jira/browse/KUDU-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387124#comment-15387124
 ] 

Todd Lipcon commented on KUDU-1538:
-----------------------------------

Put up a potential fix at http://gerrit.cloudera.org:8080/3719

> "Orphaned" block deletion can delete live blocks in use by other tablets
> ------------------------------------------------------------------------
>
>                 Key: KUDU-1538
>                 URL: https://issues.apache.org/jira/browse/KUDU-1538
>             Project: Kudu
>          Issue Type: Bug
>          Components: fs, tablet
>    Affects Versions: 0.9.1
>            Reporter: Todd Lipcon
>            Priority: Blocker
>
> Currently, we allocate block IDs using a random number generator, ensuring 
> that the blocks we allocate are not already in use. Of course that doesn't 
> proclude a block which was previously used and then deleted from having its 
> ID reused.
> This interacts quite poorly with the "orphaned block" processing we have in 
> tablet metadata. As a refresher, the "orphaned block" thing is used as 
> follows:
> - during a compaction, we have the output blocks (newly written data) and the 
> input blocks (data which has been compacted and no longer relevant)
> - when the compaction finishes, we write a new TabletMetadata which swaps in 
> the new blocks and removes the old blocks
> -- followed by that, we delete the old (input) blocks. Of course we can't 
> delete the old blocks until after we've flushed the metadata, or else if we 
> crashed before flushing the metadata we'd have lost track of the new block 
> IDs.
> -- so, we defer the deletion of the input blocks until after the metadata has 
> been flushed
> - this leaves open the opposite hole: if we defer the deletion of the old 
> blocks, and we crash just _after_ flushing metadata, we would leak those old 
> blocks and their disk space, which is no good either.
> -- so, when we flush metadata, we include the 'old blocks' in a 
> 'orphan_blocks' array. On loading of metadata, we try to 'roll forward' the 
> deletion to prevent the above-mentioned leak from being permanent.
> The "roll forward" behavior mentioned above is what seems to be eating 
> blocks. We can now have the following bad interleaving:
> - a compaction in tablet A succeeds and lists block ID "X" as orphaned
> - a different tablet B re-uses block ID "X"
> - we restart the TS, or trigger a remote bootstrap (which also "cleans up" 
> orphan blocks)
> -- it deletes block "X" from underneath tablet "B"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to