[ 
https://issues.apache.org/jira/browse/KUDU-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387075#comment-15387075
 ] 

Todd Lipcon commented on KUDU-1538:
-----------------------------------

A couple thoughts here:

- the above stuff is trying hard to avoid block leaks in the case of crashing 
just after a metadata flush, but we already have the opposite leak in the case 
of a crash just before a metadata flush (the in-progress blocks being written 
as the compaction output are "committed" in the block manager but not 
referenced anywhere). So, even despite our best efforts, we _still_ have to 
worry about a more thorough (eg mark-and-sweep-style) "garbage collector" for 
blocks (KUDU-829). Maybe we should just throw away this best effort and accept 
that our current offering is 'data leaky' and come up with a better holistic 
solution?
- the fact that we use randomized block IDs instead of sequential block IDs 
makes reuse much more plausible. With sequentially-allocated IDs, we'd have to 
"wrap around" our extremely large space to make this an issue, which is _way_ 
less likely. (I actually had a patch back in 2014 to do this, with some other 
benefits, but it only was for the FBM)
- maybe we need to "reserve" those block IDs in the block manager until they're 
actually fully removed from the metadata? worried that this could be quite 
complex, though.
- maybe a more 'WAL-like' way of doing the roll-forward, tied to specific 
revisions of the TabletMetadata, is the way to go?


> "Orphaned" block deletion can delete live blocks in use by other tablets
> ------------------------------------------------------------------------
>
>                 Key: KUDU-1538
>                 URL: https://issues.apache.org/jira/browse/KUDU-1538
>             Project: Kudu
>          Issue Type: Bug
>          Components: fs, tablet
>    Affects Versions: 0.9.1
>            Reporter: Todd Lipcon
>            Priority: Blocker
>
> Currently, we allocate block IDs using a random number generator, ensuring 
> that the blocks we allocate are not already in use. Of course that doesn't 
> proclude a block which was previously used and then deleted from having its 
> ID reused.
> This interacts quite poorly with the "orphaned block" processing we have in 
> tablet metadata. As a refresher, the "orphaned block" thing is used as 
> follows:
> - during a compaction, we have the output blocks (newly written data) and the 
> input blocks (data which has been compacted and no longer relevant)
> - when the compaction finishes, we write a new TabletMetadata which swaps in 
> the new blocks and removes the old blocks
> -- followed by that, we delete the old (input) blocks. Of course we can't 
> delete the old blocks until after we've flushed the metadata, or else if we 
> crashed before flushing the metadata we'd have lost track of the new block 
> IDs.
> -- so, we defer the deletion of the input blocks until after the metadata has 
> been flushed
> - this leaves open the opposite hole: if we defer the deletion of the old 
> blocks, and we crash just _after_ flushing metadata, we would leak those old 
> blocks and their disk space, which is no good either.
> -- so, when we flush metadata, we include the 'old blocks' in a 
> 'orphan_blocks' array. On loading of metadata, we try to 'roll forward' the 
> deletion to prevent the above-mentioned leak from being permanent.
> The "roll forward" behavior mentioned above is what seems to be eating 
> blocks. We can now have the following bad interleaving:
> - a compaction in tablet A succeeds and lists block ID "X" as orphaned
> - a different tablet B re-uses block ID "X"
> - we restart the TS, or trigger a remote bootstrap (which also "cleans up" 
> orphan blocks)
> -- it deletes block "X" from underneath tablet "B"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to