[
https://issues.apache.org/jira/browse/KUDU-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387085#comment-15387085
]
Todd Lipcon commented on KUDU-1538:
-----------------------------------
Another thing I just realized related to block ID reuse: we don't aggressively
evict entries from the block cache when a block is deleted. So, if a new block
is allocated with the same ID, we might have "false" cache hits from the prior
block which could cause some serious havoc.
Leaning more towards avoiding block ID reuse. With 64-bit block IDs, we'd have
room to allocate a million per second (many orders of magnitude more than
really required) and still last 600,000 years before rolling over.... even if
we use only 48 bits of the ID (as in the FBM, iirc), at 10K allocations/second
that gives us 900 years of runtime, which at least won't be our problem to
solve!
> "Orphaned" block deletion can delete live blocks in use by other tablets
> ------------------------------------------------------------------------
>
> Key: KUDU-1538
> URL: https://issues.apache.org/jira/browse/KUDU-1538
> Project: Kudu
> Issue Type: Bug
> Components: fs, tablet
> Affects Versions: 0.9.1
> Reporter: Todd Lipcon
> Priority: Blocker
>
> Currently, we allocate block IDs using a random number generator, ensuring
> that the blocks we allocate are not already in use. Of course that doesn't
> proclude a block which was previously used and then deleted from having its
> ID reused.
> This interacts quite poorly with the "orphaned block" processing we have in
> tablet metadata. As a refresher, the "orphaned block" thing is used as
> follows:
> - during a compaction, we have the output blocks (newly written data) and the
> input blocks (data which has been compacted and no longer relevant)
> - when the compaction finishes, we write a new TabletMetadata which swaps in
> the new blocks and removes the old blocks
> -- followed by that, we delete the old (input) blocks. Of course we can't
> delete the old blocks until after we've flushed the metadata, or else if we
> crashed before flushing the metadata we'd have lost track of the new block
> IDs.
> -- so, we defer the deletion of the input blocks until after the metadata has
> been flushed
> - this leaves open the opposite hole: if we defer the deletion of the old
> blocks, and we crash just _after_ flushing metadata, we would leak those old
> blocks and their disk space, which is no good either.
> -- so, when we flush metadata, we include the 'old blocks' in a
> 'orphan_blocks' array. On loading of metadata, we try to 'roll forward' the
> deletion to prevent the above-mentioned leak from being permanent.
> The "roll forward" behavior mentioned above is what seems to be eating
> blocks. We can now have the following bad interleaving:
> - a compaction in tablet A succeeds and lists block ID "X" as orphaned
> - a different tablet B re-uses block ID "X"
> - we restart the TS, or trigger a remote bootstrap (which also "cleans up"
> orphan blocks)
> -- it deletes block "X" from underneath tablet "B"
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)