[
https://issues.apache.org/jira/browse/KUDU-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357612#comment-15357612
]
Todd Lipcon commented on KUDU-1508:
-----------------------------------
To summarize the bug:
- an ext4 file is made up of a set of extents
- the extents are stored in a b-tree with 4KB "pages". Apparently after
accounting for headers, etc, the root page can hold 340 extent pointers.
- If you have more than 340 extents in a file, then the root page ends up
holding 340 pointers to other interior nodes, each of which has 340 extent
pointers (just like you'd expect with a btree).
https://digital-forensics.sans.org/blog/2011/03/28/digital-forensics-understanding-ext4-part-3-extent-trees
is a good reference
- In our case of the log block manager, we can end up with a lot of extents in
a file due to hole punching. Imagine a 1GB container file with 1000x1MB blocks.
If every odd block is deleted, we'd need 500 extents after we've hole-punched
the deleted blocks.
- This would normally be fine, except that the referenced bug means that ext4
forgot to update the interior node pointers, which causes an inconsistency
It seems that 'fsck' is fine at fixing the inconsistency, and we haven't seen
any runtime issues due to this bug. It may be entirely harmless. That said,
it's problematic because when systems reboot they sometimes run fsck and may
need manual intervention to tell fsck to fix the issue.
I looked through the kernel changelog and unfortunately this isn't fixed in any
version of el6. It is, however, fixed in el7 and probably any Ubuntu from the
last several years (it was fixed upstream in Dec 2012).
So, it seems we have a few choices here regarding this issue:
a) *Do nothing*- if indeed the problem is a 'harmless' ext4 corruption fixable
by fsck, then we can just document this as an el6 issue, ask RedHat to backport
this patch into the next maintenance kernel, and let users know that they may
have to look out for this particular error if fsck runs.
b) *Try to avoid multi-level extent trees*- if we limit the number of blocks
per container to a smaller number (say 300) then it's quite unlikely to meet
this issue. It's not a sure thing (the system could have arbitrary amounts of
fragmentation) but it is easy to implement and probably would make the issue
rare enough to not be a problem.
c) *Recommend xfs on el6* - XFS has performed better in most of the tests I've
run, and also doesn't not exhibit this bug. However, it's a lot to ask of new
users who are installing Kudu on existing clusters that are running ext4.
d) *Avoid hole punching* - we could spend the time to build a block manager
implementation that doesn't rely on hole punching. This is likely a lot of work.
> Log block manager triggers ext4 hole punching bug in el6
> --------------------------------------------------------
>
> Key: KUDU-1508
> URL: https://issues.apache.org/jira/browse/KUDU-1508
> Project: Kudu
> Issue Type: Bug
> Components: fs
> Affects Versions: 0.9.0
> Reporter: Todd Lipcon
> Priority: Blocker
>
> I've experienced many times that when I reboot an el6 node that was running
> Kudu tservers, fsck reports issues like:
> data6 contains a file system with errors, check forced.
> data6: Interior extent node level 0 of inode 5259348:
> Logical start 154699 does not match logical start 2623046 at next level.
> After some investigation, I've determined that this is due to an ext4 kernel
> bug: https://patchwork.ozlabs.org/patch/206123/
> Details in a comment to follow.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)