ctree.h:1779!

Hugo Mills Fri, 19 May 2017 18:50:00 -0700

On Fri, May 19, 2017 at 06:25:22PM -0700, Marc MERLIN wrote:
> On Sat, May 20, 2017 at 12:57:09AM +0000, Hugo Mills wrote:
> >    I think from the POV of removing these BUG_ONs, it doesn't matter
> > which FS causes them. "All" you need to know is where the error
> > happened. From there, you can (in theory) work out what was wrong and
> > handle it more elagantly than simply stopping.
>  
> Sorry, "you" being the code author, or the user?


   Author.

> If code author, I'd rather this be worked out without the extra steps of
> having to guess or spend more time to see which FS.

   As I understand it, it doesn't really matter which FS it comes
from. The question is: The kernel has hit this BUG_ON. What do you
actually want to do when this happens? You can't bring the kernel to a
grinding halt (BUG_ON), so how do you handle this more elegantly?

   It actually doesn't matter about the state of any specific FS that
caused this particular problem. The fact is, someone decided to check
on the FS's state, and punted the problem of handling the check's
failure to someone later (the BUG_ON). You(*)'ve got to pick up that punt
and deal with it more cleanly.

(*) You == some kernel developer.

> My FS takes up to a day to scrub and btrfs check, clearly making me do this
> over 3 of them is not a good use of time and a loss of up to 3 days of wall
> clock time.
> Not counting that during that time, I have loss of service on all my
> filesystems because I don't want to mount them read-write.
> 
> >    Obviously it would be nice, from the POV of the sysadmin, to know
> > which FS was complaining, but as an FS developer it's secondary to
> > identifying a BUG_ON which happens in real life, which offers an
> > opportunity to make the error path more elegant.
> 
> If the FS is remounted R/O, further damage is averted and it's obvious to
> the admin which FS has a problem.
> 
> Is there a reason why all errors that are serious enough, do not cause the
> FS to remount R/O instead of having any BUG/BUG_ON at all?

   Simply that it's easier to write a BUG_ON than to write the code to
bubble up a failure to the point that the FS can be made RO. This is a
clean-up kind of process: BUG_ONs should mostly be changed into a
proper error-handling path leading to remount-RO (in the worst
cases). As I understand it, it's not massively difficult, but it's
probably non-trivial effort to get right in each case.

> WARN_ON is also fine obviously if the error is not serious enough, or doing
> a WARN_ON + a remount R/O

   Sure, but everything shouild really be turned into either a proper
error-handling path (most likely remount RO), or explicitly defined as
BUG_ON (i.e. "this must never happen -- if it does, then the hardware
is fucked up, and we're not responsible for the consequences") It's
that latter definition that's part of the hard decision-making process
for the kernel dev.

   Hugo.

-- 
Hugo Mills             | Great oxymorons of the world, no. 7:
hugo@... carfax.org.uk | The Simple Truth
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

signature.asc
Description: Digital signature

Re: 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779!

Reply via email to