On Fri, May 19, 2017 at 06:25:22PM -0700, Marc MERLIN wrote: > On Sat, May 20, 2017 at 12:57:09AM +0000, Hugo Mills wrote: > > I think from the POV of removing these BUG_ONs, it doesn't matter > > which FS causes them. "All" you need to know is where the error > > happened. From there, you can (in theory) work out what was wrong and > > handle it more elagantly than simply stopping. > > Sorry, "you" being the code author, or the user?
Author. > If code author, I'd rather this be worked out without the extra steps of > having to guess or spend more time to see which FS. As I understand it, it doesn't really matter which FS it comes from. The question is: The kernel has hit this BUG_ON. What do you actually want to do when this happens? You can't bring the kernel to a grinding halt (BUG_ON), so how do you handle this more elegantly? It actually doesn't matter about the state of any specific FS that caused this particular problem. The fact is, someone decided to check on the FS's state, and punted the problem of handling the check's failure to someone later (the BUG_ON). You(*)'ve got to pick up that punt and deal with it more cleanly. (*) You == some kernel developer. > My FS takes up to a day to scrub and btrfs check, clearly making me do this > over 3 of them is not a good use of time and a loss of up to 3 days of wall > clock time. > Not counting that during that time, I have loss of service on all my > filesystems because I don't want to mount them read-write. > > > Obviously it would be nice, from the POV of the sysadmin, to know > > which FS was complaining, but as an FS developer it's secondary to > > identifying a BUG_ON which happens in real life, which offers an > > opportunity to make the error path more elegant. > > If the FS is remounted R/O, further damage is averted and it's obvious to > the admin which FS has a problem. > > Is there a reason why all errors that are serious enough, do not cause the > FS to remount R/O instead of having any BUG/BUG_ON at all? Simply that it's easier to write a BUG_ON than to write the code to bubble up a failure to the point that the FS can be made RO. This is a clean-up kind of process: BUG_ONs should mostly be changed into a proper error-handling path leading to remount-RO (in the worst cases). As I understand it, it's not massively difficult, but it's probably non-trivial effort to get right in each case. > WARN_ON is also fine obviously if the error is not serious enough, or doing > a WARN_ON + a remount R/O Sure, but everything shouild really be turned into either a proper error-handling path (most likely remount RO), or explicitly defined as BUG_ON (i.e. "this must never happen -- if it does, then the hardware is fucked up, and we're not responsible for the consequences") It's that latter definition that's part of the hard decision-making process for the kernel dev. Hugo. -- Hugo Mills | Great oxymorons of the world, no. 7: hugo@... carfax.org.uk | The Simple Truth http://carfax.org.uk/ | PGP: E2AB1DE4 |
signature.asc
Description: Digital signature