Re: Delete a directory, crash the system

2013-07-28 Thread Frank Leonhardt

On 28/07/2013 06:54, Polytropon wrote:

And here, kids, you can see the strength of open source
operating system: You can see _why_ something happens. :-)

Too true!


On Sat, 27 Jul 2013 20:35:09 +0100, Frank Leonhardt wrote:

On 27/07/2013 19:57, David Noel wrote:

So the system panics in ufs_rmdir(). Maybe the filesystem is
corrupt? Have you tried to fsck(8) it manually?

fsck worked, though I had to boot from a USB image because I couldn't
get into single user.. for some odd reason.


Even if the filesystem is corrupt, ufs_rmdir() shouldn't
panic(), IMHO, but fail gracefully. Hmmm...

Yeah, I was pretty surprised. I think I tried it like 3 times to be
sure... and yeah, each time... kaboom! Who'd have thought. Do I just
post this to the mailing list and hope some benevolent developer
stumbles upon it and takes it upon him/herself to fix this, or where
do I find the FreeBSD Suggestion Box? I guess I should file a Problem
Report and see what happens from there.


I was going to raise an issue when the discussion had died down to a
concensus. I also don't think it's reasonable for the kernel to bomb
when it encounters corruption on a disk.

If you want to patch it yourself, edit sys/ufs/ufs/ufs_vnops.c at around
line 2791 change:

  if (dp-i_effnlink  3)
  panic(ufs_dirrem: Bad link count %d on parent,
  dp-i_effnlink);

To

  if (dp-i_effnlink  3) {
  error = EINVAL;
  goto out;
  }

The ufs_link() call has a similar issue.

I can't see why my mod will break anything, but there's always
unintended consequences.

One of the core policies usually is to stop _any_ action that
had failed due to a reason that cannot be and make sure it
won't get worse. This can be seen for example in fsck's behaviour:
If there is a massive file system error that cannot be repaired
without further intervention that _could_ destroy data or make
its retrieval harder or impossible, the operator will be requested
to make the decision. There are options to automate this process,
but on the other hand, always assume 'yes' can then be a risk,
as it could prevent recovery. My assumtion is that the developers
chose a similar approach here: We found a situation that should
not be possible, so we stop the system for messing up the file
system even more. This carries the attitude of not hiding a
problem for the sake of convenience by being silent and going
back to the usual work. Of course it is debatable if this is the
right decision in _this_ particular case.





The problem I have with this is the assumption that the inode was at 
fault. I said this was the most likely, but it's not the absolute 
reason. At the risk of repeating, it's the /effective/ link count (in 
the vnode) that's out of line here, not the inode count.


If the inode was wrong it could be down to minor FS corruption; an 
interrupted directory creation or deletion would do the trick. The vnode 
could go wrong for all sorts of reasons, probably associated with a race 
during the directory removal, which is not an atomic operation by any 
means. See The Design of the UNIX operating system p 5.16.1, Bach, 
Prentice-Hall, 1986.


My guess is that we're looking at an old debugging pragma here, put in 
to cope with a race going wrong if the code wasn't quite right (note 
that the function has since been renamed but the message not updated).


You're right about stopping on internal errors (corruption to the kernel 
data structures in this case) but this case is indeed debatable. On the 
one hand, now the system is stable (i.e. we can probably trust rmdir 
code after all this time), the most likely cause is inode corruption 
polluting the vnode. On the other hand the pragma may be useful if 
people are tinkering with the kernel and you get even more opportunities 
for a race with (say) SMP.


I don't expect the kernel to panic on a user-land I/O error, or anything 
else that's expected or recoverable - and a wonky FS meets these 
criteria in my book. David was lucky to find this - I tend to run 
FreeBSD on servers, not laptops, and I'd never have seen this server 
panic live and therefore not been able to discover the cause very 
easily. That's worrying.


So it boils down to:

a) Leave is is, as it can detect when the kernel has trashed its vnode 
table; or


b) It's probably caused by expected FS corruption, so handle it 
gracefully.


Incidentally, if you look at the code you'll see this is only heuristic 
check, and a weak one at that. Most of the time it WILL NOT pick up the 
case where the parent directory's link is missing. As far as I can tell 
it will go on to unlink the target successfully, with no ill effects. If 
this situation really did lead to catastrophe (as suggested by the use 
of a panic) then the check used ought to be a lot more reliable! As it 
is, removing it entirely except for debug kernels, is a third option.


Regards, Frank.


Re: Delete a directory, crash the system

2013-07-28 Thread David Noel
Ok folks, thanks again for all the help. Using the feedback I
submitted a PR (#180894) --
http://www.freebsd.org/cgi/query-pr.cgi?pr=180894. I also submitted a
follow-up to it with Frank's code and notes. What next? I don't really
know what happens from here, but I'm guessing/hoping that someone's
monitoring the PR system and will move this forward.

Crossing my fingers, though if anyone knows any better methods of
getting PR's addressed I'm all ears.

-David
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-28 Thread Matthew Seaman
On 28/07/2013 06:38, David Noel wrote:
 Ok folks, thanks again for all the help. Using the feedback I
 submitted a PR (#180894) --
 http://www.freebsd.org/cgi/query-pr.cgi?pr=180894. I also submitted a
 follow-up to it with Frank's code and notes. What next? I don't really
 know what happens from here, but I'm guessing/hoping that someone's
 monitoring the PR system and will move this forward.
 
 Crossing my fingers, though if anyone knows any better methods of
 getting PR's addressed I'm all ears.

You've already done the right things: raising a PR and posing about your
problem on freebsd...@freebsd.org, where it is going to come to the
attention of developers working on that area of the system.

You're next move should be to provide whatever additional information
the developers might need to diagnose or reproduce the problem.  This is
really the crucial bit: unless a dev can understand what happened and
how your system came to break in that particular way, it's unlikely
they'll be able to fix it.

If you don't understand what's being asked for, or how to roduce any
required information, don't be shy about asking -- either here, or over
on freebsd-fs@...  It's sometimes hard to remember that the sort of
debugging things you'ld do routinely and without a second thought as a
developer can appear as pretty arcane mysteries to the uninitiated.

You may find these bits of documentation useful:

http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/debugging.html

http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html
   (especially section 10.1 about obtaining a kernel core dump, and
10.2 about using kgdb.)

Cheers,

Matthew

-- 
Dr Matthew J Seaman MA, D.Phil.

PGP: http://www.infracaninophile.co.uk/pgpkey
JID: matt...@infracaninophile.co.uk



signature.asc
Description: OpenPGP digital signature


Re: Delete a directory, crash the system

2013-07-28 Thread Warren Block

On Sun, 28 Jul 2013, Frank Leonhardt wrote:


So it boils down to:

a) Leave is is, as it can detect when the kernel has trashed its vnode table; 
or


b) It's probably caused by expected FS corruption, so handle it gracefully.


It would be good to log a system error message like filesystem may be 
corrupt to give the user some clue other than a seemingly impossible 
error with no explanation.

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-28 Thread cpghost
On 07/27/13 21:12, cpghost wrote:
 A more robust file system would halt all processes, and perform
 an in-kernel fsck on the filesystem and its internal (in-memory)
 structures to repair the damage... and THEN resume the processes.
 
 However, this is a major project, and we don't have a self-healing
 filesystem / kernel (... yet). ;-)
 
 -cpghost.

If we think this further, we may as well start introducing
some elements of self-healing or at least self-inspecting in
the kernel.

How about, for example, a kernel thread that wakes up periodically,
walks through VFS structures, and checks their integrity? Perhaps
also verifying the underlying inodes as well? Think background
fsck, but within the kernel and for kernel structures themselves.

Others parts of the kernel could as well self-inspect for
consistency with a periodic kernel thread. Some parts are
easier than others, so I don't think we could also walk the
VM structures (if those are corrupt, even the repair-thread
will be running amok). But save for that, most parts of the
kernel could use some periodic consistency checking.

Make that checking optional via a sysctl(8), and it won't
even cost performance.

-cpghost.

-- 
Cordula's Web. http://www.cordula.ws/

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Delete a directory, crash the system

2013-07-27 Thread David Noel
I had a strange experience on my laptop yesterday. I was deleting a
directory and the system crashed. It spat out a message along the
lines of ufs_dirrem bad link count 2 on parent. I thought it was so
strange I repeated the process several times, and each time it
crashed. Is this behavior EXPECTED? I can't for the life of me think
of a time or operating system I've run where I've ever had a system
crash on me from doing something as basic as deleting a file. Anyway I
couldn't boot into single user for some reason so I booted from a USB
image, ran fsck, and then everything was fine.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-27 Thread Fernando Apesteguía
El 27/07/2013 13:49, David Noel david.i.n...@gmail.com escribió:

 I had a strange experience on my laptop yesterday. I was deleting a
 directory and the system crashed. It spat out a message along the
 lines of ufs_dirrem bad link count 2 on parent. I thought it was so
 strange I repeated the process several times, and each time it
 crashed. Is this behavior EXPECTED? I can't for the life of me think
 of a time or operating system I've run where I've ever had a system
 crash on me from doing something as basic as deleting a file. Anyway I
 couldn't boot into single user for some reason so I booted from a USB
 image, ran fsck, and then everything was fine.

Was it a kernel crash? Did you get a core?

 ___
 freebsd-questions@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-questions
 To unsubscribe, send any mail to 
freebsd-questions-unsubscr...@freebsd.org
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-27 Thread David Noel
Yes

On 7/27/13, Fernando Apesteguía fernando.apesteg...@gmail.com wrote:
 El 27/07/2013 13:49, David Noel david.i.n...@gmail.com escribió:

 I had a strange experience on my laptop yesterday. I was deleting a
 directory and the system crashed. It spat out a message along the
 lines of ufs_dirrem bad link count 2 on parent. I thought it was so
 strange I repeated the process several times, and each time it
 crashed. Is this behavior EXPECTED? I can't for the life of me think
 of a time or operating system I've run where I've ever had a system
 crash on me from doing something as basic as deleting a file. Anyway I
 couldn't boot into single user for some reason so I booted from a USB
 image, ran fsck, and then everything was fine.

 Was it a kernel crash? Did you get a core?

 ___
 freebsd-questions@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-questions
 To unsubscribe, send any mail to 
 freebsd-questions-unsubscr...@freebsd.org

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-27 Thread Fernando Apesteguía
El 27/07/2013 14:16, David Noel david.i.n...@gmail.com escribió:

 Yes

Post the stack trace of the core and maybe someone can help you.


 On 7/27/13, Fernando Apesteguía fernando.apesteg...@gmail.com wrote:
  El 27/07/2013 13:49, David Noel david.i.n...@gmail.com escribió:
 
  I had a strange experience on my laptop yesterday. I was deleting a
  directory and the system crashed. It spat out a message along the
  lines of ufs_dirrem bad link count 2 on parent. I thought it was so
  strange I repeated the process several times, and each time it
  crashed. Is this behavior EXPECTED? I can't for the life of me think
  of a time or operating system I've run where I've ever had a system
  crash on me from doing something as basic as deleting a file. Anyway I
  couldn't boot into single user for some reason so I booted from a USB
  image, ran fsck, and then everything was fine.
 
  Was it a kernel crash? Did you get a core?
 
  ___
  freebsd-questions@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-questions
  To unsubscribe, send any mail to 
  freebsd-questions-unsubscr...@freebsd.org
 
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-27 Thread David Noel
 Post the stack trace of the core and maybe someone can help you.

panic: ufs_dirrem: Bad link count 2 on parent
cpuid = 0
KDB: stack backtrace:
#0 0x808680fe at kdb_backtrace+0x5e
#1 0x80832cb7 at panic+0x187
#2 0x80a700e3 at ufs_rmdir+0x1c3
#3 0x80b7d484 at VOP_RMDIR_APV+0x34
#4 0x808ca32a at kern_rmdirat+0x21a
#5 0x80b17cf0 at amd64_syscall+0x450
#6 0x80b03427 at Xfast_syscall+0xf7
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-27 Thread Frank Leonhardt

On 27/07/2013 13:58, David Noel wrote:

Post the stack trace of the core and maybe someone can help you.

panic: ufs_dirrem: Bad link count 2 on parent
cpuid = 0
KDB: stack backtrace:
#0 0x808680fe at kdb_backtrace+0x5e
#1 0x80832cb7 at panic+0x187
#2 0x80a700e3 at ufs_rmdir+0x1c3
#3 0x80b7d484 at VOP_RMDIR_APV+0x34
#4 0x808ca32a at kern_rmdirat+0x21a
#5 0x80b17cf0 at amd64_syscall+0x450
#6 0x80b03427 at Xfast_syscall+0xf7



I'm taking a guess here - the effective link count when it came to 
removing the parent directory was only two and it should have been three 
or more. This gets sanity checked this before proceeding, and panics if 
it is not. Why an effective link count of three? We're talking about the 
parent of the directory you're trying to zap, right? There's the link to 
the directory from its parent, and the '.' link and the .. link from 
the directory you're trying to remove. There may be more if it contains 
other directories, but there can't be less.


Anyway - if you only had a link count of just two effective links at the 
start of the delete process it suggests that the link count was messed 
up - either a link never existed or its count was wrong. Should the 
kernel panic? Well it's a situation that can never happen - it could 
simply remove the directory and pretend everything was okay but  guess 
it was decided it was likely to be a symptom of impending disaster. 
Other anomalies return an error.


In over ten years with FreeBSD systems I can't say I've ever seen this 
cannot happen situation arise. I'd guess you had an interrupted (by 
power failure) inode operation at some time which caused the corruption. 
removing a directory is a PITA as it can lead to a race - a context swap 
could create a file it it mid-way through the process.


Regards, Frank.

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-27 Thread David Noel
 I'm taking a guess here - the effective link count when it came to
 removing the parent directory was only two and it should have been three
 or more. This gets sanity checked this before proceeding, and panics if
 it is not. Why an effective link count of three? We're talking about the
 parent of the directory you're trying to zap, right? There's the link to
 the directory from its parent, and the '.' link and the .. link from
 the directory you're trying to remove. There may be more if it contains
 other directories, but there can't be less.

 Anyway - if you only had a link count of just two effective links at the
 start of the delete process it suggests that the link count was messed
 up - either a link never existed or its count was wrong. Should the
 kernel panic? Well it's a situation that can never happen - it could
 simply remove the directory and pretend everything was okay but  guess
 it was decided it was likely to be a symptom of impending disaster.
 Other anomalies return an error.

 In over ten years with FreeBSD systems I can't say I've ever seen this
 cannot happen situation arise. I'd guess you had an interrupted (by
 power failure) inode operation at some time which caused the corruption.
 removing a directory is a PITA as it can lead to a race - a context swap
 could create a file it it mid-way through the process.

 Regards, Frank.

Interesting. Thanks for the analysis. I'm not a systems guy (Java,
mostly), so I don't really have the context to make much sense of kgdb
output. What you're saying though makes sense and sounds about right
-- it's a laptop and I've inadvertently run the battery down to
nothing a few times in the past. All the same, it was a very strange
experience. I would not have expected a kernel panic from a simple rm
-rf!
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-27 Thread Jason Lenthe
On 07/27/2013 11:30, David Noel wrote:
 -- it's a laptop and I've inadvertently run the battery down to
 nothing a few times in the past. All the same, it was a very strange
 experience. I would not have expected a kernel panic from a simple rm
 -rf!

You may want to look into running fsck(8) and its myriad of options to
try to clean up the problem (assuming you're using a ufs filesystem also
see fsck_ufs(8)).  fsck normally runs during startup but perhaps a set
of non-default options will do the trick.

Also make sure you have soft updates enabled on your filesystem and
preferably journaled soft updates, if for some odd reason you don't, as
that is designed to avoid filesystem inconsistencies in the face of
things like power failures.

Sincerely,
Jason

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-27 Thread cpghost
On 07/27/13 14:58, David Noel wrote:
 Post the stack trace of the core and maybe someone can help you.
 
 panic: ufs_dirrem: Bad link count 2 on parent
 cpuid = 0
 KDB: stack backtrace:
 #0 0x808680fe at kdb_backtrace+0x5e
 #1 0x80832cb7 at panic+0x187
 #2 0x80a700e3 at ufs_rmdir+0x1c3
 #3 0x80b7d484 at VOP_RMDIR_APV+0x34
 #4 0x808ca32a at kern_rmdirat+0x21a
 #5 0x80b17cf0 at amd64_syscall+0x450
 #6 0x80b03427 at Xfast_syscall+0xf7

So the system panics in ufs_rmdir(). Maybe the filesystem is
corrupt? Have you tried to fsck(8) it manually?

Even if the filesystem is corrupt, ufs_rmdir() shouldn't
panic(), IMHO, but fail gracefully. Hmmm...

-cpghost.

-- 
Cordula's Web. http://www.cordula.ws/

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-27 Thread David Noel
 You may want to look into running fsck(8) and its myriad of options

fsck did the trick

 Also make sure you have soft updates enabled on your filesystem and
 preferably journaled soft updates

..pretty sure I do but I'll double check, thanks.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-27 Thread David Noel
 So the system panics in ufs_rmdir(). Maybe the filesystem is
 corrupt? Have you tried to fsck(8) it manually?

fsck worked, though I had to boot from a USB image because I couldn't
get into single user.. for some odd reason.

 Even if the filesystem is corrupt, ufs_rmdir() shouldn't
 panic(), IMHO, but fail gracefully. Hmmm...

Yeah, I was pretty surprised. I think I tried it like 3 times to be
sure... and yeah, each time... kaboom! Who'd have thought. Do I just
post this to the mailing list and hope some benevolent developer
stumbles upon it and takes it upon him/herself to fix this, or where
do I find the FreeBSD Suggestion Box? I guess I should file a Problem
Report and see what happens from there.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-27 Thread Frank Leonhardt

On 27/07/2013 19:57, David Noel wrote:

So the system panics in ufs_rmdir(). Maybe the filesystem is
corrupt? Have you tried to fsck(8) it manually?

fsck worked, though I had to boot from a USB image because I couldn't
get into single user.. for some odd reason.


Even if the filesystem is corrupt, ufs_rmdir() shouldn't
panic(), IMHO, but fail gracefully. Hmmm...

Yeah, I was pretty surprised. I think I tried it like 3 times to be
sure... and yeah, each time... kaboom! Who'd have thought. Do I just
post this to the mailing list and hope some benevolent developer
stumbles upon it and takes it upon him/herself to fix this, or where
do I find the FreeBSD Suggestion Box? I guess I should file a Problem
Report and see what happens from there.



I was going to raise an issue when the discussion had died down to a 
concensus. I also don't think it's reasonable for the kernel to bomb 
when it encounters corruption on a disk.


If you want to patch it yourself, edit sys/ufs/ufs/ufs_vnops.c at around 
line 2791 change:


if (dp-i_effnlink  3)
panic(ufs_dirrem: Bad link count %d on parent,
dp-i_effnlink);

To

if (dp-i_effnlink  3) {
error = EINVAL;
goto out;
}

The ufs_link() call has a similar issue.

I can't see why my mod will break anything, but there's always 
unintended consequences. By returning invalid argument, any code above 
it should already be handling that condition although the user will be 
scratching their head wondering what's wrong with it. Returning ENOENT 
or EACCES or ENOTDIR may be better (No such directory, Access denied 
or Not a valid directory).


The trouble is that it's tricky to test properly without finding a good 
way to corrupt the link count :-)


Regards, Frank.

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-27 Thread David Noel
 I was going to raise an issue when the discussion had died down to a
 concensus. I also don't think it's reasonable for the kernel to bomb
 when it encounters corruption on a disk.

 If you want to patch it yourself, edit sys/ufs/ufs/ufs_vnops.c at around
 line 2791 change:

  if (dp-i_effnlink  3)
  panic(ufs_dirrem: Bad link count %d on parent,
  dp-i_effnlink);

 To

  if (dp-i_effnlink  3) {
  error = EINVAL;
  goto out;
  }

 The ufs_link() call has a similar issue.

 I can't see why my mod will break anything, but there's always
 unintended consequences. By returning invalid argument, any code above
 it should already be handling that condition although the user will be
 scratching their head wondering what's wrong with it. Returning ENOENT
 or EACCES or ENOTDIR may be better (No such directory, Access denied
 or Not a valid directory).

 The trouble is that it's tricky to test properly without finding a good
 way to corrupt the link count :-)

 Regards, Frank.

Cool. Thanks for the patch!
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-27 Thread Frank Leonhardt

On 27/07/2013 20:38, David Noel wrote:

I was going to raise an issue when the discussion had died down to a
concensus. I also don't think it's reasonable for the kernel to bomb
when it encounters corruption on a disk.

If you want to patch it yourself, edit sys/ufs/ufs/ufs_vnops.c at around
line 2791 change:

  if (dp-i_effnlink  3)
  panic(ufs_dirrem: Bad link count %d on parent,
  dp-i_effnlink);

To

  if (dp-i_effnlink  3) {
  error = EINVAL;
  goto out;
  }

The ufs_link() call has a similar issue.

I can't see why my mod will break anything, but there's always
unintended consequences. By returning invalid argument, any code above
it should already be handling that condition although the user will be
scratching their head wondering what's wrong with it. Returning ENOENT
or EACCES or ENOTDIR may be better (No such directory, Access denied
or Not a valid directory).

The trouble is that it's tricky to test properly without finding a good
way to corrupt the link count :-)

Regards, Frank.

Cool. Thanks for the patch!


Sorry - forgot to mention that you use it entirely at your own risk!


___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-27 Thread cpghost
On 07/27/13 20:57, David Noel wrote:
 So the system panics in ufs_rmdir(). Maybe the filesystem is
 corrupt? Have you tried to fsck(8) it manually?
 
 fsck worked, though I had to boot from a USB image because I couldn't
 get into single user.. for some odd reason.
 
 Even if the filesystem is corrupt, ufs_rmdir() shouldn't
 panic(), IMHO, but fail gracefully. Hmmm...
 
 Yeah, I was pretty surprised. I think I tried it like 3 times to be
 sure... and yeah, each time... kaboom! Who'd have thought. Do I just
 post this to the mailing list and hope some benevolent developer
 stumbles upon it and takes it upon him/herself to fix this, or where
 do I find the FreeBSD Suggestion Box? I guess I should file a Problem
 Report and see what happens from there.

Maybe you could ask on freebsd-fs@. That's the list where the
filesystem hackers are hanging around.

Basically, from /usr/src/sys/ufs/ufs/ufs_vnops.c:ufs_rmdir():

if (dp-i_effnlink  3)
  panic(ufs_dirrem: Bad link count %d on parent, dp-i_effnlink);

if (!ufs_dirempty(ip, dp-i_number, cnp-cn_cred)) {
error = ENOTEMPTY;
goto out;
}

(...)

Basically, the parent directory has less than 3 entries, but
since 2 entries are mandatory (. and ..), the 3rd entry
that is missing must belong to the directory being removed.

This is inconsistent. And if the parent directory is inconsistent,
other bad things could happen. The kernel errs on the side of
caution, and panic()s instead of silently returning EINVAL.
Actually, this is a sensible thing to do in this context.

A more robust file system would halt all processes, and perform
an in-kernel fsck on the filesystem and its internal (in-memory)
structures to repair the damage... and THEN resume the processes.

However, this is a major project, and we don't have a self-healing
filesystem / kernel (... yet). ;-)

-cpghost.

-- 
Cordula's Web. http://www.cordula.ws/

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-27 Thread Adrian Chadd
Yes. It'd be nice if UFS/FFS would just downgrade things to read-only
and not panic.



-Adrian
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-27 Thread Polytropon
On Sat, 27 Jul 2013 13:57:31 -0500, David Noel wrote:
  So the system panics in ufs_rmdir(). Maybe the filesystem is
  corrupt? Have you tried to fsck(8) it manually?
 
 fsck worked, though I had to boot from a USB image because I couldn't
 get into single user.. for some odd reason.

From your initial description, a _severe_ file system defect
seems to be a reasonable assumption. Make sure fsck is run
in foreground prior to bringing up the system. The option
background_fsck=NO in /etc/rc.conf will make sure you
won't encounter this problem again (_if_ it was related
to the file system). Always make sure you're booting into
a fsck'ed environment.

You could also use a S.M.A.R.T. analysis tool such as smartmon
(from ports) to make sure the OS didn't panic because of a
hard disk defect. I'm just mentioning this because I have
sufficient exoerience in this field. :-)





  Even if the filesystem is corrupt, ufs_rmdir() shouldn't
  panic(), IMHO, but fail gracefully. Hmmm...
 
 Yeah, I was pretty surprised. I think I tried it like 3 times to be
 sure... and yeah, each time... kaboom!

It's really surprising that a (comparable) high-level function
could fail in that drastic way, but on the other hand, one would
assume that there is a _reason_ for this behaviour.





-- 
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-27 Thread Polytropon
And here, kids, you can see the strength of open source
operating system: You can see _why_ something happens. :-)

On Sat, 27 Jul 2013 20:35:09 +0100, Frank Leonhardt wrote:
 On 27/07/2013 19:57, David Noel wrote:
  So the system panics in ufs_rmdir(). Maybe the filesystem is
  corrupt? Have you tried to fsck(8) it manually?
  fsck worked, though I had to boot from a USB image because I couldn't
  get into single user.. for some odd reason.
 
  Even if the filesystem is corrupt, ufs_rmdir() shouldn't
  panic(), IMHO, but fail gracefully. Hmmm...
  Yeah, I was pretty surprised. I think I tried it like 3 times to be
  sure... and yeah, each time... kaboom! Who'd have thought. Do I just
  post this to the mailing list and hope some benevolent developer
  stumbles upon it and takes it upon him/herself to fix this, or where
  do I find the FreeBSD Suggestion Box? I guess I should file a Problem
  Report and see what happens from there.
 
 
 I was going to raise an issue when the discussion had died down to a 
 concensus. I also don't think it's reasonable for the kernel to bomb 
 when it encounters corruption on a disk.
 
 If you want to patch it yourself, edit sys/ufs/ufs/ufs_vnops.c at around 
 line 2791 change:
 
  if (dp-i_effnlink  3)
  panic(ufs_dirrem: Bad link count %d on parent,
  dp-i_effnlink);
 
 To
 
  if (dp-i_effnlink  3) {
  error = EINVAL;
  goto out;
  }
 
 The ufs_link() call has a similar issue.
 
 I can't see why my mod will break anything, but there's always 
 unintended consequences.

One of the core policies usually is to stop _any_ action that
had failed due to a reason that cannot be and make sure it
won't get worse. This can be seen for example in fsck's behaviour:
If there is a massive file system error that cannot be repaired
without further intervention that _could_ destroy data or make
its retrieval harder or impossible, the operator will be requested
to make the decision. There are options to automate this process,
but on the other hand, always assume 'yes' can then be a risk,
as it could prevent recovery. My assumtion is that the developers
chose a similar approach here: We found a situation that should
not be possible, so we stop the system for messing up the file
system even more. This carries the attitude of not hiding a
problem for the sake of convenience by being silent and going
back to the usual work. Of course it is debatable if this is the
right decision in _this_ particular case.



 By returning invalid argument, any code above 
 it should already be handling that condition although the user will be 
 scratching their head wondering what's wrong with it.

By determining the inode number and using the fsdb tool internal
data about inodes can be examined. Will it also show something
that's basically impossible? :-)



 Returning ENOENT 
 or EACCES or ENOTDIR may be better (No such directory, Access denied 
 or Not a valid directory).

Depends on the applying definition of those errors.



 The trouble is that it's tricky to test properly without finding a good 
 way to corrupt the link count :-)

There is a _simple_ way to do this, and I have even mentioned it.
Use the fsdb program and manipulate the inode manually. Make
sure that you actually understand that _what_ you are doing there
is creating severe file system inconsistency errors. :-)





-- 
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Delete a directory, crash the system

2013-07-27 Thread Polytropon
On Sat, 27 Jul 2013 14:57:07 -0700, Adrian Chadd wrote:
 Yes. It'd be nice if UFS/FFS would just downgrade things to read-only
 and not panic.

That would be possible, but it would confuse programs and users.
It's not that you could walk up to the disk drive and flip the
write protect switch back... ;-)



-- 
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org