Re: I've just had a massive file system crash

2003-01-28 Thread Daniel O'Connor
On Sun, 2003-01-26 at 18:38, Greg Lehey wrote:
 Did you use shutdown -p?  If my hypothesis is correct, it's possible
 to get this result with shutdown -h if you press the power switch as
 soon as the System halted message appears, but normally you'd give
 it a few seconds longer.  With shutdown -p, it's immediate, modulo
 delay.

Not certain if I did, but it's likely.

-- 
Daniel O'Connor software and network engineer
for Genesis Software - http://www.gsoft.com.au
The nice thing about standards is that there
are so many of them to choose from.
  -- Andrew Tanenbaum
GPG Fingerprint - 9A8C 569F 685A D928 5140  AE4B 319B 41F4 5D17 FDD5


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: I've just had a massive file system crash

2003-01-27 Thread Eugene M. Kim
On Sun, Jan 26, 2003 at 04:08:31PM +0800, Greg Lehey wrote:
 On Sunday, 26 January 2003 at 14:24:02 +1030, Daniel O'Connor wrote:
  On Sun, 2003-01-26 at 08:08, David Schultz wrote:
  Good.  I was referring to IDE in this case, because I assume
  that's what Greg's laptop uses.  The ATA driver flushes the cache
  when the device is closed, but I don't think that happens during
  shutdown.  It probably needs to register a shutdown hook like the
  SCSI driver.  Also, the driver is a bit optimistic about how long
  the flush will take; it times out after 5 seconds, whereas the ATA
  spec says a flush can take up to 30 seconds.
 
  I am wondering if I experienced this problem with my -stable laptop..
 
  I shut it down and then booted it up later to find fsck having a nice
  good chew on the drive (deleting REAMS of files).
 
 Did you use shutdown -p?  If my hypothesis is correct, it's possible
 to get this result with shutdown -h if you press the power switch as
 soon as the System halted message appears, but normally you'd give
 it a few seconds longer.  With shutdown -p, it's immediate, modulo
 delay.

Just a random idea: If that poses an issue, how about this patch?

Eugene

--- src/sys/kern_shutdown.c Sun Jan 26 14:24:56 2003
+++ src/sys/kern_shutdown.c.new Sun Jan 26 14:25:42 2003
@@ -545,7 +545,7 @@
 static void 
 poweroff_wait(void *junk, int howto)
 {
-   if(!(howto  RB_POWEROFF) || poweroff_delay = 0)
+   if(!(howto  (RB_POWEROFF | RB_HALT)) || poweroff_delay = 0)
return;
DELAY(poweroff_delay * 1000);
 }



Re: I've just had a massive file system crash

2003-01-27 Thread David Schultz

Thus spake Greg Lehey [EMAIL PROTECTED]:
 I've been thinking about what happened, and I have a possibility: the
 session before shutdown included a lot of writing to that file system,
 and I did a shutdown -p.  It's possible that the shutdown powered off
 the system before the disk had flushed its cache.  For the moment I'm
 avoiding shutdown -p, but when I get home I'll try to provoke it
 again.

Just a heads up: Soeren tells me he will commit a fix for this in
his next ATA meta-commit.  I have patches if wanted.

I still can't figure out why the problem would trash your entire
home directory, though.  Even if the disk reordered writes and
failed to write some sectors, directory entries that were not
being actively modified shouldn't have become corrupted, as far as
I know.  (Maybe your disk does track-at-once writes and just
happened to be flushing the last few sectors from its cache when
the power was cut.)  Perhaps someone could ask Kirk, although it
may take an actual hosed filesystem to diagnose what happened.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: I've just had a massive file system crash

2003-01-27 Thread Rahul Siddharthan
David Schultz wrote:
 I still can't figure out why the problem would trash your entire   
 home directory, though.  Even if the disk reordered writes and
 failed to write some sectors, directory entries that were not
 being actively modified shouldn't have become corrupted, as far as
 I know.

Something similar happened to me in 4-STABLE several months ago.
After a panic/crash (caused by an unstable USB audio driver) the
automatic fsck failed.  This happened twice; the second time my
filesystem was totally messed up, and after fsck did its thing,
several files were missing, including files in /usr/bin and /usr/sbin
that had not been touched for many weeks (ie since the last
installworld).  The damage wasn't as extensive as Greg reports, and my
home directory was spared, but I had to reinstall the base system to
get things working smoothly again.

I then turned off write caching on the IDE drive.  Afterwards I had
several such crashes (caused by the same driver) but never again had
filesystem damage -- automatic fsck always worked.  Nevertheless, as
you say, it's strange files which had not been touched went missing.

- Rahul

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: I've just had a massive file system crash

2003-01-26 Thread Greg Lehey
On Sunday, 26 January 2003 at 14:24:02 +1030, Daniel O'Connor wrote:
 On Sun, 2003-01-26 at 08:08, David Schultz wrote:
 Good.  I was referring to IDE in this case, because I assume
 that's what Greg's laptop uses.  The ATA driver flushes the cache
 when the device is closed, but I don't think that happens during
 shutdown.  It probably needs to register a shutdown hook like the
 SCSI driver.  Also, the driver is a bit optimistic about how long
 the flush will take; it times out after 5 seconds, whereas the ATA
 spec says a flush can take up to 30 seconds.

 I am wondering if I experienced this problem with my -stable laptop..

 I shut it down and then booted it up later to find fsck having a nice
 good chew on the drive (deleting REAMS of files).

Did you use shutdown -p?  If my hypothesis is correct, it's possible
to get this result with shutdown -h if you press the power switch as
soon as the System halted message appears, but normally you'd give
it a few seconds longer.  With shutdown -p, it's immediate, modulo
delay.

Greg
--
See complete headers for address and phone numbers

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: I've just had a massive file system crash

2003-01-25 Thread Brooks Davis
On Fri, Jan 24, 2003 at 11:03:52PM -0800, David Schultz wrote:
 FreeBSD's ``fix'' for this problem is the same as Windows 98's.
 Specifically, there is a 5-second delay (tuneable:
 kern.shutdown.poweroff_delay) after all buffers are flushed but
 before the power is cut.  Maybe we ought to be sending FLUSH
 CACHE commands to all drives and waiting for them to finish.

I've heard is longer then 5sec on more recent systems like 2000 or XP.
I even heard one claim that some shops were using 30sec internally.

-- Brooks

-- 
Any statement of the form X is the one, true Y is FALSE.
PGP fingerprint 655D 519C 26A7 82E7 2529  9BF0 5D8E 8BE9 F238 1AD4



msg50901/pgp0.pgp
Description: PGP signature


Re: I've just had a massive file system crash

2003-01-25 Thread Nate Lawson
On Fri, 24 Jan 2003, David Schultz wrote:
 Thus spake Greg Lehey [EMAIL PROTECTED]:
  I've been thinking about what happened, and I have a possibility: the
  session before shutdown included a lot of writing to that file system,
  and I did a shutdown -p.  It's possible that the shutdown powered off
  the system before the disk had flushed its cache.  For the moment I'm
  avoiding shutdown -p, but when I get home I'll try to provoke it
  again.
 
 FreeBSD's ``fix'' for this problem is the same as Windows 98's.
 Specifically, there is a 5-second delay (tuneable:
 kern.shutdown.poweroff_delay) after all buffers are flushed but
 before the power is cut.  Maybe we ought to be sending FLUSH
 CACHE commands to all drives and waiting for them to finish.

da(4) does a SYNC CACHE (see daclose() and dashutdown()).

-Nate


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: I've just had a massive file system crash

2003-01-25 Thread David Schultz
Thus spake Nate Lawson [EMAIL PROTECTED]:
 On Fri, 24 Jan 2003, David Schultz wrote:
  Thus spake Greg Lehey [EMAIL PROTECTED]:
   I've been thinking about what happened, and I have a possibility: the
   session before shutdown included a lot of writing to that file system,
   and I did a shutdown -p.  It's possible that the shutdown powered off
   the system before the disk had flushed its cache.  For the moment I'm
   avoiding shutdown -p, but when I get home I'll try to provoke it
   again.
  
  FreeBSD's ``fix'' for this problem is the same as Windows 98's.
  Specifically, there is a 5-second delay (tuneable:
  kern.shutdown.poweroff_delay) after all buffers are flushed but
  before the power is cut.  Maybe we ought to be sending FLUSH
  CACHE commands to all drives and waiting for them to finish.
 
 da(4) does a SYNC CACHE (see daclose() and dashutdown()).

Good.  I was referring to IDE in this case, because I assume
that's what Greg's laptop uses.  The ATA driver flushes the cache
when the device is closed, but I don't think that happens during
shutdown.  It probably needs to register a shutdown hook like the
SCSI driver.  Also, the driver is a bit optimistic about how long
the flush will take; it times out after 5 seconds, whereas the ATA
spec says a flush can take up to 30 seconds.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: I've just had a massive file system crash

2003-01-25 Thread Daniel O'Connor
On Sun, 2003-01-26 at 08:08, David Schultz wrote:
 Good.  I was referring to IDE in this case, because I assume
 that's what Greg's laptop uses.  The ATA driver flushes the cache
 when the device is closed, but I don't think that happens during
 shutdown.  It probably needs to register a shutdown hook like the
 SCSI driver.  Also, the driver is a bit optimistic about how long
 the flush will take; it times out after 5 seconds, whereas the ATA
 spec says a flush can take up to 30 seconds.

I am wondering if I experienced this problem with my -stable laptop..

I shut it down and then booted it up later to find fsck having a nice
good chew on the drive (deleting REAMS of files).

I stopped it and then ripped it out of the lappy and mounted it read
only to recover most of my files.

Lots of things in /etc got toasted, and it was rather annoying to
recover from :(

-- 
Daniel O'Connor software and network engineer
for Genesis Software - http://www.gsoft.com.au
The nice thing about standards is that there
are so many of them to choose from.
  -- Andrew Tanenbaum
GPG Fingerprint - 9A8C 569F 685A D928 5140  AE4B 319B 41F4 5D17 FDD5


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



I've just had a massive file system crash

2003-01-24 Thread Greg Lehey
I'm rather astounded.  I'm currently at a Linux conference, and have
of course been boasting about the stability of ufs, and today I had a
crash which tore apart my /home file system.

This is on a laptop, one which has been running -CURRENT for years
with no trouble.  At the moment it's running 5.0-RELEASE.  Today I
shut it down cleanly, and a couple of hours later rebooted it.  It has
three file systems, one of which came up dirty.  fsck -y reported
thousands of errors, and when it was finished, my home directory and
some other files were gone, and all the subdirectories of my home
directory were in lost+found, a total of 1.4 GB.  Most of the errors
appear to be duplicate Inode numbers.

Obviously it's too late to work out what happened, but I thought it's
worth mentioning in case somebody else is having the same trouble.

Greg
--
See complete headers for address and phone numbers

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: I've just had a massive file system crash

2003-01-24 Thread Thomas David Rivers
Greg Lehey [EMAIL PROTECTED] wrote:
 It has
 three file systems, one of which came up dirty.  fsck -y reported
 thousands of errors, and when it was finished, my home directory and
 some other files were gone, and all the subdirectories of my home
 directory were in lost+found, a total of 1.4 GB.  Most of the errors
 appear to be duplicate Inode numbers.
 

 Don't be too hasty to blame UFS.

 Everytime this has happened to me (even on Linux) it has been
 because the disk drive was failing.  It has happened to me
 *many* times with IDE drives.   I wind up replacing about 1/4
 of them every year, on average.   But, I did go through
 a run of those bad IBM drives :-)

 Did you happen to drop the laptop? :-)

- Dave Rivers -

--
[EMAIL PROTECTED]Work: (919) 676-0847
Get your mainframe programming tools at http://www.dignus.com



To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: I've just had a massive file system crash

2003-01-24 Thread Robert Watson
Next time you run fsck -y in this scenario, log the output to an md
partition and stick it somewhere for analysis.  At least, that was the
moral of the story last time I hosed a box in this form (incidentally, I
think it ended up being a failing hard disk).

Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
[EMAIL PROTECTED]  Network Associates Laboratories

On Fri, 24 Jan 2003, Greg Lehey wrote:

 I'm rather astounded.  I'm currently at a Linux conference, and have
 of course been boasting about the stability of ufs, and today I had a
 crash which tore apart my /home file system.
 
 This is on a laptop, one which has been running -CURRENT for years
 with no trouble.  At the moment it's running 5.0-RELEASE.  Today I
 shut it down cleanly, and a couple of hours later rebooted it.  It has
 three file systems, one of which came up dirty.  fsck -y reported
 thousands of errors, and when it was finished, my home directory and
 some other files were gone, and all the subdirectories of my home
 directory were in lost+found, a total of 1.4 GB.  Most of the errors
 appear to be duplicate Inode numbers.
 
 Obviously it's too late to work out what happened, but I thought it's
 worth mentioning in case somebody else is having the same trouble.
 
 Greg
 --
 See complete headers for address and phone numbers
 
 To Unsubscribe: send mail to [EMAIL PROTECTED]
 with unsubscribe freebsd-current in the body of the message
 


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



RE: I've just had a massive file system crash

2003-01-24 Thread Jaime Bozza
three file systems, one of which came up dirty.  fsck -y reported 
thousands of errors, and when it was finished, my home directory and 
some other files were gone, and all the subdirectories of my home 

This may (or may not) have anything to do with it, but I had a problem with
a couple of filesystem back in September that had the error:  (Running on
RELENG_4 that was very recent at the time)

CG 22: BAD MAGIC NUMBER

fsck -y gave thousands of errors (similar to what you had) and when it was
done, nothing was on the filesystem.  (I didn't think to check lost+found at
the time, just restored the filesystem, so the files may have been placed in
there)

During the space of 2 days, I had a total of 3 of these on two different
systems.  Forcing a mount (without cleaning) on the other two showed a
perfect filesystem (which I backed up, newfs'd and restored).  I even
compared one of these with a backup and there wasn't a single thing
different.  It sort of baffled me at the time, since one of those
filesystems didn't have any writing (other than atime perhaps) and still had
the error.

I haven't had a problem since then, and I know there are quite a bit of
changes between 4 and 5, but it really does sound similar.  At least the
fsck part sounds almost exactly the same.


Jaime bozza



To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: I've just had a massive file system crash

2003-01-24 Thread Greg Lehey
On Friday, 24 January 2003 at 20:34:24 +1000, Andy Farkas wrote:

 I'm rather astounded.  I'm currently at a Linux conference, and have
 of course been boasting about the stability of ufs, and today I had a
 crash which tore apart my /home file system.

 This is on a laptop, one which has been running -CURRENT for years
 with no trouble.  At the moment it's running 5.0-RELEASE.  Today I
 shut it down cleanly, and a couple of hours later rebooted it.  It has
 three file systems, one of which came up dirty.  fsck -y reported
 thousands of errors, and when it was finished, my home directory and
 some other files were gone, and all the subdirectories of my home
 directory were in lost+found, a total of 1.4 GB.  Most of the errors
 appear to be duplicate Inode numbers.

 Obviously it's too late to work out what happened, but I thought it's
 worth mentioning in case somebody else is having the same trouble.

 I can only think that your disk is going bad.

That was one of my thoughts too.

 Try a dd if=/dev/ad0 of=/dev/null and see if you get any read
 errors.

Nope, runs fine.  It also doesn't explain why it happened at startup
time.


On Friday, 24 January 2003 at  6:53:41 -0500, Thomas David Rivers wrote:

  Don't be too hasty to blame UFS.

I'm not.  I've just reported what happened, in case others see it.

On Friday, 24 January 2003 at 11:06:26 -0500, Robert Watson wrote:
 Next time you run fsck -y in this scenario, log the output to an md
 partition and stick it somewhere for analysis.  At least, that was the
 moral of the story last time I hosed a box in this form (incidentally, I
 think it ended up being a failing hard disk).

Yes, if you know it's going to happen.  I could easily have written it
to /var/tmp, which was mounted.  I just wasn't expecting anything like
this to happen.  I've been using UFS on a daily basis for over 10
years, and this is the first time this has happened to me.

I've been thinking about what happened, and I have a possibility: the
session before shutdown included a lot of writing to that file system,
and I did a shutdown -p.  It's possible that the shutdown powered off
the system before the disk had flushed its cache.  For the moment I'm
avoiding shutdown -p, but when I get home I'll try to provoke it
again.

Greg
--
See complete headers for address and phone numbers

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: I've just had a massive file system crash

2003-01-24 Thread David Schultz
Thus spake Greg Lehey [EMAIL PROTECTED]:
 I've been thinking about what happened, and I have a possibility: the
 session before shutdown included a lot of writing to that file system,
 and I did a shutdown -p.  It's possible that the shutdown powered off
 the system before the disk had flushed its cache.  For the moment I'm
 avoiding shutdown -p, but when I get home I'll try to provoke it
 again.

FreeBSD's ``fix'' for this problem is the same as Windows 98's.
Specifically, there is a 5-second delay (tuneable:
kern.shutdown.poweroff_delay) after all buffers are flushed but
before the power is cut.  Maybe we ought to be sending FLUSH
CACHE commands to all drives and waiting for them to finish.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message