Re: [zfs-discuss] Can't remove corrupt file

2006-08-09 Thread Eric Lowe

Eric Schrock wrote:

Well the fact that it's a level 2 indirect block indicates why it can't
simply be removed.  We don't know what data it refers to, so we can't
free the associated blocks.  The panic on move is quite interesting -
after BFU give it another shot and file a bug if it still happens.


I'm still seeing the panic (build 42) when trying to 'mv' the file with 
corrupt indirect blocks. The problem looks like 6424466 and 6440780, the 
panic string is data after EOF. Email me offline if you would like to 
collect the core from my system.


- Eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't remove corrupt file

2006-08-09 Thread Mark Maybee

Eric Lowe wrote:

Eric Schrock wrote:


Well the fact that it's a level 2 indirect block indicates why it can't
simply be removed.  We don't know what data it refers to, so we can't
free the associated blocks.  The panic on move is quite interesting -
after BFU give it another shot and file a bug if it still happens.



I'm still seeing the panic (build 42) when trying to 'mv' the file with 
corrupt indirect blocks. The problem looks like 6424466 and 6440780, the 
panic string is data after EOF. Email me offline if you would like to 
collect the core from my system.


- Eric


Yup, this is a duplicate of 6424466 (6440780 is also probably a dup of
6424466).  You are seeing this panic on a 'mv' because of some old debug
code in dnode_sync() scanning the dnode contents.  The data after EOF
message is bogus, the real problem is your data corruption.  Anyway,
this is not going to go away until I put back a fix for 6424466.  Sorry
about that.

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re[2]: [zfs-discuss] Can't remove corrupt file

2006-07-21 Thread Robert Milkowski
Hello Bill,

Friday, July 21, 2006, 7:31:25 AM, you wrote:

BM On Thu, Jul 20, 2006 at 03:45:54PM -0700, Jeff Bonwick wrote:
  However, we do have the advantage of always knowing when something
  is corrupted, and knowing what that particular block should have been. 
 
 We also have ditto blocks for all metadata, so that even if any block
 of ZFS metadata is destroyed, we always have another copy.
 Bill Moore describes ditto blocks in detail here:
 
 http://blogs.sun.com/roller/page/bill?entry=ditto_blocks_the_amazing_tape

BM Right.  And I should point out that if Eric had been running build 38 or
BM later, this data corruption would not have happened - it would have been
BM automatically repaired using ditto blocks (the bad block was a L2
BM indirect block - of which there would have been 2 copies).

However possibly something is broken there as I see on two different
servers (v240, T2000) CKSUM errors for ditto blocks on daily basics
and it's hard to belive I have a problem with hardware and it hits
only metadata blocks. More at:

http://www.opensolaris.org/jive/thread.jspa?threadID=9846tstart=0

-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't remove corrupt file

2006-07-21 Thread Gregory Shaw
After reading the ditto blocks blog (good article, btw), an idea occurred to me:Since we use ditto blocks to preserve critical filesystem data, would it be practical to add a filesystem property that would cause all files in a filesystem to be stored as mirrored blocks?That would allow a dual-copy behavior selectable on a filesystem boundary even in a vdev pool.That could be handy for those that have a little bit of critical data and a lot of not-so-critical data.On Jul 20, 2006, at 4:45 PM, Jeff Bonwick wrote:However, we do have the advantage of always knowing when somethingis corrupted, and knowing what that particular block should have been.  We also have ditto blocks for all metadata, so that even if any blockof ZFS metadata is destroyed, we always have another copy.Bill Moore describes ditto blocks in detail here:http://blogs.sun.com/roller/page/bill?entry=ditto_blocks_the_amazing_tapeJeff___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss  -Gregory Shaw, IT ArchitectPhone: (303) 673-8273        Fax: (303) 673-8273ITCTO Group, Sun Microsystems Inc.1 StorageTek Drive MS 4382              [EMAIL PROTECTED] (work)Louisville, CO 80028-4382                 [EMAIL PROTECTED] (home)"When Microsoft writes an application for Linux, I've Won." - Linus Torvalds ___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re[2]: [zfs-discuss] Can't remove corrupt file

2006-07-21 Thread Robert Milkowski




Hello Gregory,

Friday, July 21, 2006, 3:22:17 PM, you wrote:







After reading the ditto blocks blog (good article, btw), an idea occurred to me:

Since we use ditto blocks to preserve critical filesystem data, would it be practical to add a filesystem property that would cause all files in a filesystem to be stored as mirrored blocks?

That would allow a dual-copy behavior selectable on a filesystem boundary even in a vdev pool.

That could be handy for those that have a little bit of critical data and a lot of not-so-critical data.






IIRC that's already planned.



--
Best regards,
Robert  mailto:[EMAIL PROTECTED]
   http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't remove corrupt file

2006-07-21 Thread Bill Moore
On Fri, Jul 21, 2006 at 07:22:17AM -0600, Gregory Shaw wrote:
 After reading the ditto blocks blog (good article, btw), an idea  
 occurred to me:
 
 Since we use ditto blocks to preserve critical filesystem data, would  
 it be practical to add a filesystem property that would cause all  
 files in a filesystem to be stored as mirrored blocks?

Yep, that's the plan.  I even mention it in the blog.  :)


--Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't remove corrupt file

2006-07-20 Thread Eric Schrock
What does 'zpool status -v' show?  This sounds like you have corruption
in the dnode (a.k.a. metadata).  This corruption is unrepairable at the
moment, since we have no way of knowing the extent of the blocks that
this dnode may be referencing.  You should be able to move this file
aside, however.

- Eric

On Wed, Jul 19, 2006 at 01:27:23PM -0500, Eric Lowe wrote:
 I had a checksum error occur in a file. Since only one file is corrupt 
 (and it's a link library at that) I don't want to blow away the whole pool 
 to remove the corrupt file. However, I can't figure out any way to unlink 
 the file. Using rm to try to unlink the file I get EIO:
 
 % rm llib-lip.ln
 rm: llib-lip.ln not removed: I/O error
 
 Trying to truncate it is also no dice:
 % cat llib-lip.ln
 llib-lip.ln: I/O error
 
 What are the expected paths for recovery here?
 
 I took a look at:
 http://www.sun.com/msg/ZFS-8000-8A
 
 That page isn't helpful since it just says to restore the file. Well, 
 you can't restore a file if you can't cleanup the old corrupted one!
 
 (Also BTW that page has a typo, you might want to get the typo fixed, I 
 didn't know where the doc bugs should go for those messages)
 
 - Eric
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't remove corrupt file

2006-07-20 Thread Eric Lowe

Eric Schrock wrote:

What does 'zpool status -v' show?  This sounds like you have corruption


# zpool status -v
  pool: junk
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
junkONLINE   0 0 0
  raidz ONLINE   0 0 0
c0d0ONLINE   0 0 0
c1d0ONLINE   0 0 0
c1d1ONLINE   0 0 0

errors: The following persistent errors have been detected:

  DATASET  OBJECT  RANGE
  27   4a2e5   lvl=2 blkid=0


in the dnode (a.k.a. metadata).  This corruption is unrepairable at the
moment, since we have no way of knowing the extent of the blocks that
this dnode may be referencing.  You should be able to move this file
aside, however.


Trying to move it panic'd my machine.

However I am running build 36 (big disclaimer). It's time for a BFU. ;)

- Eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't remove corrupt file

2006-07-20 Thread Eric Schrock
Well the fact that it's a level 2 indirect block indicates why it can't
simply be removed.  We don't know what data it refers to, so we can't
free the associated blocks.  The panic on move is quite interesting -
after BFU give it another shot and file a bug if it still happens.

- Eric

On Thu, Jul 20, 2006 at 02:28:38PM -0500, Eric Lowe wrote:
 Eric Schrock wrote:
 What does 'zpool status -v' show?  This sounds like you have corruption
 
 # zpool status -v
   pool: junk
  state: ONLINE
 status: One or more devices has experienced an error resulting in data
 corruption.  Applications may be affected.
 action: Restore the file in question if possible.  Otherwise restore the
 entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
  scrub: none requested
 config:
 
 NAMESTATE READ WRITE CKSUM
 junkONLINE   0 0 0
   raidz ONLINE   0 0 0
 c0d0ONLINE   0 0 0
 c1d0ONLINE   0 0 0
 c1d1ONLINE   0 0 0
 
 errors: The following persistent errors have been detected:
 
   DATASET  OBJECT  RANGE
   27   4a2e5   lvl=2 blkid=0
 
 in the dnode (a.k.a. metadata).  This corruption is unrepairable at the
 moment, since we have no way of knowing the extent of the blocks that
 this dnode may be referencing.  You should be able to move this file
 aside, however.
 
 Trying to move it panic'd my machine.
 
 However I am running build 36 (big disclaimer). It's time for a BFU. ;)
 
 - Eric

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't remove corrupt file

2006-07-20 Thread Darren Dunham
 Well the fact that it's a level 2 indirect block indicates why it can't
 simply be removed.  We don't know what data it refers to, so we can't
 free the associated blocks.  The panic on move is quite interesting -
 after BFU give it another shot and file a bug if it still happens.

What's the long term solution for this type of corruption?  Will there
be a 'fsck'-like utility that can find all valid items and make sure
they're connected properly, or is something else possible?

-- 
Darren Dunham   [EMAIL PROTECTED]
Senior Technical Consultant TAOShttp://www.taos.com/
Got some Dr Pepper?   San Francisco, CA bay area
  This line left intentionally blank to confuse you. 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't remove corrupt file

2006-07-20 Thread Al Hopper
On Thu, 20 Jul 2006, Darren Dunham wrote:

  Well the fact that it's a level 2 indirect block indicates why it can't
  simply be removed.  We don't know what data it refers to, so we can't
  free the associated blocks.  The panic on move is quite interesting -
  after BFU give it another shot and file a bug if it still happens.

 What's the long term solution for this type of corruption?  Will there
 be a 'fsck'-like utility that can find all valid items and make sure
 they're connected properly, or is something else possible?

This is deja vu and positively scary.  In the bad old days, when we were
cursed with hierarchical databases[1], one ran the DB fsck equivalent.
And sometimes it worked; and sometimes it did'nt.  And sometimes there
were bugs in the hierarchical fsck/repair utility that could turn your
minor DB issue into a totally trashed DB!  :(

We still have hierarchical databases of course - but todays software
technology and practices have made them far less vulnerable to nasty bugs.
Try running the test suite for the SleepyCat DB sometime you have a
machine you want to exercise...

ZFS has a very reasonable tree-like data structure - but ... the memory
of hierarchical DB fsck-like utilities really scare me...  There has to be
a better way.

[1] and there were few alternatives.

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
   Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
OpenSolaris Governing Board (OGB) Member - Feb 2006
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't remove corrupt file

2006-07-20 Thread Eric Schrock
Note that there are two common reasons to have a fsck-like utility -

1. Detect corruption
2. Repair corruption

For the first, we have scrubbing (and eventually background scrubbing)
so it's pointless in the ZFS world.  For the latter, the type of things
it repairs are known pathologies endemic to the underlying filesystem.
For example, it knows how to reconnect inodes if you were in the middle
of adding the corresponding directory entry, fixing up the global inode
table, etc.

For the type of corruption we're talking about, there is no repair
procedure, period.   We cannot deal with arbitrary corruption any more
than other filesystems.  However, we do have the advantage of always
knowing when something is corrupted, and knowing what that particular
block should have been.  The best we can hope for is a) to identify
orphaned blocks resulting from corruption and b) provide a way to
move/free these files so they don't permanently pollute the filesystem
namespace.

- Eric

On Thu, Jul 20, 2006 at 03:34:07PM -0500, Al Hopper wrote:
 On Thu, 20 Jul 2006, Darren Dunham wrote:
 
   Well the fact that it's a level 2 indirect block indicates why it can't
   simply be removed.  We don't know what data it refers to, so we can't
   free the associated blocks.  The panic on move is quite interesting -
   after BFU give it another shot and file a bug if it still happens.
 
  What's the long term solution for this type of corruption?  Will there
  be a 'fsck'-like utility that can find all valid items and make sure
  they're connected properly, or is something else possible?
 
 This is deja vu and positively scary.  In the bad old days, when we were
 cursed with hierarchical databases[1], one ran the DB fsck equivalent.
 And sometimes it worked; and sometimes it did'nt.  And sometimes there
 were bugs in the hierarchical fsck/repair utility that could turn your
 minor DB issue into a totally trashed DB!  :(
 
 We still have hierarchical databases of course - but todays software
 technology and practices have made them far less vulnerable to nasty bugs.
 Try running the test suite for the SleepyCat DB sometime you have a
 machine you want to exercise...
 
 ZFS has a very reasonable tree-like data structure - but ... the memory
 of hierarchical DB fsck-like utilities really scare me...  There has to be
 a better way.
 
 [1] and there were few alternatives.
 
 Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
 OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
 OpenSolaris Governing Board (OGB) Member - Feb 2006
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't remove corrupt file

2006-07-20 Thread Darren Dunham
 Basically, the first step is to identify the file in question so the
 user knows what's been lost.  The second step is a way to move these
 blocks into pergatory, where they won't take up filesystem namespace,
 but still account for used space.  The final step is to actually delete
 the blocks and then do a garbage-collection type of operation to find
 which blocks are no longer referenced.

The GC operation is what I was referring to by a 'fsck'-like utility.
(not that it has to be a stand-alone utility in the way fsck is).

 This is a hugely complicated task, as dealing with snapshots and DMU
 accounting is going to be horrific, if possible at all.  It is, however,
 on our (rather long) list of things to tackle.

Understood.  I was really just interested in learning if this was a we
wanted to get other stuff out the door first or a we're not sure it's
possible type problem.

-- 
Darren Dunham   [EMAIL PROTECTED]
Senior Technical Consultant TAOShttp://www.taos.com/
Got some Dr Pepper?   San Francisco, CA bay area
  This line left intentionally blank to confuse you. 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Can't remove corrupt file

2006-07-19 Thread Eric Lowe
I had a checksum error occur in a file. Since only one file is corrupt 
(and it's a link library at that) I don't want to blow away the whole pool 
to remove the corrupt file. However, I can't figure out any way to unlink 
the file. Using rm to try to unlink the file I get EIO:


% rm llib-lip.ln
rm: llib-lip.ln not removed: I/O error

Trying to truncate it is also no dice:
% cat llib-lip.ln
llib-lip.ln: I/O error

What are the expected paths for recovery here?

I took a look at:
http://www.sun.com/msg/ZFS-8000-8A

That page isn't helpful since it just says to restore the file. Well, 
you can't restore a file if you can't cleanup the old corrupted one!


(Also BTW that page has a typo, you might want to get the typo fixed, I 
didn't know where the doc bugs should go for those messages)


- Eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't remove corrupt file

2006-07-19 Thread Tim Haley

On Wed, 19 Jul 2006, Eric Lowe wrote:



(Also BTW that page has a typo, you might want to get the typo fixed, I 
didn't know where the doc bugs should go for those messages)


- Eric


Product: event_registry
Category: events
Sub-Category: msg

-tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss