Re: [zfs-discuss] ZFS still crashing after patch

2008-05-06 Thread Robert Milkowski
Hello Richard,

Monday, May 5, 2008, 4:12:23 PM, you wrote:

RE Rustam wrote:
 Hello Robert,
   
 Which would happen if you have problem with HW and you're getting
 wring checksums on both side of your mirrors. Maybe PS?

 Try memtest anyway or sunvts
 
 Unfortunately, SunVTS doesn't run on non-Sun/OEM hardware. And memtest 
 requires too much downtime which I cannot afford right now.
   

RE Sometimes if you read the docs, you can get confused by people who
RE intend to confuse you.  SunVTS does work on a wide variety of
RE hardware, though it may not be supported. To fully understand the
RE perspective, SunVTS is used by Sun in the manufacturing process.
RE It is the tests run on hardware before shipping to customers.  It is not
RE intended to be a generic test whatever hardware you find laying around
RE product.

Nevertheless you can actually persuade it to run on non Sun HW -
it's even in manual page IIRC.

-- 
Best regards,
 Robert Milkowskimailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread Rustam
Hello Robert,
 Which would happen if you have problem with HW and you're getting
 wring checksums on both side of your mirrors. Maybe PS?

 Try memtest anyway or sunvts
Unfortunately, SunVTS doesn't run on non-Sun/OEM hardware. And memtest requires 
too much downtime which I cannot afford right now.

However, I've interesting observations and now I can reproduce crash. It seems 
that I've bad checksum(s) and ZFS crashes each time when it tries to read it. 
Below are two cases:



Case1: I've got a checksum error not striped over mirrors, this time it was 
checksum for a file and not 0x0. I tried to read file twice. First try 
returned I/O error, second try caused panic. Here's the log:




core# zpool status -xv
  pool: box5
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:
 
NAMESTATE READ WRITE CKSUM
box5ONLINE   0 0 2
  mirrorONLINE   0 0 0
c1d0ONLINE   0 0 0
c2d0ONLINE   0 0 0
  mirrorONLINE   0 0 2
c2d1ONLINE   0 0 4
c1d1ONLINE   0 0 4
 
errors: Permanent errors have been detected in the following files:
 
box5:0x0
/u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file

core# ll /u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file
-rw---   1 user group   489 Apr 20  2006 
/u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file

core# cat /u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file
cat: input error on 
/u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file: I/O error

core# zpool status -xv
  pool: box5
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:
 
NAMESTATE READ WRITE CKSUM
box5ONLINE   0 0 4
  mirrorONLINE   0 0 0
c1d0ONLINE   0 0 0
c2d0ONLINE   0 0 0
  mirrorONLINE   0 0 4
c2d1ONLINE   0 0 8
c1d1ONLINE   0 0 8
 
errors: Permanent errors have been detected in the following files:
 
box5:0x0
/u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file

core# cat /u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file
(Kernel Panic: BAD TRAP: type=e (#pf Page fault) rp=fe8001112490 
addr=fe80882b7000)
...
(after system boot up)
core# rm /u02/domains/somedomain/0/1/5/data/sub1/sub2/1145543794.file
core# zpool status -xv
  pool: box5
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:
 
NAMESTATE READ WRITE CKSUM
box5ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1d0ONLINE   0 0 0
c2d0ONLINE   0 0 0
  mirrorONLINE   0 0 0
c2d1ONLINE   0 0 0
c1d1ONLINE   0 0 0
 
errors: Permanent errors have been detected in the following files:
 
box5:0x0
box5:0x4a049a

core# mdb unix.17 vmcore.17
Loading modules: [ unix krtld genunix specfs dtrace cpu.generic uppc pcplusmp 
ufs ip hook neti sctp arp usba uhci fctl nca lofs zfs random nfs ipc sppp 
crypto ptm ]
 ::status
debugging crash dump vmcore.17 (64-bit) from core
operating system: 5.10 Generic_127128-11 (i86pc)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fe8001112490 
addr=fe80882b7000
dump content: kernel pages only
 ::stack
fletcher_2_native+0x13()
zio_checksum_verify+0x27()
zio_next_stage+0x65()
zio_wait_for_children+0x49()
zio_wait_children_done+0x15()
zio_next_stage+0x65()
zio_vdev_io_assess+0x84()
zio_next_stage+0x65()
vdev_cache_read+0x14c()
vdev_disk_io_start+0x135()
vdev_io_start+0x12()
zio_vdev_io_start+0x7b()
zio_next_stage_async+0xae()
zio_nowait+9()
vdev_mirror_io_start+0xa9()
vdev_io_start+0x12()
zio_vdev_io_start+0x7b()
zio_next_stage_async+0xae()
zio_nowait+9()
vdev_mirror_io_start+0xa9()
zio_vdev_io_start+0x116()
zio_next_stage+0x65()
zio_ready+0xec()
zio_next_stage+0x65()
zio_wait_for_children+0x49()

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread Richard Elling
Rustam wrote:
 Hello Robert,
   
 Which would happen if you have problem with HW and you're getting
 wring checksums on both side of your mirrors. Maybe PS?

 Try memtest anyway or sunvts
 
 Unfortunately, SunVTS doesn't run on non-Sun/OEM hardware. And memtest 
 requires too much downtime which I cannot afford right now.
   

Sometimes if you read the docs, you can get confused by people who
intend to confuse you.  SunVTS does work on a wide variety of
hardware, though it may not be supported. To fully understand the
perspective, SunVTS is used by Sun in the manufacturing process.
It is the tests run on hardware before shipping to customers.  It is not
intended to be a generic test whatever hardware you find laying around
product.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread Marcelo Leal
Hello,
 If you believe that the problem can be related to ZIL code, you can try to 
disable it to debug (isolate) the problem. If it is not a fileserver (NFS), 
disabling the zil should not impact consistency.

 Leal.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread Rustam
Hello Leal,

I've been already warned 
(http://www.opensolaris.org/jive/message.jspa?messageID=231349) that ZIL could 
be a cause and I made tests with zil_disabled. I run scrub and system crashed 
exactly at after the same period and the same error. ZIL known to cause some 
problems on writes, while all my problems are with zio_read and checksum_verify.

This is NFS file server, but it crashed even when NFS unshared and nfs/server 
is disabled. So this is not NFS problem.

I reduced panic occasions by setting zfs_prefetch_disable. This allows me to 
avoid unnecessary reads and reduces chances of reading bad checksums. For now 
I've 24 hours without crash which is much better than few times a day. However, 
I know that bad checksums are there and I need to fix them somehow.

--
Rustam
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread Bob Friesenhahn
On Mon, 5 May 2008, Marcelo Leal wrote:

 Hello, If you believe that the problem can be related to ZIL code, 
 you can try to disable it to debug (isolate) the problem. If it is 
 not a fileserver (NFS), disabling the zil should not impact 
 consistency.

In what way is NFS special when it comes to ZFS consistency?  If NFS 
consistency is lost by disabling the zil then local consistency is 
also lost.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread Bob Friesenhahn
On Mon, 5 May 2008, eric kustarz wrote:

 That's not true:
 http://blogs.sun.com/erickustarz/entry/zil_disable

 Perhaps people are using consistency to mean different things here...

Consistency means that fsync() assures that the data will be written 
to disk so no data is lost.  It is not the same thing as no 
corruption.  ZFS will happily lose some data in order to avoid some 
corruption if the system loses power.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread Bob Friesenhahn
On Mon, 5 May 2008, Marcelo Leal wrote:
 I'm calling consistency, a coherent local view...
 I think that was one option to debug (if not a NFS server), without
 generate a corrupted filesystem.

In other words your flight reservation will not be lost if the system 
crashes.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread eric kustarz

On May 5, 2008, at 4:43 PM, Bob Friesenhahn wrote:

 On Mon, 5 May 2008, eric kustarz wrote:

 That's not true:
 http://blogs.sun.com/erickustarz/entry/zil_disable

 Perhaps people are using consistency to mean different things  
 here...

 Consistency means that fsync() assures that the data will be written  
 to disk so no data is lost.  It is not the same thing as no  
 corruption.  ZFS will happily lose some data in order to avoid some  
 corruption if the system loses power.

Ok, that makes more sense.  You're talking from the application  
perspective, whereas my blog entry is from the file system's  
perspective (disabling the ZIL does not compromise on-disk consistency).

eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS still crashing after patch

2008-05-03 Thread Rustam
I don't think that this is hardware issue, however i don't except this. I'll 
try to explain why.

1. I've replaced all memory modules which are more likely to cause such a 
problem.

2. There are many different applications running on that server (Apache, 
PostgreSQL, etc.). However, if you look at the four different crash dump stack 
traces you see the same picture:

-- crash dump st1 --
mutex_enter+0xb()
zio_buf_alloc+0x1a()
zio_read+0xba()
spa_scrub_io_start+0xf1()
spa_scrub_cb+0x13d()

-- crash dump st2 --
mutex_enter+0xb()
zio_buf_alloc+0x1a()
zio_read+0xba()
arc_read+0x3cc()
dbuf_prefetch+0x11d()
dmu_prefetch+0x107()
zfs_readdir+0x408()
fop_readdir+0x34()

-- crash dump st3 --
mutex_enter+0xb()
zio_buf_alloc+0x1a()
zio_read+0xba()
arc_read+0x3cc()
dbuf_prefetch+0x11d()
dmu_prefetch+0x107()
zfs_readdir+0x408()
fop_readdir+0x34()

-- crash dump st4 --
mutex_enter+0xb()
zio_buf_alloc+0x1a()
zio_read+0xba()
arc_read+0x3cc()
dbuf_prefetch+0x11d()
dmu_prefetch+0x107()
zfs_readdir+0x408()
fop_readdir+0x34()


All four crash dumps show problem at zio_read/zio_buf_alloc. Three of these 
appeared during metadata prefetch (dmu_prefetch) and one during scrubbing. I 
don't think that it's coincidence. IMHO, checksum errors are the result of this 
inconsistency.

I tend to think that problem is in ZFS it exists even in the latest Solaris 
version (maybe OpenSolaris as well).


 
 Lots of CKSUM errors like you see is often indicative
 of bad hardware. Run 
 memtest for 24-48 hours.
 
 -marc
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS still crashing after patch

2008-05-03 Thread Robert Milkowski
Hello Rustam,

Saturday, May 3, 2008, 9:16:41 AM, you wrote:

R I don't think that this is hardware issue, however i don't except this. I'll 
try to explain why.

R 1. I've replaced all memory modules which are more likely to cause such a 
problem.

R 2. There are many different applications running on that server
R (Apache, PostgreSQL, etc.). However, if you look at the four
R different crash dump stack traces you see the same picture:

R -- crash dump st1 --
R mutex_enter+0xb()
R zio_buf_alloc+0x1a()
R zio_read+0xba()
R spa_scrub_io_start+0xf1()
R spa_scrub_cb+0x13d()

R -- crash dump st2 --
R mutex_enter+0xb()
R zio_buf_alloc+0x1a()
R zio_read+0xba()
R arc_read+0x3cc()
R dbuf_prefetch+0x11d()
R dmu_prefetch+0x107()
R zfs_readdir+0x408()
R fop_readdir+0x34()

R -- crash dump st3 --
R mutex_enter+0xb()
R zio_buf_alloc+0x1a()
R zio_read+0xba()
R arc_read+0x3cc()
R dbuf_prefetch+0x11d()
R dmu_prefetch+0x107()
R zfs_readdir+0x408()
R fop_readdir+0x34()

R -- crash dump st4 --
R mutex_enter+0xb()
R zio_buf_alloc+0x1a()
R zio_read+0xba()
R arc_read+0x3cc()
R dbuf_prefetch+0x11d()
R dmu_prefetch+0x107()
R zfs_readdir+0x408()
R fop_readdir+0x34()


R All four crash dumps show problem at zio_read/zio_buf_alloc. Three
R of these appeared during metadata prefetch (dmu_prefetch) and one
R during scrubbing. I don't think that it's coincidence. IMHO,
R checksum errors are the result of this inconsistency.

Which would happen if you have problem with HW and you're getting
wring checksums on both side of your mirrors. Maybe PS?

Try memtest anyway or sunvts



-- 
Best regards,
 Robert Milkowskimailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS still crashing after patch

2008-05-02 Thread Rustam
 Seems kind of old.  I am using Generic_127112-11 here.
 
 Probably many hundreds of nasty bugs have been
 eliminated since the version you are using.

I've updated to the latest available kernel 127128-11 (from 28 Apr) which 
included a number of fixes to AHCI SATA driver and ZFS.

Didn't help. Keeps crashing.
The worst thing is that I don't know where's the problem. More ideas on how to 
find problem?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS still crashing after patch

2008-05-02 Thread Marc Bevand
Rustam rustam at code.az writes:
 
 Didn't help. Keeps crashing.
 The worst thing is that I don't know where's the problem. More ideas on
 how to find problem?

Lots of CKSUM errors like you see is often indicative of bad hardware. Run 
memtest for 24-48 hours.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS still crashing after patch

2008-05-01 Thread Rustam
Today my production server crashed  4 times. THIS IS NIGHTMARE!
Self-healing file system?! For me ZFS is SELF-KILLING filesystem. 

I cannot fsck it, there's no such tool.
I cannot scrub it, it crashes 30-40 minutes after scrub starts.
I cannot use it, it crashes a number of times every day! And with every crash 
number of checksum failures is growing:

NAMESTATE READ WRITE CKSUM
box5ONLINE   0 0 0
...after a few hours...
box5ONLINE   0 0 4
...after a few hours...
box5ONLINE   0 0 62
...after another few hours...
box5ONLINE   0 0 120
...crash! and we start again...
box5ONLINE   0 0 0
...etc...

actually 120 is record, sometimes it crashed as soon as it boots.

and always there's a permanent error:
errors: Permanent errors have been detected in the following files:
box5:0x0

and very wise self-healing advice:
http://www.sun.com/msg/ZFS-8000-8A
Restore the file in question if possible.  Otherwise restore the entire pool 
from backup.

Thanks, but if I restore it from backup it won't be ZFS anymore, that's for 
sure.

It's not I/O problem. AFAIK, default ZFS I/O error behavior is wait to repair 
(i've 10U4, non-configurable). Then why it panics?

Recently there were discussions on failure of OpenSolaris community. Now it's 
been more than half a month since I reported such an error. Nobody even posted 
something like RTFM. Come on guys, I know you are there and busy with 
enterprise customers... but at least give me some troubleshooting ideas. i'm 
totally lost.

just to remind, it's heavily loaded fs with 3-4 million files and folders.

Link to original post:
http://www.opensolaris.org/jive/thread.jspa?threadID=57425
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS still crashing after patch

2008-05-01 Thread Bob Friesenhahn
On Thu, 1 May 2008, Rustam wrote:

 Today my production server crashed  4 times. THIS IS NIGHTMARE!
 Self-healing file system?! For me ZFS is SELF-KILLING filesystem.

 I cannot fsck it, there's no such tool.
 I cannot scrub it, it crashes 30-40 minutes after scrub starts.
 I cannot use it, it crashes a number of times every day! And with every crash 
 number of checksum failures is growing:

Is your ZFS pool configured with redundancy (e.g mirrors, raidz) or is 
it non-redundant?  If non-redundant, then there is not much that ZFS 
can really do if a device begins to fail.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS still crashing after patch

2008-05-01 Thread Phillip Wagstrom -- Area SSE MidAmerica
Rustam wrote:
 Today my production server crashed  4 times. THIS IS NIGHTMARE! 
 Self-healing file system?! For me ZFS is SELF-KILLING filesystem.
 
 I cannot fsck it, there's no such tool. I cannot scrub it, it crashes
 30-40 minutes after scrub starts. I cannot use it, it crashes a
 number of times every day! And with every crash number of checksum
 failures is growing:
 
 NAMESTATE READ WRITE CKSUM box5ONLINE   0
 0 0 ...after a few hours... box5ONLINE   0 0
 4 ...after a few hours... box5ONLINE   0 0 62 
 ...after another few hours... box5ONLINE   0 0
 120 ...crash! and we start again... box5ONLINE   0 0
 0 ...etc...
 
 actually 120 is record, sometimes it crashed as soon as it boots.
 
 and always there's a permanent error: errors: Permanent errors have
 been detected in the following files: box5:0x0
 
 and very wise self-healing advice: http://www.sun.com/msg/ZFS-8000-8A
  Restore the file in question if possible.  Otherwise restore the
 entire pool from backup.
 
 Thanks, but if I restore it from backup it won't be ZFS anymore,
 that's for sure.

That's a bit harsh.  ZFS is telling you that you have corrupted data 
based on the checksums.  Other types of filesystems would likely simply 
pass the corrupted data on silently.

 It's not I/O problem. AFAIK, default ZFS I/O error behavior is wait
 to repair (i've 10U4, non-configurable). Then why it panics?

Do you have the panic messages?  ZFS won't cause panics based on bad 
checksums.  It will by default cause panic if it can't write data out to 
any device or if it completely loses access to non-redundant devices or 
loses both redundant devices at the same time.

 Recently there were discussions on failure of OpenSolaris community.
 Now it's been more than half a month since I reported such an error.
 Nobody even posted something like RTFM. Come on guys, I know you
 are there and busy with enterprise customers... but at least give me
 some troubleshooting ideas. i'm totally lost.
 
 just to remind, it's heavily loaded fs with 3-4 million files and
 folders.
 
 Link to original post: 
 http://www.opensolaris.org/jive/thread.jspa?threadID=57425

Since this seems to show the same number of checksum errors across 2 
different channels and 4 different drives.  Given that, I'd assume that 
this is likely a dual-channel HBA of some sort.  It would appear that 
you either have bad hardware or some sort of driver issue.

Regards,
Phil

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS still crashing after patch

2008-05-01 Thread Rustam
 Is your ZFS pool configured with redundancy (e.g mirrors, raidz) or is
 it non-redundant? If non-redundant, then there is not much that ZFS
 can really do if a device begins to fail.

It's RAID 10 (more info here: 
http://www.opensolaris.org/jive/thread.jspa?threadID=57425):

NAME STATE READ WRITE CKSUM
box5 ONLINE 0 0 4
mirror ONLINE 0 0 2
c1d0 ONLINE 0 0 4
c2d0 ONLINE 0 0 4
mirror ONLINE 0 0 2
c2d1 ONLINE 0 0 4
c1d1 ONLINE 0 0 4

Actually, there's no damaged data so far. I don't get any unable to 
read/write kind of errors. It's just very strange checksum errors synchronized 
over all disks.

 That's a bit harsh.  ZFS is telling you that you u have corrupted data 
 based on the checksums.  Other types of filesystems would likely simply 
 pass the corrupted data on silently.

Checksums are good, no complaints about that.

 Do you have the panic messages?  ZFS won't cause panics based on bad 
 checksums.  It will by default cause panic if it can't write data out to 
 any device or if it completely loses access to non-redundant devices or 
 loses both redundant devices at the same time.

A number of panic messages and crash dump stack trace are attached to the 
original post (http://www.opensolaris.org/jive/thread.jspa?threadID=57425). 
Here is the short snip:

 ::status
debugging crash dump vmcore.5 (64-bit) from core
operating system: 5.10 Generic_127112-07 (i86pc)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fe800017f8d0 addr=238 
occurred in module unix due to a NULL pointer dereference
dump content: kernel pages only

 ::stack
mutex_enter+0xb()
zio_buf_alloc+0x1a()
zio_read+0xba()
spa_scrub_io_start+0xf1()
spa_scrub_cb+0x13d()
traverse_callback+0x6a()
traverse_segment+0x118()
traverse_more+0x7b()
spa_scrub_thread+0x147()
thread_start+8()

 Since this seems to show the same number of checksum errors across 2 
 different channels and 4 different drives.  Given that, I'd assume that 
 this is likely a dual-channel HBA of some sort.  It would appear that 
 you either have bad hardware or some sort of driver issue.

You right, this is the dual-channel Intel's ICH6 SATA controller. 10U4 has 
native support/drivers for this SATA controller (AHCI drivers afaik). The thing 
is that this hardware and ZFS were in production for almost 2 years (ok, not 
the best argument). However this problem occurred recently (20 days). It's even 
more strange because I didn't made any OS/diver upgrade or patch during last 
2-3 months.

However, this is good point. I've seen some new SATA/AHCI drivers available in 
10U5. Maybe I should try to upgrade and see if it helps. Thanks Phil.

--
Rustam
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS still crashing after patch

2008-05-01 Thread Bob Friesenhahn
On Thu, 1 May 2008, Rustam wrote:

 operating system: 5.10 Generic_127112-07 (i86pc)

Seems kind of old.  I am using Generic_127112-11 here.

Probably many hundreds of nasty bugs have been eliminated since the 
version you are using.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss