Re: [zfs-discuss] x4500 Thumper panic

Jorgen Lundman Sun, 11 May 2008 19:28:36 -0700

savecore -d
System dump time: Sat May 10 11:51:56 2008
Constructing namelist /var/crash/x4500/unix.1
Constructing corefile /var/crash/x4500/vmcore.1
  45% done


When we get the second x4500 in we can do more testing in that area. But 
more importantly we need try to work out why the UFS exported 
file-systems failed to recover properly. They are mounted "hard" so that 
IO should wait, and yet it seems to just fail IO and make the 
mount-point invisible. Logging in to each and every server to remount 
the file-systems is somewhat tedious.

# df -h
Filesystem             size   used  avail capacity  Mounted on
/                       64G    24G    39G    39%    /
/dev                    64G    24G    39G    39%    /dev
proc                     0K     0K     0K     0%    /proc
[snip]
swap                   1.1G    12K   1.1G     1%    /var/run
df: cannot statvfs /export/test: No such file or directory

# ls -l /export/
test-www01:~# ls -la /export/
total 16
drwxr-xr-x   6 root     sys          512 Mar 26 16:09 .
drwxr-xr-x  19 root     root         512 Mar 19 11:30 ..
drwxr-xr-x  23 root     root         512 Apr 14 11:54 home
drwxr-xr-x   2 root     root         512 Mar 17 16:10 nfs

No "test" directory there.

# mount
/export/test on x4500-01-vip:/export/test 
remote/read/write/setuid/nodevices/vers=3/hard/intr/quota/xattr/dev=4700002 
on Tue Mar 25 11:10:52 2008

# mkdir -p /export/test/roo
mkdir: "/export/test/roo": No such file or directory

# umount /export/test
# mount /export/test
# df -h
test-x4500-01-vip:/export/test
                         98G   4.1G    93G     5%    /export/test



More info from the panic:

 > $c
top_end_sync+0xcb(fffffffedefea000, ffffff001f4ca424, b, 0)
ufs_fsync+0x1cb(ffffffff3659f980, 10000, ffffffff6f0ccc70)
fop_fsync+0x51(ffffffff3659f980, 10000, ffffffff6f0ccc70)
rfs3_create+0x604(ffffff001f4ca7c8, ffffff001f4ca8b8, ffffff04e7627d80,
ffffff001f4cab20, ffffffff6f0ccc70)
common_dispatch+0x444(ffffff001f4cab20, ffffffffa71cc1c0, 2, 4, 
fffffffff8553a78
, ffffffffc039d3d0)
rfs_dispatch+0x2d(ffffff001f4cab20, ffffffffa71cc1c0)
svc_getreq+0x1c6(ffffffffa71cc1c0, fffffffec69bddc0)
svc_run+0x171(fffffffecc7581c0)
svc_do_run+0x85(1)
nfssys+0x748(e, fec80fc8)
sys_syscall32+0x101()


 > ::panicinfo
              cpu                3
           thread ffffffff17b8c820
          message
BAD TRAP: type=e (#pf Page fault) rp=ffffff001f4ca220 addr=0 occurred in 
module
"<unknown>" due to a NULL pointer dereference
              rdi fffffffedefea000
              rsi                9
              rdx                0
              rcx ffffffff17b8c820
               r8                0
               r9 ffffff054797dc48
              rax                0
              rbx          97eaffc
              rbp ffffff001f4ca350
              r10                0
              r10                0
              r11 fffffffec8b93868
              r12         27991000
              r13 fffffffed1b59c00
              r14 fffffffecf8d8cc0
              r15             1000
           fsbase                0
           gsbase fffffffec3d5a580
               ds               4b
               es               4b
               fs                0
               gs              1c3
           trapno                e
              err               10
              rip                0
               cs               30
           rflags            10246
              rsp ffffff001f4ca318
               ss               38
           gdt_hi                0
           gdt_lo         500001ef
           idt_hi                0
           idt_lo         40000fff
              ldt                0
             task               70
              cr0         8005003b
              cr2                0
              cr3        1fcbbc000
              cr4              6f8




Nathan Kroenert - Server ESG wrote:
> Dumping to /dev/dsk/c6t0d0s1
> 
> certainly looks like a non-mirrored dump dev...
> 
> You  might try a manual savecore telling it to ignore the dump valid 
> header and see what you get...
> 
> savecore -d
> 
> and perhaps try telling it to look directly at the dump device...
> 
> savecore -f <device>
> 
> You should also, when you get the chance, deliberately panic the box to 
> make sure you can actually capture a dump...
> 
> dumpadm is your friend as far as checking where you are going to dump 
> to, and it it's one side of your swap mirror, that's bad, M'Kay?
> 
> :)
> 
> Nathan.
> 
> Jorgen Lundman wrote:
>> OK, this is a pretty damn poor panic report if I may say no, not had 
>> much sleep.
>>
>>                  Solaris Express Developer Edition 9/07 snv_70b X86
>>             Copyright 2007 Sun Microsystems, Inc.  All Rights Reserved.
>>                          Use is subject to license terms.
>>                              Assembled 30 August 2007
>>
>> SunOS x4500-01.unix 5.11 snv_70b i86pc i386 i86pc
>>
>> Even though it dumped, it wrote nothing to /var/crash/. Perhaps 
>> because swap is mirrored.
>>
>>
>>
>> Jorgen Lundman wrote:
>>> We had a panic around noon on Saturday, which it mostly recovered 
>>> itself. All ZFS NFS exports just remounted, but the UFS on zdev NFS 
>>> exports did not, needed manual umount && mount on all clients for 
>>> some reason.
>>>
>>> Is this a known bug we should consider a patch for?
>>>
>>>
>>>
>>> May 10 11:49:46 x4500-01.unix ufs: [ID 912200 kern.notice] quota_ufs:
>>> over hard
>>> disk limit (pid 477, uid 127409, inum 1047211, fs /export/zero1)
>>> May 10 11:51:26 x4500-01.unix unix: [ID 836849 kern.notice]
>>> May 10 11:51:26 x4500-01.unix ^Mpanic[cpu3]/thread=ffffffff17b8c820:
>>> May 10 11:51:26 x4500-01.unix genunix: [ID 335743 kern.notice] BAD TRAP:
>>> type=e
>>> (#pf Page fault) rp=ffffff001f4ca220 addr=0 occurred in module
>>> "<unknown>" due t
>>> o a NULL pointer dereference
>>> May 10 11:51:26 x4500-01.unix unix: [ID 100000 kern.notice]
>>> May 10 11:51:26 x4500-01.unix unix: [ID 839527 kern.notice] nfsd:
>>> May 10 11:51:26 x4500-01.unix unix: [ID 753105 kern.notice] #pf Page 
>>> fault
>>> May 10 11:51:26 x4500-01.unix unix: [ID 532287 kern.notice] Bad kernel
>>> fault at
>>> addr=0x0
>>> May 10 11:51:26 x4500-01.unix unix: [ID 243837 kern.notice] pid=477,
>>> pc=0x0, sp=
>>> 0xffffff001f4ca318, eflags=0x10246
>>> May 10 11:51:26 x4500-01.unix unix: [ID 211416 kern.notice] cr0:
>>> 8005003b<pg,wp,
>>> ne,et,ts,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de>
>>> May 10 11:51:26 x4500-01.unix unix: [ID 354241 kern.notice] cr2: 0 cr3:
>>> 1fcbbc00
>>> 0 cr8: c
>>> May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice]     rdi:
>>> fffffffedef
>>> ea000 rsi:                9 rdx:                0
>>> May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice]     rcx:
>>> ffffffff17b
>>> 8c820  r8:                0  r9: ffffff054797dc48
>>> May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice]     rax:
>>>
>>>      0 rbx:          97eaffc rbp: ffffff001f4ca350
>>> May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice]     r10:
>>>
>>>      0 r11: fffffffec8b93868 r12:         27991000
>>> May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice]     r13:
>>> fffffffed1b
>>> 59c00 r14: fffffffecf8d8cc0 r15:             1000
>>> May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice]     fsb:
>>>
>>>      0 gsb: fffffffec3d5a580  ds:               4b
>>> May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice]      es:
>>>
>>>     4b  fs:                0  gs:              1c3
>>> May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice]     trp:
>>>
>>>      e err:               10 rip:                0
>>> May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice]      cs:
>>>
>>>     30 rfl:            10246 rsp: ffffff001f4ca318
>>> May 10 11:51:27 x4500-01.unix unix: [ID 266532 kern.notice]      ss:
>>>
>>>     38
>>> May 10 11:51:27 x4500-01.unix unix: [ID 100000 kern.notice]
>>> May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
>>> ffffff001f4ca100
>>> unix:die+c8 ()
>>> May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
>>> ffffff001f4ca210
>>> unix:trap+135b ()
>>> May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
>>> ffffff001f4ca220
>>> unix:_cmntrap+e9 ()
>>> May 10 11:51:27 x4500-01.unix genunix: [ID 802836 kern.notice]
>>> ffffff001f4ca350
>>> 0 ()
>>> May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
>>> ffffff001f4ca3d0
>>> ufs:top_end_sync+cb ()
>>> May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
>>> ffffff001f4ca440
>>> ufs:ufs_fsync+1cb ()
>>> May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
>>> ffffff001f4ca490
>>> genunix:fop_fsync+51 ()
>>> May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
>>> ffffff001f4ca770
>>> nfssrv:rfs3_create+604 ()
>>> May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
>>> ffffff001f4caa70
>>> nfssrv:common_dispatch+444 ()
>>> May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
>>> ffffff001f4caa90
>>> nfssrv:rfs_dispatch+2d ()
>>> May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
>>> ffffff001f4cab80
>>> rpcmod:svc_getreq+1c6 ()
>>> May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
>>> ffffff001f4cabf0
>>> rpcmod:svc_run+171 ()
>>> May 10 11:51:28 x4500-01.unix genunix: [ID 655072 kern.notice]
>>> ffffff001f4cac30
>>> rpcmod:svc_do_run+85 ()
>>> May 10 11:51:28 x4500-01.unix genunix: [ID 655072 kern.notice]
>>> ffffff001f4caec0
>>> nfs:nfssys+748 ()
>>> May 10 11:51:28 x4500-01.unix genunix: [ID 655072 kern.notice]
>>> ffffff001f4caf10
>>> unix:brand_sys_syscall32+1a3 ()
>>> May 10 11:51:28 x4500-01.unix unix: [ID 100000 kern.notice]
>>> May 10 11:51:28 x4500-01.unix genunix: [ID 672855 kern.notice] syncing
>>> file syst
>>> ems...
>>> May 10 11:51:28 x4500-01.unix genunix: [ID 733762 kern.notice]  8
>>> May 10 11:51:29 x4500-01.unix genunix: [ID 733762 kern.notice]  5
>>> May 10 11:51:30 x4500-01.unix genunix: [ID 733762 kern.notice]  2
>>> May 10 11:51:54 x4500-01.unix last message repeated 20 times
>>> May 10 11:51:55 x4500-01.unix genunix: [ID 622722 kern.notice]  done
>>> (not all i/
>>> o completed)
>>> May 10 11:51:56 x4500-01.unix genunix: [ID 111219 kern.notice] dumping
>>> to /dev/d
>>> sk/c6t0d0s1, offset 65536, content: kernel
>>>
>>>
>>
> 

-- 
Jorgen Lundman       | <[EMAIL PROTECTED]>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] x4500 Thumper panic

Reply via email to