On 08/16/2013 8:49 am, dweimer wrote:
On 08/15/2013 10:00 am, dweimer wrote:
On 08/14/2013 9:43 pm, Shane Ambler wrote:
On 14/08/2013 22:57, dweimer wrote:
I have a few systems running on ZFS with a backup script that creates snapshots, then backs up the .zfs/snapshot/name directory to make sure open files are not missed. This has been working great but all of the sudden one of my systems has stopped working. It takes the snapshots
fine, zfs list -t spnapshot shows the snapshots, but if you do an ls
command, on the .zfs/snapshot/ directory it returns not a directory.

part of the zfs list output:

NAME                        USED  AVAIL  REFER  MOUNTPOINT
zroot                      4.48G  29.7G    31K  none
zroot/ROOT                 2.92G  29.7G    31K  none
zroot/ROOT/91p5-20130812   2.92G  29.7G  2.92G  legacy
zroot/home                  144K  29.7G   122K  /home

part of the zfs list -t snapshot output:

NAME                                            USED  AVAIL  REFER
MOUNTPOINT
zroot/ROOT/91p5-20130812@91p5-20130812--bsnap 340K - 2.92G - zroot/home@home--bsnap 22K - 122K -

ls /.zfs/snapshot/91p5-20130812--bsnap/
Does work at the right now, since the last reboot, but wasn't always
working, this is my boot environment.

if I do ls /home/.zfs/snapshot/, result is:
ls: /home/.zfs/snapshot/: Not a directory

if I do ls /home/.zfs, result is:
ls: snapshot: Bad file descriptor
shares

I have tried zpool scrub zroot, no errors were found, if I reboot the system I can get one good backup, then I start having problems. Anyone
else ever ran into this, any suggestions as to a fix?

System is running FreeBSD 9.1-RELEASE-p5 #1 r253764: Mon Jul 29 15:07:35
CDT 2013, zpool is running version 28, zfs is running version 5



I can say I've had this problem. Not certain what fixed it. I do
remember I decided to stop snapshoting if I couldn't access them and
deleted existing snapshots. I later restarted the machine before I
went back for another look and they were working.

So my guess is a restart without existing snapshots may be the key.

Now if only we could find out what started the issue so we can stop it
happening again.

I had actually rebooted it last night, prior to seeing this message, I
do know it didn't have any snapshots this time.  As I am booting from
ZFS using boot environments I may have had an older boot environment
still on the system the last time it was rebooted.  Backups ran great
last night after the reboot, and I was able to kick off my pre-backup
job and access all the snapshots today.  Hopefully it doesn't come
back, but if it does I will see if I can find anything else wrong.

FYI,
It didn't shutdown cleanly, so if this helps anyone find the issue,
this is from my system logs:
Aug 14 22:08:04 cblproxy1 kernel:
Aug 14 22:08:04 cblproxy1 kernel: Fatal trap 12: page fault while in kernel mode
Aug 14 22:08:04 cblproxy1 kernel: cpuid = 0; apic id = 00
Aug 14 22:08:04 cblproxy1 kernel: fault virtual address = 0xa8
Aug 14 22:08:04 cblproxy1 kernel: fault code            = supervisor
write data, page not present
Aug 14 22:08:04 cblproxy1 kernel: instruction pointer   =
0x20:0xffffffff808b0562
Aug 14 22:08:04 cblproxy1 kernel: stack pointer         =
0x28:0xffffff80002238f0
Aug 14 22:08:04 cblproxy1 kernel: frame pointer         =
0x28:0xffffff8000223910
Aug 14 22:08:04 cblproxy1 kernel: code segment          = base 0x0,
limit 0xfffff, type 0x1b
Aug 14 22:08:04 cblproxy1 kernel: = DPL 0, pres 1, long 1, def32 0, gran 1
Aug 14 22:08:04 cblproxy1 kernel: processor eflags      = interrupt
enabled, resume, IOPL = 0
Aug 14 22:08:04 cblproxy1 kernel: current process = 1 (init)
Aug 14 22:08:04 cblproxy1 kernel: trap number           = 12
Aug 14 22:08:04 cblproxy1 kernel: panic: page fault
Aug 14 22:08:04 cblproxy1 kernel: cpuid = 0
Aug 14 22:08:04 cblproxy1 kernel: KDB: stack backtrace:
Aug 14 22:08:04 cblproxy1 kernel: #0 0xffffffff808ddaf0 at kdb_backtrace+0x60
Aug 14 22:08:04 cblproxy1 kernel: #1 0xffffffff808a951d at panic+0x1fd
Aug 14 22:08:04 cblproxy1 kernel: #2 0xffffffff80b81578 at trap_fatal+0x388 Aug 14 22:08:04 cblproxy1 kernel: #3 0xffffffff80b81836 at trap_pfault+0x2a6
Aug 14 22:08:04 cblproxy1 kernel: #4 0xffffffff80b80ea1 at trap+0x2a1
Aug 14 22:08:04 cblproxy1 kernel: #5 0xffffffff80b6c7b3 at calltrap+0x8
Aug 14 22:08:04 cblproxy1 kernel: #6 0xffffffff815276da at
zfsctl_umount_snapshots+0x8a
Aug 14 22:08:04 cblproxy1 kernel: #7 0xffffffff81536766 at zfs_umount+0x76 Aug 14 22:08:04 cblproxy1 kernel: #8 0xffffffff809340bc at dounmount+0x3cc Aug 14 22:08:04 cblproxy1 kernel: #9 0xffffffff8093c101 at vfs_unmountall+0x71 Aug 14 22:08:04 cblproxy1 kernel: #10 0xffffffff808a8eae at kern_reboot+0x4ee Aug 14 22:08:04 cblproxy1 kernel: #11 0xffffffff808a89c0 at kern_reboot+0 Aug 14 22:08:04 cblproxy1 kernel: #12 0xffffffff80b81dab at amd64_syscall+0x29b Aug 14 22:08:04 cblproxy1 kernel: #13 0xffffffff80b6ca9b at Xfast_syscall+0xfb

Well its back, 3 of the 8 file systems I am taking snapshots of failed
in last nights backups.

The only thing different on this system from all the 4 others I have
running is that it has a second disk volume with a UFS file system.

Setup is 2 Disks, both setup with GPART:
=>      34  83886013  da0  GPT  (40G)
        34       256    1  boot0  (128k)
       290  10485760    2  swap0  (5.0G)
  10486050  73399997    3  zroot0  (35G)

=>      34  41942973  da1  GPT  (20G)
        34  41942973    1  squid1  (20G)

I didn't want the Squid cache directory on ZFS, system is running on
an ESX 4.1 server backed by iSCSI SAN.  I have 4 other servers running
on  the same group of ESX servers and SAN, booting from ZFS without
this problem.  Two of the other 4 are also running Squid but forward
to this one so they are running without a local disk cache.

A quick update on this, in case anyone else runs into it, I did finally try on the 2nd of this month to delete my UFS volume, and create a new ZFS volume to replace it. I recreated the Squid cache directories and let squid start over building up cache. So far their hasn't been a noticeable impact on performance with the switch over, and the snapshot problem has not reoccurred since making the change. Its only a week into running this way but the problem before started within 36-48 hours.

--
Thanks,
   Dean E. Weimer
   http://www.dweimer.net/
_______________________________________________
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"

Reply via email to