Re: Sudden wierd SATA problem on RELENG_7 (Re: ZFS hanging at kernel boot now, but didn't before... (Re: ZFS MFC heads up))

2009-06-01 Thread Joe Karthauser

on 23/05/2009 05:26 Alexander Motin said the following:

Hi.

Joe Karthauser wrote:

I spoke too soon. It must have just randomly booted, because it is now
hanging again. No amount of jiggling cables has made any difference.


Can you provide verbose boot messages of your system from the beginning
up to the problem? Especially, all related to the ATA.



Attached.

>

Do you have AHCI mode enabled in BIOS, or you using legacy ATA emulation?



It's set up as AHCI in the bios.

What is strange is that it has now started working again. I can't make 
any sense of it. The machine boots up fine.  It was definitely hanging 
at the ata probes though, just after the ZFS messages are output.


Joe
Copyright (c) 1992-2009 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 7.2-STABLE #7: Fri May 22 23:10:15 BST 2009
r...@athenaeum.tao.org.uk:/usr/obj/usr/src/sys/ATHENAEUM
Preloaded elf kernel "/boot/kernel/kernel" at 0x80b47000.
Preloaded elf module "/boot/kernel/zfs.ko" at 0x80b4719c.
Preloaded elf module "/boot/kernel/opensolaris.ko" at 0x80b47244.
Preloaded elf module "/boot/kernel/geom_eli.ko" at 0x80b472f4.
Preloaded elf module "/boot/kernel/crypto.ko" at 0x80b473a4.
Preloaded elf module "/boot/kernel/zlib.ko" at 0x80b47450.
Preloaded elf module "/boot/kernel/geom_label.ko" at 0x80b474fc.
Preloaded elf module "/boot/kernel/geom_mirror.ko" at 0x80b475ac.
Preloaded /boot/zfs/zpool.cache "/boot/zfs/zpool.cache" at 0x80b4765c.
Preloaded elf module "/boot/kernel/acpi.ko" at 0x80b476b4.
module_register: module g_label already exists!
Module g_label failed to register: 17
Calibrating clock(s) ... i8254 clock: 1192003 Hz
CLK_USE_I8254_CALIBRATION not specified - using default frequency
Timecounter "i8254" frequency 1193182 Hz quality 0
Calibrating TSC clock ... TSC clock: 2402413236 Hz
CPU: Intel(R) Core(TM)2 Quad CPUQ6600  @ 2.40GHz (2402.41-MHz 686-class CPU)
  Origin = "GenuineIntel"  Id = 0x6fb  Stepping = 11
  
Features=0xbfebfbff
  Features2=0xe3bd
  AMD Features=0x2010
  AMD Features2=0x1
  Cores per package: 4

Instruction TLB: 4 KB Pages, 4-way set associative, 128 entries
1st-level instruction cache: 32 KB, 8-way set associative, 64 byte line size
1st-level data cache: 32 KB, 8-way set associative, 64 byte line size
L2 cache: 4096 kbytes, 16-way associative, 64 bytes/line
real memory  = 3756916736 (3582 MB)
Physical memory chunk(s):
0x1000 - 0x0009dfff, 643072 bytes (157 pages)
0x0010 - 0x003f, 3145728 bytes (768 pages)
0x00c25000 - 0xdbf7, 3677728768 bytes (897883 pages)
avail memory = 3673681920 (3503 MB)
Table 'FACP' at 0xdfee30c0
Table 'HPET' at 0xdfee7e00
Table 'MCFG' at 0xdfee7e80
Table 'APIC' at 0xdfee7d00
MADT: Found table at 0xdfee7d00
MP Configuration Table version 1.4 found at 0x800f0d00
APIC: Using the MADT enumerator.
MADT: Found CPU APIC ID 0 ACPI ID 0: enabled
SMP: Added CPU 0 (AP)
MADT: Found CPU APIC ID 3 ACPI ID 1: enabled
SMP: Added CPU 3 (AP)
MADT: Found CPU APIC ID 2 ACPI ID 2: enabled
SMP: Added CPU 2 (AP)
MADT: Found CPU APIC ID 1 ACPI ID 3: enabled
SMP: Added CPU 1 (AP)
ACPI APIC Table: 
INTR: Adding local APIC 1 as a target
INTR: Adding local APIC 2 as a target
INTR: Adding local APIC 3 as a target
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
 cpu2 (AP): APIC ID:  2
 cpu3 (AP): APIC ID:  3
bios32: Found BIOS32 Service Directory header at 0x800fad30
bios32: Entry = 0xfb3f0 (800fb3f0)  Rev = 0  Len = 1
pcibios: PCI BIOS entry at 0xf+0xb420
pnpbios: Found PnP BIOS data at 0x800fbf90
pnpbios: Entry = f:bfc0  Rev = 1.0
Other BIOS signatures found:
APIC: CPU 0 has ACPI ID 0
APIC: CPU 1 has ACPI ID 3
APIC: CPU 2 has ACPI ID 2
APIC: CPU 3 has ACPI ID 1
ULE: setup cpu group 0
ULE: setup cpu 0
ULE: adding cpu 0 to group 0: cpus 1 mask 0x1
ULE: setup cpu group 1
ULE: setup cpu 1
ULE: adding cpu 1 to group 1: cpus 1 mask 0x2
ULE: setup cpu group 2
ULE: setup cpu 2
ULE: adding cpu 2 to group 2: cpus 1 mask 0x4
ULE: setup cpu group 3
ULE: setup cpu 3
ULE: adding cpu 3 to group 3: cpus 1 mask 0x8
This module (opensolaris) contains code covered by the
Common Development and Distribution License (CDDL)
see http://opensolaris.org/os/licensing/opensolaris_license/
ACPI: RSDP @ 0x0xf6c30/0x0014 (v  0 GBT   )
ACPI: RSDT @ 0x0xdfee3040/0x0034 (v  1 GBTGBTUACPI 0x42302E31 GBTU 
0x01010101)
ACPI: FACP @ 0x0xdfee30c0/0x0074 (v  1 GBTGBTUACPI 0x42302E31 GBTU 
0x01010101)
ACPI: DSDT @ 0x0xdfee3180/0x4B32 (v  1 GBTGBTUACPI 0x1000 MSFT 
0x010C)
ACPI: FACS @ 0x0xdfee/0x0040
ACPI: HPET @ 0x0xdfee7e00/0x0038 (v  1 GBTGBTUACPI 0x42302E31 GBTU 
0x0098)
ACPI: MCFG @ 0x0xdfee7e80/0x003C (v  1 GBTGBTUACPI 0x42302E31 GBTU 
0x01010101)
ACPI: APIC @ 0x0xdfee7d00/0x0084

Re: Sudden wierd SATA problem on RELENG_7 (Re: ZFS hanging at kernel boot now, but didn't before... (Re: ZFS MFC heads up))

2009-05-22 Thread Alexander Motin

Hi.

Joe Karthauser wrote:
I spoke too soon. It must have just randomly booted, because it is now 
hanging again. No amount of jiggling cables has made any difference.


Can you provide verbose boot messages of your system from the beginning 
up to the problem? Especially, all related to the ATA.


Do you have AHCI mode enabled in BIOS, or you using legacy ATA emulation?

--
Alexander Motin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Sudden wierd SATA problem on RELENG_7 (Re: ZFS hanging at kernel boot now, but didn't before... (Re: ZFS MFC heads up))

2009-05-22 Thread Joe Karthauser
I spoke too soon. It must have just randomly booted, because it is now 
hanging again. No amount of jiggling cables has made any difference.


:(.

Joe

on 22/05/2009 20:40 Joe Karthauser said the following:

Hi Alexander,

I've love it if you were able to provide some insight into this problem.

I'm going to try switching sata cables around next to see if the problem
goes away if I disconnect some combination of bays.

Thanks,
Joe

on 22/05/2009 19:39 Kip Macy said the following:

Motin is your best bet in tracking down ATA problems.

Cheers,
Kip


On Fri, May 22, 2009 at 10:40 AM, Joe Karthauser wrote:

Hi Kip,

I seriously don't understand what has happened. If I boot kernel.old
I still
get the same problem. Very confusing. :(.

Joe

on 21/05/2009 19:28 Kip Macy said the following:

I have no idea what is happening. I think our best bet is having
someone with insight into ATA provide us with help in adding
diagnostics.

Sorry for the trouble. Perhaps you can just roll back to 7.2 for now.

Cheers,
Kip


On Thu, May 21, 2009 at 10:50 AM, Joe Karthauser
wrote:

Hmm, I've had a bit of a miserable afternoon trying to fight my
RELENG_7
server, which now doesn't boot. :(.

So, it's a ZRAID2 pool with a ufs/gmirror root partition split over 5
disks
(gmirror on 500Mb partition on each of five disks, and zraid2 over the
rest
of each drive).

What I did was to update the userland, and then reboot. I didn't
upgrade
the
kernel (but I've subsequently done that and have the same problem).

What happens is that the kernel hangs booting just after displaying a
LABEL
message or ZFS pool/spool message. I _can_ get it to boot if I boot
single
user with acpi switched off. When I do that I can manually start
zfs, and
mount all the partitions. However, one of the disks is missing
more
on
that next.

The machine is running a gigabyte motherboard (domestic gamer P35
board,
similar to this

http://www.gigabyte.com.tw/Products/Motherboard/Products_Overview.aspx?ProductID=2533,

although it might be a DS4 variant). I've got 5 of the 6 sata ports
wired
to a 5 unit SATA hot swap bay (5 drives vertially mounted into 3
5-1/4"
bays
kind of thing).

Now, because of the gmirror I can boot the system on any disk, or
combination of plugged in disks. I should be able to succeed with the
kernel probe up to the attempt to mount the root filesystem
irrespective
of
any zfs pool, etc. And, indeed, this has been working fine for
about two
years.

But, now it hangs in the same place no matter what disk I boot on
(I've
tried every bay).

But, without ACPI enabled it does appear to boot ok... what's going on
here?
Is it possible that the machine has developed a hardware fault?

Ok, finally, if I boot with ACPI disabled then one of the disks is
missing.
If I unplug it I get a disconnect message from the ata device, and a
reconnect and reinit attempt when I plug it back in, but no device
appears
on the bus. Usually I can do a 'atacontrol detach sata4; sleep 1;
atacontrol
attach sata4' and the device reappears. This happens on the other
buses,
but
not on the last one. It's not the disk, because if I swap it into
another
bay, it comes up and appears on the bus. On the other hand it doesn't
appear
to be that controller or slow in the drive bay because if I unplug all
the
over disks the system will boot that disk and get as far as the
hang
hmm.

Is this a consequence of disabling the ACPI?

Does anyone have a clue what might be going on?

Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to
"freebsd-stable-unsubscr...@freebsd.org"














___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


RE: Sudden wierd SATA problem on RELENG_7 (Re: ZFS hanging at kernel boot now, but didn't before... (Re: ZFS MFC heads up))

2009-05-22 Thread Larry Rosenman

I saw really strange stuff with one bad SATA cable on my 6 drive ZFS array.
It would work most of the time, but
the scrub would either cough up CRC's or hang.

I wound up replacing the disk *AND* the cable, and it's been fine since. 

This is on a SuperMicro chassis with Intel chips.

YMMV
-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 512-248-2683E-Mail: l...@lerctr.org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3893

-Original Message-
From: owner-freebsd-sta...@freebsd.org
[mailto:owner-freebsd-sta...@freebsd.org] On Behalf Of Joe Karthauser
Sent: Friday, May 22, 2009 3:45 PM
To: Alexander Motin
Cc: freebsd-stable@freebsd.org; Kip Macy
Subject: Re: Sudden wierd SATA problem on RELENG_7 (Re: ZFS hanging at
kernel boot now, but didn't before... (Re: ZFS MFC heads up))

This appears to have gone away now. I unplugged the bay that was causing 
the trouble, and the system booted just fine on the remaining 4 drives. 
Then I plugged the bay back in (live) and did an atacontrol 
detach/attach on that bus (I wonder why I always have to do that). The 
drive was seen, and ZFS resilvered itself. I'm doing a ZFS scrub now to 
make sure that everything is good, and I'll do a reboot and see if it's 
all ok after that.

Strange, so it looks like a cable might have got a little loose or 
something. I wonder why that would have hung the kernel probe though.

Joe

on 22/05/2009 20:40 Joe Karthauser said the following:
> Hi Alexander,
>
> I've love it if you were able to provide some insight into this problem.
>
> I'm going to try switching sata cables around next to see if the problem
> goes away if I disconnect some combination of bays.
>
> Thanks,
> Joe
>
> on 22/05/2009 19:39 Kip Macy said the following:
>> Motin is your best bet in tracking down ATA problems.
>>
>> Cheers,
>> Kip
>>
>>
>> On Fri, May 22, 2009 at 10:40 AM, Joe Karthauser wrote:
>>> Hi Kip,
>>>
>>> I seriously don't understand what has happened. If I boot kernel.old
>>> I still
>>> get the same problem. Very confusing. :(.
>>>
>>> Joe
>>>
>>> on 21/05/2009 19:28 Kip Macy said the following:
>>>> I have no idea what is happening. I think our best bet is having
>>>> someone with insight into ATA provide us with help in adding
>>>> diagnostics.
>>>>
>>>> Sorry for the trouble. Perhaps you can just roll back to 7.2 for now.
>>>>
>>>> Cheers,
>>>> Kip
>>>>
>>>>
>>>> On Thu, May 21, 2009 at 10:50 AM, Joe Karthauser
>>>> wrote:
>>>>> Hmm, I've had a bit of a miserable afternoon trying to fight my
>>>>> RELENG_7
>>>>> server, which now doesn't boot. :(.
>>>>>
>>>>> So, it's a ZRAID2 pool with a ufs/gmirror root partition split over 5
>>>>> disks
>>>>> (gmirror on 500Mb partition on each of five disks, and zraid2 over the
>>>>> rest
>>>>> of each drive).
>>>>>
>>>>> What I did was to update the userland, and then reboot. I didn't
>>>>> upgrade
>>>>> the
>>>>> kernel (but I've subsequently done that and have the same problem).
>>>>>
>>>>> What happens is that the kernel hangs booting just after displaying a
>>>>> LABEL
>>>>> message or ZFS pool/spool message. I _can_ get it to boot if I boot
>>>>> single
>>>>> user with acpi switched off. When I do that I can manually start
>>>>> zfs, and
>>>>> mount all the partitions. However, one of the disks is missing
>>>>> more
>>>>> on
>>>>> that next.
>>>>>
>>>>> The machine is running a gigabyte motherboard (domestic gamer P35
>>>>> board,
>>>>> similar to this
>>>>>
>>>>>
http://www.gigabyte.com.tw/Products/Motherboard/Products_Overview.aspx?Produ
ctID=2533,
>>>>>
>>>>> although it might be a DS4 variant). I've got 5 of the 6 sata ports
>>>>> wired
>>>>> to a 5 unit SATA hot swap bay (5 drives vertially mounted into 3
>>>>> 5-1/4"
>>>>> bays
>>>>> kind of thing).
>>>>>
>>>>> Now, because of the gmirror I can boot the system on any disk, or
>>>>> combination of plugged in disks. I should be able to succeed with the
>>>>> kernel probe up to the attempt to mount the root

Re: Sudden wierd SATA problem on RELENG_7 (Re: ZFS hanging at kernel boot now, but didn't before... (Re: ZFS MFC heads up))

2009-05-22 Thread Joe Karthauser
This appears to have gone away now. I unplugged the bay that was causing 
the trouble, and the system booted just fine on the remaining 4 drives. 
Then I plugged the bay back in (live) and did an atacontrol 
detach/attach on that bus (I wonder why I always have to do that). The 
drive was seen, and ZFS resilvered itself. I'm doing a ZFS scrub now to 
make sure that everything is good, and I'll do a reboot and see if it's 
all ok after that.


Strange, so it looks like a cable might have got a little loose or 
something. I wonder why that would have hung the kernel probe though.


Joe

on 22/05/2009 20:40 Joe Karthauser said the following:

Hi Alexander,

I've love it if you were able to provide some insight into this problem.

I'm going to try switching sata cables around next to see if the problem
goes away if I disconnect some combination of bays.

Thanks,
Joe

on 22/05/2009 19:39 Kip Macy said the following:

Motin is your best bet in tracking down ATA problems.

Cheers,
Kip


On Fri, May 22, 2009 at 10:40 AM, Joe Karthauser wrote:

Hi Kip,

I seriously don't understand what has happened. If I boot kernel.old
I still
get the same problem. Very confusing. :(.

Joe

on 21/05/2009 19:28 Kip Macy said the following:

I have no idea what is happening. I think our best bet is having
someone with insight into ATA provide us with help in adding
diagnostics.

Sorry for the trouble. Perhaps you can just roll back to 7.2 for now.

Cheers,
Kip


On Thu, May 21, 2009 at 10:50 AM, Joe Karthauser
wrote:

Hmm, I've had a bit of a miserable afternoon trying to fight my
RELENG_7
server, which now doesn't boot. :(.

So, it's a ZRAID2 pool with a ufs/gmirror root partition split over 5
disks
(gmirror on 500Mb partition on each of five disks, and zraid2 over the
rest
of each drive).

What I did was to update the userland, and then reboot. I didn't
upgrade
the
kernel (but I've subsequently done that and have the same problem).

What happens is that the kernel hangs booting just after displaying a
LABEL
message or ZFS pool/spool message. I _can_ get it to boot if I boot
single
user with acpi switched off. When I do that I can manually start
zfs, and
mount all the partitions. However, one of the disks is missing
more
on
that next.

The machine is running a gigabyte motherboard (domestic gamer P35
board,
similar to this

http://www.gigabyte.com.tw/Products/Motherboard/Products_Overview.aspx?ProductID=2533,

although it might be a DS4 variant). I've got 5 of the 6 sata ports
wired
to a 5 unit SATA hot swap bay (5 drives vertially mounted into 3
5-1/4"
bays
kind of thing).

Now, because of the gmirror I can boot the system on any disk, or
combination of plugged in disks. I should be able to succeed with the
kernel probe up to the attempt to mount the root filesystem
irrespective
of
any zfs pool, etc. And, indeed, this has been working fine for
about two
years.

But, now it hangs in the same place no matter what disk I boot on
(I've
tried every bay).

But, without ACPI enabled it does appear to boot ok... what's going on
here?
Is it possible that the machine has developed a hardware fault?

Ok, finally, if I boot with ACPI disabled then one of the disks is
missing.
If I unplug it I get a disconnect message from the ata device, and a
reconnect and reinit attempt when I plug it back in, but no device
appears
on the bus. Usually I can do a 'atacontrol detach sata4; sleep 1;
atacontrol
attach sata4' and the device reappears. This happens on the other
buses,
but
not on the last one. It's not the disk, because if I swap it into
another
bay, it comes up and appears on the bus. On the other hand it doesn't
appear
to be that controller or slow in the drive bay because if I unplug all
the
over disks the system will boot that disk and get as far as the
hang
hmm.

Is this a consequence of disabling the ACPI?

Does anyone have a clue what might be going on?

Joe


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Sudden wierd SATA problem on RELENG_7 (Re: ZFS hanging at kernel boot now, but didn't before... (Re: ZFS MFC heads up))

2009-05-22 Thread Joe Karthauser

Hi Alexander,

I've love it if you were able to provide some insight into this problem.

I'm going to try switching sata cables around next to see if the problem 
goes away if I disconnect some combination of bays.


Thanks,
Joe

on 22/05/2009 19:39 Kip Macy said the following:

Motin is your best bet in tracking down ATA problems.

Cheers,
Kip


On Fri, May 22, 2009 at 10:40 AM, Joe Karthauser  wrote:

Hi Kip,

I seriously don't understand what has happened. If I boot kernel.old I still
get the same problem. Very confusing. :(.

Joe

on 21/05/2009 19:28 Kip Macy said the following:

I have no idea what is happening. I think our best bet is having
someone with insight into ATA provide us with help in adding
diagnostics.

Sorry for the trouble. Perhaps you can just roll back to 7.2 for now.

Cheers,
Kip


On Thu, May 21, 2009 at 10:50 AM, Joe Karthauserwrote:

Hmm, I've had a bit of a miserable afternoon trying to fight my RELENG_7
server, which now doesn't boot. :(.

So, it's a ZRAID2 pool with a ufs/gmirror root partition split over 5
disks
(gmirror on 500Mb partition on each of five disks, and zraid2 over the
rest
of each drive).

What I did was to update the userland, and then reboot. I didn't upgrade
the
kernel (but I've subsequently done that and have the same problem).

What happens is that the kernel hangs booting just after displaying a
LABEL
message or ZFS pool/spool message. I _can_ get it to boot if I boot
single
user with acpi switched off. When I do that I can manually start zfs, and
mount all the partitions. However, one of the disks is missing more
on
that next.

The machine is running a gigabyte motherboard (domestic gamer P35 board,
similar to this

http://www.gigabyte.com.tw/Products/Motherboard/Products_Overview.aspx?ProductID=2533,
although it might be a DS4 variant).  I've got 5 of the 6 sata ports
wired
to a 5 unit SATA hot swap bay (5 drives vertially mounted into 3 5-1/4"
bays
kind of thing).

Now, because of the gmirror I can boot the system on any disk, or
combination of plugged in disks. I should be able to succeed with the
kernel probe up to the attempt to mount the root filesystem irrespective
of
any zfs pool, etc. And, indeed, this has been working fine for about two
years.

But, now it hangs in the same place no matter what disk I boot on (I've
tried every bay).

But, without ACPI enabled it does appear to boot ok... what's going on
here?
Is it possible that the machine has developed a hardware fault?

Ok, finally, if I boot with ACPI disabled then one of the disks is
missing.
If I unplug it I get a disconnect message from the ata device, and a
reconnect and reinit attempt when I plug it back in, but no device
appears
on the bus. Usually I can do a 'atacontrol detach sata4; sleep 1;
atacontrol
attach sata4' and the device reappears. This happens on the other buses,
but
not on the last one. It's not the disk, because if I swap it into another
bay, it comes up and appears on the bus. On the other hand it doesn't
appear
to be that controller or slow in the drive bay because if I unplug all
the
over disks the system will boot that disk and get as far as the hang
hmm.

Is this a consequence of disabling the ACPI?

Does anyone have a clue what might be going on?

Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"












___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"