[releng_7 tinderbox] failure on i386/pc98

2010-05-15 Thread FreeBSD Tinderbox
TB --- 2010-05-16 01:22:51 - tinderbox 2.6 running on freebsd-stable.sentex.ca
TB --- 2010-05-16 01:22:51 - starting RELENG_7 tinderbox run for i386/pc98
TB --- 2010-05-16 01:22:51 - cleaning the object tree
TB --- 2010-05-16 01:23:07 - cvsupping the source tree
TB --- 2010-05-16 01:23:07 - /usr/bin/csup -z -r 3 -g -L 1 -h localhost -s 
/tinderbox/RELENG_7/i386/pc98/supfile
TB --- 2010-05-16 01:23:15 - building world
TB --- 2010-05-16 01:23:15 - MAKEOBJDIRPREFIX=/obj
TB --- 2010-05-16 01:23:15 - PATH=/usr/bin:/usr/sbin:/bin:/sbin
TB --- 2010-05-16 01:23:15 - TARGET=pc98
TB --- 2010-05-16 01:23:15 - TARGET_ARCH=i386
TB --- 2010-05-16 01:23:15 - TZ=UTC
TB --- 2010-05-16 01:23:15 - __MAKE_CONF=/dev/null
TB --- 2010-05-16 01:23:15 - cd /src
TB --- 2010-05-16 01:23:15 - /usr/bin/make -B buildworld
>>> World build started on Sun May 16 01:23:16 UTC 2010
>>> Rebuilding the temporary build tree
>>> stage 1.1: legacy release compatibility shims
>>> stage 1.2: bootstrap tools
>>> stage 2.1: cleaning up the object tree
>>> stage 2.2: rebuilding the object tree
>>> stage 2.3: build tools
>>> stage 3: cross tools
>>> stage 4.1: building includes
>>> stage 4.2: building libraries
>>> stage 4.3: make dependencies
>>> stage 4.4: building everything
>>> World build completed on Sun May 16 02:27:10 UTC 2010
TB --- 2010-05-16 02:27:10 - generating LINT kernel config
TB --- 2010-05-16 02:27:10 - cd /src/sys/pc98/conf
TB --- 2010-05-16 02:27:10 - /usr/bin/make -B LINT
TB --- 2010-05-16 02:27:10 - building LINT kernel
TB --- 2010-05-16 02:27:10 - MAKEOBJDIRPREFIX=/obj
TB --- 2010-05-16 02:27:10 - PATH=/usr/bin:/usr/sbin:/bin:/sbin
TB --- 2010-05-16 02:27:10 - TARGET=pc98
TB --- 2010-05-16 02:27:10 - TARGET_ARCH=i386
TB --- 2010-05-16 02:27:10 - TZ=UTC
TB --- 2010-05-16 02:27:10 - __MAKE_CONF=/dev/null
TB --- 2010-05-16 02:27:10 - cd /src
TB --- 2010-05-16 02:27:10 - /usr/bin/make -B buildkernel KERNCONF=LINT
>>> Kernel build for LINT started on Sun May 16 02:27:10 UTC 2010
>>> stage 1: configuring the kernel
>>> stage 2.1: cleaning up the object tree
>>> stage 2.2: rebuilding the object tree
>>> stage 2.3: build tools
>>> stage 3.1: making dependencies
>>> stage 3.2: building everything
[...]
echo elink_reset elink_idseq > export_syms
awk -f /src/sys/conf/kmod_syms.awk elink.kld  export_syms | xargs -J% objcopy % 
elink.kld
ld -Bshareable  -d -warn-common -o elink.ko elink.kld
objcopy --strip-debug elink.ko
===> em (all)
cc -O2 -fno-strict-aliasing -pipe -DPC98  -D_KERNEL -DKLD_MODULE -std=c99 
-nostdinc  -I/src/sys/modules/em/../../dev/e1000 -DHAVE_KERNEL_OPTION_HEADERS 
-include /obj/pc98/src/sys/LINT/opt_global.h -I. -I@ -I@/contrib/altq 
-finline-limit=8000 --param inline-unit-growth=100 --param 
large-function-growth=1000 -fno-common  -I/obj/pc98/src/sys/LINT 
-mno-align-long-strings -mpreferred-stack-boundary=2  -mno-mmx -mno-3dnow 
-mno-sse -mno-sse2 -mno-sse3 -ffreestanding -Wall -Wredundant-decls 
-Wnested-externs -Wstrict-prototypes  -Wmissing-prototypes -Wpointer-arith 
-Winline -Wcast-qual  -Wundef -Wno-pointer-sign -fformat-extensions -c 
/src/sys/modules/em/../../dev/e1000/if_em.c
/src/sys/modules/em/../../dev/e1000/if_em.c:1350: error: conflicting types for 
'em_poll'
/src/sys/modules/em/../../dev/e1000/if_em.c:287: error: previous declaration of 
'em_poll' was here
*** Error code 1

Stop in /src/sys/modules/em.
*** Error code 1

Stop in /src/sys/modules.
*** Error code 1

Stop in /obj/pc98/src/sys/LINT.
*** Error code 1

Stop in /src.
*** Error code 1

Stop in /src.
TB --- 2010-05-16 02:43:35 - WARNING: /usr/bin/make returned exit code  1 
TB --- 2010-05-16 02:43:35 - ERROR: failed to build lint kernel
TB --- 2010-05-16 02:43:35 - 4100.66 user 415.24 system 4844.64 real


http://tinderbox.freebsd.org/tinderbox-releng_7-RELENG_7-i386-pc98.full
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


[releng_7 tinderbox] failure on i386/i386

2010-05-15 Thread FreeBSD Tinderbox
TB --- 2010-05-16 00:48:02 - tinderbox 2.6 running on freebsd-stable.sentex.ca
TB --- 2010-05-16 00:48:02 - starting RELENG_7 tinderbox run for i386/i386
TB --- 2010-05-16 00:48:02 - cleaning the object tree
TB --- 2010-05-16 00:48:29 - cvsupping the source tree
TB --- 2010-05-16 00:48:29 - /usr/bin/csup -z -r 3 -g -L 1 -h localhost -s 
/tinderbox/RELENG_7/i386/i386/supfile
TB --- 2010-05-16 00:48:38 - building world
TB --- 2010-05-16 00:48:38 - MAKEOBJDIRPREFIX=/obj
TB --- 2010-05-16 00:48:38 - PATH=/usr/bin:/usr/sbin:/bin:/sbin
TB --- 2010-05-16 00:48:38 - TARGET=i386
TB --- 2010-05-16 00:48:38 - TARGET_ARCH=i386
TB --- 2010-05-16 00:48:38 - TZ=UTC
TB --- 2010-05-16 00:48:38 - __MAKE_CONF=/dev/null
TB --- 2010-05-16 00:48:38 - cd /src
TB --- 2010-05-16 00:48:38 - /usr/bin/make -B buildworld
>>> World build started on Sun May 16 00:48:39 UTC 2010
>>> Rebuilding the temporary build tree
>>> stage 1.1: legacy release compatibility shims
>>> stage 1.2: bootstrap tools
>>> stage 2.1: cleaning up the object tree
>>> stage 2.2: rebuilding the object tree
>>> stage 2.3: build tools
>>> stage 3: cross tools
>>> stage 4.1: building includes
>>> stage 4.2: building libraries
>>> stage 4.3: make dependencies
>>> stage 4.4: building everything
>>> World build completed on Sun May 16 01:52:32 UTC 2010
TB --- 2010-05-16 01:52:32 - generating LINT kernel config
TB --- 2010-05-16 01:52:32 - cd /src/sys/i386/conf
TB --- 2010-05-16 01:52:32 - /usr/bin/make -B LINT
TB --- 2010-05-16 01:52:32 - building LINT kernel
TB --- 2010-05-16 01:52:32 - MAKEOBJDIRPREFIX=/obj
TB --- 2010-05-16 01:52:32 - PATH=/usr/bin:/usr/sbin:/bin:/sbin
TB --- 2010-05-16 01:52:32 - TARGET=i386
TB --- 2010-05-16 01:52:32 - TARGET_ARCH=i386
TB --- 2010-05-16 01:52:32 - TZ=UTC
TB --- 2010-05-16 01:52:32 - __MAKE_CONF=/dev/null
TB --- 2010-05-16 01:52:32 - cd /src
TB --- 2010-05-16 01:52:32 - /usr/bin/make -B buildkernel KERNCONF=LINT
>>> Kernel build for LINT started on Sun May 16 01:52:32 UTC 2010
>>> stage 1: configuring the kernel
>>> stage 2.1: cleaning up the object tree
>>> stage 2.2: rebuilding the object tree
>>> stage 2.3: build tools
>>> stage 3.1: making dependencies
>>> stage 3.2: building everything
[...]
echo elink_reset elink_idseq > export_syms
awk -f /src/sys/conf/kmod_syms.awk elink.kld  export_syms | xargs -J% objcopy % 
elink.kld
ld -Bshareable  -d -warn-common -o elink.ko elink.kld
objcopy --strip-debug elink.ko
===> em (all)
cc -O2 -fno-strict-aliasing -pipe  -D_KERNEL -DKLD_MODULE -std=c99 -nostdinc  
-I/src/sys/modules/em/../../dev/e1000 -DHAVE_KERNEL_OPTION_HEADERS -include 
/obj/src/sys/LINT/opt_global.h -I. -I@ -I@/contrib/altq -finline-limit=8000 
--param inline-unit-growth=100 --param large-function-growth=1000 -fno-common  
-I/obj/src/sys/LINT -mno-align-long-strings -mpreferred-stack-boundary=2  
-mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 -ffreestanding -Wall 
-Wredundant-decls -Wnested-externs -Wstrict-prototypes  -Wmissing-prototypes 
-Wpointer-arith -Winline -Wcast-qual  -Wundef -Wno-pointer-sign 
-fformat-extensions -c /src/sys/modules/em/../../dev/e1000/if_em.c
/src/sys/modules/em/../../dev/e1000/if_em.c:1350: error: conflicting types for 
'em_poll'
/src/sys/modules/em/../../dev/e1000/if_em.c:287: error: previous declaration of 
'em_poll' was here
*** Error code 1

Stop in /src/sys/modules/em.
*** Error code 1

Stop in /src/sys/modules.
*** Error code 1

Stop in /obj/src/sys/LINT.
*** Error code 1

Stop in /src.
*** Error code 1

Stop in /src.
TB --- 2010-05-16 02:12:38 - WARNING: /usr/bin/make returned exit code  1 
TB --- 2010-05-16 02:12:38 - ERROR: failed to build lint kernel
TB --- 2010-05-16 02:12:38 - 4320.77 user 414.50 system 5076.45 real


http://tinderbox.freebsd.org/tinderbox-releng_7-RELENG_7-i386-i386.full
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


[releng_7 tinderbox] failure on amd64/amd64

2010-05-15 Thread FreeBSD Tinderbox
TB --- 2010-05-15 23:34:21 - tinderbox 2.6 running on freebsd-stable.sentex.ca
TB --- 2010-05-15 23:34:21 - starting RELENG_7 tinderbox run for amd64/amd64
TB --- 2010-05-15 23:34:21 - cleaning the object tree
TB --- 2010-05-15 23:34:45 - cvsupping the source tree
TB --- 2010-05-15 23:34:45 - /usr/bin/csup -z -r 3 -g -L 1 -h localhost -s 
/tinderbox/RELENG_7/amd64/amd64/supfile
TB --- 2010-05-15 23:34:54 - building world
TB --- 2010-05-15 23:34:54 - MAKEOBJDIRPREFIX=/obj
TB --- 2010-05-15 23:34:54 - PATH=/usr/bin:/usr/sbin:/bin:/sbin
TB --- 2010-05-15 23:34:54 - TARGET=amd64
TB --- 2010-05-15 23:34:54 - TARGET_ARCH=amd64
TB --- 2010-05-15 23:34:54 - TZ=UTC
TB --- 2010-05-15 23:34:54 - __MAKE_CONF=/dev/null
TB --- 2010-05-15 23:34:54 - cd /src
TB --- 2010-05-15 23:34:54 - /usr/bin/make -B buildworld
>>> World build started on Sat May 15 23:34:55 UTC 2010
>>> Rebuilding the temporary build tree
>>> stage 1.1: legacy release compatibility shims
>>> stage 1.2: bootstrap tools
>>> stage 2.1: cleaning up the object tree
>>> stage 2.2: rebuilding the object tree
>>> stage 2.3: build tools
>>> stage 3: cross tools
>>> stage 4.1: building includes
>>> stage 4.2: building libraries
>>> stage 4.3: make dependencies
>>> stage 4.4: building everything
>>> stage 5.1: building 32 bit shim libraries
>>> World build completed on Sun May 16 01:04:55 UTC 2010
TB --- 2010-05-16 01:04:55 - generating LINT kernel config
TB --- 2010-05-16 01:04:55 - cd /src/sys/amd64/conf
TB --- 2010-05-16 01:04:55 - /usr/bin/make -B LINT
TB --- 2010-05-16 01:04:55 - building LINT kernel
TB --- 2010-05-16 01:04:55 - MAKEOBJDIRPREFIX=/obj
TB --- 2010-05-16 01:04:55 - PATH=/usr/bin:/usr/sbin:/bin:/sbin
TB --- 2010-05-16 01:04:55 - TARGET=amd64
TB --- 2010-05-16 01:04:55 - TARGET_ARCH=amd64
TB --- 2010-05-16 01:04:55 - TZ=UTC
TB --- 2010-05-16 01:04:55 - __MAKE_CONF=/dev/null
TB --- 2010-05-16 01:04:55 - cd /src
TB --- 2010-05-16 01:04:55 - /usr/bin/make -B buildkernel KERNCONF=LINT
>>> Kernel build for LINT started on Sun May 16 01:04:55 UTC 2010
>>> stage 1: configuring the kernel
>>> stage 2.1: cleaning up the object tree
>>> stage 2.2: rebuilding the object tree
>>> stage 2.3: build tools
>>> stage 3.1: making dependencies
>>> stage 3.2: building everything
[...]
ld  -d -warn-common -r -d -o if_ed.ko if_ed.o if_ed_novell.o if_ed_wd80x3.o 
if_ed_rtl80x9.o if_ed_isa.o if_ed_3c503.o if_ed_hpp.o if_ed_sic.o 
if_ed_pccard.o if_ed_pci.o
:> export_syms
awk -f /src/sys/conf/kmod_syms.awk if_ed.ko  export_syms | xargs -J% objcopy % 
if_ed.ko
objcopy --strip-debug if_ed.ko
===> em (all)
cc -O2 -fno-strict-aliasing -pipe  -D_KERNEL -DKLD_MODULE -std=c99 -nostdinc  
-I/src/sys/modules/em/../../dev/e1000 -DHAVE_KERNEL_OPTION_HEADERS -include 
/obj/amd64/src/sys/LINT/opt_global.h -I. -I@ -I@/contrib/altq 
-finline-limit=8000 --param inline-unit-growth=100 --param 
large-function-growth=1000 -fno-common  -fno-omit-frame-pointer 
-I/obj/amd64/src/sys/LINT -mcmodel=kernel -mno-red-zone  -mfpmath=387 -mno-sse 
-mno-sse2 -mno-mmx -mno-3dnow  -msoft-float -fno-asynchronous-unwind-tables 
-ffreestanding -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes  
-Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual  -Wundef 
-Wno-pointer-sign -fformat-extensions -c 
/src/sys/modules/em/../../dev/e1000/if_em.c
/src/sys/modules/em/../../dev/e1000/if_em.c:1350: error: conflicting types for 
'em_poll'
/src/sys/modules/em/../../dev/e1000/if_em.c:287: error: previous declaration of 
'em_poll' was here
*** Error code 1

Stop in /src/sys/modules/em.
*** Error code 1

Stop in /src/sys/modules.
*** Error code 1

Stop in /obj/amd64/src/sys/LINT.
*** Error code 1

Stop in /src.
*** Error code 1

Stop in /src.
TB --- 2010-05-16 01:22:51 - WARNING: /usr/bin/make returned exit code  1 
TB --- 2010-05-16 01:22:51 - ERROR: failed to build lint kernel
TB --- 2010-05-16 01:22:51 - 5484.84 user 580.52 system 6509.85 real


http://tinderbox.freebsd.org/tinderbox-releng_7-RELENG_7-amd64-amd64.full
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Read / write timeouts on SATA disks connected to ICH9

2010-05-15 Thread Jeremy Chadwick
On Sat, May 15, 2010 at 11:16:33PM +0200, Pieter de Boer wrote:
> >Attached the SMART output of both disks I replaced about a month ago. It
> >appears I replaced perfectly fine drives with the current disks with
> >errors ;(  One of the old disks is in a USB-enclosure now, so 'da0'.

Regarding the Western Digital RE3 disk (serial WD-WMASY5474089):

The disk looks fine.  The only thing of interest here is the
temperature, which is extremely high (47C).  If this is the drive which
is located in an (non-fan-cooled) enclosure, that would explain it.
There are no UDMA/CRC errors, so I'm not of the belief that there were
bad cables in use either.  Finally, there's no sign of the disk powering
on/off excessively either.  In summary, I can't explain how this disk
would fall off the bus given its condition.

Regarding the Western Digital RE3 disk (serial WD-WMASY5474727):

Similar to the first RE3 disk; everything here looks great, including
disk temperature.

I do wish the FreeBSD ATA layer would give full diagnostic messages when
encountering these conditions.  The request buffer could be printed, and
the response (error) could also be printed.  SCSI CAM's error output is
what I'd be hoping for (sans SK/ASC/ASCQ, which AFAIK ATA doesn't have).
Yes, I know this is available if you use ahci.ko, but this isn't
available to the OP.

Anyway, if heavy disk/controller load appears to be causing these
problems, you could have power-related issues.  Possibly the combination
of two disks + heavy I/O causes enough power draw that the ICH9 starts
to behave oddly.  Voltages which deviate too much can cause odd things
to happen to hardware.  If you have the time/money, you might try
replacing the PSU in your system to see if there's any improvement; your
BIOS should be able to provide you Hardware Monitoring statistics
(voltages).  Write these down before and after the PSU swap.  You don't
need to go crazy and buy a 1000W PSU or anything, but 450-750W is pretty
normal these days.

-- 
| Jeremy Chadwick   j...@parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Read / write timeouts on SATA disks connected to ICH9

2010-05-15 Thread Pieter de Boer

Attached the SMART output of both disks I replaced about a month ago. It
appears I replaced perfectly fine drives with the current disks with
errors ;(  One of the old disks is in a USB-enclosure now, so 'da0'.


Let's send those attachments, then.

--
Pieter
smartctl 5.39 2009-12-09 r2995 [FreeBSD 8.0-STABLE i386] (local build)
Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Western Digital RE3 Serial ATA family
Device Model: WDC WD5002ABYS-18B1B0
Serial Number:WD-WMASY5474089
Firmware Version: 02.03B03
User Capacity:500,107,862,016 bytes
Device is:In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:Sat May 15 21:53:04 2010 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:  (   0) The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection: (9480) seconds.
Offline data collection
capabilities:(0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off 
support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time:(   2) minutes.
Extended self-test routine
recommended polling time:( 112) minutes.
Conveyance self-test routine
recommended polling time:(   5) minutes.
SCT capabilities:  (0x303f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x002f   200   200   051Pre-fail  Always   
-   0
  3 Spin_Up_Time0x0027   179   179   021Pre-fail  Always   
-   4033
  4 Start_Stop_Count0x0032   100   100   000Old_age   Always   
-   89
  5 Reallocated_Sector_Ct   0x0033   200   200   140Pre-fail  Always   
-   0
  7 Seek_Error_Rate 0x002e   200   200   000Old_age   Always   
-   0
  9 Power_On_Hours  0x0032   093   093   000Old_age   Always   
-   5536
 10 Spin_Retry_Count0x0032   100   253   000Old_age   Always   
-   0
 11 Calibration_Retry_Count 0x0032   100   253   000Old_age   Always   
-   0
 12 Power_Cycle_Count   0x0032   100   100   000Old_age   Always   
-   74
192 Power-Off_Retract_Count 0x0032   200   200   000Old_age   Always   
-   71
193 Load_Cycle_Count0x0032   200   200   000Old_age   Always   
-   89
194 Temperature_Celsius 0x0022   100   094   000Old_age   Always   
-   47
196 Reallocated_Event_Count 0x0032   200   200   000Old_age   Always   
-   0
197 Current_Pending_Sector  0x0032   200   200   000Old_age   Always   
-   0
198 Offline_Uncorrectable   0x0030   200   200   000Old_age   Offline  
-   0
199 UDMA_CRC_Error_Count0x0032   200   200   000Old_age   Always   
-   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000Old_age   Offline  
-   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_DescriptionStatus  Remaining  LifeTime(hours)  
LBA_of_first_error
# 1  Extended offlineCompleted without error   00%  5487 -
# 2  Extended offlineCompleted without error   00%  492

Re: Read / write timeouts on SATA disks connected to ICH9

2010-05-15 Thread Pieter de Boer

Hi,



That could be caused by a multitude of other known things.  For
example, some Western Digital "Green" drives (including the
Enterprise class ones) are known to perform head parking/offloading
excessively, which could result in the drive spending more time doing
that than actually serving overall I/O requests.  There are some
other reports of Samsung Spinpoint drives experiencing other issues
(I've since forgotten and would have to dig up the threads).



If you could provide full SMART stats for that drive, it might help.

Attached the SMART output of both disks I replaced about a month ago. It
appears I replaced perfectly fine drives with the current disks with
errors ;(  One of the old disks is in a USB-enclosure now, so 'da0'.



Yes, it's a DOS-based utility (like most firmware upgraders these
days). I can provide it if you'd like.  I've been meaning to spend
some time trying to reverse-engineer the binary to figure out what
ATA commands it sends to the disk to toggle/adjust the feature (so
that one could do it in real-time rather than have to boot into DOS).


I'd like to try that tool. Since the old WD disks are now lying around
at home, I have some time to get a DOS boot working to try it out. A
FreeBSD-implementation of the WD tool and possibly other brands would be
really useful indeed.


At a certain point in time I had read errors from specific LBA's on
 ad4. Using dd I was able to pinpoint those to single sectors.


This isn't very effective (dd will read large chunks/amounts of data 
(read: multiple LBAs) from the underlying disk at once, rather than

the disk itself performing a per-LBA test).  My opinion is that the
"dd method" should only be used on drives which don't support
selective LBA scanning via SMART.

Will dd read multiple LBAs even when using 'bs=512'? The process I used
was reading using bs=8192, then zooming in on the LBA's mentioned in
the errors in dmesg with bs=512 to find the actual LBA.

A selective scan on ad4 did not reveal any errors today: it 'completed 
without error'. On ad6 it's a whole lot slower; at the time of writing 
it's at 2/3.



All HD vendors have their own quirks/ordeals right now.  You
basically just have to go with one who works wells for you, then if
things start going downhill, switch to another.  None of them are
perfect.
I figured as much. What irritates though is that I've had consistent 
problems with 4 disks in this specific system, but not (such) issues 
with any other disk in other systems I've had. I generally replace disks 
when I grow out of them, not because they break down.



What this indicates to me is that if a disk falls off the bus on an
ICH9 controller in Enhanced (non-AHCI) mode, FreeBSD starts seeing an
absurd number of interrupts generated from the ICH9.  My guess is
FreeBSD isn't doing something correctly with the controller when this
happens; maybe certain commands aren't being sent back to the
controller or handling of certain events are being done improperly
when it comes to ICH9 (or possibly earlier ICH revisions too).  This
should be *very* easy to reproduce.


Unfortunately I'm not really in a position to help reproducing this or 
testing possible fixes; downtime is currently very unwelcome. Although 
one of the previous disks indeed fell of the bus entirely (couldn't get 
it back with atacontrol either), that hasn't happened again so far. I 
only see timeouts (and a few days ago read errors on ad4) which gmirror 
doesn't like. I guess those aren't that simple to reproduce (apart from 
on my system ;).



If you see any of your disks on the ICH9 controller fall off the bus
or report ATA errors (doesn't matter what kind), please make note of
the timestamp (should be in the kernel log), and ASAP run "smartctl
-a" on the disk.  You should compare attributes before and after the
event.
You might also want to consider using smartd, which can log SMART 
attribute changes on its own.  Note that you might have to tune the 
arguments in smartd.conf to ignore some attributes which fluctuate 
naturally (such as drive temperature and seek error rate).


I've configured smartd to poll both disks every 5 minutes. I -think- the 
issues happen specifically under load: the periodic scripts of the host 
and its 4 jails appear to trigger it sometimes. At that time I'm 
normally trying to get some sleep, so smartd will have to do for now. 
Although I'll run a "smartctl -a" asap anyway.


--
Pieter




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Kernel panic when unpluggin AC adaptor

2010-05-15 Thread Brandon Gooch
On Thu, May 13, 2010 at 7:25 PM, Giovanni Trematerra
 wrote:
> On Thu, May 13, 2010 at 1:09 AM, Brandon Gooch
>  wrote:
>> On Wed, May 12, 2010 at 9:41 AM, Attilio Rao  wrote:
>>> 2010/5/12 David DEMELIER :
 I remove the patch, and built the kernel (I updated the src this
 morning) and it does not panic now. It's really odd. If it reappears
 soon I will tell you.
>>>
>>> I looked at the code with Giovanni and I have the feeling that the
>>> race with the idle thread may still be fatal.
>>> We need to fix that.
>>>
>>> Attilio
>>>
>>
>> That seems to be the case, as my laptop shows about an 80-85 % chance
>> of experiencing a panic if left idle for long-ish periods of time (2
>> to 4 hours). I usually rebuild world or big ports overnight, and more
>> often than not I wake up to a panicked machine, same situation every
>> time:
>>
>> ...
>> rman_get_bushandle() at rman_get_bushandle+0x1
>> sched_idletd() at sched_idletd+0x123
>> fork_exit() at fork_exit+0x12a
>> fork_trampoline() at fork_trampoline+0xe
>> ...
>>
>> The kernel/userland is rebuilt, the ports are finished compiling --
>> it's in the time AFTER the completion of all tasks that the machine
>> gets bored and tries to kill itself :)
>>
>> I have seen the AC adapter plug/unplug "hang" in the past on this
>> laptop, but I never made the connection between the events, as
>> nowadays my laptop usually stays plugged in :(
>>
>> Attilio, I hope you can track this one down, let me know if I can do
>> anything to help or test...
>>
>
> Attilio and I came up with this patch. It seems ready for stress
> testing and review
> Please test and report back.
>
> Thank you
>
> P.S: all the faults are only mine.

I tried the patch, and my kernel panics I panic on boot. I have
8.5MB(!) of JPG images (6 of them) if anyone needs to see them. I'm
looking for a place to post them, but if anyone wants, I can send via
e-mail...

-Brandon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Read / write timeouts on SATA disks connected to ICH9

2010-05-15 Thread Jeremy Chadwick
On Sat, May 15, 2010 at 09:04:11AM +0200, Pieter de Boer wrote:
> Thanks for your elaborate reply, it was very useful to see smartctl
> output explained a bit :) I still think there's something else in
> play beside disk failure. I've checked one of the drives I replaced
> earlier, but that one doesn't have any of the errors in its SMART
> output you described, although it did drop out of the mirror
> multiple times during its lifetime.

That could be caused by a multitude of other known things.  For example,
some Western Digital "Green" drives (including the Enterprise class
ones) are known to perform head parking/offloading excessively, which
could result in the drive spending more time doing that than actually
serving overall I/O requests.  There are some other reports of Samsung
Spinpoint drives experiencing other issues (I've since forgotten and
would have to dig up the threads).

If you could provide full SMART stats for that drive, it might help.

> >The WD Caviar Black drives have a useful feature called TLER -- it's
> >disabled by default, for reasons which I don't want to get into here --
> >which can force the drive to internally give up after X seconds (it's
> >user-selectable) when dealing with such remapping/errors.  The idea is
> >to keep the drive from being deemed dead from the OS/controller's point
> >of view.  I believe Seagate, Hitachi, or Samsung (I forget which) have
> >this feature as well, but it's not called TLER.
>
> I've read about this feature, but didn't have the time to try to get
> it turned on (iirc you'd need a specific Western Digital DOS-based
> util or something).

Yes, it's a DOS-based utility (like most firmware upgraders these days).
I can provide it if you'd like.  I've been meaning to spend some time
trying to reverse-engineer the binary to figure out what ATA commands it
sends to the disk to toggle/adjust the feature (so that one could do it
in real-time rather than have to boot into DOS).

> >If you want to find out the exact LBA that has the problem (there may be
> >more than one), I can step you through performing a selective LBA scan
> >using SMART, since this model of disk does support such.  It's easy to
> >do, easy to understand the results, and can be done while the drive is
> >in operation (though I would recommend trying to keep disk I/O to a
> >minimum during this test).  Let me know.
>
> At a certain point in time I had read errors from specific LBA's on
> ad4. Using dd I was able to pinpoint those to single sectors.

This isn't very effective (dd will read large chunks/amounts of data
(read: multiple LBAs) from the underlying disk at once, rather than the
disk itself performing a per-LBA test).  My opinion is that the "dd
method" should only be used on drives which don't support selective LBA
scanning via SMART.

> Overwriting those sectors with what was on ad6 made them readable
> again. What is odd is that the 'remapped sector' count of ad4 is 0.

What may have happened is that the drive took a while to read certain
LBAs (long enough for the OS/controller to time out), but that internal
drive ECC was used to correct the reads and the sectors therefore *did
not* need to be remapped.  I do see that Attribute 1 on ad4 is non-zero,
which could indicate said situation, but WD doesn't provide Attribute
195 (ECC recovery rate), which could help here.

SMART implementations are usually quite good (particularly in recent WD
drives), but I have seen situations where certain counters are,
erroneously, not being incremented or changed.  I've seen a couple brand
new disks come out of the factory with non-zero values (indicating
someone at the fab forgot to clear them before shipping).  I'd love to
get my hands on a WD utility that zeros out the counters and re-flashes
the drive firmware to rule out any oddities.

It's been proven already that WD will re-uses the same F/W version
number despite some code being changed.  There was a FreeBSD user who
got a F/W fix from WD for the head offloading/parking ordeal (see above,
re: WD GP), and the firmware version between the old and the new were
the same.  Tracking stuff like this down is basically impossible unless
MD5/SHAs of the firmware files can be provided (good luck).

All HD vendors have their own quirks/ordeals right now.  You basically
just have to go with one who works wells for you, then if things start
going downhill, switch to another.  None of them are perfect.

> Still I'd like to know how do perform such a scan.

smartctl -t select,0-max 

This will start a selective LBA scan from LBA 0 to the end of the disk.
If any error is encountered, the scan stops and the error -- including
the LBA where an error was seen -- is output in the SMART self-test and
SMART selective self-test logs.  You can then write down the LBA, and
then re-run the above command replacing "0" with the LBA+1 where the
error was seen.

Here's an example of what a failed selective scan looks like (taken from
a Hitachi disk I just dealt w

Re: Read / write timeouts on SATA disks connected to ICH9

2010-05-15 Thread Miroslav Lachman

Pieter de Boer wrote:

Hi there,


what kind of disk I/O is going on. If actual I/O is very little, then
something weird is going on with regards to the number of interrupts
being seen on IRQ 23. mav@ might have some ideas, otherwise I'd
recommend rebooting the machine and seeing if the number drops. If so,
it may be that the OS has some sort of bug where a disk timing out or
falling off the bus causes interrupt problems. (It's too bad you don't
have AHCI on this system. It handles stuff like this much more
elegantly...)

Well, due to a UFS snapshot panic the box was rebooted, and now I only
see around 1500 interrupts per second, while syncing the mirror.


I seen high interrupts on 7.x systems after pulling out/in one drive in 
gmirror [1] even if it was successfully disconnected by gmirror remove + 
atacontrol detach and reconnected by atacontrol attach + gmirror insert.

It was not 100% reproducible, but it seems the bug is still there in 8.x.

[1] 
http://lists.freebsd.org/pipermail/freebsd-stable/2008-October/046003.html


Miroslav Lachman
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Read / write timeouts on SATA disks connected to ICH9

2010-05-15 Thread Terry Kennedy

Interesting. Which version of FreeBSD is this system running? I guess
you didn't experience any of the timeouts I'm seeing?


 8-STABLE as of the 11th of this month, or thereabouts. No, I've never
seen a disk timeout on that box.


Yeah, this R300 was bought second-hand and unfortunately the owner
pulled the RAID card out. It's something to consider, getting one of
those cards. Do you use the RAID-features of the drive and if so, does
that work well? I'm a bit hesitant to use hardware raid; it would be a
big plus if the RAID disks could also be used stand-alone if need be
(which is easy with gmirror because of its metadata being stored in the
drive's last sector).


Does your system have hot-swap drive bays and the SAS backplane? If it
at least has hot-swap bays, then you could always add the backplane,
cable, and controller.

 I'm using the hardware mirroring on the SAS 6/iR card (with a pair of
WD3000HLFS drives, since the previous owner took the factory drives out
before selling the system).

 I haven't tried taking one of those drives and seeing if it will boot
on a standalone SATA port. I have removed both drives, installed a scratch
drive, and installed Windows on it to run one of the Dell update install-
ers (not all of them come in DOS or Linux flavors). The controller didn't
mind the swap a bit (or the swap back to the 2 RAID drives). That's a lot
better than the old amr-based RAID cards.

   Terry Kennedy http://www.tmk.com
   te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Read / write timeouts on SATA disks connected to ICH9

2010-05-15 Thread Pieter de Boer

Hi there,


what kind of disk I/O is going on.  If actual I/O is very little, then
something weird is going on with regards to the number of interrupts
being seen on IRQ 23.  mav@ might have some ideas, otherwise I'd
recommend rebooting the machine and seeing if the number drops.  If so,
it may be that the OS has some sort of bug where a disk timing out or
falling off the bus causes interrupt problems.  (It's too bad you don't
have AHCI on this system.  It handles stuff like this much more
elegantly...)
Well, due to a UFS snapshot panic the box was rebooted, and now I only 
see around 1500 interrupts per second, while syncing the mirror.


--
Pieter
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


www/apache22: purpose of WITHOUT_APACHE_OPTIONS?

2010-05-15 Thread Stefan Bethke
Hi,

I was quite surprised that I need to set WITHOUT_APACHE_OPTIONS to have any 
command line options honored by the makefile.  All other ports seem to override 
the config options (that may or not may be set) with the WITH and WITHOUT 
variables specifed on the make commandline or through pkgtools.conf.  What's 
the reason for this difference?


Thanks,
Stefan

-- 
Stefan BethkeFon +49 151 14070811



___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Read / write timeouts on SATA disks connected to ICH9

2010-05-15 Thread Pieter de Boer

Hi Terry,


I have a bunch of R300's here. From one that is using the on-board SATA
and 2 drives in a gmirror setup (very similar to the OP) after 18 hours
of uptime:

[0:2] speedtest:~> vmstat -i
interrupt  total   rate
irq23: atapci0254116  3
Interesting. Which version of FreeBSD is this system running? I guess 
you didn't experience any of the timeouts I'm seeing?



  I also have another R300 with Dell's "SAS 6/iR" card (a re-branded LSI
1068-something, seen as "mpt" by FreeBSD). While Dell only sells that as
part of a package deal with the hot-swap backplane and redundant power
supplies, there's no reason you couldn't pick one up on eBay and add it
yourself. You'll need some sort of breakout cable to get from the big
connector on the SAS 6 to individual SATA ports.
Yeah, this R300 was bought second-hand and unfortunately the owner 
pulled the RAID card out. It's something to consider, getting one of 
those cards. Do you use the RAID-features of the drive and if so, does 
that work well? I'm a bit hesitant to use hardware raid; it would be a 
big plus if the RAID disks could also be used stand-alone if need be 
(which is easy with gmirror because of its metadata being stored in the 
drive's last sector).


Thanks,
Pieter

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Read / write timeouts on SATA disks connected to ICH9

2010-05-15 Thread Pieter de Boer

Hi Jeremy,


Lots to say about all of this.


Thanks for your elaborate reply, it was very useful to see smartctl 
output explained a bit :) I still think there's something else in play 
beside disk failure. I've checked one of the drives I replaced earlier, 
but that one doesn't have any of the errors in its SMART output you 
described, although it did drop out of the mirror multiple times during 
its lifetime.



The WD Caviar Black drives have a useful feature called TLER -- it's
disabled by default, for reasons which I don't want to get into here --
which can force the drive to internally give up after X seconds (it's
user-selectable) when dealing with such remapping/errors.  The idea is
to keep the drive from being deemed dead from the OS/controller's point
of view.  I believe Seagate, Hitachi, or Samsung (I forget which) have
this feature as well, but it's not called TLER.
I've read about this feature, but didn't have the time to try to get it 
turned on (iirc you'd need a specific Western Digital DOS-based util or 
something).



If you want to find out the exact LBA that has the problem (there may be
more than one), I can step you through performing a selective LBA scan
using SMART, since this model of disk does support such.  It's easy to
do, easy to understand the results, and can be done while the drive is
in operation (though I would recommend trying to keep disk I/O to a
minimum during this test).  Let me know.
At a certain point in time I had read errors from specific LBA's on ad4. 
Using dd I was able to pinpoint those to single sectors. Overwriting 
those sectors with what was on ad6 made them readable again. What is odd 
is that the 'remapped sector' count of ad4 is 0.


Still I'd like to know how do perform such a scan.

 > Finally, your vmstat -i output:



# vmstat -i
interrupt  total   rate
irq23: atapci0 371021299  10423


Good to know there's no IRQ sharing going on, but what does worry me is
the interrupt rate (10K interrupts/second).  That seems *extremely*
high, but it also depends on what kind of disk I/O is happening on this
system -- especially since you have 2 disks attached to the same
controller.
The rate is higher than 1 also at idle. During a gmirror sync from 
ad6 to ad4, it's about 10670.



"iostat 1", "iostat -x 1", or "gstat" might come in handy to tell you
what kind of disk I/O is going on.  If actual I/O is very little, then
something weird is going on with regards to the number of interrupts
being seen on IRQ 23.  mav@ might have some ideas, otherwise I'd
recommend rebooting the machine and seeing if the number drops.  If so,
it may be that the OS has some sort of bug where a disk timing out or
falling off the bus causes interrupt problems.  (It's too bad you don't
have AHCI on this system.  It handles stuff like this much more
elegantly...)
If mav@ or anyone else doesn't have another insight in the interrupt 
rate, I guess a reboot will at least show if it's persistent or related 
to the errors. I'll try to do a reboot when convenient (probably sunday 
morning or something).


Thanks,
Pieter




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


[releng_7 tinderbox] failure on sparc64/sparc64

2010-05-15 Thread FreeBSD Tinderbox
TB --- 2010-05-15 05:57:23 - tinderbox 2.6 running on freebsd-stable.sentex.ca
TB --- 2010-05-15 05:57:23 - starting RELENG_7 tinderbox run for sparc64/sparc64
TB --- 2010-05-15 05:57:23 - cleaning the object tree
TB --- 2010-05-15 05:57:51 - cvsupping the source tree
TB --- 2010-05-15 05:57:51 - /usr/bin/csup -z -r 3 -g -L 1 -h localhost -s 
/tinderbox/RELENG_7/sparc64/sparc64/supfile
TB --- 2010-05-15 05:58:04 - building world
TB --- 2010-05-15 05:58:04 - MAKEOBJDIRPREFIX=/obj
TB --- 2010-05-15 05:58:04 - PATH=/usr/bin:/usr/sbin:/bin:/sbin
TB --- 2010-05-15 05:58:04 - TARGET=sparc64
TB --- 2010-05-15 05:58:04 - TARGET_ARCH=sparc64
TB --- 2010-05-15 05:58:04 - TZ=UTC
TB --- 2010-05-15 05:58:04 - __MAKE_CONF=/dev/null
TB --- 2010-05-15 05:58:04 - cd /src
TB --- 2010-05-15 05:58:04 - /usr/bin/make -B buildworld
>>> World build started on Sat May 15 05:58:05 UTC 2010
>>> Rebuilding the temporary build tree
>>> stage 1.1: legacy release compatibility shims
>>> stage 1.2: bootstrap tools
>>> stage 2.1: cleaning up the object tree
>>> stage 2.2: rebuilding the object tree
>>> stage 2.3: build tools
>>> stage 3: cross tools
>>> stage 4.1: building includes
>>> stage 4.2: building libraries
>>> stage 4.3: make dependencies
>>> stage 4.4: building everything
>>> World build completed on Sat May 15 06:58:39 UTC 2010
TB --- 2010-05-15 06:58:39 - generating LINT kernel config
TB --- 2010-05-15 06:58:39 - cd /src/sys/sparc64/conf
TB --- 2010-05-15 06:58:39 - /usr/bin/make -B LINT
TB --- 2010-05-15 06:58:39 - building LINT kernel
TB --- 2010-05-15 06:58:39 - MAKEOBJDIRPREFIX=/obj
TB --- 2010-05-15 06:58:39 - PATH=/usr/bin:/usr/sbin:/bin:/sbin
TB --- 2010-05-15 06:58:39 - TARGET=sparc64
TB --- 2010-05-15 06:58:39 - TARGET_ARCH=sparc64
TB --- 2010-05-15 06:58:39 - TZ=UTC
TB --- 2010-05-15 06:58:39 - __MAKE_CONF=/dev/null
TB --- 2010-05-15 06:58:39 - cd /src
TB --- 2010-05-15 06:58:39 - /usr/bin/make -B buildkernel KERNCONF=LINT
>>> Kernel build for LINT started on Sat May 15 06:58:39 UTC 2010
>>> stage 1: configuring the kernel
>>> stage 2.1: cleaning up the object tree
>>> stage 2.2: rebuilding the object tree
>>> stage 2.3: build tools
>>> stage 3.1: making dependencies
[...]
===> em (depend)
@ -> /src/sys
machine -> /src/sys/sparc64/include
awk -f @/tools/makeobjops.awk @/kern/device_if.m -h
awk -f @/tools/makeobjops.awk @/kern/bus_if.m -h
awk -f @/tools/makeobjops.awk @/dev/pci/pci_if.m -h
ln -sf /obj/sparc64/src/sys/LINT/opt_inet.h opt_inet.h
make: don't know how to make if_lem.c. Stop
*** Error code 2

Stop in /src/sys/modules.
*** Error code 1

Stop in /obj/sparc64/src/sys/LINT.
*** Error code 1

Stop in /src.
*** Error code 1

Stop in /src.
TB --- 2010-05-15 07:00:06 - WARNING: /usr/bin/make returned exit code  1 
TB --- 2010-05-15 07:00:06 - ERROR: failed to build lint kernel
TB --- 2010-05-15 07:00:06 - 3200.38 user 331.62 system 3763.34 real


http://tinderbox.freebsd.org/tinderbox-releng_7-RELENG_7-sparc64-sparc64.full
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"