Re: Cas driver fails to load first time after boot.

2013-01-28 Thread Paul Keusemann


On 01/25/13 17:34, Marius Strobl wrote:

On Fri, Jan 25, 2013 at 01:14:51PM -0600, Paul Keusemann wrote:

On 01/25/13 10:19, Marius Strobl wrote:

On Thu, Jan 24, 2013 at 08:48:04PM -0600, Paul Keusemann wrote:

On 01/24/13 15:50, Marius Strobl wrote:

On Thu, Jan 24, 2013 at 12:39:44PM -0600, Paul Keusemann wrote:

On 01/24/13 09:09, Marius Strobl wrote:

On Tue, Jan 22, 2013 at 02:46:48PM -0600, Paul Keusemann wrote:

Hi,

I've got a Dell R200 which I'm trying to build into a gateway with a Sun
QGE (501-6738-10).  The cas driver fails to load the first time I try to
load it but succeeds the second time.  Is this a problem with the card,
the driver, my karma?

Wrong phase of the moon, apparently :)
The MII setup of these chips is a bit tricky and I'm not sure whether
I've hit all code paths during development of the driver. I certainly
didn't test with a 501-6738, these have been reported as working before,
though. It also doesn't make much sense that attaching the devices
succeeds on the second attempt. Could you please use a if_cas.ko built
with the attached patch and report the debug output for one of the
interfaces in both the working and the non-working case?

I would love to give you output from the working and non-working case
but apparently the phase of the moon has changed, I can't get it to fail
now.  The messages output from the working case is attached.


Thanks but unfortunately this doesn't make any sense either. In general,
printf()s cause deays which can be relevant. In the locations I've put
them they hardly can make such a difference though.
If you haven't already done so, could you please power off the machine
before doing the test with the patched module? Is the problem still gone
if you revert to the original module?

OK, power-cycling makes a difference.  The driver fails to attach all of
the devices after power-cycling most of the time if not all of the
time.  The number of devices attached varies, the attached message file
fragment is from my last test.  Three of the devices were attached on
the first load attempt and all four of them on the second attempt.

Okay, so we now at least have a way to reproduce the problem.
Unfortunately, it's still unclear what's the exact cause of it. At
least the problem is not what I suspected and hoped it most likely is.
Could you please test how things behave after a power-cycle with the
attached patche (after reverting the previous one).

The patched driver fails to compile with the following error message:


...


I found the following defintion of nitems in the iwn and usb/wlan drivers:

#define nitems(_a)  (sizeof((_a)) / sizeof((_a)[0]))

so I added it to if_cas.c and rebuilt without errors.


Sorry, I didn't think of 8.3 not having nitems(), yet. Actually,
this part of the patch is orthogonal to your problem and just a
change I had in that tree.


This looks like like it fixed the problem.  I ran three tests from
power-up to loading the driver and the driver loaded successfully all
three times.  I then added if_cas_load=YES to /boot/loader.conf and
did two more successful reboots from power-up.

Great! Thanks a lot for testing!


Will this driver work on FreeBSD 9.1?


Yes, the patch should also solve the problem in 9.1. I suspect the
hang you are seeing there isn't specific to cas(4) but rather a
general regression that came in with the VIMAGE changes. Now, if
a network device driver fails to attach during boot and tries to
clean up by detaching and freeing the interface part at that stage
again this causes problems. I already talked to bz@ about this and
what I remember from his reply this is an ordering issue that is at
least very hard to fix.


OK.  I've successfully upgraded from 8.3-Release to 9.1-Release.  I 
stupidly powered-down the machine after the upgrade, so I had to remove 
the QGE card to get it to boot 9.1 and build a custom kernel.  The patch 
applied cleanly, the kernel built without errors and boots from power-up 
without problems.  I've attached the most recent messages file, dmesg, 
kldstat and ifconfig output if you're interested.  The only odd thing I 
noticed was that cas0 and cas3 log messages:  cannot disable RX MAC 
but cas1 and cas2 do not.  I haven't actually tried any of the 
interfaces yet but I assume they'll work as expected.


Let me know if there's anything further testing you'd like me to do.

Thanks so much for your help with this, it is much appreciated.

Paul




Marius




--
Paul Keusemannpkeu...@visi.com
4266 Joppa Court  (952) 894-7805
Savage, MN  55378

Copyright (c) 1992-2012 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 9.1-RELEASE #4: Mon Jan 28 09:02:45 CST 2013
toor@lucid:/usr/obj/usr/src/sys/LUCID amd64
CPU: Intel(R

Re: Cas driver fails to load first time after boot.

2013-01-24 Thread Paul Keusemann


On 01/24/13 09:09, Marius Strobl wrote:

On Tue, Jan 22, 2013 at 02:46:48PM -0600, Paul Keusemann wrote:

Hi,

I've got a Dell R200 which I'm trying to build into a gateway with a Sun
QGE (501-6738-10).  The cas driver fails to load the first time I try to
load it but succeeds the second time.  Is this a problem with the card,
the driver, my karma?

Wrong phase of the moon, apparently :)
The MII setup of these chips is a bit tricky and I'm not sure whether
I've hit all code paths during development of the driver. I certainly
didn't test with a 501-6738, these have been reported as working before,
though. It also doesn't make much sense that attaching the devices
succeeds on the second attempt. Could you please use a if_cas.ko built
with the attached patch and report the debug output for one of the
interfaces in both the working and the non-working case?


I would love to give you output from the working and non-working case 
but apparently the phase of the moon has changed, I can't get it to fail 
now.  The messages output from the working case is attached.


Let me know if there's anything else I can do.


Marius



--
Paul Keusemannpkeu...@visi.com
4266 Joppa Court  (952) 894-7805
Savage, MN  55378

Jan 24 11:00:01 lucid newsyslog[2087]: logfile turned over due to size100K
Jan 24 11:47:39 lucid shutdown: reboot by toor: 
Jan 24 11:47:41 lucid syslogd: exiting on signal 15
Jan 24 11:48:51 lucid syslogd: kernel boot file is /boot/kernel/kernel
Jan 24 11:48:51 lucid kernel: Copyright (c) 1992-2012 The FreeBSD Project.
Jan 24 11:48:51 lucid kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 
1991, 1992, 1993, 1994
Jan 24 11:48:51 lucid kernel: The Regents of the University of California. All 
rights reserved.
Jan 24 11:48:51 lucid kernel: FreeBSD is a registered trademark of The FreeBSD 
Foundation.
Jan 24 11:48:51 lucid kernel: FreeBSD 8.3-RELEASE #0: Thu Jan 24 11:15:13 CST 
2013
Jan 24 11:48:51 lucid kernel: toor@lucid:/usr/obj/usr/src/sys/LUCID amd64
Jan 24 11:48:51 lucid kernel: Timecounter i8254 frequency 1193182 Hz quality 0
Jan 24 11:48:51 lucid kernel: CPU: Intel(R) Xeon(R) CPU   X3210  @ 
2.13GHz (2133.42-MHz K8-class CPU)
Jan 24 11:48:51 lucid kernel: Origin = GenuineIntel  Id = 0x6fb  Family = 6  
Model = f  Stepping = 11
Jan 24 11:48:51 lucid kernel: 
Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE
Jan 24 11:48:51 lucid kernel: 
Features2=0xe3bdSSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM
Jan 24 11:48:51 lucid kernel: AMD Features=0x20100800SYSCALL,NX,LM
Jan 24 11:48:51 lucid kernel: AMD Features2=0x1LAHF
Jan 24 11:48:51 lucid kernel: TSC: P-state invariant
Jan 24 11:48:51 lucid kernel: real memory  = 4294967296 (4096 MB)
Jan 24 11:48:51 lucid kernel: avail memory = 4099231744 (3909 MB)
Jan 24 11:48:51 lucid kernel: ACPI APIC Table: DELL   PE_SC3  
Jan 24 11:48:51 lucid kernel: FreeBSD/SMP: Multiprocessor System Detected: 4 
CPUs
Jan 24 11:48:51 lucid kernel: FreeBSD/SMP: 1 package(s) x 4 core(s)
Jan 24 11:48:51 lucid kernel: cpu0 (BSP): APIC ID:  0
Jan 24 11:48:51 lucid kernel: cpu1 (AP): APIC ID:  1
Jan 24 11:48:51 lucid kernel: cpu2 (AP): APIC ID:  2
Jan 24 11:48:51 lucid kernel: cpu3 (AP): APIC ID:  3
Jan 24 11:48:51 lucid kernel: ioapic0: Changing APIC ID to 4
Jan 24 11:48:51 lucid kernel: ioapic1: Changing APIC ID to 5
Jan 24 11:48:51 lucid kernel: ioapic0 Version 2.0 irqs 0-23 on motherboard
Jan 24 11:48:51 lucid kernel: ioapic1 Version 2.0 irqs 32-55 on motherboard
Jan 24 11:48:51 lucid kernel: kbd1 at kbdmux0
Jan 24 11:48:51 lucid kernel: acpi0: DELL PE_SC3 on motherboard
Jan 24 11:48:51 lucid kernel: acpi0: [ITHREAD]
Jan 24 11:48:51 lucid kernel: acpi0: Power Button (fixed)
Jan 24 11:48:51 lucid kernel: Timecounter ACPI-fast frequency 3579545 Hz 
quality 1000
Jan 24 11:48:51 lucid kernel: acpi_timer0: 24-bit timer at 3.579545MHz port 
0x808-0x80b on acpi0
Jan 24 11:48:51 lucid kernel: cpu0: ACPI CPU on acpi0
Jan 24 11:48:51 lucid kernel: cpu1: ACPI CPU on acpi0
Jan 24 11:48:51 lucid kernel: cpu2: ACPI CPU on acpi0
Jan 24 11:48:51 lucid kernel: cpu3: ACPI CPU on acpi0
Jan 24 11:48:51 lucid kernel: pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on 
acpi0
Jan 24 11:48:51 lucid kernel: pci0: ACPI PCI bus on pcib0
Jan 24 11:48:51 lucid kernel: pcib1: ACPI PCI-PCI bridge irq 16 at device 1.0 
on pci0
Jan 24 11:48:51 lucid kernel: pci1: ACPI PCI bus on pcib1
Jan 24 11:48:51 lucid kernel: pcib2: ACPI PCI-PCI bridge irq 16 at device 
28.0 on pci0
Jan 24 11:48:51 lucid kernel: pci2: ACPI PCI bus on pcib2
Jan 24 11:48:51 lucid kernel: pcib3: ACPI PCI-PCI bridge at device 0.0 on pci2
Jan 24 11:48:51 lucid kernel: pci3: ACPI PCI bus on pcib3
Jan 24 11:48:51 lucid kernel: pcib4: PCI-PCI bridge at device 2.0 on pci3
Jan 24 11:48:51 lucid kernel: pci4: PCI bus on pcib4
Jan 24 11:48:51 lucid kernel: pci4

Re: Cas driver fails to load first time after boot.

2013-01-24 Thread Paul Keusemann


On 01/24/13 15:50, Marius Strobl wrote:

On Thu, Jan 24, 2013 at 12:39:44PM -0600, Paul Keusemann wrote:

On 01/24/13 09:09, Marius Strobl wrote:

On Tue, Jan 22, 2013 at 02:46:48PM -0600, Paul Keusemann wrote:

Hi,

I've got a Dell R200 which I'm trying to build into a gateway with a Sun
QGE (501-6738-10).  The cas driver fails to load the first time I try to
load it but succeeds the second time.  Is this a problem with the card,
the driver, my karma?

Wrong phase of the moon, apparently :)
The MII setup of these chips is a bit tricky and I'm not sure whether
I've hit all code paths during development of the driver. I certainly
didn't test with a 501-6738, these have been reported as working before,
though. It also doesn't make much sense that attaching the devices
succeeds on the second attempt. Could you please use a if_cas.ko built
with the attached patch and report the debug output for one of the
interfaces in both the working and the non-working case?

I would love to give you output from the working and non-working case
but apparently the phase of the moon has changed, I can't get it to fail
now.  The messages output from the working case is attached.


Thanks but unfortunately this doesn't make any sense either. In general,
printf()s cause deays which can be relevant. In the locations I've put
them they hardly can make such a difference though.
If you haven't already done so, could you please power off the machine
before doing the test with the patched module? Is the problem still gone
if you revert to the original module?


OK, power-cycling makes a difference.  The driver fails to attach all of 
the devices after power-cycling most of the time if not all of the 
time.  The number of devices attached varies, the attached message file 
fragment is from my last test.  Three of the devices were attached on 
the first load attempt and all four of them on the second attempt.


In the interest of full disclosure, I did build a new kernel but it is 
just a copy of GENERIC.  This is a




Marius




--
Paul Keusemannpkeu...@visi.com
4266 Joppa Court  (952) 894-7805
Savage, MN  55378

Jan 24 20:32:32 lucid kernel: Copyright (c) 1992-2012 The FreeBSD Project.
Jan 24 20:32:32 lucid kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 
1991, 1992, 1993, 1994
Jan 24 20:32:32 lucid kernel: The Regents of the University of California. All 
rights reserved.
Jan 24 20:32:32 lucid kernel: FreeBSD is a registered trademark of The FreeBSD 
Foundation.
Jan 24 20:32:32 lucid kernel: FreeBSD 8.3-RELEASE #0: Thu Jan 24 11:15:13 CST 
2013
Jan 24 20:32:32 lucid kernel: toor@lucid:/usr/obj/usr/src/sys/LUCID amd64
Jan 24 20:32:32 lucid kernel: Timecounter i8254 frequency 1193182 Hz quality 0
Jan 24 20:32:32 lucid kernel: CPU: Intel(R) Xeon(R) CPU   X3210  @ 
2.13GHz (2133.42-MHz K8-class CPU)
Jan 24 20:32:32 lucid kernel: Origin = GenuineIntel  Id = 0x6fb  Family = 6  
Model = f  Stepping = 11
Jan 24 20:32:32 lucid kernel: 
Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE
Jan 24 20:32:32 lucid kernel: 
Features2=0xe3bdSSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM
Jan 24 20:32:32 lucid kernel: AMD Features=0x20100800SYSCALL,NX,LM
Jan 24 20:32:32 lucid kernel: AMD Features2=0x1LAHF
Jan 24 20:32:32 lucid kernel: TSC: P-state invariant
Jan 24 20:32:32 lucid kernel: real memory  = 4294967296 (4096 MB)
Jan 24 20:32:32 lucid kernel: avail memory = 4099231744 (3909 MB)
Jan 24 20:32:32 lucid kernel: ACPI APIC Table: DELL   PE_SC3  
Jan 24 20:32:32 lucid kernel: FreeBSD/SMP: Multiprocessor System Detected: 4 
CPUs
Jan 24 20:32:32 lucid kernel: FreeBSD/SMP: 1 package(s) x 4 core(s)
Jan 24 20:32:32 lucid kernel: cpu0 (BSP): APIC ID:  0
Jan 24 20:32:32 lucid kernel: cpu1 (AP): APIC ID:  1
Jan 24 20:32:32 lucid kernel: cpu2 (AP): APIC ID:  2
Jan 24 20:32:32 lucid kernel: cpu3 (AP): APIC ID:  3
Jan 24 20:32:32 lucid kernel: ioapic0: Changing APIC ID to 4
Jan 24 20:32:32 lucid kernel: ioapic1: Changing APIC ID to 5
Jan 24 20:32:32 lucid kernel: ioapic0 Version 2.0 irqs 0-23 on motherboard
Jan 24 20:32:32 lucid kernel: ioapic1 Version 2.0 irqs 32-55 on motherboard
Jan 24 20:32:32 lucid kernel: kbd1 at kbdmux0
Jan 24 20:32:32 lucid kernel: acpi0: DELL PE_SC3 on motherboard
Jan 24 20:32:32 lucid kernel: acpi0: [ITHREAD]
Jan 24 20:32:32 lucid kernel: acpi0: Power Button (fixed)
Jan 24 20:32:32 lucid kernel: Timecounter ACPI-fast frequency 3579545 Hz 
quality 1000
Jan 24 20:32:32 lucid kernel: acpi_timer0: 24-bit timer at 3.579545MHz port 
0x808-0x80b on acpi0
Jan 24 20:32:32 lucid kernel: cpu0: ACPI CPU on acpi0
Jan 24 20:32:32 lucid kernel: cpu1: ACPI CPU on acpi0
Jan 24 20:32:32 lucid kernel: cpu2: ACPI CPU on acpi0
Jan 24 20:32:32 lucid kernel: cpu3: ACPI CPU on acpi0
Jan 24 20:32:32 lucid kernel: pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on 
acpi0

Cas driver fails to load first time after boot.

2013-01-22 Thread Paul Keusemann
-FDX, 
100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-master, 1000baseT-FDX, 
1000baseT-FDX-master,auto, auto-flow

Jan 22 14:04:33 lucid kernel: cas3: 16kB RX FIFO, 9kB TX FIFO
Jan 22 14:04:33 lucid kernel: cas3: Ethernet address: 00:14:4f:25:ca:13
Jan 22 14:04:33 lucid kernel: cas3: [FILTER]


The following are attached:
/var/run/dmesg.boot
dmesg output after the second attempt to load the cas driver.
/var/log/messages after the second attemp to load the cas driver.


--
Paul Keusemannpkeu...@visi.com
4266 Joppa Court  (952) 894-7805
Savage, MN  55378

Copyright (c) 1992-2012 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 8.3-RELEASE #0: Mon Apr  9 21:23:18 UTC 2012
r...@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64
Timecounter i8254 frequency 1193182 Hz quality 0
CPU: Intel(R) Xeon(R) CPU   X3210  @ 2.13GHz (2133.42-MHz K8-class CPU)
  Origin = GenuineIntel  Id = 0x6fb  Family = 6  Model = f  Stepping = 11
  
Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE
  Features2=0xe3bdSSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM
  AMD Features=0x20100800SYSCALL,NX,LM
  AMD Features2=0x1LAHF
  TSC: P-state invariant
real memory  = 4294967296 (4096 MB)
avail memory = 4099231744 (3909 MB)
ACPI APIC Table: DELL   PE_SC3  
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s)
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
 cpu2 (AP): APIC ID:  2
 cpu3 (AP): APIC ID:  3
ioapic0: Changing APIC ID to 4
ioapic1: Changing APIC ID to 5
ioapic0 Version 2.0 irqs 0-23 on motherboard
ioapic1 Version 2.0 irqs 32-55 on motherboard
kbd1 at kbdmux0
acpi0: DELL PE_SC3 on motherboard
acpi0: [ITHREAD]
acpi0: Power Button (fixed)
Timecounter ACPI-fast frequency 3579545 Hz quality 1000
acpi_timer0: 24-bit timer at 3.579545MHz port 0x808-0x80b on acpi0
cpu0: ACPI CPU on acpi0
cpu1: ACPI CPU on acpi0
cpu2: ACPI CPU on acpi0
cpu3: ACPI CPU on acpi0
pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0
pci0: ACPI PCI bus on pcib0
pcib1: ACPI PCI-PCI bridge irq 16 at device 1.0 on pci0
pci1: ACPI PCI bus on pcib1
pcib2: ACPI PCI-PCI bridge irq 16 at device 28.0 on pci0
pci2: ACPI PCI bus on pcib2
pcib3: ACPI PCI-PCI bridge at device 0.0 on pci2
pci3: ACPI PCI bus on pcib3
pcib4: PCI-PCI bridge at device 2.0 on pci3
pci4: PCI bus on pcib4
pci4: network, ethernet at device 0.0 (no driver attached)
pci4: network, ethernet at device 1.0 (no driver attached)
pci4: network, ethernet at device 2.0 (no driver attached)
pci4: network, ethernet at device 3.0 (no driver attached)
pcib5: ACPI PCI-PCI bridge irq 16 at device 28.4 on pci0
pci5: ACPI PCI bus on pcib5
bge0: Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev. 0x004201 mem 
0xdf3f-0xdf3f irq 16 at device 0.0 on pci5
bge0: CHIP ID 0x4201; ASIC REV 0x04; CHIP REV 0x42; PCI-E
miibus0: MII bus on bge0
brgphy0: BCM5750 10/100/1000baseTX PHY PHY 1 on miibus0
brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow
bge0: Ethernet address: 00:19:b9:fa:82:51
bge0: [ITHREAD]
pcib6: ACPI PCI-PCI bridge irq 17 at device 28.5 on pci0
pci6: ACPI PCI bus on pcib6
bge1: Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev. 0x004201 mem 
0xdf4f-0xdf4f irq 17 at device 0.0 on pci6
bge1: CHIP ID 0x4201; ASIC REV 0x04; CHIP REV 0x42; PCI-E
miibus1: MII bus on bge1
brgphy1: BCM5750 10/100/1000baseTX PHY PHY 1 on miibus1
brgphy1:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow
bge1: Ethernet address: 00:19:b9:fa:82:52
bge1: [ITHREAD]
uhci0: Intel 82801I (ICH9) USB controller port 0xdc60-0xdc7f irq 21 at device 
29.0 on pci0
uhci0: [ITHREAD]
usbus0: Intel 82801I (ICH9) USB controller on uhci0
uhci1: Intel 82801I (ICH9) USB controller port 0xdc80-0xdc9f irq 20 at device 
29.1 on pci0
uhci1: [ITHREAD]
usbus1: Intel 82801I (ICH9) USB controller on uhci1
uhci2: Intel 82801I (ICH9) USB controller port 0xdca0-0xdcbf irq 21 at device 
29.2 on pci0
uhci2: [ITHREAD]
usbus2: Intel 82801I (ICH9) USB controller on uhci2
ehci0: Intel 82801I (ICH9) USB 2.0 controller mem 0xdf2ffc00-0xdf2f irq 
21 at device 29.7 on pci0
ehci0: [ITHREAD]
usbus3: EHCI version 1.0
usbus3: Intel 82801I (ICH9) USB 2.0 controller on ehci0
pcib7: ACPI PCI-PCI bridge at device 30.0 on pci0
pci7: ACPI PCI bus on pcib7
vgapci0: VGA-compatible display port 0xec00-0xecff mem 
0xd000-0xd7ff,0xdf5f-0xdf5f irq 19 at device 5.0 on pci7
isab0: PCI-ISA bridge at device 31.0 on pci0
isa0: ISA bus on isab0
atapci0

Re: Debugging dropped shell connections over a VPN

2011-07-27 Thread Paul Keusemann

On 07/27/11 06:50, Gary Palmer wrote:

On Tue, Jul 26, 2011 at 01:35:16PM -0500, Paul Keusemann wrote:

On 07/26/11 08:05, Gary Palmer wrote:

On Tue, Jul 26, 2011 at 06:53:59AM -0500, Paul Keusemann wrote:

Again, sorry for the sluggish response.

On 07/20/11 15:15, Gary Palmer wrote:

On Tue, Jul 12, 2011 at 02:26:34PM -0500, Paul Keusemann wrote:

On 07/07/11 14:39, Chuck Swiger wrote:

On Jul 7, 2011, at 4:45 AM, Paul Keusemann wrote:

My setup is something like this:
- My local network is a mix of AIX, HP-UX, Linux, FreeBSD and Solaris
machines running various OS versions.
- My gateway / firewall  machine is running FreeBSD-8.1-RELEASE-p1
with
ipfw, nat and racoon for the firewall and VPN.

The problem is that rlogin, ssh and telnet connections over the VPN
get
dropped after some period of inactivity.

You're probably getting NAT timeouts against the VPN connection if it
is
left idle.  racoon ought to have a config setting called natt_keepalive
which sends periodic keepalives-- see whether that's disabled.

Regards,

Thanks for the suggestions Chuck, sorry it's taken so long to respond
but I had to reconfigure and rebuild my kernel to enable IPSEC_NAT_T in
order to try this out.

One thing that I did not explicitly mention before is that I am routing
a network over the VPN.

Hi Paul,

Even if you are not being NAT'd on the VPN there may be a firewall (or
other active network component like a load balancer) with an
overflowing state table somewhere at the remote end.  We see this
frequently where I work with customer networks and the
firewall/VPN/network
admin denies that its a time out issue so there is likely some device in
the network that has a state table and if the connection is idle for a
few minutes it gets dropped.

Hmmm,  this seems likely.  Have you had any luck in finding the culprit
and resolving the problem?

Unfortunately no.  We know the problem exists but as a vendor we have
very little success in getting the customer to identify the problematic
device inside their network as it only seems to affect our connections
to them when we are helping them with problems, so there is almost
always something more important going on and the timeout issue gets put
on the back burner and forgotten.  We've worked around it in some
places by using the ssh 'ServerAliveInterval' directive to make ssh
send packets and keep the session open even if we're idle, but that
doesn't always work.

OK, I found the ClientAliveInterval, and ClientAliveCountMax setting in
the ssh_config man page.  I assume these are what you are referring to.
I tried setting ClientAliveInterval to 15 seconds with
ClientAliveCountMax set to 3 and this seems to help.  I've only tried
this a couple of times but I have seen an ssh session stay alive for
over an hour.  The bad news is that the sessions are still getting
dropped, at least now I know when it happens.  Now I'm getting the
following message:

 Received disconnect from 10.64.20.69: 2: Timeout, your session not
responding.

 From a quick perusal of the openssh source, it is not obvious whether
this message is coming from the client or the server side.   Initially,
because the keep alive timer is a server side setting, I assumed the
message was coming from the server side but if the session is not
responding how is the message getting to the client?  If it is a client
side problem, then I have much more flexibility to fix.  All I can do is
whine about server side problems.


Hi Paul,

ServerAliveInterval is actually a client setting.  e.g.  put this in
your ~/.ssh/config file

host *
ServerAliveInterval 15

will set the client to ping the server every 15 seconds and try to
keep the connection alive.  You can replace '*' you want to be more
targeted in your configuration.


Ah, I see.  I was looking at the Solaris ssh_config man page.  The 
OpenSSH ssh_config man page is third in the sequence.  The ServerAlive* 
options are not documented in the Solaris ssh_config man page.  I'll try 
it out too.  Thanks.



I've never played with the server side settings for various reasons.

Regards,

Gary
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org




--
Paul Keusemannpkeu...@visi.com
4266 Joppa Court  (952) 894-7805
Savage, MN  55378

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: Debugging dropped shell connections over a VPN

2011-07-26 Thread Paul Keusemann
Once again, apologies for my sluggish response.  The VPN problem is a 
background job worked on when I can or when I'm too annoyed by it to do 
anything else.


On 07/12/11 17:42, Chuck Swiger wrote:

On Jul 12, 2011, at 12:26 PM, Paul Keusemann wrote:

So, any other ideas on how to debug this?

Gather data with tcpdump.  If you do it on one of the VPN endpoints, you ought 
to see the VPN contents rather than just packets going by in the encrypted 
tunnel.



I assume by endpoint, you are talking about the target of the remote 
shell.  Unfortunately, running tcpdump on the endpoint shows only the 
initial negotiation (and any interactive keyboard traffic) but nothing 
to indicate the connection has been dropped or timed out.


If I can get some time when I don't actually need to use the VPN for 
work, I'm going to try to run tcpdump on the tunnel to see if there's 
anything going across it that might shed some light on the cause of the 
dropped connections.



Anybody know how to get racoon to log everything to one file?  Right now, 
depending on the log level, I am getting messages in racoon.log (specified with 
-l at startup), messages and debug.log.  It would really be nice to have just 
one log to look at.

This is likely governed by /etc/syslog.conf, but if you specify -l then racoon 
shouldn't use syslog logging.


My syslog.conf foo is not good but it seems that some stuff  from racoon 
always ends up in the messages file, even when the -l option to racoon 
is specified.


Thanks again for the tips.

--
Paul Keusemannpkeu...@visi.com
4266 Joppa Court  (952) 894-7805
Savage, MN  55378

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: Debugging dropped shell connections over a VPN

2011-07-26 Thread Paul Keusemann

Again, sorry for the sluggish response.

On 07/20/11 15:15, Gary Palmer wrote:

On Tue, Jul 12, 2011 at 02:26:34PM -0500, Paul Keusemann wrote:

On 07/07/11 14:39, Chuck Swiger wrote:

On Jul 7, 2011, at 4:45 AM, Paul Keusemann wrote:

My setup is something like this:
- My local network is a mix of AIX, HP-UX, Linux, FreeBSD and Solaris
machines running various OS versions.
- My gateway / firewall  machine is running FreeBSD-8.1-RELEASE-p1 with
ipfw, nat and racoon for the firewall and VPN.

The problem is that rlogin, ssh and telnet connections over the VPN get
dropped after some period of inactivity.

You're probably getting NAT timeouts against the VPN connection if it is
left idle.  racoon ought to have a config setting called natt_keepalive
which sends periodic keepalives-- see whether that's disabled.

Regards,

Thanks for the suggestions Chuck, sorry it's taken so long to respond
but I had to reconfigure and rebuild my kernel to enable IPSEC_NAT_T in
order to try this out.

One thing that I did not explicitly mention before is that I am routing
a network over the VPN.

Hi Paul,

Even if you are not being NAT'd on the VPN there may be a firewall (or
other active network component like a load balancer) with an
overflowing state table somewhere at the remote end.  We see this
frequently where I work with customer networks and the firewall/VPN/network
admin denies that its a time out issue so there is likely some device in
the network that has a state table and if the connection is idle for a
few minutes it gets dropped.


Hmmm,  this seems likely.  Have you had any luck in finding the culprit 
and resolving the problem?




Regards,

Gary
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org




--
Paul Keusemannpkeu...@visi.com
4266 Joppa Court  (952) 894-7805
Savage, MN  55378

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: Debugging dropped shell connections over a VPN

2011-07-26 Thread Paul Keusemann

On 07/26/11 08:05, Gary Palmer wrote:

On Tue, Jul 26, 2011 at 06:53:59AM -0500, Paul Keusemann wrote:

Again, sorry for the sluggish response.

On 07/20/11 15:15, Gary Palmer wrote:

On Tue, Jul 12, 2011 at 02:26:34PM -0500, Paul Keusemann wrote:

On 07/07/11 14:39, Chuck Swiger wrote:

On Jul 7, 2011, at 4:45 AM, Paul Keusemann wrote:

My setup is something like this:
- My local network is a mix of AIX, HP-UX, Linux, FreeBSD and Solaris
machines running various OS versions.
- My gateway / firewall  machine is running FreeBSD-8.1-RELEASE-p1 with
ipfw, nat and racoon for the firewall and VPN.

The problem is that rlogin, ssh and telnet connections over the VPN get
dropped after some period of inactivity.

You're probably getting NAT timeouts against the VPN connection if it is
left idle.  racoon ought to have a config setting called natt_keepalive
which sends periodic keepalives-- see whether that's disabled.

Regards,

Thanks for the suggestions Chuck, sorry it's taken so long to respond
but I had to reconfigure and rebuild my kernel to enable IPSEC_NAT_T in
order to try this out.

One thing that I did not explicitly mention before is that I am routing
a network over the VPN.

Hi Paul,

Even if you are not being NAT'd on the VPN there may be a firewall (or
other active network component like a load balancer) with an
overflowing state table somewhere at the remote end.  We see this
frequently where I work with customer networks and the firewall/VPN/network
admin denies that its a time out issue so there is likely some device in
the network that has a state table and if the connection is idle for a
few minutes it gets dropped.

Hmmm,  this seems likely.  Have you had any luck in finding the culprit
and resolving the problem?

Unfortunately no.  We know the problem exists but as a vendor we have
very little success in getting the customer to identify the problematic
device inside their network as it only seems to affect our connections
to them when we are helping them with problems, so there is almost
always something more important going on and the timeout issue gets put
on the back burner and forgotten.  We've worked around it in some
places by using the ssh 'ServerAliveInterval' directive to make ssh
send packets and keep the session open even if we're idle, but that
doesn't always work.


OK, I found the ClientAliveInterval, and ClientAliveCountMax setting in 
the ssh_config man page.  I assume these are what you are referring to.  
I tried setting ClientAliveInterval to 15 seconds with 
ClientAliveCountMax set to 3 and this seems to help.  I've only tried 
this a couple of times but I have seen an ssh session stay alive for 
over an hour.  The bad news is that the sessions are still getting 
dropped, at least now I know when it happens.  Now I'm getting the 
following message:


Received disconnect from 10.64.20.69: 2: Timeout, your session not 
responding.


From a quick perusal of the openssh source, it is not obvious whether 
this message is coming from the client or the server side.   Initially, 
because the keep alive timer is a server side setting, I assumed the 
message was coming from the server side but if the session is not 
responding how is the message getting to the client?  If it is a client 
side problem, then I have much more flexibility to fix.  All I can do is 
whine about server side problems.


Paul



Gary
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org




--
Paul Keusemannpkeu...@visi.com
4266 Joppa Court  (952) 894-7805
Savage, MN  55378

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: Debugging dropped shell connections over a VPN

2011-07-12 Thread Paul Keusemann

On 07/07/11 14:39, Chuck Swiger wrote:

On Jul 7, 2011, at 4:45 AM, Paul Keusemann wrote:

My setup is something like this:
- My local network is a mix of AIX, HP-UX, Linux, FreeBSD and Solaris machines 
running various OS versions.
- My gateway / firewall  machine is running FreeBSD-8.1-RELEASE-p1 with ipfw, 
nat and racoon for the firewall and VPN.

The problem is that rlogin, ssh and telnet connections over the VPN get dropped 
after some period of inactivity.

You're probably getting NAT timeouts against the VPN connection if it is left 
idle.  racoon ought to have a config setting called natt_keepalive which sends 
periodic keepalives-- see whether that's disabled.

Regards,


Thanks for the suggestions Chuck, sorry it's taken so long to respond 
but I had to reconfigure and rebuild my kernel to enable IPSEC_NAT_T in 
order to try this out.


One thing that I did not explicitly mention before is that I am routing 
a network over the VPN.


I did not have previously NAT-Traversal enabled nor was it configured in 
my kernel.  After reconfiguring, compiling and installing the new 
kernel, I added the following to the phase 1 configuration for my VPN:


timer
{
# Default is 20 seconds.
natt_keepalive 10 sec;
}

# Enable NAT traversal.
#nat_traversal on;
nat_traversal force;

# Enable IKE fragmentation.
ike_frag on;

# Enable ESP fragmentaion at 552 bytes.
esp_frag 552;

The only immediately noticeable change is that I am no longer getting 
the following warnings at racoon startup:


WARNING: setsockopt(UDP_ENCAP_ESPINUDP_NON_IKE): UDP_ENCAP 
Invalid argument


I assume this is the result of adding IPSEC_NAT_T to the kernel config.  
My shell connections are still being dropped, so I'm pretty much back to 
square one.


So, any other ideas on how to debug this?

Anybody know how to get racoon to log everything to one file?  Right 
now, depending on the log level, I am getting messages in racoon.log 
(specified with -l at startup), messages and debug.log.  It would really 
be nice to have just one log to look at.


--
Paul Keusemannpkeu...@visi.com
4266 Joppa Court  (952) 894-7805
Savage, MN  55378

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org