Re: Bizarre arp entry corruption
Joe Holden(m...@m.jwh.me.uk) on 2017.03.09 13:41:26 +: > On 09/03/2017 11:51, Martin Pieuchot wrote: > >On 07/03/17(Tue) 19:38, Joe Holden wrote: > >>On 12/12/2016 16:55, Joe Holden wrote: > >>>On 12/12/2016 10:27, Martin Pieuchot wrote: > On 11/12/16(Sun) 00:50, Joe Holden wrote: > >On 10/12/2016 08:43, Mihai Popescu wrote: > seeing some bizarre behaviour on one box, on one specific interface: > >> > >>Hello, > >> > >>This looks like some stupid TV game, where contesters are given some > >>clues from time to time and they have to guess what is the real shit. > >> > >>Do post your FULL dmesg and configurations for network if you really > >>want someone to even think at your issue. Isn't that obvious? > >> > >>Bye! > >> > > > >Appreciate the useless response (but still better than nothing!), the > >affected box has since been reverted to older snapshot and thus no more > >debugging can be done - someone else will have to do it. > > I'd appreciate to see the output of 'netstat -rnf inet' when it is > relevant. Without that information it's hard to understand. > > But there's a bug somewhere, it has to be fixed. > > >Not that dmesg is even relevant since it is a userland bug not a kernel > >problem but anyway: > > It's a kernel problem. > > >>>I'll see if I can recreate it but I'm not holding my breath - it only > >>>breaks once BGP loaded the table which leads me to thing it is actually > >>>bgpd that is updating the llinfo with bogus info and even though I have > >>>a feed in my lab it doesn't do the same thing. > >>> > >>Ok so, inadvertantly recreated this (pretty much exactly the same) issue > >>on > >>a lab/test setup: > >> > >>For the purposes of debug, ignore the fact that the interfaces are tap > >>interfaces, they're still emulated ethernet... > >> > >>Wall of text incoming, various info... > >> > >>box#1: > >> > >>tap1: flags=8843 mtu 1500 > >>lladdr fe:e1:ba:d1:be:f3 > >>index 7 priority 0 llprio 3 > >>groups: tap > >>status: active > >>inet 172.20.230.72 netmask 0xfffe > >> > >>box#2: > >> > >>tap1: flags=8843 mtu 1500 > >>lladdr fe:e1:ba:d1:cf:92 > >>index 7 priority 0 llprio 3 > >>groups: tap > >>status: active > >>inet 172.20.230.73 netmask 0xfffe > >> > >>All is fine after starting ospfd, but as soon as I start bgpd, box#2 shows > >>the following: > >> > >>Host Ethernet AddressNetif Expire > >>Flags > >>172.20.230.7200:00:00:00:20:12 ? 12m30s > >> > >># route -n get 172.20.230.72 > >> route to: 172.20.230.72 > >>destination: 172.20.230.72 > >> mask: 255.255.255.255 > >> interface: tap1 > >> if address: 172.20.230.73 > >> priority: 3 () > >> flags: > >> use mtuexpire > >> 20 0 702 > >> > >>flags destination gateway lpref med aspath origin > >>IS*> 172.20.230.72/31 172.20.230.64 200 0 i > >> > >>.64 is the loopback on one of its connected boxes that doesn't have broken > >>entries > >> > >>tcpdump looks ok, afterwards: > >> > >>19:14:23.723876 arp who-has 172.20.230.72 tell 172.20.230.73 > >>19:14:23.901883 arp reply 172.20.230.72 is-at fe:e1:ba:d1:be:f3 > >>19:14:24.022948 arp who-has 172.20.230.72 tell 172.20.230.73 > >>19:14:24.201095 arp reply 172.20.230.72 is-at fe:e1:ba:d1:be:f3 > >> > >>but the correct entry is never installed, after I delete the broken arp > >>entry it never readds a new one. > >> > >>This only happens with redist connected as far as I can tell, but bgpd > >>probably shouldn't be able to mangle arp entries and prevent the correct > >>one > >>being added. > > > >Here's the fix. > > > >Index: net/rtsock.c > >=== > >RCS file: /cvs/src/sys/net/rtsock.c,v > >retrieving revision 1.232 > >diff -u -p -r1.232 rtsock.c > >--- net/rtsock.c 7 Mar 2017 09:23:27 - 1.232 > >+++ net/rtsock.c 8 Mar 2017 16:06:22 - > >@@ -895,10 +895,22 @@ rtm_output(struct rt_msghdr *rtm, struct > > } > > } > > change: > >-if (info->rti_info[RTAX_GATEWAY] != NULL && (error = > >-rt_setgate(rt, info->rti_info[RTAX_GATEWAY], > >-tableid))) > >-break; > >+if (info->rti_info[RTAX_GATEWAY] != NULL) { > >+/* > >+ * When updating the gateway, make sure it's > >+ * valid. > >+ */ > >+if (!newgate && rt->rt_gateway->sa_family != > >+info->rti_info[RTAX_GATEWAY]->sa_family) > >{ > >+error = EINVAL; >
Re: Bizarre arp entry corruption
On 09/03/2017 11:51, Martin Pieuchot wrote: On 07/03/17(Tue) 19:38, Joe Holden wrote: On 12/12/2016 16:55, Joe Holden wrote: On 12/12/2016 10:27, Martin Pieuchot wrote: On 11/12/16(Sun) 00:50, Joe Holden wrote: On 10/12/2016 08:43, Mihai Popescu wrote: seeing some bizarre behaviour on one box, on one specific interface: Hello, This looks like some stupid TV game, where contesters are given some clues from time to time and they have to guess what is the real shit. Do post your FULL dmesg and configurations for network if you really want someone to even think at your issue. Isn't that obvious? Bye! Appreciate the useless response (but still better than nothing!), the affected box has since been reverted to older snapshot and thus no more debugging can be done - someone else will have to do it. I'd appreciate to see the output of 'netstat -rnf inet' when it is relevant. Without that information it's hard to understand. But there's a bug somewhere, it has to be fixed. Not that dmesg is even relevant since it is a userland bug not a kernel problem but anyway: It's a kernel problem. I'll see if I can recreate it but I'm not holding my breath - it only breaks once BGP loaded the table which leads me to thing it is actually bgpd that is updating the llinfo with bogus info and even though I have a feed in my lab it doesn't do the same thing. Ok so, inadvertantly recreated this (pretty much exactly the same) issue on a lab/test setup: For the purposes of debug, ignore the fact that the interfaces are tap interfaces, they're still emulated ethernet... Wall of text incoming, various info... box#1: tap1: flags=8843 mtu 1500 lladdr fe:e1:ba:d1:be:f3 index 7 priority 0 llprio 3 groups: tap status: active inet 172.20.230.72 netmask 0xfffe box#2: tap1: flags=8843 mtu 1500 lladdr fe:e1:ba:d1:cf:92 index 7 priority 0 llprio 3 groups: tap status: active inet 172.20.230.73 netmask 0xfffe All is fine after starting ospfd, but as soon as I start bgpd, box#2 shows the following: Host Ethernet AddressNetif Expire Flags 172.20.230.7200:00:00:00:20:12 ? 12m30s # route -n get 172.20.230.72 route to: 172.20.230.72 destination: 172.20.230.72 mask: 255.255.255.255 interface: tap1 if address: 172.20.230.73 priority: 3 () flags: use mtuexpire 20 0 702 flags destination gateway lpref med aspath origin IS*> 172.20.230.72/31 172.20.230.64 200 0 i .64 is the loopback on one of its connected boxes that doesn't have broken entries tcpdump looks ok, afterwards: 19:14:23.723876 arp who-has 172.20.230.72 tell 172.20.230.73 19:14:23.901883 arp reply 172.20.230.72 is-at fe:e1:ba:d1:be:f3 19:14:24.022948 arp who-has 172.20.230.72 tell 172.20.230.73 19:14:24.201095 arp reply 172.20.230.72 is-at fe:e1:ba:d1:be:f3 but the correct entry is never installed, after I delete the broken arp entry it never readds a new one. This only happens with redist connected as far as I can tell, but bgpd probably shouldn't be able to mangle arp entries and prevent the correct one being added. Here's the fix. Index: net/rtsock.c === RCS file: /cvs/src/sys/net/rtsock.c,v retrieving revision 1.232 diff -u -p -r1.232 rtsock.c --- net/rtsock.c7 Mar 2017 09:23:27 - 1.232 +++ net/rtsock.c8 Mar 2017 16:06:22 - @@ -895,10 +895,22 @@ rtm_output(struct rt_msghdr *rtm, struct } } change: - if (info->rti_info[RTAX_GATEWAY] != NULL && (error = - rt_setgate(rt, info->rti_info[RTAX_GATEWAY], - tableid))) - break; + if (info->rti_info[RTAX_GATEWAY] != NULL) { + /* +* When updating the gateway, make sure it's +* valid. +*/ + if (!newgate && rt->rt_gateway->sa_family != + info->rti_info[RTAX_GATEWAY]->sa_family) { + error = EINVAL; + break; + } + + error = rt_setgate(rt, + info->rti_info[RTAX_GATEWAY], tableid); + if (error) + break; + } #ifdef MPLS if ((rtm->rtm_flags & RTF_MPLS) && info->rti_info[RTAX_SRC] != NULL) { Looking good - have tried to break it since and it's fine, thanks for your help! Will this make it into 6.1?
Re: Bizarre arp entry corruption
On 07/03/17(Tue) 19:38, Joe Holden wrote: > On 12/12/2016 16:55, Joe Holden wrote: > > On 12/12/2016 10:27, Martin Pieuchot wrote: > > > On 11/12/16(Sun) 00:50, Joe Holden wrote: > > > > On 10/12/2016 08:43, Mihai Popescu wrote: > > > > > > > seeing some bizarre behaviour on one box, on one specific > > > > > > > interface: > > > > > > > > > > Hello, > > > > > > > > > > This looks like some stupid TV game, where contesters are given some > > > > > clues from time to time and they have to guess what is the real shit. > > > > > > > > > > Do post your FULL dmesg and configurations for network if you really > > > > > want someone to even think at your issue. Isn't that obvious? > > > > > > > > > > Bye! > > > > > > > > > > > > > Appreciate the useless response (but still better than nothing!), the > > > > affected box has since been reverted to older snapshot and thus no more > > > > debugging can be done - someone else will have to do it. > > > > > > I'd appreciate to see the output of 'netstat -rnf inet' when it is > > > relevant. Without that information it's hard to understand. > > > > > > But there's a bug somewhere, it has to be fixed. > > > > > > > Not that dmesg is even relevant since it is a userland bug not a kernel > > > > problem but anyway: > > > > > > It's a kernel problem. > > > > > I'll see if I can recreate it but I'm not holding my breath - it only > > breaks once BGP loaded the table which leads me to thing it is actually > > bgpd that is updating the llinfo with bogus info and even though I have > > a feed in my lab it doesn't do the same thing. > > > Ok so, inadvertantly recreated this (pretty much exactly the same) issue on > a lab/test setup: > > For the purposes of debug, ignore the fact that the interfaces are tap > interfaces, they're still emulated ethernet... > > Wall of text incoming, various info... > > box#1: > > tap1: flags=8843 mtu 1500 > lladdr fe:e1:ba:d1:be:f3 > index 7 priority 0 llprio 3 > groups: tap > status: active > inet 172.20.230.72 netmask 0xfffe > > box#2: > > tap1: flags=8843 mtu 1500 > lladdr fe:e1:ba:d1:cf:92 > index 7 priority 0 llprio 3 > groups: tap > status: active > inet 172.20.230.73 netmask 0xfffe > > All is fine after starting ospfd, but as soon as I start bgpd, box#2 shows > the following: > > Host Ethernet AddressNetif Expire Flags > 172.20.230.7200:00:00:00:20:12 ? 12m30s > > # route -n get 172.20.230.72 >route to: 172.20.230.72 > destination: 172.20.230.72 >mask: 255.255.255.255 > interface: tap1 > if address: 172.20.230.73 >priority: 3 () > flags: > use mtuexpire > 20 0 702 > > flags destination gateway lpref med aspath origin > IS*> 172.20.230.72/31 172.20.230.64 200 0 i > > .64 is the loopback on one of its connected boxes that doesn't have broken > entries > > tcpdump looks ok, afterwards: > > 19:14:23.723876 arp who-has 172.20.230.72 tell 172.20.230.73 > 19:14:23.901883 arp reply 172.20.230.72 is-at fe:e1:ba:d1:be:f3 > 19:14:24.022948 arp who-has 172.20.230.72 tell 172.20.230.73 > 19:14:24.201095 arp reply 172.20.230.72 is-at fe:e1:ba:d1:be:f3 > > but the correct entry is never installed, after I delete the broken arp > entry it never readds a new one. > > This only happens with redist connected as far as I can tell, but bgpd > probably shouldn't be able to mangle arp entries and prevent the correct one > being added. Here's the fix. Index: net/rtsock.c === RCS file: /cvs/src/sys/net/rtsock.c,v retrieving revision 1.232 diff -u -p -r1.232 rtsock.c --- net/rtsock.c7 Mar 2017 09:23:27 - 1.232 +++ net/rtsock.c8 Mar 2017 16:06:22 - @@ -895,10 +895,22 @@ rtm_output(struct rt_msghdr *rtm, struct } } change: - if (info->rti_info[RTAX_GATEWAY] != NULL && (error = - rt_setgate(rt, info->rti_info[RTAX_GATEWAY], - tableid))) - break; + if (info->rti_info[RTAX_GATEWAY] != NULL) { + /* +* When updating the gateway, make sure it's +* valid. +*/ + if (!newgate && rt->rt_gateway->sa_family != + info->rti_info[RTAX_GATEWAY]->sa_family) { + error = EINVAL; + break; + } + + error = rt_setgate(rt, + info->rti_info[RTAX_GATEWAY], tableid); +
Re: Bizarre arp entry corruption
On 12/12/2016 16:55, Joe Holden wrote: On 12/12/2016 10:27, Martin Pieuchot wrote: On 11/12/16(Sun) 00:50, Joe Holden wrote: On 10/12/2016 08:43, Mihai Popescu wrote: seeing some bizarre behaviour on one box, on one specific interface: Hello, This looks like some stupid TV game, where contesters are given some clues from time to time and they have to guess what is the real shit. Do post your FULL dmesg and configurations for network if you really want someone to even think at your issue. Isn't that obvious? Bye! Appreciate the useless response (but still better than nothing!), the affected box has since been reverted to older snapshot and thus no more debugging can be done - someone else will have to do it. I'd appreciate to see the output of 'netstat -rnf inet' when it is relevant. Without that information it's hard to understand. But there's a bug somewhere, it has to be fixed. Not that dmesg is even relevant since it is a userland bug not a kernel problem but anyway: It's a kernel problem. I'll see if I can recreate it but I'm not holding my breath - it only breaks once BGP loaded the table which leads me to thing it is actually bgpd that is updating the llinfo with bogus info and even though I have a feed in my lab it doesn't do the same thing. Ok so, inadvertantly recreated this (pretty much exactly the same) issue on a lab/test setup: For the purposes of debug, ignore the fact that the interfaces are tap interfaces, they're still emulated ethernet... Wall of text incoming, various info... box#1: tap1: flags=8843 mtu 1500 lladdr fe:e1:ba:d1:be:f3 index 7 priority 0 llprio 3 groups: tap status: active inet 172.20.230.72 netmask 0xfffe box#2: tap1: flags=8843 mtu 1500 lladdr fe:e1:ba:d1:cf:92 index 7 priority 0 llprio 3 groups: tap status: active inet 172.20.230.73 netmask 0xfffe All is fine after starting ospfd, but as soon as I start bgpd, box#2 shows the following: Host Ethernet AddressNetif Expire Flags 172.20.230.7200:00:00:00:20:12 ? 12m30s # route -n get 172.20.230.72 route to: 172.20.230.72 destination: 172.20.230.72 mask: 255.255.255.255 interface: tap1 if address: 172.20.230.73 priority: 3 () flags: use mtuexpire 20 0 702 flags destination gateway lpref med aspath origin IS*> 172.20.230.72/31 172.20.230.64 200 0 i .64 is the loopback on one of its connected boxes that doesn't have broken entries tcpdump looks ok, afterwards: 19:14:23.723876 arp who-has 172.20.230.72 tell 172.20.230.73 19:14:23.901883 arp reply 172.20.230.72 is-at fe:e1:ba:d1:be:f3 19:14:24.022948 arp who-has 172.20.230.72 tell 172.20.230.73 19:14:24.201095 arp reply 172.20.230.72 is-at fe:e1:ba:d1:be:f3 but the correct entry is never installed, after I delete the broken arp entry it never readds a new one. This only happens with redist connected as far as I can tell, but bgpd probably shouldn't be able to mangle arp entries and prevent the correct one being added. If someone thinks they can diag/fix it then hit me up off-list and I can fire over ssh details. Thanks
Re: Bizarre arp entry corruption
On 12/12/2016 10:27, Martin Pieuchot wrote: On 11/12/16(Sun) 00:50, Joe Holden wrote: On 10/12/2016 08:43, Mihai Popescu wrote: seeing some bizarre behaviour on one box, on one specific interface: Hello, This looks like some stupid TV game, where contesters are given some clues from time to time and they have to guess what is the real shit. Do post your FULL dmesg and configurations for network if you really want someone to even think at your issue. Isn't that obvious? Bye! Appreciate the useless response (but still better than nothing!), the affected box has since been reverted to older snapshot and thus no more debugging can be done - someone else will have to do it. I'd appreciate to see the output of 'netstat -rnf inet' when it is relevant. Without that information it's hard to understand. But there's a bug somewhere, it has to be fixed. Not that dmesg is even relevant since it is a userland bug not a kernel problem but anyway: It's a kernel problem. I'll see if I can recreate it but I'm not holding my breath - it only breaks once BGP loaded the table which leads me to thing it is actually bgpd that is updating the llinfo with bogus info and even though I have a feed in my lab it doesn't do the same thing.
Re: Bizarre arp entry corruption
On 11/12/16(Sun) 00:50, Joe Holden wrote: > On 10/12/2016 08:43, Mihai Popescu wrote: > > > > seeing some bizarre behaviour on one box, on one specific interface: > > > > Hello, > > > > This looks like some stupid TV game, where contesters are given some > > clues from time to time and they have to guess what is the real shit. > > > > Do post your FULL dmesg and configurations for network if you really > > want someone to even think at your issue. Isn't that obvious? > > > > Bye! > > > > Appreciate the useless response (but still better than nothing!), the > affected box has since been reverted to older snapshot and thus no more > debugging can be done - someone else will have to do it. I'd appreciate to see the output of 'netstat -rnf inet' when it is relevant. Without that information it's hard to understand. But there's a bug somewhere, it has to be fixed. > Not that dmesg is even relevant since it is a userland bug not a kernel > problem but anyway: It's a kernel problem.
Re: Bizarre arp entry corruption
On 10/12/2016 08:43, Mihai Popescu wrote: seeing some bizarre behaviour on one box, on one specific interface: Hello, This looks like some stupid TV game, where contesters are given some clues from time to time and they have to guess what is the real shit. Do post your FULL dmesg and configurations for network if you really want someone to even think at your issue. Isn't that obvious? Bye! Appreciate the useless response (but still better than nothing!), the affected box has since been reverted to older snapshot and thus no more debugging can be done - someone else will have to do it. Not that dmesg is even relevant since it is a userland bug not a kernel problem but anyway: OpenBSD 6.0-current (GENERIC.MP) #19: Wed Dec 7 12:07:13 MST 2016 bu...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP real mem = 4273471488 (4075MB) avail mem = 4139397120 (3947MB) mpath0 at root scsibus0 at mpath0: 256 targets mainbus0 at root bios0 at mainbus0: SMBIOS rev. 2.6 @ 0x9d000 (74 entries) bios0: vendor American Megatrends Inc. version "1ADQW068" date 11/16/2010 bios0: Sun Microsystems SUN FIRE X4150 acpi0 at bios0: rev 2 acpi0: sleep states S0 S1 S5 acpi0: tables DSDT FACP APIC SPCR MCFG SSDT OEMB HPET EINJ BERT ERST HEST acpi0: wakeup devices SPE4(S1) SPE2(S1) SPE1(S1) P8PC(S1) P0P1(S1) UAR1(S1) P0P5(S1) P0P6(S1) P0P7(S1) NPE4(S1) NPE5(S1) NPE6(S1) NPE7(S1) USB0(S1) USB1(S1) USB2(S1) [...] acpitimer0 at acpi0: 3579545 Hz, 24 bits acpimadt0 at acpi0 addr 0xfee0: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: Intel(R) Xeon(R) CPU E5450 @ 3.00GHz, 4189.89 MHz cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,DCA,SSE4.1,XSAVE,LONG,LAHF,PERF,SENSOR cpu0: 6MB 64b/line 16-way L2 cache cpu0: smt 0, core 0, package 0 mtrr: Pentium Pro MTRR support, 7 var ranges, 88 fixed ranges cpu0: apic clock running at 332MHz cpu0: mwait min=64, max=64, C-substates=0.2.2.2, IBE cpu1 at mainbus0: apid 1 (application processor) cpu1: Intel(R) Xeon(R) CPU E5450 @ 3.00GHz, 2992.51 MHz cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,DCA,SSE4.1,XSAVE,LONG,LAHF,PERF,SENSOR cpu1: 6MB 64b/line 16-way L2 cache cpu1: smt 0, core 1, package 0 cpu2 at mainbus0: apid 2 (application processor) cpu2: Intel(R) Xeon(R) CPU E5450 @ 3.00GHz, 2992.51 MHz cpu2: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,DCA,SSE4.1,XSAVE,LONG,LAHF,PERF,SENSOR cpu2: 6MB 64b/line 16-way L2 cache cpu2: smt 0, core 2, package 0 cpu3 at mainbus0: apid 3 (application processor) cpu3: Intel(R) Xeon(R) CPU E5450 @ 3.00GHz, 2992.52 MHz cpu3: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,DCA,SSE4.1,XSAVE,LONG,LAHF,PERF,SENSOR cpu3: 6MB 64b/line 16-way L2 cache cpu3: smt 0, core 3, package 0 ioapic0 at mainbus0: apid 4 pa 0xfec0, version 20, 24 pins ioapic1 at mainbus0: apid 5 pa 0xfec8, version 20, 24 pins acpimcfg0 at acpi0 addr 0xe000, bus 0-255 acpihpet0 at acpi0: 14318179 Hz acpiprt0 at acpi0: bus 0 (PCI0) acpiprt1 at acpi0: bus 1 (NPES) acpiprt2 at acpi0: bus 2 (SPE4) acpiprt3 at acpi0: bus -1 (SPE2) acpiprt4 at acpi0: bus 3 (SPE1) acpiprt5 at acpi0: bus 4 (P8PC) acpiprt6 at acpi0: bus 15 (P0P1) acpiprt7 at acpi0: bus -1 (P0P5) acpiprt8 at acpi0: bus -1 (P0P6) acpiprt9 at acpi0: bus -1 (P0P7) acpiprt10 at acpi0: bus 7 (NPE4) acpiprt11 at acpi0: bus 11 (NPE5) acpiprt12 at acpi0: bus 12 (NPE6) acpiprt13 at acpi0: bus 13 (NPE7) acpiprt14 at acpi0: bus 14 (P0P4) acpiprt15 at acpi0: bus -1 (BR1E) acpicpu0 at acpi0: C1(@1 halt!) acpicpu1 at acpi0: C1(@1 halt!) acpicpu2 at acpi0: C1(@1 halt!) acpicpu3 at acpi0: C1(@1 halt!) "PNP0501" at acpi0 not configured "PNP0501" at acpi0 not configured acpibtn0 at acpi0: PWRB "IPI0001" at acpi0 not configured ipmi at mainbus0 not configured pci0 at mainbus0 bus 0 pchb0 at pci0 dev 0 function 0 "Intel 5000P Host" rev 0xb1 ppb0 at pci0 dev 2 function 0 "Intel 5000 PCIE" rev 0xb1 pci1 at ppb0 bus 1 ppb1 at pci1 dev 0 function 0 "Intel 6321ESB PCIE" rev 0x01 pci2 at ppb1 bus 2 ppb2 at pci2 dev 0 function 0 "Intel 6321ESB PCIE" rev 0x01 pci3 at ppb2 bus 3 ppb3 at pci2 dev 2 function 0 "Intel 6321ESB PCIE" rev 0x01 pci4 at ppb3 bus 4 em0 at pci4 dev 0 function 0 "Intel 80003ES2" rev 0x01: msi, address 00:23:8b:57:b4:9e em1 at pci4 dev 0 function 1 "Intel 80003ES2" rev 0x01: msi, address 00:23:8b:57:b4:9f ppb4 at pci1 dev 0 function 3 "Intel 6321ESB PCIE-PCIX" rev 0x01 pci5 at ppb4 bus 5 ppb5 at pci0 dev 3 function 0 "Intel 5000 PCIE" rev 0xb1 pci6 at ppb5 bus 6
Re: Bizarre arp entry corruption
>> seeing some bizarre behaviour on one box, on one specific interface: Hello, This looks like some stupid TV game, where contesters are given some clues from time to time and they have to guess what is the real shit. Do post your FULL dmesg and configurations for network if you really want someone to even think at your issue. Isn't that obvious? Bye!
Re: Bizarre arp entry corruption
On 08/12/2016 14:35, Joe Holden wrote: On 08/12/2016 13:56, Joe Holden wrote: Hi guys, I've just updated a couple of boxes to the Dec 7th snapshot and I'm seeing some bizarre behaviour on one box, on one specific interface: The box in question is an OSPF and BGP speaker, and the following happens when booted: After OSPF and BGP tables load, a couple of minutes later the following appear: Dec 8 06:33:03 edge-pe-2 /bsd: arp_rtrequest: bad gateway value: em0 Dec 8 06:33:03 edge-pe-2 last message repeated 2 times Dec 8 06:33:04 edge-pe-2 /bsd: arpresolve: X.X.X.X: incorrect arp information Then some seconds later: Dec 8 06:41:41 edge-pe-2 /bsd: arpresolve: unresolved and rt_expire == 0 At this point the arp entry for the neighbour in question has been updated so that the lladdr is all zeros and the interface is simply '?' according to arp -n. The box it is paired with that has a pretty much identical config doesn't exhibit the same problem and this only occurs on the single em0 interface (the box has about 6 active in total, mix of em and ix). I should clarify that this isn't CARP, but rather the box it is directly connected to. OpenBSD 6.0-current (GENERIC.MP) #19: Wed Dec 7 12:07:13 MST 2016 bu...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP I don't see any odd behaviour on the wire, according to pcap the who-has and associated reply is seen once as expected with the correct lladdr, but at some point it gets overwritten with the above. Previous kernel was about 2 months old which leaves a large number of commits to check through - I can't see anything that might cause this from a quick look though so I was hoping someone might have an idea. For now i've had to add a static arp entry with permanent to prevent it misbehaving but that has stopped working at least once so far. I also have limited debug ability as the box is part of a live network and obviously it causes disruption, and I can't recreate it in a lab with identical configurations. Any pointers appreciated! Cheers Actually looks like it breaks when BGP comes up, a route -nvd get looks ok, but what else should I be checking? After it breaks it doesn't seem to want to do any arp resolution on the interface until it I do down/up...
Re: Bizarre arp entry corruption
On 08/12/2016 13:56, Joe Holden wrote: Hi guys, I've just updated a couple of boxes to the Dec 7th snapshot and I'm seeing some bizarre behaviour on one box, on one specific interface: The box in question is an OSPF and BGP speaker, and the following happens when booted: After OSPF and BGP tables load, a couple of minutes later the following appear: Dec 8 06:33:03 edge-pe-2 /bsd: arp_rtrequest: bad gateway value: em0 Dec 8 06:33:03 edge-pe-2 last message repeated 2 times Dec 8 06:33:04 edge-pe-2 /bsd: arpresolve: X.X.X.X: incorrect arp information Then some seconds later: Dec 8 06:41:41 edge-pe-2 /bsd: arpresolve: unresolved and rt_expire == 0 At this point the arp entry for the neighbour in question has been updated so that the lladdr is all zeros and the interface is simply '?' according to arp -n. The box it is paired with that has a pretty much identical config doesn't exhibit the same problem and this only occurs on the single em0 interface (the box has about 6 active in total, mix of em and ix). I should clarify that this isn't CARP, but rather the box it is directly connected to. OpenBSD 6.0-current (GENERIC.MP) #19: Wed Dec 7 12:07:13 MST 2016 bu...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP I don't see any odd behaviour on the wire, according to pcap the who-has and associated reply is seen once as expected with the correct lladdr, but at some point it gets overwritten with the above. Previous kernel was about 2 months old which leaves a large number of commits to check through - I can't see anything that might cause this from a quick look though so I was hoping someone might have an idea. For now i've had to add a static arp entry with permanent to prevent it misbehaving but that has stopped working at least once so far. I also have limited debug ability as the box is part of a live network and obviously it causes disruption, and I can't recreate it in a lab with identical configurations. Any pointers appreciated! Cheers
Bizarre arp entry corruption
Hi guys, I've just updated a couple of boxes to the Dec 7th snapshot and I'm seeing some bizarre behaviour on one box, on one specific interface: The box in question is an OSPF and BGP speaker, and the following happens when booted: After OSPF and BGP tables load, a couple of minutes later the following appear: Dec 8 06:33:03 edge-pe-2 /bsd: arp_rtrequest: bad gateway value: em0 Dec 8 06:33:03 edge-pe-2 last message repeated 2 times Dec 8 06:33:04 edge-pe-2 /bsd: arpresolve: X.X.X.X: incorrect arp information Then some seconds later: Dec 8 06:41:41 edge-pe-2 /bsd: arpresolve: unresolved and rt_expire == 0 At this point the arp entry for the neighbour in question has been updated so that the lladdr is all zeros and the interface is simply '?' according to arp -n. The box it is paired with that has a pretty much identical config doesn't exhibit the same problem and this only occurs on the single em0 interface (the box has about 6 active in total, mix of em and ix). OpenBSD 6.0-current (GENERIC.MP) #19: Wed Dec 7 12:07:13 MST 2016 bu...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP I don't see any odd behaviour on the wire, according to pcap the who-has and associated reply is seen once as expected with the correct lladdr, but at some point it gets overwritten with the above. Previous kernel was about 2 months old which leaves a large number of commits to check through - I can't see anything that might cause this from a quick look though so I was hoping someone might have an idea. For now i've had to add a static arp entry with permanent to prevent it misbehaving but that has stopped working at least once so far. I also have limited debug ability as the box is part of a live network and obviously it causes disruption, and I can't recreate it in a lab with identical configurations. Any pointers appreciated! Cheers