from:"Hrvoje Popovski"

bgpctl show mrt file - Segmentation fault (core dumped)

2024-06-03 Thread Hrvoje Popovski

Hi,

Here at Srce we are running OpenBSD 7.5-release as route server. I
wanted to collect some additional MRT data and I have this in bgpd.conf

dump table-v2 "/data/bgpdumps/bgp-rib-dump-%y_%m_%d-%H_%M" 300
dump all out "/data/bgpdumps/bgp-all-out-%y_%m_%d-%H_%M" 300
dump all in "/data/bgpdumps/bgp-all-in-%y_%m_%d-%H_%M" 300

If I want to read bgp-rib-dump with
bgpctl show mrt file /data/bgpdumps/bgp-rib-dump-24_06_03-10_46

everything seems fine

But if I want to read bgp-all-in or bgp-all-out I get
Segmentation fault (core dumped)


rs1# gdb bgpctl bgpctl.core
GNU gdb 6.3
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-unknown-openbsd7.5"...(no debugging
symbols found)

Core was generated by `bgpctl'.
Program terminated with signal 11, Segmentation fault.
(no debugging symbols found)
Loaded symbols for /usr/sbin/bgpctl
Reading symbols from /usr/lib/libutil.so.18.0...done.
Loaded symbols for /usr/lib/libutil.so.18.0
Reading symbols from /usr/lib/libm.so.10.1...done.
Loaded symbols for /usr/lib/libm.so.10.1
Reading symbols from /usr/lib/libc.so.99.0...done.
Loaded symbols for /usr/lib/libc.so.99.0
Reading symbols from /usr/libexec/ld.so...Error while reading shared
library symbols:
Dwarf Error: wrong version in compilation unit header (is 4, should be
2) [in module /usr/libexec/ld.so]
#0  ibuf_get_n8 (buf=0x751ce250b5d0, value=0x751ce250b6dd "u") at
/usr/src/lib/libutil/imsg-buffer.c:412
412 /usr/src/lib/libutil/imsg-buffer.c: No such file or directory.
in /usr/src/lib/libutil/imsg-buffer.c




All those three files I can normally read with bgpdump.

panic when forwarding high amount of traffic over mcx - kernel diagnostic assertion "((flags & PGO_LOCKED)

2024-06-02 Thread Hrvoje Popovski

Hi all,

in lab I have 2 socket box with lot of interfaces, ix, ixl, mcx, bnxt,
em and bge. When sending high traffic over mcx whole machine is almost
unresponsive, like sending any command over console. In that state
pagedaemon is at 100% sometimes ever higher and mcl12k Fail counter is
rising. In sysctl.conf there is kern.maxclusters=1048576 and
NET_TASKQ=16 in if.c

While sending traffic over ix or ixl in the same machine everything
seems fine. bnxt is fishy and for some other bug report :)
I saw this mcx behavior before mpi@ diff "Add per-CPU caches to the
pmemrange allocator" but didn't manage to trigger panic.

I seems that this mcx behaviour is only under high traffic besause I
have few mcx in producion at they behaves excelent



In the attachment you can find ddb output

Just one question, is it possible to put in ddb something like mach all
ddbcpu or mach ddbcpu all ? :) When having 32 or more cores and
converting decimal to hex cpu number, one can easely make mistake


dmesg
OpenBSD 7.5-current (GENERIC.MP) #2: Sat Jun  1 22:36:05 CEST 2024
hrvoje@bigi.netlab:/sys/arch/amd64/compile/GENERIC.MP
real mem = 410826829824 (391794MB)
avail mem = 398354190336 (379900MB)
random: good seed from bootblocks
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 3.2 @ 0x68e36000 (80 entries)
bios0: vendor Dell Inc. version "2.21.2" date 02/19/2024
bios0: Dell Inc. PowerEdge R740xd
efi0 at bios0: UEFI 2.7
efi0: Dell Inc. rev 0x2150201
acpi0 at bios0: ACPI 6.1
acpi0: sleep states S0 S5
acpi0: tables DSDT FACP SSDT MCEJ WD__ SLIC HPET APIC MCFG MIGT MSCT
PCAT PCCT RASF SLIT SRAT SVOS WSMT OEM4 SSDT SSDT SSDT SPCR DMAR HEST
BERT ERST EINJ
acpi0: wakeup devices XHCI(S4) RP17(S4) PXSX(S4) RP18(S4) PXSX(S4)
RP19(S4) PXSX(S4) RP20(S4) PXSX(S4) RP01(S4) PXSX(S4) RP02(S4) PXSX(S4)
RP03(S4) PXSX(S4) RP04(S4) [...]
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpihpet0 at acpi0: 2399 Hz
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 2793.55 MHz, 06-55-04,
patch 02007108
cpu0: cpuid 1
edx=bfebfbff
ecx=77fefbff
cpu0: cpuid 6 eax=77 ecx=9
cpu0: cpuid 7.0
ebx=d39b
ecx=8 edx=bc002400
cpu0: cpuid a vers=4, gp=8, gpwidth=48, ff=3, ffwidth=48
cpu0: cpuid d.1 eax=f
cpu0: cpuid 8001 edx=2c100800
ecx=121
cpu0: cpuid 8007 edx=100
cpu0: msr 10a=2000c04
cpu0: MELTDOWN
cpu0: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB
64b/line 16-way L2 cache, 22MB 64b/line 11-way L3 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
cpu0: apic clock running at 24MHz
cpu0: mwait min=64, max=64, C-substates=0.2.0.2, IBE
cpu1 at mainbus0: apid 32 (application processor)
cpu1: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 2791.39 MHz, 06-55-04,
patch 02007108
cpu1: smt 0, core 0, package 1
cpu2 at mainbus0: apid 14 (application processor)
cpu2: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 2793.65 MHz, 06-55-04,
patch 02007108
cpu2: smt 0, core 7, package 0
cpu3 at mainbus0: apid 46 (application processor)
cpu3: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 2793.76 MHz, 06-55-04,
patch 02007108
cpu3: smt 0, core 7, package 1
cpu4 at mainbus0: apid 2 (application processor)
cpu4: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 2793.59 MHz, 06-55-04,
patch 02007108
cpu4: smt 0, core 1, package 0
cpu5 at mainbus0: apid 34 (application processor)
cpu5: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 2794.11 MHz, 06-55-04,
patch 02007108
cpu5: smt 0, core 1, package 1
cpu6 at mainbus0: apid 12 (application processor)
cpu6: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 2793.88 MHz, 06-55-04,
patch 02007108
cpu6: smt 0, core 6, package 0
cpu7 at mainbus0: apid 44 (application processor)
cpu7: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 2793.84 MHz, 06-55-04,
patch 02007108
cpu7: smt 0, core 6, package 1
cpu8 at mainbus0: apid 4 (application processor)
cpu8: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 2794.95 MHz, 06-55-04,
patch 02007108
cpu8: smt 0, core 2, package 0
cpu9 at mainbus0: apid 36 (application processor)
cpu9: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 2794.84 MHz, 06-55-04,
patch 02007108
cpu9: smt 0, core 2, package 1
cpu10 at mainbus0: apid 10 (application processor)
cpu10: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 2795.34 MHz, 06-55-04,
patch 02007108
cpu10: smt 0, core 5, package 0
cpu11 at mainbus0: apid 42 (application processor)
cpu11: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 2793.85 MHz, 06-55-04,
patch 02007108
cpu11: smt 0, core 5, package 1
cpu12 at mainbus0: apid 6 (application processor)
cpu12: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 2795.03 MHz, 06-55-04,
patch 02007108
cpu12: smt 0, core 3, package 0
cpu13 at mainbus0: apid 38 (application processor)
cpu13: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 2794.46 MHz, 06-55-04,
patch 02007108
cpu13: smt 0, core 3, package 1
cpu14 at mainbus0: apid 8 (application processor)
cpu14: Intel(R) Xeon(R) Gold 6130

Dell HBA330 problem

2024-03-21 Thread Hrvoje Popovski

Hi all,

I have Dell R740xd with HBA330 non-raid as disk controler. It seems that
after doing little more intensive disk work over HBA330 like local cvs
checkout, box freezes with lots of logs

mpii0: mpii_scsi_cmd_tmo (0x4000265d)


After that nothing can be done except power cycle box from ipmi console
or via power button, reboot does not work from console.
I've compiled kernel with MPII_DEBUG and here are last few logs before
box freezes


Mar 20 19:53:11 r740xd /bsd: mpii0: mpii_read 0 0x4000265d
Mar 20 19:53:11 r740xd /bsd: mpii0: mpii_scsi_cmd_tmo (0x4000265d)
Mar 20 19:53:11 r740xd /bsd: mpii0: mpii_get_ccb 0x818fe490
Mar 20 19:53:11 r740xd /bsd: mpii0: mpii_start 0x2c87c00
Mar 20 19:53:11 r740xd /bsd: mpii0:   MPII_REQ_DESCR_POST_LOW
(0x00c0) write 0x00760006
Mar 20 19:53:11 r740xd /bsd: mpii0:   MPII_REQ_DESCR_POST_HIGH
(0x00c4) write 0x0286
Mar 20 19:53:11 r740xd /bsd: mpii0: mpii_read 0 0x4000265d
Mar 20 19:53:11 r740xd /bsd: mpii0: mpii_scsi_cmd_tmo (0x4000265d)
Mar 20 19:53:11 r740xd /bsd: mpii0: mpii_get_ccb 0x818fe580
Mar 20 19:53:11 r740xd /bsd: mpii0: mpii_start 0x2c88200
Mar 20 19:53:11 r740xd /bsd: mpii0:   MPII_REQ_DESCR_POST_LOW
(0x00c0) write 0x00790006
Mar 20 19:53:11 r740xd /bsd: mpii0:   MPII_REQ_DESCR_POST_HIGH
(0x00c4) write 0x0286
Mar 20 19:53:18 r740xd /bsd: mpii0: mpii_read 0 0x4000265d
Mar 20 19:53:18 r740xd /bsd: mpii0: mpii_scsi_cmd_tmo (0x4000265d)
Mar 20 19:53:18 r740xd /bsd: mpii0: mpii_get_ccb 0x818fe710
Mar 20 19:53:18 r740xd /bsd: mpii0: mpii_start 0x2c88c00
Mar 20 19:53:18 r740xd /bsd: mpii0:   MPII_REQ_DESCR_POST_LOW
(0x00c0) write 0x007e0006
Mar 20 19:53:18 r740xd /bsd: mpii0:   MPII_REQ_DESCR_POST_HIGH
(0x00c4) write 0x0286
Mar 20 19:58:19 r740xd /bsd: mpii0: mpii_get_ccb 0x818fe350
Mar 20 19:58:19 r740xd /bsd: mpii0: mpii_scsi_cmd
Mar 20 19:58:19 r740xd /bsd: mpii0: ccb_smid: 114 xs->flags: 0x1001
Mar 20 19:58:19 r740xd /bsd: mpii0: mpii_start 0x2c87400
Mar 20 19:58:19 r740xd /bsd: mpii0:   MPII_REQ_DESCR_POST_LOW
(0x00c0) write 0xa0072
Mar 20 19:58:19 r740xd /bsd: mpii0:   MPII_REQ_DESCR_POST_HIGH
(0x00c4) write 0x0286
Mar 20 19:59:19 r740xd /bsd: mpii0: mpii_read 0 0x4000265d
Mar 20 19:59:19 r740xd /bsd: mpii0: mpii_scsi_cmd_tmo (0x4000265d)
Mar 20 19:59:19 r740xd /bsd: mpii0: mpii_get_ccb 0x818fe670
Mar 20 19:59:19 r740xd /bsd: mpii0: mpii_start 0x2c88800
Mar 20 19:59:19 r740xd /bsd: mpii0:   MPII_REQ_DESCR_POST_LOW
(0x00c0) write 0x007c0006
Mar 20 19:59:19 r740xd /bsd: mpii0:   MPII_REQ_DESCR_POST_HIGH
(0x00c4) write 0x0286


If someone is willing to look at this problem I will gladly give access
to this box.



dmesg without MPII_DEBUG

OpenBSD 7.5 (GENERIC.MP) #78: Sun Mar 17 21:55:24 MDT 2024
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 410826829824 (391794MB)
avail mem = 398354268160 (379900MB)
random: good seed from bootblocks
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 3.2 @ 0x68e36000 (80 entries)
bios0: vendor Dell Inc. version "2.21.2" date 02/19/2024
bios0: Dell Inc. PowerEdge R740xd
efi0 at bios0: UEFI 2.7
efi0: Dell Inc. rev 0x2150201
acpi0 at bios0: ACPI 6.1
acpi0: sleep states S0 S5
acpi0: tables DSDT FACP SSDT MCEJ WD__ SLIC HPET APIC MCFG MIGT MSCT
PCAT PCCT RASF SLIT SRAT SVOS WSMT OEM4 SSDT SSDT SSDT SPCR DMAR HEST
BERT ERST EINJ
acpi0: wakeup devices XHCI(S4) RP17(S4) PXSX(S4) RP18(S4) PXSX(S4)
RP19(S4) PXSX(S4) RP20(S4) PXSX(S4) RP01(S4) PXSX(S4) RP02(S4) PXSX(S4)
RP03(S4) PXSX(S4) RP04(S4) [...]
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpihpet0 at acpi0: 2399 Hz
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 2793.56 MHz, 06-55-04,
patch 02007108
cpu0:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,PQM,MPX,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,PT,AVX512CD,AVX512BW,AVX512VL,PKU,MD_CLEAR,TSXFA,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,RSBA,MISC_PKG_CT,ENERGY_FILT,GDS_CTRL,XSAVEOPT,XSAVEC,XGETBV1,XSAVES,MELTDOWN
cpu0: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB
64b/line 16-way L2 cache, 22MB 64b/line 11-way L3 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
cpu0: apic clock running at 24MHz
cpu0: mwait min=64, max=64, C-substates=0.2.0.2, IBE
cpu1 at mainbus0: apid 32 (application processor)
cpu1: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 2793.60 MHz, 06-55-04,
patch 02007108
cpu1:

Re: HA IPSec with AWS - no second flow

2024-03-11 Thread Hrvoje Popovski

On 11.3.2024. 10:22, Rafał Ramocki wrote:
> Hello, 
> 
> 
> Hello, I'm not sure if I'm doing something wrong or if is it a common 
> problem. I have iked.conf set up in the following way: 
> 
> ikev2 active from 10.2.15.0/24 to 172.31.0.0/20 from 10.2.15.0/24 to 
> 172.31.16.0/20 from 10.2.15.0/24 to 172.31.32.0/20 from 169.254.74.238 to 
> 169.254.74.237 local X.X.X.X peer 16.170.59.81 ikesa auth hmac-sha2-256 enc 
> aes-256 prf hmac-sha2-256 group modp4096 childsa auth hmac-sha2-256 enc 
> aes-256 group modp4096 srcid X.X.X.X ikelifetime 28800 lifetime 3600 psk 
> '_REMOVED_' 
> 
> ikev2 active from 10.2.15.0/24 to 172.31.0.0/20 from 10.2.15.0/24 to 
> 172.31.16.0/20 from 10.2.15.0/24 to 172.31.32.0/20 from 169.254.21.38 to 
> 169.254.21.37 local X.X.X.X peer 51.21.86.8 ikesa auth hmac-sha2-256 enc 
> aes-256 prf hmac-sha2-256 group modp4096 childsa auth hmac-sha2-256 enc 
> aes-256 group modp4096 srcid X.X.X.X ikelifetime 28800 lifetime 3600 psk 
> '_REMOVED_' 
> 
> 
> Both tunnels are up from AWS perspective. Both tunnels have SAD's: 
> 
> # ipsecctl -ss 
> esp tunnel from 51.21.86.8 to X.X.X.X spi 0x02c0ae3a auth hmac-sha2-256 enc 
> aes-256 
> esp tunnel from 16.170.59.81 to X.X.X.X spi 0x09ef0398 auth hmac-sha2-256 enc 
> aes-256 
> esp tunnel from 16.170.59.81 to X.X.X.X spi 0x324ceca5 auth hmac-sha2-256 enc 
> aes-256 
> esp tunnel from 51.21.86.8 to X.X.X.X spi 0xa9672a52 auth hmac-sha2-256 enc 
> aes-256 
> esp tunnel from X.X.X.X to 16.170.59.81 spi 0xc08c4de5 auth hmac-sha2-256 enc 
> aes-256 
> esp tunnel from X.X.X.X to 16.170.59.81 spi 0xc2e0efe9 auth hmac-sha2-256 enc 
> aes-256 
> esp tunnel from X.X.X.X to 51.21.86.8 spi 0xc3e8a0e0 auth hmac-sha2-256 enc 
> aes-256 
> esp tunnel from X.X.X.X to 51.21.86.8 spi 0xccb3250e auth hmac-sha2-256 enc 
> aes-256 
> 
> 
> But flows with overlapped from-to pair are set only for one of the tunnels: 
> 
> # ipsecctl -sf 
> flow esp in from 169.254.21.37 to 169.254.21.38 peer 51.21.86.8 srcid 
> IPV4/X.X.X.X dstid IPV4/51.21.86.8 type require 
> flow esp in from 169.254.74.237 to 169.254.74.238 peer 16.170.59.81 srcid 
> IPV4/X.X.X.X dstid IPV4/16.170.59.81 type require 
> flow esp in from 172.31.0.0/20 to 10.2.15.0/24 peer 51.21.86.8 srcid 
> IPV4/X.X.X.X dstid IPV4/51.21.86.8 type require 
> flow esp in from 172.31.16.0/20 to 10.2.15.0/24 peer 51.21.86.8 srcid 
> IPV4/X.X.X.X dstid IPV4/51.21.86.8 type require 
> flow esp in from 172.31.32.0/20 to 10.2.15.0/24 peer 51.21.86.8 srcid 
> IPV4/X.X.X>X dstid IPV4/51.21.86.8 type require 
> 
> flow esp out from 10.2.15.0/24 to 172.31.0.0/20 peer 51.21.86.8 srcid 
> IPV4/X.X.X.X dstid IPV4/51.21.86.8 type require 
> flow esp out from 10.2.15.0/24 to 172.31.16.0/20 peer 51.21.86.8 srcid 
> IPV4/X.X.X.X dstid IPV4/51.21.86.8 type require 
> flow esp out from 10.2.15.0/24 to 172.31.32.0/20 peer 51.21.86.8 srcid 
> IPV4/X.X.X.X dstid IPV4/51.21.86.8 type require 
> flow esp out from 169.254.21.38 to 169.254.21.37 peer 51.21.86.8 srcid 
> IPV4/X.X.X.X dstid IPV4/51.21.86.8 type require 
> flow esp out from 169.254.74.238 to 169.254.74.237 peer 16.170.59.81 srcid 
> IPV4/X.X.X.X dstid IPV4/16.170.59.81 type require 
> 
> I think IKED may detect that flow is already set for this from-to pair and is 
> not setting up additional one but it should take also remote endpoint into 
> account as those are different. Having no flow set up is resulting in that, 
> when some data are received on that second tunnel that have no flows set, 
> those data are discarded and not forwarded any more propably due to RPF 
> policy. 
> 
> I tried to figure out how those are set up by code analysys but I think it 
> may be beyond my capabilitys as I'm only a sysadmin not a developer. 
> 
> OpenBSD version: 7.3 
> 
> best regards 
> Rafal Ramocki 
> 



Hi,

I think that you can't have two same ipsec tunnels with policy based
vpns in OpenBSD, but you can do something like this
https://www.linuxquestions.org/questions/blog/rocket357-328529/openbsd-etc-ipsec-conf-for-aws-vpc-vpn-36423/

Good thing is that OpenBSD from 7.4 supports route based ipsec tunnels
https://www.undeadly.org/cgi?action=article;sid=20230704094238

Re: TSO em(4) problem

2024-02-01 Thread Hrvoje Popovski

On 1.2.2024. 18:42, Alexander Bluhm wrote:
> On Tue, Jan 30, 2024 at 02:32:24PM +0100, Hrvoje Popovski wrote:
>> yes, and forwarding only without pf.
>> I'm sending traffic from host connected to vlan/ix0 and forward through
>> em5 to other host.
>> I'm sending 1Gbps of traffic with cisco t-rex
> I cannot reproduce.
> 
> ix0 at pci6 dev 0 function 0 "Intel 82599" rev 0x01, msix, 8 queues, address 
> 90:e2:ba:d6:23:68
> em1 at pci7 dev 0 function 1 "Intel I350" rev 0x01: msi, address 
> a0:36:9f:0a:4a:c5
> 
> root@ot42:.../~# ifconfig ix0 hwfeatures
> ix0: flags=2008843 mtu 1500
> 
> hwfeatures=71b7
>  hardmtu 9198
> lladdr 90:e2:ba:d6:23:68
> description: Intel 82599
> index 5 priority 0 llprio 3
> media: Ethernet autoselect (10GSFP+Cu full-duplex,rxpause,txpause)
> status: active
> 
> root@ot42:.../~# ifconfig em1 hwfeatures
> em1: flags=8c43 mtu 1500
> 
> hwfeatures=31b7
>  hardmtu 9216
> lladdr a0:36:9f:0a:4a:c5
> description: Intel I350
> index 8 priority 0 llprio 3
> media: Ethernet autoselect (1000baseT full-duplex,master)
> status: active
> inet 10.10.22.3 netmask 0xff00 broadcast 10.10.22.255
> 
> root@ot42:.../~# ifconfig vlan0 hwfeatures
> vlan0: flags=8843 mtu 1500
> 
> hwfeatures=3187
>  hardmtu 9198
> lladdr 90:e2:ba:d6:23:68
> index 24 priority 0 llprio 3
> encap: vnetid 221 parent ix0 txprio packet rxprio outer
> groups: vlan
> media: Ethernet autoselect (10GSFP+Cu full-duplex,rxpause,txpause)
> status: active
> inet 10.10.21.2 netmask 0xff00 broadcast 10.10.21.255
> 
> root@ot42:.../~# pfctl -si
> Status: Disabled for 0 days 00:03:42 Debug: err
> 
> Running tcpbench -n100 from Linux via OpenBSD forwarding to Linux.
> Simultaneous udpbench to create traffic mixture.
> 
> root@ot42:.../~# netstat -ss | egrep 'TSO|LRO'
> 1188 output TSO packets software chopped
> 33086906 output TSO packets hardware processed
> 265855748 output TSO packets generated
> 31090975 input LRO generated packets from hardware
> 176482178 input LRO coalesced packets by network device
> 
> Lot of LRO and TSO.  Running diff below, which reverts em TSO backout
> and adds sparc64 fix.
> 
> Hrvoje: What is different in your lab?


I think I found it. It's lldp. If I enable lldpd I'm getting watchdog on
em, when disabled only one watchdog at the beginning of testing.

Re: TSO em(4) problem

2024-01-30 Thread Hrvoje Popovski

On 30.1.2024. 13:33, Alexander Bluhm wrote:
> On Tue, Jan 30, 2024 at 12:07:08PM +0100, Hrvoje Popovski wrote:
>> On 30.1.2024. 9:27, Hrvoje Popovski wrote:
>>> I will prepare one box for this kind of traffic and will contact you and
>>> marcus
>>>
>>>> In theory when going through vlan interface it should remove
>>>> M_VLANTAG.  But something must be wrong and I wonder what.
>>>>
>>>> bluhm
>>
>> Hi,
>>
>> I've managed to trigger watchdog in lab. It couldn't be possible without
>> bluhm@ information about ix vlan, thank you.
> 
> Great, now we can debug the details.
> 
> I have to know how ix and em are connected.
> 
> Do you have any bridge or veb?  Where are your vlan trunks?
> Any aggr, trunk, carp?

no, only vlan on ix0.


> Is my understanding of your setup corect?
> 
> ix -> vlan -> forward -> em

yes, and forwarding only without pf.
I'm sending traffic from host connected to vlan/ix0 and forward through
em5 to other host.
I'm sending 1Gbps of traffic with cisco t-rex

> Can something more happen, like
> 
> ix -> forward -> em
> 

In setup without vlan on ix I've got only one watchdog at the begging of
testing and that's it.
With vlan I'm getting around 6 or 7 watchdogs per minute which means 6
or 7 links going up/down.


without vlan
smc4# netstat -sp tcp | grep TSO
0 output TSO packets software chopped
268 output TSO packets hardware processed
0 output TSO packets generated
0 output TSO packets dropped
smc4# netstat -sp tcp | grep LRO
0 input LRO packets passed through pseudo device
7666573 input LRO generated packets from hardware
21667579 input LRO coalesced packets by network device
0 input bad LRO packets dropped

Re: TSO em(4) problem

2024-01-30 Thread Hrvoje Popovski

On 30.1.2024. 9:27, Hrvoje Popovski wrote:
> I will prepare one box for this kind of traffic and will contact you and
> marcus
> 
>> In theory when going through vlan interface it should remove
>> M_VLANTAG.  But something must be wrong and I wonder what.
>>
>> bluhm

Hi,

I've managed to trigger watchdog in lab. It couldn't be possible without
bluhm@ information about ix vlan, thank you.



Jan 30 12:01:09 smc4 /bsd: em5: watchdog: head 123 tail 187 TDH 187 TDT 123
Jan 30 12:01:18 smc4 /bsd: em5: watchdog: head 243 tail 307 TDH 307 TDT 243
Jan 30 12:01:28 smc4 /bsd: em5: watchdog: head 463 tail 15 TDH 15 TDT 463
Jan 30 12:01:37 smc4 /bsd: em5: watchdog: head 413 tail 477 TDH 477 TDT 413
Jan 30 12:01:46 smc4 /bsd: em5: watchdog: head 195 tail 259 TDH 259 TDT 195
Jan 30 12:01:55 smc4 /bsd: em5: watchdog: head 259 tail 323 TDH 323 TDT 259
Jan 30 12:02:05 smc4 /bsd: em5: watchdog: head 333 tail 397 TDH 397 TDT 333
Jan 30 12:02:14 smc4 /bsd: em5: watchdog: head 33 tail 97 TDH 97 TDT 33
Jan 30 12:02:24 smc4 /bsd: em5: watchdog: head 459 tail 11 TDH 11 TDT 459
Jan 30 12:02:33 smc4 /bsd: em5: watchdog: head 447 tail 511 TDH 511 TDT 447


em0 at pci7 dev 0 function 0 "Intel 82576" rev 0x01: msi, address
00:1b:21:61:8a:94
em1 at pci7 dev 0 function 1 "Intel 82576" rev 0x01: msi, address
00:1b:21:61:8a:95
em2 at pci8 dev 0 function 0 "Intel I210" rev 0x03: msi, address
00:25:90:5d:c9:98
em3 at pci9 dev 0 function 0 "Intel I210" rev 0x03: msi, address
00:25:90:5d:c9:99
em4 at pci12 dev 0 function 0 "Intel I350" rev 0x01: msi, address
00:25:90:5d:c9:9a
em5 at pci12 dev 0 function 1 "Intel I350" rev 0x01: msi, address
00:25:90:5d:c9:9b
em6 at pci12 dev 0 function 2 "Intel I350" rev 0x01: msi, address
00:25:90:5d:c9:9c
em7 at pci12 dev 0 function 3 "Intel I350" rev 0x01: msi, address
00:25:90:5d:c9:9d


smc4# netstat -sp tcp | grep LRO
0 input LRO packets passed through pseudo device
4696315 input LRO generated packets from hardware
13205047 input LRO coalesced packets by network device
0 input bad LRO packets dropped
smc4# netstat -sp tcp | grep TSO
0 output TSO packets software chopped
3672 output TSO packets hardware processed
0 output TSO packets generated
0 output TSO packets dropped




smc4# ifconfig em5 hwfeatures
em5: flags=8c43 mtu 1500
 
hwfeatures=31b7
 hardmtu 9216
lladdr 00:25:90:5d:c9:9b
index 8 priority 0 llprio 3
media: Ethernet autoselect (1000baseT
full-duplex,master,rxpause,txpause)
status: active
inet 192.168.20.1 netmask 0xff00 broadcast 192.168.20.255

Re: TSO em(4) problem

2024-01-30 Thread Hrvoje Popovski

On 29.1.2024. 15:29, Alexander Bluhm wrote:
> On Sat, Jan 27, 2024 at 08:08:35AM +0100, Hrvoje Popovski wrote:
>> On 26.1.2024. 22:47, Alexander Bluhm wrote:
>>> On Fri, Jan 26, 2024 at 11:41:49AM +0100, Hrvoje Popovski wrote:
>>>> I've manage to reproduce TSO em problem on anoter setup, unfortunatly
>>>> production.
>>> What helped debugging a similar issue with ixl(4) and TSO was to
>>> remove all TSO specific code from the driver.  Then only this part
>>> remains from the original em(4) TSO diff.
>>>
>>> error = bus_dmamap_create(sc->sc_dmat, EM_TSO_SIZE,
>>> EM_MAX_SCATTER / (sc->pcix_82544 ? 2 : 1),
>>> EM_TSO_SEG_SIZE, 0, BUS_DMA_NOWAIT, >pkt_map);
>>>
>>> The parameters that changed when adding TSO are:
>>>
>>> bus_size_t size:MAX_JUMBO_FRAME_SIZE 16128 -> EM_TSO_SIZE 65535
>>> bus_size_t maxsegsz:MAX_JUMBO_FRAME_SIZE 16128 -> EM_TSO_SEG_SIZE 
>>> 4096
>>>
>>> I suspect that this is the cause for the regression as disabling
>>> TSO did not help.  Would it be possible to run the diff below?  I
>>> expect that the problem will still be there.  But then we know it
>>> must be the change of one of the bus_dmamap_create() arguments.
>>>
>>> bluhm
>>
>> Hi,
>>
>> with this diff em0 seems happy and em watchdog is gone.
> 
> This is very interesting.  That means that the bus_dmamap_create()
> argument does not cause the regression.
> 
> Did you see anywhere "output TSO packets hardware processed in"
> netstat -s.  In some iteration of testing you turned TSO off with
> sysctl net.inet.tcp.tso=0, but it did not help.  So no TSO packets
> from the stack.
> 
> In another mail you mentioned
> 
>> Setup is very simple
>> em0 - carp <- uplink
>> em1 - pfsync
>> ix1 - vlans - carp
> 
> ix supports LRO.  If you forward from ix1 to em0 the LRO packets
> from ix hardware are split by TSO on em hardware.  And the ix does
> vlan offloading + LRO, so em must do vlan offloading properly with
> TSO.  Or do you use a vlan interface?
> 
> Does it help to disable LRO, ifconfig ix1 -tcplro ?


Yes, it helps... Thank you

uplink
em0: flags=8b43
mtu 1500

hwfeatures=31b7
hardmtu 9216
lladdr 0c:c4:7a:da:cd:5a
index 3 priority 0 llprio 3
groups: egress
media: Ethernet autoselect (1000baseT full-duplex,master,rxpause)
status: active


vlans are on ix1 - I've disabled LRO
ix1: flags=8b43
mtu 1500
lladdr 90:e2:ba:d7:1b:f5
index 2 priority 0 llprio 3
media: Ethernet autoselect (10GbaseSR full-duplex,rxpause,txpause)
status: active


before I've disabled LRO on ix1 I've got lot of watchdog on em0

bcbnfw1# uptime
 9:25AM  up 8 mins, 1 user, load averages: 0.14, 0.13, 0.06
bcbnfw1# cat /var/log/messages| grep watchdog
Jan 30 09:18:51 bcbnfw1 /bsd: em0: watchdog: head 148 tail 213 TDH 213
TDT 148
Jan 30 09:19:01 bcbnfw1 /bsd: em0: watchdog: head 160 tail 224 TDH 224
TDT 160
Jan 30 09:19:12 bcbnfw1 /bsd: em0: watchdog: head 163 tail 228 TDH 228
TDT 163
Jan 30 09:19:22 bcbnfw1 /bsd: em0: watchdog: head 128 tail 192 TDH 192
TDT 128
Jan 30 09:19:32 bcbnfw1 /bsd: em0: watchdog: head 309 tail 373 TDH 373
TDT 309
Jan 30 09:19:41 bcbnfw1 /bsd: em0: watchdog: head 113 tail 177 TDH 177
TDT 113
Jan 30 09:19:51 bcbnfw1 /bsd: em0: watchdog: head 402 tail 466 TDH 466
TDT 402
Jan 30 09:20:01 bcbnfw1 /bsd: em0: watchdog: head 114 tail 178 TDH 178
TDT 114
Jan 30 09:20:16 bcbnfw1 /bsd: em0: watchdog: head 111 tail 175 TDH 175
TDT 111
Jan 30 09:20:26 bcbnfw1 /bsd: em0: watchdog: head 199 tail 263 TDH 263
TDT 199



without LRO on ix1 everything seems to work just fine ...


> 
> I see this vlan code with mac_type checks.  Can we end in a
> configuration where we enable TSO but cannot do VLAN offloading?
> 
> #if NVLAN > 0
> /* Find out if we are in VLAN mode */
> if (m->m_flags & M_VLANTAG && (sc->hw.mac_type < em_82575 ||
> sc->hw.mac_type > em_i210)) {
> /* Set the VLAN id */
> desc->upper.fields.special = htole16(m->m_pkthdr.ether_vtag);
> 
> /* Tell hardware to add tag */
> desc->lower.data |= htole32(E1000_TXD_CMD_VLE);
> }
> #endif
> 
> Hrvoje, I know you do great tests in your lab.  Did you try this
> setup:
> 
> Send bulk TCP traffic in vlan that will trigger LRO.
> Do VLAN + LRO offloading in ix.
> Forward it to em with TSO.
> 

I will prepare one box for this kind of traffic and will contact you and
marcus

> In theory when going through vlan interface it should remove
> M_VLANTAG.  But something must be wrong and I wonder what.
> 
> bluhm
>

Re: TSO em(4) problem

2024-01-28 Thread Hrvoje Popovski

On 28.1.2024. 10:44, Marcus Glocker wrote:
> On Sun, Jan 28, 2024 at 12:16:20AM +0100, Hrvoje Popovski wrote:
> 
>> On 27.1.2024. 21:01, Marcus Glocker wrote:
>>> On Sat, Jan 27, 2024 at 08:01:09AM +0100, Hrvoje Popovski wrote:
>>>
>>>> On 26.1.2024. 21:56, Marcus Glocker wrote:
>>>>> On Fri, Jan 26, 2024 at 11:41:49AM +0100, Hrvoje Popovski wrote:
>>>>>
>>>>>> I've manage to reproduce TSO em problem on anoter setup, unfortunatly
>>>>>> production.
>>>>>>
>>>>>> Setup is very simple
>>>>>>
>>>>>> em0 - carp <- uplink
>>>>>> em1 - pfsync
>>>>>> ix1 - vlans - carp
>>>>> Would it be possible that you also share an "ifconfig -a hwfeatures" of
>>>>> that box?  You can mask the IPs if it's too sensitive.
>>>>>
>>>>> I still try to reproduce the issue here, and for now I can't.
>>>>> Maybe in your full ifconfig output I can see some specifics about your
>>>>> configuration, which makes it more likely to reproduce the issue here.
>>>>>
>>>> Hi,
>>>>
>>>> here's ifconfig from second setup where watchdog is triggered much faster.
>>>> Originally in this setup uplink is ix0, I've change that to em0 to see
>>>> would the problem be same as in other setup and it is, and that's good
>>>> because this is pfsync setup for students and I can do whatever I want
>>>> with it :)
>>> Thanks.
>>>
>>> But still, I can do whatever I want on my em(4) I210 box, carp(4),
>>> vlan(4), creating a lot of traffic, I can't reproduce the watchdog which
>>> you are seeing :-(  I'm not sure if this is something related to your
>>> I350.
>>>
>>> Also, I can't understand why the watchdog still triggers when you disable
>>> TSO by setting net.inet.tcp.tso=0.
>>>
>>> Just to rule out that you're receiving a MAXMCLBYTES (65536) packet,
>>> while EM_TSO_SIZE (65535) is one byte less, can you please apply this
>>> diff to -current and test it?  I doubt it will make a difference, but
>>> I'm running a bit out of ideas here.
>>
>>
>> Hi,
>>
>> with this diff I'm still getting em watchdog
>>
>> Jan 28 00:14:12 bcbnfw1 /bsd: em0: watchdog: head 120 tail 185 TDH 185
>> TDT 120
> 
> Thanks for testing again.
> 
> I think we might have a generic problem with TSO with the current em(4)
> code and some chips.  Referring to this recent FreeBSD commit.
> 
> e1000: disable TSO on lem(4) and em(4):
> Disable TSO on lem(4) and em(4) until a ring stall can be debugged.
> https://github.com/freebsd/freebsd-src/commit/797e480cba8834e584062092c098e60956d28180
> 
> Can you try this diff to specifically disable TSO for I350 please?
> 
> We will need to discuss internally which way to go.  I see those
> options currently:
> 
> - Entirely pull out the TSO diff.
> - Leave the TSO code in but disable TSO for now (what FreeBSD did).
> - Leave the TSO code in but disable TSO only for chips we see issues
>   with (this diff).
> 


Hi,


with this diff I still see TSOv4 and TSOv6 on i350 is this ok ?
em0 watchgod is triggered with or without net.inet.tcp.tso=1/0

em0: flags=8b43
mtu 1500
hwfeatures=31b7
hardmtu 9216
lladdr 0c:c4:7a:da:cd:5a
index 3 priority 0 llprio 3
groups: egress
media: Ethernet autoselect (1000baseT full-duplex,master,rxpause)
status: active


em0 at pci7 dev 0 function 0 "Intel I350" rev 0x01: msi, address


Jan 28 13:18:45 bcbnfw1 /bsd: em0: watchdog: head 89 tail 153 TDH 153 TDT 89
Jan 28 13:41:19 bcbnfw1 /bsd: em0: watchdog: head 336 tail 400 TDH 400
TDT 336
Jan 28 13:58:13 bcbnfw1 /bsd: em0: watchdog: head 172 tail 236 TDH 236
TDT 172





> 
> Index: if_em.c
> ===
> RCS file: /cvs/src/sys/dev/pci/if_em.c,v
> diff -u -p -u -p -r1.370 if_em.c
> --- if_em.c   31 Dec 2023 08:42:33 -  1.370
> +++ if_em.c   28 Jan 2024 09:30:59 -
> @@ -2013,7 +2013,9 @@ em_setup_interface(struct em_softc *sc)
>   if (sc->hw.mac_type >= em_82575 && sc->hw.mac_type <= em_i210) {
>   ifp->if_capabilities |= IFCAP_CSUM_IPv4;
>   ifp->if_capabilities |= IFCAP_CSUM_TCPv6 | IFCAP_CSUM_UDPv6;
> - ifp->if_capabilities |= IFCAP_TSOv4 | IFCAP_TSOv6;
> + /* XXX: Enabling TSO on I350 causes watchdogs */
> + if (sc->hw.mac_type != em_i350)
> + ifp->if_capabilities |= IFCAP_TSOv4 | IFCAP_TSOv6;
>   }
>  
>   /* 
>

Re: TSO em(4) problem

2024-01-27 Thread Hrvoje Popovski

On 27.1.2024. 21:01, Marcus Glocker wrote:
> On Sat, Jan 27, 2024 at 08:01:09AM +0100, Hrvoje Popovski wrote:
> 
>> On 26.1.2024. 21:56, Marcus Glocker wrote:
>>> On Fri, Jan 26, 2024 at 11:41:49AM +0100, Hrvoje Popovski wrote:
>>>
>>>> I've manage to reproduce TSO em problem on anoter setup, unfortunatly
>>>> production.
>>>>
>>>> Setup is very simple
>>>>
>>>> em0 - carp <- uplink
>>>> em1 - pfsync
>>>> ix1 - vlans - carp
>>> Would it be possible that you also share an "ifconfig -a hwfeatures" of
>>> that box?  You can mask the IPs if it's too sensitive.
>>>
>>> I still try to reproduce the issue here, and for now I can't.
>>> Maybe in your full ifconfig output I can see some specifics about your
>>> configuration, which makes it more likely to reproduce the issue here.
>>>
>> Hi,
>>
>> here's ifconfig from second setup where watchdog is triggered much faster.
>> Originally in this setup uplink is ix0, I've change that to em0 to see
>> would the problem be same as in other setup and it is, and that's good
>> because this is pfsync setup for students and I can do whatever I want
>> with it :)
> Thanks.
> 
> But still, I can do whatever I want on my em(4) I210 box, carp(4),
> vlan(4), creating a lot of traffic, I can't reproduce the watchdog which
> you are seeing :-(  I'm not sure if this is something related to your
> I350.
> 
> Also, I can't understand why the watchdog still triggers when you disable
> TSO by setting net.inet.tcp.tso=0.
> 
> Just to rule out that you're receiving a MAXMCLBYTES (65536) packet,
> while EM_TSO_SIZE (65535) is one byte less, can you please apply this
> diff to -current and test it?  I doubt it will make a difference, but
> I'm running a bit out of ideas here.


Hi,

with this diff I'm still getting em watchdog

Jan 28 00:14:12 bcbnfw1 /bsd: em0: watchdog: head 120 tail 185 TDH 185
TDT 120

Re: TSO em(4) problem

2024-01-26 Thread Hrvoje Popovski

On 26.1.2024. 22:47, Alexander Bluhm wrote:
> On Fri, Jan 26, 2024 at 11:41:49AM +0100, Hrvoje Popovski wrote:
>> I've manage to reproduce TSO em problem on anoter setup, unfortunatly
>> production.
> What helped debugging a similar issue with ixl(4) and TSO was to
> remove all TSO specific code from the driver.  Then only this part
> remains from the original em(4) TSO diff.
> 
> error = bus_dmamap_create(sc->sc_dmat, EM_TSO_SIZE,
>   EM_MAX_SCATTER / (sc->pcix_82544 ? 2 : 1),
>   EM_TSO_SEG_SIZE, 0, BUS_DMA_NOWAIT, >pkt_map);
> 
> The parameters that changed when adding TSO are:
> 
> bus_size_t size:  MAX_JUMBO_FRAME_SIZE 16128 -> EM_TSO_SIZE 65535
> bus_size_t maxsegsz:  MAX_JUMBO_FRAME_SIZE 16128 -> EM_TSO_SEG_SIZE 4096
> 
> I suspect that this is the cause for the regression as disabling
> TSO did not help.  Would it be possible to run the diff below?  I
> expect that the problem will still be there.  But then we know it
> must be the change of one of the bus_dmamap_create() arguments.
> 
> bluhm

Hi,

with this diff em0 seems happy and em watchdog is gone.

bcbnfw1# uptime
 8:06AM  up 44 mins, 2 users, load averages: 0.00, 0.00, 0.00

bcbnfw1# ifconfig em0 hwfeatures
em0: flags=8b43
mtu 1500

hwfeatures=1b7
hardmtu 9216
lladdr 0c:c4:7a:da:cd:5a
index 3 priority 0 llprio 3
groups: egress
media: Ethernet autoselect (1000baseT full-duplex,master,rxpause)
status: active
inet 10.10.155.234 netmask 0xfff8 broadcast 10.10.155.239


This morning without diff
bcbnfw1# cat /var/log/messages | grep watchdog
Jan 27 07:12:03 bcbnfw1 /bsd: em0: watchdog: head 50 tail 114 TDH 114 TDT 50
Jan 27 07:15:29 bcbnfw1 /bsd: em0: watchdog: head 370 tail 434 TDH 434
TDT 370
Jan 27 07:15:43 bcbnfw1 /bsd: em0: watchdog: head 219 tail 283 TDH 283
TDT 219
Jan 27 07:15:54 bcbnfw1 /bsd: em0: watchdog: head 322 tail 386 TDH 386
TDT 322
Jan 27 07:16:08 bcbnfw1 /bsd: em0: watchdog: head 115 tail 179 TDH 179
TDT 115
Jan 27 07:16:21 bcbnfw1 /bsd: em0: watchdog: head 364 tail 428 TDH 428
TDT 364
Jan 27 07:16:35 bcbnfw1 /bsd: em0: watchdog: head 473 tail 26 TDH 26 TDT 473

Re: TSO em(4) problem

2024-01-26 Thread Hrvoje Popovski

On 26.1.2024. 21:56, Marcus Glocker wrote:
> On Fri, Jan 26, 2024 at 11:41:49AM +0100, Hrvoje Popovski wrote:
> 
>> I've manage to reproduce TSO em problem on anoter setup, unfortunatly
>> production.
>>
>> Setup is very simple
>>
>> em0 - carp <- uplink
>> em1 - pfsync
>> ix1 - vlans - carp
> 
> Would it be possible that you also share an "ifconfig -a hwfeatures" of
> that box?  You can mask the IPs if it's too sensitive.
> 
> I still try to reproduce the issue here, and for now I can't.
> Maybe in your full ifconfig output I can see some specifics about your
> configuration, which makes it more likely to reproduce the issue here.
> 

Hi,

here's ifconfig from second setup where watchdog is triggered much faster.
Originally in this setup uplink is ix0, I've change that to em0 to see
would the problem be same as in other setup and it is, and that's good
because this is pfsync setup for students and I can do whatever I want
with it :)



bcbnfw1# ifconfig -a hwfeatures
lo0: flags=2008049 mtu 32768

hwfeatures=7187
index 6 priority 0 llprio 3
groups: lo
inet 127.0.0.1 netmask 0xff00
ix0: flags=2008802 mtu 1500

hwfeatures=71b7
hardmtu 9198
lladdr 90:e2:ba:d7:1b:f4
index 1 priority 0 llprio 3
media: Ethernet autoselect (10GbaseSR full-duplex)
status: active
ix1:
flags=2008b43
mtu 1500

hwfeatures=71b7
hardmtu 9198
lladdr 90:e2:ba:d7:1b:f5
index 2 priority 0 llprio 3
media: Ethernet autoselect (10GbaseSR full-duplex,rxpause,txpause)
status: active
em0: flags=8b43
mtu 1500

hwfeatures=31b7
hardmtu 9216
lladdr 0c:c4:7a:da:cd:5a
index 3 priority 0 llprio 3
groups: egress
media: Ethernet autoselect (1000baseT full-duplex,rxpause)
status: active
inet 10.10.155.234 netmask 0xfff8 broadcast 10.10.155.239
em1: flags=8843 mtu 1500

hwfeatures=31b7
hardmtu 9216
lladdr 0c:c4:7a:da:cd:5b
index 4 priority 0 llprio 3
media: Ethernet autoselect (1000baseT full-duplex,rxpause,txpause)
status: active
inet 192.168.0.77 netmask 0xfffc broadcast 192.168.0.79
enc0: flags=0<>
hwfeatures=0<>
index 5 priority 0 llprio 3
groups: enc
status: active
carp0: flags=8843 mtu 1500

hwfeatures=3187
hardmtu 1500
lladdr 00:00:5e:00:01:01
index 7 priority 15 llprio 3
carp: MASTER carpdev em0 vhid 1 advbase 1 advskew 10
groups: carp
status: master
inet 10.10.155.236 netmask 0x
carp1100: flags=8843 mtu 1500

hwfeatures=3187
hardmtu 1500
lladdr 00:00:5e:00:01:12
index 8 priority 15 llprio 3
carp: MASTER carpdev vlan1100 vhid 18 advbase 1 advskew 10
groups: carp
status: master
inet 10.30.16.1 netmask 0x
carp1101: flags=8843 mtu 1500

hwfeatures=3187
hardmtu 1500
lladdr 00:00:5e:00:01:16
index 9 priority 15 llprio 3
carp: MASTER carpdev vlan1101 vhid 22 advbase 1 advskew 10
groups: carp
status: master
inet 10.31.16.1 netmask 0x
carp1102: flags=8843 mtu 1500

hwfeatures=3187
hardmtu 1500
lladdr 00:00:5e:00:01:19
index 10 priority 15 llprio 3
carp: MASTER carpdev vlan1102 vhid 25 advbase 1 advskew 10
groups: carp
status: master
inet 10.32.16.1 netmask 0x
carp1103: flags=8843 mtu 1500

hwfeatures=3187
hardmtu 1500
lladdr 00:00:5e:00:01:1c
index 11 priority 15 llprio 3
carp: MASTER carpdev vlan1103 vhid 28 advbase 1 advskew 10
groups: carp
status: master
inet 10.33.16.1 netmask 0x
carp1130: flags=8843 mtu 1500

hwfeatures=3187
hardmtu 1500
lladdr 00:00:5e:00:01:13
index 12 priority 15 llprio 3
carp: MASTER carpdev vlan1130 vhid 19 advbase 1 advskew 10
groups: carp
status: master
inet 10.30.0.1 netmask 0x
carp1131: flags=8843 mtu 1500

hwfeatures=3187
hardmtu 1500
lladdr 00:00:5e:00:01:17
index 13 priority 15 llprio 3
carp: MASTER carpdev vlan1131 vhid 23 advbase 1 advskew 10
groups: carp
status: master
inet 10.31.0.1 netmask 0x
carp1132: flags=8843 mtu 1500

hwfeatures=3187
hardmtu 1500
lladdr 00:00:5e:00:01:1a
index 14 priority 15 llprio 3
carp: MASTER carpdev vlan1132 vhid 26 advbase 1 advskew 10
groups: carp
status: master
inet 10.32.0.1 netmask 0x
carp1133: flags=8843 mtu 1500

hwfeatures=3187
hardmtu 1500
lladdr 00:00:5e:00:01:1d
index 15 priority 15 llprio 3
carp: MASTER carpdev vlan1133 vhid 29 advbase 1 advskew 10
groups: carp
status: master
inet 10.33.0.1 netmask 0x
carp1150: flags=8843 mtu 1500

hwfeatures=3187
hardmtu 1500

Re: TSO em(4) problem

2024-01-26 Thread Hrvoje Popovski

ev
2.00/1.00 addr 1
pcib0 at pci1 dev 31 function 0 "Intel C610 LPC" rev 0x05
ahci1 at pci1 dev 31 function 2 "Intel C610 AHCI" rev 0x05: msi, AHCI 1.3
ahci1: port 0: 6.0Gb/s
ahci1: port 1: 6.0Gb/s
scsibus2 at ahci1: 32 targets
sd0 at scsibus2 targ 0 lun 0: 
naa.5002538d417c7a2b
sd0: 244198MB, 512 bytes/sector, 500118192 sectors, thin
sd1 at scsibus2 targ 1 lun 0: 
naa.5002538d417cc12c
sd1: 244198MB, 512 bytes/sector, 500118192 sectors, thin
ichiic0 at pci1 dev 31 function 3 "Intel C610 SMBus" rev 0x05: apic 1 int 18
iic0 at ichiic0
isa0 at pcib0
isadma0 at isa0
pcppi0 at isa0 port 0x61
spkr0 at pcppi0
vmm0 at mainbus0: VMX/EPT
uhub3 at uhub0 port 12 configuration 1 interface 0 "ATEN International
product 0x7000" rev 2.00/0.00 addr 2
uhidev0 at uhub3 port 1 configuration 1 interface 0 "ATEN International
product 0x2419" rev 1.10/1.00 addr 3
uhidev0: iclass 3/1
ukbd0 at uhidev0: 8 variable keys, 6 key codes
wskbd0 at ukbd0 mux 1
uhidev1 at uhub3 port 1 configuration 1 interface 1 "ATEN International
product 0x2419" rev 1.10/1.00 addr 3
uhidev1: iclass 3/1
ums0 at uhidev1: 3 buttons, Z dir
wsmouse0 at ums0 mux 0
uhub4 at uhub1 port 1 configuration 1 interface 0 "Intel Rate Matching
Hub" rev 2.00/0.05 addr 2
uhub5 at uhub2 port 1 configuration 1 interface 0 "Intel Rate Matching
Hub" rev 2.00/0.05 addr 2
vscsi0 at root
scsibus3 at vscsi0: 256 targets
softraid0 at root
scsibus4 at softraid0: 256 targets
root on sd0a (06e397d0f983db15.a) swap on sd0b dump on sd0b



On 24.1.2024. 0:48, Hrvoje Popovski wrote:
> Hi all,
> 
> in production I have simple carp pfsync setup with
> em0 - carp <- uplink
> em1 - pfsync
> ix0 - vlan - carp <- internal networks
> ix1 - not used
> and for vpn I have wireguard and people connects to em0 carp address.
> There's no bridges or tunnels or any exotic pf feature in this setup.
> 
> Until this snapshot
> OpenBSD 7.4-current (GENERIC.MP) #1587: Sat Dec 30 22:44:51 MST 2023
> every this was fine,
> but with and after
> OpenBSD 7.4-current (GENERIC.MP) #1588: Thu Jan  4 20:58:35 MST 2024
> em0 starts to go up/down spontaneously and em0 watchdog logs start to
> appear in messages
> 
> em0: watchdog: head 113 tail 178 TDH 178 TDT 113
> carp1: state transition: BACKUP -> MASTER
> 
> even with net.inet.tcp.tso=0
> 
> 
> When reverting em TSO diffs if_em.c to r1.369 and if_em.h to r1.80
> firewall starts to work normally and em0 is fine.
> 
> After rebooting firewall and promote it to carp master I've started to
> collect kstat em0::: after em0 watchdog log
> 
> 
> 1) Jan 22 08:01:01 fw2 /bsd: em0: watchdog: head 473 tail 25 TDH 25 TDT 473
>   kstat em0::: - em0-1.txt
> 2) Jan 22 08:07:11 fw2 /bsd: em0: watchdog: head 114 tail 178 TDH 178
> TDT 114
> 3) Jan 22 08:08:16 fw2 /bsd: em0: watchdog: head 61 tail 126 TDH 126 TDT 61
>   kstat em0::: - em0-3.txt
> 4) Jan 22 08:21:23 fw2 /bsd: em0: watchdog: head 452 tail 5 TDH 5 TDT 452
> 5) Jan 22 08:33:48 fw2 /bsd: em0: watchdog: head 352 tail 416 TDH 416
> TDT 352
> 6) Jan 22 08:36:20 fw2 /bsd: em0: watchdog: head 446 tail 510 TDH 510
> TDT 446
>   kstat em0::: - em0-6.txt
> 7) Jan 22 08:42:16 fw1 /bsd: em0: watchdog: head 385 tail 450 TDH 450
> TDT 385
>   kstat em0::: - em0-7.txt
> 
> 
> in the attachment you can find em0 txt kstat output and kstat-all.txt
> which is kstat of all interfaces with TSO diff after 7th time em0
> watchdog log
> 
> From logs it seems that em0:0:txq:0 oactives counter, em0 watchdog and
> em0 going up/down is somehow connected because every time I see em0
> watchdog, oactives counter is increased by one
> 
> 
> log on switch
> I 01/22/24 08:01:01 00077 ports: port 2 is now off-line
> I 01/22/24 08:01:05 00076 ports: port 2 is now on-line
> 
> I 01/22/24 08:07:11 00077 ports: port 2 is now off-line
> I 01/22/24 08:07:14 00076 ports: port 2 is now on-line
> 
> I 01/22/24 08:08:16 00077 ports: port 2 is now off-line
> I 01/22/24 08:08:20 00076 ports: port 2 is now on-line
> 
> I 01/22/24 08:21:23 00077 ports: port 2 is now off-line
> I 01/22/24 08:21:26 00076 ports: port 2 is now on-line
> 
> I 01/22/24 08:33:47 00077 ports: port 2 is now off-line
> I 01/22/24 08:33:51 00076 ports: port 2 is now on-line
> 
> I 01/22/24 08:36:20 00077 ports: port 2 is now off-line
> I 01/22/24 08:36:24 00076 ports: port 2 is now on-line
> 
> I 01/22/24 08:42:16 00077 ports: port 2 is now off-line
> I 01/22/24 08:42:20 00076 ports: port 2 is now on-line
> 
> em0 is connected to port 2
> ix0 is connected to port 6 and it's up whole the time...
> 
> 
> Packet processing and some little pressure need to be over em0 to
> trigger em0 watchdog and only carp master is affected. Over night th

TSO em(4) problem

2024-01-23 Thread Hrvoje Popovski

Hi all,

in production I have simple carp pfsync setup with
em0 - carp <- uplink
em1 - pfsync
ix0 - vlan - carp <- internal networks
ix1 - not used
and for vpn I have wireguard and people connects to em0 carp address.
There's no bridges or tunnels or any exotic pf feature in this setup.

Until this snapshot
OpenBSD 7.4-current (GENERIC.MP) #1587: Sat Dec 30 22:44:51 MST 2023
every this was fine,
but with and after
OpenBSD 7.4-current (GENERIC.MP) #1588: Thu Jan  4 20:58:35 MST 2024
em0 starts to go up/down spontaneously and em0 watchdog logs start to
appear in messages

em0: watchdog: head 113 tail 178 TDH 178 TDT 113
carp1: state transition: BACKUP -> MASTER

even with net.inet.tcp.tso=0


When reverting em TSO diffs if_em.c to r1.369 and if_em.h to r1.80
firewall starts to work normally and em0 is fine.

After rebooting firewall and promote it to carp master I've started to
collect kstat em0::: after em0 watchdog log


1) Jan 22 08:01:01 fw2 /bsd: em0: watchdog: head 473 tail 25 TDH 25 TDT 473
kstat em0::: - em0-1.txt
2) Jan 22 08:07:11 fw2 /bsd: em0: watchdog: head 114 tail 178 TDH 178
TDT 114
3) Jan 22 08:08:16 fw2 /bsd: em0: watchdog: head 61 tail 126 TDH 126 TDT 61
kstat em0::: - em0-3.txt
4) Jan 22 08:21:23 fw2 /bsd: em0: watchdog: head 452 tail 5 TDH 5 TDT 452
5) Jan 22 08:33:48 fw2 /bsd: em0: watchdog: head 352 tail 416 TDH 416
TDT 352
6) Jan 22 08:36:20 fw2 /bsd: em0: watchdog: head 446 tail 510 TDH 510
TDT 446
kstat em0::: - em0-6.txt
7) Jan 22 08:42:16 fw1 /bsd: em0: watchdog: head 385 tail 450 TDH 450
TDT 385
kstat em0::: - em0-7.txt


in the attachment you can find em0 txt kstat output and kstat-all.txt
which is kstat of all interfaces with TSO diff after 7th time em0
watchdog log

>From logs it seems that em0:0:txq:0 oactives counter, em0 watchdog and
em0 going up/down is somehow connected because every time I see em0
watchdog, oactives counter is increased by one


log on switch
I 01/22/24 08:01:01 00077 ports: port 2 is now off-line
I 01/22/24 08:01:05 00076 ports: port 2 is now on-line

I 01/22/24 08:07:11 00077 ports: port 2 is now off-line
I 01/22/24 08:07:14 00076 ports: port 2 is now on-line

I 01/22/24 08:08:16 00077 ports: port 2 is now off-line
I 01/22/24 08:08:20 00076 ports: port 2 is now on-line

I 01/22/24 08:21:23 00077 ports: port 2 is now off-line
I 01/22/24 08:21:26 00076 ports: port 2 is now on-line

I 01/22/24 08:33:47 00077 ports: port 2 is now off-line
I 01/22/24 08:33:51 00076 ports: port 2 is now on-line

I 01/22/24 08:36:20 00077 ports: port 2 is now off-line
I 01/22/24 08:36:24 00076 ports: port 2 is now on-line

I 01/22/24 08:42:16 00077 ports: port 2 is now off-line
I 01/22/24 08:42:20 00076 ports: port 2 is now on-line

em0 is connected to port 2
ix0 is connected to port 6 and it's up whole the time...


Packet processing and some little pressure need to be over em0 to
trigger em0 watchdog and only carp master is affected. Over night there
are 2 or 3 em0 watchdogs. Firewalls are more than underutilized cca 5k
states and under 100Mbps

To rule out em hardware problem, I've sysupdate second firewall and
problem was the same as on first one.


I am willing to debug this further but I don't know what to look any
more ...

And of course, thank you guys for carp and pfsync, without it this would
be a problem but it's not :)




kstat em0::: after day without TSO diffs

fw2# uptime
12:36AM  up 1 day, 13:38, 2 users, load averages: 0.35, 0.23, 0.23
fw2# kstat em0:::
em0:0:em-stats:0
 rx crc errs: 0 packets
   rx align errs: 0 packets
   rx align errs: 0 packets
 rx errs: 0 packets
   rx missed: 0 packets
  tx single coll: 0 packets
  tx excess coll: 0 packets
   tx multi coll: 0 packets
tx late coll: 0 packets
 tx coll: 0
   tx defers: 0
   tx no CRS: 0 packets
seq errs: 0
   carr ext errs: 0 packets
 rx len errs: 0 packets
  rx xon: 0 packets
  tx xon: 0 packets
 rx xoff: 0 packets
 tx xoff: 0 packets
  FC unsupported: 0 packets
  rx 64B: 6361422 packets
  rx 65-127B: 19106140 packets
 rx 128-255B: 4430154 packets
 rx 256-511B: 5116503 packets
rx 512-1023B: 7665843 packets
rx 1024-maxB: 86778341 packets
 rx good: 129458403 packets
rx bcast: 147 packets
rx mcast: 112979 packets
 tx good: 67968976 packets
 rx good: 134077827262 bytes
 tx good: 32314161469 bytes
   rx no buffers: 4 packets
rx undersize: 0 packets
rx fragments: 0 packets
 rx oversize: 0 packets
  rx jabbers: 0 packets
 rx mgmt: 0 packets
   rx mgmt drops: 0 packets
 tx mgmt: 0 packets
rx total: 134077827262 bytes
tx total: 32314161469 bytes
rx total: 129458403 packets
tx total: 67968976 packets
  tx 64B: 8932448 packets
  tx 65-127B: 31092764 packets
 tx 128-255B: 3930861 packets
 tx 256-511B: 2126737 packets
tx 512-1023B: 4009214

Re: bnxt panic - HWRM_RING_ALLOC command returned RESOURCE_ALLOC_ERROR error.

2024-01-08 Thread Hrvoje Popovski

On 9.1.2024. 3:04, Jonathan Matthew wrote:
> On Wed, Jan 03, 2024 at 10:14:12AM +0100, Hrvoje Popovski wrote:
>> On 3.1.2024. 7:51, Jonathan Matthew wrote:
>>> On Wed, Jan 03, 2024 at 01:50:06AM +0100, Alexander Bluhm wrote:
>>>> On Wed, Jan 03, 2024 at 12:26:26AM +0100, Hrvoje Popovski wrote:
>>>>> While testing kettenis@ ipl diff from tech@ and doing iperf3 to bnxt
>>>>> interface and ifconfig bnxt0 down/up at the same time I can trigger
>>>>> panic. Panic can be triggered without kettenis@ diff...
>>>> It is easy to reproduce.  ifconfig bnxt1 down/up a few times while
>>>> receiving TCP traffic with iperf3.  Machine still has kettenis@ diff.
>>>> My panic looks different.
>>> It looks like I wasn't trying very hard when I wrote bnxt_down().
>>> I think there's also a problem with bnxt_up() unwinding after failure
>>> in various places, but that's a different issue.
>>>
>>> This makes it a more resilient for me, though it still logs
>>> 'bnxt0: unexpected completion type 3' a lot if I take the interface
>>> down while it's in use.  I'll look at that separately.
>>
>> Hi,
>>
>> with this diff I can still panic box with ifconfig up/down but not as
>> fast as without it
> 
> Right, this is the other problem where bnxt_up() wasn't cleaning up properly
> after failing part way through.  This diff should fix that, but I don't think
> it will fix the 'HWRM_RING_ALLOC command returned RESOURCE_ALLOC_ERROR error'
> problem, so the interface will still stop working at that point.
> 


With this diff bnxt behaves exactly as you said.
After a lot of ifconfig down/up at some point I get

smc24# ifconfig bnxt0 down
smc24# ifconfig bnxt0 up
bnxt0: attempt to re-allocate ring 0010
bnxt0: failed to allocate completion queue 0

and bnxt stop working ..



> 
> Index: if_bnxt.c
> ===
> RCS file: /cvs/src/sys/dev/pci/if_bnxt.c,v
> retrieving revision 1.39
> diff -u -p -r1.39 if_bnxt.c
> --- if_bnxt.c 10 Nov 2023 15:51:20 -  1.39
> +++ if_bnxt.c 9 Jan 2024 01:59:38 -
> @@ -1073,7 +1081,7 @@ bnxt_up(struct bnxt_softc *sc)
>   if (bnxt_hwrm_vnic_ctx_alloc(sc, >sc_vnic.rss_id) != 0) {
>   printf("%s: failed to allocate vnic rss context\n",
>   DEVNAME(sc));
> - goto down_queues;
> + goto down_all_queues;
>   }
>  
>   sc->sc_vnic.id = (uint16_t)HWRM_NA_SIGNATURE;
> @@ -1139,8 +1147,11 @@ dealloc_vnic:
>   bnxt_hwrm_vnic_free(sc, >sc_vnic);
>  dealloc_vnic_ctx:
>   bnxt_hwrm_vnic_ctx_free(sc, >sc_vnic.rss_id);
> +
> +down_all_queues:
> + i = sc->sc_nqueues;
>  down_queues:
> - for (i = 0; i < sc->sc_nqueues; i++)
> + while (i-- > 0)
>   bnxt_queue_down(sc, >sc_queues[i]);
>  
>   bnxt_dmamem_free(sc, sc->sc_rx_cfg);
>

Re: bnxt panic - HWRM_RING_ALLOC command returned RESOURCE_ALLOC_ERROR error.

2024-01-03 Thread Hrvoje Popovski

On 3.1.2024. 7:51, Jonathan Matthew wrote:
> On Wed, Jan 03, 2024 at 01:50:06AM +0100, Alexander Bluhm wrote:
>> On Wed, Jan 03, 2024 at 12:26:26AM +0100, Hrvoje Popovski wrote:
>>> While testing kettenis@ ipl diff from tech@ and doing iperf3 to bnxt
>>> interface and ifconfig bnxt0 down/up at the same time I can trigger
>>> panic. Panic can be triggered without kettenis@ diff...
>> It is easy to reproduce.  ifconfig bnxt1 down/up a few times while
>> receiving TCP traffic with iperf3.  Machine still has kettenis@ diff.
>> My panic looks different.
> It looks like I wasn't trying very hard when I wrote bnxt_down().
> I think there's also a problem with bnxt_up() unwinding after failure
> in various places, but that's a different issue.
> 
> This makes it a more resilient for me, though it still logs
> 'bnxt0: unexpected completion type 3' a lot if I take the interface
> down while it's in use.  I'll look at that separately.

Hi,

with this diff I can still panic box with ifconfig up/down but not as
fast as without it

panic with diff

bnxt0: HWRM_RING_ALLOC command returned RESOURCE_ALLOC_ERROR error.
bnxt0: failed to set up tx ring
uvm_fault(0xfd8e57e02460, 0xff0, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at  bnxt_queue_down+0x62:   movq0(%r12,%rax,1),%rsi
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
* 70181  53204  0 0x3  00K ifconfig
bnxt_queue_down(802c9000,802c9f88) at bnxt_queue_down+0x62
bnxt_up(802c9000) at bnxt_up+0x36b
bnxt_ioctl(802c9048,80206910,8000607fffd0) at bnxt_ioctl+0x162
ifioctl(fd8e417ab758,80206910,8000607fffd0,800060797aa8) at
ifioctl+0x726
sys_ioctl(800060797aa8,8000608000d0,800060800120) at
sys_ioctl+0x2af
syscall(800060800190) at syscall+0x3b4
Xsyscall() at Xsyscall+0x128
end of kernel
end trace frame: 0x7e3d0a930430, count: 8
https://www.openbsd.org/ddb.html describes the minimum info required in
bug reports.  Insufficient info makes it difficult to find and fix bugs.


ddb{0}> show reg
rdi   0x8244b950pci_bus_dma_tag
rsi   0x802c9f88
rbp   0x8000607ffe40
rbx0x101
rdx   0xc803
rcx0x206
rax0xff0
r8  0x3f
r9 0
r10   0xa14b312597c5ea6a
r11   0x819fac40_bus_dmamap_destroy
r120
r130x100
r14   0x802c9f88
r15   0x802c9000
rip   0x81b578e2bnxt_queue_down+0x62
cs   0x8
rflags   0x10216__ALIGN_SIZE+0xf216
rsp   0x8000607ffde0
ss  0x10
bnxt_queue_down+0x62:   movq0(%r12,%rax,1),%rsi



ddb{0}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
*53204   70181  81971  0  7 0x3ifconfig
 57044  336864  81971  0  30x100083  kqreadiperf3
 57044  317909  81971  0  3   0x4100083  kqreadiperf3
 57044  253167  81971  0  3   0x4100083  kqreadiperf3
 57044  199984  81971  0  3   0x4100083  kqreadiperf3
 57044  343144  81971  0  3   0x4100083  kqreadiperf3
 81971  379109  1  0  30x10008b  sigsusp   ksh
 69236  410163  1  0  30x100098  kqreadcron
 28984  478747  27164 95  3   0x1100092  kqreadsmtpd
 75309  290569  27164103  3   0x1100092  kqreadsmtpd
  3782  175531  27164 95  3   0x1100092  kqreadsmtpd
 60089   38850  27164 95  30x100092  kqreadsmtpd
 72803  151501  27164 95  3   0x1100092  kqreadsmtpd
 88240  203086  27164 95  3   0x1100092  kqreadsmtpd
 27164  293957  1  0  30x100080  kqreadsmtpd
 51687  170066  1  0  30x88  kqreadsshd
 82716  114406  1  0  30x100080  kqreadntpd
 95469  439610  76144 83  30x100092  kqreadntpd
 76144  242283  1 83  3   0x1100092  kqreadntpd
 25275  206721  16938 73  3   0x1100090  kqreadsyslogd
 16938  424245  1  0  30x100082  netio syslogd
 92580  279098  0  0  3 0x14200  bored smr
 40549  159120  0  0  3 0x14200  pgzerozerothread
 12488  115575  0  0  3 0x14200  aiodoned  aiodoned
 91171  460632  0  0  3 0x14200  syncerupdate
 83952  275089  0  0  3 0x14200  cleaner   cleaner
  6394  148862  0  0  3 0x14200  reaperreaper
 60888  287201  0  0  3 0x14200  pgdaemon  pagedaemon
 25804  403088  0  0  3

bnxt panic - HWRM_RING_ALLOC command returned RESOURCE_ALLOC_ERROR error.

2024-01-02 Thread Hrvoje Popovski

Hi all,

While testing kettenis@ ipl diff from tech@ and doing iperf3 to bnxt
interface and ifconfig bnxt0 down/up at the same time I can trigger
panic. Panic can be triggered without kettenis@ diff...


bnxt0: HWRM_RING_ALLOC command returned RESOURCE_ALLOC_ERROR error.
bnxt0: failed to set up tx ring
uvm_fault(0xfd8e57f12a20, 0xff0, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at  bnxt_queue_down+0x62:   movq0(%r12,%rax,1),%rsi
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
*292054  36537  0 0x3  00K ifconfig
 163937  81780  0 0x14000 0x42006  sensors
bnxt_queue_down(802c9000,802c9f88) at bnxt_queue_down+0x62
bnxt_up(802c9000) at bnxt_up+0x36b
bnxt_ioctl(802c9048,80206910,8000607295f0) at bnxt_ioctl+0x162
ifioctl(fd8e442f2758,80206910,8000607295f0,8000607cf2b0) at
ifioctl+0x726
sys_ioctl(8000607cf2b0,8000607296f0,800060729740) at
sys_ioctl+0x2af
syscall(8000607297b0) at syscall+0x3b4
Xsyscall() at Xsyscall+0x128
end of kernel
end trace frame: 0x726ac871f790, count: 8
https://www.openbsd.org/ddb.html describes the minimum info required in
bug reports.  Insufficient info makes it difficult to find and fix bugs.


ddb{0}> show reg
rdi   0x82485c78pci_bus_dma_tag
rsi   0x802c9f88
rbp   0x800060729460
rbx0x101
rdx   0xc803
rcx0x286
rax0xff0
r8  0x3f
r9 0
r10   0x96b31028f3e5d46c
r11   0x81825410_bus_dmamap_destroy
r120
r130x100
r14   0x802c9f88
r15   0x802c9000
rip   0x81db3da2bnxt_queue_down+0x62
cs   0x8
rflags   0x10216__ALIGN_SIZE+0xf216
rsp   0x800060729400
ss  0x10
bnxt_queue_down+0x62:   movq0(%r12,%rax,1),%rsi
ddb{0}>


ddb{0}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
*36537  292054  47404  0  7 0x3ifconfig
  86797
280843  47404  0  30x100083  kqreadiperf3
 86797  429491  47404  0  3   0x4100083  kqreadiperf3
 86797  214299  47404  0  3   0x4100083  kqreadiperf3
 86797  368590  47404  0  3   0x4100083  kqreadiperf3
 86797  380965  47404  0  3   0x4100083  kqreadiperf3
 47404  299766  1  0  30x10008b  sigsusp   ksh
  7161  521423  1  0  30x100098  kqreadcron
 39740  121938  83604 95  3   0x1100092  kqreadsmtpd
 94839  467744  83604103  3   0x1100092  kqreadsmtpd
 31264  522699  83604 95  3   0x1100092  kqreadsmtpd
 94528  511199  83604 95  30x100092  kqreadsmtpd
 37502  123618  83604 95  3   0x1100092  kqreadsmtpd
 89306   15887  83604 95  3   0x1100092  kqreadsmtpd
 83604  206718  1  0  30x100080  kqreadsmtpd
   428   70010  1  0  30x88  kqreadsshd
 94146  379619  1  0  30x100080  kqreadntpd
 23446  401588  82414 83  30x100092  kqreadntpd
 82414  378350  1 83  3   0x1100092  kqreadntpd
 80891  252069  55631 73  3   0x1100090  kqreadsyslogd
 55631   62854  1  0  30x100082  netio syslogd
 60491  452354  0  0  3 0x14200  bored smr
 20945   92407  0  0  3 0x14200  pgzerozerothread
 369255987  0  0  3 0x14200  aiodoned  aiodoned
 55091  437847  0  0  3 0x14200  syncerupdate
 13970  164134  0  0  3 0x14200  cleaner   cleaner
 36841  522592  0  0  3 0x14200  reaperreaper
 93326  303752  0  0  3 0x14200  pgdaemon  pagedaemon
  7898  311095  0  0  3 0x14200  usbtskusbtask
  2747  192075  0  0  3 0x14200  usbatsk   usbatsk
 97645  203456  0  0  3  0x40014200  acpi0 acpi0
 57525   67008  0  0  7  0x40014200idle23
 51862  472206  0  0  7  0x40014200idle22
 60651  418998  0  0  7  0x40014200idle21
  3576  237393  0  0  7  0x40014200idle20
  6504  170181  0  0  7  0x40014200idle19
 207063186  0  0  7  0x40014200idle18
 78053  233580  0  0  7  0x40014200idle17
 29625   58284  0  0  7  0x40014200idle16
 94538  146456  0  0  7  0x40014200idle15
 84902  429192  0  0  7  0x40014200

Re: terminal is cleared when logging as root

2023-10-23 Thread Hrvoje Popovski

On 23.10.2023. 19:16, Daniel Jakots wrote:
> Hi, I installed a new machine on Saturday (with -current) and I noticed
> that when I logged in as root the terminal get cleared but not cleanly.
> I upgraded a existing machine to a newer snapshot and then the problem
> appeared as well. This happens when using `doas su -`, `ssh root@` and I
> think I had it on console as well. For some reason, it doesn't happen
> with my regular user. Previous snapshot was from 2023-10-13. I guess
> it's since the libcurses update on the 17th? Cheers, Daniel

Hi,

I confirm what Daniel said over ssh and over console I'm getting this

OpenBSD/amd64 (r620-1.srce.hr) (tty01)

login: root
Password:
Last login: Mon Oct 23 10:18:24 on tty01
OpenBSD 7.4-current (GENERIC.MP) #1419: Mon Oct 23 10:14:12 MDT 2023

Welcome to OpenBSD: The proactively secure Unix-like operating system.

Please use the sendbug(1) utility to report bugs in the system.
Before reporting a bug, please try to reproduce it with the latest
version of the code.  With bug reports, please try to ensure that
enough information to reproduce the problem is enclosed, and if a
known fix for it exists, include that as well.

You have mail.
F

r620-1#

when I login as user over console

OpenBSD/amd64 (r620-2.srce.hr) (tty01)

login: hrvoje
Password:
Last login: Mon Oct 23 19:55:11 on ttyp2 from 161.53.255.123
OpenBSD 7.4-current (GENERIC.MP) #1419: Mon Oct 23 10:14:12 MDT 2023

Welcome to OpenBSD: The proactively secure Unix-like operating system.

Please use the sendbug(1) utility to report bugs in the system.
Before reporting a bug, please try to reproduce it with the latest
version of the code.  With bug reports, please try to ensure that
enough information to reproduce the problem is enclosed, and if a
known fix for it exists, include that as well.

You have new mail.
r620-2$ su -
Password:
F

r620-2#

Re: Dell R6515 with mpii HBA330 Mini - mpii_scsi_cmd_tmo (0x40005862)

2023-09-09 Thread Hrvoje Popovski

On 1.8.2023. 1:59, Hrvoje Popovski wrote:
> Hi all,
> 
> I've got 2 new Dell servers for vpns and firewalling with Dell non-raid
> HBA330 mini and after install both firewalls freeze with this log
> This is not first time I saw that log on dell servers with HBA330.

Nice thing is, and I didn't know that, when ssd is replaced with nvme
disk on same slot, HBA330 is out of the game and it seems that nvme is
connected direct on motherboard and it's so fast. But I need to boot
uefi so nvme disk can be recognized as boot disk by Dell server.

vpn1# dmesg | grep nvme
nvme0 at pci13 dev 0 function 0 vendor "SK hynix", unknown product
0x2839 rev 0x21: msix, NVMe 1.3
nvme0: Dell DC NVMe PE8010 RI U.2 960GB, firmware 1.2.0, serial
SJC2N4257I34R2Q19
scsibus2 at nvme0: 17 targets, initiator 0

Dell R6515 with mpii HBA330 Mini - mpii_scsi_cmd_tmo (0x40005862)

2023-07-31 Thread Hrvoje Popovski

Hi all,

I've got 2 new Dell servers for vpns and firewalling with Dell non-raid
HBA330 mini and after install both firewalls freeze with this log
This is not first time I saw that log on dell servers with HBA330.


OpenBSD/amd64 (vpn2.lan) (tty00)

login: root
123
^Cmpii0: mpii_scsi_cmd_tmo (0x40005862)
mpii0: mpii_scsi_cmd_tmo (0x40005862)
mpii0: mpii_scsi_cmd_tmo (0x40005862)
mpii0: mpii_scsi_cmd_tmo (0x40005862)
mpii0: mpii_scsi_cmd_tmo (0x40005862)
mpii0: mpii_scsi_cmd_tmo (0x40005862)
mpii0: mpii_scsi_cmd_tmo (0x40005862)
mpii0: mpii_scsi_cmd_tmo (0x40005862)
mpii0: mpii_scsi_cmd_tmo (0x40005862)
mpii0: mpii_scsi_cmd_tmo (0x40005862)
mpii0: mpii_scsi_cmd_tmo (0x40005862)
mpii0: mpii_scsi_cmd_tmo (0x40005862)


after that I can only reboot boxes and sometimes I would be able to
login over idrac or ssh but mostly it will freeze and print log above.

I've saw that mpii_scsi_cmd_tmp log even in ramdisk - sysupgrade.

sysupgrade
Set name(s)? (or 'abort' or 'done') [done] done
Directory does not contain SHA256.sig. Continue without verification?
[no] yes
Installing bsd  100% |**| 24695 KB00:00
Installing bsd.mp   100% |**| 24787 KB00:00
Installing bsd.rd   100% |**|  4549 KB00:00
Installing base73.tgz   100% |**|   368 MB00:05
Installing comp73.tgz   100% |**| 75590 KB00:02
Installing man73.tgz100% |**|  7822 KB00:00
Installing game73.tgz   100% |**|  2748 KB00:00
Installing xbase73.tgz0% |  | 0
--:-- ETAmpii0: mpii_scsi_cmd_tmo (0x40005862)
mpii0: mpii_scsi_cmd_tmo (0x40005862)
mpii0: mpii_scsi_cmd_tmo (0x40005862)
mpii0: mpii_scsi_cmd_tmo (0x40005862)
mpii0: mpii_scsi_cmd_tmo (0x40005862)
mpii0: mpii_scsi_cmd_tmo (0x40005862)


That one time that I was able to do reposync and checkout from local
disk I've got

mpii0: mpii_scsi_cmd_tmo (0x2400)
mpii0: mpii_scsi_cmd_tmo (0x2400)
sd0(mpii0:4:0): Check Condition (error 0x70) on opcode 0x2a
SENSE KEY: Not Ready
 ASC/ASCQ: Logical Unit Not Ready, Cause Not Reportable
sd0(mpii0:4:0): Check Condition (error 0x70) on opcode 0x2a
SENSE KEY: Not Ready
 ASC/ASCQ: Logical Unit Not Ready, Cause Not Reportable
sd0(mpii0:4:0): Check Condition (error 0x70) on opcode 0x2a
SENSE KEY: Not Ready
 ASC/ASCQ: Logical Unit Not Ready, Cause Not Reportable
sd0(mpii0:4:0): Check Condition (error 0x70) on opcode 0x28
SENSE KEY: Not Ready
 ASC/ASCQ: Logical Unit Not Ready, Cause Not Reportable
sd0(mpii0:4:0): Check Condition (error 0x70) on opcode 0x28
SENSE KEY: Not Ready
 ASC/ASCQ: Logical Unit Not Ready, Cause Not Reportable


kernel: protection fault trap, code=0
Stopped at  dounmount+0x52: movq0x28(%r13),%rax
ddb> trace
dounmount(802f6000,808,800025420828) at dounmount+0x52
vop_generic_revoke(800025435858) at vop_generic_revoke+0x7d
VOP_REVOKE(fdb8e9d931c0,1) at VOP_REVOKE+0x3b
vdevgone(4,0,f,3) at vdevgone+0xd5
disk_gone(816bd690,0) at disk_gone+0x7e
sddetach(802cfa00,1) at sddetach+0x3e
config_detach(802cfa00,1) at config_detach+0x140
scsi_detach_link(802f3e00,1) at scsi_detach_link+0x60
scsi_detach_target(801ec380,4,1) at scsi_detach_target+0x5d
mpii_event_sas(801e9c00) at mpii_event_sas+0x251
taskq_thread(8246ad10) at taskq_thread+0xf0
end trace frame: 0x0, count: -11



dmesg:

OpenBSD 7.3-current (GENERIC.MP) #1324: Mon Jul 31 14:48:11 MDT 2023
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 274435207168 (261721MB)
avail mem = 266098278400 (253771MB)
random: good seed from bootblocks
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 3.3 @ 0x6989c000 (68 entries)
bios0: vendor Dell Inc. version "2.11.4" date 03/22/2023
bios0: Dell Inc. PowerEdge R6515
acpi0 at bios0: ACPI 6.3
acpi0: sleep states S0 S5
acpi0: tables DSDT FACP BERT ERST HEST HPET APIC MCFG WSMT SSDT SSDT
EINJ SSDT CRAT CDIT IVRS SPCR SSDT SSDT
acpi0: wakeup devices PC00(S5) XHCI(S3) PC01(S5) XHCI(S3) PC02(S5)
XHCI(S3) PC03(S5) XHCI(S3)
acpitimer0 at acpi0: 3579545 Hz, 32 bits
acpihpet0 at acpi0: 14318180 Hz
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
ioapic0 at mainbus0: apid 240 pa 0xfec0, version 21, 24 pins, can't
remap
ioapic1 at mainbus0: apid 241 pa 0xe010, version 21, 32 pins, can't
remap
ioapic2 at mainbus0: apid 242 pa 0xc510, version 21, 32 pins, can't
remap
ioapic3 at mainbus0: apid 243 pa 0xaa10, version 21, 32 pins, can't
remap
ioapic4 at mainbus0: apid 244 pa 0xfd10, version 21, 32 pins, can't
remap
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: AMD EPYC 7313P 16-Core Processor, 3000.00 MHz, 19-01-01
cpu0:

Re: dvmrpd start causes kernel panic: assertion failed

2023-06-07 Thread Hrvoje Popovski

On 7.6.2023. 12:31, Why 42? The lists account. wrote:
> 
> Hi All,
> 
> Just FYI, in my attempts to route multicast traffic I started a daemon
> called "dvmrpd", the kernel paniced immediately. See attached photo.
> 
> Prior to the panic the XFCE Desktop was running. After the panic I could
> not find any combination of keys that would allow me to enter a debugger
> or gather more info.
> 
> I then modified the dvmrpd config file e.g. to change the interface
> configuration and also changed the configured interface IP addresses
> themselves. A second attempt to start the daemon resulted in the same
> immediate panic. So it could be that I don't know what I'm doing, but
> apparently pretty reproducible :-/
> 
> This occurred using the 7.3 AMD64 release on a Lenovo ThinkPad with an
> 11th gen i7 CPU.
> 
> Cheers,
> Robb.
> 

If dvmrpd is enabled you should see this in boot msg.
starting network daemons: sshd dvmrpd smtpd.

I don't see that in your screenshot.

c/p from boot msg when multicast forwarding and dvmrpd is enabled.

ddb.console: 0 -> 1
kern.pool_debug: 1 -> 0
kern.maxclusters: 262144 -> 1048576
net.inet.ip.mforwarding: 0 -> 1
starting network
reordering: ld.so libc libcrypto sshd.
starting early daemons: syslogd ntpd.
starting RPC daemons:.
savecore: no core dump
checking quotas: done.
clearing /tmp
kern.securelevel: 0 -> 1
creating runtime link editor directory cache.
preserving editor files.
starting network daemons: sshd dvmrpd smtpd.
starting local daemons: cron.
Wed Jun  7 19:47:52 CEST 2023

Re: pfsync_bulk_update panic

2023-05-10 Thread Hrvoje Popovski

On 10.5.2023. 0:24, Alexandr Nedvedicky wrote:
> Hello,
> 
> 
> On Tue, May 09, 2023 at 06:26:43PM +, mabi wrote:
>> Hi,
>>
>> On a brand new OpenBSD 7.3 firewall (amd64) I get a kernel panic every few
>> days and was wondering if this panic I get is related to this issue/bug?
>>
> 
> your panic got fixed by recent commit [1]
> 
> Hrvoje was/is hitting very close to that KASSERT() (now removed) at line 2274.
> in Hrvoje's case the TAILQ_REMOVE() macro complains we attempt to remove state
> which is removed already:
> 
> 2273 atomic_sub_long(>sc_len, pfsync_qs[q].len);
> 2274 TAILQ_REMOVE(>sc_qs[q], st, sync_list);
> 2275 if (TAILQ_EMPTY(>sc_qs[q]))
> 2276 atomic_sub_long(>sc_len, sizeof (struct 
> pfsync_subheader));
> 2277 st->sync_state = PFSYNC_S_NONE;
> 2278 mtx_leave(>sc_st_mtx);
> 2279
> 2280 pf_state_unref(st);
> 
> the cause is very similar pfsync relies on volatile value in ->sync_state 
> member.
> The ->sync_state member must be modified under protection of ->mtx.
> 
> The issue has been pointed out by bluhm@ during m2k23 hackathon where I shared
> my pfsync(4) headache with him. Diff below is my attempt to fix it. I had no
> chance to test it. I'll appreciate If you will give it a try and let me know
> how things look like.
> 
> thanks and
> regards
> sashan
> 
> [1] https://marc.info/?l=openbsd-cvs=168269695603160=2
> 
> 

Hi,

with this diff I can't trigger panic below


r620-1# uvm_fault(0x82598710, 0x17, 0, 2) -> e
kernel: page fault trap, code=2
Stopped at  pfsync_q_del+0x8d:  movq%rdx,0x8(%rax)
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
*254020  58090  0 0x14000  0x2000K systq
pfsync_q_del(fd8360adf6e0) at pfsync_q_del+0x8d
pfsync_delete_state(fd8360adf6e0) at pfsync_delete_state+0x118
pf_remove_state(fd8360adf6e0) at pf_remove_state+0x156
pf_purge_expired_states(2e9c1) at pf_purge_expired_states+0x273
pf_purge(8258caa0) at pf_purge+0x2c
taskq_thread(824a8150) at taskq_thread+0x100
end trace frame: 0x0, count: 9
https://www.openbsd.org/ddb.html describes the minimum info required in bug
reports.  Insufficient info makes it difficult to find and fix bugs.
ddb{0}>


I'm only being able to trigger it in lab and quite fast. Now, after few
hours it's still stable.



> 8<---8<---8<--8<
> diff --git a/sys/net/if_pfsync.c b/sys/net/if_pfsync.c
> index 822b4211d0f..811d9d59666 100644
> --- a/sys/net/if_pfsync.c
> +++ b/sys/net/if_pfsync.c
> @@ -1362,14 +1362,17 @@ pfsync_grab_snapshot(struct pfsync_snapshot *sn, 
> struct pfsync_softc *sc)
>  
>   while ((st = TAILQ_FIRST(>sc_qs[q])) != NULL) {
>   TAILQ_REMOVE(>sc_qs[q], st, sync_list);
> + mtx_enter(>mtx);
>   if (st->snapped == 0) {
>   TAILQ_INSERT_TAIL(>sn_qs[q], st, sync_snap);
>   st->snapped = 1;
> + mtx_leave(>mtx);
>   } else {
>   /*
>* item is on snapshot list already, so we can
>* skip it now.
>*/
> + mtx_leave(>mtx);
>   pf_state_unref(st);
>   }
>   }
> @@ -1422,11 +1425,13 @@ pfsync_drop_snapshot(struct pfsync_snapshot *sn)
>   continue;
>  
>   while ((st = TAILQ_FIRST(>sn_qs[q])) != NULL) {
> + mtx_enter(>mtx);
>   KASSERT(st->sync_state == q);
>   KASSERT(st->snapped == 1);
>   TAILQ_REMOVE(>sn_qs[q], st, sync_snap);
>   st->sync_state = PFSYNC_S_NONE;
>   st->snapped = 0;
> + mtx_leave(>mtx);
>   pf_state_unref(st);
>   }
>   }
> @@ -1665,6 +1670,7 @@ pfsync_sendout(void)
>  
>   count = 0;
>   while ((st = TAILQ_FIRST(_qs[q])) != NULL) {
> + mtx_enter(>mtx);
>   TAILQ_REMOVE(_qs[q], st, sync_snap);
>   KASSERT(st->sync_state == q);
>   KASSERT(st->snapped == 1);
> @@ -1672,6 +1678,7 @@ pfsync_sendout(void)
>   st->snapped = 0;
>   pfsync_qs[q].write(st, m->m_data + offset);
>   offset += pfsync_qs[q].len;
> + mtx_leave(>mtx);
>  
>   pf_state_unref(st);
>   count++;
> @@ -1725,8 +1732,6 @@ pfsync_insert_state(struct pf_state *st)
>   ISSET(st->state_flags, PFSTATE_NOSYNC))
>   return;
>  
> - KASSERT(st->sync_state == PFSYNC_S_NONE);
> -
>   if (sc->sc_len == PFSYNC_MINPKT)
>

Re: pfsync_bulk_update panic

2023-02-08 Thread Hrvoje Popovski

On 8.2.2023. 8:53, Alexandr Nedvedicky wrote:
> Hello,
> 
> On Tue, Feb 07, 2023 at 09:12:38PM +0100, Hrvoje Popovski wrote:
> 
>>
>> Hi,
>>
>> this panic is with plain snapshot and I didn't do anything. I will leave
>> box in ddb if something else is needed.
>>
> It does not look like there is more data to gather in ddb.
> may be I'm quick in my judgment. this is the relevant part
> of pfsync_bulk_update() function:
> 2456 int i = 0;
>   /* `i` seems to be kept in %r12 */
> 2457
> 2458 NET_LOCK();
> 2459 sc = pfsyncif;
> 2460 if (sc == NULL)
> 2461 goto out;
> 2462
> 2463 rw_enter_read(_state_list.pfs_rwl);
> 2464 st = sc->sc_bulk_next;
>   /* `st` is kept in %r15
> 2465 sc->sc_bulk_next = NULL;
> 2466
> 2467 for (;;) {
> 2468 if (st->sync_state == PFSYNC_S_NONE &&
> 2469 st->timeout < PFTM_MAX &&
> 2470 st->pfsync_time <= sc->sc_ureq_received) {
> 2471 pfsync_update_state_req(st);
> 2472 i++;
> 2473 }
> 
> 
> 
> 
>> ddb{0}> dmesg
>> OpenBSD 7.2-current (GENERIC.MP) #1021: Sun Feb  5 09:52:50 MST 2023
>> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
>>
>>
>> r620-2# uvm_fault(0x824fb2f8, 0x14e, 0, 1) -> e
>> kernel: page fault trap, code=0
>> Stopped at  pfsync_bulk_update+0x60:cmpb$0xff,0x14e(%r15)
>> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
>> *109809  58944  0 0x14000 0x42000K softclock
>> pfsync_bulk_update(0) at pfsync_bulk_update+0x60
> we seems to be dying at line 2468 due to a NULL pointer dereference
> 
>> softclock_thread(8000f050) at softclock_thread+0x13b
>> end trace frame: 0x0, count: 13
>> https://www.openbsd.org/ddb.html describes the minimum info required in
>> bug reports.  Insufficient info makes it difficult to find and fix bugs.
>> ddb{0}>
>>
> 
> 
>> r11   0xfbec2dfc846efdb5
>> r120
>> r13   0x82503f80timeout_proc
>> r14   0x809d8000
>> r150
>> rip   0x8101aea0pfsync_bulk_update+0x60
> r12 (`i`) is 0 which suggest the loop is most likely in its first 
> iteration
> r15 (`st`) is 0 ... so looks like it's trivial bug we try to send
> a bulk but there is nothing to send. this makes me wonder if diff below
> makes your test box more stable.
> 
> 
> can you give a try a diff below?
> 
> thanks a lot for your help
> 
> regards
> sashan

Hi,

with this diff I can't trigger panic as before. I'm trying the whole day
and I should be able to see panic or 2, but there isn't any ...

Thank you...

pfsync_bulk_update panic

2023-02-07 Thread Hrvoje Popovski

Hi all,

In lab I'm playing around with ip4/ip6 sasyncd setup which requires
carp, pf, pfsync, isakmpd, sasyncd.
I'm sending ip4/ip6 traffic though ipsec tunnels and at the same time
sending ip4 traffic over firewall just to activate all cores. I'm having
NET_TASKQ=6 on 6 cores firewalls.

ix2 is pfsync interface and when sending traffic and doing ifconfig ix2
down && ifconfig ix2 up from time to time I'm able to trigger panic.

this panic is with WITNESS and when doing mach ddbcpu X box freeze


r620-1# ifconfig ix2 down
r620-1# ifconfig ix2 up
uvpma_fnaiult(c:0 x kfefrfnfel8 251 e   e6  8,   0 x 1  6 e ,
0,1 )   -  >d  iae
kgernnosetli:c p a g e  f a ul t  tr a p ,  co d e= 0
Stopped at  pfsync_bulk_update+0x60:cmpb$0xff,0x16e(%r15)
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
 270521  35272 68   0x110  01  sasyncd
 489979  76548  0 0x14000  0x2003  reaper
 164092  74224  0 0x14000  0x2004  softnet
 112060  78126  0 0x14000  0x2002  systq
*372775  98656  0 0x14000 0x42000  softclock
pfsync_bulk_update(0) at pfsync_bulk_update+0x60
timeout_run(81942978) at timeout_run+0x93
softclock_thread(8000f050) at softclock_thread+0x11d
end trace frame: 0x0, count: 12
https://www.openbsd.org/ddb.html describes the minimum info required in bug
reports.  Insufficient info makes it difficult to find and fix bugs.
ddb{0}>


ddb{0}> show panic
*cpu0: uvm_fault(0x8251ee68, 0x16e, 0, 1) -> e
 cpu3: kernel diagnostic assertion "!_kernel_lock_held()" failed: file
"/sys/uvm/uvm_map.c", line 2539
ddb{0}>

ddb{0}> show reg
rdi  0x4
rsi0
rbp   0x800022d53bb0
rbx0
rdx   0xde007fffc240
rcx0x206
rax  0xd
r80x7fff
r90x800022d53c40
r10   0x82084c2bcmd0646_9_tim_udma+0x485f1
r11   0xbeeb38867a1c691d
r120
r13   0x8000f050
r14   0x81942000
r150
rip   0x814e71e0pfsync_bulk_update+0x60
cs   0x8
rflags   0x10246__ALIGN_SIZE+0xf246
rsp   0x800022d53b70
ss 0
pfsync_bulk_update+0x60:cmpb$0xff,0x16e(%r15)
ddb{0}>


ddb{0}> show locks
shared rwlock pfstates r = 0 (0x8245cc00)
#0  witness_lock+0x311
#1  pfsync_bulk_update+0x45
#2  timeout_run+0x93
#3  softclock_thread+0x11d
#4  proc_trampoline+0x1c
exclusive rwlock netlock r = 0 (0x82454b38)
#0  witness_lock+0x311
#1  rw_enter+0x292
#2  pfsync_bulk_update+0x29
#3  timeout_run+0x93
#4  softclock_thread+0x11d
#5  proc_trampoline+0x1c
exclusive kernel_lock _lock r = 1 (0x8252b258)
#0  witness_lock+0x311
#1  __mp_acquire_count+0x38
#2  mi_switch+0x28b
#3  sleep_finish+0xfe
#4  rw_enter+0x232
#5  pfsync_bulk_update+0x29
#6  timeout_run+0x93
#7  softclock_thread+0x11d
#8  proc_trampoline+0x1c
shared rwlock timeout r = 0 (0x8244c9c8)
#0  witness_lock+0x311
#1  timeout_run+0x88
#2  softclock_thread+0x11d
#3  proc_trampoline+0x1c
ddb{0}>


ddb{0}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
 75873  445724  20843 68  3   0x190  kqreadisakmpd
 20843   31033  1  0  30x80  netio isakmpd
 76865  283351  1  0  30x10008b  sigsusp   ksh
 43091  324769  1  0  30x100098  kqreadcron
 92254  264061  28601 95  3   0x1100092  kqreadsmtpd
 80520  324180  28601103  3   0x1100092  kqreadsmtpd
 12107  295529  28601 95  3   0x1100092  kqreadsmtpd
 89174  344742  28601 95  30x100092  kqreadsmtpd
 50810  389490  28601 95  3   0x1100092  kqreadsmtpd
 75581  433356  28601 95  3   0x1100092  kqreadsmtpd
 28601  432136  1  0  30x100080  kqreadsmtpd
 67099   85178  1  0  30x88  kqreadsshd
 35272  270521  29963 68  7   0x110sasyncd
 29963  124841  1  0  30x80  kqreadsasyncd
 27546  425204  1  0  30x100080  kqreadntpd
 88920  144011  29553 83  30x100092  kqreadntpd
 295532629  1 83  3   0x1100092  kqreadntpd
 25414  252219  66731 73  3   0x1100090  kqreadsyslogd
 667319587  1  0  30x100082  netio syslogd
 13849  243057  0  0  3 0x14200  bored smr
 15866  463556  0  0  3 0x14200  pgzerozerothread
 29043  244190  0  0  3 0x14200  aiodoned  aiodoned
 50284  435047  0  0  3 0x14200  syncerupdate
 91848  147363  0  0  3

Re: pf_state_export crash

2022-12-25 Thread Hrvoje Popovski

On 26.12.2022. 5:06, Csillag Tamas wrote:
> hi,
> 
> the crash repeated again
> 
> uvm_fault(0x823ed470, 0x0, 0, 1) -> e
> fatal page fault in supervisor mode
> trap type 6 code 0 rip 81e4c208 cs 8 rflags 10246 cr2 0 cpl 0 rsp 
> 8000225a4060
> gsbase 0x80001e119ff0  kgsbase 0x0
> panic: trap type 6, code=0, pc=81e4c208
> Starting stack trace...
> panic(81f27c9e) at panic+0x12c
> kerntrap(8000225a3fb0) at kerntrap+0x114
> alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b
> pf_state_export(fd8002e8a1c0,fd952aa907e0) at pf_state_export+0x38
> pfsync_sendout() at pfsync_sendout+0x5e4
> pfsync_update_state(fd9526f4f3f0) at pfsync_update_state+0x15b
> pf_test(2,1,81eb5800,8000225a4438) at pf_test+0x117a
> ip_input_if(8000225a4438,8000225a,4,0,81eb5800) at 
> ip_input_if+0xcd
> ipv4_input(81eb5800,fd805f671000) at ipv4_input+0x39
> ether_input(81eb5800,fd805f671000) at ether_input+0x3b1
> carp_input(81ecd800,fd805f671000,5e000101) at carp_input+0x196
> ether_input(81ecd800,fd805f671000) at ether_input+0x1d9
> if_input_process(814e3050,8000225a4618) at if_input_process+0x6f
> ifiq_process(814e5700) at ifiq_process+0x69
> taskq_thread(80037200) at taskq_thread+0x100
> end trace frame: 0x0, count: 242
> End of stack trace.
> 
> Regards,
>  Tamas

Hi,

can you upgrade to latest snapshot with sysupgrade?
If that won't solve your panic can you try this diff
https://www.mail-archive.com/tech@openbsd.org/msg72582.html

this was my panic
https://www.mail-archive.com/bugs@openbsd.org/msg18583.html
and that diff solved it ...

Re: Random kernel panic on 7.2

2022-11-22 Thread Hrvoje Popovski

On 22.11.2022. 18:48, Josmar Pierri wrote:
> I upgraded to 7.2 snapshot #849 early this morning, but it crashed
> twice in a few hours.
> This time, however, the panic message is different:
> 

Could you compile kernel with this diff
https://www.mail-archive.com/tech@openbsd.org/msg72582.html

at least for me, that diff makes my firewall stable..




> uvm_fault(0x8236dcb8, 0x17, 0, 2) -> e
> kernel: page fault trap, code=0
> Stopped at pfsync_q_del+0x96:movq  %rdx,0x8(%rax)
> TID   PID  UID  PRFLAGS  PFLAGS   CPU   COMMAND
>  436110  83038  0   0x14000  0x200  3 softnet
>  395295  39926  0   0x14000  0x200  0 softnet
>  189958   2208  0   0x14000  0x200  2 softnet
> * 658395423  0   0x14000  0x200  1 systqmp
> pfsync_q_del(fd8401d63890) at pfsync_q_del+0x96
> pfsync_delete_state(fd8401d63890) at pfsync_delete_state+0x118
> pf_remove_state(fd8401d63890) at pfsync_remove_state+0x14b
> pf_purge_expired_states(4031,40) at pf_purge_expired_states+0x242
> pf_purge_states(0) at pf_purge_states+0x1c
> taskq_thread(822a1a10) at taskq_thread+0x100
> end trace frame: 0x0, count: 9
> 
> This is all I could manage to get since the crash happened when I was
> away (and that stupid Dell console timeout when idle, removing the USB
> keyboard)
> 
> I observed a thing that may or may not be related to this issue: The
> "output fail" counter keeps steadily increasing both on aggregate and
> the two member interfaces:
> 
> :~# netstat -i -I aggr0
> NameMtu   Network Address  Ipkts IfailOpkts Ofail 
> Colls
> aggr0   9200fe:e1:ba:d0:91:13 224426940 0 200785282
>  357 0
> 
> At first I thought it could be something related to the switches but I
> still haven't found anything wrong with them.
> 
> 
> 
> On Mon, Nov 21, 2022 at 1:22 PM Hrvoje Popovski  wrote:
>>
>> On 21.11.2022. 16:04, Josmar Pierri wrote:
>>> Hi,
>>>
>>> I managed to get screenshots of a random kernel panic that we are
>>> having on a server here.
>>> They were taken using a console management tool embedded into the
>>> server (Dell IDRAC) and are PNG images of the panic itself, trace of
>>> all cpus and ps.
>>> I'm not attaching them here right now because I don't know how the
>>> list would react to them.
>>>
>>> I attached the output of:
>>> 1 - sendbug -P
>>> 2 - dmesg right after reboot
>>> 3 - dmesg-boot
>>>
>>> This server has an aggr0 grouping bnxt0 and bnxt1, both at 10 Gbps.
>>> Its task is to load-balance RDP traffic (TCP 3389) among 2 large pools
>>> (more than 50 servers on each one) and 3 small ones using pf (tables)
>>> for that.
>>>
>>> These panics happen at random times without an apparent cause.
>>>
>>> The panic message reads:
>>>
>>> ddb{3}> show panic
>>> *cpu3: kernel diagnostic assertion "st->snapped == 0" failed: file
>>> "/usr/src/sys/net/if_pfsync.c", line 1591
>>>  cpu2: kernel diagnostic assertion "st->snapped == 0" failed: file
>>> "/usr/src/sys/net/if_pfsync.c", line 1591
>>>  cpu1: kernel diagnostic assertion "st->snapped == 0" failed: file
>>> "/usr/src/sys/net/if_pfsync.c", line 1591
>>> ddb{3}>
>>>
>>> Please advise how I should proceed to submit the screenshots.
>>
>> Hi,
>>
>> I have similar setup with aggr grouping ix0 and ix1 and pfsync. If you
>> have two firewalls, can you sysupgrade this one to latest snapshot ?
>>
>> I'm running snapshot after last hackathon with this diff
>> https://www.mail-archive.com/tech@openbsd.org/msg72582.html
>>
>> and for now firewall seems to work just fine.
>>
>>
>>
>

Re: pfsync panic in pfsync_insert_state - syspatch?

2022-11-21 Thread Hrvoje Popovski

On 21.11.2022. 14:26, Damjan Dimitrov wrote:
> One thing I forgot to mention, these clusters also run ipsec.
> I attach another stack-trace from a different node.
> Thx.

I'm sure this is not a solution, but could you increase the number of
CPUs to more than 4, for example 6 or 8? I think it could prolong the
frequency of panic a little ...

I'm just curious ...

Re: Random kernel panic on 7.2

2022-11-21 Thread Hrvoje Popovski

On 21.11.2022. 16:04, Josmar Pierri wrote:
> Hi,
> 
> I managed to get screenshots of a random kernel panic that we are
> having on a server here.
> They were taken using a console management tool embedded into the
> server (Dell IDRAC) and are PNG images of the panic itself, trace of
> all cpus and ps.
> I'm not attaching them here right now because I don't know how the
> list would react to them.
> 
> I attached the output of:
> 1 - sendbug -P
> 2 - dmesg right after reboot
> 3 - dmesg-boot
> 
> This server has an aggr0 grouping bnxt0 and bnxt1, both at 10 Gbps.
> Its task is to load-balance RDP traffic (TCP 3389) among 2 large pools
> (more than 50 servers on each one) and 3 small ones using pf (tables)
> for that.
> 
> These panics happen at random times without an apparent cause.
> 
> The panic message reads:
> 
> ddb{3}> show panic
> *cpu3: kernel diagnostic assertion "st->snapped == 0" failed: file
> "/usr/src/sys/net/if_pfsync.c", line 1591
>  cpu2: kernel diagnostic assertion "st->snapped == 0" failed: file
> "/usr/src/sys/net/if_pfsync.c", line 1591
>  cpu1: kernel diagnostic assertion "st->snapped == 0" failed: file
> "/usr/src/sys/net/if_pfsync.c", line 1591
> ddb{3}>
> 
> Please advise how I should proceed to submit the screenshots.

Hi,

I have similar setup with aggr grouping ix0 and ix1 and pfsync. If you
have two firewalls, can you sysupgrade this one to latest snapshot ?

I'm running snapshot after last hackathon with this diff
https://www.mail-archive.com/tech@openbsd.org/msg72582.html

and for now firewall seems to work just fine.

panic with OpenBSD 7.2-current (GENERIC.MP) #846

2022-11-21 Thread Hrvoje Popovski

Hi all,

I've sysupgrade 64 core box and I'm getting kernel fault trap below


OpenBSD 7.2-current (GENERIC.MP) #846: Sun Nov 20 09:43:16 MST 2022
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 549312065536 (523864MB)
avail mem = 532646780928 (507971MB)
random: good seed from bootblocks
mpath0 at root
scsibus0 at mpath0: 256 targets
kernel: protection fault trap, code=0
Stopped at  memcpy+0x15:repe movsq  (%rsi),%es:(%rdi)
ddb{0}> trace
memcpy(fe007e5c2010,2,2c6c28baa7240e66,fe007e5c2000,2,ff000)
at
 memcpy+0x15
pmap_randomize_level(fe007e7dafe0,3,2c6c28baa7241299,8272d000,f
e007e7da000,fe0) at pmap_randomize_level+0x215
pmap_randomize(dc9b195a5a0185e0,0,0,0,0,0) at pmap_randomize+0x1ca
cpu_configure(bd8245a77341df16,0,0,8002d000,816fdfb0,82
733f00) at cpu_configure+0x20
main(0,0,0,0,0,1) at main+0x3a3
end trace frame: 0x0, count: -5
ddb{0}>


ddb{0}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
*0   0 -1  0  7 0x10200swapper


I will leave box in ddb if something else is needed ...

Re: pf panic

2022-10-04 Thread Hrvoje Popovski

On 29.8.2022. 20:01, Hrvoje Popovski wrote:
> On 9.8.2022. 21:32, Hrvoje Popovski wrote:
>> On 9.8.2022. 19:56, Alexandr Nedvedicky wrote:
>>> this is a NULL pointer dereference panic. I think we've seen it few 
>>> months
>>> back. patch below was applied to one of your test machines if I remember
>>> correct. can you give it a try again to see if it will help?
>>>
>>> the change adds a mutex to pf_state structure to protect references
>>> to keys attached to state.
>>>
>>> we also have to take into account a fact that pf_state_export() may be
>>> presented with state which keys got detached. Hence we have to
>>> skip such state when doing export. Therefore pf_state_export()
>>> indicates a failure to hint caller whether data were written (success)
>>> and we should move to next free slot in output buffer. Or nothing
>>> got written (failure) and current slot in output buffer is still free.
>>
>> Hi,
>>
>> this diff is applied to firewall and I will monitor it.
>>
>> Thank you ...
>>
> 
> Hi,
> 
> after 20 days with this diff firewall seems stable. Problem is that last
> time firewall was up for long time too, and I'm not sure what triggered
> that panic. I will update that firewall to latest snapshot, apply that
> diff and wait...
> 
> 

Hi,

after month or so with this diff firewall didn't panic.

Re: pf panic

2022-08-29 Thread Hrvoje Popovski

On 9.8.2022. 21:32, Hrvoje Popovski wrote:
> On 9.8.2022. 19:56, Alexandr Nedvedicky wrote:
>> this is a NULL pointer dereference panic. I think we've seen it few 
>> months
>> back. patch below was applied to one of your test machines if I remember
>> correct. can you give it a try again to see if it will help?
>>
>> the change adds a mutex to pf_state structure to protect references
>> to keys attached to state.
>>
>> we also have to take into account a fact that pf_state_export() may be
>> presented with state which keys got detached. Hence we have to
>> skip such state when doing export. Therefore pf_state_export()
>> indicates a failure to hint caller whether data were written (success)
>> and we should move to next free slot in output buffer. Or nothing
>> got written (failure) and current slot in output buffer is still free.
> 
> Hi,
> 
> this diff is applied to firewall and I will monitor it.
> 
> Thank you ...
> 

Hi,

after 20 days with this diff firewall seems stable. Problem is that last
time firewall was up for long time too, and I'm not sure what triggered
that panic. I will update that firewall to latest snapshot, apply that
diff and wait...

Re: pflow - kernel: protection fault trap

2022-08-10 Thread Hrvoje Popovski

On 10.8.2022. 15:44, Vitaliy Makkoveev wrote:
> On Wed, Aug 10, 2022 at 12:52:06AM +0200, Hrvoje Popovski wrote:
>> On 9.8.2022. 22:22, Vitaliy Makkoveev wrote:
>>> Hi,
>>>
>>> The kernel lock within pflow_output_process() doesn't help because the
>>> following sosend() has sleep points. So, at least pflow_clone_destroy()
>>> should wait until pflow_output_process() finished. We should use
>>> taskq_del_barrier(9) instead of task_del(9).
>>>
>>> Also we need to unlink dying pflow(4) interface from the stack before
>>> start destruction.
>>>
>>> This diff should help. Please keep in mind, this diff is incomplete,
>>> because it doesn't fix the race between pflowioctl() and
>>> pflow_output_process(). This race is much more complicated, because we
>>> need to introduce the new lock to protect `so' and take it before call
>>> sosend(), but the sosend() takes netlock, which is taken before
>>> pflowioctl() where we modify `so'. This introduces re-locking games to
>>> pflowioctl() path, I so want to make this with separate diff, because
>>> this potential panic was not triggered.
>>>
>> Hi,
>>
>> with this diff I'm getting this protection fault trap
>>
> taskq_del_barrier(9) has a bug and doesn't work as expected. This diff
> uses taskq_barrier(9).
> 
> According private Hrvoje report it fixes the problem.

Hi,

I'm was running

ifconfig pflow0 destroy
sleep 120
sh /etc/netstart pflow0
sleep 120

whole night and firewall didn't break.
Without this diff if I run ifconfig pflow0 destroy and firewall is under
pressure box got kernel fault trap immediately

Re: pflow - kernel: protection fault trap

2022-08-10 Thread Hrvoje Popovski

On 10.8.2022. 0:49, Vitaliy Makkoveev wrote:
> That's strange, because after we the only timeout handlers can reschedule
> pflow_output_process to run, but they have no sleep points. However the
> task handler still running after taskq_del_barrier(9).
> 
> Does this help?

Hi,

this diff doesn't help. Here's output


r620-1# ifconfig pflow0 destroy
kernel: protection fault trap, code=0
Stopped at  sblock+0x35:movq0x8(%rax),%rax

ddb{4}> show panic
the kernel did not panic

ddb{4}> trace
sblock(fd83b34818e8,fd83b3481a10,1) at sblock+0x35
sosend(fd83b34818e8,fd80cd292d00,0,fd80a3b4e200,0,0) at
sosend+0x163
pflow_output_process(808ca000) at pflow_output_process+0x67
taskq_thread(80030100) at taskq_thread+0x100
end trace frame: 0x0, count: -4
ddb{4}>

ddb{4}> show reg
rdi   0xfd83b34818e8
rsi   0xfd83b3481a10
rbp   0x800022d66160
rbx0x501
rdx  0x1
rcx   0x8000e004
rax   0x12197a31cb9f19c7
r8   0x1
r90x821f4240rw_ops+0x10
r10   0x
r11   0xd76a5b1e376e
r120
r13  0x1
r14   0xfd83b3481a60
r15   0xfd83b34818e8
rip   0x8188cbf5sblock+0x35
cs   0x8
rflags   0x10246__ALIGN_SIZE+0xf246
rsp   0x800022d66110
ss  0x10
sblock+0x35:movq0x8(%rax),%rax
ddb{4}>

ddb{4}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
 57222   30708  52109  0  7 0x3ifconfig
 52109  165317  1  0  30x10008b  sigsusp   ksh
 49245  507782  1  0  30x100098  kqreadcron
  6217   82510  20690 95  3   0x1100092  kqreadsmtpd
 82506  376421  20690103  3   0x1100092  kqreadsmtpd
 82613  290075  20690 95  3   0x1100092  kqreadsmtpd
  4815  308602  20690 95  30x100092  kqreadsmtpd
 12941  472567  20690 95  3   0x1100092  kqreadsmtpd
 23744  467673  20690 95  3   0x1100092  kqreadsmtpd
 20690   84561  1  0  30x100080  kqreadsmtpd
 76380   94838  1  0  30x88  kqreadsshd
 86280  347923  1  0  30x100080  kqreadntpd
 14359  243801  59957 83  30x100092  kqreadntpd
 59957  263943  1 83  3   0x1100092  kqreadntpd
 52207  492049  48201 73  3   0x1100090  kqreadsyslogd
 48201  424791  1  0  30x100082  netio syslogd
 25023  493390  0  0  3 0x14200  bored smr
 49475  241893  0  0  3 0x14200  pgzerozerothread
 35733  465768  0  0  3 0x14200  aiodoned  aiodoned
 44819  211641  0  0  3 0x14200  syncerupdate
 12802  139258  0  0  3 0x14200  cleaner   cleaner
 77815   78998  0  0  3 0x14200  reaperreaper
 97772  253526  0  0  3 0x14200  pgdaemon  pagedaemon
 20567  420970  0  0  3 0x14200  usbtskusbtask
 81765  348189  0  0  3 0x14200  usbatsk   usbatsk
 58744  470980  0  0  3  0x40014200  acpi0 acpi0
 42832   77958  0  0  7  0x40014200idle5
 40468  474721  0  0  3  0x40014200idle4
 98228  394491  0  0  7  0x40014200idle3
 13842   58745  0  0  3  0x40014200idle2
 87447   45776  0  0  7  0x40014200idle1
 14520  516279  0  0  3 0x14200  bored sensors
 20057  421224  0  0  3 0x14200  netlock   softnet
 681204487  0  0  3 0x14200  netlock   softnet
*15557  167519  0  0  7 0x14200softnet
 57471  116257  0  0  3 0x14200  netlock   softnet
 21894  328074  0  0  3 0x14200  bored systqmp
 36959   61819  0  0  3 0x14200  bored systq
 29261  452739  0  0  3  0x40014200  bored softclock
 26163  383919  0  0  7  0x40014200idle0
 1   14140  0  0  30x82  wait  init
 0   0 -1  0  3 0x10200  scheduler swapper
ddb{4}>

ddb{4}> ps /o
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
  30708  57222  0 0x3  02  ifconfig
*167519  15557  0 0x14000  0x2004K softnet

ddb{4}> trace /t 0t30708
sleep_finish(800022e25200,1) at sleep_finish+0xfe
rw_enter(822c58d8,1) at rw_enter+0x1cb
if_detach(808ca000) at if_detach+0xda
pflow_clone_destroy(808ca000) at pflow_clone_destroy+0x1a0
if_clone_destroy(800022e253c0) at

Re: pflow - kernel: protection fault trap

2022-08-09 Thread Hrvoje Popovski

On 9.8.2022. 22:22, Vitaliy Makkoveev wrote:
> Hi,
> 
> The kernel lock within pflow_output_process() doesn't help because the
> following sosend() has sleep points. So, at least pflow_clone_destroy()
> should wait until pflow_output_process() finished. We should use
> taskq_del_barrier(9) instead of task_del(9).
> 
> Also we need to unlink dying pflow(4) interface from the stack before
> start destruction.
> 
> This diff should help. Please keep in mind, this diff is incomplete,
> because it doesn't fix the race between pflowioctl() and
> pflow_output_process(). This race is much more complicated, because we
> need to introduce the new lock to protect `so' and take it before call
> sosend(), but the sosend() takes netlock, which is taken before
> pflowioctl() where we modify `so'. This introduces re-locking games to
> pflowioctl() path, I so want to make this with separate diff, because
> this potential panic was not triggered.
> 

Hi,

with this diff I'm getting this protection fault trap

r620-1# ifconfig pflow0 destroy
kernel: protection fault trap, code=0
Stopped at  sblock+0x35:movq0x8(%rax),%rax

ddb{0}> show panic
the kernel did not panic

ddb{0}> trace
sblock(fd842c34d8e8,fd842c34da10,1) at sblock+0x35
sosend(fd842c34d8e8,fd80cd292800,0,fd80a3f37c00,0,0) at
sosend+0x163
pflow_output_process(808ca000) at pflow_output_process+0x67
taskq_thread(80030100) at taskq_thread+0x100
end trace frame: 0x0, count: -4
ddb{0}>

ddb{0}> show reg
rdi   0xfd842c34d8e8
rsi   0xfd842c34da10
rbp   0x800022d66710
rbx0x501
rdx  0x1
rcx   0x8000ea84
rax   0x9f3ebe5199894262
r8   0x1
r90x821c7080rw_ops+0x10
r10   0x
r11   0x6db1a912181c98f1
r120
r13  0x1
r14   0xfd842c34da60
r15   0xfd842c34d8e8
rip   0x81d71565sblock+0x35
cs   0x8
rflags   0x10246__ALIGN_SIZE+0xf246
rsp   0x800022d666c0
ss  0x10
sblock+0x35:movq0x8(%rax),%rax
ddb{0}>

ddb{0}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
  1364  367790  19987  0  7 0x3ifconfig
 19987  130981  1  0  30x10008b  sigsusp   ksh
 74340  115416  1  0  30x100098  kqreadcron
 68578  240636   2156 95  3   0x1100092  kqreadsmtpd
 86507  443747   2156103  3   0x1100092  kqreadsmtpd
 47223  261838   2156 95  3   0x1100092  kqreadsmtpd
 38121  503884   2156 95  30x100092  kqreadsmtpd
 29539  133065   2156 95  3   0x1100092  kqreadsmtpd
 83786  266601   2156 95  3   0x1100092  kqreadsmtpd
  2156  411192  1  0  30x100080  kqreadsmtpd
 62749   20828  1  0  30x88  kqreadsshd
 85488  424702  1  0  30x100080  kqreadntpd
  4633  197093  51224 83  30x100092  kqreadntpd
 51224  139274  1 83  7   0x1100012ntpd
 19966  136109  61788 73  3   0x1100090  kqreadsyslogd
 61788   27725  1  0  30x100082  netio syslogd
 31851  123130  0  0  3 0x14200  bored smr
 12870  490593  0  0  3 0x14200  pgzerozerothread
 51010  283420  0  0  3 0x14200  aiodoned  aiodoned
 69180  131489  0  0  3 0x14200  syncerupdate
 36711  165342  0  0  3 0x14200  cleaner   cleaner
 75263  504085  0  0  3 0x14200  reaperreaper
 72069  133609  0  0  3 0x14200  pgdaemon  pagedaemon
 99378  234898  0  0  3 0x14200  usbtskusbtask
 30200  405105  0  0  3 0x14200  usbatsk   usbatsk
 96366  324880  0  0  3  0x40014200  acpi0 acpi0
 24969  140748  0  0  7  0x40014200idle5
 95045  386153  0  0  3  0x40014200idle4
 72849  289914  0  0  7  0x40014200idle3
 49815  213569  0  0  3  0x40014200idle2
 39848   84701  0  0  3  0x40014200idle1
 43651  137149  0  0  7  0x40014200sensors
 10764  419906  0  0  3 0x14200  netlock   softnet
 51829  300708  0  0  3 0x14200  netlock   softnet
*58674  303202  0  0  7 0x14200softnet
 60899  100126  0  0  3 0x14200  netlock   softnet
 49625  511441  0  0  3 0x14200  bored systqmp
  5435   16476  0  0  3 0x14200  bored systq
  8069  217014  0  0  2  0x40014200

Re: pf panic

2022-08-09 Thread Hrvoje Popovski

On 9.8.2022. 19:56, Alexandr Nedvedicky wrote:
> this is a NULL pointer dereference panic. I think we've seen it few months
> back. patch below was applied to one of your test machines if I remember
> correct. can you give it a try again to see if it will help?
> 
> the change adds a mutex to pf_state structure to protect references
> to keys attached to state.
> 
> we also have to take into account a fact that pf_state_export() may be
> presented with state which keys got detached. Hence we have to
> skip such state when doing export. Therefore pf_state_export()
> indicates a failure to hint caller whether data were written (success)
> and we should move to next free slot in output buffer. Or nothing
> got written (failure) and current slot in output buffer is still free.

Hi,

this diff is applied to firewall and I will monitor it.

Thank you ...

pflow - kernel: protection fault trap

2022-08-09 Thread Hrvoje Popovski

Hi all,

when sending lot of traffic over firewall with pflow and if I run
ifconfig pflow0 destroy I'm getting kernel: protection fault trap.


This is latest snapshot:
OpenBSD 7.2-beta (GENERIC.MP) #677: Mon Aug  8 18:58:49 MDT 2022
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP


r620-1# ifconfig pflow0 destroy
kernel: protection fault trap, code=0
Stopped at  in_nam2sin+0x29:cmpb$0x2,0x1(%rdx)

ddb{2}> show panic
the kernel did not panic

ddb{2}> trace
in_nam2sin(fd80cd292b00,800022d66028) at in_nam2sin+0x29
udp_output(fd83b2c1ba00,fd80a3abf800,fd80cd292b00,0) at
udp_output+0xcc
sosend(fd83b2c1c558,fd80cd292b00,0,fd80a3abf800,0,0) at
sosend+0x385
pflow_output_process(808ca000) at pflow_output_process+0x67
taskq_thread(80030100) at taskq_thread+0x100
end trace frame: 0x0, count: -5
ddb{2}>

ddb{2}> show reg
rdi   0xfd80cd292b00
rsi   0x800022d66028
rbp   0x800022d65ff0
rbx0
rdx   0x4a1336b5a404c64e
rcx   0xce2fdf4a
rax 0x2f
r8 0x5b8
r9 0
r10   0x
r11   0x3b190b40737cbe31
r12   0xfd80a3abf800
r13 0x28
r140x5b8
r15   0xfd83b2c1ba00
rip   0x81e494f9in_nam2sin+0x29
cs   0x8
rflags   0x10286__ALIGN_SIZE+0xf286
rsp   0x800022d65fe0
ss  0x10
in_nam2sin+0x29:cmpb$0x2,0x1(%rdx)
ddb{2}>

ddb{2}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
 97584  114469  17291  0  7 0x3ifconfig
 17291  147440  1  0  30x10008b  sigsusp   ksh
 61667  371523  1  0  30x100098  kqreadcron
 62419  388523  46193 95  3   0x1100092  kqreadsmtpd
 43433  312290  46193103  3   0x1100092  kqreadsmtpd
 45389  509524  46193 95  3   0x1100092  kqreadsmtpd
 68113  112694  46193 95  30x100092  kqreadsmtpd
 12544   45817  46193 95  3   0x1100092  kqreadsmtpd
 35310  168879  46193 95  3   0x1100092  kqreadsmtpd
 46193  474443  1  0  30x100080  kqreadsmtpd
 66976  365265  1  0  30x88  kqreadsshd
 45262  438619  1  0  30x100080  kqreadntpd
 23411  270550  91687 83  30x100092  kqreadntpd
 91687  425806  1 83  3   0x1100092  kqreadntpd
 87999  345906 43 73  3   0x1100090  kqreadsyslogd
43  197785  1  0  30x100082  netio syslogd
 53263  391295  0  0  3 0x14200  bored smr
 53027  160140  0  0  3 0x14200  pgzerozerothread
 93436  395928  0  0  3 0x14200  aiodoned  aiodoned
  6422  376977  0  0  3 0x14200  syncerupdate
 12666  145796  0  0  3 0x14200  cleaner   cleaner
  5339  104878  0  0  3 0x14200  reaperreaper
 18437  379590  0  0  3 0x14200  pgdaemon  pagedaemon
 95609   15815  0  0  3 0x14200  usbtskusbtask
 34720  188775  0  0  3 0x14200  usbatsk   usbatsk
 28283  197132  0  0  3  0x40014200  acpi0 acpi0
 32308  129369  0  0  7  0x40014200idle5
 91423  465223  0  0  7  0x40014200idle4
 82830  201537  0  0  7  0x40014200idle3
 72849  294469  0  0  3  0x40014200idle2
 82591  160582  0  0  3  0x40014200idle1
 19010   51380  0  0  3 0x14200  bored sensors
 46387  318985  0  0  3 0x14200  netlock   softnet
 72266  368671  0  0  3 0x14200  netlock   softnet
*31740  217354  0  0  7 0x14200softnet
 63482  377439  0  0  3 0x14200  netlock   softnet
 66088   38816  0  0  3 0x14200  bored systqmp
 72341  421031  0  0  3 0x14200  bored systq
 43727   54109  0  0  3  0x40014200  bored softclock
  4948  138264  0  0  7  0x40014200idle0
 1  135757  0  0  30x82  wait  init
 0   0 -1  0  3 0x10200  scheduler swapper

ddb{2}> ps /o
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
 114469  97584  0 0x3  01  ifconfig
*217354  31740  0 0x14000  0x2002K softnet
ddb{2}>

ddb{2}> trace /t 0t114469
sleep_finish(800022e258d0,1) at sleep_finish+0xfe
rw_enter(822b5b90,1) at rw_enter+0x1cb
soclose(fd83b2c1c558,80) at soclose+0x27

pf panic

2022-08-09 Thread Hrvoje Popovski

Hi all,

I'm running
OpenBSD 7.2-beta (GENERIC.MP) #651: Tue Jul 26 23:11:26 MDT 2022
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP

on production firewall and for few weeks it was stable. Firewall panic
today and I will sysupgrade it, but maybe this panic message is
interesting so I'm sending it here.


bcbnfw1# uvm_fault(0x823a1a20, 0x0, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at  pf_state_export+0x38:   movq0(%rax),%rcx
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
 309438  83954  0 0x14000  0x2001  softnet
 486408  53515  0 0x14000  0x2003  softnet
* 80122  54608  0 0x14000  0x2002  softnet
pf_state_export(fd806152f9dc,fd8664eb12b0) at pf_state_export+0x38
pfsync_sendout() at pfsync_sendout+0x5e4
pfsync_update_state(fd8728968d40) at pfsync_update_state+0x15b
pf_test(2,1,80bbb000,800020c336d8) at pf_test+0x117a
ip_input_if(800020c336d8,800020c336e4,4,0,80bbb000) at
ip_input_if+0xcd
ipv4_input(80bbb000,fd80661d5300) at ipv4_input+0x39
ether_input(80bbb000,fd80661d5300) at ether_input+0x3b1
carp_input(80bd2000,fd80661d5300,5e000101) at carp_input+0x196
ether_input(80bd2000,fd80661d5300) at ether_input+0x1d9
vlan_input(80b9d000,fd80661d5300,800020c3390c) at
vlan_input+0x23d
ether_input(80b9d000,fd80661d5300) at ether_input+0x85
if_input_process(8048b048,800020c339a8) at if_input_process+0x6f
ifiq_process(8048ea00) at ifiq_process+0x69
taskq_thread(80035080) at taskq_thread+0x100
end trace frame: 0x0, count: 1
https://www.openbsd.org/ddb.html describes the minimum info required in
bug reports.  Insufficient info makes it difficult to find and fix bugs.


ddb{2}> show reg
rdi   0xfd806152fae4
rsi0
rbp   0x800020c33340
rbx0x19c
rdx  0x4
rcx0
rax0
r8 0x104
r9 0x7d788a8c5153bdc
r10   0x92a5ce4f38be8823
r11   0xfd806152f9dc
r12   0xfd8664eb12b0
r130
r14   0xfd806152f9dc
r15   0xfd8664eb12b0
rip   0x81387678pf_state_export+0x38
cs   0x8
rflags   0x10246__ALIGN_SIZE+0xf246
rsp   0x800020c33300
ss  0x10
pf_state_export+0x38:   movq0(%rax),%rcx



ddb{2}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
  5515  239236  1  0  30x100083  ttyin ksh
 46351  180485  1  0  30x100098  kqreadcron
 670259485  68290720  3   0x190  kqreadlldpd
 68290  377807  1  0  30x80  netio lldpd
 74149   64334  55708 95  3   0x1100092  kqreadsmtpd
 77756  107926  55708103  3   0x1100092  kqreadsmtpd
 96682  419793  55708 95  3   0x1100092  kqreadsmtpd
 95361  134736  55708 95  30x100092  kqreadsmtpd
 17548   16395  55708 95  3   0x1100092  kqreadsmtpd
  9493  444926  55708 95  3   0x1100092  kqreadsmtpd
 55708  424253  1  0  30x100080  kqreadsmtpd
  3986  219916  1 77  3   0x1100090  kqreaddhcpd
 29833  112637  1  0  30x100080  kqreadsnmpd
 99415  374613  1 91  3   0x192  kqreadsnmpd
 94378  355183  1  0  30x88  kqreadsshd
 95447  307241  1  0  30x100080  kqreadntpd
 55599  503746   7240 83  30x100092  kqreadntpd
  7240  502064  1 83  3   0x1100092  kqreadntpd
 96225  207770  58673 74  3   0x1100092  bpf   pflogd
 58673  266584  1  0  30x80  netio pflogd
 56880  475875  37876 73  3   0x1100090  kqreadsyslogd
 37876  114860  1  0  30x100082  netio syslogd
 77675  225215  0  0  3 0x14200  bored smr
 24420   32069  0  0  3 0x14200  pgzerozerothread
 40785  164275  0  0  3 0x14200  aiodoned  aiodoned
  3250   15093  0  0  3 0x14200  syncerupdate
 71159  338127  0  0  3 0x14200  cleaner   cleaner
 45614  132741  0  0  3 0x14200  reaperreaper
 17965  161362  0  0  3 0x14200  pgdaemon  pagedaemon
 70681   34263  0  0  3 0x14200  usbtskusbtask
 30654  291134  0  0  3 0x14200  usbatsk   usbatsk
 22566  258438  0  0  3  0x40014200  acpi0 acpi0
 65828   69579  0  0  7  0x40014200idle5
 61839   98119  0  0  7  0x40014200idle4

Re: [External] : Re: PF hangs when doing NAT round-robin

2022-07-18 Thread Hrvoje Popovski

On 18.7.2022. 10:40, Alexandr Nedvedicky wrote:
> hello,
> 
> On Sun, Jul 17, 2022 at 11:52:21PM +0200, Hrvoje Popovski wrote:
>> On 17.7.2022. 20:19, Alexandr Nedvedicky wrote:
>>> So in case 49/27 we are supposed to be selecting addresses:
>>> 49.0.0.1, 49.0.0.2, ..., 49.0.0.30, 49.0.0.1
>>> we need to make sure selection mechanism skips network
>>> address (49.0.0.0) and network broadcast address (49.0.0.31).
>>
>> Hi,
>>
>> I'm I understanding you correctly? If doing NAT to some route, let's say
>> /30 (4 addresses) with this diff I will doing NAT only to 2 addresses?
>>
>>
>>
>>
> 
> yes. I believe this should be correct. but I would like to
> get it confirmed with someone who is stronger in network protocols.
> 
>  
>  
> if we have a prafix /30 then the hosts we can address are:
>   .1
>   .2
> the host part .3 should be a network broadcast. let's assume
> we have something like: 192.168.1.8/30, then the network
> broadcast address will be 192.168.1.11
> 
> to be honest I'm not sure if we can assign address
> 192.168.8.0 to any host in that network.


In my carp/pfsync setup's I have /30 route for NAT and I would like to
NAT to all 4 addresses. If that's not possible, is right way to do NAT
create table and list there ip by ip ?

Re: PF hangs when doing NAT round-robin

2022-07-17 Thread Hrvoje Popovski

On 17.7.2022. 20:19, Alexandr Nedvedicky wrote:
> So in case 49/27 we are supposed to be selecting addresses:
> 49.0.0.1, 49.0.0.2, ..., 49.0.0.30, 49.0.0.1
> we need to make sure selection mechanism skips network
> address (49.0.0.0) and network broadcast address (49.0.0.31).

Hi,

I'm I understanding you correctly? If doing NAT to some route, let's say
/30 (4 addresses) with this diff I will doing NAT only to 2 addresses?

ure - ure0: usb errors on rx: IOERROR

2022-07-11 Thread Hrvoje Popovski

Hi all,

I have supermicro server with usb
ure0: RTL8153 (0x5c30), address 00:e0:4b:68:84:de
used for ssh and management.

When doing ifconfig ure0 down / up I always getting this error

smc24# ifconfig ure0 down
smc24# ifconfig ure0 up
smc24# ure0: usb errors on rx: IOERROR
ure0: usb error on tx: IOERROR
ure0: usb error on tx: IN_PROGRESS
ure0: usb error on tx: TIMEOUT
ure0: usb error on tx: IN_PROGRESS
usb_insert_transfer: xfer=0xfd904e5cc7a8 not free


after that ure0 is unusable and i need to reboot server, maybe just to
remove usb and attach it again, but server is not near me.

I've compiled kernel with
option  URE_DEBUG
option  USB_DEBUG
option  UHUB_DEBUG

ure0: flags=8807 mtu 1500
lladdr 00:e0:4b:68:84:de
index 11 priority 0 llprio 3
groups: egress
media: Ethernet autoselect (1000baseT full-duplex)
status: active
inet X netmask 0xffe0 broadcast X

kstat
ure0:0:rxq:0
 packets: 554 packets
   bytes: 45661 bytes
  qdrops: 0 packets
  errors: 0 packets
qlen: 0 packets
ure0:0:txq:0
 packets: 144 packets
   bytes: 10962 bytes
  qdrops: 0 packets
  errors: 0 packets
qlen: 0 packets
 maxqlen: 256 packets
 oactive: false


dmesg

smc24$ dmesg
OpenBSD 7.1-current (GENERIC.MP) #5: Mon Jul 11 18:29:35 CEST 2022
r...@smc24.srce.hr:/sys/arch/amd64/compile/GENERIC.MP
real mem = 68497002496 (65323MB)
avail mem = 66403713024 (63327MB)
random: good seed from bootblocks
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 3.3 @ 0xa9d1c000 (71 entries)
bios0: vendor American Megatrends Inc. version "2.3" date 10/20/2021
bios0: Supermicro AS -1114S-WTRT
acpi0 at bios0: ACPI 6.0
acpi0: sleep states S0 S5
acpi0: tables DSDT FACP SSDT SPMI SSDT FIDT MCFG SSDT SSDT BERT HPET
IVRS PCCT SSDT CRAT CDIT SSDT WSMT APIC ERST HEST
acpi0: wakeup devices B000(S3) C000(S3) B010(S3) C010(S3) B030(S3)
C030(S3) B020(S3) C020(S3) B100(S3) C100(S3) B110(S3) C110(S3) B130(S3)
C130(S3) B120(S3) C120(S3)
acpitimer0 at acpi0: 3579545 Hz, 32 bits
acpimcfg0 at acpi0
acpimcfg0: addr 0xe000, bus 0-255
acpihpet0 at acpi0: 14318180 Hz
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: AMD EPYC 7413 24-Core Processor, 2650.33 MHz, 19-01-01
cpu0:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,FMA3,CX16,PCID,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,IBS,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,MWAITX,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,INVPCID,PQM,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,SHA,UMIP,PKU,IBPB,IBRS,STIBP,SSBD,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
cpu0: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 512KB
64b/line 8-way L2 cache, 32MB 64b/line 16-way L3 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
cpu0: apic clock running at 100MHz
cpu0: mwait min=64, max=64, C-substates=1.1, IBE
cpu1 at mainbus0: apid 1 (application processor)
cpu1: AMD EPYC 7413 24-Core Processor, 2650.01 MHz, 19-01-01
cpu1:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,FMA3,CX16,PCID,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,IBS,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,MWAITX,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,INVPCID,PQM,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,SHA,UMIP,PKU,IBPB,IBRS,STIBP,SSBD,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
cpu1: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 512KB
64b/line 8-way L2 cache, 32MB 64b/line 16-way L3 cache
cpu1: smt 0, core 1, package 0
cpu2 at mainbus0: apid 2 (application processor)
cpu2: AMD EPYC 7413 24-Core Processor, 2650.00 MHz, 19-01-01
cpu2:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,FMA3,CX16,PCID,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,IBS,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,MWAITX,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,INVPCID,PQM,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,SHA,UMIP,PKU,IBPB,IBRS,STIBP,SSBD,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
cpu2: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 512KB
64b/line 8-way L2 cache, 32MB 64b/line 16-way L3 cache
cpu2: smt 0, core 2, package 0
cpu3 at mainbus0: apid 3 (application processor)
cpu3: AMD EPYC 7413 24-Core Processor, 2650.00 MHz, 19-01-01
cpu3:

PF hangs when doing NAT round-robin

2022-07-08 Thread Hrvoje Popovski

Hi all,

Chris Cappuccio kindly asked me if I can reproduce his PF NAT problem in
lab.

His mail:
On my NAT cluster I recently switched from nat-to source-hash $key to
nat-to round-robin and started getting hangs. The boxes would hang where
the network would stop responding and one core would be at 100% with
softnet.
Commands at the console would hang like ifconfig or even reboot. I'm
curious if you might be able to reproduce this in your own testing. It
seems to take around 24 hours in my case but I also have pfsync and
pflow turned on.



I've manage to reproduce his problem and pf conf with which I can
trigger hang is

set skip on { lo ix2 }
set limit states 500
match out on ix1 nat-to 49/27 round-robin
#match out on ix1 nat-to 49/27
block
pass

I'm sending traffic from host connected to ix0 to host connected to ix1
and that host is default gateway.
Traffic is 100 Kpps (very low) of UDP from random 16/16 port 9 to random
48/28 port 9. And I'm sniffing traffic on linux port connected to ix1.
As box behaves exactly as Chris described.

vmstat, top and ddb output are inline and in attachment.


After 10 sec of traffic box stops to do NAT or forward traffic and at
that point I'm immediately stopping traffic.

immediately after traffic is stopped
r620-1# vmstat -m | egrep "Name|pf"
NameSize Requests FailInUse Pgreq Pgrel Npage Hiwat Minpg
Maxpg Idle
pfrule  1360303 1 0 1 1 0
 80
pfstate  336   4614590   461398 38450 0 38450 38450 0
 80
pfstkey  120   6921840   692097 20973 0 20973 20973 0
 80
pfstitem  24   6921840   692097  4170 0  4170  4170 0
 80
pfruleitem16   2307260   230700   927 0   927   927 0
 80
pfosfpen 112  7140  71421 02121 0
 80
pfosfp40  7140  423 5 0 5 5 0
 80


after 10 minutes
r620-1# vmstat -m | egrep "Name|pf"
NameSize Requests FailInUse Pgreq Pgrel Npage Hiwat Minpg
Maxpg Idle
pfrule  1360303 1 0 1 1 0
 80
pfstate  336   4614590   461398 38450 0 38450 38450 0
 80
pfstkey  120   6921840   692097 20973 0 20973 20973 0
 80
pfstitem  24   6921840   692097  4170 0  4170  4170 0
 80
pfruleitem16   2307260   230700   927 0   927   927 0
 80
pfosfpen 112  7140  71421 02121 0
 80
pfosfp40  7140  423 5 0 5 5 0
 80

It seems that pf states are never cleared and what I can see with
tcpdump on host connected to ix1 is that pf nat only to 49.0.0.0.


immediately after traffic is stopped
  PID  TID PRI NICE  SIZE   RES STATE WAIT  TIMECPU COMMAND
43797   410290  5400K  900K onproc/3  - 0:03 91.21% softnet
40810   512893  1000K  900K idle  pf_lock   0:04  0.88% softnet
82004   472857  1000K  900K idle  pf_lock   0:03  0.68% softnet
 7921   447270  1000K  900K idle  netlock   0:02  0.24% softnet


after 10 minutes
  PID  TID PRI NICE  SIZE   RES STATE WAIT  TIMECPU COMMAND
43797   410290  6400K  900K onproc/3  - 0:03 99.02% softnet
17097   135479 -2200K  900K sleep/5   -14:38  0.00% idle5
 8526   296669  2800K  900K onproc/0  -14:35  0.00% idle0
66780   349165  2800K  900K onproc/4  -14:34  0.00% idle4
26781   248132  2800K  900K onproc/1  -14:33  0.00% idle1
21078   476731  2800K  900K onproc/2  -14:04  0.00% idle2
39533   504239 -2200K  900K sleep/3   -11:47  0.00% idle3
40810   512893  1000K  900K idle  pf_lock   0:04  0.00% softnet
82004   472857  1000K  900K idle  pf_lock   0:03  0.00% softnet
 7921   447270  1000K  900K idle  netlock   0:02  0.00% softnet


at this point if one wants to do ifconifg or pfctl -vsi or some other
network command, that command hangs and box needs to be rebooted from
idrac or you can drop to ddb :)

r620-1# Stopped at  db_enter+0x10:  popq%rbp
ddb{0}> trace
db_enter() at db_enter+0x10
comintr(80082000) at comintr+0x2de
intr_handler(800022d4d620,8007a080) at intr_handler+0x6e
Xintr_ioapic_edge16_untramp() at Xintr_ioapic_edge16_untramp+0x18f
acpicpu_idle() at acpicpu_idle+0x203
sched_idle(822a4ff0) at sched_idle+0x280
end trace frame: 0x0, count: -6
ddb{0}>


ddb{0}> show reg
rdi0x2f8
rsi0
rbp   0x800022d4d550
rbx   0x82252bf9__kernel_virt_to_phys+0x2252bf9
rdx0x2f8
rcx0x286
rax   0x82252b00kstat_pv_tree_RBT_INFO+0x10
r80x82364040w_locklistdata+0x330
r9

Re: bnxt panic

2022-06-27 Thread Hrvoje Popovski

On 17.3.2022. 21:31, Alexander Bluhm wrote:
> On Thu, Mar 17, 2022 at 01:01:11AM +0100, Hrvoje Popovski wrote:
>> On 16.3.2022. 20:00, Hrvoje Popovski wrote:
>>> Hi all,
>>>
>>> While opensbd box is under pressure and in that moment i run ifconfig
>>> bnxt0 down i get panic... it's not every time and it's that easy to
>>> trigger panic
>>>
>>> I'm sending traffic over ix interfaces and bnxt is for ssh and nothing
>>> else.
>>>
>>> I've compiled kernel with "option BNXT_DEBUG" and put debug in
>>> hostname.bnxt0 but i didn't saw any log regarding bnxt interfaces.
>>>
>>> I will try to trigger panic few more times and will post them here..
>>
>> this is same panic but with snapshot kernel without debug options
>>
>> uvm_fault(0xfd904e3a9440, 0x0, 0, 1) -> e
>> kernel: page fault trap, code=0
>> Stopped at  bnxt_intr+0x195:movq0(%r14,%r12,1),%rbx
>> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
>> *465591  26407  0 0x3  00K ifconfig
>> bnxt_intr(802bc7c0) at bnxt_intr+0x195
>> intr_handler(800027d3d7b0,80269880) at intr_handler+0x6e
>> Xintr_ioapic_edge28_untramp() at Xintr_ioapic_edge28_untramp+0x18f
>> Xspllower() at Xspllower+0x19
>> softintr_dispatch(0) at softintr_dispatch+0xdc
>> Xsoftclock() at Xsoftclock+0x1f
>> bnxt_ioctl(802bc048,80206910,800027d3dae0) at bnxt_ioctl+0x165
>> ifioctl(fd8e5a4f13a8,80206910,800027d3dae0,800027d8da50) at
>> ifioctl+0x92b
>> soo_ioctl(fd904cc0b2e8,80206910,800027d3dae0,800027d8da50)
>> at soo_ioctl+0x161
>> sys_ioctl(800027d8da50,800027d3dbf0,800027d3dc40) at
>> sys_ioctl+0x2c4
>> syscall(800027d3dcb0) at syscall+0x374
>> Xsyscall() at Xsyscall+0x128
>> end of kernel
>> end trace frame: 0x7f7faf30, count: 3
>> https://www.openbsd.org/ddb.html describes the minimum info required in
>> bug reports.  Insufficient info makes it difficult to find and fix bugs.
> 
> I don't have the device and don't know the code.  But other drivers
> don't process rx and tx interrupts when the interface is not running.
> 
> Maybe this helps.  dlg@ and jmatthew@ should know better than me.
> 
> bluhm
> 

Hi all,

is it good time to commit this diff?

Thank you




> Index: dev/pci/if_bnxt.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/dev/pci/if_bnxt.c,v
> retrieving revision 1.36
> diff -u -p -r1.36 if_bnxt.c
> --- dev/pci/if_bnxt.c 14 Mar 2022 23:41:42 -  1.36
> +++ dev/pci/if_bnxt.c 17 Mar 2022 20:26:49 -
> @@ -1543,6 +1543,7 @@ bnxt_intr(void *xq)
>  {
>   struct bnxt_queue *q = (struct bnxt_queue *)xq;
>   struct bnxt_softc *sc = q->q_sc;
> + struct ifnet *ifp = >sc_ac.ac_if;
>   struct bnxt_cp_ring *cpr = >q_cp;
>   struct bnxt_rx_queue *rx = >q_rx;
>   struct bnxt_tx_queue *tx = >q_tx;
> @@ -1565,10 +1566,13 @@ bnxt_intr(void *xq)
>   bnxt_handle_async_event(sc, cmpl);
>   break;
>   case CMPL_BASE_TYPE_RX_L2:
> - rollback = bnxt_rx(sc, rx, cpr, , , , 
> cmpl);
> + if (ISSET(ifp->if_flags, IFF_RUNNING))
> + rollback = bnxt_rx(sc, rx, cpr, , ,
> + , cmpl);
>   break;
>   case CMPL_BASE_TYPE_TX_L2:
> - bnxt_txeof(sc, tx, , cmpl);
> + if (ISSET(ifp->if_flags, IFF_RUNNING))
> + bnxt_txeof(sc, tx, , cmpl);
>   break;
>   default:
>   printf("%s: unexpected completion type %u\n",
>

Re: pf panic with clean snapshot (GENERIC.MP) #570

2022-06-12 Thread Hrvoje Popovski

On 8.6.2022. 7:33, Hrvoje Popovski wrote:
> On 8.6.2022. 0:42, Alexandr Nedvedicky wrote:
>> Hello Hrvoje,
>>
>> 
>>> Hi,
>>>
>>> while booting with this diff I've got this log:
>>>
>>> starting early daemons: syslogd pflogd ntpdwitness: lock_object
>>> uninitialized: 0xfd8785c81a
>>> 90
>>> Starting stack trace...
>>> witness_checkorder(fd8785c81a90,9,0) at witness_checkorder+0xad
>>> mtx_enter(fd8785c81a80) at mtx_enter+0x34
>>> pf_remove_state(fd8785c81988) at pf_remove_state+0x1da
>>> pfsync_in_del_c(fd80028977b0,c,2,2) at pfsync_in_del_c+0x9f
>>> pfsync_input(800020b056e8,800020b056f4,f0,2) at pfsync_input+0x33c
>>> ip_deliver(800020b056e8,800020b056f4,f0,2) at ip_deliver+0x103
>>> ip_local(800020b056e8,800020b056f4,fe007fff0220,0) at
>>> ip_local+0x1b7
>>> ipintr() at ipintr+0x5f
>>> if_netisr(0) at if_netisr+0xca
>>> taskq_thread(80036000) at taskq_thread+0x11a
>> thanks for quick test with pfsync. it has turned out I've forgot to 
>> initialize
>> a pf_state::mtx in pfsync_state_import() function.
>>
>> below is updated diff, which should fix a stack trace reported by 
>> witness.
> 
> Hi,
> 
> yes, stack trace is gone with this diff. will leave it running for a
> while to see if panic goes away ...
> 
> Thank you ...
> 

Hi,

after 4 days of running this diff firewall seems stable. It should panic
by now ..

Re: pf panic with clean snapshot (GENERIC.MP) #570

2022-06-07 Thread Hrvoje Popovski

On 8.6.2022. 0:42, Alexandr Nedvedicky wrote:
> Hello Hrvoje,
> 
> 
>> Hi,
>>
>> while booting with this diff I've got this log:
>>
>> starting early daemons: syslogd pflogd ntpdwitness: lock_object
>> uninitialized: 0xfd8785c81a
>> 90
>> Starting stack trace...
>> witness_checkorder(fd8785c81a90,9,0) at witness_checkorder+0xad
>> mtx_enter(fd8785c81a80) at mtx_enter+0x34
>> pf_remove_state(fd8785c81988) at pf_remove_state+0x1da
>> pfsync_in_del_c(fd80028977b0,c,2,2) at pfsync_in_del_c+0x9f
>> pfsync_input(800020b056e8,800020b056f4,f0,2) at pfsync_input+0x33c
>> ip_deliver(800020b056e8,800020b056f4,f0,2) at ip_deliver+0x103
>> ip_local(800020b056e8,800020b056f4,fe007fff0220,0) at
>> ip_local+0x1b7
>> ipintr() at ipintr+0x5f
>> if_netisr(0) at if_netisr+0xca
>> taskq_thread(80036000) at taskq_thread+0x11a
> thanks for quick test with pfsync. it has turned out I've forgot to 
> initialize
> a pf_state::mtx in pfsync_state_import() function.
> 
> below is updated diff, which should fix a stack trace reported by witness.

Hi,

yes, stack trace is gone with this diff. will leave it running for a
while to see if panic goes away ...

Thank you ...

Re: [External] : pf panic with clean snapshot (GENERIC.MP) #570

2022-06-07 Thread Hrvoje Popovski

On 7.6.2022. 2:16, Alexandr Nedvedicky wrote:
> Hello,
> 
> below is a diff which hopes to fix the issue. Although diff is fairly
> large the change itself is kind of straightforward. Let me briefly
> explain what's going on here. Diff introduces a mutex to pf_state,
> which protects array of keys (pf_state::key) bound to state.
> 
> The panic which diff below hopes to fix is caused by a race between timer
> thread, which expires state and pfsync dispatch task, which updates a peer.
> According to data provided by Hrvoje we panic due to NULL pointer dereference
> in pf_state_export(), which finds sk->key[] to be NULL. This may happen 
> because
> purge state mechanism detaches state key from state under protection
> of PF_STATE_LOCK, while pfsync dispatch task just keeps a reference to state
> without using a PF_STATE_LOCK to access a state instance.
> 
> In order to synchronize access to pf_statey::key between purge thread
> and pfsync dispatch task diff below introduces pf_state::mtx.
> pfsync uses pf_state::mtx to attempt to grab references to keys bound
> to state, while purge task uses mtx to safely invalidate state keys
> in pf_detach_state().
> 
> Such change requires pfsync(4) to deal with situation when state
> got detached while waiting in dispatch queue to update a peer.
> We have to a .write() operation on sync-queue to indicate a failure
> so pfsync_sendout() will just skip the state when processing dispatch
> queue.
> 
> Also diff changes pf_state_key_detach() such caller must pass pointer to state
> key instead of key index to be detached from state.  It also requires caller 
> to
> invalidate a state key entry in pf_state::key member.
> 
> I've just smoked tested the diff _without_ pfsync.

Hi,

while booting with this diff I've got this log:

starting early daemons: syslogd pflogd ntpdwitness: lock_object
uninitialized: 0xfd8785c81a
90
Starting stack trace...
witness_checkorder(fd8785c81a90,9,0) at witness_checkorder+0xad
mtx_enter(fd8785c81a80) at mtx_enter+0x34
pf_remove_state(fd8785c81988) at pf_remove_state+0x1da
pfsync_in_del_c(fd80028977b0,c,2,2) at pfsync_in_del_c+0x9f
pfsync_input(800020b056e8,800020b056f4,f0,2) at pfsync_input+0x33c
ip_deliver(800020b056e8,800020b056f4,f0,2) at ip_deliver+0x103
ip_local(800020b056e8,800020b056f4,fe007fff0220,0) at
ip_local+0x1b7
ipintr() at ipintr+0x5f
if_netisr(0) at if_netisr+0xca
taskq_thread(80036000) at taskq_thread+0x11a
end trace frame: 0x0, count: 247
End of stack trace.
witness: lock_object uninitialized: 0xfd8786d61d50
Starting stack trace...
witness_checkorder(fd8786d61d50,9,0) at witness_checkorder+0xad
mtx_enter(fd8786d61d40) at mtx_enter+0x34
pf_remove_state(fd8786d61c48) at pf_remove_state+0x1da
pfsync_in_del_c(fd80028d04e0,c,2,2) at pfsync_in_del_c+0x9f
pfsync_input(800020b056e8,800020b056f4,f0,2) at pfsync_input+0x33c
ip_deliver(800020b056e8,800020b056f4,f0,2) at ip_deliver+0x103
ip_local(800020b056e8,800020b056f4,fe03,0) at
ip_local+0x1b7
ipintr() at ipintr+0x5f
if_netisr(0) at if_netisr+0xca
taskq_thread(80036000) at taskq_thread+0x11a
end trace frame: 0x0, count: 247
End of stack trace.
witness: lock_object uninitialized: 0xfd8786d61bd0
Starting stack trace...
witness_checkorder(fd8786d61bd0,9,0) at witness_checkorder+0xad
mtx_enter(fd8786d61bc0) at mtx_enter+0x34
pf_remove_state(fd8786d61ac8) at pf_remove_state+0x1da
pfsync_in_del_c(fd80028d04e0,c,2,2) at pfsync_in_del_c+0x9f
pfsync_input(800020b056e8,800020b056f4,f0,2) at pfsync_input+0x33c
ip_deliver(800020b056e8,800020b056f4,f0,2) at ip_deliver+0x103
ip_local(800020b056e8,800020b056f4,fe03,0) at
ip_local+0x1b7
ipintr() at ipintr+0x5f
if_netisr(0) at if_netisr+0xca
taskq_thread(80036000) at taskq_thread+0x11a
end trace frame: 0x0, count: 247
End of stack trace.
witness: lock_object uninitialized: 0xfd87846cebc8
Starting stack trace...
witness_checkorder(fd87846cebc8,9,0) at witness_checkorder+0xad
mtx_enter(fd87846cebb8) at mtx_enter+0x34
pf_remove_state(fd87846ceac0) at pf_remove_state+0x1da
pfsync_in_del_c(fd8070be0450,c,2,2) at pfsync_in_del_c+0x9f
pfsync_input(800020b056e8,800020b056f4,f0,2) at pfsync_input+0x33c
ip_deliver(800020b056e8,800020b056f4,f0,2) at ip_deliver+0x103
ip_local(800020b056e8,800020b056f4,fe03,0) at
ip_local+0x1b7
ipintr() at ipintr+0x5f
if_netisr(0) at if_netisr+0xca
taskq_thread(80036000) at taskq_thread+0x11a
end trace frame: 0x0, count: 247
End of stack trace.
witness: lock_object uninitialized: 0xfd87846ce748
Starting stack trace...
witness_checkorder(fd87846ce748,9,0) at witness_checkorder+0xad
mtx_enter(fd87846ce738) at mtx_enter+0x34
pf_remove_state(fd87846ce640) at pf_remove_state+0x1da
pfsync_in_del_c(fd8070be0450,c,2,2) at pfsync_in_del_c+0x9f

Re: pf panic with clean snapshot (GENERIC.MP) #570

2022-06-06 Thread Hrvoje Popovski

On 6.6.2022. 12:45, Alexandr Nedvedicky wrote:
> this is most likely identical to crash you've reported ?two weeks ago?
> I can not find an email with it.

oh yes, yes it's on tech@ with subject
pf_state_export panic with NET_TASKQ=6 and stuff 

i've totally forgot about that report :)

difference is that panic was with few diffs on top of NET_TASKQ=6, but
this one is plain snapshot...

pf panic with clean snapshot (GENERIC.MP) #570

2022-06-06 Thread Hrvoje Popovski

Hi,

this is follow up mail from
https://marc.info/?l=openbsd-tech=165450511622133=2

panic log:

bcbnfw1# uvm_fault(0x822e5e48, 0x0, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at  pf_state_export+0x38:   movq0(%rax),%rcx
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
*186873  72386  0 0x14000  0x2001  softnet
 177504   6658  0 0x14000  0x2004  softnet
  39873  45066  0 0x14000  0x2003  softnet
 212195  13588  0 0x14000  0x2002  softnet
pf_state_export(fd80610b3bd4,fd87778f3010) at pf_state_export+0x38
pfsync_sendout() at pfsync_sendout+0x5e4
pfsync_update_state(fd874a5bd190) at pfsync_update_state+0x15b
pf_test(2,1,80bbe000,800020b45b18) at pf_test+0xd53
ip_input_if(800020b45b18,800020b45b24,4,0,80bbe000) at
ip_input_if+0xcd
ipv4_input(80bbe000,fd8061062300) at ipv4_input+0x39
ether_input(80bbe000,fd8061062300) at ether_input+0x3ad
carp_input(80bd5000,fd8061062300,5e000101) at carp_input+0x196
ether_input(80bd5000,fd8061062300) at ether_input+0x1d9
vlan_input(80ba1000,fd8061062300,800020b45d4c) at
vlan_input+0x23d
ether_input(80ba1000,fd8061062300) at ether_input+0x85
if_input_process(8048b048,800020b45de8) at if_input_process+0x6f
ifiq_process(8048e900) at ifiq_process+0x69
taskq_thread(80035200) at taskq_thread+0x100
end trace frame: 0x0, count: 1
https://www.openbsd.org/ddb.html describes the minimum info required in
bug reports.  Insufficient info makes it difficult to find and fix bugs.
ddb{1}>



ddb{1}> show reg
rdi   0xfd80610b3cdc
rsi0
rbp   0x800020b457a0
rbx0x394
rdx  0x4
rcx0
rax0
r8 0x104
r9 0x201641d4bc7bea8
r10   0xfa48834155c0359a
r11   0xfd80610b3bd4
r12   0xfd87778f3010
r130
r14   0xfd80610b3bd4
r15   0xfd87778f3010
rip   0x81768b08pf_state_export+0x38
cs   0x8
rflags   0x10246__ALIGN_SIZE+0xf246
rsp   0x800020b45760
ss  0x10
pf_state_export+0x38:   movq0(%rax),%rcx





ddb{1}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
 69799  138915  1  0  30x100083  ttyin ksh
 30083  46  1  0  30x100098  kqreadcron
 56402  311171  96066720  3   0x190  kqreadlldpd
 96066  275425  1  0  30x80  netio lldpd
 82432  242124  96039 95  3   0x1100092  kqreadsmtpd
 12516  216897  96039103  3   0x1100092  kqreadsmtpd
 19369   21427  96039 95  3   0x1100092  kqreadsmtpd
 16547   5  96039 95  30x100092  kqreadsmtpd
 40575  355715  96039 95  3   0x1100092  kqreadsmtpd
 64566  206338  96039 95  3   0x1100092  kqreadsmtpd
 96039  176140  1  0  30x100080  kqreadsmtpd
 74078  507976  1 77  3   0x1100090  kqreaddhcpd
 22909  489517  1  0  30x100080  kqreadsnmpd
 49177  112109  1 91  3   0x192  kqreadsnmpd
 97916  230895  1  0  30x88  kqreadsshd
 16686  416523  1  0  30x100080  kqreadntpd
 318405744  94041 83  30x100092  kqreadntpd
 94041  139024  1 83  3   0x1100092  kqreadntpd
 67241  440831  52217 74  3   0x1100092  bpf   pflogd
 52217  253016  1  0  30x80  netio pflogd
 75377   97140  41241 73  3   0x1100090  kqreadsyslogd
 41241  505035  1  0  30x100082  netio syslogd
 33175  220087  0  0  3 0x14200  bored smr
 59216   65103  0  0  3 0x14200  pgzerozerothread
 93094  298208  0  0  3 0x14200  aiodoned  aiodoned
  4707  184791  0  0  3 0x14200  syncerupdate
 13584  284481  0  0  3 0x14200  cleaner   cleaner
 86417  471845  0  0  3 0x14200  reaperreaper
 78809   25532  0  0  3 0x14200  pgdaemon  pagedaemon
 32266  308574  0  0  3 0x14200  usbtskusbtask
  1400  353498  0  0  3 0x14200  usbatsk   usbatsk
 85069  436856  0  0  3  0x40014200  acpi0 acpi0
 73330  275126  0  0  7  0x40014200idle5
 11953  135217  0  0  3  0x40014200idle4
 20559  345946  0  0  3  0x40014200idle3
 22822  186899  0  0  3  0x40014200idle2
 52381  348951  0  0  3

Re: relayd panic

2022-06-01 Thread Hrvoje Popovski

On 1.6.2022. 9:16, Alexandr Nedvedicky wrote:
> Hello,
> 
> 
>> r420-1# rcctl -f start relayd
>> relayd(ok)
>> r420-1# uvm_fault(0xfd862f82f990, 0x0, 0, 1) -> e
>> kernel: page fault trap, code=0
>> Stopped at  pf_find_or_create_ruleset+0x1c: movb0(%rdi),%al
>> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
>>  431388  19003  0 0x2  05  relayd
>>  174608  32253 89   0x112  02  relayd
>>  395415  12468  0 0x2  04  relayd
>>  493579  11904  0 0x2  03  relayd
>> *101082  14967 89   0x1100012  00K relayd
>> pf_find_or_create_ruleset(0) at pf_find_or_create_ruleset+0x1c
>> pfr_add_tables(832d7cca800,1,80eaf43c,1000) at
>> pfr_add_tables+0x6ae
>>
>> pfioctl(4900,c450443d,80eaf000,3,80002272e7f0) at pfioctl+0x1d9f
>> VOP_IOCTL(fd8551f82dd0,c450443d,80eaf000,3,fd862f7d60c0,800
>> 02272e7f0) at VOP_IOCTL+0x5c
>> vn_ioctl(fd855ecec1e8,c450443d,80eaf000,80002272e7f0) at
>> vn_ioctl+0x75
>> sys_ioctl(80002272e7f0,8000227d9980,8000227d99d0) at
>> sys_ioctl+0x2c4
>> syscall(8000227d9a40) at syscall+0x374
>> Xsyscall() at Xsyscall+0x128
>> end of kernel
> it looks like we are dying here at line 239 due to NULL pointer deference:
> 
> 232 struct pf_ruleset *
> 233 pf_find_or_create_ruleset(const char *path)
> 234 {
> 235 char*p, *aname, *r;
> 236 struct pf_ruleset   *ruleset;
> 237 struct pf_anchor*anchor;
> 238 
> 239 if (path[0] == 0)
> 240 return (_main_ruleset);
> 241 
> 242 while (*path == '/')
> 243 path++;
> 244 
> 
> I've followed the same steps to reproduce the issue to check if
> diff below resolves the issue. The bug has been introduced by
> my recent change to pf_table.c [1] from May 10th:
> 
>   Modified files:
>   sys/net: pf_ioctl.c pf_table.c 
> 
>   Log message:
>   move memory allocations in pfr_add_tables() out of
>   NET_LOCK()/PF_LOCK() scope. bluhm@ helped a lot
>   to put this diff into shape.
> 
> besides using a regression test I've also did simple testing
> using a 'load anchor':
> 
> netlock# cat /tmp/anchor.conf 
>  
> load anchor "test" from "/tmp/pf.conf"
> netlock#
> netlock# cat /tmp/pf.conf 
>  
> table  { 192.168.1.1 }
> pass from 
> netlock#
> netlock# pfctl -sA
>   test
> netlock# pfctl -a test -sT
> try
> netlock# pfctl -a test -t try -T show
>192.168.1.1
> 
> OK to commit fix below?


I'm confirming that with this diff i can't trigger panic...

Re: relayd panic

2022-06-01 Thread Hrvoje Popovski

On 1.6.2022. 7:01, Hrvoje Popovski wrote:
> Hi all,
> 
> while playing around with TCP Large Receive Offloading for ix I have
> configure httpd and relayd on test box.
> Same second I've start relayd box panic.
> This is latest snapshot and it easely reproduciable..

With WITNESS

r420-1# rcctl -f start relayd
relayd(ok)
WuAvRm_NfINaGu:l t(S0PLx ffNfOTff LdO8W6E2fR8ED2 37O3N0 T,R 0AxP0 E,X
0I,T  a1 )0 -
> Stopped at  proc_trampoline+0xdc:   m
ovl $0,%gs:0x538
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
 434783  78195  0 0x2  04  relayd
 416901   1262 89   0x112  03  relayd
 290632  38913  0 0x2  02  relayd
 239447  37685  0 0x2  05  relayd
  72623   6837 89   0x1100012  00K relayd
*174940  41382  00x13  01  ksh
proc_trampoline() at proc_trampoline+0xdc
end of kernel
end trace frame: 0x7f7dd400, count: 14
https://www.openbsd.org/ddb.html describes the minimum info required in
bug reports.  Insufficient info makes it difficult to find and fix bugs.
ddb{1}>


ddb{1}> show panic
*cpu0: uvm_fault(0xfd862f823730, 0x0, 0, 1) -> e
ddb{1}>

ddb{1}> show reg
rdi   0x822c0d48kprintf_mutex
rsi  0x5
rbp   0x8000227afea0
rbx0
rdx   0xc000
rcx0x286
rax 0x2a
r8 0
r9 0
r100xf417d734fa974b8
r11   0x7ea5978c0be9feb6
r120
r130
r140
r150
rip   0x8118b50cproc_trampoline+0xdc
cs   0x8
rflags 0x246
rsp   0x8000227afe20
ss 0
proc_trampoline+0xdc:   movl$0,%gs:0x538
ddb{1}>


ddb{1}> show all locks
CPU 1:
exclusive mutex >pm_mtx r = 0 (0xfd862f8226d8)
#0  witness_lock+0x311
#1  mtx_enter_try+0x95
#2  mtx_enter+0x48
#3  pmap_enter+0xf8
#4  uvm_fault_upper+0x1e5
#5  uvm_fault+0xde
#6  upageflttrap+0x62
#7  usertrap+0x129
#8  recall_trap+0x8
Process 37685 (relayd) thread 0x80002273f508 (239447)
exclusive rwlock uobjlk r = 0 (0xfd8575064088)
#0  witness_lock+0x311
#1  rw_enter+0x292
#2  uvm_fault_lower_lookup+0x41
#3  uvm_fault_lower+0x45
#4  uvm_fault+0x1b3
#5  upageflttrap+0x62
#6  usertrap+0x129
#7  recall_trap+0x8
shared rwlock vmmaplk r = 0 (0xfd862f823a28)
#0  witness_lock+0x311
#1  uvmfault_lookup+0x8a
#2  uvm_fault_check+0x32
#3  uvm_fault+0xfb
#4  upageflttrap+0x62
#5  usertrap+0x129
#6  recall_trap+0x8
Process 6837 (relayd) thread 0x80002273f268 (72623)
exclusive rwlock pf_lock r = 0 (0x822ce1f8)
#0  witness_lock+0x311
#1  pfr_add_tables+0x384
#2  pfioctl+0x1daf
#3  VOP_IOCTL+0x5c
#4  vn_ioctl+0x75
#5  sys_ioctl+0x2c4
#6  syscall+0x374
#7  Xsyscall+0x128
exclusive rwlock netlock r = 0 (0x822adc60)
#0  witness_lock+0x311
#1  pfr_add_tables+0x342
#2  pfioctl+0x1daf
#3  VOP_IOCTL+0x5c
#4  vn_ioctl+0x75
#5  sys_ioctl+0x2c4
#6  syscall+0x374
#7  Xsyscall+0x128
exclusive rwlock pfioctl_rw r = 0 (0x822ce258)
#0  witness_lock+0x311
#1  pfioctl+0x21e
#2  VOP_IOCTL+0x5c
#3  vn_ioctl+0x75
#4  sys_ioctl+0x2c4
#5  syscall+0x374
#6  Xsyscall+0x128
exclusive kernel_lock _lock r = 1 (0x8247f570)
#0  witness_lock+0x311
#1  vn_ioctl+0x3b
#2  sys_ioctl+0x2c4
#3  syscall+0x374
#4  Xsyscall+0x128
Process 41382 (ksh) thread 0x80002273f7a8 (174940)
exclusive rwlock amaplk r = 0 (0xfd857123cad0)
#0  witness_lock+0x311
#1  uvm_fault_check+0x3f7
#2  uvm_fault+0xfb
#3  upageflttrap+0x62
#4  usertrap+0x129
#5  recall_trap+0x8
shared rwlock vmmaplk r = 0 (0xfd857136d758)
#0  witness_lock+0x311
#1  uvmfault_lookup+0x8a
#2  uvm_fault_check+0x32
#3  uvm_fault+0xfb
#4  upageflttrap+0x62
#5  usertrap+0x129
#6  recall_trap+0x8
exclusive mutex >pm_mtx r = 0 (0xfd862f8226d8)
#0  witness_lock+0x311
#1  mtx_enter_try+0x95
#2  mtx_enter+0x48
#3  pmap_enter+0xf8
#4  uvm_fault_upper+0x1e5
#5  uvm_fault+0xde
#6  upageflttrap+0x62
#7  usertrap+0x129
#8  recall_trap+0x8
ddb{1}>



ddb{1}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
 11599  104649  1  0  30x80  kqreadrelayd
 61284  290693  1  0  2 0x2relayd
 78195  434783  1  0  7 0x2relayd
 51529   52072  1 89  2   0x112relayd
  1262  416901  1 89  7   0x112relayd
 38913  290632  1  0  7 0x2relayd
 37685  239447  1  0  7 0x2relayd
 59481  105452  1  0  2 0x2

relayd panic

2022-05-31 Thread Hrvoje Popovski

Hi all,

while playing around with TCP Large Receive Offloading for ix I have
configure httpd and relayd on test box.
Same second I've start relayd box panic.
This is latest snapshot and it easely reproduciable..


r420-1# cat /etc/httpd.conf
prefork 4

server "default" {
listen on 127.0.0.1 port 80
}

r420-1# cat /etc/relayd.conf
table  { 127.0.0.1 }
redirect www {
listen on 192.168.100.205 port http
forward to  check icmp
}


panic
r420-1# rcctl -f start relayd
relayd(ok)
r420-1# uvm_fault(0xfd862f82f990, 0x0, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at  pf_find_or_create_ruleset+0x1c: movb0(%rdi),%al
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
 431388  19003  0 0x2  05  relayd
 174608  32253 89   0x112  02  relayd
 395415  12468  0 0x2  04  relayd
 493579  11904  0 0x2  03  relayd
*101082  14967 89   0x1100012  00K relayd
pf_find_or_create_ruleset(0) at pf_find_or_create_ruleset+0x1c
pfr_add_tables(832d7cca800,1,80eaf43c,1000) at
pfr_add_tables+0x6ae

pfioctl(4900,c450443d,80eaf000,3,80002272e7f0) at pfioctl+0x1d9f
VOP_IOCTL(fd8551f82dd0,c450443d,80eaf000,3,fd862f7d60c0,800
02272e7f0) at VOP_IOCTL+0x5c
vn_ioctl(fd855ecec1e8,c450443d,80eaf000,80002272e7f0) at
vn_ioctl+0x75
sys_ioctl(80002272e7f0,8000227d9980,8000227d99d0) at
sys_ioctl+0x2c4
syscall(8000227d9a40) at syscall+0x374
Xsyscall() at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7eca80, count: 7
https://www.openbsd.org/ddb.html describes the minimum info required in bug
reports.  Insufficient info makes it difficult to find and fix bugs.
ddb{0}>

ddb{0}> show reg
rdi0
rsi   0x80eb2a01
rbp   0x8000227d8f70
rbx0
rdx0
rcx  0x3
rax 0x72
r8 0x101010101010101
r90x8080808080808080
r10   0xe7c5ac49b5b31a3e
r11   0xd36a3af6ec2034e3
r12  0x1
r13   0x80eb2e00
r14   0x80eb39c0
r15   0x80eb2a00
rip   0x8147bffcpf_find_or_create_ruleset+0x1c
cs   0x8
rflags   0x10282__ALIGN_SIZE+0xf282
rsp   0x8000227d8f30
ss  0x10
pf_find_or_create_ruleset+0x1c: movb0(%rdi),%al


ddb{0}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
 70374  260289  1  0  30x80  kqreadrelayd
 19003  431388  1  0  7 0x2relayd
 32253  174608  1 89  7   0x112relayd
 12468  395415  1  0  7 0x2relayd
 11904  493579  1  0  7 0x2relayd
 71579  177053  1 89  3   0x1100092  kqreadrelayd
 52250  384601  1 89  3   0x1100092  kqreadrelayd
 78736  288537  1 89  3   0x1100092  kqreadrelayd
*14967  101082  1 89  7   0x1100012relayd
 97366   28376  48265  0  30x100083  nanoslp   sleep
 48265  148003  1  0  30x100089  sigsusp   ksh
 72597  317981  1  0  30x100083  ttyin ksh
 14238  266586  1  0  30x100098  kqreadcron
 88363  270212  39693 95  3   0x1100092  kqreadsmtpd
 43072  155751  39693103  3   0x1100092  kqreadsmtpd
 20968  329586  39693 95  3   0x1100092  kqreadsmtpd
 61100  508858  39693 95  30x100092  kqreadsmtpd
 98465  158391  39693 95  3   0x1100092  kqreadsmtpd
 12045  461090  39693 95  3   0x1100092  kqreadsmtpd
 39693  153086  1  0  30x100080  kqreadsmtpd
  2297  255527  1  0  30x88  kqreadsshd
 73816   88254  1  0  30x100080  kqreadntpd
 74329  300888  70971 83  30x100092  kqreadntpd
 70971  124726  1 83  3   0x1100092  kqreadntpd
 31879  226513  51900 74  3   0x1100092  bpf   pflogd
 51900  452501  1  0  30x80  netio pflogd
 84934  332753  55410 73  3   0x1100090  kqreadsyslogd
 55410  338332  1  0  30x100082  netio syslogd
 42399  151525  0  0  3 0x14200  bored smr
 51084   48313  0  0  3 0x14200  pgzerozerothread
 55543  234427  0  0  3 0x14200  aiodoned  aiodoned
 38843  197586  0  0  3 0x14200  syncerupdate
 39156   69723  0  0  3 0x14200  cleaner   cleaner
 28960  522155  0  0  3 0x14200  reaperreaper
 98774  330824  0  0  3

Re: -current crash

2022-05-31 Thread Hrvoje Popovski

On 1.6.2022. 0:27, Stuart Henderson wrote:
> I accidentally updated a router to -current instead of 7.1 and hit this.
> (Thanks sysupgrade - it was running a 7.0-stable kernel before...)
> 
> Unfortunately it runs with ddb.panic=0 and this time it hanged, I won't
> have time to figure anything out with it when I get it back online, but
> might be able to do so later in the week.
> 
> Thought I'd send it out now as a heads-up as much as anything (and maybe
> someone has an idea). Boot messages below.

Hi,

I think that this is relayd panic. I'm seeing this too while testing TCP
Large Receive Offloading. I will send proper bug report just in next mail...

r420-1# rcctl -f start relayd
relayd(ok)
r420-1# uvm_fault(0xfd8571260e70, 0x0, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at  pf_find_or_create_ruleset+0x1c: movb0(%rdi),%al
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
*307542  67712 89   0x1100012  02K relayd
pf_find_or_create_ruleset(0) at pf_find_or_create_ruleset+0x1c
pfr_add_tables(b0caf51,1,8104343c,1000) at
pfr_add_tables+0x6ae
pfioctl(4900,c450443d,81043000,3,80002271e550) at pfioctl+0x1daf
VOP_IOCTL(fd857631f1f8,c450443d,81043000,3,fd862f7d6960,80002271e550)
at VOP_IOCTL+0x5c
vn_ioctl(fd854c7a2308,c450443d,81043000,80002271e550) at
vn_ioctl+0x75
sys_ioctl(80002271e550,8000227f5b40,8000227f5b90) at
sys_ioctl+0x2c4
syscall(8000227f5c00) at syscall+0x374
Xsyscall() at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7d3d40, count: 7
https://www.openbsd.org/ddb.html describes the minimum info required in
bug reports.  Insufficient info makes it difficult to find and fix bugs.

Re: [External] : Re: ip6 forwarding with pf and pfsync over veb/vport

2022-05-24 Thread Hrvoje Popovski

On 24.5.2022. 9:01, Alexandr Nedvedicky wrote:
> interesting. I went through mbuf handling in if_veb.c
> I just could find a single nit, which is most likely unrelated,
> however I think it's still worth to give it a try a diff below.
> 
> basically all calls to veb_pf() read as follows:
>   m = veb_pf(ifp, ..., m);
> except the one in veb_broadcast(), which readsa as:
>   m = veb_pf(ifp, ..., m0);
> I think it is a bug, veb_pf() caller should continue to run
> with packet returned by veb_pf().
> 
> thanks and
> regards
> sashan


Hi,

and with this diff i can panic box the same way as before... ip6
forwarding, pf and veb/vport

panic:
r620-1# panuicvm:_ f paoulotl(_0caxcffhfef_iftfeffm8_2ma2gfi13ca_c8h, e
ck :m bu f p
l   cp uf r
 e0ex1 7 , l i 0s,t   2 )   - >  e
 mkoedrnieflie: d :  i t e  m   a dd r0 xf f  f ff d 8  0 a 42 0 e
5 00 + 2  4   0x 6
 a  b 22 4 5  9 6 1e e  9 8 5c ! =  0 x 6 ab 2 2  4 5
9pcadge0 a f8 5 c
Stopped at  db_enter+0x10:  popq%rbp
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
 418374  46077  0 0x14000  0x2003  softnet
 355064  80120  0 0x14000  0x2002K softnet
*401307  69853  0 0x14000  0x2005  softnet
db_enter() at db_enter+0x10
panic(81f3c6f5) at panic+0xbf
pool_cache_get(82483608) at pool_cache_get+0x25b
pool_get(82483608,2) at pool_get+0x61
m_get(2,1) at m_get+0x3f
m_copym(fd80a3b50900,0,40,2) at m_copym+0xd8
ip6_forward(fd80a3b50900,fd842ce9c708,0) at ip6_forward+0x1cc
ip6_input_if(800022c6b728,800022c6b734,29,0,8074b000) at
ip6_input_if+0x80a
ipv6_input(8074b000,fd80a3b50900) at ipv6_input+0x39
ether_input(8074b000,fd80a3b50900) at ether_input+0x3ad
vport_if_enqueue(8074b000,fd80a3b50900) at vport_if_enqueue+0x19
veb_port_input(80095048,fd80a3b50900,ecf4bbdaf7f8,80747300)
 at veb_port_input+0x5b0
ether_input(80095048,fd80a3b50900) at ether_input+0x100
if_input_process(80095048,800022c6b938) at if_input_process+0x6f
end trace frame: 0x800022c6b980, count: 0
https://www.openbsd.org/ddb.html describes the minimum info required in
bug reports.  Insufficient info makes it difficult to find and fix bugs.



ddb{5}> show panic
*cpu5: pool_cache_item_magic_check: mbufpl cpu free list modified: item
addr 0x
fd80a420e500+24 0x6ab2245961ee985c!=0x6ab22459cd0af85c
 cpu2: uvm_fault(0x822f13a8, 0x17, 0, 2) -> e
ddb{5}>

Re: ip6 forwarding with pf and pfsync over veb/vport

2022-05-23 Thread Hrvoje Popovski

On 23.5.2022. 10:41, Hrvoje Popovski wrote:
> On 23.5.2022. 8:34, Alexandr Nedvedicky wrote:
>> looks like kind of memory corruption. my bet is use-after-free.
>> will try to get to it later today.
>>
>> does it mean there is no such panic, when we handle IPv4 traffic only?
> 
> Hi,
> 
> yes, it seems that i can't trigger panic with ip4 only traffic, at least
> the same way i can with ip6 traffic
> 

All day I'm trying to trigger panic with ip4 and I just can't

Re: ip6 forwarding with pf and pfsync over veb/vport

2022-05-23 Thread Hrvoje Popovski

On 23.5.2022. 10:41, Hrvoje Popovski wrote:
> On 23.5.2022. 8:34, Alexandr Nedvedicky wrote:
>> looks like kind of memory corruption. my bet is use-after-free.
>> will try to get to it later today.
>>
>> does it mean there is no such panic, when we handle IPv4 traffic only?
> 
> Hi,
> 
> yes, it seems that i can't trigger panic with ip4 only traffic, at least
> the same way i can with ip6 traffic
> 

Here's another one but this time i've tcpdump outgoing ix interface.
I've tried same stuff with ip4 traffic and couldn't trigger panic.



10:53:59.682513 a192:a168:a100::111.9 > b192:b168:b111::bfbf.9: udp
puvamn_icf:au l t p(o0
oxflf_cfafcffhfe_fi82t2emf_62m6a8gi, c _ ch e  c k :   m b uf p  l  c p
uf r
 0exe1 l7i, s t  m o  d if i  e d :  i t  e m  a d  d r  0  x ff f f  f
d8 0  a 37 f d  a
0 0+ 1 60 xf f  f ff d  8 0a 3 7  fd a  f 2!  = 0x c
0f1,8 9 2b)ec d f -5>9 b0 0  b
 Stopped at  db_enter+0x10:  popq%rbp
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
  32710  85256  0 0x14000  0x2004K softnet
  97437  83157  0 0x14000  0x2001  softnet
 212200  25091  0 0x14000  0x2003  softnet
 510395  50985  0 0x14000  0x2005  softnet
 417502  88838  0 0x14000  0x2000  systq
db_enter() at db_enter+0x10
panic(81f34fe0) at panic+0xbf
pool_cache_get(82474c48) at pool_cache_get+0x25b
pool_get(82474c48,2) at pool_get+0x61
m_clget(0,2,802) at m_clget+0xdd
ixgbe_get_buf(800973a0,b2) at ixgbe_get_buf+0xa3
ixgbe_rxfill(800973a0) at ixgbe_rxfill+0xaa
ixgbe_queue_intr(80024d00) at ixgbe_queue_intr+0x4f
intr_handler(800022c89380,80081e00) at intr_handler+0x6e
Xintr_ioapic_edge0_untramp() at Xintr_ioapic_edge0_untramp+0x18f
acpicpu_idle() at acpicpu_idle+0x203
sched_idle(800022412ff0) at sched_idle+0x280
end trace frame: 0x0, count: 3
https://www.openbsd.org/ddb.html describes the minimum info required in
bug reports.  Insufficient info makes it difficult to find and fix bugs.
ddb{2}>


ddb{2}> show panic
 cpu4: uvm_fault(0x822f6268, 0x17, 0, 2) -> e
*cpu2: pool_cache_item_magic_check: mbufpl cpu free list modified: item
addr 0x
fd80a37fda00+16 0xfd80a37fdaf2!=0xcf189becdf59b00b
ddb{2}>


ddb{2}> show reg
rdi0
rsi 0x14
rbp   0x800022c88ff0
rbx   0xfd842f835c00
rdx   0xc800
rcx0x206
rax 0x8a
r8 0x101010101010101
r9 0
r10   0xe6540fc793a8e615
r11   0x4860824aa7540a0c
r12   0x800022413a60
r130
r140
r15   0x81f34fe0cmd0646_9_tim_udma+0x2acb1
rip   0x817b4d90db_enter+0x10
cs   0x8
rflags 0x206
rsp   0x800022c88ff0
ss  0x10
db_enter+0x10:  popq%rbp


ddb{2}> show mbuf
mbuf 0x817b4d90
m_type: -13108  m_flags:
c3cc
m_next: 0x1d3b4c241c334c5d  m_nextpkt: 0x117400ae525c
m_data: 0x  m_len: 3435973836
m_dat: 0x817b4db0   m_pktdat: 0x817b4e00


ddb{2}> show all locks
Process 85256 (softnet) thread 0x8000e7e0 (32710)
shared rwlock netlock r = 0 (0x822e9990)
shared rwlock softnet r = 0 (0x80031370)
Process 83157 (softnet) thread 0x8000ea80 (97437)
shared rwlock netlock r = 0 (0x822e9990)
shared rwlock softnet r = 0 (0x80031270)
Process 25091 (softnet) thread 0x8000ed20 (212200)
shared rwlock netlock r = 0 (0x822e9990)
shared rwlock softnet r = 0 (0x80031170)
Process 50985 (softnet) thread 0x8000efc0 (510395)
shared rwlock softnet r = 0 (0x80031070)
Process 88838 (systq) thread 0x8000f500 (417502)
shared rwlock systq r = 0 (0x822eaf08)
Process 59744 (softclock) thread 0x8000f7a0 (200127)
exclusive kernel_lock _lock r = 0 (0x824b03c0)
shared rwlock timeout r = 0 (0x822b2fe8)


ddb{2}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
 81137  105065  65725 76  30x100093  netio tcpdump
 65725  227707  17816 76  3   0x1100093  ttyouttcpdump
 17816  349982  1  0  30x10008b  sigsusp   ksh
 96985  429538  1  0  30x100098  kqreadcron
 95498  144368  28860 95  3   0x1100092  kqreadsmtpd
 43714  295842  28860103  3   0x1100092  kqreadsmtpd
 80683  116687  28860 95  3   0x1100092  kqreadsmtpd
 35950  130878  28860 95  30x100092  kqreadsmtpd
 27765   48615  28860 95  3   0x1100092  kqreadsmtpd
 554

Re: ip6 forwarding with pf and pfsync over veb/vport

2022-05-23 Thread Hrvoje Popovski

On 23.5.2022. 8:34, Alexandr Nedvedicky wrote:
> looks like kind of memory corruption. my bet is use-after-free.
> will try to get to it later today.
> 
> does it mean there is no such panic, when we handle IPv4 traffic only?

Hi,

yes, it seems that i can't trigger panic with ip4 only traffic, at least
the same way i can with ip6 traffic

ip6 forwarding with pf and pfsync over veb/vport

2022-05-22 Thread Hrvoje Popovski

Hi all,

I can reproduce panic when sending ip6 traffic over vport and destroying
pfsync interface. It is reproducible with veb and vport but i couldn't
trigger panic when forwarding ip6 over physical interfaces.

I've compiled kernel with source fetched half an hour ago just to enable
WITNESS.

r620-1# ifconfig pfsync0 destroy
panicu:v m_ f  a u lt ( 0 x  f ff f  f ff f 8  2 3 ba 6 1  8 ,  0  x 17
,0,   2  )  -
>e
 pkoeronle_cla: c he _  i t em _ m  a gi c  _ ch e c k  :  m b  u fp l
 c pu   f  r ee
 l  i s t  m o  d if i e  d :  i  t ema d dr   0
xpfagfef f fd 8 0  a 4 1 c3 f  0 0 +2 40 x a f5 5  1 e 6f 8 f  9 0
25 5  f != 0 x  a f
 55 1 e  6 f8  f 35  f d 5 f
 fStopped at  db_enter+0x10:  popq%rbp
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
 317552  39553  0 0x14000  0x2002K softnet
 504828  12606  0 0x14000  0x2004  softnet
*283345  81494  0 0x14000  0x2003  softnet
db_enter() at db_enter+0x10
panic(81f39222) at panic+0xbf
pool_cache_get(82323228) at pool_cache_get+0x25b
pool_get(82323228,2) at pool_get+0x61
m_gethdr(2,1) at m_gethdr+0x3f
pfsync_sendout() at pfsync_sendout+0xe9
pfsync_update_state(fd839f1f8950) at pfsync_update_state+0x15b
pf_test(18,1,80095048,800022c6ae30) at pf_test+0xd53
veb_pf(80095048,1,fd80a3594900) at veb_pf+0xbf
veb_port_input(80095048,fd80a3594900,ecf4bbdaf7f8,80747300)
 at veb_port_input+0x2ce
ether_input(80095048,fd80a3594900) at ether_input+0x100
if_input_process(80095048,800022c6afc8) at if_input_process+0x6f
ifiq_process(80099800) at ifiq_process+0x69
taskq_thread(80031100) at taskq_thread+0x11a
end trace frame: 0x0, count: 1
https://www.openbsd.org/ddb.html describes the minimum info required in
bug reports.  Insufficient info makes it difficult to find and fix bugs.


ddb{3}> show panic
*cpu3: pool_cache_item_magic_check: mbufpl cpu free list modified: item
addr 0xfd80a41c3f00+24 0xaf551e6f8f90255f!=0xaf551e6f8f35fd5f
 cpu2: uvm_fault(0x823ba618, 0x17, 0, 2) -> e
ddb{3}>

ddb{3}> show reg
rdi0
rsi 0x14
rbp   0x800022c6a960
rbx   0xfd842f835c00
rdx   0xc800
rcx0x282
rax 0x8a
r8 0x101010101010101
r9 0
r10   0xedcd3183c339b665
r110xb0f0eb58b1d2563
r12   0x80002241ca60
r130
r140
r15   0x81f39222cmd0646_9_tim_udma+0x314d8
rip   0x8118e200db_enter+0x10
cs   0x8
rflags 0x206
rsp   0x800022c6a960
ss  0x10
db_enter+0x10:  popq%rbp


ddb{3}> show all locks
Process 39553 (softnet) thread 0x8000ed20 (317552)
shared rwlock netlock r = 0 (0x822c6550)
shared rwlock softnet r = 0 (0x80031370)
Process 12606 (softnet) thread 0x8000e000 (504828)
shared rwlock netlock r = 0 (0x822c6550)
shared rwlock softnet r = 0 (0x80031270)
Process 81494 (softnet) thread 0x8000e2a0 (283345)
shared rwlock netlock r = 0 (0x822c6550)
shared rwlock softnet r = 0 (0x80031170)
Process 96881 (softnet) thread 0x8000e540 (159803)
shared rwlock softnet r = 0 (0x80031070)
Process 26865 (systq) thread 0x8000ea80 (449324)
shared rwlock systq r = 0 (0x822dd728)
Process 93339 (softclock) thread 0x8000efc0 (160018)
shared rwlock timeout r = 0 (0x822b6000)
ddb{3}>


ddb{3}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
 39455  263128  42512  0  3 0x3  netlock   ifconfig
 42512  258140  1  0  30x10008b  sigsusp   ksh
 34696  282706  1  0  30x100098  kqreadcron
 86943  298932  81565 95  3   0x1100092  kqreadsmtpd
 34037  448643  81565103  3   0x1100092  kqreadsmtpd
 17802  340759  81565 95  3   0x1100092  kqreadsmtpd
 54979  438478  81565 95  30x100092  kqreadsmtpd
 29724  438684  81565 95  3   0x1100092  kqreadsmtpd
  3110  313509  81565 95  3   0x1100092  kqreadsmtpd
 81565  137591  1  0  30x100080  kqreadsmtpd
 81008  204817  1  0  30x88  kqreadsshd
 72442  275002  1  0  30x100080  kqreadntpd
 97406  453489  91190 83  30x100092  kqreadntpd
 91190  488051  1 83  3   0x1100012  netlock   ntpd
 31521   42595   4468 73  3   0x1100090  kqreadsyslogd
  4468   43476  1  0  30x100082  netio syslogd
 66713  499933  0  0  3 0x14200  bored smr

Re: [External] : Re: 7.1-Current crash with NET_TASKQ 4 and veb interface

2022-05-13 Thread Hrvoje Popovski

On 13.5.2022. 4:19, David Gwynne wrote:
> sorry i'm late to the party. can you try this diff?
> 
> this diff replaces the list of ports with an array/map of ports.
> the map takes references to all the ports, so the forwarding paths
> just have to hold a reference to the map to be able to use all the
> ports. the forwarding path uses smr to get hold of a map, takes a map
> ref, and then leaves the smr crit section before iterating over the map
> and pushing packets.
> 
> this means we should only take and release a single refcnt when
> we're pushing packets out any number of ports.
> 
> if no span ports are configured, then there's no span port map and
> we don't try and take a ref, we can just return early.
> 
> we also only take and release a single refcnt when we forward the
> actual packet. forwarding to a single port provided by an etherbridge
> lookup already takes/releases the single port ref. if it falls
> through that for unknown unicast or broadcast/multicast, then it's
> a single refcnt for the current map of all ports.



Hi,

and with this diff i can't trigger panic ...

Re: [External] : Re: 7.1-Current crash with NET_TASKQ 4 and veb interface

2022-05-12 Thread Hrvoje Popovski

On 12.5.2022. 20:04, Hrvoje Popovski wrote:
> On 12.5.2022. 16:22, Hrvoje Popovski wrote:
>> On 12.5.2022. 14:48, Claudio Jeker wrote:
>>> I think the diff below may be enough to fix this issue. It drops the SMR
>>> critical secition around the enqueue operation but uses a reference on the
>>> port insteadt to ensure that the device can't be removed during the
>>> enqueue. Once the enqueue is finished we enter the SMR critical section
>>> again and drop the reference.
>>>
>>> To make it clear that the SMR_TAILQ remains intact while a refcount is
>>> held I moved refcnt_finalize() above SMR_TAILQ_REMOVE_LOCKED(). This is
>>> not strictly needed since the next pointer remains valid up until the
>>> smr_barrier() call but I find this a bit easier to understand.
>>> First make sure nobody else holds a reference then remove the port from
>>> the list.
>>>
>>> I currently have no test setup to verify this but I hope someone else can
>>> give this a spin.
>> Hi,
>>
>> for now veb seems stable and i can't panic box although it should, but
>> please give me few more hours to torture it properly.
> 
> 
> I can trigger panic in veb with this diff.
> 
> Thank you ..
> 
> 

I can't trigger ... :))) sorry ..

Re: [External] : Re: 7.1-Current crash with NET_TASKQ 4 and veb interface

2022-05-12 Thread Hrvoje Popovski

On 12.5.2022. 16:22, Hrvoje Popovski wrote:
> On 12.5.2022. 14:48, Claudio Jeker wrote:
>> I think the diff below may be enough to fix this issue. It drops the SMR
>> critical secition around the enqueue operation but uses a reference on the
>> port insteadt to ensure that the device can't be removed during the
>> enqueue. Once the enqueue is finished we enter the SMR critical section
>> again and drop the reference.
>>
>> To make it clear that the SMR_TAILQ remains intact while a refcount is
>> held I moved refcnt_finalize() above SMR_TAILQ_REMOVE_LOCKED(). This is
>> not strictly needed since the next pointer remains valid up until the
>> smr_barrier() call but I find this a bit easier to understand.
>> First make sure nobody else holds a reference then remove the port from
>> the list.
>>
>> I currently have no test setup to verify this but I hope someone else can
>> give this a spin.
> Hi,
> 
> for now veb seems stable and i can't panic box although it should, but
> please give me few more hours to torture it properly.


I can trigger panic in veb with this diff.

Thank you ..

Re: [External] : Re: 7.1-Current crash with NET_TASKQ 4 and veb interface

2022-05-12 Thread Hrvoje Popovski

On 12.5.2022. 14:48, Claudio Jeker wrote:
> I think the diff below may be enough to fix this issue. It drops the SMR
> critical secition around the enqueue operation but uses a reference on the
> port insteadt to ensure that the device can't be removed during the
> enqueue. Once the enqueue is finished we enter the SMR critical section
> again and drop the reference.
> 
> To make it clear that the SMR_TAILQ remains intact while a refcount is
> held I moved refcnt_finalize() above SMR_TAILQ_REMOVE_LOCKED(). This is
> not strictly needed since the next pointer remains valid up until the
> smr_barrier() call but I find this a bit easier to understand.
> First make sure nobody else holds a reference then remove the port from
> the list.
> 
> I currently have no test setup to verify this but I hope someone else can
> give this a spin.

Hi,

for now veb seems stable and i can't panic box although it should, but
please give me few more hours to torture it properly.

I'm doing this in loop
ifconfig veb1 destroy
sh /etc/netstart
ifconfig veb0 destroy
sh /etc/netstart
ifconfig vport1 destroy
sh /etc/netstart
ifconfig vport0 destroy
sh /etc/netstart



my config

veb1: flags=a843
index 25 llprio 3
groups: veb
ix1 flags=3
port 2 ifpriority 0 ifcost 0
vport1 flags=3
port 27 ifpriority 0 ifcost 0
veb0: flags=a843
index 26 llprio 3
groups: veb
ix0 flags=3
port 1 ifpriority 0 ifcost 0
vport0 flags=3
port 28 ifpriority 0 ifcost 0
ix2 flags=100
vport1: flags=8943 mtu 1500
lladdr ec:f4:bb:da:f7:fa
index 27 priority 0 llprio 3
groups: vport
inet 192.168.111.11 netmask 0xff00 broadcast 192.168.111.255
vport0: flags=8943 mtu 1500
lladdr ec:f4:bb:da:f7:f8
index 28 priority 0 llprio 3
groups: vport
inet 192.168.100.11 netmask 0xff00 broadcast 192.168.100.255

Re: 7.1-Current crash with NET_TASKQ 4 and veb interface

2022-05-11 Thread Hrvoje Popovski

On 10.5.2022. 22:55, Alexander Bluhm wrote:
> Yes.  It is similar.
> 
> I have read the whole mail thread and the final fix got commited.
> But it looks incomplete, pf is still sleeping.
> 
> Hrvoje, can you run the tests again that triggered the panics a
> year ago?

Hi,

year ago panics was triggered when veb or tpmr bridged traffic. I've
tried that right now and I couldn't trigger that panics.
If I put vport and route traffic over veb than I can trigger panic with
or without vlans as child-iface for veb.
For panic i need to have pf enabled and to run
ifconfig veb destroy or ifconfig vlan destroy and sh netstart in loop.

this is panic without vlans

panic: kernel diagnostic assertion "curcpu()->ci_schedstate.spc_smrdepth
== 0" failed: file "/sys/kern/subr_xxx.c", line 163
Stopped at  db_enter+0x10:  popq%rbp
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
  57981  52408  0 0x14000  0x2003  softnet
  18952  62179  0 0x14000  0x2005  softnet
db_enter() at db_enter+0x10
panic(81f36a34) at panic+0xbf
__assert(81faa7fa,81fd47a7,a3,81fe7c9d) at
__assert+0x25
assertwaitok() at assertwaitok+0xcc
mi_switch() at mi_switch+0x40
sleep_finish(800022c707a0,1) at sleep_finish+0x10b
rw_enter(822b3ad8,2) at rw_enter+0x232
pf_test(2,3,800c6048,800022c70a58) at pf_test+0xcf0
ip_output(fd80a32f1f00,0,800022c70be8,1,0,0,e8e0f1a7c10273fe) at
ip_output+0x6b7
ip_forward(fd80a32f1f00,814ee000,fd83a8657078,0) at
ip_forward+0x2da
ip_input_if(800022c70d28,800022c70d34,4,0,814ee000) at
ip_input_if+0x353
ipv4_input(814ee000,fd80a32f1f00) at ipv4_input+0x39
ether_input(814ee000,fd80a32f1f00) at ether_input+0x3ad
vport_if_enqueue(814ee000,fd80a32f1f00) at vport_if_enqueue+0x19
end trace frame: 0x800022c70e70, count: 0
https://www.openbsd.org/ddb.html describes the minimum info required in
bug reports.  Insufficient info makes it difficult to find and fix bugs.
ddb{4}>

ddb{4}> show reg
rdi0
rsi 0x14
rbp   0x800022c705f0
rbx   0x800022424ff0
rdx   0x8000
rcx0x286
rax 0x7d
r8 0x101010101010101
r9 0
r10   0x5b4ef42a9c796b43
r11   0xada7e964a691819f
r12   0x800022425a60
r13   0x800022c450a0
r140
r15   0x81f36a34cmd0646_9_tim_udma+0x2d9d2
rip   0x81c01c30db_enter+0x10
cs   0x8
rflags 0x286
rsp   0x800022c705f0
ss  0x10
db_enter+0x10:  popq%rbp
ddb{4}>

ddb{4}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
 14129  480066  92457  0  3 0x3  netlock   ifconfig
 92457  504149   2002  0  30x10008b  sigsusp   sh
  2002   26492   1517  0  30x10008b  sigsusp   sh
  1517  131574  1  0  30x10008b  sigsusp   ksh
 26252  251094  1  0  30x100098  kqreadcron
 20251  457205  97875 95  3   0x1100092  kqreadsmtpd
 62139  255853  97875103  3   0x1100092  kqreadsmtpd
 29505   64154  97875 95  3   0x1100092  kqreadsmtpd
 20035  471489  97875 95  30x100092  kqreadsmtpd
 91114   73268  97875 95  3   0x1100092  kqreadsmtpd
 78396  414422  97875 95  3   0x1100092  kqreadsmtpd
 97875  113010  1  0  30x100080  kqreadsmtpd
 21916  226987  1  0  30x88  kqreadsshd
  90174247  1  0  30x100080  kqreadntpd
 72358  391459  38133 83  30x100092  kqreadntpd
 38133  355054  1 83  3   0x1100012  netlock   ntpd
 91824  285625  60194 73  3   0x1100090  kqreadsyslogd
 60194  367623  1  0  30x100082  netio syslogd
 73270  113983  0  0  3 0x14200  bored smr
 51379  478537  0  0  3 0x14200  pgzerozerothread
 85386   54454  0  0  3 0x14200  aiodoned  aiodoned
 10937  491268  0  0  3 0x14200  syncerupdate
 85008  360847  0  0  3 0x14200  cleaner   cleaner
 76642  501363  0  0  3 0x14200  reaperreaper
 32934  257878  0  0  3 0x14200  pgdaemon  pagedaemon
 48583  371156  0  0  3 0x14200  usbtskusbtask
 53660  310701  0  0  3 0x14200  usbatsk   usbatsk
 19211   31258  0  0  3  0x40014200  acpi0 acpi0
 11856  305318  0  0  3  0x40014200idle5
  9933  290633  0  0  3  0x40014200idle4
 41570   94891  0  0  3  0x40014200

Re: 7.1-Current crash with NET_TASKQ 4 and veb interface

2022-05-10 Thread Hrvoje Popovski

On 9.5.2022. 22:04, Alexander Bluhm wrote:
> Can some veb or smr hacker explain how this is supposed to work?
> 
> Sleeping in pf is also not ideal as it is in the hot path and slows
> down packets.  But that is not easy to fix as we have to refactor
> the memory allocations before converting pf lock to a mutex.  sashan@
> is working on that.


Hi,

isn't that similar or same panic that was talked about in "parallel
forwarding vs. bridges" mail thread on tech@ started by sashan@

https://www.mail-archive.com/tech@openbsd.org/msg64040.html

Re: bnxt panic

2022-03-19 Thread Hrvoje Popovski

On 18.3.2022. 18:59, Alexander Bluhm wrote:
> On Fri, Mar 18, 2022 at 11:21:15AM +0100, Hrvoje Popovski wrote:
>> On 17.3.2022. 21:31, Alexander Bluhm wrote:
>>> I don't have the device and don't know the code.  But other drivers
>>> don't process rx and tx interrupts when the interface is not running.
>>>
>>> Maybe this helps.  dlg@ and jmatthew@ should know better than me.
>>>
>>> bluhm
>>
>> Hi,
>>
>> i didn't manage to trigger same panic by hand, maybe i didn't try hard
>> enough, so i've put "ifconfig bnxt0 down && sleep 2 && ifconfig bnxt0 up
>> && sleep 2" in loop and i've got panic below ...
> 

Hi,

sorry for hijack your diff with some other panic ... I thought that
panics are similar.
I've tried everything one more time and now I can answer your questions :)

> Does my diff make things better?

Yes, I can't panic box with this diff as i can without it.


> Does the diff just replace one panic with another?

No.


> Do you need more effort to trigger crashes now?

I've tried all day to reproduce panic with your diff but i can't ...


> Does the panic below also happend without my diff?

Yes, panic below is easy to trigger with or without your diff ..




> bluhm
> 
>>
>> with this loop, box panic quite fast even if box doesn't have any
>> interface configured and it's totally idle ..
>>
>>
>> bnxt0: HWRM_RING_ALLOC command returned RESOURCE_ALLOC_ERROR error.
>> bnxt0: failed to set up tx ring
>> uvm_fault(0xfd904e3ac880, 0xff0, 0, 1) -> e
>> kernel: page fault trap, code=0
>> Stopped at  bnxt_queue_down+0x61:   movq0(%r13,%rax,1),%rsi
>> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
>> *186511  37451  0 0x3  02K ifconfig
>> bnxt_queue_down(802ba000,802bb080) at bnxt_queue_down+0x61
>> bnxt_up(802ba000) at bnxt_up+0x3fb
>> bnxt_ioctl(802ba048,80206910,800027d1bc90) at bnxt_ioctl+0x15b
>> ifioctl(fd8e5b1a98e8,80206910,800027d1bc90,800027d22d28) at
>> ifioctl+0x92b
>> soo_ioctl(fd8e5b6e3cb8,80206910,800027d1bc90,800027d22d28)
>> at soo_ioctl+0x161
>> sys_ioctl(800027d22d28,800027d1bda0,800027d1bdf0) at
>> sys_ioctl+0x2c4
>> syscall(800027d1be60) at syscall+0x374
>> Xsyscall() at Xsyscall+0x128
>> end of kernel
>> end trace frame: 0x7f7d3130, count: 7
>> https://www.openbsd.org/ddb.html describes the minimum info required in
>> bug reports.  Insufficient info makes it difficult to find and fix bugs.
>> ddb{2}>
>>
>>
>> ddb{2}> show reg
>> rdi   0x8226e7b8pci_bus_dma_tag
>> rsi   0x802bb080
>> rbp   0x800027d1ba80
>> rbx 0xff
>> rdx   0xc800
>> rcx0x202
>> rax0xff0
>> r8  0x3f
>> r90x800027d1b9b8
>> r10   0x8f1758fa1ca4c280
>> r11   0x814e96f0_bus_dmamap_destroy
>> r120x100
>> r130
>> r140x101
>> r15   0x802ba000
>> rip   0x815385c1bnxt_queue_down+0x61
>> cs   0x8
>> rflags   0x10216__ALIGN_SIZE+0xf216
>> rsp   0x800027d1ba20
>> ss  0x10
>> bnxt_queue_down+0x61:   movq0(%r13,%rax,1),%rsi
>>
>>
>> ddb{2}> ps
>>PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
>> *37451  186511  30450  0  7 0x3ifconfig
>>  30450  100025  92193  0  30x10008b  sigsusp   sh
>>  92193  391163  1  0  30x10008b  sigsusp   ksh
>>  29430  255996  1  0  30x100098  kqreadcron
>>  83509   36913  52718 95  3   0x1100092  kqreadsmtpd
>>  85854  120678  52718103  3   0x1100092  kqreadsmtpd
>>  10057  425077  52718 95  3   0x1100092  kqreadsmtpd
>>   8903  266150  52718 95  30x100092  kqreadsmtpd
>>  10952   25497  52718 95  3   0x1100092  kqreadsmtpd
>>  10277  273205  52718 95  3   0x1100092  kqreadsmtpd
>>  52718  225011  1  0  30x100080  kqreadsmtpd
>>  10965   74402  1  0  30x88  kqreadsshd
>>  92646   92606  1  0  30x100080  kqread

Re: bnxt panic

2022-03-18 Thread Hrvoje Popovski

On 17.3.2022. 21:31, Alexander Bluhm wrote:
> I don't have the device and don't know the code.  But other drivers
> don't process rx and tx interrupts when the interface is not running.
> 
> Maybe this helps.  dlg@ and jmatthew@ should know better than me.
> 
> bluhm

Hi,

i didn't manage to trigger same panic by hand, maybe i didn't try hard
enough, so i've put "ifconfig bnxt0 down && sleep 2 && ifconfig bnxt0 up
&& sleep 2" in loop and i've got panic below ...

with this loop, box panic quite fast even if box doesn't have any
interface configured and it's totally idle ..

bnxt0: HWRM_RING_ALLOC command returned RESOURCE_ALLOC_ERROR error.
bnxt0: failed to set up tx ring
uvm_fault(0xfd904e3ac880, 0xff0, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at  bnxt_queue_down+0x61:   movq0(%r13,%rax,1),%rsi
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
*186511  37451  0 0x3  02K ifconfig
bnxt_queue_down(802ba000,802bb080) at bnxt_queue_down+0x61
bnxt_up(802ba000) at bnxt_up+0x3fb
bnxt_ioctl(802ba048,80206910,800027d1bc90) at bnxt_ioctl+0x15b
ifioctl(fd8e5b1a98e8,80206910,800027d1bc90,800027d22d28) at
ifioctl+0x92b
soo_ioctl(fd8e5b6e3cb8,80206910,800027d1bc90,800027d22d28)
at soo_ioctl+0x161
sys_ioctl(800027d22d28,800027d1bda0,800027d1bdf0) at
sys_ioctl+0x2c4
syscall(800027d1be60) at syscall+0x374
Xsyscall() at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7d3130, count: 7
https://www.openbsd.org/ddb.html describes the minimum info required in
bug reports.  Insufficient info makes it difficult to find and fix bugs.
ddb{2}>

ddb{2}> show reg
rdi   0x8226e7b8pci_bus_dma_tag
rsi   0x802bb080
rbp   0x800027d1ba80
rbx 0xff
rdx   0xc800
rcx0x202
rax0xff0
r8  0x3f
r90x800027d1b9b8
r10   0x8f1758fa1ca4c280
r11   0x814e96f0_bus_dmamap_destroy
r120x100
r130
r140x101
r15   0x802ba000
rip   0x815385c1bnxt_queue_down+0x61
cs   0x8
rflags   0x10216__ALIGN_SIZE+0xf216
rsp   0x800027d1ba20
ss  0x10
bnxt_queue_down+0x61:   movq0(%r13,%rax,1),%rsi

ddb{2}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
*37451  186511  30450  0  7 0x3ifconfig
 30450  100025  92193  0  30x10008b  sigsusp   sh
 92193  391163  1  0  30x10008b  sigsusp   ksh
 29430  255996  1  0  30x100098  kqreadcron
 83509   36913  52718 95  3   0x1100092  kqreadsmtpd
 85854  120678  52718103  3   0x1100092  kqreadsmtpd
 10057  425077  52718 95  3   0x1100092  kqreadsmtpd
  8903  266150  52718 95  30x100092  kqreadsmtpd
 10952   25497  52718 95  3   0x1100092  kqreadsmtpd
 10277  273205  52718 95  3   0x1100092  kqreadsmtpd
 52718  225011  1  0  30x100080  kqreadsmtpd
 10965   74402  1  0  30x88  kqreadsshd
 92646   92606  1  0  30x100080  kqreadntpd
 69044   66045  48912 83  30x100092  kqreadntpd
 48912  346342  1 83  3   0x1100092  kqreadntpd
 94844  373900  21297 73  3   0x1100090  kqreadsyslogd
 21297   85879  1  0  30x100082  netio syslogd
 35205  481579  0  0  3 0x14200  bored smr
 12575  275960  0  0  3 0x14200  pgzerozerothread
 24927  447870  0  0  3 0x14200  aiodoned  aiodoned
 80994  339930  0  0  3 0x14200  syncerupdate
 44042  109239  0  0  3 0x14200  cleaner   cleaner
 96179  166822  0  0  3 0x14200  reaperreaper
 15452   36252  0  0  3 0x14200  pgdaemon  pagedaemon
 25413  304120  0  0  3 0x14200  usbtskusbtask
 98205  375271  0  0  3 0x14200  usbatsk   usbatsk
 69469  523327  0  0  3 0x14200  bored sensors
 88463  437912  0  0  3  0x40014200  acpi0 acpi0
 22995  343696  0  0  7  0x40014200idle23
 34936  338429  0  0  7  0x40014200idle22
 55150   49423  0  0  7  0x40014200idle21
 17888  189165  0  0  7  0x40014200idle20
 96685  487854  0  0  7  0x40014200idle19
 13313  506501  0  0  7  0x40014200idle18
 88567  261311  0  0  7  0x40014200idle17
 56026  316512  0  0  7

Re: bnxt panic

2022-03-16 Thread Hrvoje Popovski

On 16.3.2022. 20:00, Hrvoje Popovski wrote:
> Hi all,
> 
> While opensbd box is under pressure and in that moment i run ifconfig
> bnxt0 down i get panic... it's not every time and it's that easy to
> trigger panic
> 
> I'm sending traffic over ix interfaces and bnxt is for ssh and nothing
> else.
> 
> I've compiled kernel with "option BNXT_DEBUG" and put debug in
> hostname.bnxt0 but i didn't saw any log regarding bnxt interfaces.
> 
> I will try to trigger panic few more times and will post them here..

this is same panic but with snapshot kernel without debug options

uvm_fault(0xfd904e3a9440, 0x0, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at  bnxt_intr+0x195:movq0(%r14,%r12,1),%rbx
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
*465591  26407  0 0x3  00K ifconfig
bnxt_intr(802bc7c0) at bnxt_intr+0x195
intr_handler(800027d3d7b0,80269880) at intr_handler+0x6e
Xintr_ioapic_edge28_untramp() at Xintr_ioapic_edge28_untramp+0x18f
Xspllower() at Xspllower+0x19
softintr_dispatch(0) at softintr_dispatch+0xdc
Xsoftclock() at Xsoftclock+0x1f
bnxt_ioctl(802bc048,80206910,800027d3dae0) at bnxt_ioctl+0x165
ifioctl(fd8e5a4f13a8,80206910,800027d3dae0,800027d8da50) at
ifioctl+0x92b
soo_ioctl(fd904cc0b2e8,80206910,800027d3dae0,800027d8da50)
at soo_ioctl+0x161
sys_ioctl(800027d8da50,800027d3dbf0,800027d3dc40) at
sys_ioctl+0x2c4
syscall(800027d3dcb0) at syscall+0x374
Xsyscall() at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7faf30, count: 3
https://www.openbsd.org/ddb.html describes the minimum info required in
bug reports.  Insufficient info makes it difficult to find and fix bugs.

bnxt panic

2022-03-16 Thread Hrvoje Popovski

Hi all,

While opensbd box is under pressure and in that moment i run ifconfig
bnxt0 down i get panic... it's not every time and it's that easy to
trigger panic

I'm sending traffic over ix interfaces and bnxt is for ssh and nothing
else.

I've compiled kernel with "option BNXT_DEBUG" and put debug in
hostname.bnxt0 but i didn't saw any log regarding bnxt interfaces.

I will try to trigger panic few more times and will post them here..


smc24# ifconfig bnxt0 down
uvm_fault(0xfd8e5b9b0aa8, 0xc80, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at  bnxt_intr+0x195:movq0(%r14,%r12,1),%rbx
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
*112449  43646  0 0x3  00K ifconfig
bnxt_intr(802ba7c0) at bnxt_intr+0x195
intr_handler(800027d819d0,80262200) at intr_handler+0x6e
Xintr_ioapic_edge28_untramp() at Xintr_ioapic_edge28_untramp+0x18f
Xspllower() at Xspllower+0x19
softintr_dispatch(0) at softintr_dispatch+0xdc
Xsoftclock() at Xsoftclock+0x1f
bnxt_ioctl(802ba048,80206910,800027d81d00) at bnxt_ioctl+0x165
ifioctl(fd8e587008f0,80206910,800027d81d00,800027d23ce8) at
ifioctl+0x92b
soo_ioctl(fd8e5ba589e8,80206910,800027d81d00,800027d23ce8)
at soo_ioctl+0x161
sys_ioctl(800027d23ce8,800027d81e10,800027d81e60) at
sys_ioctl+0x2c4
syscall(800027d81ed0) at syscall+0x374
Xsyscall() at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7d2b40, count: 3
https://www.openbsd.org/ddb.html describes the minimum info required in
bug reports.  Insufficient info makes it difficult to find and fix bugs.
ddb{0}>


ddb{0}> show reg
rdi   0xfd8008d639b0
rsi   0xfd8008d639b0
rbp   0x800027d81970
rbx0
rdx0
rcx   0x802ba7c0
rax 0xc8
r8 0
r9 0x400
r10   0xa08d02d9a679a6aa
r11   0xa5908789619e5597
r120xc80
r130
r140
r15 0x1e
rip   0x81891255bnxt_intr+0x195
cs   0x8
rflags   0x10212__ALIGN_SIZE+0xf212
rsp   0x800027d818c0
ss  0x10
bnxt_intr+0x195:movq0(%r14,%r12,1),%rbx


ddb{0}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
*43646  112449   2064  0  7 0x3ifconfig
  3020  522978  71359  0  30x100083  ttyin ksh
 71359   83904  70315   1000  30x10008b  sigsusp   ksh
 70315   76327  33688   1000  30x98  kqreadsshd
 33688  239932  19353  0  30x82  kqreadsshd
  2064   31814  1  0  30x10008b  sigsusp   ksh
 15593  505847  1  0  30x100098  kqreadcron
 53746   29436  19867 95  3   0x1100092  kqreadsmtpd
 70674  201429  19867103  3   0x1100092  kqreadsmtpd
 13668  378996  19867 95  3   0x1100092  kqreadsmtpd
  7348  364416  19867 95  30x100092  kqreadsmtpd
 28469  342601  19867 95  3   0x1100092  kqreadsmtpd
 148556717  19867 95  3   0x1100092  kqreadsmtpd
 19867  370251  1  0  30x100080  kqreadsmtpd
 19353  418536  1  0  30x88  kqreadsshd
 78442  277045  1  0  30x100080  kqreadntpd
 59273  418917  71217 83  30x100092  kqreadntpd
 71217  402451  1 83  3   0x1100092  kqreadntpd
  2942   84526  89760 73  3   0x1100090  kqreadsyslogd
 89760   90249  1  0  30x100082  netio syslogd
 92039   93123  0  0  3 0x14200  bored smr
 36298  398102  0  0  3 0x14200  pgzerozerothread
 71408   18325  0  0  3 0x14200  aiodoned  aiodoned
 96775  209072  0  0  3 0x14200  syncerupdate
 99374   29019  0  0  3 0x14200  cleaner   cleaner
 93530  452277  0  0  3 0x14200  reaperreaper
 92427  520070  0  0  3 0x14200  pgdaemon  pagedaemon
 11949  151046  0  0  3 0x14200  usbtskusbtask
 14625  109693  0  0  3 0x14200  usbatsk   usbatsk
   884  415337  0  0  3 0x14200  bored sensors
 27132  374599  0  0  3  0x40014200  acpi0 acpi0
 44604  508777  0  0  7  0x40014200idle23
 50524  223281  0  0  7  0x40014200idle22
   382  384761  0  0  7  0x40014200idle21
  3240  505687  0  0  7  0x40014200idle20
 45727  351159  0  0  7  0x40014200idle19
  3827   15512  0  0  7  0x40014200

Re: ix - failed to allocate interrupt slot for PIC msix pin -2143223537

2021-08-30 Thread Hrvoje Popovski

On 21.8.2021. 21:50, Hrvoje Popovski wrote:
> Hi all,
> 
> on new AMD 24 core server and i've put 2 dual port ix interfaces and
> while booting ix3 interface throws this error
> 
> ix3 at pci21 dev 0 function 1 "Intel 82599" rev 0x01failed to allocate
> interrupt slot for PIC msix pin -2143223537
> ixgbe_allocate_msix: pci_intr_establish vec 15 failed
> 
> and i can't see ix3 with ifconfig..
> With option IX_DEBUG i thought that i would see something more, but
> there's nothing ...


Hi all,

now after a few reboots ix problem is gone but i see similar errors on
xhci and ahci...


smc24# dmesg | grep failed
xhci3 at pci23 dev 0 function 3 vendor "AMD", unknown product 0x148c rev
0x00failed to allocate interrupt slot for PIC msi pin -2143091968
ahci2 at pci24 dev 0 function 0 "AMD FCH AHCI" rev 0x51: msi,failed to
allocate interrupt slot for PIC msi pin -2143027200
ahci3 at pci25 dev 0 function 0 "AMD FCH AHCI" rev 0x51: msi,failed to
allocate interrupt slot for PIC msi pin -2142961664

smc24# vmstat -iz
interrupt   total rate
irq0/clock1859139 4754
irq0/ipi   320887  820
irq96/amdgpio0  00
irq97/acpi0 00
irq98/ppb0  00
irq99/xhci0 00
irq100/ppb1 00
irq101/xhci100
irq102/ppb2 00
irq103/ppb4 00
irq104/ahci000
irq105/ppb5 00
irq114/bnxt010
irq115/bnxt0:0   15243
irq116/bnxt0:1  00
irq117/bnxt0:2  20
irq118/bnxt0:3 170
irq119/bnxt0:4  00
irq120/bnxt0:5 360
irq121/bnxt0:6 230
irq122/bnxt0:77031
irq123/bnxt100
irq124/bnxt1:0  00
irq125/bnxt1:1  00
irq126/bnxt1:2  00
irq127/bnxt1:3  00
irq128/bnxt1:4  00
irq129/bnxt1:5  00
irq130/bnxt1:6  00
irq131/bnxt1:7  00
irq106/ppb8 00
irq132/ix0:0   280
irq133/ix0:100
irq134/ix0:200
irq135/ix0:300
irq136/ix0:400
irq137/ix0:500
irq138/ix0:600
irq139/ix0:700
irq140/ix0:800
irq141/ix0:900
irq142/ix0:10   00
irq143/ix0:11   00
irq144/ix0:12   00
irq145/ix0:13   00
irq146/ix0:14   00
irq147/ix0:15   00
irq148/ix0  10
irq149/ix1:0   280
irq150/ix1:100
irq151/ix1:200
irq152/ix1:300
irq153/ix1:400
irq154/ix1:500
irq155/ix1:600
irq156/ix1:700
irq157/ix1:800
irq158/ix1:900
irq159/ix1:10   00
irq160/ix1:11   00
irq161/ix1:12   00
irq162/ix1:13   00
irq163/ix1:14   00
irq164/ix1:15   00
irq165/ix1  10
irq107/ppb9 00
irq108/ppb1000
irq109/ahci128532   72
irq110/ppb1400
irq166/mcx0   1250
irq167/mcx0:0  280
irq168/mcx0:1   00
irq169/mcx0:2   00
irq170/mcx0:3   00
irq171/mcx0:4   00
irq172/mcx0:5   00
irq173/mcx0:6   00
irq174/mcx0:7   00
irq175/mcx0:8   00
irq176/mcx0:9   00
irq177/mcx0:10  00
irq178/mcx0:11

Re: failed to allocate interrupt slot for PIC msix

2021-08-21 Thread Hrvoje Popovski

On 27.5.2021. 17:04, w...@null0.nl wrote:
> ixl13 at pci15 dev 0 function 3 "Intel X710 SFP+" rev 0x02: port 1, FW 
> 8.3.64775 API 1.13, msix, 8 queues, address 40:a6:b7:51:bf:13
> failed to allocate interrupt slot for PIC msix pin -2136014080
> ixl13: unable to establish interrupt handler
> ppb9 at pci14 dev 2 function 0 "Intel Xeon Scalable PCIE" rev 0x07: msi
> pci16 at ppb9 bus 176
> ixl14 at pci16 dev 0 function 0 "Intel X710 SFP+" rev 0x02: port 3, FW 
> 8.3.64775 API 1.13, msix, 8 queues, address 40:a6:b7:51:b7:10
> failed to allocate interrupt slot for PIC msix pin -2135949312
> ixl14: unable to establish interrupt handler
> ixl15 at pci16 dev 0 function 1 "Intel X710 SFP+" rev 0x02: port 2, FW 
> 8.3.64775 API 1.13, msix, 8 queues, address 40:a6:b7:51:b7:11
> failed to allocate interrupt slot for PIC msix pin -2135949056
> ixl15: unable to establish interrupt handler
> ixl16 at pci16 dev 0 function 2 "Intel X710 SFP+" rev 0x02: port 0, FW 
> 8.3.64775 API 1.13, msix, 8 queues, address 40:a6:b7:51:b7:12
> failed to allocate interrupt slot for PIC msix pin -2135948800
> ixl16: unable to establish interrupt handler
> ixl17 at pci16 dev 0 function 3 "Intel X710 SFP+" rev 0x02: port 1, FW 
> 8.3.64775 API 1.13, msix, 8 queues, address 40:a6:b7:51:b7:13
> failed to allocate interrupt slot for PIC msix pin -2135948544
> ixl17: unable to establish interrupt handler

Hi,

i think that i have same problem but with ix interfaces ..

ix3 at pci21 dev 0 function 1 "Intel 82599" rev 0x01failed to allocate
interrupt slot for PIC msix pin -2143223537
ixgbe_allocate_msix: pci_intr_establish vec 15 failed

Re: ipsec - panic: malloc: out of space in kmem_map

2021-07-19 Thread Hrvoje Popovski

On 19.7.2021. 0:08, Hrvoje Popovski wrote:
> On other hand, isakmpd is stable at 150Kpps even if sending 12Mpps
> through tunnel ...

and of course i forgot to stop generator and isakmpd is still forwarding
traffic ..

:)

Re: ipsec - panic: malloc: out of space in kmem_map

2021-07-18 Thread Hrvoje Popovski

On 18.7.2021. 20:39, Hrvoje Popovski wrote:
> On 18.7.2021. 20:11, Alexander Bluhm wrote:
>> On Sat, Jul 17, 2021 at 06:32:59PM +0200, Hrvoje Popovski wrote:
>>> with this diff i'm getting very stable traffic over tunnel and it's
>>> little faster.
>>
>> This is expected.  Too much queueing creates oscilating behavior
>> and suboptimal throughput.
>>
>>> Even with your last diff on tech@
>>> https://marc.info/?l=openbsd-tech=162645141414262=2
>>> i'm seeing traffic drops, less frequent, but i'm seeing it...
>>
>> There is another reason for traffic drops.  iked(8) is not clever
>> when rekeying.  The idea is to have SAs with old key and new key
>> simultaneously.  After both machines have new SA, the old should
>> be removed.  But currently we have a window when sender uses new
>> SA, but receiver only has old SA and cannot decrypt the packets.
>> This is a temproray problem, I see drops for a short time.  tobhe@
>> wants to fix this.
>>
>> I think you use isakmpd(8), I don't know how rekeying works there.
> 
> Yes, I'm using isakmpd, but I can test iked and isakmpd, no problem ...
> 
> 
>>> Do you want me to test this diff combined with your ipsec diff
>>> on tech@ ?
>>
>> I have commited the replay diff.  This fixes permanent packet drop.
>> Do you see permanent traffic stalls with current?
> 
> With isakmpd yes, iked haven't tested, but i will now ..
> But with your diff from bugs@ everything seems smooth and stable without
> drops and panics even with isakmpd :)

I've setup isakmpd and iked tunnels and tested both daemons with or
without replay diff from tech@ and they behave very similar.

I have 2 same boxes with 6 x E5-2643 v2 @ 3.50GHz, 3600.44 MHz, 06-3e-04
and sending traffic (1000 byte UDP) through tunnel between them.

ipsec.conf
ike active esp from 192.168.232.0/24 to 192.168.123.0/24 \
local 192.168.42.1 peer 192.168.42.111 \
main auth hmac-sha1 enc aes group modp1024 \
quick enc aes-128-gcm group modp1024 \
psk "123"

iked.conf
ikev2 active esp from 192.168.232.0/24 to 192.168.123.0/24 \
local 192.168.42.1 peer 192.168.42.111 \
ikesa enc aes-128-gcm group modp1024 prf hmac-sha1 \
childsa enc aes-128-gcm group modp1024 \
psk "123"

In this config with these boxes if i send up to 150Kpps through iked or
ipsec tunnel, tunnel is stable with or without diff. If I send 200Kpps
traffic through ipsec tunnel, traffic drops and won't come back, iked
tunnel would come back, but after few minutes it stops forwarding
traffic completely.
If i send 250Kpps or more, with or without replay diff ipsec or iked
tunnel stops forwarding traffic after few seconds ...

So I think that reply diff only prolong permanent stalls of traffic

Interesting is that when traffic completely stops i need to do ifconfig
ix1 down && ifconfig ix1 up, ix1 is the interface where my generator is
connected, to make traffic flow through tunnel. Stopping the generator
and run it again didn't help, only down/up of ix1 interface.
When traffic stops mcl2k2 Fail counter increases..

r620-1# vmstat -m | egrep "Fail|mcl"
NameSize Requests FailInUse Pgreq Pgrel Npage Hiwat Minpg
Maxpg Idle
mcl2k   2048 1225 39570 3 0 3 3 0
  80
mcl2k2  2112 285844285 198  183 58658 1 58657 58657 0
  80
mcl4k   4096  112 13270 1 0 1 1 0
  80
mcl8k   8192  406  3020 1 0 1 1 0
  80

With queuing diff that you sent here on bugs@, iked behaves the same as
before, i just need to send about 600Kpps or more through tunnel and i'm
not seeing and mcl Fails when traffic stops.
On other hand, isakmpd is stable at 150Kpps even if sending 12Mpps
through tunnel ...

I'm sorry if this mail is not clear enough, I'm testing, repeating
tests, and writing here what I'm seeing

Re: ipsec - panic: malloc: out of space in kmem_map

2021-07-18 Thread Hrvoje Popovski

On 18.7.2021. 20:11, Alexander Bluhm wrote:
> On Sat, Jul 17, 2021 at 06:32:59PM +0200, Hrvoje Popovski wrote:
>> with this diff i'm getting very stable traffic over tunnel and it's
>> little faster.
> 
> This is expected.  Too much queueing creates oscilating behavior
> and suboptimal throughput.
> 
>> Even with your last diff on tech@
>> https://marc.info/?l=openbsd-tech=162645141414262=2
>> i'm seeing traffic drops, less frequent, but i'm seeing it...
> 
> There is another reason for traffic drops.  iked(8) is not clever
> when rekeying.  The idea is to have SAs with old key and new key
> simultaneously.  After both machines have new SA, the old should
> be removed.  But currently we have a window when sender uses new
> SA, but receiver only has old SA and cannot decrypt the packets.
> This is a temproray problem, I see drops for a short time.  tobhe@
> wants to fix this.
> 
> I think you use isakmpd(8), I don't know how rekeying works there.

Yes, I'm using isakmpd, but I can test iked and isakmpd, no problem ...


>> Do you want me to test this diff combined with your ipsec diff
>> on tech@ ?
> 
> I have commited the replay diff.  This fixes permanent packet drop.
> Do you see permanent traffic stalls with current?

With isakmpd yes, iked haven't tested, but i will now ..
But with your diff from bugs@ everything seems smooth and stable without
drops and panics even with isakmpd :)

> Temporary drops are still possible.  The rekey problem is known.
> The crypto queuing problem is known.  You could disable iked lifetime
> bytes rekeying and try my no crypto queue diff.
> Do you see traffic drops with that?
> 
>> And this diff with parallel forwarding?
> 
> Parallel forwarding still crashes with IPsec.  We must commit fixes
> step by step until we get it stable.  Of course you can try it, but
> currently I can reproduce problems myself.

Ok, great, now i will concentrate to test iked and isakmpd ..

Re: ipsec - panic: malloc: out of space in kmem_map

2021-07-17 Thread Hrvoje Popovski

On 17.7.2021. 1:02, Alexander Bluhm wrote:
> On Fri, Jul 16, 2021 at 10:57:24PM +0200, Alexander Bluhm wrote:
>> All I said is more or less theory, I did not test it yet.
> I should not send untested diffs.  New version one does not crash
> immediately.  I removed a netlock that is already taken due to not
> queuing.  This also fixes the tdb->tdb_odrops++ spotted by mvs@.
> 
> Note that avoiding queues is the fastest way do IPsec.
> http://bluhm.genua.de/perform/results/2021-07-15T23:54:11Z/perform.html
> 
> This diff is the middle column.
> http://bluhm.genua.de/perform/results/2021-07-15T23:54:11Z/gnuplot/ipsec.html
> 
> bluhm

Hi,

with this diff i'm getting very stable traffic over tunnel and it's
little faster.
Even with your last diff on tech@
https://marc.info/?l=openbsd-tech=162645141414262=2
i'm seeing traffic drops, less frequent, but i'm seeing it...

Do you want me to test this diff combined with your ipsec diff
on tech@ ?
And this diff with parallel forwarding?

tnx for stable ipsec :)

Re: ipsec - panic: malloc: out of space in kmem_map

2021-07-16 Thread Hrvoje Popovski

On 16.7.2021. 20:02, Hrvoje Popovski wrote:
> Hi all,
> 
> with source fetched few minutes ago i wanted to test bluhm@ diff
> https://marc.info/?l=openbsd-tech=162645141414262=2
> I've found out that with or without bluhm@ diff i'm getting panic. panic
> in attachment.
> 
> I'm sending traffic over ipsec tunnel from r620-1 box to x3550m4 box. If
> at one point while sending traffic over tunnel i hit "enter" over serial
> console on x3550m4, that box panic with this log:


with WITNESS ..


x3550m4# panic: malloc: out of space in kmem_map
Stopped at  db_enter+0x10:  popq%rbp
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
*289105  63905  00x13  01K ksh
db_enter() at db_enter+0x10
panic(81ea2c88) at panic+0xbf
malloc(1,7f,1) at malloc+0x7b4
ufs_readdir(800026e84c10) at ufs_readdir+0xf8
VOP_READDIR(fd887f03d460,800026e84c78,fd887f7d7c60,800026e84cbc)
at VOP_READDIR+0x50
sys_getdents(800026f2b508,800026e84d30,800026e84d90) at
sys_getdents+0x161
syscall(800026e84e00) at syscall+0x3a9
Xsyscall() at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7d7000, count: 7
https://www.openbsd.org/ddb.html describes the minimum info required in
bug reports.  Insufficient info makes it difficult to find and fix bugs.
ddb{1}>


ddb{1}> show locks
exclusive rrwlock inode r = 0 (0xfd887ec1dc48)
#0  witness_lock+0x333
#1  rw_enter+0x27f
#2  rrw_enter+0x56
#3  VOP_LOCK+0x5b
#4  vn_lock+0xad
#5  sys_getdents+0xfe
#6  syscall+0x3a9
#7  Xsyscall+0x128
exclusive kernel_lock _lock r = 0 (0x82306b40)
#0  witness_lock+0x333
#1  syscall+0x29e
#2  Xsyscall+0x128



ddb{1}> show reg
rdi0
rsi 0x14
rbp   0x800026e848d0
rbx  0x1__ALIGN_SIZE+0xf000
rdx   0xfe03
rcx0x282
rax 0x28
r8 0x101010101010101
r9 0
r100
r11   0xe254fbd981f43f47
r12   0x8000219baa00
r13  0x1__ALIGN_SIZE+0xf000
r140
r15   0x81ea2c88apollo_udma100_tim+0xc745
rip   0x8139d200db_enter+0x10
cs   0x8
rflags 0x286
rsp   0x800026e848d0
ss  0x10
db_enter+0x10:  popq%rbp


ddb{1}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
 38747  389002  68287 68  30x90  selectisakmpd
 68287  379577  1  0  30x80  netio isakmpd
*63905  289105  1  0  70x13ksh
 51457  163250  1  0  30x100098  poll  cron
 77909  402925  35691 95  30x100092  kqreadsmtpd
 78088   37152  35691103  30x100092  kqreadsmtpd
 62475  202424  35691 95  30x100092  kqreadsmtpd
 33104  393361  35691 95  30x100092  kqreadsmtpd
 99817  462713  35691 95  30x100092  kqreadsmtpd
 49139  379482  35691 95  30x100092  kqreadsmtpd
 35691  127563  1  0  30x100080  kqreadsmtpd
 60575  345518  1  0  30x88  selectsshd
 42089  129781  1  0  30x100080  poll  ntpd
 13806  436163  95467 83  30x100092  poll  ntpd
 95467  117285  1 83  30x100092  poll  ntpd
 68954  146234  99233 74  30x100092  bpf   pflogd
 99233   26243  1  0  30x80  netio pflogd
 46731   44450  83208 73  30x100090  kqreadsyslogd
 83208  298754  1  0  30x100082  netio syslogd
 24108  326233  1  0  30x100080  kqreadresolvd
 86229  481448  0  0  3 0x14200  bored smr
 90881   31129  0  0  3 0x14200  pgzerozerothread
  3369  196147  0  0  3 0x14200  aiodoned  aiodoned
 91637   24402  0  0  3 0x14200  syncerupdate
 10524  225039  0  0  3 0x14200  cleaner   cleaner
 98418  474258  0  0  3 0x14200  reaperreaper
  4887  253001  0  0  3 0x14200  pgdaemon  pagedaemon
 72437  269608  0  0  3 0x14200  bored crynlk
 21023  435965  0  0  3 0x14200  bored crypto
  3751  454766  0  0  3 0x14200  usbtskusbtask
 10715  377531  0  0  3 0x14200  usbatsk   usbatsk
 60934   18965  0  0  3  0x40014200  acpi0 acpi0
 43187  352600  0  0  7  0x40014200idle11
 21339   40638  0  0  7  0x40014200idle10
 97154  350372  0  0  7  0x40014200idle9
 10084  1122

ipsec - panic: malloc: out of space in kmem_map

2021-07-16 Thread Hrvoje Popovski

Hi all,

with source fetched few minutes ago i wanted to test bluhm@ diff
https://marc.info/?l=openbsd-tech=162645141414262=2
I've found out that with or without bluhm@ diff i'm getting panic. panic
in attachment.

I'm sending traffic over ipsec tunnel from r620-1 box to x3550m4 box. If
at one point while sending traffic over tunnel i hit "enter" over serial
console on x3550m4, that box panic with this log:

x3550m4# panic: malloc: out of space in kmem_map
Stopped at  db_enter+0x10:  popq%rbp
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
*153853  51636  00x13  03  ksh
 194352  78666  0 0x14000  0x2004  crynlk
  43942  98148  0 0x14000  0x2001  softnet
db_enter() at db_enter+0x10
panic(81ea2c7a) at panic+0xbf
malloc(1,7f,1) at malloc+0x7b4
ufs_readdir(800026ff6a50) at ufs_readdir+0xf8
VOP_READDIR(fd887f03e380,800026ff6ab8,fd887f7d7600,800026ff6afc)
at VOP_READDIR+0x50
sys_getdents(800026ed57a8,800026ff6b70,800026ff6bd0) at
sys_getdents+0x161
syscall(800026ff6c40) at syscall+0x3a9
Xsyscall() at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7cc780, count: 7
https://www.openbsd.org/ddb.html describes the minimum info required in
bug reports.  Insufficient info makes it difficult to find and fix bugs.
ddb{3}>


x3550m4# panic: malloc: out of space in kmem_map
Stopped at  db_enter+0x10:  popq%rbp
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
*153853  51636  00x13  03  ksh
 194352  78666  0 0x14000  0x2004  crynlk
  43942  98148  0 0x14000  0x2001  softnet
db_enter() at db_enter+0x10
panic(81ea2c7a) at panic+0xbf
malloc(1,7f,1) at malloc+0x7b4
ufs_readdir(800026ff6a50) at ufs_readdir+0xf8
VOP_READDIR(fd887f03e380,800026ff6ab8,fd887f7d7600,800026ff6afc)
 at VOP_READDIR+0x50
sys_getdents(800026ed57a8,800026ff6b70,800026ff6bd0) at 
sys_getdents+0x161
syscall(800026ff6c40) at syscall+0x3a9
Xsyscall() at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7cc780, count: 7
https://www.openbsd.org/ddb.html describes the minimum info required in bug
reports.  Insufficient info makes it difficult to find and fix bugs.
ddb{3}> 

ddb{3}> show reg
rdi0
rsi 0x14
rbp   0x800026ff6710
rbx  0x1__ALIGN_SIZE+0xf000
rdx   0xfe03
rcx0x206
rax 0x28
r8 0x101010101010101
r9 0
r100
r11   0x49e3f089d74cd5c1
r12   0x8000219dca00
r13  0x1__ALIGN_SIZE+0xf000
r140
r15   0x81ea2c7aapollo_udma100_tim+0xd4a7
rip   0x8117d3a0db_enter+0x10
cs   0x8
rflags 0x202
rsp   0x800026ff6710
ss  0x10
db_enter+0x10:  popq%rbp



ddb{3}> show malloc
   Type InUse  MemUse  HighUse   Limit  Requests Type Lim
 devbuf 98112  50978K   50979K  78643K 995560
pcb13  8K   8K  78643K130
 rtable   155  5K   5K  78643K   2570
 ifaddr   177 22K  22K  78643K   1850
   counters   322212K 212K  78643K   3220
   ioctlops 0  0K   4K  78643K  15670
iov 0  0K   0K  78643K190
  mount 9  9K   9K  78643K 90
log 0  0K  13K  78643K 493970
 vnodes  1194 75K  75K  78643K  12040
  UFS quota 1 32K  32K  78643K 10
  UFS mount37 87K  87K  78643K370
shm 2  1K   1K  78643K 20
 VM map 2  1K   1K  78643K 20
sem 2  0K   0K  78643K 20
dirhash27  5K   5K  78643K540
   ACPI  3991479K 519K  78643K 153350
  file desc 3  1K   1K  78643K 40
   proc69 87K 100K  78643K   6690
NFS srvsock 1  0K   0K  78643K 10
 NFS daemon 1 16K  16K  78643K 10
   in_multi28  1K   1K  78643K280
ether_multi 7  0K   0K  78643K 70
ISOFS mount 1 32K  32K  78643K 10
  MSDOSFS mount 1 16K  16K  78643K 10
   ttys37175K 175K  78643K370

switch splassert

2021-02-15 Thread Hrvoje Popovski

Hi all,

i'm not much of a switch user, just tried to play with it. With most
recent snapshot i'm getting this splassert

r620-1# ifconfig switch0 create && ifconfig switch0 add ix0 && ifconfig
switch0 destroy
splassert: ifpromisc: want 2 have 0
Starting stack trace...
ifpromisc(80082048,0) at ifpromisc+0x5b
switch_port_detach(80082048) at switch_port_detach+0x4f
switch_clone_destroy(8169b000) at switch_clone_destroy+0x10d
if_clone_destroy(80002490b310) at if_clone_destroy+0xd8
ifioctl(fd839bc04dc8,80206979,80002490b310,8000248ed260) at
ifioctl+0x1d2
soo_ioctl(fd84209fbac8,80206979,80002490b310,8000248ed260)
at soo_ioctl+0x171
sys_ioctl(8000248ed260,80002490b420,80002490b480) at
sys_ioctl+0x2d4
syscall(80002490b4f0) at syscall+0x389
Xsyscall() at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7d9c80, count: 248
End of stack trace.



dmesg
OpenBSD 6.9-beta (GENERIC.MP) #335: Sun Feb 14 21:12:08 MST 2021
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 17115840512 (16322MB)
avail mem = 16581775360 (15813MB)
random: good seed from bootblocks
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xcf42c000 (99 entries)
bios0: vendor Dell Inc. version "2.9.0" date 12/06/2019
bios0: Dell Inc. PowerEdge R620
acpi0 at bios0: ACPI 3.0
acpi0: sleep states S0 S4 S5
acpi0: tables DSDT FACP APIC SPCR HPET DMAR MCFG WDAT SLIC ERST HEST
BERT EINJ TCPA PC__ SRAT SSDT
acpi0: wakeup devices PCI0(S5)
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 4 (boot processor)
cpu0: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.50 MHz, 06-3e-04
cpu0:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu0: 256KB 64b/line 8-way L2 cache
cpu0: smt 0, core 2, package 0
mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
cpu0: apic clock running at 100MHz
cpu0: mwait min=64, max=64, C-substates=0.2.1.1, IBE
cpu1 at mainbus0: apid 6 (application processor)
cpu1: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.00 MHz, 06-3e-04
cpu1:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu1: 256KB 64b/line 8-way L2 cache
cpu1: smt 0, core 3, package 0
cpu2 at mainbus0: apid 8 (application processor)
cpu2: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.00 MHz, 06-3e-04
cpu2:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu2: 256KB 64b/line 8-way L2 cache
cpu2: smt 0, core 4, package 0
cpu3 at mainbus0: apid 16 (application processor)
cpu3: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.00 MHz, 06-3e-04
cpu3:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu3: 256KB 64b/line 8-way L2 cache
cpu3: smt 0, core 8, package 0
cpu4 at mainbus0: apid 18 (application processor)
cpu4: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.00 MHz, 06-3e-04
cpu4:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu4: 256KB 64b/line 8-way L2 cache
cpu4: smt 0, core 9, package 0
cpu5 at mainbus0: apid 20 (application processor)
cpu5: Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz, 3600.00 MHz, 06-3e-04
cpu5:

Re: splassert w/ add/del vlan on bridge

2020-04-11 Thread Hrvoje Popovski

On 11.4.2020. 5:26, Visa Hankala wrote:
> On Fri, Apr 10, 2020 at 11:39:59PM +0200, Hrvoje Popovski wrote:
>> hostname.tpmr20
>> trunkport vxlan20
>> trunkport vlan20
>> up
>>
>>
>> x3550m4# ifconfig tpmr20 destroy
>>
>> splassert: vlan_ioctl: want 2 have 0
>> Starting stack trace...
>> vlan_ioctl(8129d800,80206910,800021d048a8) at vlan_ioctl+0x65
>> ifpromisc(8129d800,0) at ifpromisc+0xbb
>> tpmr_p_dtor(80b0e800,81288100,5ea751037d06af69) at
>> tpmr_p_dtor+0xa0
>> tpmr_clone_destroy(80b0e800) at tpmr_clone_destroy+0xba
>> ifioctl(fd8784ae41c8,80206979,800021d04ab0,800021c0cd90) at
>> ifioctl+0x1c2
>> soo_ioctl(fd877da53e10,80206979,800021d04ab0,800021c0cd90)
>> at soo_ioctl+0x171
>> sys_ioctl(800021c0cd90,800021d04bc0,800021d04c20) at
>> sys_ioctl+0x2df
>> syscall(800021d04c90) at syscall+0x389
>> Xsyscall() at Xsyscall+0x128
>> end of kernel
>> end trace frame: 0x7f7c1250, count: 248
>> End of stack trace.
> The diff below should fix that.


Hi,

with bridge and tpmr diffs i can't reproduce splassert

tnx ..

Re: splassert w/ add/del vlan on bridge

2020-04-10 Thread Hrvoje Popovski

On 10.4.2020. 21:30, Theo de Raadt wrote:
> Why did it take almost a year to find this?
> 
> Or is this bug due to ioctl(2) becoming UNLOCKED on 2020/02/22?

Hi guys,

i think that this splassert is not related only to bridge..

hostname.bridge1242
add vxlan1242
add vlan1242
up

x3550m4# ifconfig bridge1242 destroy

splassert: vlan_ioctl: want 2 have 0
Starting stack trace...
vlan_ioctl(80b21000,80206910,800021d04818) at vlan_ioctl+0x65
ifpromisc(80b21000,0) at ifpromisc+0xbb
bridge_ifremove(80b23e00) at bridge_ifremove+0xa4
bridge_clone_destroy(80b1c000) at bridge_clone_destroy+0xa5
ifioctl(fd8784ae41c8,80206979,800021d04a20,800021bad010) at
ifioctl+0x1c2
soo_ioctl(fd8784bb0f18,80206979,800021d04a20,800021bad010)
at soo_ioctl+0x171
sys_ioctl(800021bad010,800021d04b30,800021d04b90) at
sys_ioctl+0x2df
syscall(800021d04c00) at syscall+0x389
Xsyscall() at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7d7900, count: 248
End of stack trace.

hostname.tpmr20
trunkport vxlan20
trunkport vlan20
up

x3550m4# ifconfig tpmr20 destroy

splassert: vlan_ioctl: want 2 have 0
Starting stack trace...
vlan_ioctl(8129d800,80206910,800021d048a8) at vlan_ioctl+0x65
ifpromisc(8129d800,0) at ifpromisc+0xbb
tpmr_p_dtor(80b0e800,81288100,5ea751037d06af69) at
tpmr_p_dtor+0xa0
tpmr_clone_destroy(80b0e800) at tpmr_clone_destroy+0xba
ifioctl(fd8784ae41c8,80206979,800021d04ab0,800021c0cd90) at
ifioctl+0x1c2
soo_ioctl(fd877da53e10,80206979,800021d04ab0,800021c0cd90)
at soo_ioctl+0x171
sys_ioctl(800021c0cd90,800021d04bc0,800021d04c20) at
sys_ioctl+0x2df
syscall(800021d04c90) at syscall+0x389
Xsyscall() at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7c1250, count: 248
End of stack trace.

Re: Kernel crash in OpenBSD 6.5

2019-07-30 Thread Hrvoje Popovski

On 30.7.2019. 13:34, illya.me...@wiesan.de wrote:
> Am 30.07.19 um 13:17 schrieb Hrvoje Popovski:
>> 2) - download install.iso, burn it on cd or usb disk
>> 3) - boot from cd or usb
> 
> That's not so easy, I haven't a monitor at this machine.
> I try the „manual update process“ via ssh and if I crash the machine,
> it's crashed and I have to walk through ;-)
> 
> [10 minutes later]
> 
> The machine is up in -current (at least, I hope) and "ifconfig -A"
> doesn't crash it.
> 
> So, how can I update our other "normal" 6.5 machines? Is it possible to
> provide a patch for the problem on the errata-page?
> 
> Thank you very much and kind regards
> Illya


sorry, i forgot to put bugs@ in mail ..

try to update both boxes to latest snapshot at least because in snapshot
you have excellent tool called sysupgrade ... you will love it :)

with this tool you can upgrade os to latest snapshot without any problem
over ssh :)

Re: ifconfig bridge crashes host

2019-07-23 Thread Hrvoje Popovski

On 23.7.2019. 17:03, obs...@high5.nl wrote:
>> Synopsis:ifconfig bridge crashes host
>> Category:
>> Environment:
>   System  : OpenBSD 6.5
>   Details : OpenBSD 6.5 (GENERIC.MP) #1: Mon May 27 18:27:59 CEST 2019
>
> r...@syspatch-65-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> 
>   Architecture: OpenBSD.amd64
>   Machine : amd64
>> Description:
>   After running the command "ifconfig bridge" twice on the host, the host
>   became unresponsive. I was able to capture the trace from the console.
>> How-To-Repeat:
>   The host was running for some time so I am uncertain if it's related to 
> time,
>   but I have seen this happening a couple of times now, and it seems 
> running the
>   "ifconfig bridge" command multiple times triggers this.

Hi,

can you update your box with latest snapshot ?
There were some problems with "ifconfig bridge" command few months ago..

Re: Fujitsu RX 2530 M4 - Segmentation fault - ttyflags.core

2019-06-04 Thread Hrvoje Popovski

On 4.6.2019. 17:47, Otto Moerbeek wrote:
> Any idea how you ended up with this malformed file?

No, i simply don't know how that happened ...

Re: Fujitsu RX 2530 M4 - Segmentation fault - ttyflags.core

2019-06-04 Thread Hrvoje Popovski

On 4.6.2019. 16:54, Otto Moerbeek wrote:
> On Tue, Jun 04, 2019 at 04:50:28PM +0200, Hrvoje Popovski wrote:
> 
>> On 4.6.2019. 16:38, Otto Moerbeek wrote:
>>> On Tue, Jun 04, 2019 at 03:41:02PM +0200, Hrvoje Popovski wrote:
>>>
>>>> Hi all,
>>>>
>>>> after upgrading Fujitsu RX 2530 M4 from 6.4 to 6.5 in dmesg i saw
>>>> "Segmentation fault (core dumped)" instead of "setting tty flags".
>>>> Installation and upgrade are very standard, but they are done over
>>>> Fujitsu AVR, their remote console like Dell iDRAC.
>>>> Machine is working normally, i just wanted to report this.
>>>>
>>>> core dump:
>>>> http://kosjenka.srce.hr/~hrvoje/openbsd/ttyflags.core
>>>>
>>>> sendbug is in attachment
>>>>
>>>> dmesg -s
>>>> Automatic boot in progress: starting file system checks.
>>>> /dev/sd0a (208b7bf77955eee9.a): file system is clean; not checking
>>>> /dev/sd0k (208b7bf77955eee9.k): file system is clean; not checking
>>>> /dev/sd0d (208b7bf77955eee9.d): file system is clean; not checking
>>>> /dev/sd0f (208b7bf77955eee9.f): file system is clean; not checking
>>>> /dev/sd0g (208b7bf77955eee9.g): file system is clean; not checking
>>>> /dev/sd0h (208b7bf77955eee9.h): file system is clean; not checking
>>>> /dev/sd0j (208b7bf77955eee9.j): file system is clean; not checking
>>>> /dev/sd0i (208b7bf77955eee9.i): file system is clean; not checking
>>>> /dev/sd0e (208b7bf77955eee9.e): file system is clean; not checking
>>>> Segmentation fault (core dumped)
>>>> starting network
>>>> reordering libraries: done.
>>>> starting early daemons: syslogd ntpd.
>>>> starting RPC daemons:.
>>>> savecore: no core dump
>>>> checking quotas: done.
>>>> clearing /tmp
>>>> kern.securelevel: 0 -> 1
>>>> creating runtime link editor directory cache.
>>>> preserving editor files.
>>>> starting network daemons: sshd snmpd bgpd smtpd.
>>>> starting package daemons: zabbix_agentd.
>>>> starting local daemons: cron.
>>>> Tue Jun  4 06:44:04 CEST 2019
>>>
>>> Does this also happen when you run ttyflags -a by hand?
>>>
>>
>> yes it happens,
>>
>> rs2# ttyflags -a
>> Segmentation fault (core dumped)
>>
>> http://kosjenka.srce.hr/~hrvoje/openbsd/ttyflags2.core
>>
>>
>>> Do you have a malformed entry in /etc/ttys ?
>>
>> i don't think that i have. i haven't touch /etc/ttys
>>
>>
>>> Please share your /etc/ttys file.
>>
>> http://kosjenka.srce.hr/~hrvoje/openbsd/ttys
>>
> 
> there's a stray t at the end.

ohh .. for god sake .. i'm sorry for taking your time .. thank you ..

Re: Fujitsu RX 2530 M4 - Segmentation fault - ttyflags.core

2019-06-04 Thread Hrvoje Popovski

On 4.6.2019. 16:38, Otto Moerbeek wrote:
> On Tue, Jun 04, 2019 at 03:41:02PM +0200, Hrvoje Popovski wrote:
> 
>> Hi all,
>>
>> after upgrading Fujitsu RX 2530 M4 from 6.4 to 6.5 in dmesg i saw
>> "Segmentation fault (core dumped)" instead of "setting tty flags".
>> Installation and upgrade are very standard, but they are done over
>> Fujitsu AVR, their remote console like Dell iDRAC.
>> Machine is working normally, i just wanted to report this.
>>
>> core dump:
>> http://kosjenka.srce.hr/~hrvoje/openbsd/ttyflags.core
>>
>> sendbug is in attachment
>>
>> dmesg -s
>> Automatic boot in progress: starting file system checks.
>> /dev/sd0a (208b7bf77955eee9.a): file system is clean; not checking
>> /dev/sd0k (208b7bf77955eee9.k): file system is clean; not checking
>> /dev/sd0d (208b7bf77955eee9.d): file system is clean; not checking
>> /dev/sd0f (208b7bf77955eee9.f): file system is clean; not checking
>> /dev/sd0g (208b7bf77955eee9.g): file system is clean; not checking
>> /dev/sd0h (208b7bf77955eee9.h): file system is clean; not checking
>> /dev/sd0j (208b7bf77955eee9.j): file system is clean; not checking
>> /dev/sd0i (208b7bf77955eee9.i): file system is clean; not checking
>> /dev/sd0e (208b7bf77955eee9.e): file system is clean; not checking
>> Segmentation fault (core dumped)
>> starting network
>> reordering libraries: done.
>> starting early daemons: syslogd ntpd.
>> starting RPC daemons:.
>> savecore: no core dump
>> checking quotas: done.
>> clearing /tmp
>> kern.securelevel: 0 -> 1
>> creating runtime link editor directory cache.
>> preserving editor files.
>> starting network daemons: sshd snmpd bgpd smtpd.
>> starting package daemons: zabbix_agentd.
>> starting local daemons: cron.
>> Tue Jun  4 06:44:04 CEST 2019
> 
> Does this also happen when you run ttyflags -a by hand?
> 

yes it happens,

rs2# ttyflags -a
Segmentation fault (core dumped)

http://kosjenka.srce.hr/~hrvoje/openbsd/ttyflags2.core


> Do you have a malformed entry in /etc/ttys ?

i don't think that i have. i haven't touch /etc/ttys


> Please share your /etc/ttys file.

http://kosjenka.srce.hr/~hrvoje/openbsd/ttys

Re: witness report

2019-06-03 Thread Hrvoje Popovski

On 3.6.2019. 18:32, Philip Guenther wrote:
> On Mon, 3 Jun 2019, Hrvoje Popovski wrote:
>> i'm having samba server, transmission client and gnome desktop on one
>> box. from time to time i'm getting witness log below. source is clean
>> and fetched few hours ago and compiled with WITNESS. userland and
>> packages are up to date ..
>> i put kern.witness.watch=3 in sysctl.conf so now i'm in ddb and will
>> leave it like this if something is needed
> 
> From pguent...@proofpoint.com Sat Jun  1 13:25:04 2019
> Date: Sat, 1 Jun 2019 13:25:00 -0700
> From: Philip Guenther 
> To: Antoine Jacoutot 
> Cc: hack...@openbsd.org
> Subject: Re: witness and unveil
> 
> On Sat, 1 Jun 2019, Antoine Jacoutot wrote:
>> Running a WITNESS kernel, mpi@ told me to send this here.
>>
>> kern.version=OpenBSD 6.5-current (GENERIC.MP) #0: Sat Jun  1 18:29:16 CEST 
>> 2019
>>
>> witness: acquiring duplicate lock of same type: ">uv_lock"
>>  1st unveil
>>  2nd unveil
> 
> Give this diff a try.

Hi,

with this diff i can't reproduce witness log. if something comes up i
will report it back ..

witness report

2019-06-03 Thread Hrvoje Popovski

Hi all,

i'm having samba server, transmission client and gnome desktop on one
box. from time to time i'm getting witness log below. source is clean
and fetched few hours ago and compiled with WITNESS. userland and
packages are up to date ..
i put kern.witness.watch=3 in sysctl.conf so now i'm in ddb and will
leave it like this if something is needed

witness log from console:

witness: acquiring duplicate lock of same type: ">uv_lock"
 1st unveil
 2nd unveil
Starting stack trace...
witness_checkorder(80af9078,9,0) at witness_checkorder+0x826
rw_enter_write(80af9068) at rw_enter_write+0x43
unveil_copy() at unveil_copy+0x183
process_new(8000331b8278,800032f94d50,1) at process_new+0xde
fork1() at fork1+0x2d7
syscall(80003300b1f0) at syscall+0x389
Xsyscall(6,2,c781ea04749,2,7f7cd478,c7a81850500) at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7cd040, count: 250
End of stack trace.




ddb output with kern.witness.watch=3


witness: acquiring duplicate lock of same type: ">uv_lock"
 1st unveil
 2nd unveil
Starting stack trace...
witness_checkorder(80af9078,9,0) at witness_checkorder+0x826
rw_enter_write(80af9068) at rw_enter_write+0x43
unveil_copy() at unveil_copy+0x183
process_new(8000331b8278,800032f94d50,1) at process_new+0xde
fork1() at fork1+0x2d7
syscall(80003300b1f0) at syscall+0x389
Xsyscall(6,2,c781ea04749,2,7f7cd478,c7a81850500) at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7cd040, count: 250
End of stack trace.
Stopped at  db_enter+0x10:  popq%rbp

ddb{1}> trace
db_enter() at db_enter+0x10
witness_checkorder(80af9078,9,0) at witness_checkorder+0x82b
rw_enter_write(80af9068) at rw_enter_write+0x43
unveil_copy() at unveil_copy+0x183
process_new(8000331b8278,800032f94d50,1) at process_new+0xde
fork1() at fork1+0x2d7
syscall(80003300b1f0) at syscall+0x389
Xsyscall(6,2,c781ea04749,2,7f7cd478,c7a81850500) at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7cd040, count: -8
ddb{1}>

ddb{1}> mach ddbcpu 0
Stopped at  x86_ipi_db+0x12:leave
ddb{0}> trace
x86_ipi_db(81d0fff0) at x86_ipi_db+0x12
x86_ipi_handler() at x86_ipi_handler+0x80
Xresume_lapic_ipi(6,81d0fff0,800c9b00,0,0,81e7da68)
at Xresume_lapic_ipi+0x23
__mp_lock(81e7da68) at __mp_lock+0xae
intr_handler(800022287240,800c9b00) at intr_handler+0x44
Xintr_ioapic_edge23_untramp(0,81d0fff0,fd81eef2edc0,0,0,81e
7da68) at Xintr_ioapic_edge23_untramp+0x19f
__mp_lock(81e7da68) at __mp_lock+0xa9
sowakeup(fd81eef2edc0,fd81eef2ee48) at sowakeup+0x8f
sorwakeup(fd81eef2edc0) at sorwakeup+0x78
udp_sbappend(fd81eeafa7b0,fd800e7b4700,fd80131a7490,0,14,fd8013
1a74a4) at udp_sbappend+0x1c8
udp_input(800022287588,800022287594,11,2) at udp_input+0xd21
ip_deliver(800022287588,800022287594,11,2) at ip_deliver+0x223
ipintr() at ipintr+0x5f
if_netisr(0) at if_netisr+0x4e
taskq_thread(80026080) at taskq_thread+0x67
end trace frame: 0x0, count: -15
ddb{0}>


ddb{0}> mach ddbcpu 2
Stopped at  x86_ipi_db+0x12:leave
ddb{2}> trace
x86_ipi_db(80002201aff0) at x86_ipi_db+0x12
x86_ipi_handler() at x86_ipi_handler+0x80
Xresume_lapic_ipi(0,0,1388,0,800cca80,80002201b6f8) at
Xresume_lapic_ipi+0x23
acpicpu_idle() at acpicpu_idle+0x271
sched_idle(80002201aff0) at sched_idle+0x225
end trace frame: 0x0, count: -5
ddb{2}>


ddb{3}> trace
x86_ipi_db(800022023ff0) at x86_ipi_db+0x12
x86_ipi_handler() at x86_ipi_handler+0x80
Xresume_lapic_ipi(c,800022023ff0,800022023ff0,0,3,81e7da68)
at Xresume_lapic_ipi+0x23
__mp_lock(81e7da68) at __mp_lock+0xa9
__mp_acquire_count(81e7da68,1) at __mp_acquire_count+0x38
mi_switch() at mi_switch+0x243
sleep_finish(80003345ce48,1) at sleep_finish+0x84
tsleep(fd80b7e19f28,118,81aefeef,485) at tsleep+0xc7
kqueue_scan(fd80b7e19f28,40,66acc36f000,80003345d208,800032f3e2a8,f
fff80003345d248) at kqueue_scan+0x4ec
sys_kevent(800032f3e2a8,80003345d2b0,80003345d310) at
sys_kevent+0x28f
syscall(80003345d380) at syscall+0x389
Xsyscall(0,48,7f7bd0b0,48,0,66acc36f000) at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7bd070, count: -12
ddb{3}>


ddb{1}> show locks
shared rwlock unveil r = 0 (0x80966078)
exclusive kernel_lock _lock r = 0 (0x81e7dc70)


ddb{1}> show all locks
Process 8789 (transmission-dae) thread 0x800032fb73b0 (223699)
exclusive rrwlock inode r = 0 (0xfd81ec938818)
Process 33527 (ntpd) thread 0x800032f6c500 (150820)
shared rwlock unveil r = 0 (0x80966078)
exclusive kernel_lock _lock r = 0 (0x81e7dc70)
Process 32311 (softnet) thread 0x800022260750 (481883)
exclusive rwlock netlock r = 0 (0x81d2d728)
shared rwlock softnet r = 0 (0x800260d8)
ddb{1}>


ddb{1}> show uvm
Current UVM status:
  pagesize=4096

Re: bridge - kernel: protection fault trap

2019-05-03 Thread Hrvoje Popovski

On 3.5.2019. 13:32, Alexander Bluhm wrote:
> On Fri, May 03, 2019 at 12:15:44PM +0200, Alexander Bluhm wrote:
>> 0  3082 39335  10  10   0   304   232 ifidxrm D+p00:00.00 
>> /sbin/ifconfig bridge12 destroy
> Looks like a missing if_put().

Hi,

with this diff i can't reproduce panic with ifconfig bridge0 destroy
after removing stp from interfaces in bridge ..

Thank you guys ...

Re: bridge - kernel: protection fault trap

2019-05-01 Thread Hrvoje Popovski

On 30.4.2019. 23:40, Martin Pieuchot wrote:
> On 30/04/19(Tue) 14:45, Hrvoje Popovski wrote:
>> Hi all,
>>
>> if i have bridge with rstp on interfaces and rstp on switch and i want
>> to disable rstp on openbsd interfaces i'm getting fault trap. I can
>> reproduce it on 6.4 and on -current.
>> i can't reproduce it if i don't have rstp on switch.
> 
> Seems that `bs_root_port' isn't reset.  Does the diff below help?
> 

Hi,

yes, it helps. i can't reproduce trap with ifconfig bridge0 after
removing stp from interfaces in bridge. But now if i destroy bridge0
after removing stp from interfaces box freeze and if in second terminal
i execute reboot i'm getting same or similar trap. i didn't try ifconfig
bridge0 destroy without this diff ..



bridge0: flags=41
index 18 llprio 3
groups: bridge
priority 32768 hellotime 2 fwddelay 15 maxage 20 holdcnt 6 proto
rstp
ix1 flags=eb
port 6 ifpriority 128 ifcost 2000 learning role root
ix0 flags=eb
port 5 ifpriority 128 ifcost 2000 discarding role alternate
x3550m4# ifconfig bridge0 -stp ix0
x3550m4# ifconfig bridge0 -stp ix1
x3550m4# ifconfig bridge0
bridge0: flags=41
index 18 llprio 3
groups: bridge
priority 32768 hellotime 2 fwddelay 15 maxage 20 holdcnt 6 proto
rstp
designated: id a0:36:9f:2e:96:a1 priority 32768
ix1 flags=e3
port 6 ifpriority 0 ifcost 0
ix0 flags=e3
port 5 ifpriority 0 ifcost 0
Addresses (max cache: 100, timeout: 240):
00:01:e8:8a:ea:53 ix1 1 flags=0<>
x3550m4# ifconfig bridge0 destroy

after this box freeze and when trying to reboot in other terminal i'm
getting this:

uvm_fault(0xfd87845eae78, 0x50, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at  bridge_ioctl+0x25d: movq0x10(%rax),%rax

ddb{5}> trace
bridge_ioctl(80aa1000,c0406958,800025c803c0) at
bridge_ioctl+0x25d
ifioctl(fd8784f154a8,c0406958,800025c803c0,8000fffef790) at
ifioctl+0x2e1
sys_ioctl(8000fffef790,800025c804e0,800025c80550) at
sys_ioctl+0x3c4
syscall(800025c805c0) at syscall+0x2d5
Xsyscall(6,36,7f7bdd60,36,7f7bd7e0,1120dda0c53f) at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7bd840, count: -5
ddb{5}>

ddb{5}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
  8881  355476  58607  0  30x100080  piperdsh
*61948  478931  58607  0  7 0x2ifconfig
 58607  500143  60114  0  30x10008a  pause sh
 60114  118807     0  30x83  wait  reboot
    475759  36533  0  30x10008b  pause ksh
 36533  245986  54582   1000  30x10008b  pause ksh
 54582   88066  51228   1000  30x90  selectsshd
 51228  249575  10714  0  30x82  poll  sshd
 89349   12679  78679  0  3 0x3  ifidxrm   ifconfig
 78679  494362  1  0  30x10008b  pause ksh
 23063  361819  1  0  30x100083  ttyin getty
  5688  521523  1  0  30x100083  ttyin getty
 10811  485927  1  0  30x100083  ttyin getty
 53603  187259  1  0  30x100083  ttyin getty
 76136  329246  1  0  30x100083  ttyin getty
 37428   18304  1  0  30x100098  poll  cron
 35480   87192  93615 95  30x100092  kqreadsmtpd
 361385975  93615103  30x100092  kqreadsmtpd
 30067   12755  93615 95  30x100092  kqreadsmtpd
 93539  274871  93615 95  30x100092  kqreadsmtpd
 22439  508287  93615 95  30x100092  kqreadsmtpd
 72080  200916  93615 95  30x100092  kqreadsmtpd
 93615  356738  1  0  30x100080  kqreadsmtpd
 10714  370355  1  0  30x80  selectsshd
 28735  407225  44481 83  30x100092  poll  ntpd
 44481  159120  76739 83  30x100092  poll  ntpd
 76739  329789  1  0  30x100080  poll  ntpd
 65485  248241  32912 73  70x100090syslogd
 32912  421250  1  0  30x100082  netio syslogd
 96126  242404  0  0  3 0x14200  pgzerozerothread
 73505  492214  0  0  3 0x14200  aiodoned  aiodoned
 89628  391838  0  0  3 0x14200  syncerupdate
 33653  327764  0  0  3 0x14200  cleaner   cleaner
  7953  391928  0  0  3 0x14200  reaperreaper
 166285698  0  0  3 0x14200  pgdaemon  pagedaemon
 29741  351482  0  0  3 0x14200  bored crynlk
 44734  415647  0  0  3 0x14200  bored crypto
 36512  227354  0  0  3 0x14200  usbtskusbtask
 38180   58616  0  0  3 0x14200  usbats

Re: bridge - kernel: protection fault trap

2019-04-30 Thread Hrvoje Popovski

On 30.4.2019. 14:45, Hrvoje Popovski wrote:
> Hi all,
> 
> if i have bridge with rstp on interfaces and rstp on switch and i want
> to disable rstp on openbsd interfaces i'm getting fault trap.

i suck so much at describing the problem 
problem is executing "ifconfig bridge0" after removing stp from
interfaces in bridge ...

Re: Strange (mis)behaviour of pf ruleset in combination with dhcpd

2019-04-10 Thread Hrvoje Popovski

On 10.4.2019. 11:19, illya.me...@wiesan.de wrote:
> Am 10.04.19 um 10:58 schrieb Otto Moerbeek:
>> On Wed, Apr 10, 2019 at 10:08:51AM +0200, illya.me...@wiesan.de wrote:
>>
>>>
>>> Am 10.04.19 um 07:34 schrieb Bruno Flückiger:
 On 09.04., illya.me...@wiesan.de wrote:
> Dear all,
>
> I discovered a strange problem with OpenBSD 6.4 AMD64
> (stable(?)-release
> with all 16 patches).
>
> When running dhcpd, some pf rules are seem to not working.
> I'm pretty sure, this behaviour is different than in 6.3.
>
> Setup:
> ++  ++  +--+
> | Client |--| Switch |--em0-| OpenBSD (with dhcpd) |
> ++  ++  +--+
>
> I try to get an ip address for „Client“ via dhcp from „OpenBSD“,
> but in
> pf.conf I block traffic on port 67+68 (see below).
>
> When dhcpd is NOT running, I got from „tcpdump -nettti pflog0“ as
> expected:
>  Schnipp 8< 
> Apr 09 16:29:05.165687 rule 3/(match) block in on em0: 0.0.0.0.68 >
> 255.255.255.255.67:  xid:0x3f51206f secs:5 [|bootp] [tos 0x10]
>  Schnapp 8< 
>
> When dhcpd („dhcpd em0“) is running, I got an entry in
> /var/log/daemon.log:
>  Schnipp 8< 
> Apr  9 16:30:40 feuerwand dhcpd[50668]: DHCPDISCOVER from
> 00:96:69:96:69:96
> via em0
> Apr  9 16:30:41 feuerwand dhcpd[50668]: DHCPOFFER on 10.69.250.1 to
> 00:96:69:96:69:96 via em0
> Apr  9 16:30:41 feuerwand dhcpd[50668]: DHCPREQUEST for 10.69.250.1
> from
> 00:96:69:96:69:96 via em0
> Apr  9 16:30:41 feuerwand dhcpd[50668]: DHCPACK on 10.69.250.1 to
> 00:96:69:96:69:96 via em0
>  Schniap 8< 
>
> .. and this entry via tcpdump:
>  Schnipp 8< 
> Apr 09 16:30:40.450863 rule 5/(match) pass out on em0: 10.69.228.156 >
> 10.69.250.1: icmp: echo request
>  Schnapp 8< 
>
> .. and „Client“ got an ip address!
>
> If you need futher information don't hesistate to contact me.
>
> Please tell me also, if I'm to stupid to understand what happenend ;-)
>
> If you want to know, why I'm running dhcpd and want to block the
> traffic: We
> use OpenBSD as bridge and dhcpd should only offer ip-addresses to
> one side.
> But this strange behaviour is also present without the
> bridge-configuration.
>
> Thank you for your help and support
> Illya Meyer
>

 Hi Illya

 DHCP operates on layer 2 using bpf(4) to receive and send packets.
 Packet filtering takes place on layers 3 and 4. This means that
 dhcpd(8)
 has done its work before the packets get to pf(4). If you want to make
 sure that dhcpd(8) hands out leases only on interface em0 you can tell
 it to operate only on this interface:

 # rcctl set dhcpd flags em0

 Cheers,
 Bruno

>>>
>>> Hi Bruno,
>>>
>>> thank you for the information.
>>>
>>> It's strange, that a packet first reachs a daemon and then the packet
>>> filter
>>> (thats job it is to protect the machine from unwanted packets!)
>>>
>>> Maybe it's a good idea to build a bpf-Filter for layer 2 :-)
>>>
>>> Thank you and kind regards,
>>> Illya
>>>
>>
>> What do you think dhcpd uses?
>>
>> -Otto
>>
> 
> Hm, sorry. What do you mean exactly?
> 
> In my opinion, it should be possible for a packet filter to block ALL
> packets, that arrives from a network, before a daemon (in this case
> dhcpd) does its work.
> 
> But as Bruno sayd, dhcpd listens on layer 2 and answers first, before pf
> gets the packet on layer 3. So was my understanding. Please see my tests
> above, pf doesn't block the dhcp requests when dhcpd runs.
> 
> In my scenario, I have a firewall, which works as bridge (so more a
> firebridge ;-)) with a dhcpd for „Good net“ and blocking the most things
> from „Bad net“ (especially dhcp requests).
> 
> +-+   ++   +--+
> | Bad net |---em0-| OpenBSD-Bridge |-em1---| Good net |
> +-+   ++   +--+
> 
> Only em0 has had an ip address and so dhcpd had to listen on em0. But
> some PCs from „Bad net“ got ip addresses from the BSD-Box.
> My solution was now to give the BSD-Box a second ip address on em1 and
> let dhcpd listens on em1 only. This works with the pf-rules (see above).
> 
> When I interpret this article in the right way
> (https://www.linuxtopia.org/Linux_Firewall_iptables/c479.html) iptables
> on Linux works on layer 2, so it should be possible to block dhcp
> requests. Other articles said the same (e.g.
> https://serverfault.com/questions/873839/block-dhcp-traffic-for-one-device-mac-address)
> 
> But it seems, this is not possible with pf, which works on layer 3.
> 
> Kind regards,
> Illya
> 
> 

maybe you could use tcpdump -B fildrop feature, but you need -current to
do this ..

Re: splassert: bstp_notify_rtage

2019-03-29 Thread Hrvoje Popovski

On 29.3.2019. 15:32, Martin Pieuchot wrote:
> Hello,
> 
> On 24/03/19(Sun) 01:00, Hrvoje Popovski wrote:
>> Hi all,
>>
>> while playing around with stp and pair interfaces and using exactly the
>> same example as in man (4) pair
>>
>> ifconfig pair0 up
>> ifconfig pair1 rdomain 1 patch pair0 up
>> ifconfig pair2 up
>> ifconfig pair3 rdomain 1 patch pair2 up
>> ifconfig bridge0 add pair0 add pair2 stp pair0 stp pair2 up
>> ifconfig bridge1 add pair1 add pair3 stp pair1 stp pair3 up
>>
>> and while destroying/creating stp root pair interfaces with
>> kern.pool_debug=1 and kern.splassert=2 i'm getting this traces
>>
>> splassert: bstp_notify_rtage: want 2 have 0
>> Starting stack trace...
>> bstp_update_tc(804e0c00) at bstp_update_tc+0x338
>> bstp_tick(80159700) at bstp_tick+0x357
>> softclock(0) at softclock+0x123
>> softintr_dispatch(0) at softintr_dispatch+0x11e
>> Xsoftclock(0,0,1388,0,80021800,81d0b6d0) at Xsoftclock+0x1f
>> acpicpu_idle() at acpicpu_idle+0x281
>> sched_idle(81d0aff0) at sched_idle+0x235
>> end trace frame: 0x0, count: 250
>> End of stack trace.
> 
> It's an incorrect assert.  What's currently protecting all the bridge
> data structures is the KERNEL_LOCK().  Does the diff below help?

Yes it helps. With this diff i can't reproduce traces ..

Tnx ..

splassert: bstp_notify_rtage

2019-03-23 Thread Hrvoje Popovski

Hi all,

while playing around with stp and pair interfaces and using exactly the
same example as in man (4) pair

ifconfig pair0 up
ifconfig pair1 rdomain 1 patch pair0 up
ifconfig pair2 up
ifconfig pair3 rdomain 1 patch pair2 up
ifconfig bridge0 add pair0 add pair2 stp pair0 stp pair2 up
ifconfig bridge1 add pair1 add pair3 stp pair1 stp pair3 up

and while destroying/creating stp root pair interfaces with
kern.pool_debug=1 and kern.splassert=2 i'm getting this traces

splassert: bstp_notify_rtage: want 2 have 0
Starting stack trace...
bstp_update_tc(804e0c00) at bstp_update_tc+0x338
bstp_tick(80159700) at bstp_tick+0x357
softclock(0) at softclock+0x123
softintr_dispatch(0) at softintr_dispatch+0x11e
Xsoftclock(0,0,1388,0,80021800,81d0b6d0) at Xsoftclock+0x1f
acpicpu_idle() at acpicpu_idle+0x281
sched_idle(81d0aff0) at sched_idle+0x235
end trace frame: 0x0, count: 250
End of stack trace.


splassert: bstp_notify_rtage: want 2 have 256
Starting stack trace...
bstp_set_port_tc(804e0600,5) at bstp_set_port_tc+0x1a6
bstp_update_tc(804e0600) at bstp_update_tc+0xfd
bstp_tick(80159000) at bstp_tick+0x357
softclock(0) at softclock+0x123
softintr_dispatch(0) at softintr_dispatch+0x11e
Xsoftclock(0,0,1388,0,80021800,81d0b6d0) at Xsoftclock+0x1f
acpicpu_idle() at acpicpu_idle+0x281
sched_idle(81d0aff0) at sched_idle+0x235
end trace frame: 0x0, count: 249
End of stack trace.


i don't know is this serious or not, i'm just sending report here for
the record.

from misc@ https://www.mail-archive.com/misc@openbsd.org/msg165596.html


log with "option BRIDGESTP_DEBUG" :

bstp: state changed to DISCARDING on pair0
bstp: pair0 -> TC_INACTIVE
bstp: state changed to DISCARDING on pair2
bstp: pair2 -> TC_INACTIVE
bstp: state changed to DISCARDING on pair1
bstp: pair1 -> TC_INACTIVE
bstp: state changed to DISCARDING on pair3
bstp: pair3 -> TC_INACTIVE
bstp: pair0 role -> DESIGNATED
bstp: pair0 -> DESIGNATED_SYNCED
bstp: pair0 -> DESIGNATED_PROPOSE
bstp: pair1 role -> DESIGNATED
bstp: pair1 -> DESIGNATED_SYNCED
bstp: pair1 -> DESIGNATED_PROPOSE
bstp: pair2 role -> DESIGNATED
bstp: pair2 -> DESIGNATED_SYNCED
bstp: pair2 -> DESIGNATED_PROPOSE
bstp: pair3 role -> DESIGNATED
bstp: pair3 -> DESIGNATED_SYNCED
bstp: pair3 -> DESIGNATED_PROPOSE
bstp: pair2 role -> ALT/BACK/DISABLED
bstp: pair3 role -> ALT/BACK/DISABLED
bstp: pair1 role -> ROOT
bstp: pair1 -> ROOT_REROOT
bstp: pair1 -> ROOT_AGREED
bstp: state changed to LEARNING on pair1
bstp: pair1 -> TC_LEARNING
bstp: pair1 -> ROOT_AGREED
bstp: state changed to FORWARDING on pair1
bstp: pair1 -> ROOT_REROOTED
bstp: pair1 -> TC_DETECTED
bstp: pair1 -> ROOT_AGREED
bstp: pair1 -> ROOT_AGREED
bstp: pair1 -> ROOT_AGREED
bstp: pair1 -> ROOT_AGREED
bstp: state changed to FORWARDING on pair0
bstp: pair0 -> TC_LEARNING
bstp: state changed to DISCARDING on pair2
bstp: pair2 -> TC_INACTIVE
bstp: pair3 role -> DESIGNATED
bstp: pair3 -> DESIGNATED_RETIRED
bstp: pair3 -> DESIGNATED_SYNCED
bstp: pair3 -> DESIGNATED_PROPOSE
bstp: pair2 role -> DESIGNATED
bstp: pair2 -> DESIGNATED_SYNCED
bstp: pair2 -> DESIGNATED_PROPOSE
bstp: pair0 -> TC_DETECTED
bstp: pair3 role -> ALT/BACK/DISABLED
bstp: pair3 -> ALTERNATE_AGREED
bstp: pair1 -> TC_TC
bstp: pair3 -> ALTERNATE_AGREED
bstp: pair1 -> TC_TC
bstp: pair3 -> ALTERNATE_AGREED
bstp: pair3 -> ALTERNATE_AGREED
bstp: pair3 -> ALTERNATE_AGREED
bstp: pair3 -> ALTERNATE_AGREED
bstp: state changed to FORWARDING on pair2
bstp: pair2 -> TC_LEARNING
bstp: pair0 -> TC_LEARNING
bstp: pair3 role -> ROOT
bstp: pair1 role -> ALT/BACK/DISABLED
bstp: state changed to DISCARDING on pair1
bstp: pair1 -> TC_LEARNING
bstp: pair3 -> ROOT_REROOT
bstp: state changed to LEARNING on pair3
bstp: pair3 -> TC_LEARNING
bstp: pair1 -> ALTERNATE_PORT
bstp: pair1 -> TC_INACTIVE
bstp: state changed to FORWARDING on pair3
bstp: pair3 -> ROOT_REROOTED
bstp: pair3 -> TC_DETECTED
bstp: state changed to DISCARDING on pair3
bstp: pair3 role -> ALT/BACK/DISABLED
bstp: pair3 -> TC_LEARNING
bstp: pair1 role -> ROOT
bstp: pair3 -> TC_INACTIVE
bstp: state changed to DISCARDING on pair2
bstp: pair2 role -> ALT/BACK/DISABLED
bstp: pair2 -> TC_INACTIVE
bstp: pair1 -> ROOT_REROOT
bstp: state changed to LEARNING on pair1
bstp: pair1 -> TC_LEARNING
bstp: state changed to FORWARDING on pair1
bstp: pair1 -> ROOT_REROOTED
bstp: pair1 -> TC_DETECTED
bstp: pair0 -> TC_DETECTED
bstp: pair1 -> TC_TC
bstp: pair0 -> TC_TC
bstp: pair1 -> TC_TC
bstp: pair0 -> TC_LEARNING
bstp: state changed to DISCARDING on pair3
bstp: pair3 -> TC_INACTIVE
bstp: pair2 role -> DESIGNATED
bstp: pair2 -> DESIGNATED_SYNCED
bstp: state changed to FORWARDING on pair2
bstp: pair2 -> TC_LEARNING
bstp: pair3 role -> DESIGNATED
bstp: pair3 -> DESIGNATED_SYNCED
bstp: pair3 -> DESIGNATED_PROPOSE
bstp: pair2 -> TC_DETECTED
bstp: pair0 -> TC_LEARNING
bstp: pair0 -> TC_DETECTED
bstp: pair3 role -> ALT/BACK/DISABLED
bstp: pair3 -> ALTERNATE_AGREED
bstp: pair3 role

Re: witness report

2018-11-05 Thread Hrvoje Popovski

On 3.6.2018. 13:38, Visa Hankala wrote:
> On Sat, Jun 02, 2018 at 03:08:14PM -0700, Philip Guenther wrote:
>> On Sat, 2 Jun 2018, Christophe Prévotaux wrote:
>>> This a witness report I got on boot with snapshot Jun 1st amd64
>>>
>>> root on sd0a (9b49e3196b9bfae8.a) swap on sd0b dump on sd0b
>>> lock order reversal:
>>>  1st 0xff021cdac180 vmmaplk (>lock) @ 
>>> /usr/src/sys/uvm/uvm_map.c:4433
>>>  2nd 0xff01dc5f71a8 inode (>i_lock)
>> I believe uvm and the vnode layer handle this correctly, with lock tries 
>> that fall back to releasing the other lock and retrying so progress is 
>> made.  The fix for WITNESS complaining is to mark vmmaplk as a vnode lock.
> I think there is an actual issue because the locking calls are
> unconditional. FreeBSD appears to work around the problem by unlocking
> the vm_map when calling the pager. The diff below adapts that logic
> to OpenBSD.
> 
> Because the temporary unlocking may allow another thread to change the
> vm_map, the code has to check if the map has been altered since the
> unlocking, and if so, handle the case somehow. The patch uses a best
> effort approach where the code proceeds from the vm_map entry indicated
> by the end address of the current vm_map entry. The sanity checks that
> are done at the start of uvm_map_clean() are not rerun.
> 
> The system call that triggers the reversal is msync(2), and the
> reversal can be reproduced with the sys/kern/mmap regression test.
> sys/kern/mmap3 shows that there is another similar reversal with
> mlock(2) which is not covered by the patch.

Hi all,

with WITNESS I'm getting that similar log and with visa@ diff it's gone.


WITNESS log:

lock order reversal:
 1st 0xff01ef4eb2f0 vmmaplk (>lock) @
/usr/src/sys/uvm/uvm_map.c:4435
 2nd 0xff020f58f700 inode (>i_lock) @
/usr/src/sys/ufs/ufs/ufs_vnops.c:1544
lock order ">i_lock"(rrwlock) -> ">lock"(rwlock) first seen at:
#0  witness_checkorder+0x4c0
#1  _rw_enter+0x68
#2  vm_map_lock_ln+0xbc
#3  uvm_map+0x1a1
#4  km_alloc+0x16a
#5  pool_multi_alloc_ni+0xbb
#6  pool_p_alloc+0x56
#7  pool_do_get+0xe4
#8  pool_get+0xaf
#9  ufsdirhash_build+0x31e
#10 ufs_lookup+0x19d
#11 VOP_LOOKUP+0x4f
#12 vfs_lookup+0x2cf
#13 namei+0x2e3
#14 start_init+0xb2
lock order ">lock"(rwlock) -> ">i_lock"(rrwlock) first seen at:
#0  witness_checkorder+0x4c0
#1  _rw_enter+0x68
#2  _rrw_enter+0x3e
#3  VOP_LOCK+0x3d
#4  vn_lock+0x34
#5  uvn_io+0x1b8
#6  uvm_pager_put+0x109
#7  uvn_flush+0x424
#8  uvm_map_clean+0x3e7
#9  syscall+0x32a
#10 Xsyscall+0x128


OpenBSD 6.4-current (GENERIC.MP) #7: Mon Nov  5 22:09:07 CET 2018
r...@asd.srce.hr:/sys/arch/amd64/compile/GENERIC.MP
real mem = 8456089600 (8064MB)
avail mem = 8121876480 (7745MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xe87b1 (86 entries)
bios0: vendor Hewlett-Packard version "J01 v02.29" date 04/04/2016
bios0: Hewlett-Packard HP Compaq 8200 Elite CMT PC
acpi0 at bios0: rev 2
acpi0: sleep states S0 S3 S4 S5
acpi0: tables DSDT FACP APIC SSDT MCFG HPET SSDT SLIC TCPA
acpi0: wakeup devices PS2K(S3) PS2M(S3) BR20(S4) EUSB(S3) USBE(S3)
PEX0(S4) PEX1(S4) PEX2(S4) PEX3(S4) PEX4(S4) PEX5(S4) PEX6(S4) PEX7(S4)
P0P1(S4) P0P2(S4) P0P3(S4) [...]
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 3293.52 MHz, 06-2a-07
cpu0:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,NXE,RDTSCP,LONG,LAHF,PERF,ITSC,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu0: 256KB 64b/line 8-way L2 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
cpu0: apic clock running at 99MHz
cpu0: mwait min=64, max=64, C-substates=0.2.1.1, IBE
cpu1 at mainbus0: apid 2 (application processor)
cpu1: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 3292.56 MHz, 06-2a-07
cpu1:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,NXE,RDTSCP,LONG,LAHF,PERF,ITSC,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu1: 256KB 64b/line 8-way L2 cache
cpu1: smt 0, core 1, package 0
cpu2 at mainbus0: apid 4 (application processor)
cpu2: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 3292.56 MHz, 06-2a-07
cpu2:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,NXE,RDTSCP,LONG,LAHF,PERF,ITSC,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu2: 256KB 64b/line 8-way L2

Re: IPv6 NDP Timeout

2018-10-23 Thread Hrvoje Popovski

On 23.10.2018. 21:41, Florian Obser wrote:
> I'm currently on vacation and can't look into this soon.
> 
> One thing that comes to mind: do these machines keep proper time or are they 
> having issues with timer interrupts stopping because of too new KVM version 
> and missing hypervisor flag (someone with access to a real computer please  
> chip in with a link to a thread where this has been discussed before and the 
> name of the KVM flag).
> 

This link?
https://marc.info/?l=openbsd-misc=151575775607633=2

Re: kernel page fault - uvm_fault - softclock_thread

2018-10-01 Thread Hrvoje Popovski

On 1.10.2018. 14:09, Visa Hankala wrote:
> On Mon, Oct 01, 2018 at 01:50:59PM +0200, Hrvoje Popovski wrote:
>> Hi all,
>>
>> while testing sasha's "pfsync: avoid a recursion on PF_LOCK" diff i
>> manage to get panic. first i thought that this panic have something to
>> do with sasha@ work but i can easily reproduce it on clean -current.
>>
>> while firewall is under stress and forwarding traffic and i'm doing this
>> in loop
>>
>> ifconfig pfsync0 destroy && sleep 2 && sh netstart pfsync0 && sleep 2
>>
>> i'm getting this panic:
>>
>>
>> uvm_fault(0x81d51fe8, 0x8, 0, 2) -> e
>> kernel: page fault trap, code=0
>> Stopped at  softclock_thread+0xef:  movq%rdx,0x8(%rcx)
>> ddb{0}>
> 
> pfsync_clone_destroy() lacks proper locking and its timeout cancellation
> is not robust. Please try the patch below.


i can't reproduce panic with this diff  thank you ..

kernel page fault - uvm_fault - softclock_thread

2018-10-01 Thread Hrvoje Popovski

Hi all,

while testing sasha's "pfsync: avoid a recursion on PF_LOCK" diff i
manage to get panic. first i thought that this panic have something to
do with sasha@ work but i can easily reproduce it on clean -current.

while firewall is under stress and forwarding traffic and i'm doing this
in loop

ifconfig pfsync0 destroy && sleep 2 && sh netstart pfsync0 && sleep 2

i'm getting this panic:


uvm_fault(0x81d51fe8, 0x8, 0, 2) -> e
kernel: page fault trap, code=0
Stopped at  softclock_thread+0xef:  movq%rdx,0x8(%rcx)
ddb{0}>

ddb{0}> show panic
kernel page fault
uvm_fault(0x81d51fe8, 0x8, 0, 2) -> e
softclock_thread(0) at softclock_thread+0xef
end trace frame: 0x0, count: 1
ddb{0}>

ddb{0}> trace
softclock_thread(0) at softclock_thread+0xef
end trace frame: 0x0, count: -1
ddb{0}>

ddb{0}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
 28648   44673  80585  0  30x10008b  pause sh
 80585   91436  54828  0  30x10008b  pause sh
 96033   29102  64150  0  30x100083  ttyin ksh
 64150  347245  16720   1000  30x10008b  pause ksh
 16720  469432  43690   1000  30x90  selectsshd
 43690  138411   7279  0  30x82  poll  sshd
 58828   27527  83398  0  30x100083  ttyin ksh
 83398   15426  72476   1000  30x10008b  pause ksh
 72476  428551  71468   1000  30x90  selectsshd
 71468  140851   7279  0  30x82  poll  sshd
 22446   93134  1  0  30x100083  ttyin getty
  8308  230314  1  0  30x100083  ttyin getty
 54828  192367  1  0  30x10008b  pause ksh
 23569  400623  1  0  30x100083  ttyin getty
 74599  179405  1  0  30x100083  ttyin getty
 37720  157979  1  0  30x100083  ttyin getty
 58655  119359  1  0  30x100098  poll  cron
 93238  265393  25253 95  30x100092  kqreadsmtpd
 63459  395509  25253103  30x100092  kqreadsmtpd
 77659  387006  25253 95  30x100092  kqreadsmtpd
 41450  383102  25253 95  30x100092  kqreadsmtpd
 69474  212171  25253 95  30x100092  kqreadsmtpd
  7791  518306  25253 95  30x100092  kqreadsmtpd
 25253  181430  1  0  30x100080  kqreadsmtpd
 51069  231291  1  0  30x100080  kqreadsnmpd
 93117  241876  1 91  30x100092  kqreadsnmpd
  8062  431841  1 91  30x92  kqreadsnmpd
  7279   80889  1  0  30x80  selectsshd
  3849  318379  82126 83  30x100092  poll  ntpd
 82126   62754  70140 83  30x100092  poll  ntpd
 70140  349305  1  0  30x100080  poll  ntpd
  3780  405249   7687 74  30x100092  bpf   pflogd
  7687   19138  1  0  30x80  netio pflogd
 11361  410273   7613 73  70x100090syslogd
  76135709  1  0  30x100082  netio syslogd
 83045  500797  0  0  3 0x14200  pgzerozerothread
 96104  237415  0  0  3 0x14200  aiodoned  aiodoned
 95478  105584  0  0  3 0x14200  syncerupdate
  7491  247419  0  0  3 0x14200  cleaner   cleaner
 14850  510159  0  0  3 0x14200  reaperreaper
 90803  319870  0  0  3 0x14200  pgdaemon  pagedaemon
 92554  485066  0  0  3 0x14200  bored crynlk
 19382  238999  0  0  3 0x14200  bored crypto
 55351  397450  0  0  3 0x14200  usbtskusbtask
 94701  370298  0  0  3 0x14200  usbatsk   usbatsk
 25462   61976  0  0  3  0x40014200  acpi0 acpi0
 40756  454275  0  0  7  0x40014200idle5
 62029  393821  0  0  3  0x40014200idle4
 49221  298303  0  0  7  0x40014200idle3
 35942  140312  0  0  3  0x40014200idle2
 25913  384078  0  0  3  0x40014200idle1
 67573  244162  0  0  3 0x14200  bored sensors
 27303  343674  0  0  7 0x14200softnet
 80804  305643  0  0  3 0x14200  bored systqmp
 44322  381381  0  0  3 0x14200  bored systq
*90433  384560  0  0  7  0x40014200softclock
 73749  204855  0  0  3  0x40014200idle0
 1  182696  0  0  30x82  wait  init
 0   0 -1  0  3 0x10200  scheduler swapper


ddb{0}> tr /p 0t384560
db_ktrap(75cac9a96611906e,8000227567b0,6) at db_ktrap+0xee
kerntrap(261d4113102b7957) at kerntrap+0xa0
alltraps_kern(6,804d6850,0,2,81785280,800022756860)
at alltraps_kern+0x7b
softclock_thread(0) at

Re: Fujitsu RX2530 M4 16 cores null acpi panic

2018-09-17 Thread Hrvoje Popovski

On 15.9.2018. 19:24, Mike Larkin wrote:
> On Fri, Sep 14, 2018 at 07:44:45PM +0200, Mark Kettenis wrote:
>>> Date: Fri, 14 Sep 2018 10:05:34 -0700
>>> From: Mike Larkin 
>>>
>>> On Thu, Sep 13, 2018 at 11:17:15AM +0200, Hrvoje Popovski wrote:
>>>> Hi all,
>>>>
>>>> i'm having Fujitsu PRIMERGY RX2530 M4 server with Intel Gold 6134 cpu
>>>> with 8/16 cores.
>>>> When booting box up to 14 cores everything seems fine, but with 16 cores
>>>> i'm getting panic. In attachment you can find sendbug. Dmesg in sendbug
>>>> is with 14 cores.
>>>>
>>>> 8 cores (HT disabled) are more than enough for me but maybe this panic
>>>> is interesting to developers so i report it ...
>>>>
>>>>
>>>> root on sd0a (fa90dc9ea66a7e54.a) swap on sd0b dump on sd0b
>>>> panipanic: l anel pu
>>>> gnStopped at  db_enter+0x12:  popq%r11
>>>> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
>>>>  380736  12442  0 0x14000  0x2005  zerothread
>>>>  196414  98064  0 0x14000  0x2007  aiodoned
>>>> db_enter() at db_enter+0x12
>>>> panic() at panic+0x120
>>>> acpicpu_idle() at acpicpu_idle+0x2e8
>>>> sched_idle(0) at sched_idle+0x245
>>>> end trace frame: 0x0, count: 11
>>>> https://www.openbsd.org/ddb.html describes the minimum info required in
>>>> bug reports.  Insufficient info makes it difficult to find and fix bugs.
>>>>
>>>>
>>>> ddb{3}> trace
>>>> db_enter() at db_enter+0x12
>>>> panic() at panic+0x120
>>>> acpicpu_idle() at acpicpu_idle+0x2e8
>>>
>>> There are only 3 panics in acpicpu_idle. One at the very top:
>>> panic ("null acpicpu");
>>>
>>> and two much further down:
>>> panic ("idle with interrupts blocked");
>>>
>>> Based on the fact that it's called at acpicpu_idle+0x2e8, I'm
>>> inclined to believe it to be the lattter, but the garbled
>>> panic string seems to more closely match the former.
>>>
>>> Can you put a plain printf before each panic and try to repro,
>>> to see which it is? Just printf the same panic string.
>>
>> Mike, look below:
>>
>>>> ddb{3}> show panic
>>>> null acpicpu
>>>> ddb{3}>
>>
>> So it's the first one.
>>
>> Hvorje, can you boot this machine with 16 cores but acpicpu(4)
>> disabled and send us the acpidump output from /var/db/acpi?
> 
> Ah, oops. Missed that.
> 
> Based on later replies, I think you nailed it with _MAT.
> 
> -ml
> 

Thank you guys and sorry for noise ...

Re: Fujitsu RX2530 M4 16 cores null acpi panic

2018-09-14 Thread Hrvoje Popovski

On 14.9.2018. 19:44, Mark Kettenis wrote:
>> Date: Fri, 14 Sep 2018 10:05:34 -0700
>> From: Mike Larkin 
>>
>> On Thu, Sep 13, 2018 at 11:17:15AM +0200, Hrvoje Popovski wrote:
>>> Hi all,
>>>
>>> i'm having Fujitsu PRIMERGY RX2530 M4 server with Intel Gold 6134 cpu
>>> with 8/16 cores.
>>> When booting box up to 14 cores everything seems fine, but with 16 cores
>>> i'm getting panic. In attachment you can find sendbug. Dmesg in sendbug
>>> is with 14 cores.
>>>
>>> 8 cores (HT disabled) are more than enough for me but maybe this panic
>>> is interesting to developers so i report it ...
>>>
>>>
>>> root on sd0a (fa90dc9ea66a7e54.a) swap on sd0b dump on sd0b
>>> panipanic: l anel pu
>>> gnStopped at  db_enter+0x12:  popq%r11
>>> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
>>>  380736  12442  0 0x14000  0x2005  zerothread
>>>  196414  98064  0 0x14000  0x2007  aiodoned
>>> db_enter() at db_enter+0x12
>>> panic() at panic+0x120
>>> acpicpu_idle() at acpicpu_idle+0x2e8
>>> sched_idle(0) at sched_idle+0x245
>>> end trace frame: 0x0, count: 11
>>> https://www.openbsd.org/ddb.html describes the minimum info required in
>>> bug reports.  Insufficient info makes it difficult to find and fix bugs.
>>>
>>>
>>> ddb{3}> trace
>>> db_enter() at db_enter+0x12
>>> panic() at panic+0x120
>>> acpicpu_idle() at acpicpu_idle+0x2e8
>>
>> There are only 3 panics in acpicpu_idle. One at the very top:
>>  panic ("null acpicpu");
>>
>> and two much further down:
>>  panic ("idle with interrupts blocked");
>>
>> Based on the fact that it's called at acpicpu_idle+0x2e8, I'm
>> inclined to believe it to be the lattter, but the garbled
>> panic string seems to more closely match the former.
>>
>> Can you put a plain printf before each panic and try to repro,
>> to see which it is? Just printf the same panic string.
> 
> Mike, look below:
> 
>>> ddb{3}> show panic
>>> null acpicpu
>>> ddb{3}>
> 
> So it's the first one.
> 
> Hvorje, can you boot this machine with 16 cores but acpicpu(4)
> disabled and send us the acpidump output from /var/db/acpi?
> 

Hi,

here it is:

http://kosjenka.srce.hr/~hrvoje/zaprocvat/noacpi.tgz

dmesg without acpicpu, just in case ..

OpenBSD 6.4-beta (GENERIC.MP) #293: Tue Sep 11 20:16:57 MDT 2018
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 33731620864 (32168MB)
avail mem = 32700067840 (31185MB)
User Kernel Config
UKC> dia\^H \^Hsable acpicpu
402 acpicpu* disabled
UKC> exit
Continuing...
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 3.0 @ 0x6f119000 (84 entries)
bios0: vendor FUJITSU // American Megatrends Inc. version "V5.0.0.12
R1.22.0 for D3383-A1x" date 06/04/2018
bios0: FUJITSU PRIMERGY RX2530 M4
acpi0 at bios0: rev 2
acpi0: sleep states S0 S5
acpi0: tables DSDT FACP FPDT FIDT SPMI UEFI UEFI MCEJ MCFG HPET APIC
MIGT MSCT NFIT PCAT PCCT RASF SLIT SRAT SVOS WDDT OEM4 OEM1 OEM2 SSDT
SSDT SSDT DMAR HEST BERT ERST EINJ
acpi0: wakeup devices PWRB(S0) XHCI(S0) PXSX(S0) RP17(S0) PXSX(S0)
RP18(S0) PXSX(S0) RP19(S0) PXSX(S0) RP20(S0) PXSX(S0) RP01(S0) PXSX(S0)
RP02(S0) PXSX(S0) RP03(S0) [...]
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpimcfg0 at acpi0
acpimcfg0: addr 0x8000, bus 0-255
acpihpet0 at acpi0: 2399 Hz
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz, 3193.22 MHz, 06-55-04
cpu0:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,PQM,MPX,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,PT,AVX512CD,AVX512BW,AVX512VL,PKU,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES,MELTDOWN
cpu0: 256KB 64b/line 8-way L2 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
cpu0: apic clock running at 24MHz
cpu0: mwait min=64, max=64, C-substates=0.2.0.2, IBE
cpu1 at mainbus0: apid 8 (application processor)
cpu1: Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz, 3192.53 MHz, 06-55-04
cpu1:
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,S

Re: Kernel Panic 6.3 and HP DL360 Gen9

2018-06-19 Thread Hrvoje Popovski

On 19.6.2018. 15:14, Albert Martinez wrote:
> Dear OpenBSD Team,
> 
> first of all, thanks for your time and effort in the OBSD project, we use it 
> daily.

Hi,

please send dmesg and if you can install latest -current and report back.

Re: lock order reversal

2018-01-30 Thread Hrvoje Popovski

On 30.1.2018. 13:34, Martin Pieuchot wrote:
> On 30/01/18(Tue) 13:12, Hrvoje Popovski wrote:
>> Hi all,
>>
>> I've checkouted cvs tree few minutes ago on desktop pc and enabled
>> WITNESS. While booting pc with new kernel i'm getting "lock order reversal"
>>
>> http://kosjenka.srce.hr/~hrvoje/zaprocvat/IMG_20180130_125928.jpg
> 
> I'd say this is known.  With a working keyboard you could print the
> traces and run "show witness /b".
> 
>>
>> in ddb i can't do nothing with keyboard
> 
> Putting machdep.forceukbd=1 into sysctl.conf(5) should help with that.


now without uefi output to serial port is working :)



login: lock order reversal:
 1st 0x81bf6aa8 _lock (_lock) @
/sys/kern/kern_synch.c:292
 2nd 0x80218240 _priv->irq_lock (_priv->irq_lock) @
/sys/dev/pc
i/drm/i915/intel_ringbuffer.c:1787
Stopped at  db_enter+0x5:   popq%rbp
ddb{2}>


ddb{2}> trace
db_enter() at db_enter+0x5
witness_checkorder(81886033,6fb,80218240,80218230,0)
at
 witness_checkorder+0xaf2
_mtx_enter(80211000,80218230,8021b000) at
_mtx_enter+0x
30
gen6_ring_put_irq(800033083370) at gen6_ring_put_irq+0x36
__i915_wait_request(0,ff0219e0d008,80ade850,0,80211000)
at _
_i915_wait_request+0x330
i915_wait_request(0) at i915_wait_request+0x87
i915_gem_object_wait_rendering(0,80ade740) at
i915_gem_object_wait_rend
ering+0x18e
i915_gem_object_set_cache_level(8000330834e0,80ade740) at
i915_gem_
object_set_cache_level+0x20d
i915_gem_object_pin_to_display_plane(80ade740,80211000,8000
330834e0,0,8130d56d) at i915_gem_object_pin_to_display_plane+0x6d
intel_pin_and_fence_fb_obj() at intel_pin_and_fence_fb_obj+0x1c0
intel_crtc_page_flip(80a80a00,8021c000,800033083ab0,800
00021b000) at intel_crtc_page_flip+0x4dd
drm_mode_page_flip_ioctl(8021b000,8171b440,c01864b0) at
drm_mod
e_page_flip_ioctl+0x39b
drm_do_ioctl(0,8021b0d8,8021b000,8021b108) at
drm_do_io
ctl+0x201
drmioctl(ff01dfe55ae0,800033084030,ff01ddaf4e88,c01864b0,ff01dd
af4e88) at drmioctl+0xe8
VOP_IOCTL(6e8804fd365ed36d,800033084030,ff021e5d3ba0,3,800033083ab0
,c01864b0) at VOP_IOCTL+0x3e
vn_ioctl(800033084030,800033083ba0,ff01ddaf4e88,18) at
vn_ioctl+0x5
d
sys_ioctl(360,800033084030,0) at sys_ioctl+0x343
syscall() at syscall+0x279
--- syscall (number 54) ---
end of kernel
end trace frame: 0x7f7f2e80, count: -18
0x1dc9e137593a:
ddb{2}>


ddb{2}> show witness /b
Number of known direct relationships is 300

Lock order reversal between "_lock"(sched_lock) and
"_priv->irq_lock"
(mutex)!
Lock order "_lock"(sched_lock) -> "_priv->irq_lock"(mutex)
first seen
 at:
#0  witness_checkorder+0x466
#1  _mtx_enter+0x30
#2  gen6_ring_put_irq+0x36
#3  __i915_wait_request+0x330
#4  i915_wait_request+0x87
#5  i915_gem_object_wait_rendering+0x18e
#6  i915_gem_object_set_cache_level+0x20d
#7  i915_gem_object_pin_to_display_plane+0x6d
#8  intel_pin_and_fence_fb_obj+0x1c0
#9  intel_crtc_page_flip+0x4dd
#10 drm_mode_page_flip_ioctl+0x39b
#11 drm_do_ioctl+0x201
#12 drmioctl+0xe8
#13 VOP_IOCTL+0x3e
#14 vn_ioctl+0x5d
#15 sys_ioctl+0x343
#16 syscall+0x279

Lock order "_priv->irq_lock"(mutex) -> "_lock"(sched_lock)
first seen
 at:
#0  witness_checkorder+0x466
#1  ___mp_lock+0x6f
#2  wakeup_n+0x39
#3  task_add+0x85
#4  gen6_rps_boost+0x110
#5  __i915_wait_request+0x13c
#6  i915_gem_object_wait_rendering__nonblocking+0x1c6
#7  i915_gem_set_domain_ioctl+0xce
#8  drm_do_ioctl+0x201
#9  drmioctl+0xe8
#10 VOP_IOCTL+0x3e
#11 vn_ioctl+0x5d
#12 sys_ioctl+0x343
#13 syscall+0x279


Lock order reversal between ">mnt_lock"(rwlock) and
">i_lock"(rrwlock)!

Lock order ">mnt_lock"(rwlock) -> ">i_lock"(rrwlock) first seen at:
#0  witness_checkorder+0x466
#1  _rw_enter+0x56
#2  _rrw_enter+0x32
#3  VOP_LOCK+0x31
#4  vn_lock+0x36
#5  vget+0xba
#6  cache_lookup+0x173
#7  ufs_lookup+0x10e
#8  VOP_LOOKUP+0x33
#9  vfs_lookup+0x26e
#10 namei+0x1eb
#11 ffs_mount+0x111
#12 sys_mount+0x33c
#13 syscall+0x279

Lock order ">i_lock"(rrwlock) -> ">mnt_lock"(rwlock) first seen at:
#0  witness_checkorder+0x466
#1  _rw_enter+0x56
#2  vfs_busy+0x64
#3  vfs_lookup+0x38b
#4  namei+0x1eb
#5  doreadlinkat+0x6d
#6  syscall+0x279

ddb{2}>


ddb{2}> mach ddbcpu 0
Stopped at  x86_ipi_db+0x5: popq%rbp
ddb{0}> trace
x86_ipi_db(802119e0) at x86_ipi_db+0x5
x86_ipi_handler() at x86_ipi_handler+0x6a
Xresume_lapic_ipi() at Xresume_lapic_ipi+0x1f
--- interrupt ---
end of kernel
end trace frame: 0x33a79405c75250cc, count: -3
0x41cb8c419c524153:
ddb{0}> mach ddbcpu 1
Stopped at  x86_ipi_db+0x5: popq%rbp
ddb{1}> trace
x86_i

Re: lock order reversal

2018-01-30 Thread Hrvoje Popovski

On 30.1.2018. 14:18, Hrvoje Popovski wrote:
> On 30.1.2018. 13:34, Martin Pieuchot wrote:
>> On 30/01/18(Tue) 13:12, Hrvoje Popovski wrote:
>>> Hi all,
>>>
>>> I've checkouted cvs tree few minutes ago on desktop pc and enabled
>>> WITNESS. While booting pc with new kernel i'm getting "lock order reversal"
>>>
>>> http://kosjenka.srce.hr/~hrvoje/zaprocvat/IMG_20180130_125928.jpg
>>
>> I'd say this is known.  With a working keyboard you could print the
>> traces and run "show witness /b".
>>
>>>
>>> in ddb i can't do nothing with keyboard
>>
>> Putting machdep.forceukbd=1 into sysctl.conf(5) should help with that.
> 
> 
> thank you for info on forceukdb=1
> 
> trace
> http://kosjenka.srce.hr/~hrvoje/zaprocvat/IMG_20180130_140812.jpg
> 
> show witness /b - can i somehow scroll up/down in ddb? show withess /b
> doesn't print page by page link show witness
> 
> this is so easier over serial console :)
> 
> 

i will reinstall pc without uefi boot and send proper ddb output ..

Re: lock order reversal

2018-01-30 Thread Hrvoje Popovski

On 30.1.2018. 13:34, Martin Pieuchot wrote:
> On 30/01/18(Tue) 13:12, Hrvoje Popovski wrote:
>> Hi all,
>>
>> I've checkouted cvs tree few minutes ago on desktop pc and enabled
>> WITNESS. While booting pc with new kernel i'm getting "lock order reversal"
>>
>> http://kosjenka.srce.hr/~hrvoje/zaprocvat/IMG_20180130_125928.jpg
> 
> I'd say this is known.  With a working keyboard you could print the
> traces and run "show witness /b".
> 
>>
>> in ddb i can't do nothing with keyboard
> 
> Putting machdep.forceukbd=1 into sysctl.conf(5) should help with that.


thank you for info on forceukdb=1

trace
http://kosjenka.srce.hr/~hrvoje/zaprocvat/IMG_20180130_140812.jpg

show witness /b - can i somehow scroll up/down in ddb? show withess /b
doesn't print page by page link show witness

this is so easier over serial console :)

1 2 >

1 - 100 of 132 matches

Mail list logo