Re: acpicpu(4) and ACPI0007

2020-07-28 Thread Johan Huldtgren
hello,

On 2020-07-28 11:12, Mark Kettenis wrote:
> > Date: Tue, 28 Jul 2020 13:46:34 +1000
> > From: Jonathan Matthew 
> > 
> > On Mon, Jul 27, 2020 at 05:16:47PM +0200, Mark Kettenis wrote:
> > > > Date: Mon, 27 Jul 2020 17:02:41 +0200 (CEST)
> > > > From: Mark Kettenis 
> > > > 
> > > > Recent ACPI versions have deprecated "Processor()" nodes in favout of
> > > > "Device()" nodes with a _HID() method that returns "ACPI0007".  This
> > > > diff tries to support machines with firmware that implements this.  If
> > > > you see something like:
> > > > 
> > > >   "ACPI0007" at acpi0 not configured
> > > > 
> > > > please try the following diff and report back with an updated dmesg.
> > > > 
> > > > Cheers,
> > > > 
> > > > Mark
> > > 
> > > And now with the right diff...
> > 
> > On a dell r6415, it looks like this:
> > 
> > acpicpu0 at acpi0copyvalue: 6: C1(@1 halt!)
> > all the way up to
> > acpicpu127 at acpi0copyvalue: 6: no cpu matching ACPI ID 127
> > 
> > which I guess means aml_copyvalue() needs to learn how to copy 
> > AML_OBJTYPE_DEVICE.
> 
> Yes.  It is not immediately obvious how this should work.  Do we need
> to copy the aml_node pointer or not?  We don't do that for
> AML_OBJTYPE_PROCESSOR and AML_OBJTYPE_POWERRSRC types which are
> similar to AML_OBJTYPE_DEVICE.  But AML_OBJTYPE_DEVICE object don't
> carry any additional information.  So we end up with just an empty
> case to avoid the warning.
> 
> Does this work on the Dell machines?
> 
> 
> Index: dev/acpi/dsdt.c
> ===
> RCS file: /cvs/src/sys/dev/acpi/dsdt.c,v
> retrieving revision 1.252
> diff -u -p -r1.252 dsdt.c
> --- dev/acpi/dsdt.c   21 Jul 2020 03:48:06 -  1.252
> +++ dev/acpi/dsdt.c   28 Jul 2020 09:04:15 -
> @@ -996,6 +996,8 @@ aml_copyvalue(struct aml_value *lhs, str
>   lhs->v_objref = rhs->v_objref;
>   aml_addref(lhs->v_objref.ref, "");
>   break;
> + case AML_OBJTYPE_DEVICE:
> + break;
>   default:
>   printf("copyvalue: %x", rhs->type);
>   break;
> 

before diffs I see:

"ACPI0004" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0004" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured
"ACPI0007" at acpi0 not configured

after applying both diffs I see that as

"ACPI0004" at acpi0 not configured
acpicpu24 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu25 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu26 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu27 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu28 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu29 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu30 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu31 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu32 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu33 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu34 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu35 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
"ACPI0004" at acpi0 not configured
acpicpu36 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu37 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu38 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu39 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu40 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu41 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu42 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu43 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu44 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu45 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu46 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS
acpicpu47 at acpi0: C2(350@41 mwait.3@0x20), C1(1016@1 mwait.1), PSS

Full dmesg attached.

thanks,

.jh
OpenBSD 6.7-current (GENERIC.MP) #1: Tue Jul 28 11:35:11 EDT 2020

jo...@amd64-ports.home.huldtgren.net:/usr/src/sys/arch/amd64/compile/

Re: [patch] snmpd hrStorageSize negative values

2020-07-01 Thread Johan Huldtgren
On 2020-07-01 17:40, Martijn van Duren wrote:
> On Wed, 2020-07-01 at 10:59 -0400, Johan Huldtgren wrote:
> > hello,
> > 
> > On 2017-11-27 10:31, Gerhard Roth wrote:
> > > On Sat, 25 Nov 2017 11:42:07 -0700 Joel Knight  
> > > wrote:
> > > > On Thu, Mar 9, 2017 at 10:02 PM, Joel Knight  
> > > > wrote:
> > > > > Hi.
> > > > > 
> > > > > snmpd(8) uses unsigned ints internally to represent the size and used
> > > > > space of a file system. The HOST-RESOURCES-MIB defines the valid
> > > > > values for those OIDs as 0..2147483647. With sufficiently large file
> > > > > systems, this can cause negative numbers to be returned for the size
> > > > > and used space OIDs.
> > > > > 
> > > > > .1.3.6.1.2.1.25.2.3.1.5.36=-1573167768  
> > > > 
> > > > Hi. Just wanted to bump this again and see if anyone that cares about
> > > > snmp could take a look? Looking for oks and someone who wouldn't mind
> > > > committing it.
> > > > 
> > > > 
> > > > > At sthen's suggestion, do what net-snmp does and fiddle with the
> > > > > values to prevent wrapping. Yes this mucks with the actual values of
> > > > > size, used space, and block size, but it allows snmpd to convey the
> > > > > proper size and used space of the file system which is what most
> > > > > everybody is really interested in.
> > > > > 
> > > > > In case gmail hoses this diff, it's also here:
> > > > > https://www.packetmischief.ca/files/patches/snmpd.hrstorage2.diff  
> > > 
> > > Hi Joel,
> > > 
> > > I think this won't work unless you also change the type of 'size' and
> > > 'used' to u_int64_t.
> > 
> > I ran into an issue where my snmpd underreported my filesystem size.
> > 
> > $ df -h /ftp
> > Filesystem SizeUsed   Avail Capacity  Mounted on
> > /dev/sd0a 50.5T   13.7T   34.3T29%/ftp
> > 
> > However snmp reports something different.
> > 
> > $ snmp walk -v 2c -c public localhost hrStorage
> > 
> > hrStorageDescr.40 = STRING: /ftp
> > hrStorageAllocationUnits.40 = INTEGER: 8192 Bytes
> > hrStorageSize.40 = INTEGER: 2487209520
> > 
> > sthen@ pointed me to this thread but suggested 'int_t' as opposed to
> > 'u_int_64_t', making that change and applying it fixes the issue for
> > me.
> > 
> > hrStorageDescr.40 = STRING: /ftp
> > hrStorageAllocationUnits.40 = INTEGER: 32768
> > hrStorageUsed.40 = INTEGER: 459624840
> > 
> > Updated patch attached.
> > 
> > thanks,
> > 
> > .jh
> 
> Some minor tweaks:
> - Use u_int64_t instead of size_t
> - Put the calculation outside the switch, so memory can profit as well.
> 
> OK?

Tested and your updated patch works fine for me.

thanks,

.jh

> 
> martijn@
> 
> Index: mib.c
> ===
> RCS file: /cvs/src/usr.sbin/snmpd/mib.c,v
> retrieving revision 1.99
> diff -u -p -r1.99 mib.c
> --- mib.c 15 May 2020 00:56:03 -  1.99
> +++ mib.c 1 Jul 2020 15:39:37 -
> @@ -563,7 +563,7 @@ mib_hrstorage(struct oid *oid, struct be
>   u_int32_tidx;
>   struct statfs   *mntbuf, *mnt;
>   int  mntsize, maxsize;
> - u_int32_tunits, size, used, fail = 0;
> + u_int64_tunits, size, used, fail = 0;
>   const char  *descr = NULL;
>   int  mib[] = { CTL_HW, 0 };
>   u_int64_tphysmem, realmem;
> @@ -645,6 +645,12 @@ mib_hrstorage(struct oid *oid, struct be
>   used = mnt->f_blocks - mnt->f_bfree;
>   sop = &so[3];
>   break;
> + }
> +
> + while (size > INT32_MAX) {
> + units *= 2;
> + size /= 2;
> + used /= 2;
>   }
>  
>   /* Tables need to prepend the OID on their own */
> 



Re: [patch] snmpd hrStorageSize negative values

2020-07-01 Thread Johan Huldtgren
hello,

On 2017-11-27 10:31, Gerhard Roth wrote:
> On Sat, 25 Nov 2017 11:42:07 -0700 Joel Knight  wrote:
> > On Thu, Mar 9, 2017 at 10:02 PM, Joel Knight  wrote:
> > > Hi.
> > >
> > > snmpd(8) uses unsigned ints internally to represent the size and used
> > > space of a file system. The HOST-RESOURCES-MIB defines the valid
> > > values for those OIDs as 0..2147483647. With sufficiently large file
> > > systems, this can cause negative numbers to be returned for the size
> > > and used space OIDs.
> > >
> > > .1.3.6.1.2.1.25.2.3.1.5.36=-1573167768  
> > 
> > Hi. Just wanted to bump this again and see if anyone that cares about
> > snmp could take a look? Looking for oks and someone who wouldn't mind
> > committing it.
> > 
> > 
> > > At sthen's suggestion, do what net-snmp does and fiddle with the
> > > values to prevent wrapping. Yes this mucks with the actual values of
> > > size, used space, and block size, but it allows snmpd to convey the
> > > proper size and used space of the file system which is what most
> > > everybody is really interested in.
> > >
> > > In case gmail hoses this diff, it's also here:
> > > https://www.packetmischief.ca/files/patches/snmpd.hrstorage2.diff  
> 
> 
> Hi Joel,
> 
> I think this won't work unless you also change the type of 'size' and
> 'used' to u_int64_t.

I ran into an issue where my snmpd underreported my filesystem size.

$ df -h /ftp
Filesystem SizeUsed   Avail Capacity  Mounted on
/dev/sd0a 50.5T   13.7T   34.3T29%/ftp

However snmp reports something different.

$ snmp walk -v 2c -c public localhost hrStorage

hrStorageDescr.40 = STRING: /ftp
hrStorageAllocationUnits.40 = INTEGER: 8192 Bytes
hrStorageSize.40 = INTEGER: 2487209520

sthen@ pointed me to this thread but suggested 'int_t' as opposed to
'u_int_64_t', making that change and applying it fixes the issue for
me.

hrStorageDescr.40 = STRING: /ftp
hrStorageAllocationUnits.40 = INTEGER: 32768
hrStorageUsed.40 = INTEGER: 459624840

Updated patch attached.

thanks,

.jh
Index: usr.sbin/snmpd/mib.c
===
RCS file: /cvs/src/usr.sbin/snmpd/mib.c,v
retrieving revision 1.99
diff -u -p -u -p -r1.99 mib.c
--- usr.sbin/snmpd/mib.c15 May 2020 00:56:03 -  1.99
+++ usr.sbin/snmpd/mib.c1 Jul 2020 14:22:59 -
@@ -563,7 +563,7 @@ mib_hrstorage(struct oid *oid, struct be
u_int32_tidx;
struct statfs   *mntbuf, *mnt;
int  mntsize, maxsize;
-   u_int32_tunits, size, used, fail = 0;
+   size_t   units, size, used, fail = 0;
const char  *descr = NULL;
int  mib[] = { CTL_HW, 0 };
u_int64_tphysmem, realmem;
@@ -643,6 +643,14 @@ mib_hrstorage(struct oid *oid, struct be
units = mnt->f_bsize;
size = mnt->f_blocks;
used = mnt->f_blocks - mnt->f_bfree;
+
+   /* for large filesystems, do not overflow hrStorageSize */
+   while (size > INT32_MAX) {
+   size = size >> 1;
+   units = units << 1;
+   used = used >> 1;
+   }
+
sop = &so[3];
break;
}


Re: 11n Tx aggregation for iwm(4)

2020-06-27 Thread Johan Huldtgren
On 2020-06-26 20:11, Johan Huldtgren wrote:
> hello,
> 
> On 2020-06-26 14:45, Stefan Sperling wrote:
> > It would be great to get at least one test for all the chipsets the driver
> > supports: 7260, 7265, 3160, 3165, 3168, 8260, 8265, 9260, 9560
> > The behaviour of the access point also matters a great deal. It won't
> > hurt to test the same chipset against several different access points.
> 
> tested on:
> 
> iwm0 at pci1 dev 0 function 0 "Intel Dual Band Wireless-AC 8265" rev 0x78, msi
> 
> AP is a Ruckus 7363.
> 
> $ netstat -W iwm0 | grep "output block"   
>   
> 
> 6 new output block ack agreements
> 0 output block ack agreements timed out
> 
> Before:
> 
> bandwidth min/avg/max/std-dev = 16.780/18.325/19.939/1.235 Mbps
> 
> After:
> 
> bandwidth min/avg/max/std-dev = 0.000/15.559/51.631/19.548 Mbps

Testing against a slightly different AP (Ruckus 7372):

before patch:

bandwidth min/avg/max/std-dev = 0.092/14.665/22.589/9.992 Mbps

after patch:

bandwidth min/avg/max/std-dev = 7.020/24.596/41.121/11.300 Mbps

This is the reported mode:

media: IEEE802.11 autoselect (HT-MCS13 mode 11n)

.jh



Re: 11n Tx aggregation for iwm(4)

2020-06-26 Thread Johan Huldtgren
hello,

On 2020-06-26 14:45, Stefan Sperling wrote:
> It would be great to get at least one test for all the chipsets the driver
> supports: 7260, 7265, 3160, 3165, 3168, 8260, 8265, 9260, 9560
> The behaviour of the access point also matters a great deal. It won't
> hurt to test the same chipset against several different access points.

tested on:

iwm0 at pci1 dev 0 function 0 "Intel Dual Band Wireless-AC 8265" rev 0x78, msi

AP is a Ruckus 7363.

$ netstat -W iwm0 | grep "output block" 


6 new output block ack agreements
0 output block ack agreements timed out

Before:

bandwidth min/avg/max/std-dev = 16.780/18.325/19.939/1.235 Mbps

After:

bandwidth min/avg/max/std-dev = 0.000/15.559/51.631/19.548 Mbps

.jh



vmm Linux guest crashes frequently

2020-03-17 Thread Johan Huldtgren
hello,

i am running vmd on current and one of the guests I have is an Ubuntu
18.04.4 LTS. I have two problems, first this guest will crash 
frequently and second when restarting this guest it will often crash
on restart.

Starting with the second issue first, when the guest has crashed and
is started back up it will sometimes crash, but only during fsck, this
makes me suspect disk I/O being the culprit. The output I get when
doing 'vmctl start -c 2' is:

[0.00] ACPI BIOS Error (bug): A valid RSDP was not found 
(20170831/tbxfroot-244) a command-line.   
[1.096305] mce: Unable to init MCE device (rc: -5) 
/dev/vda1: recovering journal  
/dev/vda1: clean, 291619/1966080 files, 4637993/7863808 blocks

[EOT]

I've attached the the full log displayed during startup (with VMM_DEBUG)
the one thing I noticed was this seems to happen just before each
crash:

Mar 17 11:08:40 absu /bsd: vmx_handle_wrmsr: wrmsr exit, msr=0xc90, discarding 
data written from guest=0x0:0xf
Mar 17 11:08:40 absu /bsd: vmx_handle_rdmsr: rdmsr exit, msr=0xc90, data 
returned to guest=0x0:0x0
Mar 17 11:08:40 absu vmd[74219]: vcpu_process_com_data: guest reading com1 when 
not ready
Mar 17 11:08:40 absu vmd[74219]: vcpu_process_com_data: guest reading com1 when 
not ready
Mar 17 11:08:40 absu /bsd: vmx_handle_rdmsr: rdmsr exit, msr=0x3b, data 
returned to guest=0x0:0x0
Mar 17 11:08:40 absu vmd[74219]: vioblk_notifyq: unsupported command 0x8
Mar 17 11:08:40 absu vmd[74219]: vioblk_notifyq: unsupported command 0x8
Mar 17 11:08:42 absu vmd[74219]: vmd: vm 3 event thread exited unexpectedly
Mar 17 11:08:42 absu /bsd: vmm_free_vpid: freed VPID/ASID 2

Generally after a few attempts it will get passed this and then start
up normally.

The issue of the guest crashing is harder to pin down, it's random,
I've gone days with everything running fine, sometimes it crashes
multiple times a day. Most common is about once a day or so. When
it did crash even with VMM_DEBUG I didn't see much:

Mar 17 14:27:13 absu /bsd: vmx_handle_rdmsr: rdmsr exit, msr=0x3b, data 
returned to guest=0x0:0x0
Mar 17 14:27:46 absu last message repeated 6 times
Mar 17 14:29:46 absu last message repeated 22 times
Mar 17 14:39:44 absu last message repeated 109 times
Mar 17 14:49:46 absu last message repeated 110 times
Mar 17 14:56:53 absu last message repeated 78 times
Mar 17 15:09:45 absu last message repeated 63 times
Mar 17 15:19:46 absu last message repeated 78 times
Mar 17 15:29:48 absu last message repeated 110 times
Mar 17 15:39:50 absu last message repeated 110 times
Mar 17 15:49:47 absu last message repeated 109 times
Mar 17 15:59:49 absu last message repeated 110 times
Mar 17 16:09:49 absu last message repeated 109 times
Mar 17 16:19:49 absu last message repeated 74 times
Mar 17 16:29:45 absu last message repeated 107 times
Mar 17 16:39:47 absu last message repeated 109 times
Mar 17 16:49:51 absu last message repeated 110 times
Mar 17 16:59:47 absu last message repeated 108 times
Mar 17 17:00:59 absu last message repeated 12 times
Mar 17 17:01:20 absu vmd[55361]: vmd: vm 5 event thread exited unexpectedly
Mar 17 17:01:21 absu /bsd: vmm_free_vpid: freed VPID/ASID 2

any hints on further debugging this?

thanks,

.jh
Mar 17 11:07:59 absu /bsd: vm_impl_init_vmx: created vm_map @ 0xfd9db272edd8
Mar 17 11:08:00 absu /bsd: vm_resetcpu: resetting vm 3 vcpu 0 to power on 
defaults
Mar 17 11:08:00 absu /bsd: Guest EPTP = 0x207da5c01e
Mar 17 11:08:00 absu /bsd: vmm_alloc_vpid: allocated VPID/ASID 2
Mar 17 11:08:04 absu /bsd: vmx_handle_cr: mov to cr0 @ b4e0008b, 
data=0x80050033
Mar 17 11:08:04 absu /bsd: vmm_handle_cpuid: function 0x06 (thermal/power mgt) 
not supported
Mar 17 11:08:04 absu /bsd: vmm_handle_cpuid: function 0x0f (QoS info) not 
supported
Mar 17 11:08:04 absu /bsd: vmm_handle_cpuid: function 0x06 (thermal/power mgt) 
not supported
Mar 17 11:08:04 absu /bsd: vmm_handle_cpuid: function 0x06 (thermal/power mgt) 
not supported
Mar 17 11:08:04 absu /bsd: vmm_handle_cpuid: invalid cpuid input leaf 0x10, 
guest rip=0xb4e6cbbf - resetting to 0xf
Mar 17 11:08:04 absu /bsd: vmm_handle_cpuid: function 0x0f (QoS info) not 
supported
Mar 17 11:08:05 absu /bsd: vmm_handle_cpuid: invalid cpuid input leaf 0x10, 
guest rip=0xb4e6cbbf - resetting to 0xf
Mar 17 11:08:05 absu /bsd: vmm_handle_cpuid: function 0x0f (QoS info) not 
supported
Mar 17 11:08:05 absu /bsd: vmm_handle_cpuid: invalid cpuid input leaf 0x10, 
guest rip=0xb4e6cbbf - resetting to 0xf
Mar 17 11:08:05 absu /bsd: vmm_handle_cpuid: function 0x0f (QoS info) not 
supported
Mar 17 11:08:05 absu /bsd: vmm_handle_cpuid: invalid cpuid input leaf 0x10, 
guest rip=0xb4e6cbbf - resetting to 0xf
Mar 17 11:08:05 absu /bsd: vmm_handle_cpuid: function 0x0f (QoS info) not 
supported
Mar 17 11:08:05 absu /bsd: vmm_handle_cpuid: invalid cpuid input leaf 0x10, 
guest rip=0xb

Re: vmm disk unavailable after forceful vm termination

2019-12-07 Thread Johan Huldtgren
On 2019-11-01 12:40, Mike Larkin wrote:
> On Fri, Nov 01, 2019 at 09:20:58AM -0400, Johan Huldtgren wrote:
> > hello,
> > 
> > I have vmd running on -current, in it I have an Ubuntu vm (18.04.3 LTS),
> > every now and then the Ubuntu vm will hang hard, console is dead, only
> > option is to restart it. Now at that point a graceful restart won't
> > work, 'vmctl stop n' will run and return but nothing will happen. So
> > the only option is 'vmctl stop -f n', this will kill the vm. However
> > after that the vm will disapear from the list of vms
> > 
> > Before killing:
> > 
> > $ vmctl status
> >  ID   PID VCPUS  MAXMEM  CURMEM TTYOWNERSTATE NAME
> >   2 71924 12.0G1.1G   ttyp2johan  running plex.vm
> >   1 18308 18.0G6.5G   ttyp0johan  running monitor.vm
> >   3 - 11.0G   -   -johan  stopped 
> > amd64-ports.vm
> >   4 - 1512M   -   -johan  stopped 
> > i386-ports.vm
> > 
> > After killing:
> > 
> > $ vmctl stop -f 2
> > stopping vm: forced to terminate vm 2
> > 
> > $ vmctl status
> >  ID   PID VCPUS  MAXMEM  CURMEM TTYOWNERSTATE NAME
> >   1 18308 18.0G6.5G   ttyp0johan  running monitor.vm
> >   3 - 11.0G   -   -johan  stopped 
> > amd64-ports.vm
> >   4 - 1512M   -   -johan  stopped 
> > i386-ports.vm
> > 
> > The vm is now gone, I can't just start it with 'vmctl start n',
> > but it's defined in vm.conf so let's try starting it by name:
> > 
> > $ vmctl start plex.vm
> > vmctl: start vm command failed: Operation already in progress
> > 
> > ok.. so let's restart vmd.
> > 
> > $ doas rcctl stop vmd
> > vmd(ok)
> > $ doas rcctl start vmd
> > vmd(ok)
> > 
> > but it's not actually running.
> > 
> > $ vmctl status
> > vmctl: connect: /var/run/vmd.sock: Connection refused
> > 
> > Let's try starting manually with debug
> > 
> > $ doas vmd -d
> > startup
> > warning: macro 'sets' not used
> > can't open disk /ftp/vm/plex.img: Resource temporarily unavailable
> > failed to start vm plex.vm
> > parent: configuration failed
> > priv exiting, pid 9509
> > control exiting, pid 63731
> > vmm exiting, pid 28106
> > 
> > So it seems the disk image is now in some state which won't let it be read 
> > again?
> > 
> > $ ls -al /ftp/vm/plex.img
> > -rw---  1 root  wheel  32212254720 Nov  1 08:07 /ftp/vm/plex.img
> > 
> > $ doas file /ftp/vm/plex.img
> > /ftp/vm/plex.img: x86 boot sector; partition 1: ID=0x83, active, starthead 
> > 32, startsector 2048, 62910464 sectors
> > 
> > At this point the only solution is rebooting the host running vmd.
> > After that everything will work just fine again until the next time
> > this happens. I don't know if this is known or a bug, googling and
> > scanning through marc.info I couldn't find any reports. I don't know
> > if this is relevant but when I restarted this vm yesterday (along with
> > hanging hard it sometimes just dies and is reported as stopped). I see
> > this in /var/log/daemon
> > 
> > Oct 31 07:58:25 absu vmd[17613]: plex.vm: started vm 2 successfully, tty 
> > /dev/ttyp2
> > Oct 31 07:58:29 absu vmd[71924]: vcpu_process_com_data: guest reading com1 
> > when not ready
> > Oct 31 07:58:35 absu vmd[71924]: vioblk_notifyq: unsupported command 0x8
> > Oct 31 07:58:52 absu vmd[71924]: vcpu_process_com_data: guest reading com1 
> > when not ready
> > 
> > Full dmesg of the vmd host below, let me know if I can provide any further 
> > details.
> > 
> > thanks,
> > 
> > .jh
> > 
> 
> Sometimes vmd gets really stuck like this and even vmctl stop -f won't stop
> it and rcctl stop vmd also fails. You probably have a vmd spinning at 100%,
> kill that one manually and it should free things up.
> 
> I know about the problem, but have not had a chance to fix it yet.
> 
> -ml

Just for the archives, this happened again and just as you stated there was
a vmd process still spinning at 100% from the forcefully killed vm. killing
it let me just restart the vm again without any of the other steps. 

thanks,

.jh

> 
> > ---
> > 
> > OpenBSD 6.6-current (GENERIC.MP) #407: Mon Oct 28 00:42:58 MDT 2019
> >  dera...@amd

vmm disk unavailable after forceful vm termination

2019-11-01 Thread Johan Huldtgren
hello,

I have vmd running on -current, in it I have an Ubuntu vm (18.04.3 LTS),
every now and then the Ubuntu vm will hang hard, console is dead, only
option is to restart it. Now at that point a graceful restart won't
work, 'vmctl stop n' will run and return but nothing will happen. So
the only option is 'vmctl stop -f n', this will kill the vm. However
after that the vm will disapear from the list of vms

Before killing:

$ vmctl status
 ID   PID VCPUS  MAXMEM  CURMEM TTYOWNERSTATE NAME
  2 71924 12.0G1.1G   ttyp2johan  running plex.vm
  1 18308 18.0G6.5G   ttyp0johan  running monitor.vm
  3 - 11.0G   -   -johan  stopped amd64-ports.vm
  4 - 1512M   -   -johan  stopped i386-ports.vm

After killing:

$ vmctl stop -f 2
stopping vm: forced to terminate vm 2

$ vmctl status
 ID   PID VCPUS  MAXMEM  CURMEM TTYOWNERSTATE NAME
  1 18308 18.0G6.5G   ttyp0johan  running monitor.vm
  3 - 11.0G   -   -johan  stopped amd64-ports.vm
  4 - 1512M   -   -johan  stopped i386-ports.vm

The vm is now gone, I can't just start it with 'vmctl start n',
but it's defined in vm.conf so let's try starting it by name:

$ vmctl start plex.vm
vmctl: start vm command failed: Operation already in progress

ok.. so let's restart vmd.

$ doas rcctl stop vmd
vmd(ok)
$ doas rcctl start vmd
vmd(ok)

but it's not actually running.

$ vmctl status
vmctl: connect: /var/run/vmd.sock: Connection refused

Let's try starting manually with debug

$ doas vmd -d
startup
warning: macro 'sets' not used
can't open disk /ftp/vm/plex.img: Resource temporarily unavailable
failed to start vm plex.vm
parent: configuration failed
priv exiting, pid 9509
control exiting, pid 63731
vmm exiting, pid 28106

So it seems the disk image is now in some state which won't let it be read 
again?

$ ls -al /ftp/vm/plex.img
-rw---  1 root  wheel  32212254720 Nov  1 08:07 /ftp/vm/plex.img

$ doas file /ftp/vm/plex.img
/ftp/vm/plex.img: x86 boot sector; partition 1: ID=0x83, active, starthead 32, 
startsector 2048, 62910464 sectors

At this point the only solution is rebooting the host running vmd.
After that everything will work just fine again until the next time
this happens. I don't know if this is known or a bug, googling and
scanning through marc.info I couldn't find any reports. I don't know
if this is relevant but when I restarted this vm yesterday (along with
hanging hard it sometimes just dies and is reported as stopped). I see
this in /var/log/daemon

Oct 31 07:58:25 absu vmd[17613]: plex.vm: started vm 2 successfully, tty 
/dev/ttyp2
Oct 31 07:58:29 absu vmd[71924]: vcpu_process_com_data: guest reading com1 when 
not ready
Oct 31 07:58:35 absu vmd[71924]: vioblk_notifyq: unsupported command 0x8
Oct 31 07:58:52 absu vmd[71924]: vcpu_process_com_data: guest reading com1 when 
not ready

Full dmesg of the vmd host below, let me know if I can provide any further 
details.

thanks,

.jh

---

OpenBSD 6.6-current (GENERIC.MP) #407: Mon Oct 28 00:42:58 MDT 2019
 dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 137318658048 (130957MB)
avail mem = 133144268800 (126976MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 2.8 @ 0x7db4c000 (134 entries)
bios0: vendor American Megatrends Inc. version "3703" date 04/24/2018
bios0: ASUSTeK COMPUTER INC. Z10PE-D16 Series
acpi0 at bios0: ACPI 5.0
acpi0: sleep states S0 S4 S5
acpi0: tables DSDT FACP APIC FPDT FIDT MCFG EINJ UEFI HPET MSCT SLIT SRAT WDDT 
SSDT SPMI SSDT SSDT PRAD DMAR HEST BERT ERST
acpi0: wakeup devices IP2P(S3) EHC1(S4) BR1A(S4) BR1B(S4) BR2A(S4) BR2B(S4) 
BR2C(S4) BR2D(S4) BR3A(S4) BR3B(S4) BR3C(S4) BR3D(S4) RP01(S4) RP02(S4) 
RP03(S4) RP04(S4) [...]
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 2394.84 MHz, 06-3f-02
cpu0: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,PQM,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu0: 256KB 64b/line 8-way L2 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
cpu0: apic clock running at 99MHz
cpu0: mwait min=64, max=64, C-substates=0.2.1.2, IBE
cpu1 at mainbus0: apid 2 (application processor)
cpu1: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 2394.47 MHz, 06-3f-02
cpu1: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,P

Re: bsd.rd failure in VirtualBox

2018-09-16 Thread Johan Huldtgren
On 2018/09/16 10:52, David Higgs wrote:
> On Sun, Sep 16, 2018 at 10:17 AM, David Higgs  wrote:
>> On Sat, Sep 15, 2018 at 10:05 PM, Philip Guenther  wrote:
>>> On Sat, Sep 15, 2018 at 11:59 AM David Higgs  wrote:

 I often use VirtualBox (version 5.2.18 on OS X) to familiarize myself
 with new features in snapshots, before upgrading my physical hardware.

 This afternoon, I tried updating bsd.rd (amd64, 6.4-beta RAMDISK_CD
 #281) and wasn't able to successfully boot it.  I had to rely on the
 video capture ability of VirtualBox to even notice there was a panic
 (typed out below) before it rebooted to the "BIOS" splash screen.
>>>
>>> ...

 Also attached is the dmesg from a prior working snapshot.  I haven't
 tried updating since this prior snapshot, so I don't have further
 insight into when the issue first appeared.
>>>
>>>
>>> Thank you for the complete and clear report!
>>>
>>> I have a diff in the amd64 snapshots to use the CPU's PCID support in many
>>> cases and this VirtualBox setup found a bug in it.  I've generated a new
>>> diff that should fix this, so a future snap should fix this, though when
>>> that'll happend depends on the snap builder's schedule.
>>>
>>
>> Not sure if the fix made it into RAMDISK_CD #282, but this panic is
>> slightly different.  I haven't tried reproducing to see if the panic
>> message differs between boots.
>>
>> 
>> root on rd0a swap on rd0b dump on rd0b
>> uvm_fault(0xff011f73ac60, 0x208, 0, 1) -> e
>> fatal page fault in supervisor mode
>> trap type 6 code 0 rip 8135510b cs 8 rflags 10246 cr2 208 cpl
>> 0 rsp 800022026c90
>> gsbase 0x81870ff0 kgsbase 0x0
>> panic: trap type 6, code=0, pc=8135510b
>> syncing disk... done
>>
>> dump to dev 17,1 not possible
>> rebooting...
>> 
>>
>> Hope this helps.
>>
> 
> FWIW, the vbox capture feature is pretty buggy - it doesn't create the
> file when it says it is recording, and it frequently crashes.  It is
> possible the panic above is from #281 instead, because I deleted the
> video before I realizing this.
> 
> Below is definitely from #282.
> 
> 
> Welcome to the OpenBSD/amd64 6.4 installation program.
> fatal protection fault in supervisor mode
> trap type 4 code 0 rip 810f4244 cs 8 rflags 10286 cr2 6c1fed
> cpl a rsp 800022098800
> gsbase 0x81870ff0 kgsbase 0x0
> panic: trap type 4, code 0, pc=0x 810f4244
> syncing disks... done
> 
> dump to dev 17,1 not possible
> rebooting...
> 
> 
> Hope this is actually useful and not another stupid VirtualBox bug.

I see this an almost identical panic on real hardware too, the only difference
being the string after 'rsp'

Welcome to the OpenBSD/amd64 6.4 installation program.
fatal protection fault in supervisor mode
trap type 4 code 0 rip 810f4244 cs 8 rflags 10286 cr2 6c1fed cpl a rsp 
8000220ba9e0
gsbase 0x81870ff0 kgsbase 0x0
panic: trap type 4, code 0, pc=810f4244
syncing disks... done

dump to dev 17,1 not possible
rebooting...

Below is first the working dmesg snapshot, and then one from booting bsd.rd, 
note
the ACPI error about not being able to load tables, that's not there on the 
working
snap. That might be the culprit at least in my case?

thanks,

.jh

dmesg from the working snapshot:

OpenBSD 6.3-current (GENERIC.MP) #180: Fri Aug  3 20:53:10 MDT 2018
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 16838430720 (16058MB)
avail mem = 16318918656 (15562MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 2.8 @ 0xec820 (29 entries)
bios0: vendor American Megatrends Inc. version "P2.10" date 05/12/2015
bios0: ASRock Z97 Extreme4
acpi0 at bios0: rev 2
acpi0: sleep states S0 S3 S4 S5
acpi0: tables DSDT FACP APIC FPDT SSDT SSDT SSDT MCFG HPET SSDT SSDT AAFT UEFI
acpi0: wakeup devices PEGP(S4) PEG0(S4) PEGP(S4) PEG1(S4) PEGP(S4) PEG2(S4) 
PS2K(S4) UAR1(S4) USB1(S3) PXSX(S4) RP01(S4) PXSX(S4) PXSX(S4) PXSX(S4) 
RP04(S4) PXSX(S4) [...]
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Pentium(R) CPU G3258 @ 3.20GHz, 3199.54 MHz
cpu0: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,MOVBE,POPCNT,DEADLINE,XSAVE,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,PERF,ITSC,FSGSBASE,ERMS,INVPCID,IBRS,IBPB,STIBP,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu0: 256KB 64b/line 8-way L2 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
cpu0: apic clock running at 99MHz
cpu0: mwait min=64, max=64, C-substates=0.2.1.2, IBE
cpu1 at mainbus0: apid 2 (application processor)
cpu1: Intel(R) Pentium(R) CPU G3258 @ 3.20GHz, 3199.08 MHz
cpu1: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PA

Re: request for test: mfii

2016-10-28 Thread Johan Huldtgren
On 10/25/16 23:50, YASUOKA Masahiko wrote:
> On Wed, 26 Oct 2016 10:26:19 +1100
> Jonathan Gray  wrote:
>> On Tue, Oct 25, 2016 at 05:29:55PM +0900, YASUOKA Masahiko wrote:
>>> I'm working on making mfii(4) bio(4) capable.
>>>
>>> If you have a machine which has mfii(4), I'd like you to test the diff
>>> following.  (It might be risky for production machines for this
>>> moment.)
>>>
>>> After the diff applied, bioctl(8) against the disk (eg. sd0) starts
>>> working and also "sysctl hw.sensors.mfii0" will appear.
>>>
>>> Especially if you can configure a hotspare, testing it is very
>>> helpful for me since I can't use a hotspare on my test machine.
> (snip)
>>> +   case BIOC_SATEST:
>>> +   cmd = MR_DCMD_SPEAKER_TEST;
>>> +   break;
>>> +   default:
>>> +   return (EINVAL);
>>> +   }
>>> +
>>> +   ccb = scsi_io_get(&sc->sc_iopool, 0);
>>> +   rv = mfii_mgmt(sc, ccb, MR_DCMD_PD_SET_STATE, NULL,
>>> +   &spkr, sizeof(spkr), flags | SCSI_NOSLEEP);
>>
>> Should this be cmd rather than MR_DCMD_PD_SET_STATE?
>> The cmd values from the switch statement are not used.
> 
> Oops.  Yes, that's right.
> 
>>> +int
>>> +mfii_ioctl_blink(struct mfii_softc *sc, struct bioc_blink *bb)
> (snip)
>>> +   case BIOC_SBBLINK:
>>> +   case BIOC_SBALARM:
>>> +   cmd = MR_DCMD_PD_BLINK;
>>> +   break;
>>> +   default:
>>> +   rv = EINVAL;
>>> +   goto done;
>>> +   }
>>> +
>>> +   ccb = scsi_io_get(&sc->sc_iopool, 0);
>>> +   rv = mfii_mgmt(sc, ccb, cmd, NULL, NULL, 0, SCSI_NOSLEEP);
>>> +   scsi_io_put(&sc->sc_iopool, ccb);
> 
> Passing the mbox to mfii_mgmt() was missing.
> 
>>> + done:
>>> +   free(list, M_TEMP, sizeof(*list));
>>> +
>>> +   return (ENOTTY);
>>> +}
>>
>> Shouldn't this be return (rv) to return the EINVAL values?
>> With rv set to 0 before the 'done' to return 0 when there is no error?
> 
> Yes, that's also right.  Thanks.
> 
> Let me update the diff.
> 
> Index: sys/dev/pci/mfii.c
> ===
> RCS file: /cvs/src/sys/dev/pci/mfii.c,v
> retrieving revision 1.28
> diff -u -p -r1.28 mfii.c
> --- sys/dev/pci/mfii.c24 Oct 2016 05:27:52 -  1.28
> +++ sys/dev/pci/mfii.c26 Oct 2016 03:46:15 -
> @@ -22,9 +22,11 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> +#include 
>  #include 
>  #include 
>  
> @@ -212,6 +214,13 @@ struct mfii_iop {
>   u_int8_t sge_flag_eol;
>  };
>  
> +struct mfii_cfg {
> + struct mfi_conf *cfg;
> + struct mfi_array*cfg_array;
> + struct mfi_ld_cfg   *cfg_ld;
> + struct mfi_hotspare *cfg_hs;
> +};
> +
>  struct mfii_softc {
>   struct device   sc_dev;
>   const struct mfii_iop   *sc_iop;
> @@ -250,11 +259,15 @@ struct mfii_softc {
>   struct scsi_iopool  sc_iopool;
>  
>   struct mfi_ctrl_infosc_info;
> +
> + struct ksensor  *sc_sensors;
> + struct ksensordev   sc_sensordev;
>  };
>  
>  int  mfii_match(struct device *, void *, void *);
>  void mfii_attach(struct device *, struct device *, void *);
>  int  mfii_detach(struct device *, int);
> +int  mfii_scsi_ioctl(struct scsi_link *, u_long, caddr_t, int);
>  
>  struct cfattach mfii_ca = {
>   sizeof(struct mfii_softc),
> @@ -277,7 +290,7 @@ struct scsi_adapter mfii_switch = {
>   scsi_minphys,
>   NULL, /* probe */
>   NULL, /* unprobe */
> - NULL  /* ioctl */
> + mfii_scsi_ioctl
>  };
>  
>  void mfii_pd_scsi_cmd(struct scsi_xfer *);
> @@ -334,7 +347,26 @@ int  mfii_scsi_cmd_cdb(struct 
> mfii_soft
>   struct scsi_xfer *);
>  int  mfii_pd_scsi_cmd_cdb(struct mfii_softc *,
>   struct scsi_xfer *);
> -
> +int  mfii_scsi_ioctl_cache(struct scsi_link *, u_int,
> + struct dk_cache *);
> +#if NBIO > 0
> +int  mfii_ioctl(struct device *, u_long, caddr_t);
> +int  mfii_fill_cfg(struct mfii_softc *, struct mfii_cfg *);
> +int  mfii_ioctl_inq(struct mfii_softc *, struct bioc_inq *);
> +int  mfii_ioctl_vol(struct mfii_softc *, struct bioc_vol *);
> +int  mfii_ioctl_disk(struct mfii_softc *,
> + struct bioc_disk *);
> +int  mfii_ioctl_alarm(struct mfii_softc *,
> + struct bioc_alarm *);
> +int  mfii_ioctl_blink(struct mfii_softc *,
> + struct bioc_blink *);
> +int  mfii_ioctl_setstate(struct mfii_softc *,
> + struct bioc_setstate *);
> +int  mfii_ioctl_patrol(struct mfii_softc *,
> + struct bioc_patrol *);
> +int  mfii_create_sensors(struct mfii_softc *);
> +void mfii_refresh_sensors(void *);
> +#endif

Re: carp backup becomes no carrier

2015-04-28 Thread Johan Huldtgren

hello,


I doubt the problem lies in your setup, I found the master->active
problem, diff below should correct that, can you tell me if it helps?

Index: netinet/ip_carp.c
===
RCS file: /cvs/src/sys/netinet/ip_carp.c,v
retrieving revision 1.253
diff -u -p -r1.253 ip_carp.c
--- netinet/ip_carp.c   22 Apr 2015 06:44:17 -  1.253
+++ netinet/ip_carp.c   28 Apr 2015 09:31:07 -
@@ -750,6 +750,7 @@ carp_clone_create(ifc, unit)
if_attach(ifp);
ether_ifattach(ifp);
ifp->if_type = IFT_CARP;
+   ifp->if_sadl->sdl_type = IFT_CARP;
ifp->if_output = carp_output;

/* Hook carp_addr_updated to cope with address and route changes. */


ok, with this patch status is once again "master" and "backup", overall 
behavior
also seems better, I'm not seeing periods when I can't reach the carp 
interfaces
after a failover. I believe this fixes my issues, I've been testing all 
morning
with both reboot failovers and increasing the demote counters, all seems 
well

now.

thanks,

.jh




Re: carp backup becomes no carrier

2015-04-27 Thread Johan Huldtgren

hello,


If you try 1.250 and 1.253 and tell me if you can reproduce the problem
that would be really helpful.  In case you see something weird, Could
you include the routing table "netstat -rnf inet" in your report?  If
you can also play with tcpdump on the various pseudo-interfaces and see
if something is wrong that would be great.


i'm not sure what is going on, it seems likely that the problem lies
in my setup or something I've done, I'll explain the behavior I'm
seeing and hopefully you can use it to rule out that the error lies
in ip_carp.c at least.

As I stated originally I have two carp nodes, one is still running
the April 12th snap, the other is running the April 23rd snap but with
a recompiled kernel containing ip_carp.c r=1.249, 1,250, or 1.253. The
host running the April 12th snap is normally the backup, it has an
advskew of 100 set in /etc/hostname.carp*

If the April 23rd host is master and a failover occurs, I can not
reach the carp interfaces on the April 12th node for several minutes,
unless the April 23rd host is shutdown / rebooted. If the failover
goes the other way from the April 12th host to the April 23rd host,
this does not happen, here I never lose a ping. This behavior is seen
regardless of which version of ip_carp.c I'm using.

Further when a reboot is initiated on the April 23rd host and the
April 12th host becomes the master, it stays the master even when
the April 23rd host returns and takes over (resulting in two masters)
At this point 'ifconfig -g carp [-]carpdemote 120' is only sometimes
successful in forcing one of the nodes into backup mode. With April
23rd running with version 1.253 I never managed this, had to take a
reboot on the April 12th host to make one of them end up the backup.
I am assuming this has to do with this error:

pfsync: failed to receive bulk update

I will note though that my initial report regarding carp interfaces
showing up as "no carrier" happens on all kernels with ip_carp.c
after r=1.249 While I missed it originally, it seems that also the
status when a carp interface is master is changed and is now
reported as "active" as opposed to "master", it's almost like they
are inheriting their status from what one would expect the parent
interface to have.

Below is ifconfig and netstat output for each version both as master
and backup. tcpdump on the carp interfaces didn't reveal anything to
me that looked out of the ordinary, but if there is something specific
you'd like to look for or just include output from them when things
aren't working I can do that.

thanks,

.jh


###
ip_carp.c r=1.249
###

$ ifconfig
lo0: flags=8049 mtu 32768
priority: 0
groups: lo
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x7
inet6 ::1 prefixlen 128
inet 127.0.0.1 netmask 0xff00
vr0: flags=8b43 
mtu 1500

lladdr 00:00:24:c8:da:54
priority: 0
groups: egress
media: Ethernet autoselect (100baseTX full-duplex)
status: active
inet 172.16.0.3 netmask 0xff00 broadcast 172.16.0.255
vr1: flags=8b43 
mtu 1500

lladdr 00:00:24:c8:da:55
priority: 0
media: Ethernet autoselect (100baseTX full-duplex)
status: active
vr2: flags=8b43 
mtu 1500

lladdr 00:00:24:c8:da:56
priority: 0
media: Ethernet autoselect (100baseTX full-duplex)
status: active
vr3: flags=8b43 
mtu 1500

lladdr 00:00:24:c8:da:57
priority: 0
media: Ethernet autoselect (100baseTX full-duplex)
status: active
ipw0: flags=8802 mtu 1500
lladdr 00:04:23:83:7a:a1
priority: 4
groups: wlan
media: IEEE802.11 autoselect (autoselect mode 11b)
status: no network
ieee80211: nwid "" 100dBm
enc0: flags=0<>
priority: 0
groups: enc
status: active
vlan20: flags=8943 mtu 
1500

lladdr 00:00:24:c8:da:55
priority: 0
vlan: 20 parent interface: vr1
groups: vlan
status: active
inet 192.168.100.3 netmask 0xff00 broadcast 192.168.100.255
vlan30: flags=8943 mtu 
1500

lladdr 00:00:24:c8:da:56
priority: 0
vlan: 30 parent interface: vr2
groups: vlan
status: active
inet 192.168.0.3 netmask 0xff00 broadcast 192.168.0.255
vlan666: flags=8943 mtu 
1500

lladdr 00:00:24:c8:da:57
priority: 0
vlan: 666 parent interface: vr3
groups: vlan
status: active
inet 10.66.66.3 netmask 0xff00 broadcast 10.66.66.255
tun0: flags=8051 mtu 1500
priority: 0
groups: tun
status: active
inet 10.6.6.1 --> 10.6.6.2 netmask 0x
pfsync0: flags=41 mtu 1500
priority: 0
pfsync: syncdev: vlan666 syncpeer: 10.66.66.2 maxupd: 128 defer: 
off

groups: carp pfsync
pflog0: flags=141 mtu 33192
priority: 0
groups: pflog
carp0: flags=8843 mtu 1500
lladdr 00:00:5e:00:01:01
priority: 0

Re: carp backup becomes no carrier

2015-04-24 Thread Johan Huldtgren

hello,

a few hours after I sent the previous e-mail the backup
(April 23rd snap) took over and became the master, at
that point I could not reach the carp interfaces anymore.
Reverting roles so the host running the April 12th snap
became the master would mostly fix the problems although
occasionally things would seem to get confused and traffic
(esp to vlan666, which my laptop isn't on but has access
to) would cease. Shutting down the node running the April
23rd snap would generally clear this up, but I'm not sure
if this is a red herring and there is some caching going
on somewhere which is clouding my troubleshooting efforts.

Regardless, I stood up an i386 vm, downloaded -current but
grabbed ip_carp.c r1.249, and built a new kernel. Copied it
over to the firewall which had the April 23rd snap and now
everything is working as it was before. Traffic is flowing
as expected regardless of which host is master and which is
backup. It's only been a few hours, but so far so good.

thanks,

.jh


On 2015-04-24 13:15, Johan Huldtgren wrote:

hello,

I noticed some carp weirdness and sthen@ thought it might be worth
bringing to light.  Quick background, I run two carp nodes, one
(current master) is running the April 12th snapshot, the other is
running the April 23rd snapshot. The node running the April 23rd
snap when it's the backup node ifconfig reports all the carp
interfaces status' as "no carrier" whereas before (as far as I can
remember and on the April 12th snap at least) it would report
"backup". Once the backup becomes the master status changes to
"master".

I don't notice anything not working, however this behavior is perhaps
not expected.

dmesgs and ifconfig output for each host below, let me know if you
need anything further.

thanks,

.jh

April 12th snapshot host:

$ dmesg
syncing disks... done
OpenBSD 5.7-current (GENERIC) #772: Sun Apr 12 17:38:03 MDT 2015
dera...@i386.openbsd.org:/usr/src/sys/arch/i386/compile/GENERIC
cpu0: Geode(TM) Integrated Processor by AMD PCS ("AuthenticAMD"
586-class) 500 MHz
cpu0: FPU,DE,PSE,TSC,MSR,CX8,SEP,PGE,CMOV,CFLUSH,MMX,MMXX,3DNOW2,3DNOW
real mem  = 536363008 (511MB)
avail mem = 515301376 (491MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: date 20/71/05, BIOS32 rev. 0 @ 0xfac40
pcibios0 at bios0: rev 2.0 @ 0xf/0x1
pcibios0: pcibios_get_intr_routing - function not supported
pcibios0: PCI IRQ Routing information unavailable.
pcibios0: PCI bus #0 is the last bus
bios0: ROM list: 0xc8000/0xa800
cpu0 at mainbus0: (uniprocessor)
mtrr: K6-family MTRR support (2 registers)
amdmsr0 at mainbus0
pci0 at mainbus0 bus 0: configuration mode 1 (no bios)
0:20:0: io address conflict 0x6100/0x100
0:20:0: io address conflict 0x6200/0x200
pchb0 at pci0 dev 1 function 0 "AMD Geode LX" rev 0x31
glxsb0 at pci0 dev 1 function 2 "AMD Geode LX Crypto" rev 0x00: RNG AES
vr0 at pci0 dev 6 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 11,
address 00:00:24:c9:58:4c
ukphy0 at vr0 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI
0x004063, model 0x0034
vr1 at pci0 dev 7 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 5,
address 00:00:24:c9:58:4d
ukphy1 at vr1 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI
0x004063, model 0x0034
vr2 at pci0 dev 8 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 9,
address 00:00:24:c9:58:4e
ukphy2 at vr2 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI
0x004063, model 0x0034
vr3 at pci0 dev 9 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 12,
address 00:00:24:c9:58:4f
ukphy3 at vr3 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI
0x004063, model 0x0034
ral0 at pci0 dev 17 function 0 "Ralink RT2561S" rev 0x00: irq 15,
address 00:12:0e:61:7f:b0
ral0: MAC/BBP RT2561C, RF RT5225
glxpcib0 at pci0 dev 20 function 0 "AMD CS5536 ISA" rev 0x03: rev 3,
32-bit 3579545Hz timer, watchdog, gpio, i2c
gpio0 at glxpcib0: 32 pins
iic0 at glxpcib0
pciide0 at pci0 dev 20 function 2 "AMD CS5536 IDE" rev 0x01: DMA,
channel 0 wired to compatibility, channel 1 wired to compatibility
wd0 at pciide0 channel 0 drive 0: 
wd0: 4-sector PIO, LBA, 7815MB, 16007040 sectors
wd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 2
pciide0: channel 1 ignored (disabled)
ohci0 at pci0 dev 21 function 0 "AMD CS5536 USB" rev 0x02: irq 7,
version 1.0, legacy support
ehci0 at pci0 dev 21 function 1 "AMD CS5536 USB" rev 0x02: irq 7
usb0 at ehci0: USB revision 2.0
uhub0 at usb0 "AMD EHCI root hub" rev 2.00/1.00 addr 1
isa0 at glxpcib0
isadma0 at isa0
com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
com0: console
com1 at isa0 port 0x2f8/8 irq 3: ns16550a, 16 byte fifo
pckbc0 at isa0 port 0x60/5
pckbc0: unable to establish interrupt for aux slot
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keybo

carp backup becomes no carrier

2015-04-24 Thread Johan Huldtgren

hello,

I noticed some carp weirdness and sthen@ thought it might be worth
bringing to light.  Quick background, I run two carp nodes, one
(current master) is running the April 12th snapshot, the other is
running the April 23rd snapshot. The node running the April 23rd
snap when it's the backup node ifconfig reports all the carp
interfaces status' as "no carrier" whereas before (as far as I can
remember and on the April 12th snap at least) it would report
"backup". Once the backup becomes the master status changes to
"master".

I don't notice anything not working, however this behavior is perhaps
not expected.

dmesgs and ifconfig output for each host below, let me know if you
need anything further.

thanks,

.jh

April 12th snapshot host:

$ dmesg
syncing disks... done
OpenBSD 5.7-current (GENERIC) #772: Sun Apr 12 17:38:03 MDT 2015
dera...@i386.openbsd.org:/usr/src/sys/arch/i386/compile/GENERIC
cpu0: Geode(TM) Integrated Processor by AMD PCS ("AuthenticAMD" 
586-class) 500 MHz

cpu0: FPU,DE,PSE,TSC,MSR,CX8,SEP,PGE,CMOV,CFLUSH,MMX,MMXX,3DNOW2,3DNOW
real mem  = 536363008 (511MB)
avail mem = 515301376 (491MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: date 20/71/05, BIOS32 rev. 0 @ 0xfac40
pcibios0 at bios0: rev 2.0 @ 0xf/0x1
pcibios0: pcibios_get_intr_routing - function not supported
pcibios0: PCI IRQ Routing information unavailable.
pcibios0: PCI bus #0 is the last bus
bios0: ROM list: 0xc8000/0xa800
cpu0 at mainbus0: (uniprocessor)
mtrr: K6-family MTRR support (2 registers)
amdmsr0 at mainbus0
pci0 at mainbus0 bus 0: configuration mode 1 (no bios)
0:20:0: io address conflict 0x6100/0x100
0:20:0: io address conflict 0x6200/0x200
pchb0 at pci0 dev 1 function 0 "AMD Geode LX" rev 0x31
glxsb0 at pci0 dev 1 function 2 "AMD Geode LX Crypto" rev 0x00: RNG AES
vr0 at pci0 dev 6 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 11, 
address 00:00:24:c9:58:4c
ukphy0 at vr0 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 
0x004063, model 0x0034
vr1 at pci0 dev 7 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 5, 
address 00:00:24:c9:58:4d
ukphy1 at vr1 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 
0x004063, model 0x0034
vr2 at pci0 dev 8 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 9, 
address 00:00:24:c9:58:4e
ukphy2 at vr2 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 
0x004063, model 0x0034
vr3 at pci0 dev 9 function 0 "VIA VT6105M RhineIII" rev 0x96: irq 12, 
address 00:00:24:c9:58:4f
ukphy3 at vr3 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 
0x004063, model 0x0034
ral0 at pci0 dev 17 function 0 "Ralink RT2561S" rev 0x00: irq 15, 
address 00:12:0e:61:7f:b0

ral0: MAC/BBP RT2561C, RF RT5225
glxpcib0 at pci0 dev 20 function 0 "AMD CS5536 ISA" rev 0x03: rev 3, 
32-bit 3579545Hz timer, watchdog, gpio, i2c

gpio0 at glxpcib0: 32 pins
iic0 at glxpcib0
pciide0 at pci0 dev 20 function 2 "AMD CS5536 IDE" rev 0x01: DMA, 
channel 0 wired to compatibility, channel 1 wired to compatibility

wd0 at pciide0 channel 0 drive 0: 
wd0: 4-sector PIO, LBA, 7815MB, 16007040 sectors
wd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 2
pciide0: channel 1 ignored (disabled)
ohci0 at pci0 dev 21 function 0 "AMD CS5536 USB" rev 0x02: irq 7, 
version 1.0, legacy support

ehci0 at pci0 dev 21 function 1 "AMD CS5536 USB" rev 0x02: irq 7
usb0 at ehci0: USB revision 2.0
uhub0 at usb0 "AMD EHCI root hub" rev 2.00/1.00 addr 1
isa0 at glxpcib0
isadma0 at isa0
com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
com0: console
com1 at isa0 port 0x2f8/8 irq 3: ns16550a, 16 byte fifo
pckbc0 at isa0 port 0x60/5
pckbc0: unable to establish interrupt for aux slot
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard
pcppi0 at isa0 port 0x61
spkr0 at pcppi0
nsclpcsio0 at isa0 port 0x2e/2: NSC PC87366 rev 10: GPIO VLM TMS
gpio1 at nsclpcsio0: 29 pins
npx0 at isa0 port 0xf0/16: reported by CPUID; using exception 16
usb1 at ohci0: USB revision 1.0
uhub1 at usb1 "AMD OHCI root hub" rev 1.00/1.00 addr 1
vscsi0 at root
scsibus1 at vscsi0: 256 targets
softraid0 at root
scsibus2 at softraid0: 256 targets
root on wd0a (ba730608caf94ae4.a) swap on wd0b dump on wd0b
carp0: state transition: BACKUP -> MASTER
carp1: state transition: BACKUP -> MASTER
carp2: state transition: BACKUP -> MASTER
carp3: state transition: BACKUP -> MASTER
carp0: state transition: MASTER -> BACKUP
carp1: state transition: MASTER -> BACKUP
carp2: state transition: MASTER -> BACKUP
carp3: state transition: MASTER -> BACKUP
carp3: state transition: BACKUP -> MASTER
carp2: state transition: BACKUP -> MASTER
carp1: state transition: BACKUP -> MASTER
carp0: state transition: BACKUP -> MASTER

$ ifconfig
lo0: flags=8049 mtu 32768
priority: 0
groups: lo
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x6
inet6 ::1 prefixlen 128
inet 127.0.0.1 netmask 0xff00
vr0: flags=8b43 
mtu 1500

lladdr 00:00:24:c9:58:4c
priority: 0
groups: egres