Re: sparc64 -CURRENT in LDOM: ERROR: Last Trap: Fast Data Access Protection

2017-05-26 Thread Ted Unangst
Ax0n wrote:
> FWIW, the kernels running in my -stable guests are considerably larger than
> 8MB, and not much smaller than the -CURRENT kernels.

So it's actually the size of the code in the kernel, not the file size.

>From your boot message

Booting /virtual-devices@100/channel-devices@200/disk@0:a/bsd
8381472@0x100+7136@0x17fe420+196864@0x180+3997440@0x1830100

8381472 + 7136 (padding) = 8388608



Re: sparc64 -CURRENT in LDOM: ERROR: Last Trap: Fast Data Access Protection

2017-05-26 Thread Ax0n
FWIW, the kernels running in my -stable guests are considerably larger than
8MB, and not much smaller than the -CURRENT kernels.

-- a running LDOM guest -
-bash-4.4$ doas cu -l ttyV0
Connected to /dev/ttyV0 (speed 9600)

OpenBSD/sparc64 (puffyone.ldom.openbsd.local) (console)
login: axon
Password:
Last login: Fri May 26 00:46:47 on console
OpenBSD 6.1 (GENERIC.MP) #58: Sat Apr  1 17:10:24 MDT 2017

Welcome to OpenBSD: The proactively secure Unix-like operating system.
[...]
You have new mail.
$ uname -a
OpenBSD puffyone.ldom.openbsd.local 6.1 GENERIC.MP#58 sparc64
$ ls -la /bsd*
-rw-r--r--  1 root  wheel  9487408 Dec 31  1999 /bsd
-rw-r--r--  1 root  wheel  2739432 Dec 31  1999 /bsd.rd
-rw-r--r--  1 root  wheel  9440853 Dec 31  1999 /bsd.sp

 the -CURRENT image (bsd.rd's been copied to bsd for testing)
-
-bash-4.4$ doas vnconfig /dev/vnd0c /home/axon/vm/vdisk5
-bash-4.4$ doas mount /dev/vnd0a /mnt
-bash-4.4$ ls -al /mnt/bsd*
-rw-r--r--  1 root  wheel  2749459 May 26 22:02 /mnt/bsd
-rw-r--r--  1 root  wheel  9531028 May 26 22:02 /mnt/bsd.bak
-rw-r--r--  1 root  wheel  2749459 May 24 18:28 /mnt/bsd.rd
-rw-r--r--  1 root  wheel  9480748 May 24 18:28 /mnt/bsd.sp



On Sat, May 27, 2017 at 12:06 AM, Ted Unangst  wrote:

> Ax0n wrote:
> > Is this limit specifically for LDOM guests? I have a Sun Blade 1500 I
> could
> > compile a custom -CURRENT kernel with, if that might help. Though I'm not
> > sure I want to do that with every snapshot I try.
>
> Not specifically, but the limit can vary by hardware. If you want to run a
> snapshot now, a custom kernel with a few devices removed will help. We'll
> have
> to make a similar long term fix anyway.
>
>


fq codel panic: ifq_is_serialized and MP interrupt

2017-05-26 Thread Sebastien Marie
Hi,

I am experiencing often the following panic: 

panic: kernel diagnostic assertion "ifq_is_serialized(ifq)" failed: 
../sys/net/ifq.c, line 394

while running with GENERIC.MP patched with mikeb@ diff:

 if_start(struct ifnet *ifp)
 {
KASSERT(ifp->if_qstart == if_qstart_compat);
-   if_qstart_compat(&ifp->if_snd);
+   ifq_start(&ifp->if_snd);
 }


I report it in another thread, as I am unsure if the problem is exactly
correlated: the ddb backtrace showed a network interrupt inside
ifq_serialize().

ddb{0}> trace
db_enter(x,x,x,x,x) at db_enter+0x7
panic(x,x,x,x,18a) at panic+0x71
__assert(x,x,18a,x,bbd5) at __assert+0x2e
ifq_mfreeml(x,x,x,2bbb,x) at ifq_mfreeml+0x6a
fqcodel_deq_begin(x,x,x,x,x) at fqcodel_decbe+0x186
ifq_deb_begin(x,x,f0,0,x) at ifq_deb_begin+0x37
ifq_dequeue(x,x,x,x,x) at ifq_dequeue+0x17
bce_start(x,20,100,x,200282,bbd5,0) at bce_start+0x11f
bce_intr(x,x,x,2bbb,x) at bce_intr+0xc3
Xintr_ioapic3() at Xintr_ioapic3+0x66
--- interrupt ---
ifq_serialize(x,x,2,x,x) at ifq_serialize+0x1
ether_output(x,x,x,x,0) at ether_output+0x1d2
ip_output(x,0,x,800,0) at ip_output+0x821
tcp_output(x,x,x,x,0) at tcp_output+0x81d
tcp_usrreq(x,8,0,0,0,0) at tcp_usrreq+0x633
soreceive(x,0,x,0,0) at soreceive+0x2da
soo_read(x,x,x,x,0) at soo_read+0x43
dofilereadv(x,3,x,x,1) at dofilereadv+0x1c5
sys_read(x,x,x,0,x) at sys_read+0x8f
syscall() at syscall+0x250
--- syscall (number 2081103872) ---
0x6:
ddb{0}> 



the panic seems to occurs at tcp connection (ssh session incoming)
whereas it is already doing some network activity (here it was updating
using pkg_add).


the host was running with:
queue fq on bce0 flows 1024 default

# ifconfig

lo0: flags=8049 mtu 32768
index 4 priority 0 llprio 3
groups: lo
inet6 ::1 prefixlen 128
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x4
inet 127.0.0.1 netmask 0xff00
wpi0: flags=8802 mtu 1500
lladdr 00:13:02:2e:8b:46
index 1 priority 4 llprio 3
groups: wlan
media: IEEE802.11 autoselect
status: no network
ieee80211: nwid ""
bce0: flags=208a43 
mtu 1500
lladdr 00:15:c5:0b:8b:7a
index 2 priority 0 llprio 3
groups: egress
media: Ethernet autoselect (100baseTX full-duplex)
status: active
inet 192.168.92.11 netmask 0xff00 broadcast 192.168.92.255
inet6 fe80::215:c5ff:fe0b:8b7a%bce0 prefixlen 64 scopeid 0x2
inet6 2001:41d0:fe39:c05c:215:c5ff:fe0b:8b7a prefixlen 64 autoconf 
pltime 604784 vltime 2591984
inet6 2001:41d0:fe39:c05c:5057:c993:3ee2:599a prefixlen 64 autoconf 
autoconfprivacy pltime 85934 vltime 604710
enc0: flags=0<>
index 3 priority 0 llprio 3
groups: enc
status: active
pppoe0: flags=8810 mtu 1492
index 5 priority 0 llprio 3
dev:  state: initial
sid: 0x0 PADI retries: 0 PADR retries: 0
groups: pppoe
pflog0: flags=141 mtu 33172
index 6 priority 0 llprio 3
groups: pflog


(note: the pppoe0 is here for testing. it is only created and put in
down state).


# dmesg
OpenBSD 6.1-current (GENERIC.MP) #0: Thu May 25 14:00:16 CEST 2017
semarie@bert.local:/home/openbsd/src/sys/arch/i386/compile/GENERIC.MP
cpu0: Genuine Intel(R) CPU T2400 @ 1.83GHz ("GenuineIntel" 686-class) 1.83 GHz
cpu0: 
FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,NXE,SSE3,MWAIT,VMX,EST,TM2,xTPR,PDCM,PERF,SENSOR
real mem  = 2137354240 (2038MB)
avail mem = 2083602432 (1987MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: date 06/13/07, BIOS32 rev. 0 @ 0xffa10, SMBIOS rev. 2.4 @ 
0xf7980 (44 entries)
bios0: vendor Dell Inc. version "A17" date 06/13/2007
bios0: Dell Inc. MM061
acpi0 at bios0: rev 0
acpi0: sleep states S0 S3 S4 S5
acpi0: tables DSDT FACP HPET APIC MCFG SLIC BOOT SSDT
acpi0: wakeup devices LID_(S3) PBTN(S4) MBTN(S5) PCI0(S3) USB0(S0) USB1(S0) 
USB2(S0) USB3(S0) EHCI(S0) AZAL(S3) PCIE(S4) RP01(S4) RP02(S3) RP03(S3) 
RP04(S3) RP05(S3) [...]
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpihpet0 at acpi0: 14318179 Hz
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
cpu0: apic clock running at 166MHz
cpu0: mwait min=64, max=64, C-substates=0.2.2.2.2, IBE
cpu1 at mainbus0: apid 1 (application processor)
cpu1: Genuine Intel(R) CPU T2400 @ 1.83GHz ("GenuineIntel" 686-class) 1.83 GHz
cpu1: 
FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,NXE,SSE3,MWAIT,VMX,EST,TM2,xTPR,PDCM,PERF,SENSOR
ioapic0 at mainbus0: apid 2 pa 0xfec0, version 20, 24 pins
acpimcfg0 at acpi0 addr 0xf000, bus 0-63
acpiprt0 at acpi0: bus 0 (PCI0)
acpiprt1 at acpi0: bus -1 (AGP_)
acpiprt2 at acpi0: bus 3 (PCIE)
acpiprt3 at acpi0: bus 11 (RP01)
acpiprt4 at acpi0: bus -1 (RP02)
acpiprt5 at acpi0: bus -1 (RP03)
acpi

Re: sparc64 -CURRENT in LDOM: ERROR: Last Trap: Fast Data Access Protection

2017-05-26 Thread Ted Unangst
Ax0n wrote:
> Is this limit specifically for LDOM guests? I have a Sun Blade 1500 I could
> compile a custom -CURRENT kernel with, if that might help. Though I'm not
> sure I want to do that with every snapshot I try.

Not specifically, but the limit can vary by hardware. If you want to run a
snapshot now, a custom kernel with a few devices removed will help. We'll have
to make a similar long term fix anyway.



Re: Kernel panic on 6.1: init dies under load

2017-05-26 Thread Dan Cross
Thanks for this latest patch; it seems to help. At least, I was able to put
a fairly significant amount of load on the machine with out a panic. I'll
try and load it up more and see where we get, but so far this is positive.

On Wed, May 24, 2017 at 7:37 PM, Mike Belopuhov  wrote:

> On Wed, May 24, 2017 at 12:27 -0400, Dan Cross wrote:
> > Thanks for the patch; I just got a few minutes today and I applied it,
> > rebuilt and installed the kernel and rebooted. Sadly, I get a similar
> > panic. Attached is a screenshot of console output. Note that, 'boot sync'
> > from ddb hangs forever.
> >
> > - Dan C.
>
> That's OK. I've discovered more problems related to 64k transfers.
> The reason why we didn't notice anything bad when aborting sleep
> was because sleep has a small memory footprint, but if you dump
> core of a larger (> 64k) program, you'd notice the issue because
> core dump routine like some other places in the kernel assumes
> that 64k transfers always work.
>
> I've attempted to attack this problem from a different angle:
> ensure that xbf(4) can handle 64k transfers.  Solutions to this
> problem are notoriously messy and complicated and so far this
> one is no exception. Today I got to the point where the system
> boots multiuser but couldn't test further. I've noticed however
> that "boot dump" from ddb still crashes so I know it's not 100%
> right just yet, but since I won't get around doing anything
> about this until early next week, I'd appreciate a quick test
> if possible.
>
> I'm not attaching the diff since it's rather large:
>
> http://gir.theapt.org/~mike/xbf.diff
>
> Cheers,
> Mike
>


Re: sparc64 -CURRENT in LDOM: ERROR: Last Trap: Fast Data Access Protection

2017-05-26 Thread Ax0n
Is this limit specifically for LDOM guests? I have a Sun Blade 1500 I could
compile a custom -CURRENT kernel with, if that might help. Though I'm not
sure I want to do that with every snapshot I try.

*musing* I wonder if that's why NetBSD 7.1 is also crashing on boot.

On Fri, May 26, 2017 at 6:31 PM, Ted Unangst  wrote:

> Ax0n wrote:
> > I have a SunFire T2000 that I've chopped up into LDOMs. The primary
> domain
> > and six of the LDOMs are running 6.1-STABLE just fine. I pulled down the
> > May 22 snapshot, and it installs (with a strange error, see bottom of
> > post), but the LDOM crashes upon boot. I just tried again with the May
> 24th
> > snapshot, and I'm getting the same error. This seems to dump me into
> > OpenBoot, not ddb. I can provide a shell on the primary domain, and
> serial
> > console (over ssh) access to a developer if needed. I am not subscribed
> to
> > bugs@, so please copy me off-list.
>
> There's a hardware/software limit that currently restricts the kernel to
> 8MB.
> Larger than that and bad things happen. Hopefully someone will soon find a
> way
> to reduce the size of the kernel.
>


Re: sparc64 -CURRENT in LDOM: ERROR: Last Trap: Fast Data Access Protection

2017-05-26 Thread Ted Unangst
Ax0n wrote:
> I have a SunFire T2000 that I've chopped up into LDOMs. The primary domain
> and six of the LDOMs are running 6.1-STABLE just fine. I pulled down the
> May 22 snapshot, and it installs (with a strange error, see bottom of
> post), but the LDOM crashes upon boot. I just tried again with the May 24th
> snapshot, and I'm getting the same error. This seems to dump me into
> OpenBoot, not ddb. I can provide a shell on the primary domain, and serial
> console (over ssh) access to a developer if needed. I am not subscribed to
> bugs@, so please copy me off-list.

There's a hardware/software limit that currently restricts the kernel to 8MB.
Larger than that and bad things happen. Hopefully someone will soon find a way
to reduce the size of the kernel.



Re: Backlight brightness not working on Acer 5733Z Series Notebook

2017-05-26 Thread Ax0n
A quick follow-up that will hopefully make a fix a bit easier:

On the advice of jcs@, I first tried
https://github.com/jcs/intel_backlight_fbsd which was able to adjust my
backlight fine once I rebooted with machdep.allowaperture=3

Next, I booted up with acpivout disabled (from boot -c) and after that,
xbacklight and wsconsctl can both properly adjust the display brightness.

This is in 6.1-STABLE.

On Fri, Nov 18, 2016 at 8:53 AM, Ax0n  wrote:

> Anton reminded me about wsconsctl off-list.  "wsconsctl
> display.brightness" acts the same as xbacklight. Adjusting xbacklight
> brightness and/or messing with the brightness controls on the keyboard
> affects the value reported by wsconsctl display.brightness, but none of
> these have any impact on the backlight brightness.
>
>
> According to your dmesg, acpivout(4) is attached. Have you tried
>> changing the brightness using wsconsctl(1)?
>>
>
>


Re: ldapd(8) assertion fails on amd64 Dell PowerEdge R710

2017-05-26 Thread Allan Streib
"Todd C. Miller"  writes:

> I can explain that. The page size is being set based on the file
> system block size.

Yes, I just discovered exactly this.

I was looking at the btree.c code and saw:

if (fstat(fd, &sb) == 0)
psize = sb.st_blksize;
else
psize = PAGESIZE;

On my desktop, from dumpfs(8):

bsize   16384   shift   14  mask0xc000

And on the server:

bsize   65536   shift   16  mask0x


> Either indx_t needs to be changed to uint32_t or an upper bound
> needs to be placed on psize, perhaps 0x7fff.
>
> I'm not familiar enough with that code to say which is better.

I naively tried changing indx_t to uint32_tthat and got:

May 26 10:44:03.382 [27298] opening namespace dc=example,dc=org
btree_read_header:908: header has invalid magic

Currently, BT_MAGIC is #defined as 0xB3DBB3DB but I don't know what
comprises that value.

I think my short term workaround is going to be a smaller partition
mounted on /var/db/ldap.

  
Allan



Re: ldapd(8) assertion fails on amd64 Dell PowerEdge R710

2017-05-26 Thread Todd C. Miller
On Fri, 26 May 2017 10:52:04 -0400, Allan Streib wrote:

> Note the "page size" is different. On the Dell R710 the message says
> "page size 65536" which is one higher than 0x, which seems like a
> red flag? The "upper" and "lower" fields look to be of type indx_t which
> is defined as a uint16_t, but in the bt_head struct, psize is a
> uint32_t. So the line
> 
>mp->page->upper = bt->head.psize;
> 
> Is going to result in mp->page->upper being zero, if bt->head.psize is 65536.
> 
> I don't understand why the R710 has a different behavior than my desktop
> machine, but that's what I'm seeing.

I can explain that.  The page size is being set based on the file
system block size.  On your desktop this is 16384 which you can
verify by running the dumpfs command on the filesystem.  You'll see
something like this:

magic   11954 (FFS1)timeFri May 26 06:54:56 2017
id  [ 57ffcc69 89ea6f31 ]
cylgrp  dynamic inodes  4.4BSD  fslevel 3
ncg 6   ncyl6   size526112  blocks  516263
bsize   16384   shift   14  mask0xc000
fsize   2048shift   11  mask0xf800
frag8   shift   3   fsbtodb 2
...

However, the R710 probably has a larger file system with bigger
blocks.  If it is FFS2 it will look something like this:

magic   19540119 (FFS2) timeFri May 26 09:03:51 2017
superblock location 65536   id  [ 53f23555 9ac85182 ]
ncg 561 size234374284   blocks  232529698
bsize   65536   shift   16  mask0x
fsize   8192shift   13  mask0xe000
frag8   shift   3   fsbtodb 4
...

Either indx_t needs to be changed to uint32_t or an upper bound
needs to be placed on psize, perhaps 0x7fff.

I'm not familiar enough with that code to say which is better.

 - todd



Re: ldapd(8) assertion fails on amd64 Dell PowerEdge R710

2017-05-26 Thread Allan Streib
I've been trying to debug this a bit. 20 years since I did any C
programming to any great degree.

I enabled and added some debugging to btree.c:

Index: btree.c
===
RCS file: /cvs/src/usr.sbin/ldapd/btree.c,v
retrieving revision 1.37
diff -u -p -u -r1.37 btree.c
--- btree.c 2 Dec 2016 05:52:01 -   1.37
+++ btree.c 26 May 2017 13:56:06 -
@@ -36,7 +36,7 @@
 
 #include "btree.h"
 
-/* #define DEBUG */
+#define DEBUG
 
 #ifdef DEBUG
 # define DPRINTF(...)  do { fprintf(stderr, "%s:%d: ", __func__, __LINE__); \
@@ -1855,6 +1855,9 @@ btree_new_page(struct btree *bt, uint32_
mp->page->flags = flags;
mp->page->lower = PAGEHDRSZ;
mp->page->upper = bt->head.psize;
+
+   DPRINTF("new mpage %u, page upper %u, page lower %u",
+   mp->pgno, mp->page->upper, mp->page->lower);
 
if (IS_BRANCH(mp))
bt->meta.branch_pages++;


Running ldapd with this extra code I get these messages just before the
assertion failure:

.
.
btree_search_page:1470: tree is empty
btree_txn_put:2948: allocating new root leaf page
btree_new_page:1847: allocating new mpage 1, page size 65536
btree_new_page:1860: new mpage 1, page upper 0, page lower 12
btree_txn_put:2962: there are 0 keys, should insert new key at index 0
assertion "p->upper >= p->lower" failed: file 
"/usr/src/usr.sbin/ldapd/btree.c", line 1952, function "btree_add_node"

Note the debug statement at line 1847 has page size (bt->head.psize) as
65536, while at line 1860 the value of mp->page->upper is 0, but it
should have just been assigned the value from bt->head.psize. I'm not
seeing anything that should have changed bt->head.psize between those
two lines.

If I run this on my local desktop I get the following.

.
.
btree_search_page:1470: tree is empty
btree_txn_put:2948: allocating new root leaf page
btree_new_page:1847: allocating new mpage 1, page size 16384
btree_new_page:1860: new mpage 1, page upper 16384, page lower 12
btree_txn_put:2962: there are 0 keys, should insert new key at index 0
btree_add_node:1957: add node [dc=example,dc=org] to leaf page 1 at index 0, 
key size 17

Note the "page size" is different. On the Dell R710 the message says
"page size 65536" which is one higher than 0x, which seems like a
red flag? The "upper" and "lower" fields look to be of type indx_t which
is defined as a uint16_t, but in the bt_head struct, psize is a
uint32_t. So the line

   mp->page->upper = bt->head.psize;

Is going to result in mp->page->upper being zero, if bt->head.psize is 65536.

I don't understand why the R710 has a different behavior than my desktop
machine, but that's what I'm seeing.

Allan