OpenBSD 6.1-stable lock up

2017-08-31 Thread Maxim Bourmistrov
having a dual-node setup of 6.0 in prod, I decided to move forward with one of 
and upgrade to 6.1-stable. Ending up in benchmark tool ”locking” the 6.1 

Nodes are Xeon E5-2642v3 3.4Ghz x12, 16G RAM, 64G DOM modules as hdd,
4x X540T (ix) - 2x on-board and 2x PCI-card.

All 4x X540T are connected to 2x Cisco Nexus 3000-series, creating an LACP 
trunk (1x on-board + 1x PCI).
trunk0 - external (VLAN), 1x NIC connected to switch1 and 1x NIC connected to 
switch2 (ix0 + ix3)
trunk1 - internal (VLAN) , 1x NIC connected to switch1 and 1x NIC connected to 
switch2 (ix1 + ix2)
As I have 2x Nexus 3000, VPC is configured and sitting on top of LACP trunk on 
their end.

Each obsd node have several carp interfaces configured on top of trunk0.
Only one carp interface on trunk1 - carp1.

Each switch acting as a default gw (VRRP configured) for any existing VLAN, 
except one towards trunk1.
Default gateway for those switches is IP on carp1.
Those switches run OSPF as well as obsd nodes do.

obsd nodes are the front line, facing the Internet. (2x uplink goes into 2x 
Nexus and then traffic is passed to 2x obsd.)
Running relayd with SSL-offload and plain HTTP.
Except relayd, there is ospfd, ntpd, snmpd, and bgpd(for distributed 
blacklisting around other global nodes).

The problem:
While doing a bench with  
from my laptop (OS X, 1Gbps max. pipe) agains the environment (HTTPS)
relayd experienced problems with handling the traffic.

shell# ./wrk -t16 -c1500 -d90s —latency 

wrk hammering apache 2.4(behind those nodes), serving a txt file with avg 
7k-10k req/s as an output:

wrk -t16 -c1500 -d90s --latency https://ping.txt
Running 2m test @ https:///ping.txt
  16 threads and 1500 connections
  Thread Stats   Avg  Stdev Max   +/- Stdev
Latency   131.17ms   70.91ms   1.97s91.70%
Req/Sec   651.06135.80 1.09k84.95%
  Latency Distribution
 50%  131.90ms
 75%  144.63ms
 90%  159.63ms
 99%  230.92ms
  927039 requests in 1.50m, 190.12MB read
  Socket errors: connect 0, read 0, write 0, timeout 1330
Requests/sec:  10290.54
Transfer/sec:  2.11MB

wrk hammering apache 2.4, mod_proxy_balance, with NodeJS nodes behind apache:

wrk -t16 -c1500 -d90s --latency https:///nodejs
Running 2m test @ https:///nodejs
  16 threads and 1500 connections
  Thread Stats   Avg  Stdev Max   +/- Stdev
Latency   445.91ms  518.66ms   2.00s83.49%
Req/Sec56.80 26.89   180.00 68.48%
  Latency Distribution
 50%  217.57ms
 75%  374.15ms
  80673 requests in 1.50m, 1.12GB read
  Socket errors: connect 0, read 5534, write 0, timeout 18099
Transfer/sec: 12.72MB 

’top’ showed none interrupting at all, but rather heavy system load values and 
some user values.
20-30% - user
80-90% - system
relayd (12 forks as the number of cores) - 99% usage.

I basically killed both machines running 6.0, thus my decision to upgrade to 
However, during the tests against 6.0, my ssh session never got terminated 
(”kicked out”) even with this hight load (0% CPU idle).
6.1 showed different symptoms - ssh session termination, login via web based 
IPMI GUI hanging after log in part,
ping not responding(from the switches and node1 which is 6.0 yet).
After a while, with bench aborted, 6.1 eventually let me in via ssh (terminal 
via IPMI stil hanging).

snmpd which been running (remember), been polled by other sys doing graphs.
What been seen on those graphs is high rate of output err pkts on trunks, not 
NICs (ix) them selves.
Also, syslog, with enabled ’log all’ for relayd showed a lot of ’buffer timeout 
ospfd yeilding about ’no buffer space available’.

I had to modd relayd.conf to spawn only 8 preforks instead of 12

kern.maxclusters=24576 #12288
kern.maxfiles=65536 #32768

in order to survive the bench (e.g.. having ssh session alive).
Values commented out are from the 6.0 setup.

I’m looking for any advice here, which hopefully will lead to a stable and 
performant setup.
Configuration follows.

———sysct.conf (obsd 6.0)
net.inet.ipcomp.enable=1# 1=Enable the IPCOMP protocol
net.inet.etherip.allow=1# 1=Enable the Ethernet-over-IP protocol
net.inet.tcp.ecn=1  # 1=Enable the TCP ECN extension
net.inet.carp.preempt=1 # 1=Enable carp(4) preemption
net.inet.carp.log=3 # log level of carp(4) info, default 2
ddb.panic=0 # 0=Do not drop into ddb on a kernel panic
ddb.console=1   # 1=Permit entry of ddb from the console



Re: OpenBSD 6.1-stable lock up

2017-08-31 Thread Philipp Buehler


Am 01.09.2017 00:33 schrieb Maxim Bourmistrov:

0/232/64 mbuf 2048 byte clusters in use (current/peak/max)
423/2865/120 mbuf 2112 byte clusters in use (current/peak/max)
0/160/64 mbuf 4096 byte clusters in use (current/peak/max)
0/200/64 mbuf 8192 byte clusters in use (current/peak/max)

I've seen this before - including a kind of "lock up".
How does one reach a peak/current way over the maximum - and 2112 byte 
IIRC, there was activity in this area changing allocation and 


Re: OpenBSD 6.1-stable lock up

2017-09-02 Thread Florian Ermisch
Am 1. September 2017 06:38:49 MESZ schrieb Philipp Buehler 
>Am 01.09.2017 00:33 schrieb Maxim Bourmistrov:
>> 0/232/64 mbuf 2048 byte clusters in use (current/peak/max)
>> 423/2865/120 mbuf 2112 byte clusters in use (current/peak/max)
>> 0/160/64 mbuf 4096 byte clusters in use (current/peak/max)
>> 0/200/64 mbuf 8192 byte clusters in use (current/peak/max)
>I've seen this before - including a kind of "lock up".
>How does one reach a peak/current way over the maximum - and 2112 byte 
>IIRC, there was activity in this area changing allocation and 

Hm, could this be the same performance
regression as VLANs saw?

The post and the one on tech@ don't 
mention the version but as it was a
discussion between OpenBSD devs I
guess it was what became 6.1 a few
month later.
I think I've heard or read something about
improvements in this area (on BSDnow or
undeadly) so maybe you could try a 6.2-

Regards, Florian