Re: VNET related kernel panic on jail startup with epairs on 11-STABLE

2018-08-03 Thread Bjoern A. Zeeb

On 3 Aug 2018, at 20:42, Oliver Pinter wrote:


On 8/3/18, Bjoern A. Zeeb  wrote:

On 3 Aug 2018, at 18:48, Oliver Pinter wrote:


Hi all!

One of out users observed an VNET related kernel panic with epairs 
in

a jail. Seems like some of the


Well would be great for a start to (a) email virtualisation@ as well,
(b) include a panic message, backtrace or other related information 
to
deduce anything about the possible bug, (c) and not to conflate it 
with

another totally unrelated MFC request.

So what makes you think it’s related to tcp fast open?


Every required detail is in HardenedBSD's github issue, but I copy the
kernel panic here:


Ah sorry my bad;  the issue said ZFS in the subject and I thought it 
refers to something else.


Thanks! Looking at the backtrace it seems it is happening on teardown 
and not on startup but indeed in the fast open code and that PR 216613 
indeed fixed this in head, good :)  Hope Patrick will do the mfc for 
you.


/bz
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: VNET related kernel panic on jail startup with epairs on 11-STABLE

2018-08-03 Thread Oliver Pinter
On 8/3/18, Bjoern A. Zeeb  wrote:
> On 3 Aug 2018, at 18:48, Oliver Pinter wrote:
>
>> Hi all!
>>
>> One of out users observed an VNET related kernel panic with epairs in
>> a jail. Seems like some of the
>
> Well would be great for a start to (a) email virtualisation@ as well,
> (b) include a panic message, backtrace or other related information to
> deduce anything about the possible bug, (c) and not to conflate it with
> another totally unrelated MFC request.
>
> So what makes you think it’s related to tcp fast open?

Every required detail is in HardenedBSD's github issue, but I copy the
kernel panic here:

Aug  2 17:52:00 test2 syslogd: kernel boot file is /boot/kernel/kernel
Aug  2 17:52:00 test2 kernel: [205] epair3314a: promiscuous mode enabled
Aug  2 17:52:00 test2 kernel: [205] panic: lock 0xfe00078e8fd8 is
not initialized
Aug  2 17:52:00 test2 kernel: [205] cpuid = 0
Aug  2 17:52:00 test2 kernel: [205] __HardenedBSD_version = 1100056
__FreeBSD_version = 1102501
Aug  2 17:52:00 test2 kernel: [205] version = FreeBSD 11.2-STABLE-HBSD
#0 : Thu Aug  2 02:27:22 CEST 2018
Aug  2 17:52:00 test2 kernel: [205] root@hb67:/λ/obj/λ/src/11/sys/VerKnowSys
Aug  2 17:52:00 test2 kernel: [205] KDB: stack backtrace:
Aug  2 17:52:00 test2 kernel: [205] db_trace_self_wrapper() at
db_trace_self_wrapper+0x2b/frame 0xfe011fbed750
Aug  2 17:52:00 test2 kernel: [205] vpanic() at vpanic+0x17c/frame
0xfe011fbed7b0
Aug  2 17:52:00 test2 kernel: [205] doadump() at doadump/frame
0xfe011fbed830
Aug  2 17:52:00 test2 kernel: [205] lock_destroy() at
lock_destroy+0x32/frame 0xfe011fbed850
Aug  2 17:52:00 test2 kernel: [205] rm_destroy() at
rm_destroy+0x33/frame 0xfe011fbed870
Aug  2 17:52:00 test2 kernel: [205] tcp_fastopen_destroy() at
tcp_fastopen_destroy+0x44/frame 0xfe011fbed890
Aug  2 17:52:00 test2 kernel: [205] tcp_destroy() at
tcp_destroy+0x10e/frame 0xfe011fbed8c0
Aug  2 17:52:00 test2 kernel: [205] vnet_destroy() at
vnet_destroy+0x12c/frame 0xfe011fbed8f0
Aug  2 17:52:00 test2 kernel: [205] prison_deref() at
prison_deref+0x29d/frame 0xfe011fbed930
Aug  2 17:52:00 test2 kernel: [205] sys_jail_remove() at
sys_jail_remove+0x28a/frame 0xfe011fbed980
Aug  2 17:52:00 test2 kernel: [205] amd64_syscall() at
amd64_syscall+0x6ae/frame 0xfe011fbedab0
Aug  2 17:52:00 test2 kernel: [205] fast_syscall_common() at
fast_syscall_common+0x101/frame 0xfe011fbedab0
Aug  2 17:52:00 test2 kernel: [205] --- syscall (508, FreeBSD ELF64,
sys_jail_remove), rip = 0x507a3545dba, rsp = 0x68533a501528, rbp =
0x68533a5015a0 ---
Aug  2 17:52:00 test2 kernel: [205] KDB: enter: panic

>
>
> /bz
>
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: VNET related kernel panic on jail startup with epairs on 11-STABLE

2018-08-03 Thread Bjoern A. Zeeb

On 3 Aug 2018, at 18:48, Oliver Pinter wrote:


Hi all!

One of out users observed an VNET related kernel panic with epairs in
a jail. Seems like some of the


Well would be great for a start to (a) email virtualisation@ as well, 
(b) include a panic message, backtrace or other related information to 
deduce anything about the possible bug, (c) and not to conflate it with 
another totally unrelated MFC request.


So what makes you think it’s related to tcp fast open?


/bz
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64

2018-08-03 Thread Mark Martinec

More attempts at tracking this down. The suggested dtrace command does
usually abort with:

  Assertion failed: (buf->dtbd_timestamp >= first_timestamp),
file 
/usr/src/cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c,

line 3330.

but with some luck soon after each machine reboot I can leave the dtrace
running for about 10 or 20 seconds (max) before terminating it with a 
^C,
and succeed in collecting the report.  If I miss the opportunity to 
leave

dtrace running just long enough to collect useful info, but not long
enough for it to hit the assertion check, then any further attempt
to run the dtrace script hits the assertion fault immediately.

Btw, (just in case) I have recompiled kernel from source 
(base/release/11.2.0)

with debugging symbols, although the behaviour has not changed:

  FreeBSD floki.ijs.si 11.2-RELEASE FreeBSD 11.2-RELEASE #0 r337238:
Fri Aug 3 17:29:42 CEST 2018 
m...@xxx.ijs.si:/usr/obj/usr/src/sys/FLOKI amd64



Anyway, after several attempts I was able to collect a useful dtrace
output from the suggested dtrace stript:

# dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] =
  count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = count()}'

while running "zpool list" repeatedly in another terminal screen:

  # (while true; do zpool list -Hp >/dev/null; vmstat -m | fgrep 
solaris; \

  sleep 0.2; done) | awk '{print $2-a; a=$2}'
454303
570
570
570
570
570
570
570
570
570
570
570
570
570
570

Two samples of the collected dtrace output (after about 15 seconds)
are at:

  https://www.ijs.si/usr/mark/tmp/dtrace1.out.bz2
  https://www.ijs.si/usr/mark/tmp/dtrace2.out.bz2

(the dtrace2.out is probably cleaner, I made sure no other service
 was running except my sshd and syslog)

Not really sure what I'm looking at, but a couple of large entries
stand out:

$ awk '/^ .*[0-9]+ .*[0-9]$/' dtrace2.out | sort -k1n | tail -5
   114688  138
   114688  138
   114688  138
   114688  138
   114688  138

Thanks in advance for looking into it,
  Mark




2018-08-01 09:12, myself wrote:

On Tue, Jul 31, 2018 at 11:54:29PM +0200, Mark Martinec wrote:

I have now upgraded this host from 11.1-RELEASE-p11 to 11.2-RELEASE
and the situation has not improved. Also turned off all services.
ZFS is still leaking memory about 30 MB per hour, until the host
runs out of memory and swap space and crashes, unless I reboot it
first every four days.

Any advise before I try to get rid of that faulted disk with a pool
(or downgrade to 10.3, which was stable) ?


2018-08-01 00:09, Mark Johnston wrote:

If you're able to use dtrace, it would be useful to try tracking
allocations with the solaris tag:

# dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] =
  count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = 
count();}'


Try letting that run for one minute, then kill it and paste the 
output.

Ideally the host will be as close to idle as possible while still
demonstrating the leak.


Good and bad news:

The suggested dtrace command bails out:

# dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] =
count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = count();}'
dtrace: description 'dtmalloc::solaris:malloc ' matched 2 probes
Assertion failed: (buf->dtbd_timestamp >= first_timestamp), file
/usr/src/cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c,
line 3330.
Abort trap

But I did get one step further, localizing the culprit.

I realized that the "solaris" malloc count goes up in sync with
the 'telegraf' monitoring service polls, which also has a ZFS plugin
which monitors the zfs pool and ARC. This plugin runs 'zpool list -Hp'
periodically.

So after stopping telegraf (and other remaining services),
the 'vmstat -m' shows that InUse count for "solaris" goes up by 552
every time that I run "zpool list -Hp" :

# (while true; do zpool list -Hp >/dev/null; vmstat -m | \
fgrep solaris; sleep 1; done) | awk '{print $2-a; a=$2}'
6664427
541
552
552
552
552
552
552
552
552

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


VNET related kernel panic on jail startup with epairs on 11-STABLE

2018-08-03 Thread Oliver Pinter
Hi all!

One of out users observed an VNET related kernel panic with epairs in
a jail. Seems like some of the
vnet related patches does not gets backported to 11-STABLE (they are
marked as MFC candidate):
SVN r33 and r313168, these are for the panic fix.

https://github.com/HardenedBSD/hardenedBSD/issues/325

https://github.com/HardenedBSD/hardenedBSD/commit/acbbc549618ac96dd2dd461429558f6cf135e31a
https://github.com/HardenedBSD/hardenedBSD/commit/6c91473476ff712b71b6a9b25afa162fa15a5d23

The other nice to have commit would be the r333885 commit, to fix
ctfconvert related build errors.

https://github.com/HardenedBSD/hardenedBSD/commit/3895dd38ecf4dc422d3e1656844a051e0aa5d06c

Thanks,
Oliver
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"