Re: Panic: 12.2 fails to use VIMAGE jails

2020-12-09 Thread Peter
On Wed, Dec 09, 2020 at 02:00:37PM +1100, Dewayne Geraghty wrote:

! On a jail with config:
! exec.start = "/bin/sh -x /etc/rc";
! exec.stop = "/bin/sh /etc/rc.shutdown";
! exec.clean;
! 
! test_prod  { jid=7; persist; ip4.addr =
! "10.0.7.96,10.0.5.96,127.0.5.96"; devfs_ruleset = "6";
! host.hostuuid=---0001-0302; host.hostid=000302; }
! 
! I successfully performed
! for i in `seq 10`; do jail -vc test_prod; sleep 3; jail -vr test_prod; done

But, this is not a VIMAGE jail, is it?
Old-style jails are unaffected by this issue. Only VIMAGE jails, using
epair or netgraph, might be affected. (In that case, you would not
have an "ip4.addr" configured, and rather a "vnet.interface".)

! I think the normal use of jail.conf is to NOT explicitly use a jid in
! the definition, which may be why this may not have been picked up?
! (Maybe a clue).

This is an interesting point. When you stop a jail, it may stay for
a more or less long time in a "dying" state (visible with "jls -d"),
keeping the jid occupied. During that time, the jail cannot be
restarted with that same jid.
Once ago, I read people complaining about this, and the advice was to
just not define the jid in the definition, so that the jail can be
restarted immediately (and will probably grab another jid).

I did not find a solid explanation for what is happening in that
"dying" state (and why it does take more or less long), even less
an approach to fix that. I found some theories circling the net, but
these don't really figure. So I would need to look into the source
myself - and I did postpone that indefinitely. ;)

But what I found out, with the VIMAGE jails (those that can carry
their own network interfaces), when you make a slight mistake with
managing and handling the interfaces, then the jail will stay in the
dying state forever. If you don't make a mistake, then it will finally
die within some time.
So I decided to keep the jid, so that rightaway nothing is allowed to
linger from misconfigured unnoticed. (The tradeoff is obviousely that
one might have to wait before restarting.)

cheerio,
PMc

P.S. 41 celsius is phantastic! I envy You! :)
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Panic: 12.2 fails to use VIMAGE jails

2020-12-09 Thread Kristof Provost

Peter,

I’m not interested in discussing software development methodology 
here.


Please drop me from this thread. Let me know if/when you have a test 
case I can work from.


Regards,
Kristof

On 9 Dec 2020, at 11:54, Peter wrote:


On Tue, Dec 08, 2020 at 07:51:07PM -0600, Kyle Evans wrote:

! You seem to have misinterpreted this; he doesn't want to narrow it
! down to one bug, he wants simple steps that he can follow to 
reproduce


Maybe I did misinterpret, but then I don't really understand it.
I would suppose, when testing a proposed fix, the fact that it
does break under the exact same conditions as before, is all the
information needed at that point. Put in simple words: that it does
not work.

! any failure, preferably steps that can actually be followed by just
! about anyone and don't require immense amounts of setup time or
! additional hardware.

Engineering does not normally work that way.

I'll try to explain: when a bug is first encountered, it is necessary
to isolate it insofar that somebody who is knowledgeable of the code,
can actually reproduce it, in order to have a look at it and analyze
what causes the mis-happening.

If then a remedy is devised, and that does not work as expected, then
the flaw is in the analysis, and we just start over from there.

In fact, I would have expected somebody who is trying to fix such
kind of bug, to already have testing tools available and tell me
exactly which kind of data I might retrieve from the dumps.

The open question now is: am I the only one seeing these failures?
Might they be attributed to a faulty configuration or maybe hardware
issues or whatever?
We cannot know this, we can only watch out what happens at other
sites. And that is why I sent out all these backtraces - because they
appear weird and might be difficult to associate with this issue.

I don't think there is much more we can do at this point, unless we
were willing to actually look into the details.


Am I discouraging? Indeed, I think, engineering is discouraging by
it's very nature, and that's the fun of it: to overcome odds and
finally maybe make things better. And when we start to forget about
that, bad things begin to happen (anybody remember Apollo 13?).

But talking about disencouragement: I usually try to track down
defects I encounter, and, if possible, do a viable root-cause
analysis. I tended to be very willing to share the outcomes and. if
a solution arises, by all means make that get back into the code base;
but I found that even ready made patches for easy matters would
linger forever in the sendbug system without anybody caring, or, in
more complex cases where I would need some feedback from the original
writer, if only to clarify the purpose of some defaults or verify
than an approach is viable, that communication is very difficult to
establish. And that is what I would call disencouraging, and I for
my part have accepted to just leave the developers in their ivory
tower and tend to my own business.


cheerio,
PMc

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Panic: 12.2 fails to use VIMAGE jails

2020-12-09 Thread Peter
On Tue, Dec 08, 2020 at 07:51:07PM -0600, Kyle Evans wrote:
 
! You seem to have misinterpreted this; he doesn't want to narrow it
! down to one bug, he wants simple steps that he can follow to reproduce

Maybe I did misinterpret, but then I don't really understand it.
I would suppose, when testing a proposed fix, the fact that it
does break under the exact same conditions as before, is all the
information needed at that point. Put in simple words: that it does
not work.

! any failure, preferably steps that can actually be followed by just
! about anyone and don't require immense amounts of setup time or
! additional hardware.

Engineering does not normally work that way. 

I'll try to explain: when a bug is first encountered, it is necessary
to isolate it insofar that somebody who is knowledgeable of the code,
can actually reproduce it, in order to have a look at it and analyze
what causes the mis-happening.

If then a remedy is devised, and that does not work as expected, then
the flaw is in the analysis, and we just start over from there.

In fact, I would have expected somebody who is trying to fix such
kind of bug, to already have testing tools available and tell me
exactly which kind of data I might retrieve from the dumps.

The open question now is: am I the only one seeing these failures?
Might they be attributed to a faulty configuration or maybe hardware
issues or whatever?
We cannot know this, we can only watch out what happens at other
sites. And that is why I sent out all these backtraces - because they
appear weird and might be difficult to associate with this issue.

I don't think there is much more we can do at this point, unless we
were willing to actually look into the details.


Am I discouraging? Indeed, I think, engineering is discouraging by
it's very nature, and that's the fun of it: to overcome odds and
finally maybe make things better. And when we start to forget about
that, bad things begin to happen (anybody remember Apollo 13?). 

But talking about disencouragement: I usually try to track down
defects I encounter, and, if possible, do a viable root-cause
analysis. I tended to be very willing to share the outcomes and. if
a solution arises, by all means make that get back into the code base;
but I found that even ready made patches for easy matters would
linger forever in the sendbug system without anybody caring, or, in
more complex cases where I would need some feedback from the original
writer, if only to clarify the purpose of some defaults or verify
than an approach is viable, that communication is very difficult to
establish. And that is what I would call disencouraging, and I for
my part have accepted to just leave the developers in their ivory
tower and tend to my own business.


cheerio,
PMc
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Panic: 12.2 fails to use VIMAGE jails

2020-12-08 Thread Kristof Provost

On 9 Dec 2020, at 2:31, Peter wrote:

On Tue, Dec 08, 2020 at 08:02:47PM +0100, Kristof Provost wrote:

! > Sorry for the bad news.
! >
! You appear to be triggering two or three different bugs there.

That is possible. Then there are two or three different bugs in the
production code.

In any case, my current workaround, i.e. delaying in the exec.poststop


exec.poststop = "
   sleep 6 ;
   /usr/sbin/ngctl shutdown ${ifname1l}: ;
   ";


helps for it all and makes the system behave solid. This is true
with and without Your patch.

! Can you reduce your netgraph use case to a small test case that can 
trigger

! the problem?

I'm sorry, I fear I don't get Your point.
Assumed there are actually two or three bugs here, You are asking me
to reduce config so that it will trigger only one of them? Is that
correct?

No, we need a simple case to reproduce these problems. It’s fine if 
that test case triggers multiple issues.



Then let me put this different: assuming this is the OS for the life
support system of the manned Jupiter mission. Then, which one of the
bugs do You want to get fixed, and which would You prefer to keep and
make Your oxygen supply cut off?

https://www.youtube.com/watch?v=BEo2g-w545A


Happily we’re not in space.



! I’m not likely to be able to do anything unless I can reproduce
! the problem(s).

I understand that.
From Your former mail I get the impression that you prefer to rely
on tests. I consider this a bad habit[1] and prefer logical thinking.





(Background: It is not that I would be unwilling to create clean and
precisely reproducible scenarious, But, one of my problems is
currently, I only have two machines availabe: the graphical one where
I'm just typing, and the backend server with the jails that does
practically everything.

These issues should trigger just fine in VMs. There’s no need for 
hardware pain.


Regards,
Kristof
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Panic: 12.2 fails to use VIMAGE jails

2020-12-08 Thread Kyle Evans
On Tue, Dec 8, 2020 at 7:45 PM Peter  wrote:
>
>
> On Tue, Dec 08, 2020 at 08:02:47PM +0100, Kristof Provost wrote:
> > Can you reduce your netgraph use case to a small test case that can trigger
> ? the problem?
>
> I'm sorry, I fear I don't get Your point.
> Assumed there are actually two or three bugs here, You are asking me
> to reduce config so that it will trigger only one of them? Is that
> correct?
>
> Then let me put this different: assuming this is the OS for the life
> support system of the manned Jupiter mission. Then, which one of the
> bugs do You want to get fixed, and which would You prefer to keep and
> make Your oxygen supply cut off?
>
> https://www.youtube.com/watch?v=BEo2g-w545A

You seem to have misinterpreted this; he doesn't want to narrow it
down to one bug, he wants simple steps that he can follow to reproduce
any failure, preferably steps that can actually be followed by just
about anyone and don't require immense amounts of setup time or
additional hardware.

Unfortunately, your tone following the misunderstanding was pretty discouraging.

Thanks,

Kyle Evans
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Panic: 12.2 fails to use VIMAGE jails

2020-12-08 Thread Peter

On Tue, Dec 08, 2020 at 08:02:47PM +0100, Kristof Provost wrote:

! > Sorry for the bad news.
! > 
! You appear to be triggering two or three different bugs there.

That is possible. Then there are two or three different bugs in the
production code.

In any case, my current workaround, i.e. delaying in the exec.poststop

> exec.poststop = "
>sleep 6 ;
>/usr/sbin/ngctl shutdown ${ifname1l}: ;
>";

helps for it all and makes the system behave solid. This is true
with and without Your patch.

! Can you reduce your netgraph use case to a small test case that can trigger
! the problem?

I'm sorry, I fear I don't get Your point.
Assumed there are actually two or three bugs here, You are asking me
to reduce config so that it will trigger only one of them? Is that
correct?

Then let me put this different: assuming this is the OS for the life
support system of the manned Jupiter mission. Then, which one of the
bugs do You want to get fixed, and which would You prefer to keep and
make Your oxygen supply cut off?

https://www.youtube.com/watch?v=BEo2g-w545A

! I’m not likely to be able to do anything unless I can reproduce
! the problem(s).

I understand that.
From Your former mail I get the impression that you prefer to rely
on tests. I consider this a bad habit[1] and prefer logical thinking.

So lets try that:
We know that there is a problem with taking down an interface from a
VIMAGE, in the way it is done by "jail -r". We know this problem can
be solidly workarounded by delaying the interface takedown for a short
time.

Now with Your patch, we do not get the typical crash at interface
takedown. Instead, all of a sudden, there are strange crashes from
various other places. And, interestingly, we get these also when
STARTING a jail.

I think this is not an additional problem, it is instead a valuable
information (albeit not the one You might like to get).

Furthermore, we get these new crashes always invoked by "ifconfig",
and they seem to have in common that somebody tries to obtain
information about some interface configuration and receives some
bogus. I might conclude, just out of the belly without looking into
details, that either
 - your patch achieves to garble some internal interface data,
   instead of what it is intended to do, or
 - the original problem manages to garble internal interface data
   (leading to the usual crash), and Your patch does not achieve to
   solve this, but only protects from the immediate consequence.

It might also be worth consideration, that, while the problem may be
more easy to reproduce with epair, this effect may or may not be a
netgraph specific one[2].

Now lets keep in mind that a successful test means EXACTLY NOTHING.
By which other means can we confirm that Your patch fully achieves
what it is intended for? (E.g. something like dumping and verifying
the respective internal tables in-vivo)

(Background: It is not that I would be unwilling to create clean and
precisely reproducible scenarious, But, one of my problems is
currently, I only have two machines availabe: the graphical one where
I'm just typing, and the backend server with the jails that does
practically everything.
Therefore, experimenting on any of them creates considerable pain.
I'm working on that issue, trying to get a real server board for the
backend so to get the current one free for testing - but what I would
like to use, e.g. ASUS Z10PE+cores+regECC, is not something one would
easily find on yardsales - and seldom for an acceptable price.)


cheerio,
PMc

[1] Rationale: a failing test tells us that either the test or the
application has a bug (50/50 chance). A succeeding test tells us
that 1 equals 1, which we knew already before.
In fact, tests tell us *nothing at all* about the state of our
code, and specifically, 'successful' outcomes do NOT mean that
things are all correct.
The only true usefulness of tests is to protect against
re-introducing a fault that was already fixed before,
i.e. regressions.

[2] My netgraph configuration consists of bringing up some bridges
and then attaching the jails to them.

Here is the bridge starter (only respective component,
there are more of these populated, but probably not influencing
the issue):

#! /bin/sh

# PROVIDE: netgraphs
# REQUIRE: netwait
# BEFORE: NETWORKING

. /etc/rc.subr

name="netgraphs"
start_cmd="${name}_start"
stop_cmd="${name}_stop"

load_rc_config $name

netgraphs_graphs="svc"

netgraphs_svc_if1_name="nge_svc_1u"
netgraphs_svc_if1_mac="00:1d:92:01:02:01"
netgraphs_svc_if1_addr="***.***.***.***/29"

netgraphs_svc_start()
{
local _ifname
if ngctl info svcswitch: > /dev/null 2>&1; then
netgraphs_svc_stop
fi

echo "Creating SVC Switch"
ngctl -f - < /dev/null 2>&1; then
$_cmd
else
echo "netgraphs-start: object $i not found" >&2
fi
done
}

Re: Panic: 12.2 fails to use VIMAGE jails

2020-12-08 Thread Peter
Here is the next funny crashdump - I obtained this one twice
and also the sysctl_rtsock() again.

I can reproduce this by just starting and stopping a most simple jail
that does only
exec.start = "/bin/sleep 4 &";
(And as usual, when I let it time out, nothing bad happens.)


Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 02
instruction pointer = 0x20:0x80a2ac45
stack pointer   = 0x28:0xfe0047cf2890
frame pointer   = 0x28:0xfe0047cf2890
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 13557 (ifconfig)
trap number = 9
panic: general protection fault
cpuid = 1
time = 1607469295
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe0047cf25a0
vpanic() at vpanic+0x17b/frame 0xfe0047cf25f0
panic() at panic+0x43/frame 0xfe0047cf2650
trap_fatal() at trap_fatal+0x391/frame 0xfe0047cf26b0
trap() at trap+0x67/frame 0xfe0047cf27c0
calltrap() at calltrap+0x8/frame 0xfe0047cf27c0
--- trap 0x9, rip = 0x80a2ac45, rsp = 0xfe0047cf2890, rbp = 
0xfe0047cf2890 ---
strncmp() at strncmp+0x15/frame 0xfe0047cf2890
ifunit_ref() at ifunit_ref+0x59/frame 0xfe0047cf28d0
ifioctl() at ifioctl+0x427/frame 0xfe0047cf2990
kern_ioctl() at kern_ioctl+0x275/frame 0xfe0047cf29f0
sys_ioctl() at sys_ioctl+0x101/frame 0xfe0047cf2ac0
amd64_syscall() at amd64_syscall+0x380/frame 0xfe0047cf2bf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfe0047cf2bf0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x800475b2a, rsp = 
0x7fffe3b8, rbp = 0x7fffe450 ---
Uptime: 8m54s
Dumping 880 out of 3959 MB:
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Panic: 12.2 fails to use VIMAGE jails

2020-12-08 Thread Kristof Provost

On 8 Dec 2020, at 19:49, Peter wrote:

On Tue, Dec 08, 2020 at 04:50:00PM +0100, Kristof Provost wrote:
! Yeah, the bug is not exclusive to epair but that’s where it’s 
most easily

! seen.

Ack.

! Try 
http://people.freebsd.org/~kp/0001-if-Fix-panic-when-destroying-vnet-and-epair-simultan.patch


Great, thanks a lot.

Now I have bad news: when playing yoyo with the next-best three
application  jails (with all their installed stuff) it took about
ten up and down's then I got this one:

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address   = 0x10
fault code  = supervisor read data, page not present
instruction pointer = 0x20:0x80aad73c
stack pointer   = 0x28:0xfe003f80e810
frame pointer   = 0x28:0xfe003f80e810
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 15486 (ifconfig)
trap number = 12
panic: page fault
cpuid = 1
time = 1607450838
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
0xfe003f80e4d0

vpanic() at vpanic+0x17b/frame 0xfe003f80e520
panic() at panic+0x43/frame 0xfe003f80e580
trap_fatal() at trap_fatal+0x391/frame 0xfe003f80e5e0
trap_pfault() at trap_pfault+0x4f/frame 0xfe003f80e630
trap() at trap+0x4cf/frame 0xfe003f80e740
calltrap() at calltrap+0x8/frame 0xfe003f80e740
--- trap 0xc, rip = 0x80aad73c, rsp = 0xfe003f80e810, rbp 
= 0xfe003f80e810 ---
ng_eiface_mediastatus() at ng_eiface_mediastatus+0xc/frame 
0xfe003f80e810

ifmedia_ioctl() at ifmedia_ioctl+0x174/frame 0xfe003f80e850
ifhwioctl() at ifhwioctl+0x639/frame 0xfe003f80e8d0
ifioctl() at ifioctl+0x448/frame 0xfe003f80e990
kern_ioctl() at kern_ioctl+0x275/frame 0xfe003f80e9f0
sys_ioctl() at sys_ioctl+0x101/frame 0xfe003f80eac0
amd64_syscall() at amd64_syscall+0x380/frame 0xfe003f80ebf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 
0xfe003f80ebf0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x800475b2a, rsp = 
0x7fffe358, rbp = 0x7fffe450 ---

Uptime: 9m51s
Dumping 899 out of 3959 MB:

I decided to give it a second try, and this is what I did:

root@edge:/var/crash # jls
   JID  IP Address  Hostname  Path
 1  1***gate.***.org  /j/gate
 3  1***raix.***.org  /j/raix
 4  oper.***.org  /j/oper
 5  admn.***.org  /j/admn
 6  data.***.org  /j/data
 7  conn.***.org  /j/conn
 8  kerb.***.org  /j/kerb
 9  tele.***.org  /j/tele
10  rail.***.org  /j/rail
root@edge:/var/crash # service jail stop rail
Stopping jails: rail.
root@edge:/var/crash # service jail stop tele
Stopping jails: tele.
root@edge:/var/crash # service jail stop kerb
Stopping jails: kerb.
root@edge:/var/crash # jls
   JID  IP Address  Hostname  Path
 1  1***gate.***.org  /j/gate
 3  1***raix.***.org  /j/raix
 4  oper.***.org  /j/oper
 5  admn.***.org  /j/admn
 6  data.***.org  /j/data
 7  conn.***.org  /j/conn
root@edge:/var/crash # jls -d
   JID  IP Address  Hostname  Path
 1  1***gate.***.org  /j/gate
 3  1***raix.***.org  /j/raix
 4  oper.***.org  /j/oper
 5  admn.***.org  /j/admn
 6  data.***.org  /j/data
 7  conn.***.org  /j/conn
 9  tele.***.org  /j/tele
10  rail.***.org  /j/rail
root@edge:/var/crash # service jail start kerb
Starting jails:Fssh_packet_write_wait: Connection to 1*** port 
22: Broken pipe


Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address   = 0x0
fault code  = supervisor read instruction, page not 
present

instruction pointer = 0x20:0x0
stack pointer   = 0x28:0xfe00540ea658
frame pointer   = 0x28:0xfe00540ea670
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 13420 (ifconfig)
trap number = 12
panic: page fault
cpuid = 1
time = 1607451910
KDB: st

Re: Panic: 12.2 fails to use VIMAGE jails

2020-12-08 Thread Peter
On Tue, Dec 08, 2020 at 04:50:00PM +0100, Kristof Provost wrote:
! Yeah, the bug is not exclusive to epair but that’s where it’s most easily
! seen.

Ack.

! Try 
http://people.freebsd.org/~kp/0001-if-Fix-panic-when-destroying-vnet-and-epair-simultan.patch

Great, thanks a lot.

Now I have bad news: when playing yoyo with the next-best three
application  jails (with all their installed stuff) it took about
ten up and down's then I got this one:

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address   = 0x10
fault code  = supervisor read data, page not present
instruction pointer = 0x20:0x80aad73c
stack pointer   = 0x28:0xfe003f80e810
frame pointer   = 0x28:0xfe003f80e810
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 15486 (ifconfig)
trap number = 12
panic: page fault
cpuid = 1
time = 1607450838
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe003f80e4d0
vpanic() at vpanic+0x17b/frame 0xfe003f80e520
panic() at panic+0x43/frame 0xfe003f80e580
trap_fatal() at trap_fatal+0x391/frame 0xfe003f80e5e0
trap_pfault() at trap_pfault+0x4f/frame 0xfe003f80e630
trap() at trap+0x4cf/frame 0xfe003f80e740
calltrap() at calltrap+0x8/frame 0xfe003f80e740
--- trap 0xc, rip = 0x80aad73c, rsp = 0xfe003f80e810, rbp = 
0xfe003f80e810 ---
ng_eiface_mediastatus() at ng_eiface_mediastatus+0xc/frame 0xfe003f80e810
ifmedia_ioctl() at ifmedia_ioctl+0x174/frame 0xfe003f80e850
ifhwioctl() at ifhwioctl+0x639/frame 0xfe003f80e8d0
ifioctl() at ifioctl+0x448/frame 0xfe003f80e990
kern_ioctl() at kern_ioctl+0x275/frame 0xfe003f80e9f0
sys_ioctl() at sys_ioctl+0x101/frame 0xfe003f80eac0
amd64_syscall() at amd64_syscall+0x380/frame 0xfe003f80ebf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfe003f80ebf0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x800475b2a, rsp = 
0x7fffe358, rbp = 0x7fffe450 ---
Uptime: 9m51s
Dumping 899 out of 3959 MB:

I decided to give it a second try, and this is what I did:

root@edge:/var/crash # jls
   JID  IP Address  Hostname  Path
 1  1***gate.***.org  /j/gate
 3  1***raix.***.org  /j/raix
 4  oper.***.org  /j/oper
 5  admn.***.org  /j/admn
 6  data.***.org  /j/data
 7  conn.***.org  /j/conn
 8  kerb.***.org  /j/kerb
 9  tele.***.org  /j/tele
10  rail.***.org  /j/rail
root@edge:/var/crash # service jail stop rail
Stopping jails: rail.
root@edge:/var/crash # service jail stop tele
Stopping jails: tele.
root@edge:/var/crash # service jail stop kerb
Stopping jails: kerb.
root@edge:/var/crash # jls
   JID  IP Address  Hostname  Path
 1  1***gate.***.org  /j/gate
 3  1***raix.***.org  /j/raix
 4  oper.***.org  /j/oper
 5  admn.***.org  /j/admn
 6  data.***.org  /j/data
 7  conn.***.org  /j/conn
root@edge:/var/crash # jls -d
   JID  IP Address  Hostname  Path
 1  1***gate.***.org  /j/gate
 3  1***raix.***.org  /j/raix
 4  oper.***.org  /j/oper
 5  admn.***.org  /j/admn
 6  data.***.org  /j/data
 7  conn.***.org  /j/conn
 9  tele.***.org  /j/tele
10  rail.***.org  /j/rail
root@edge:/var/crash # service jail start kerb
Starting jails:Fssh_packet_write_wait: Connection to 1*** port 22: 
Broken pipe

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address   = 0x0
fault code  = supervisor read instruction, page not present
instruction pointer = 0x20:0x0
stack pointer   = 0x28:0xfe00540ea658
frame pointer   = 0x28:0xfe00540ea670
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 13420 (ifconfig)
trap number = 12
panic: page fault
cpuid = 1
time = 1607451910
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_s

Re: Panic: 12.2 fails to use VIMAGE jails

2020-12-08 Thread Kristof Provost

On 8 Dec 2020, at 0:34, Peter wrote:

Hi Kristof,
  it's great to read You!

On Mon, Dec 07, 2020 at 09:11:32PM +0100, Kristof Provost wrote:

! That smells a lot like the epair/vnet issues in bugs 238870, 234985, 
244703,

! 250870.

epair? No. It is purely Netgrh here.

Yeah, the bug is not exclusive to epair but that’s where it’s most 
easily seen.


! I pushed a fix for that in CURRENT in r368237. It’s scheduled to 
go into
! stable/12 sometime next week, but it’d be good to know that it 
fixes your

! problem too before I merge it.
! In other words: can you test a recent CURRENT? It’s likely fixed 
there, and

! if it’s not I may be able to fix it quickly.


Oh my Gods. No offense meant, but this is not really a good time
for that. This is the most horrible upgrade I experienced in 25 years
FreeBSD (and it was prepared, 12.2 did run fine on the other machine).

I have issue with mem config
https://forums.freebsd.org/threads/fun-with-upgrading-sysctl-unknown-oid-vm-pageout_wakeup_thresh.77955/
I have issue with damaged filesystem, for no apparent reason
https://forums.freebsd.org/threads/no-longer-fun-with-upgrading-file-offline.77959/

Then I have this issue here which is now gladly workarounded
https://forums.freebsd.org/threads/panic-12-2-does-not-work-with-jails.77962/post-486365

and when I then dare to have a look at my applications, they look like
sheer horror, segfaults all over, and I don't even know where to begin
with these.


Other option: can you make this fix so that I can patch it into 12.2
source and just redeploy?

Try 
http://people.freebsd.org/~kp/0001-if-Fix-panic-when-destroying-vnet-and-epair-simultan.patch


That’s currently running the regression tests that used to provoke the 
panic nearly instantly, and no panics so far.


Best regards.
Kristof
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Panic: 12.2 fails to use VIMAGE jails

2020-12-07 Thread Peter

Hi Kristof,
  it's great to read You!
  
On Mon, Dec 07, 2020 at 09:11:32PM +0100, Kristof Provost wrote:

! That smells a lot like the epair/vnet issues in bugs 238870, 234985, 244703,
! 250870.

epair? No. It is purely Netgraph here.

! I pushed a fix for that in CURRENT in r368237. It’s scheduled to go into
! stable/12 sometime next week, but it’d be good to know that it fixes your
! problem too before I merge it.
! In other words: can you test a recent CURRENT? It’s likely fixed there, and
! if it’s not I may be able to fix it quickly.


Oh my Gods. No offense meant, but this is not really a good time
for that. This is the most horrible upgrade I experienced in 25 years
FreeBSD (and it was prepared, 12.2 did run fine on the other machine).

I have issue with mem config
https://forums.freebsd.org/threads/fun-with-upgrading-sysctl-unknown-oid-vm-pageout_wakeup_thresh.77955/
I have issue with damaged filesystem, for no apparent reason
https://forums.freebsd.org/threads/no-longer-fun-with-upgrading-file-offline.77959/

Then I have this issue here which is now gladly workarounded
https://forums.freebsd.org/threads/panic-12-2-does-not-work-with-jails.77962/post-486365

and when I then dare to have a look at my applications, they look like
sheer horror, segfaults all over, and I don't even know where to begin
with these.


Other option: can you make this fix so that I can patch it into 12.2
source and just redeploy?

I tried to apply the changes from r368237 into my 12.2 source, that
seemed to be quite obvious, but it doesn't work; jails fail to remove
entirely:

# service jail stop rail
Stopping jails: rail.
# jexec rail
jexec: jail "rail" not found

-> it works once.

# service jail start rail
Starting jails: rail.
# service jail stop rail
Stopping jails: rail.
# jexec rail
root@rail:/ # ps ax
ps: empty file: Invalid argument

-> And here it doesn't work anymore, and leaves a skull of a jail
   one cannot get rid of.


Cheerio,
PMc
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Panic: 12.2 fails to use VIMAGE jails

2020-12-07 Thread Kristof Provost

On 7 Dec 2020, at 13:54, Peter wrote:

After clean upgrade (from source) from 11.4 to 12.2-p1 my jails do
no longer work correctly.

Old-fashioned jails seem to work, but most are VIMAGE+NETGRAPH style,
and do not work properly.
All did work flawlessly for nearly a year with Rel.11.

If I start 2-3 jails, and then stop them again, there is always a
panic.
Also reproducible with GENERIC kernel.

Can this be fixed, or do I need to revert to 11.4?

The backtrace looks like this:

#4 0x810bbadf at trap_pfault+0x4f
#5 0x810bb23f at trap+0x4cf
#6 0x810933f8 at calltrap+0x8
#7 0x80cdd555 at _if_delgroup_locked+0x465
#8 0x80cdbfbe at if_detach_internal+0x24e
#9 0x80ce305c at if_vmove+0x3c
#10 0x80ce3010 at vnet_if_return+0x50
#11 0x80d0e696 at vnet_destroy+0x136
#12 0x80ba781d at prison_deref+0x27d
#13 0x80c3e38a at taskqueue_run_locked+0x14a
#14 0x80c3f799 at taskqueue_thread_loop+0xb9
#15 0x80b9fd52 at fork_exit+0x82
#16 0x8109442e at fork_trampoline+0xe

This is my typical jail config, designed and tested with Rel.11:

That smells a lot like the epair/vnet issues in bugs 238870, 234985, 
244703, 250870.
I pushed a fix for that in CURRENT in r368237. It’s scheduled to go 
into stable/12 sometime next week, but it’d be good to know that it 
fixes your problem too before I merge it.
In other words: can you test a recent CURRENT? It’s likely fixed 
there, and if it’s not I may be able to fix it quickly.


Best regards,
Kristof
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Panic: 12.2 fails to use VIMAGE jails

2020-12-07 Thread Peter


After clean upgrade (from source) from 11.4 to 12.2-p1 my jails do
no longer work correctly.

Old-fashioned jails seem to work, but most are VIMAGE+NETGRAPH style,
and do not work properly.
All did work flawlessly for nearly a year with Rel.11.

If I start 2-3 jails, and then stop them again, there is always a
panic.
Also reproducible with GENERIC kernel.

Can this be fixed, or do I need to revert to 11.4?

The backtrace looks like this:

#4 0x810bbadf at trap_pfault+0x4f
#5 0x810bb23f at trap+0x4cf
#6 0x810933f8 at calltrap+0x8
#7 0x80cdd555 at _if_delgroup_locked+0x465
#8 0x80cdbfbe at if_detach_internal+0x24e
#9 0x80ce305c at if_vmove+0x3c
#10 0x80ce3010 at vnet_if_return+0x50
#11 0x80d0e696 at vnet_destroy+0x136
#12 0x80ba781d at prison_deref+0x27d
#13 0x80c3e38a at taskqueue_run_locked+0x14a
#14 0x80c3f799 at taskqueue_thread_loop+0xb9
#15 0x80b9fd52 at fork_exit+0x82
#16 0x8109442e at fork_trampoline+0xe

This is my typical jail config, designed and tested with Rel.11:

rail {
jid = 10;
devfs_ruleset = 11;
host.hostname = "xxx.xxx.xxx.org";
vnet = "new";
sysvshm;
$ifname1l = nge_${name}_1l;
$ifname1l_mac = 00:1d:92:01:01:0a;
vnet.interface = "$ifname1l";
exec.prestart = "
echo -e \"mkpeer eiface crhook ether\nname .:crhook $ifname1l\" \
| /usr/sbin/ngctl -f -
/usr/sbin/ngctl connect ${ifname1l}: svcswitch: ether link2
ifname=`/usr/sbin/ngctl msg ${ifname1l}: getifname | \
awk '$1 == \"Args:\" { print substr($2, 2, length($2)-2)}'`
/sbin/ifconfig \$ifname name $ifname1l
/sbin/ifconfig $ifname1l link $ifname1l_mac
";
exec.poststart = "
/usr/sbin/jexec $name /sbin/sysctl kern.securelevel=3 ;
";
exec.poststop = "/usr/sbin/ngctl shutdown ${ifname1l}:";
}
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"