Re: Panic: 12.2 fails to use VIMAGE jails
On Wed, Dec 09, 2020 at 02:00:37PM +1100, Dewayne Geraghty wrote: ! On a jail with config: ! exec.start = "/bin/sh -x /etc/rc"; ! exec.stop = "/bin/sh /etc/rc.shutdown"; ! exec.clean; ! ! test_prod { jid=7; persist; ip4.addr = ! "10.0.7.96,10.0.5.96,127.0.5.96"; devfs_ruleset = "6"; ! host.hostuuid=---0001-0302; host.hostid=000302; } ! ! I successfully performed ! for i in `seq 10`; do jail -vc test_prod; sleep 3; jail -vr test_prod; done But, this is not a VIMAGE jail, is it? Old-style jails are unaffected by this issue. Only VIMAGE jails, using epair or netgraph, might be affected. (In that case, you would not have an "ip4.addr" configured, and rather a "vnet.interface".) ! I think the normal use of jail.conf is to NOT explicitly use a jid in ! the definition, which may be why this may not have been picked up? ! (Maybe a clue). This is an interesting point. When you stop a jail, it may stay for a more or less long time in a "dying" state (visible with "jls -d"), keeping the jid occupied. During that time, the jail cannot be restarted with that same jid. Once ago, I read people complaining about this, and the advice was to just not define the jid in the definition, so that the jail can be restarted immediately (and will probably grab another jid). I did not find a solid explanation for what is happening in that "dying" state (and why it does take more or less long), even less an approach to fix that. I found some theories circling the net, but these don't really figure. So I would need to look into the source myself - and I did postpone that indefinitely. ;) But what I found out, with the VIMAGE jails (those that can carry their own network interfaces), when you make a slight mistake with managing and handling the interfaces, then the jail will stay in the dying state forever. If you don't make a mistake, then it will finally die within some time. So I decided to keep the jid, so that rightaway nothing is allowed to linger from misconfigured unnoticed. (The tradeoff is obviousely that one might have to wait before restarting.) cheerio, PMc P.S. 41 celsius is phantastic! I envy You! :) ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Panic: 12.2 fails to use VIMAGE jails
Peter, I’m not interested in discussing software development methodology here. Please drop me from this thread. Let me know if/when you have a test case I can work from. Regards, Kristof On 9 Dec 2020, at 11:54, Peter wrote: On Tue, Dec 08, 2020 at 07:51:07PM -0600, Kyle Evans wrote: ! You seem to have misinterpreted this; he doesn't want to narrow it ! down to one bug, he wants simple steps that he can follow to reproduce Maybe I did misinterpret, but then I don't really understand it. I would suppose, when testing a proposed fix, the fact that it does break under the exact same conditions as before, is all the information needed at that point. Put in simple words: that it does not work. ! any failure, preferably steps that can actually be followed by just ! about anyone and don't require immense amounts of setup time or ! additional hardware. Engineering does not normally work that way. I'll try to explain: when a bug is first encountered, it is necessary to isolate it insofar that somebody who is knowledgeable of the code, can actually reproduce it, in order to have a look at it and analyze what causes the mis-happening. If then a remedy is devised, and that does not work as expected, then the flaw is in the analysis, and we just start over from there. In fact, I would have expected somebody who is trying to fix such kind of bug, to already have testing tools available and tell me exactly which kind of data I might retrieve from the dumps. The open question now is: am I the only one seeing these failures? Might they be attributed to a faulty configuration or maybe hardware issues or whatever? We cannot know this, we can only watch out what happens at other sites. And that is why I sent out all these backtraces - because they appear weird and might be difficult to associate with this issue. I don't think there is much more we can do at this point, unless we were willing to actually look into the details. Am I discouraging? Indeed, I think, engineering is discouraging by it's very nature, and that's the fun of it: to overcome odds and finally maybe make things better. And when we start to forget about that, bad things begin to happen (anybody remember Apollo 13?). But talking about disencouragement: I usually try to track down defects I encounter, and, if possible, do a viable root-cause analysis. I tended to be very willing to share the outcomes and. if a solution arises, by all means make that get back into the code base; but I found that even ready made patches for easy matters would linger forever in the sendbug system without anybody caring, or, in more complex cases where I would need some feedback from the original writer, if only to clarify the purpose of some defaults or verify than an approach is viable, that communication is very difficult to establish. And that is what I would call disencouraging, and I for my part have accepted to just leave the developers in their ivory tower and tend to my own business. cheerio, PMc ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Panic: 12.2 fails to use VIMAGE jails
On Tue, Dec 08, 2020 at 07:51:07PM -0600, Kyle Evans wrote: ! You seem to have misinterpreted this; he doesn't want to narrow it ! down to one bug, he wants simple steps that he can follow to reproduce Maybe I did misinterpret, but then I don't really understand it. I would suppose, when testing a proposed fix, the fact that it does break under the exact same conditions as before, is all the information needed at that point. Put in simple words: that it does not work. ! any failure, preferably steps that can actually be followed by just ! about anyone and don't require immense amounts of setup time or ! additional hardware. Engineering does not normally work that way. I'll try to explain: when a bug is first encountered, it is necessary to isolate it insofar that somebody who is knowledgeable of the code, can actually reproduce it, in order to have a look at it and analyze what causes the mis-happening. If then a remedy is devised, and that does not work as expected, then the flaw is in the analysis, and we just start over from there. In fact, I would have expected somebody who is trying to fix such kind of bug, to already have testing tools available and tell me exactly which kind of data I might retrieve from the dumps. The open question now is: am I the only one seeing these failures? Might they be attributed to a faulty configuration or maybe hardware issues or whatever? We cannot know this, we can only watch out what happens at other sites. And that is why I sent out all these backtraces - because they appear weird and might be difficult to associate with this issue. I don't think there is much more we can do at this point, unless we were willing to actually look into the details. Am I discouraging? Indeed, I think, engineering is discouraging by it's very nature, and that's the fun of it: to overcome odds and finally maybe make things better. And when we start to forget about that, bad things begin to happen (anybody remember Apollo 13?). But talking about disencouragement: I usually try to track down defects I encounter, and, if possible, do a viable root-cause analysis. I tended to be very willing to share the outcomes and. if a solution arises, by all means make that get back into the code base; but I found that even ready made patches for easy matters would linger forever in the sendbug system without anybody caring, or, in more complex cases where I would need some feedback from the original writer, if only to clarify the purpose of some defaults or verify than an approach is viable, that communication is very difficult to establish. And that is what I would call disencouraging, and I for my part have accepted to just leave the developers in their ivory tower and tend to my own business. cheerio, PMc ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Panic: 12.2 fails to use VIMAGE jails
On 9 Dec 2020, at 2:31, Peter wrote: On Tue, Dec 08, 2020 at 08:02:47PM +0100, Kristof Provost wrote: ! > Sorry for the bad news. ! > ! You appear to be triggering two or three different bugs there. That is possible. Then there are two or three different bugs in the production code. In any case, my current workaround, i.e. delaying in the exec.poststop exec.poststop = " sleep 6 ; /usr/sbin/ngctl shutdown ${ifname1l}: ; "; helps for it all and makes the system behave solid. This is true with and without Your patch. ! Can you reduce your netgraph use case to a small test case that can trigger ! the problem? I'm sorry, I fear I don't get Your point. Assumed there are actually two or three bugs here, You are asking me to reduce config so that it will trigger only one of them? Is that correct? No, we need a simple case to reproduce these problems. It’s fine if that test case triggers multiple issues. Then let me put this different: assuming this is the OS for the life support system of the manned Jupiter mission. Then, which one of the bugs do You want to get fixed, and which would You prefer to keep and make Your oxygen supply cut off? https://www.youtube.com/watch?v=BEo2g-w545A Happily we’re not in space. ! I’m not likely to be able to do anything unless I can reproduce ! the problem(s). I understand that. From Your former mail I get the impression that you prefer to rely on tests. I consider this a bad habit[1] and prefer logical thinking. (Background: It is not that I would be unwilling to create clean and precisely reproducible scenarious, But, one of my problems is currently, I only have two machines availabe: the graphical one where I'm just typing, and the backend server with the jails that does practically everything. These issues should trigger just fine in VMs. There’s no need for hardware pain. Regards, Kristof ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Panic: 12.2 fails to use VIMAGE jails
On Tue, Dec 8, 2020 at 7:45 PM Peter wrote: > > > On Tue, Dec 08, 2020 at 08:02:47PM +0100, Kristof Provost wrote: > > Can you reduce your netgraph use case to a small test case that can trigger > ? the problem? > > I'm sorry, I fear I don't get Your point. > Assumed there are actually two or three bugs here, You are asking me > to reduce config so that it will trigger only one of them? Is that > correct? > > Then let me put this different: assuming this is the OS for the life > support system of the manned Jupiter mission. Then, which one of the > bugs do You want to get fixed, and which would You prefer to keep and > make Your oxygen supply cut off? > > https://www.youtube.com/watch?v=BEo2g-w545A You seem to have misinterpreted this; he doesn't want to narrow it down to one bug, he wants simple steps that he can follow to reproduce any failure, preferably steps that can actually be followed by just about anyone and don't require immense amounts of setup time or additional hardware. Unfortunately, your tone following the misunderstanding was pretty discouraging. Thanks, Kyle Evans ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Panic: 12.2 fails to use VIMAGE jails
On Tue, Dec 08, 2020 at 08:02:47PM +0100, Kristof Provost wrote: ! > Sorry for the bad news. ! > ! You appear to be triggering two or three different bugs there. That is possible. Then there are two or three different bugs in the production code. In any case, my current workaround, i.e. delaying in the exec.poststop > exec.poststop = " >sleep 6 ; >/usr/sbin/ngctl shutdown ${ifname1l}: ; >"; helps for it all and makes the system behave solid. This is true with and without Your patch. ! Can you reduce your netgraph use case to a small test case that can trigger ! the problem? I'm sorry, I fear I don't get Your point. Assumed there are actually two or three bugs here, You are asking me to reduce config so that it will trigger only one of them? Is that correct? Then let me put this different: assuming this is the OS for the life support system of the manned Jupiter mission. Then, which one of the bugs do You want to get fixed, and which would You prefer to keep and make Your oxygen supply cut off? https://www.youtube.com/watch?v=BEo2g-w545A ! I’m not likely to be able to do anything unless I can reproduce ! the problem(s). I understand that. From Your former mail I get the impression that you prefer to rely on tests. I consider this a bad habit[1] and prefer logical thinking. So lets try that: We know that there is a problem with taking down an interface from a VIMAGE, in the way it is done by "jail -r". We know this problem can be solidly workarounded by delaying the interface takedown for a short time. Now with Your patch, we do not get the typical crash at interface takedown. Instead, all of a sudden, there are strange crashes from various other places. And, interestingly, we get these also when STARTING a jail. I think this is not an additional problem, it is instead a valuable information (albeit not the one You might like to get). Furthermore, we get these new crashes always invoked by "ifconfig", and they seem to have in common that somebody tries to obtain information about some interface configuration and receives some bogus. I might conclude, just out of the belly without looking into details, that either - your patch achieves to garble some internal interface data, instead of what it is intended to do, or - the original problem manages to garble internal interface data (leading to the usual crash), and Your patch does not achieve to solve this, but only protects from the immediate consequence. It might also be worth consideration, that, while the problem may be more easy to reproduce with epair, this effect may or may not be a netgraph specific one[2]. Now lets keep in mind that a successful test means EXACTLY NOTHING. By which other means can we confirm that Your patch fully achieves what it is intended for? (E.g. something like dumping and verifying the respective internal tables in-vivo) (Background: It is not that I would be unwilling to create clean and precisely reproducible scenarious, But, one of my problems is currently, I only have two machines availabe: the graphical one where I'm just typing, and the backend server with the jails that does practically everything. Therefore, experimenting on any of them creates considerable pain. I'm working on that issue, trying to get a real server board for the backend so to get the current one free for testing - but what I would like to use, e.g. ASUS Z10PE+cores+regECC, is not something one would easily find on yardsales - and seldom for an acceptable price.) cheerio, PMc [1] Rationale: a failing test tells us that either the test or the application has a bug (50/50 chance). A succeeding test tells us that 1 equals 1, which we knew already before. In fact, tests tell us *nothing at all* about the state of our code, and specifically, 'successful' outcomes do NOT mean that things are all correct. The only true usefulness of tests is to protect against re-introducing a fault that was already fixed before, i.e. regressions. [2] My netgraph configuration consists of bringing up some bridges and then attaching the jails to them. Here is the bridge starter (only respective component, there are more of these populated, but probably not influencing the issue): #! /bin/sh # PROVIDE: netgraphs # REQUIRE: netwait # BEFORE: NETWORKING . /etc/rc.subr name="netgraphs" start_cmd="${name}_start" stop_cmd="${name}_stop" load_rc_config $name netgraphs_graphs="svc" netgraphs_svc_if1_name="nge_svc_1u" netgraphs_svc_if1_mac="00:1d:92:01:02:01" netgraphs_svc_if1_addr="***.***.***.***/29" netgraphs_svc_start() { local _ifname if ngctl info svcswitch: > /dev/null 2>&1; then netgraphs_svc_stop fi echo "Creating SVC Switch" ngctl -f - < /dev/null 2>&1; then $_cmd else echo "netgraphs-start: object $i not found" >&2 fi done }
Re: Panic: 12.2 fails to use VIMAGE jails
Here is the next funny crashdump - I obtained this one twice and also the sysctl_rtsock() again. I can reproduce this by just starting and stopping a most simple jail that does only exec.start = "/bin/sleep 4 &"; (And as usual, when I let it time out, nothing bad happens.) Fatal trap 9: general protection fault while in kernel mode cpuid = 1; apic id = 02 instruction pointer = 0x20:0x80a2ac45 stack pointer = 0x28:0xfe0047cf2890 frame pointer = 0x28:0xfe0047cf2890 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags= interrupt enabled, resume, IOPL = 0 current process = 13557 (ifconfig) trap number = 9 panic: general protection fault cpuid = 1 time = 1607469295 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe0047cf25a0 vpanic() at vpanic+0x17b/frame 0xfe0047cf25f0 panic() at panic+0x43/frame 0xfe0047cf2650 trap_fatal() at trap_fatal+0x391/frame 0xfe0047cf26b0 trap() at trap+0x67/frame 0xfe0047cf27c0 calltrap() at calltrap+0x8/frame 0xfe0047cf27c0 --- trap 0x9, rip = 0x80a2ac45, rsp = 0xfe0047cf2890, rbp = 0xfe0047cf2890 --- strncmp() at strncmp+0x15/frame 0xfe0047cf2890 ifunit_ref() at ifunit_ref+0x59/frame 0xfe0047cf28d0 ifioctl() at ifioctl+0x427/frame 0xfe0047cf2990 kern_ioctl() at kern_ioctl+0x275/frame 0xfe0047cf29f0 sys_ioctl() at sys_ioctl+0x101/frame 0xfe0047cf2ac0 amd64_syscall() at amd64_syscall+0x380/frame 0xfe0047cf2bf0 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfe0047cf2bf0 --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x800475b2a, rsp = 0x7fffe3b8, rbp = 0x7fffe450 --- Uptime: 8m54s Dumping 880 out of 3959 MB: ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Panic: 12.2 fails to use VIMAGE jails
On 8 Dec 2020, at 19:49, Peter wrote: On Tue, Dec 08, 2020 at 04:50:00PM +0100, Kristof Provost wrote: ! Yeah, the bug is not exclusive to epair but that’s where it’s most easily ! seen. Ack. ! Try http://people.freebsd.org/~kp/0001-if-Fix-panic-when-destroying-vnet-and-epair-simultan.patch Great, thanks a lot. Now I have bad news: when playing yoyo with the next-best three application jails (with all their installed stuff) it took about ten up and down's then I got this one: Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 02 fault virtual address = 0x10 fault code = supervisor read data, page not present instruction pointer = 0x20:0x80aad73c stack pointer = 0x28:0xfe003f80e810 frame pointer = 0x28:0xfe003f80e810 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags= interrupt enabled, resume, IOPL = 0 current process = 15486 (ifconfig) trap number = 12 panic: page fault cpuid = 1 time = 1607450838 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe003f80e4d0 vpanic() at vpanic+0x17b/frame 0xfe003f80e520 panic() at panic+0x43/frame 0xfe003f80e580 trap_fatal() at trap_fatal+0x391/frame 0xfe003f80e5e0 trap_pfault() at trap_pfault+0x4f/frame 0xfe003f80e630 trap() at trap+0x4cf/frame 0xfe003f80e740 calltrap() at calltrap+0x8/frame 0xfe003f80e740 --- trap 0xc, rip = 0x80aad73c, rsp = 0xfe003f80e810, rbp = 0xfe003f80e810 --- ng_eiface_mediastatus() at ng_eiface_mediastatus+0xc/frame 0xfe003f80e810 ifmedia_ioctl() at ifmedia_ioctl+0x174/frame 0xfe003f80e850 ifhwioctl() at ifhwioctl+0x639/frame 0xfe003f80e8d0 ifioctl() at ifioctl+0x448/frame 0xfe003f80e990 kern_ioctl() at kern_ioctl+0x275/frame 0xfe003f80e9f0 sys_ioctl() at sys_ioctl+0x101/frame 0xfe003f80eac0 amd64_syscall() at amd64_syscall+0x380/frame 0xfe003f80ebf0 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfe003f80ebf0 --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x800475b2a, rsp = 0x7fffe358, rbp = 0x7fffe450 --- Uptime: 9m51s Dumping 899 out of 3959 MB: I decided to give it a second try, and this is what I did: root@edge:/var/crash # jls JID IP Address Hostname Path 1 1***gate.***.org /j/gate 3 1***raix.***.org /j/raix 4 oper.***.org /j/oper 5 admn.***.org /j/admn 6 data.***.org /j/data 7 conn.***.org /j/conn 8 kerb.***.org /j/kerb 9 tele.***.org /j/tele 10 rail.***.org /j/rail root@edge:/var/crash # service jail stop rail Stopping jails: rail. root@edge:/var/crash # service jail stop tele Stopping jails: tele. root@edge:/var/crash # service jail stop kerb Stopping jails: kerb. root@edge:/var/crash # jls JID IP Address Hostname Path 1 1***gate.***.org /j/gate 3 1***raix.***.org /j/raix 4 oper.***.org /j/oper 5 admn.***.org /j/admn 6 data.***.org /j/data 7 conn.***.org /j/conn root@edge:/var/crash # jls -d JID IP Address Hostname Path 1 1***gate.***.org /j/gate 3 1***raix.***.org /j/raix 4 oper.***.org /j/oper 5 admn.***.org /j/admn 6 data.***.org /j/data 7 conn.***.org /j/conn 9 tele.***.org /j/tele 10 rail.***.org /j/rail root@edge:/var/crash # service jail start kerb Starting jails:Fssh_packet_write_wait: Connection to 1*** port 22: Broken pipe Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 02 fault virtual address = 0x0 fault code = supervisor read instruction, page not present instruction pointer = 0x20:0x0 stack pointer = 0x28:0xfe00540ea658 frame pointer = 0x28:0xfe00540ea670 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags= interrupt enabled, resume, IOPL = 0 current process = 13420 (ifconfig) trap number = 12 panic: page fault cpuid = 1 time = 1607451910 KDB: st
Re: Panic: 12.2 fails to use VIMAGE jails
On Tue, Dec 08, 2020 at 04:50:00PM +0100, Kristof Provost wrote: ! Yeah, the bug is not exclusive to epair but that’s where it’s most easily ! seen. Ack. ! Try http://people.freebsd.org/~kp/0001-if-Fix-panic-when-destroying-vnet-and-epair-simultan.patch Great, thanks a lot. Now I have bad news: when playing yoyo with the next-best three application jails (with all their installed stuff) it took about ten up and down's then I got this one: Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 02 fault virtual address = 0x10 fault code = supervisor read data, page not present instruction pointer = 0x20:0x80aad73c stack pointer = 0x28:0xfe003f80e810 frame pointer = 0x28:0xfe003f80e810 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags= interrupt enabled, resume, IOPL = 0 current process = 15486 (ifconfig) trap number = 12 panic: page fault cpuid = 1 time = 1607450838 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe003f80e4d0 vpanic() at vpanic+0x17b/frame 0xfe003f80e520 panic() at panic+0x43/frame 0xfe003f80e580 trap_fatal() at trap_fatal+0x391/frame 0xfe003f80e5e0 trap_pfault() at trap_pfault+0x4f/frame 0xfe003f80e630 trap() at trap+0x4cf/frame 0xfe003f80e740 calltrap() at calltrap+0x8/frame 0xfe003f80e740 --- trap 0xc, rip = 0x80aad73c, rsp = 0xfe003f80e810, rbp = 0xfe003f80e810 --- ng_eiface_mediastatus() at ng_eiface_mediastatus+0xc/frame 0xfe003f80e810 ifmedia_ioctl() at ifmedia_ioctl+0x174/frame 0xfe003f80e850 ifhwioctl() at ifhwioctl+0x639/frame 0xfe003f80e8d0 ifioctl() at ifioctl+0x448/frame 0xfe003f80e990 kern_ioctl() at kern_ioctl+0x275/frame 0xfe003f80e9f0 sys_ioctl() at sys_ioctl+0x101/frame 0xfe003f80eac0 amd64_syscall() at amd64_syscall+0x380/frame 0xfe003f80ebf0 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfe003f80ebf0 --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x800475b2a, rsp = 0x7fffe358, rbp = 0x7fffe450 --- Uptime: 9m51s Dumping 899 out of 3959 MB: I decided to give it a second try, and this is what I did: root@edge:/var/crash # jls JID IP Address Hostname Path 1 1***gate.***.org /j/gate 3 1***raix.***.org /j/raix 4 oper.***.org /j/oper 5 admn.***.org /j/admn 6 data.***.org /j/data 7 conn.***.org /j/conn 8 kerb.***.org /j/kerb 9 tele.***.org /j/tele 10 rail.***.org /j/rail root@edge:/var/crash # service jail stop rail Stopping jails: rail. root@edge:/var/crash # service jail stop tele Stopping jails: tele. root@edge:/var/crash # service jail stop kerb Stopping jails: kerb. root@edge:/var/crash # jls JID IP Address Hostname Path 1 1***gate.***.org /j/gate 3 1***raix.***.org /j/raix 4 oper.***.org /j/oper 5 admn.***.org /j/admn 6 data.***.org /j/data 7 conn.***.org /j/conn root@edge:/var/crash # jls -d JID IP Address Hostname Path 1 1***gate.***.org /j/gate 3 1***raix.***.org /j/raix 4 oper.***.org /j/oper 5 admn.***.org /j/admn 6 data.***.org /j/data 7 conn.***.org /j/conn 9 tele.***.org /j/tele 10 rail.***.org /j/rail root@edge:/var/crash # service jail start kerb Starting jails:Fssh_packet_write_wait: Connection to 1*** port 22: Broken pipe Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 02 fault virtual address = 0x0 fault code = supervisor read instruction, page not present instruction pointer = 0x20:0x0 stack pointer = 0x28:0xfe00540ea658 frame pointer = 0x28:0xfe00540ea670 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags= interrupt enabled, resume, IOPL = 0 current process = 13420 (ifconfig) trap number = 12 panic: page fault cpuid = 1 time = 1607451910 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_s
Re: Panic: 12.2 fails to use VIMAGE jails
On 8 Dec 2020, at 0:34, Peter wrote: Hi Kristof, it's great to read You! On Mon, Dec 07, 2020 at 09:11:32PM +0100, Kristof Provost wrote: ! That smells a lot like the epair/vnet issues in bugs 238870, 234985, 244703, ! 250870. epair? No. It is purely Netgrh here. Yeah, the bug is not exclusive to epair but that’s where it’s most easily seen. ! I pushed a fix for that in CURRENT in r368237. It’s scheduled to go into ! stable/12 sometime next week, but it’d be good to know that it fixes your ! problem too before I merge it. ! In other words: can you test a recent CURRENT? It’s likely fixed there, and ! if it’s not I may be able to fix it quickly. Oh my Gods. No offense meant, but this is not really a good time for that. This is the most horrible upgrade I experienced in 25 years FreeBSD (and it was prepared, 12.2 did run fine on the other machine). I have issue with mem config https://forums.freebsd.org/threads/fun-with-upgrading-sysctl-unknown-oid-vm-pageout_wakeup_thresh.77955/ I have issue with damaged filesystem, for no apparent reason https://forums.freebsd.org/threads/no-longer-fun-with-upgrading-file-offline.77959/ Then I have this issue here which is now gladly workarounded https://forums.freebsd.org/threads/panic-12-2-does-not-work-with-jails.77962/post-486365 and when I then dare to have a look at my applications, they look like sheer horror, segfaults all over, and I don't even know where to begin with these. Other option: can you make this fix so that I can patch it into 12.2 source and just redeploy? Try http://people.freebsd.org/~kp/0001-if-Fix-panic-when-destroying-vnet-and-epair-simultan.patch That’s currently running the regression tests that used to provoke the panic nearly instantly, and no panics so far. Best regards. Kristof ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Panic: 12.2 fails to use VIMAGE jails
Hi Kristof, it's great to read You! On Mon, Dec 07, 2020 at 09:11:32PM +0100, Kristof Provost wrote: ! That smells a lot like the epair/vnet issues in bugs 238870, 234985, 244703, ! 250870. epair? No. It is purely Netgraph here. ! I pushed a fix for that in CURRENT in r368237. It’s scheduled to go into ! stable/12 sometime next week, but it’d be good to know that it fixes your ! problem too before I merge it. ! In other words: can you test a recent CURRENT? It’s likely fixed there, and ! if it’s not I may be able to fix it quickly. Oh my Gods. No offense meant, but this is not really a good time for that. This is the most horrible upgrade I experienced in 25 years FreeBSD (and it was prepared, 12.2 did run fine on the other machine). I have issue with mem config https://forums.freebsd.org/threads/fun-with-upgrading-sysctl-unknown-oid-vm-pageout_wakeup_thresh.77955/ I have issue with damaged filesystem, for no apparent reason https://forums.freebsd.org/threads/no-longer-fun-with-upgrading-file-offline.77959/ Then I have this issue here which is now gladly workarounded https://forums.freebsd.org/threads/panic-12-2-does-not-work-with-jails.77962/post-486365 and when I then dare to have a look at my applications, they look like sheer horror, segfaults all over, and I don't even know where to begin with these. Other option: can you make this fix so that I can patch it into 12.2 source and just redeploy? I tried to apply the changes from r368237 into my 12.2 source, that seemed to be quite obvious, but it doesn't work; jails fail to remove entirely: # service jail stop rail Stopping jails: rail. # jexec rail jexec: jail "rail" not found -> it works once. # service jail start rail Starting jails: rail. # service jail stop rail Stopping jails: rail. # jexec rail root@rail:/ # ps ax ps: empty file: Invalid argument -> And here it doesn't work anymore, and leaves a skull of a jail one cannot get rid of. Cheerio, PMc ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Panic: 12.2 fails to use VIMAGE jails
On 7 Dec 2020, at 13:54, Peter wrote: After clean upgrade (from source) from 11.4 to 12.2-p1 my jails do no longer work correctly. Old-fashioned jails seem to work, but most are VIMAGE+NETGRAPH style, and do not work properly. All did work flawlessly for nearly a year with Rel.11. If I start 2-3 jails, and then stop them again, there is always a panic. Also reproducible with GENERIC kernel. Can this be fixed, or do I need to revert to 11.4? The backtrace looks like this: #4 0x810bbadf at trap_pfault+0x4f #5 0x810bb23f at trap+0x4cf #6 0x810933f8 at calltrap+0x8 #7 0x80cdd555 at _if_delgroup_locked+0x465 #8 0x80cdbfbe at if_detach_internal+0x24e #9 0x80ce305c at if_vmove+0x3c #10 0x80ce3010 at vnet_if_return+0x50 #11 0x80d0e696 at vnet_destroy+0x136 #12 0x80ba781d at prison_deref+0x27d #13 0x80c3e38a at taskqueue_run_locked+0x14a #14 0x80c3f799 at taskqueue_thread_loop+0xb9 #15 0x80b9fd52 at fork_exit+0x82 #16 0x8109442e at fork_trampoline+0xe This is my typical jail config, designed and tested with Rel.11: That smells a lot like the epair/vnet issues in bugs 238870, 234985, 244703, 250870. I pushed a fix for that in CURRENT in r368237. It’s scheduled to go into stable/12 sometime next week, but it’d be good to know that it fixes your problem too before I merge it. In other words: can you test a recent CURRENT? It’s likely fixed there, and if it’s not I may be able to fix it quickly. Best regards, Kristof ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Panic: 12.2 fails to use VIMAGE jails
After clean upgrade (from source) from 11.4 to 12.2-p1 my jails do no longer work correctly. Old-fashioned jails seem to work, but most are VIMAGE+NETGRAPH style, and do not work properly. All did work flawlessly for nearly a year with Rel.11. If I start 2-3 jails, and then stop them again, there is always a panic. Also reproducible with GENERIC kernel. Can this be fixed, or do I need to revert to 11.4? The backtrace looks like this: #4 0x810bbadf at trap_pfault+0x4f #5 0x810bb23f at trap+0x4cf #6 0x810933f8 at calltrap+0x8 #7 0x80cdd555 at _if_delgroup_locked+0x465 #8 0x80cdbfbe at if_detach_internal+0x24e #9 0x80ce305c at if_vmove+0x3c #10 0x80ce3010 at vnet_if_return+0x50 #11 0x80d0e696 at vnet_destroy+0x136 #12 0x80ba781d at prison_deref+0x27d #13 0x80c3e38a at taskqueue_run_locked+0x14a #14 0x80c3f799 at taskqueue_thread_loop+0xb9 #15 0x80b9fd52 at fork_exit+0x82 #16 0x8109442e at fork_trampoline+0xe This is my typical jail config, designed and tested with Rel.11: rail { jid = 10; devfs_ruleset = 11; host.hostname = "xxx.xxx.xxx.org"; vnet = "new"; sysvshm; $ifname1l = nge_${name}_1l; $ifname1l_mac = 00:1d:92:01:01:0a; vnet.interface = "$ifname1l"; exec.prestart = " echo -e \"mkpeer eiface crhook ether\nname .:crhook $ifname1l\" \ | /usr/sbin/ngctl -f - /usr/sbin/ngctl connect ${ifname1l}: svcswitch: ether link2 ifname=`/usr/sbin/ngctl msg ${ifname1l}: getifname | \ awk '$1 == \"Args:\" { print substr($2, 2, length($2)-2)}'` /sbin/ifconfig \$ifname name $ifname1l /sbin/ifconfig $ifname1l link $ifname1l_mac "; exec.poststart = " /usr/sbin/jexec $name /sbin/sysctl kern.securelevel=3 ; "; exec.poststop = "/usr/sbin/ngctl shutdown ${ifname1l}:"; } ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"