Re: RLIMIT_CPU doesn't work reliably on mostly idle systems
On Tue, Aug 29, 2023 at 07:15:14PM +0200, Claudio Jeker wrote: > On Tue, Aug 29, 2023 at 01:01:10AM +, Eric Wong wrote: > > >Synopsis: RLIMIT_CPU doesn't work reliably on mostly idle systems > > >Category: system > > >Environment: > > System : OpenBSD 7.3 > > Details : OpenBSD 7.3 (GENERIC.MP) #1242: Sat Mar 25 18:04:31 MDT > > 2023 > > > > dera...@octeon.openbsd.org:/usr/src/sys/arch/octeon/compile/GENERIC.MP > > > > Architecture: OpenBSD.octeon > > Machine : octeon > > >Description: > > > > RLIMIT_CPU doesn't work reliably when few/no syscalls are made on an > > otherwise idle system (aside from the test process using up CPU). > > It can take 20-50s to fire SIGKILL with rlim_max=9 (and the SIGXCPU > > from rlim_cur=1 won't even fire). > > > > I can reproduce this on a private amd64 VM and also on gcc231 > > on GCC compiler farm <https://cfarm.tetaneutral.net/>. > > I can't reproduce this on a busy system like gcc220 on cfarm, > > however. > > Thanks for the report. There is indeed an issue in how the CPU time is > accounted on an idle system. The below diff is a possible fix. > > In roundrobin() force a resched and therefor mi_switch() when > SPCF_SHOULDYIELD is set. On an idle CPU mi_switch() will just do all > accounting bits but skip the expensive cpu_switchto() since the proc > remains the same. This is an elegant solution. I'm a little reluctant to add mi_switch() overhead to a thread if *nothing* else wants to run, but the roundrobin() quantum is massive at the moment, so it's not all that much. This change also fixes a couple other things. - Runaway user threads (like a synthetic spinloop) can no longer keep the system from suspending. If such threads are forced into mi_switch() after two ticks, sched_chooseproc() will then see SPCF_SHOULDHALT, etc. - Sibling theads in a process using ITIMER_VIRTUAL or ITIMER_PROF are now forced to enable the itimer_update() interrupt after two ticks. - Sibling threads in a process running profil(2) are now forced to enable the profclock() interrupt after two ticks. For sake of discussion, an alternative approach is below. Basically, tweak rucheck() to account for any un-accumulated ONPROC time. It seems to work. I would lean toward your change, though. Fixing those other things is a nice bonus. Index: kern_resource.c === RCS file: /cvs/src/sys/kern/kern_resource.c,v retrieving revision 1.78 diff -u -p -r1.78 kern_resource.c --- kern_resource.c 29 Aug 2023 16:19:34 - 1.78 +++ kern_resource.c 29 Aug 2023 22:21:06 - @@ -531,21 +531,32 @@ ruadd(struct rusage *ru, struct rusage * void rucheck(void *arg) { + struct timespec elapsed, now, total; struct rlimit rlim; struct process *pr = arg; time_t runtime; + struct proc *p; int s; KERNEL_ASSERT_LOCKED(); SCHED_LOCK(s); - runtime = pr->ps_tu.tu_runtime.tv_sec; + nanouptime(); + total = pr->ps_tu.tu_runtime; + TAILQ_FOREACH(p, >ps_threads, p_thr_link) { + if (p->p_stat == SONPROC) { + timespecsub(, >p_cpu->ci_schedstate.spc_runtime, + ); + timespecadd(, , ); + } + } SCHED_UNLOCK(s); mtx_enter(>ps_mtx); rlim = pr->ps_limit->pl_rlimit[RLIMIT_CPU]; mtx_leave(>ps_mtx); + runtime = total.tv_sec; if ((rlim_t)runtime >= rlim.rlim_cur) { if ((rlim_t)runtime >= rlim.rlim_max) { prsignal(pr, SIGKILL);
Re: RLIMIT_CPU doesn't work reliably on mostly idle systems
> Date: Tue, 29 Aug 2023 19:15:14 +0200 > From: Claudio Jeker > > On Tue, Aug 29, 2023 at 01:01:10AM +, Eric Wong wrote: > > >Synopsis: RLIMIT_CPU doesn't work reliably on mostly idle systems > > >Category: system > > >Environment: > > System : OpenBSD 7.3 > > Details : OpenBSD 7.3 (GENERIC.MP) #1242: Sat Mar 25 18:04:31 MDT > > 2023 > > > > dera...@octeon.openbsd.org:/usr/src/sys/arch/octeon/compile/GENERIC.MP > > > > Architecture: OpenBSD.octeon > > Machine : octeon > > >Description: > > > > RLIMIT_CPU doesn't work reliably when few/no syscalls are made on an > > otherwise idle system (aside from the test process using up CPU). > > It can take 20-50s to fire SIGKILL with rlim_max=9 (and the SIGXCPU > > from rlim_cur=1 won't even fire). > > > > I can reproduce this on a private amd64 VM and also on gcc231 > > on GCC compiler farm <https://cfarm.tetaneutral.net/>. > > I can't reproduce this on a busy system like gcc220 on cfarm, > > however. > > Thanks for the report. There is indeed an issue in how the CPU time is > accounted on an idle system. The below diff is a possible fix. > > In roundrobin() force a resched and therefor mi_switch() when > SPCF_SHOULDYIELD is set. On an idle CPU mi_switch() will just do all > accounting bits but skip the expensive cpu_switchto() since the proc > remains the same. A bit of a hack, but probably better than trying to account for spc_runtime at the point where we check the limit. Also this will call smr_idle() sooner, which may help speed up smr? ok kettenis@ > Index: kern/sched_bsd.c > === > RCS file: /cvs/src/sys/kern/sched_bsd.c,v > retrieving revision 1.84 > diff -u -p -r1.84 sched_bsd.c > --- kern/sched_bsd.c 29 Aug 2023 16:19:34 - 1.84 > +++ kern/sched_bsd.c 29 Aug 2023 16:20:03 - > @@ -106,7 +106,7 @@ roundrobin(struct clockintr *cl, void *c > } > } > > - if (spc->spc_nrun) > + if (spc->spc_nrun || spc->spc_schedflags & SPCF_SHOULDYIELD) > need_resched(ci); > } > > >
Re: RLIMIT_CPU doesn't work reliably on mostly idle systems
On Tue, Aug 29, 2023 at 01:01:10AM +, Eric Wong wrote: > >Synopsis: RLIMIT_CPU doesn't work reliably on mostly idle systems > >Category: system > >Environment: > System : OpenBSD 7.3 > Details : OpenBSD 7.3 (GENERIC.MP) #1242: Sat Mar 25 18:04:31 MDT > 2023 > > dera...@octeon.openbsd.org:/usr/src/sys/arch/octeon/compile/GENERIC.MP > > Architecture: OpenBSD.octeon > Machine : octeon > >Description: > > RLIMIT_CPU doesn't work reliably when few/no syscalls are made on an > otherwise idle system (aside from the test process using up CPU). > It can take 20-50s to fire SIGKILL with rlim_max=9 (and the SIGXCPU > from rlim_cur=1 won't even fire). > > I can reproduce this on a private amd64 VM and also on gcc231 > on GCC compiler farm <https://cfarm.tetaneutral.net/>. > I can't reproduce this on a busy system like gcc220 on cfarm, > however. Thanks for the report. There is indeed an issue in how the CPU time is accounted on an idle system. The below diff is a possible fix. In roundrobin() force a resched and therefor mi_switch() when SPCF_SHOULDYIELD is set. On an idle CPU mi_switch() will just do all accounting bits but skip the expensive cpu_switchto() since the proc remains the same. -- :wq Claudio Index: kern/sched_bsd.c === RCS file: /cvs/src/sys/kern/sched_bsd.c,v retrieving revision 1.84 diff -u -p -r1.84 sched_bsd.c --- kern/sched_bsd.c29 Aug 2023 16:19:34 - 1.84 +++ kern/sched_bsd.c29 Aug 2023 16:20:03 - @@ -106,7 +106,7 @@ roundrobin(struct clockintr *cl, void *c } } - if (spc->spc_nrun) + if (spc->spc_nrun || spc->spc_schedflags & SPCF_SHOULDYIELD) need_resched(ci); }
Re: RLIMIT_CPU doesn't work reliably on mostly idle systems
> On Aug 28, 2023, at 20:04, Eric Wong wrote: > > >> >> Synopsis: RLIMIT_CPU doesn't work reliably on mostly idle systems >> Category: system >> Environment: >System : OpenBSD 7.3 >Details : OpenBSD 7.3 (GENERIC.MP) #1242: Sat Mar 25 18:04:31 MDT 2023 > > dera...@octeon.openbsd.org:/usr/src/sys/arch/octeon/compile/GENERIC.MP > >Architecture: OpenBSD.octeon >Machine : octeon >> Description: > > RLIMIT_CPU doesn't work reliably when few/no syscalls are made on an > otherwise idle system (aside from the test process using up CPU). > It can take 20-50s to fire SIGKILL with rlim_max=9 (and the SIGXCPU > from rlim_cur=1 won't even fire). > > I can reproduce this on a private amd64 VM and also on gcc231 > on GCC compiler farm <https://cfarm.tetaneutral.net/>. > I can't reproduce this on a busy system like gcc220 on cfarm, > however. > >> How-To-Repeat: > > Following is a standalone C program which demonstrates the problem on > a mostly idle system: > /* > * Most reliably reproduced with compiler optimizations disabled: > * > * cc -o rlimit_cpu -ggdb3 -Wall rlimit_cpu.c > * > * Neither SIGXCPU (from rlim_cur) nor SIGKILL (from rlim_max) > * with RLIMIT_CPU set seem to fire reliably with few syscalls being made. > * On an otherwise idle system, it can take many more seconds (20-50s) > * than expected when rlim_max=9 (SIGXCPU doesn't even happen). > * Best case is 2 seconds for SIGXCPU when rlim_cur=1 on a busy system > * which is understandable due to kernel accounting delays. > * > * I rely on RLIMIT_CPU to protect systems from pathological userspace > * code (diff generation, regexps, etc) > * > * Testing on cfarm <https://cfarm.tetaneutral.net/> machines, > * the issue is visible on a mostly-idle 4-core gcc231 mips64 > * but doesn't seem to happen on the busy 12-core gcc220 machine > * (only 2 seconds for XCPU w/ rlim_cur=1) > */ > #include > #include > #include > #include > > static void sigxcpu(int sig) > { >write(1, "SIGXCPU\n", 8); >_exit(1); > } > > static volatile size_t nr; // volatile to disable optimizations > int main(void) > { >struct rlimit rlim = { .rlim_cur = 1, .rlim_max = 9 }; >int rc; > >signal(SIGXCPU, sigxcpu); >rc = setrlimit(RLIMIT_CPU, ); >assert(rc == 0); > >/* > * adding some time, times, and write syscalls improve likelyhood of > * of rlimit signals firing in a timely manner. writes to /dev/null > * seems less-likely to trigger than writes to the terminal or > * regular file. > */ >for (;; nr++) { >} > >return 0; > } > >> Fix: >Making more syscalls can workaround the problem, but that's not > an option when dealing with userspace-heavy code like pathological regexps. The CPU time limit is checked from a periodic timeout. CPU time totals accumulate in mi_switch(). The problem is that on a mostly idle system, a user thread that is hogging the CPU may take a very long time to switch out, and the and all that CPU time doesn't accumulate until the switch, and so the signal will arrive later than requested. System calls have points where a thread can switch out, which is why the delay is exaggerated on synthetic workloads like the busy-loop shown above. One possible solution is to check usage times for threads that are still ONPROC during the rusage timeout. Another approach is to be more aggressive about forcing threads to switch out, even when nothing else wants to run. Coincidentally, we are discussing p_rtime on tech@ right now, which is tangentially related to this issue.
RLIMIT_CPU doesn't work reliably on mostly idle systems
>Synopsis: RLIMIT_CPU doesn't work reliably on mostly idle systems >Category: system >Environment: System : OpenBSD 7.3 Details : OpenBSD 7.3 (GENERIC.MP) #1242: Sat Mar 25 18:04:31 MDT 2023 dera...@octeon.openbsd.org:/usr/src/sys/arch/octeon/compile/GENERIC.MP Architecture: OpenBSD.octeon Machine : octeon >Description: RLIMIT_CPU doesn't work reliably when few/no syscalls are made on an otherwise idle system (aside from the test process using up CPU). It can take 20-50s to fire SIGKILL with rlim_max=9 (and the SIGXCPU from rlim_cur=1 won't even fire). I can reproduce this on a private amd64 VM and also on gcc231 on GCC compiler farm <https://cfarm.tetaneutral.net/>. I can't reproduce this on a busy system like gcc220 on cfarm, however. >How-To-Repeat: Following is a standalone C program which demonstrates the problem on a mostly idle system: /* * Most reliably reproduced with compiler optimizations disabled: * * cc -o rlimit_cpu -ggdb3 -Wall rlimit_cpu.c * * Neither SIGXCPU (from rlim_cur) nor SIGKILL (from rlim_max) * with RLIMIT_CPU set seem to fire reliably with few syscalls being made. * On an otherwise idle system, it can take many more seconds (20-50s) * than expected when rlim_max=9 (SIGXCPU doesn't even happen). * Best case is 2 seconds for SIGXCPU when rlim_cur=1 on a busy system * which is understandable due to kernel accounting delays. * * I rely on RLIMIT_CPU to protect systems from pathological userspace * code (diff generation, regexps, etc) * * Testing on cfarm <https://cfarm.tetaneutral.net/> machines, * the issue is visible on a mostly-idle 4-core gcc231 mips64 * but doesn't seem to happen on the busy 12-core gcc220 machine * (only 2 seconds for XCPU w/ rlim_cur=1) */ #include #include #include #include static void sigxcpu(int sig) { write(1, "SIGXCPU\n", 8); _exit(1); } static volatile size_t nr; // volatile to disable optimizations int main(void) { struct rlimit rlim = { .rlim_cur = 1, .rlim_max = 9 }; int rc; signal(SIGXCPU, sigxcpu); rc = setrlimit(RLIMIT_CPU, ); assert(rc == 0); /* * adding some time, times, and write syscalls improve likelyhood of * of rlimit signals firing in a timely manner. writes to /dev/null * seems less-likely to trigger than writes to the terminal or * regular file. */ for (;; nr++) { } return 0; } >Fix: Making more syscalls can workaround the problem, but that's not an option when dealing with userspace-heavy code like pathological regexps. dmesg: OpenBSD 7.3 (GENERIC.MP) #1242: Sat Mar 25 18:04:31 MDT 2023 dera...@octeon.openbsd.org:/usr/src/sys/arch/octeon/compile/GENERIC.MP real mem = 1073741824 (1024MB) avail mem = 1035304960 (987MB) random: good seed from bootblocks mainbus0 at root: board 20300 rev 0.15, model cavium,ubnt_e300 cpu0 at mainbus0: CN70xx/CN71xx CPU rev 0.2 1000 MHz, CN70xx/CN71xx FPU rev 0.0 cpu0: cache L1-I 78KB 39 way D 32KB 32 way, L2 1024KB 8 way cpu1 at mainbus0: CN70xx/CN71xx CPU rev 0.2 1000 MHz, CN70xx/CN71xx FPU rev 0.0 cpu1: cache L1-I 78KB 39 way D 32KB 32 way, L2 1024KB 8 way cpu2 at mainbus0: CN70xx/CN71xx CPU rev 0.2 1000 MHz, CN70xx/CN71xx FPU rev 0.0 cpu2: cache L1-I 78KB 39 way D 32KB 32 way, L2 1024KB 8 way cpu3 at mainbus0: CN70xx/CN71xx CPU rev 0.2 1000 MHz, CN70xx/CN71xx FPU rev 0.0 cpu3: cache L1-I 78KB 39 way D 32KB 32 way, L2 1024KB 8 way clock0 at mainbus0: int 5 octcrypto0 at mainbus0 iobus0 at mainbus0 simplebus0 at iobus0: "soc" "bootbus" at simplebus0 not configured octciu0 at simplebus0 octcib0 at simplebus0: max-bits 23 octcib1 at simplebus0: max-bits 12 octcib2 at simplebus0: max-bits 6 octcib3 at simplebus0: max-bits 15 octcib4 at simplebus0: max-bits 4 octcib5 at simplebus0: max-bits 11 octcib6 at simplebus0: max-bits 11 octgpio0 at simplebus0: 20 pins, xbit 16 octsmi0 at simplebus0 octsmi1 at simplebus0 octpip0 at simplebus0 octgmx0 at octpip0 interface 0 cnmac0 at octgmx0: port 0 SGMII, address e0:63:da:c0:62:27 ukphy0 at cnmac0 phy 4: Generic IEEE 802.3u media interface, rev. 2: OUI 0x0001c1, model 0x000c cnmac1 at octgmx0: port 1 SGMII, address e0:63:da:c0:62:28 ukphy1 at cnmac1 phy 5: Generic IEEE 802.3u media interface, rev. 2: OUI 0x0001c1, model 0x000c cnmac2 at octgmx0: port 2 SGMII, address e0:63:da:c0:62:29 ukphy2 at cnmac2 phy 6: Generic IEEE 802.3u media interface, rev. 2: OUI 0x0001c1, model 0x000c cnmac3 at octgmx0: port 3 SGMII, address e0:63:da:c0:62:2a ukphy3 at cnmac3 phy 7: Generic IEEE 802.3u media interface, rev. 2: OUI 0x0001c1, model 0x000c octsctl0 at simplebus0: disabled octxctl0 at simplebus0: DWC3 rev 0x250a xhci0 at octxctl0, xHCI 1.0 usb0 at xhci0: USB revision 3.0 uhub0 at usb0 configuration 1 interface 0 "Generic xHCI root hub" rev 3.00/1.00 addr