Re: RLIMIT_CPU doesn't work reliably on mostly idle systems

2023-08-29 Thread Scott Cheloha
On Tue, Aug 29, 2023 at 07:15:14PM +0200, Claudio Jeker wrote:
> On Tue, Aug 29, 2023 at 01:01:10AM +, Eric Wong wrote:
> > >Synopsis: RLIMIT_CPU doesn't work reliably on mostly idle systems
> > >Category: system
> > >Environment:
> > System  : OpenBSD 7.3
> > Details : OpenBSD 7.3 (GENERIC.MP) #1242: Sat Mar 25 18:04:31 MDT 
> > 2023
> >  
> > dera...@octeon.openbsd.org:/usr/src/sys/arch/octeon/compile/GENERIC.MP
> > 
> > Architecture: OpenBSD.octeon
> > Machine : octeon
> > >Description:
> > 
> > RLIMIT_CPU doesn't work reliably when few/no syscalls are made on an
> > otherwise idle system (aside from the test process using up CPU).
> > It can take 20-50s to fire SIGKILL with rlim_max=9 (and the SIGXCPU
> > from rlim_cur=1 won't even fire).
> > 
> > I can reproduce this on a private amd64 VM and also on gcc231
> > on GCC compiler farm <https://cfarm.tetaneutral.net/>.
> > I can't reproduce this on a busy system like gcc220 on cfarm,
> > however.
> 
> Thanks for the report. There is indeed an issue in how the CPU time is
> accounted on an idle system. The below diff is a possible fix.
> 
> In roundrobin() force a resched and therefor mi_switch() when
> SPCF_SHOULDYIELD is set. On an idle CPU mi_switch() will just do all
> accounting bits but skip the expensive cpu_switchto() since the proc
> remains the same.

This is an elegant solution.  I'm a little reluctant to add
mi_switch() overhead to a thread if *nothing* else wants to run, but
the roundrobin() quantum is massive at the moment, so it's not all
that much.  This change also fixes a couple other things.

- Runaway user threads (like a synthetic spinloop) can no longer
  keep the system from suspending.  If such threads are forced into
  mi_switch() after two ticks, sched_chooseproc() will then see
  SPCF_SHOULDHALT, etc.

- Sibling theads in a process using ITIMER_VIRTUAL or ITIMER_PROF
  are now forced to enable the itimer_update() interrupt after
  two ticks.

- Sibling threads in a process running profil(2) are now forced
  to enable the profclock() interrupt after two ticks.

For sake of discussion, an alternative approach is below.  Basically,
tweak rucheck() to account for any un-accumulated ONPROC time.  It
seems to work.

I would lean toward your change, though.  Fixing those other things is
a nice bonus.

Index: kern_resource.c
===
RCS file: /cvs/src/sys/kern/kern_resource.c,v
retrieving revision 1.78
diff -u -p -r1.78 kern_resource.c
--- kern_resource.c 29 Aug 2023 16:19:34 -  1.78
+++ kern_resource.c 29 Aug 2023 22:21:06 -
@@ -531,21 +531,32 @@ ruadd(struct rusage *ru, struct rusage *
 void
 rucheck(void *arg)
 {
+   struct timespec elapsed, now, total;
struct rlimit rlim;
struct process *pr = arg;
time_t runtime;
+   struct proc *p;
int s;
 
KERNEL_ASSERT_LOCKED();
 
SCHED_LOCK(s);
-   runtime = pr->ps_tu.tu_runtime.tv_sec;
+   nanouptime();
+   total = pr->ps_tu.tu_runtime;
+   TAILQ_FOREACH(p, >ps_threads, p_thr_link) {
+   if (p->p_stat == SONPROC) {
+   timespecsub(, >p_cpu->ci_schedstate.spc_runtime,
+   );
+   timespecadd(, , );
+   }
+   }
SCHED_UNLOCK(s);
 
mtx_enter(>ps_mtx);
rlim = pr->ps_limit->pl_rlimit[RLIMIT_CPU];
mtx_leave(>ps_mtx);
 
+   runtime = total.tv_sec;
if ((rlim_t)runtime >= rlim.rlim_cur) {
if ((rlim_t)runtime >= rlim.rlim_max) {
prsignal(pr, SIGKILL);



Re: RLIMIT_CPU doesn't work reliably on mostly idle systems

2023-08-29 Thread Mark Kettenis
> Date: Tue, 29 Aug 2023 19:15:14 +0200
> From: Claudio Jeker 
> 
> On Tue, Aug 29, 2023 at 01:01:10AM +, Eric Wong wrote:
> > >Synopsis: RLIMIT_CPU doesn't work reliably on mostly idle systems
> > >Category: system
> > >Environment:
> > System  : OpenBSD 7.3
> > Details : OpenBSD 7.3 (GENERIC.MP) #1242: Sat Mar 25 18:04:31 MDT 
> > 2023
> >  
> > dera...@octeon.openbsd.org:/usr/src/sys/arch/octeon/compile/GENERIC.MP
> > 
> > Architecture: OpenBSD.octeon
> > Machine : octeon
> > >Description:
> > 
> > RLIMIT_CPU doesn't work reliably when few/no syscalls are made on an
> > otherwise idle system (aside from the test process using up CPU).
> > It can take 20-50s to fire SIGKILL with rlim_max=9 (and the SIGXCPU
> > from rlim_cur=1 won't even fire).
> > 
> > I can reproduce this on a private amd64 VM and also on gcc231
> > on GCC compiler farm <https://cfarm.tetaneutral.net/>.
> > I can't reproduce this on a busy system like gcc220 on cfarm,
> > however.
> 
> Thanks for the report. There is indeed an issue in how the CPU time is
> accounted on an idle system. The below diff is a possible fix.
> 
> In roundrobin() force a resched and therefor mi_switch() when
> SPCF_SHOULDYIELD is set. On an idle CPU mi_switch() will just do all
> accounting bits but skip the expensive cpu_switchto() since the proc
> remains the same.

A bit of a hack, but probably better than trying to account for
spc_runtime at the point where we check the limit.

Also this will call smr_idle() sooner, which may help speed up smr?

ok kettenis@

> Index: kern/sched_bsd.c
> ===
> RCS file: /cvs/src/sys/kern/sched_bsd.c,v
> retrieving revision 1.84
> diff -u -p -r1.84 sched_bsd.c
> --- kern/sched_bsd.c  29 Aug 2023 16:19:34 -  1.84
> +++ kern/sched_bsd.c  29 Aug 2023 16:20:03 -
> @@ -106,7 +106,7 @@ roundrobin(struct clockintr *cl, void *c
>   }
>   }
>  
> - if (spc->spc_nrun)
> + if (spc->spc_nrun || spc->spc_schedflags & SPCF_SHOULDYIELD)
>   need_resched(ci);
>  }
>  
> 
> 



Re: RLIMIT_CPU doesn't work reliably on mostly idle systems

2023-08-29 Thread Claudio Jeker
On Tue, Aug 29, 2023 at 01:01:10AM +, Eric Wong wrote:
> >Synopsis: RLIMIT_CPU doesn't work reliably on mostly idle systems
> >Category: system
> >Environment:
>   System  : OpenBSD 7.3
>   Details : OpenBSD 7.3 (GENERIC.MP) #1242: Sat Mar 25 18:04:31 MDT 
> 2023
>
> dera...@octeon.openbsd.org:/usr/src/sys/arch/octeon/compile/GENERIC.MP
> 
>   Architecture: OpenBSD.octeon
>   Machine : octeon
> >Description:
> 
> RLIMIT_CPU doesn't work reliably when few/no syscalls are made on an
> otherwise idle system (aside from the test process using up CPU).
> It can take 20-50s to fire SIGKILL with rlim_max=9 (and the SIGXCPU
> from rlim_cur=1 won't even fire).
> 
> I can reproduce this on a private amd64 VM and also on gcc231
> on GCC compiler farm <https://cfarm.tetaneutral.net/>.
> I can't reproduce this on a busy system like gcc220 on cfarm,
> however.

Thanks for the report. There is indeed an issue in how the CPU time is
accounted on an idle system. The below diff is a possible fix.

In roundrobin() force a resched and therefor mi_switch() when
SPCF_SHOULDYIELD is set. On an idle CPU mi_switch() will just do all
accounting bits but skip the expensive cpu_switchto() since the proc
remains the same.

-- 
:wq Claudio

Index: kern/sched_bsd.c
===
RCS file: /cvs/src/sys/kern/sched_bsd.c,v
retrieving revision 1.84
diff -u -p -r1.84 sched_bsd.c
--- kern/sched_bsd.c29 Aug 2023 16:19:34 -  1.84
+++ kern/sched_bsd.c29 Aug 2023 16:20:03 -
@@ -106,7 +106,7 @@ roundrobin(struct clockintr *cl, void *c
}
}
 
-   if (spc->spc_nrun)
+   if (spc->spc_nrun || spc->spc_schedflags & SPCF_SHOULDYIELD)
need_resched(ci);
 }
 



Re: RLIMIT_CPU doesn't work reliably on mostly idle systems

2023-08-28 Thread Scott Cheloha
> On Aug 28, 2023, at 20:04, Eric Wong  wrote:
> 
> 
>> 
>> Synopsis: RLIMIT_CPU doesn't work reliably on mostly idle systems
>> Category: system
>> Environment:
>System  : OpenBSD 7.3
>Details : OpenBSD 7.3 (GENERIC.MP) #1242: Sat Mar 25 18:04:31 MDT 2023
> 
> dera...@octeon.openbsd.org:/usr/src/sys/arch/octeon/compile/GENERIC.MP
> 
>Architecture: OpenBSD.octeon
>Machine : octeon
>> Description:
> 
> RLIMIT_CPU doesn't work reliably when few/no syscalls are made on an
> otherwise idle system (aside from the test process using up CPU).
> It can take 20-50s to fire SIGKILL with rlim_max=9 (and the SIGXCPU
> from rlim_cur=1 won't even fire).
> 
> I can reproduce this on a private amd64 VM and also on gcc231
> on GCC compiler farm <https://cfarm.tetaneutral.net/>.
> I can't reproduce this on a busy system like gcc220 on cfarm,
> however.
> 
>> How-To-Repeat:
> 
> Following is a standalone C program which demonstrates the problem on
> a mostly idle system:
> /*
> * Most reliably reproduced with compiler optimizations disabled:
> *
> *   cc -o rlimit_cpu -ggdb3 -Wall rlimit_cpu.c
> *
> * Neither SIGXCPU (from rlim_cur) nor SIGKILL (from rlim_max)
> * with RLIMIT_CPU set seem to fire reliably with few syscalls being made.
> * On an otherwise idle system, it can take many more seconds (20-50s)
> * than expected when rlim_max=9 (SIGXCPU doesn't even happen).
> * Best case is 2 seconds for SIGXCPU when rlim_cur=1 on a busy system
> * which is understandable due to kernel accounting delays.
> *
> * I rely on RLIMIT_CPU to protect systems from pathological userspace
> * code (diff generation, regexps, etc)
> *
> * Testing on cfarm <https://cfarm.tetaneutral.net/> machines,
> * the issue is visible on a mostly-idle 4-core gcc231 mips64
> * but doesn't seem to happen on the busy 12-core gcc220 machine
> * (only 2 seconds for XCPU w/ rlim_cur=1)
> */
> #include 
> #include 
> #include 
> #include 
> 
> static void sigxcpu(int sig)
> {
>write(1, "SIGXCPU\n", 8);
>_exit(1);
> }
> 
> static volatile size_t nr; // volatile to disable optimizations
> int main(void)
> {
>struct rlimit rlim = { .rlim_cur = 1, .rlim_max = 9 };
>int rc;
> 
>signal(SIGXCPU, sigxcpu);
>rc = setrlimit(RLIMIT_CPU, );
>assert(rc == 0);
> 
>/*
> * adding some time, times, and write syscalls improve likelyhood of
> * of rlimit signals firing in a timely manner.  writes to /dev/null
> * seems less-likely to trigger than writes to the terminal or
> * regular file.
> */
>for (;; nr++) {
>}
> 
>return 0;
> }
> 
>> Fix:
>Making more syscalls can workaround the problem, but that's not
> an option when dealing with userspace-heavy code like pathological regexps.

The CPU time limit is checked from a
periodic timeout.

CPU time totals accumulate in mi_switch().

The problem is that on a mostly idle system,
a user thread that is hogging the CPU may
take a very long time to switch out, and the
and all that CPU time doesn't accumulate
until the switch, and so the signal will
arrive later than requested.

System calls have points where a thread can
switch out, which is why the delay is
exaggerated on synthetic workloads like the
busy-loop shown above.

One possible solution is to check usage
times for threads that are still ONPROC
during the rusage timeout.

Another approach is to be more aggressive
about forcing threads to switch out, even
when nothing else wants to run.

Coincidentally, we are discussing p_rtime on
tech@ right now, which is tangentially related
to this issue.



RLIMIT_CPU doesn't work reliably on mostly idle systems

2023-08-28 Thread Eric Wong
>Synopsis: RLIMIT_CPU doesn't work reliably on mostly idle systems
>Category: system
>Environment:
System  : OpenBSD 7.3
Details : OpenBSD 7.3 (GENERIC.MP) #1242: Sat Mar 25 18:04:31 MDT 
2023
 
dera...@octeon.openbsd.org:/usr/src/sys/arch/octeon/compile/GENERIC.MP

Architecture: OpenBSD.octeon
Machine : octeon
>Description:

RLIMIT_CPU doesn't work reliably when few/no syscalls are made on an
otherwise idle system (aside from the test process using up CPU).
It can take 20-50s to fire SIGKILL with rlim_max=9 (and the SIGXCPU
from rlim_cur=1 won't even fire).

I can reproduce this on a private amd64 VM and also on gcc231
on GCC compiler farm <https://cfarm.tetaneutral.net/>.
I can't reproduce this on a busy system like gcc220 on cfarm,
however.

>How-To-Repeat:

Following is a standalone C program which demonstrates the problem on
a mostly idle system:
/*
 * Most reliably reproduced with compiler optimizations disabled:
 *
 *   cc -o rlimit_cpu -ggdb3 -Wall rlimit_cpu.c
 *
 * Neither SIGXCPU (from rlim_cur) nor SIGKILL (from rlim_max)
 * with RLIMIT_CPU set seem to fire reliably with few syscalls being made.
 * On an otherwise idle system, it can take many more seconds (20-50s)
 * than expected when rlim_max=9 (SIGXCPU doesn't even happen).
 * Best case is 2 seconds for SIGXCPU when rlim_cur=1 on a busy system
 * which is understandable due to kernel accounting delays.
 *
 * I rely on RLIMIT_CPU to protect systems from pathological userspace
 * code (diff generation, regexps, etc)
 *
 * Testing on cfarm <https://cfarm.tetaneutral.net/> machines,
 * the issue is visible on a mostly-idle 4-core gcc231 mips64
 * but doesn't seem to happen on the busy 12-core gcc220 machine
 * (only 2 seconds for XCPU w/ rlim_cur=1)
 */
#include 
#include 
#include 
#include 

static void sigxcpu(int sig)
{
write(1, "SIGXCPU\n", 8);
_exit(1);
}

static volatile size_t nr; // volatile to disable optimizations
int main(void)
{
struct rlimit rlim = { .rlim_cur = 1, .rlim_max = 9 };
int rc;

signal(SIGXCPU, sigxcpu);
rc = setrlimit(RLIMIT_CPU, );
assert(rc == 0);

/*
 * adding some time, times, and write syscalls improve likelyhood of
 * of rlimit signals firing in a timely manner.  writes to /dev/null
 * seems less-likely to trigger than writes to the terminal or
 * regular file.
 */
for (;; nr++) {
}

return 0;
}

>Fix:
Making more syscalls can workaround the problem, but that's not
an option when dealing with userspace-heavy code like pathological regexps.


dmesg:
OpenBSD 7.3 (GENERIC.MP) #1242: Sat Mar 25 18:04:31 MDT 2023
dera...@octeon.openbsd.org:/usr/src/sys/arch/octeon/compile/GENERIC.MP
real mem = 1073741824 (1024MB)
avail mem = 1035304960 (987MB)
random: good seed from bootblocks
mainbus0 at root: board 20300 rev 0.15, model cavium,ubnt_e300
cpu0 at mainbus0: CN70xx/CN71xx CPU rev 0.2 1000 MHz, CN70xx/CN71xx FPU rev 0.0
cpu0: cache L1-I 78KB 39 way D 32KB 32 way, L2 1024KB 8 way
cpu1 at mainbus0: CN70xx/CN71xx CPU rev 0.2 1000 MHz, CN70xx/CN71xx FPU rev 0.0
cpu1: cache L1-I 78KB 39 way D 32KB 32 way, L2 1024KB 8 way
cpu2 at mainbus0: CN70xx/CN71xx CPU rev 0.2 1000 MHz, CN70xx/CN71xx FPU rev 0.0
cpu2: cache L1-I 78KB 39 way D 32KB 32 way, L2 1024KB 8 way
cpu3 at mainbus0: CN70xx/CN71xx CPU rev 0.2 1000 MHz, CN70xx/CN71xx FPU rev 0.0
cpu3: cache L1-I 78KB 39 way D 32KB 32 way, L2 1024KB 8 way
clock0 at mainbus0: int 5
octcrypto0 at mainbus0
iobus0 at mainbus0
simplebus0 at iobus0: "soc"
"bootbus" at simplebus0 not configured
octciu0 at simplebus0
octcib0 at simplebus0: max-bits 23
octcib1 at simplebus0: max-bits 12
octcib2 at simplebus0: max-bits 6
octcib3 at simplebus0: max-bits 15
octcib4 at simplebus0: max-bits 4
octcib5 at simplebus0: max-bits 11
octcib6 at simplebus0: max-bits 11
octgpio0 at simplebus0: 20 pins, xbit 16
octsmi0 at simplebus0
octsmi1 at simplebus0
octpip0 at simplebus0
octgmx0 at octpip0 interface 0
cnmac0 at octgmx0: port 0 SGMII, address e0:63:da:c0:62:27
ukphy0 at cnmac0 phy 4: Generic IEEE 802.3u media interface, rev. 2: OUI 
0x0001c1, model 0x000c
cnmac1 at octgmx0: port 1 SGMII, address e0:63:da:c0:62:28
ukphy1 at cnmac1 phy 5: Generic IEEE 802.3u media interface, rev. 2: OUI 
0x0001c1, model 0x000c
cnmac2 at octgmx0: port 2 SGMII, address e0:63:da:c0:62:29
ukphy2 at cnmac2 phy 6: Generic IEEE 802.3u media interface, rev. 2: OUI 
0x0001c1, model 0x000c
cnmac3 at octgmx0: port 3 SGMII, address e0:63:da:c0:62:2a
ukphy3 at cnmac3 phy 7: Generic IEEE 802.3u media interface, rev. 2: OUI 
0x0001c1, model 0x000c
octsctl0 at simplebus0: disabled
octxctl0 at simplebus0: DWC3 rev 0x250a
xhci0 at octxctl0, xHCI 1.0
usb0 at xhci0: USB revision 3.0
uhub0 at usb0 configuration 1 interface 0 "Generic xHCI root hub" rev 3.00/1.00 
addr