Re: Scheduler with a single runqueue

2016-12-11 Thread Martin Pieuchot
On 11/12/16(Sun) 12:54, Stuart Henderson wrote:
> On 2016/12/10 17:56, Bryan Vyhmeister wrote:
> > On Sat, Dec 10, 2016 at 10:47:31PM +, Stuart Henderson wrote:
> > > In case anyone is interested, here's a version of this diff against
> > > -current. It helps a lot for me. I'm not watching HD video while doing
> > > "make -j4", just things like trying to move the pointer around the screen
> > > and type into a terminal while a map is loading in a browser.
> > 
> > Thank you for taking the time to update the diff. I should probably try
> > it. I have noticed that if I am doing a lot of NFS I/O or rsync
> > transfers to an NFS share that the pointer gets jumpy.
> 
> In my case more like "machine freezes every few minutes for 30-40 seconds"
> than just jumpy. Loads of IPIs, cpu0 high cpu in interrupt, other cores
> 100% cpu in sys. It seems more apparent with chromium, but it's not a
> whole lot better with firefox either.

The problem is in librthread.  Any multi-threaded program might benefit
from this diff.  However that's just a bandage.  The real fix is to stop
using sched_yield(2) when there's some contention in userland.



Re: Scheduler with a single runqueue

2016-12-11 Thread Stuart Henderson
On 2016/12/10 17:56, Bryan Vyhmeister wrote:
> On Sat, Dec 10, 2016 at 10:47:31PM +, Stuart Henderson wrote:
> > In case anyone is interested, here's a version of this diff against
> > -current. It helps a lot for me. I'm not watching HD video while doing
> > "make -j4", just things like trying to move the pointer around the screen
> > and type into a terminal while a map is loading in a browser.
> 
> Thank you for taking the time to update the diff. I should probably try
> it. I have noticed that if I am doing a lot of NFS I/O or rsync
> transfers to an NFS share that the pointer gets jumpy.

In my case more like "machine freezes every few minutes for 30-40 seconds"
than just jumpy. Loads of IPIs, cpu0 high cpu in interrupt, other cores
100% cpu in sys. It seems more apparent with chromium, but it's not a
whole lot better with firefox either.

> The machine I am
> seeing this on is a Supermicro X10SAE with Xeon E3 1275 v3, 32GB of
> memory, and a Samsung 950 Pro NVMe SSD. I typically also have Firefox
> and Iridium also running with a number of tabs. It is using Intel
> graphics and running a 4K display which might be part of it. It seems to
> be better for me with efifb(4) and wsfb(4) on my ThinkPad X1 Carbon 4th
> Gen (which has the 2560x1440 display). I also tried the modesetting
> driver on my desktop which seemed to be every so slightly improved over
> the intel(4) driver but that is hard to quantify exactly.

Ah yes, I should have included a dmesg from this machine. I haven't tried
the modesetting driver, maybe I will sometime..

OpenBSD 6.0-current (GENERIC.MP) #0: Sat Dec 10 21:55:34 GMT 2016
st...@symphytum.spacehopper.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 8477306880 (8084MB)
avail mem = 8215822336 (7835MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xec400 (90 entries)
bios0: vendor Dell Inc. version "A04" date 07/20/2014
bios0: Dell Inc. PowerEdge T20
acpi0 at bios0: rev 2
acpi0: sleep states S0 S3 S4 S5
acpi0: tables DSDT FACP APIC FPDT SLIC LPIT SSDT SSDT SSDT HPET SSDT MCFG SSDT 
ASF! DMAR
acpi0: wakeup devices UAR1(S3) PXSX(S4) RP01(S4) PXSX(S4) RP02(S4) PXSX(S4) 
RP03(S4) PXSX(S4) PXSX(S4) PXSX(S4) PXSX(S4) GLAN(S4) EHC1(S3) EHC2(S3) 
XHC_(S4) HDEF(S4) [...]
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz, 3392.62 MHz
cpu0: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,LONG,LAHF,ABM,PERF,ITSC,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,SENSOR,ARAT
cpu0: 256KB 64b/line 8-way L2 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
cpu0: apic clock running at 99MHz
cpu0: mwait min=64, max=64, C-substates=0.2.1.2.4, IBE
cpu1 at mainbus0: apid 2 (application processor)
cpu1: Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz, 3392.15 MHz
cpu1: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,LONG,LAHF,ABM,PERF,ITSC,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,SENSOR,ARAT
cpu1: 256KB 64b/line 8-way L2 cache
cpu1: smt 0, core 1, package 0
cpu2 at mainbus0: apid 4 (application processor)
cpu2: Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz, 3392.15 MHz
cpu2: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,LONG,LAHF,ABM,PERF,ITSC,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,SENSOR,ARAT
cpu2: 256KB 64b/line 8-way L2 cache
cpu2: smt 0, core 2, package 0
cpu3 at mainbus0: apid 6 (application processor)
cpu3: Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz, 3392.15 MHz
cpu3: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,LONG,LAHF,ABM,PERF,ITSC,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,SENSOR,ARAT
cpu3: 256KB 64b/line 8-way L2 cache
cpu3: smt 0, core 3, package 0
ioapic0 at mainbus0: apid 8 pa 0xfec0, version 20, 24 pins
acpihpet0 at acpi0: 14318179 Hz
acpimcfg0 at acpi0 addr 0xf800, bus 0-63
acpiprt0 at acpi0: bus 0 (PCI0)
acpiprt1 at acpi0: bus 1 (RP01)
acpiprt2 at acpi0: bus 2 (RP02)
acpiprt3 at 

Re: Scheduler with a single runqueue

2016-12-10 Thread Bryan Vyhmeister
On Sat, Dec 10, 2016 at 10:47:31PM +, Stuart Henderson wrote:
> In case anyone is interested, here's a version of this diff against
> -current. It helps a lot for me. I'm not watching HD video while doing
> "make -j4", just things like trying to move the pointer around the screen
> and type into a terminal while a map is loading in a browser.

Thank you for taking the time to update the diff. I should probably try
it. I have noticed that if I am doing a lot of NFS I/O or rsync
transfers to an NFS share that the pointer gets jumpy. The machine I am
seeing this on is a Supermicro X10SAE with Xeon E3 1275 v3, 32GB of
memory, and a Samsung 950 Pro NVMe SSD. I typically also have Firefox
and Iridium also running with a number of tabs. It is using Intel
graphics and running a 4K display which might be part of it. It seems to
be better for me with efifb(4) and wsfb(4) on my ThinkPad X1 Carbon 4th
Gen (which has the 2560x1440 display). I also tried the modesetting
driver on my desktop which seemed to be every so slightly improved over
the intel(4) driver but that is hard to quantify exactly.

Bryan



Re: Scheduler with a single runqueue

2016-12-10 Thread Stuart Henderson
On 2016/07/06 21:14, Martin Pieuchot wrote:
> Please, don't try this diff blindly it won't make your machine faster.
> 
> In the past months I've been looking more closely at our scheduler.
> At p2k16 I've shown to a handful of developers that when running a
> browser on my x220 with HT enable, a typical desktop usage, the per-
> CPU runqueues were never balanced.  You often have no job on a CPU
> and multiple on the others.
> 
> Currently when a CPU doesn't have any job on its runqueue it tries
> to "steal" a job from another CPU's runqueue.  If look at the stats
> on my machine running a lot of threaded apps (GNOME3, Thunderbird,
> Firefox, Chrome), here's what I get:
> 
> # pstat -d ld sched_stolen sched_choose sched_wasidle
>   sched_stolen: 1665846
>   sched_choose: 3195615
>   sched_wasidle: 1309253
> 
> For 32K jobs dispatched, 16K got stolen.  That's 50% of the jobs on
> my machine and this ratio is stable for my usage.
> 
> On my test machine, an Atom with HT, I got the following number:
> 
> - after boot:
>   sched_stolen: 570
>   sched_choose: 10450
>   sched_wasidle: 8936
> 
> - after playing a video on youtube w/ firefox:
>   sched_stolen: 2153754
>   sched_choose: 10261682
>   sched_wasidle: 1525801
> 
> - after playing a video on youtube w/ chromium (after reboot):
> sched_stolen: 31
>   sched_choose: 6470258
>   sched_wasidle: 934772
> 
> What's interesting here is that threaded apps (like firefox) seems to
> trigger more "stealing".  It would be interesting to see if/how this
> is related to the yield-busy-wait triggered by librthread's thrsleep()
> usage explained some months ago.
> 
> What's also interesting is that the number of stolen jobs seems to
> be higher if your number of CPU is higher. Elementary, My Dear Watson?
> I observed that for the same workload, playing a HD video in firefox
> while compiling a kernel with make -j4, I have 50% have stolen jobs
> with 4 CPUs and 20% with 2 CPUs.  Sadly I don't have a bigger machine
> to test.  How bad can it be?
> 
> So I looked at how this situation could be improved.  My goal was to
> be able to compile a kernel while watching a video in my browser without
> having my audio slutter.  I started by removing the "stealing" logic but
> the situation didn't improve.  Then I tried to play with the calculation
> of the cost and failed.  Then I decided to remove completely the per-CPU
> runqueues and came up with the diff below...
> 
> There's too many things that I still don't understand so I'm not asking
> for ok, but I'd appreciate if people could test this diff and report back.
> My goal is currently to get a better understanding of our scheduler to
> hopefully improve it.
> 
> By using a single runqueue I prioritise latency over throughput.  That
> means your performance might degrade, but at least I can watch my HD
> video while doing a "make -j4".
> 
> As a bonus, the diff below also greatly reduces the number of IPIs on my
> systems.

In case anyone is interested, here's a version of this diff against
-current. It helps a lot for me. I'm not watching HD video while doing
"make -j4", just things like trying to move the pointer around the screen
and type into a terminal while a map is loading in a browser.


Index: sys/sched.h
===
RCS file: /cvs/src/sys/sys/sched.h,v
retrieving revision 1.41
diff -u -p -r1.41 sched.h
--- sys/sched.h 17 Mar 2016 13:18:47 -  1.41
+++ sys/sched.h 10 Dec 2016 22:24:15 -
@@ -89,9 +89,10 @@
 
 #defineSCHED_NQS   32  /* 32 run queues. */
 
+#ifdef _KERNEL
+
 /*
  * Per-CPU scheduler state.
- * XXX - expose to userland for now.
  */
 struct schedstate_percpu {
struct timespec spc_runtime;/* time curproc started running */
@@ -102,23 +103,16 @@ struct schedstate_percpu {
int spc_rrticks;/* ticks until roundrobin() */
int spc_pscnt;  /* prof/stat counter */
int spc_psdiv;  /* prof/stat divisor */ 
+   unsigned int spc_npeg;  /* nb. of pegged threads on runqueue */
struct proc *spc_idleproc;  /* idle proc for this cpu */
 
-   u_int spc_nrun; /* procs on the run queues */
fixpt_t spc_ldavg;  /* shortest load avg. for this cpu */
 
-   TAILQ_HEAD(prochead, proc) spc_qs[SCHED_NQS];
-   volatile uint32_t spc_whichqs;
-
-#ifdef notyet
-   struct proc *spc_reaper;/* dead proc reaper */
-#endif
LIST_HEAD(,proc) spc_deadproc;
 
volatile int spc_barrier;   /* for sched_barrier() */
 };
 
-#ifdef _KERNEL
 
 /* spc_flags */
 #define SPCF_SEENRR 0x0001  /* process has seen roundrobin() */
@@ -141,14 +135,13 @@ void roundrobin(struct cpu_info *);
 void scheduler_start(void);
 void userret(struct proc *p);
 
+void sched_init(void);
 void sched_init_cpu(struct cpu_info 

Re: Scheduler with a single runqueue

2016-07-08 Thread Matej Nanut
On 6 July 2016 at 21:14, Martin Pieuchot  wrote:
> By using a single runqueue I prioritise latency over throughput.  That
> means your performance might degrade, but at least I can watch my HD
> video while doing a "make -j4".

I've been running your patch since you've posted it and haven't had
any problems so far. I do get less audio stutter, which used to happen
quite often when closing a tab in Chromium. Now it's hard to
reproduce.

My computer is an Asus laptop with the i7-2670QM and 8 GB RAM.



Re: Scheduler with a single runqueue

2016-07-08 Thread Ray Lai
On Wed, 6 Jul 2016 21:14:05 +0200
Martin Pieuchot  wrote:
> By using a single runqueue I prioritise latency over throughput.  That
> means your performance might degrade, but at least I can watch my HD
> video while doing a "make -j4".

When I run borgbackup, audio (and my mouse) still stutters. Is disk IO 
something that this diff should help with? Anything I can do to help diagnose 
this?

X200 with 4G ram.