Re: Scheduler with a single runqueue
On 11/12/16(Sun) 12:54, Stuart Henderson wrote: > On 2016/12/10 17:56, Bryan Vyhmeister wrote: > > On Sat, Dec 10, 2016 at 10:47:31PM +, Stuart Henderson wrote: > > > In case anyone is interested, here's a version of this diff against > > > -current. It helps a lot for me. I'm not watching HD video while doing > > > "make -j4", just things like trying to move the pointer around the screen > > > and type into a terminal while a map is loading in a browser. > > > > Thank you for taking the time to update the diff. I should probably try > > it. I have noticed that if I am doing a lot of NFS I/O or rsync > > transfers to an NFS share that the pointer gets jumpy. > > In my case more like "machine freezes every few minutes for 30-40 seconds" > than just jumpy. Loads of IPIs, cpu0 high cpu in interrupt, other cores > 100% cpu in sys. It seems more apparent with chromium, but it's not a > whole lot better with firefox either. The problem is in librthread. Any multi-threaded program might benefit from this diff. However that's just a bandage. The real fix is to stop using sched_yield(2) when there's some contention in userland.
Re: Scheduler with a single runqueue
On 2016/12/10 17:56, Bryan Vyhmeister wrote: > On Sat, Dec 10, 2016 at 10:47:31PM +, Stuart Henderson wrote: > > In case anyone is interested, here's a version of this diff against > > -current. It helps a lot for me. I'm not watching HD video while doing > > "make -j4", just things like trying to move the pointer around the screen > > and type into a terminal while a map is loading in a browser. > > Thank you for taking the time to update the diff. I should probably try > it. I have noticed that if I am doing a lot of NFS I/O or rsync > transfers to an NFS share that the pointer gets jumpy. In my case more like "machine freezes every few minutes for 30-40 seconds" than just jumpy. Loads of IPIs, cpu0 high cpu in interrupt, other cores 100% cpu in sys. It seems more apparent with chromium, but it's not a whole lot better with firefox either. > The machine I am > seeing this on is a Supermicro X10SAE with Xeon E3 1275 v3, 32GB of > memory, and a Samsung 950 Pro NVMe SSD. I typically also have Firefox > and Iridium also running with a number of tabs. It is using Intel > graphics and running a 4K display which might be part of it. It seems to > be better for me with efifb(4) and wsfb(4) on my ThinkPad X1 Carbon 4th > Gen (which has the 2560x1440 display). I also tried the modesetting > driver on my desktop which seemed to be every so slightly improved over > the intel(4) driver but that is hard to quantify exactly. Ah yes, I should have included a dmesg from this machine. I haven't tried the modesetting driver, maybe I will sometime.. OpenBSD 6.0-current (GENERIC.MP) #0: Sat Dec 10 21:55:34 GMT 2016 st...@symphytum.spacehopper.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP real mem = 8477306880 (8084MB) avail mem = 8215822336 (7835MB) mpath0 at root scsibus0 at mpath0: 256 targets mainbus0 at root bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xec400 (90 entries) bios0: vendor Dell Inc. version "A04" date 07/20/2014 bios0: Dell Inc. PowerEdge T20 acpi0 at bios0: rev 2 acpi0: sleep states S0 S3 S4 S5 acpi0: tables DSDT FACP APIC FPDT SLIC LPIT SSDT SSDT SSDT HPET SSDT MCFG SSDT ASF! DMAR acpi0: wakeup devices UAR1(S3) PXSX(S4) RP01(S4) PXSX(S4) RP02(S4) PXSX(S4) RP03(S4) PXSX(S4) PXSX(S4) PXSX(S4) PXSX(S4) GLAN(S4) EHC1(S3) EHC2(S3) XHC_(S4) HDEF(S4) [...] acpitimer0 at acpi0: 3579545 Hz, 24 bits acpimadt0 at acpi0 addr 0xfee0: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz, 3392.62 MHz cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,LONG,LAHF,ABM,PERF,ITSC,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,SENSOR,ARAT cpu0: 256KB 64b/line 8-way L2 cache cpu0: smt 0, core 0, package 0 mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges cpu0: apic clock running at 99MHz cpu0: mwait min=64, max=64, C-substates=0.2.1.2.4, IBE cpu1 at mainbus0: apid 2 (application processor) cpu1: Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz, 3392.15 MHz cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,LONG,LAHF,ABM,PERF,ITSC,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,SENSOR,ARAT cpu1: 256KB 64b/line 8-way L2 cache cpu1: smt 0, core 1, package 0 cpu2 at mainbus0: apid 4 (application processor) cpu2: Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz, 3392.15 MHz cpu2: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,LONG,LAHF,ABM,PERF,ITSC,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,SENSOR,ARAT cpu2: 256KB 64b/line 8-way L2 cache cpu2: smt 0, core 2, package 0 cpu3 at mainbus0: apid 6 (application processor) cpu3: Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz, 3392.15 MHz cpu3: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,LONG,LAHF,ABM,PERF,ITSC,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,SENSOR,ARAT cpu3: 256KB 64b/line 8-way L2 cache cpu3: smt 0, core 3, package 0 ioapic0 at mainbus0: apid 8 pa 0xfec0, version 20, 24 pins acpihpet0 at acpi0: 14318179 Hz acpimcfg0 at acpi0 addr 0xf800, bus 0-63 acpiprt0 at acpi0: bus 0 (PCI0) acpiprt1 at acpi0: bus 1 (RP01) acpiprt2 at acpi0: bus 2 (RP02) acpiprt3 at
Re: Scheduler with a single runqueue
On Sat, Dec 10, 2016 at 10:47:31PM +, Stuart Henderson wrote: > In case anyone is interested, here's a version of this diff against > -current. It helps a lot for me. I'm not watching HD video while doing > "make -j4", just things like trying to move the pointer around the screen > and type into a terminal while a map is loading in a browser. Thank you for taking the time to update the diff. I should probably try it. I have noticed that if I am doing a lot of NFS I/O or rsync transfers to an NFS share that the pointer gets jumpy. The machine I am seeing this on is a Supermicro X10SAE with Xeon E3 1275 v3, 32GB of memory, and a Samsung 950 Pro NVMe SSD. I typically also have Firefox and Iridium also running with a number of tabs. It is using Intel graphics and running a 4K display which might be part of it. It seems to be better for me with efifb(4) and wsfb(4) on my ThinkPad X1 Carbon 4th Gen (which has the 2560x1440 display). I also tried the modesetting driver on my desktop which seemed to be every so slightly improved over the intel(4) driver but that is hard to quantify exactly. Bryan
Re: Scheduler with a single runqueue
On 2016/07/06 21:14, Martin Pieuchot wrote: > Please, don't try this diff blindly it won't make your machine faster. > > In the past months I've been looking more closely at our scheduler. > At p2k16 I've shown to a handful of developers that when running a > browser on my x220 with HT enable, a typical desktop usage, the per- > CPU runqueues were never balanced. You often have no job on a CPU > and multiple on the others. > > Currently when a CPU doesn't have any job on its runqueue it tries > to "steal" a job from another CPU's runqueue. If look at the stats > on my machine running a lot of threaded apps (GNOME3, Thunderbird, > Firefox, Chrome), here's what I get: > > # pstat -d ld sched_stolen sched_choose sched_wasidle > sched_stolen: 1665846 > sched_choose: 3195615 > sched_wasidle: 1309253 > > For 32K jobs dispatched, 16K got stolen. That's 50% of the jobs on > my machine and this ratio is stable for my usage. > > On my test machine, an Atom with HT, I got the following number: > > - after boot: > sched_stolen: 570 > sched_choose: 10450 > sched_wasidle: 8936 > > - after playing a video on youtube w/ firefox: > sched_stolen: 2153754 > sched_choose: 10261682 > sched_wasidle: 1525801 > > - after playing a video on youtube w/ chromium (after reboot): > sched_stolen: 31 > sched_choose: 6470258 > sched_wasidle: 934772 > > What's interesting here is that threaded apps (like firefox) seems to > trigger more "stealing". It would be interesting to see if/how this > is related to the yield-busy-wait triggered by librthread's thrsleep() > usage explained some months ago. > > What's also interesting is that the number of stolen jobs seems to > be higher if your number of CPU is higher. Elementary, My Dear Watson? > I observed that for the same workload, playing a HD video in firefox > while compiling a kernel with make -j4, I have 50% have stolen jobs > with 4 CPUs and 20% with 2 CPUs. Sadly I don't have a bigger machine > to test. How bad can it be? > > So I looked at how this situation could be improved. My goal was to > be able to compile a kernel while watching a video in my browser without > having my audio slutter. I started by removing the "stealing" logic but > the situation didn't improve. Then I tried to play with the calculation > of the cost and failed. Then I decided to remove completely the per-CPU > runqueues and came up with the diff below... > > There's too many things that I still don't understand so I'm not asking > for ok, but I'd appreciate if people could test this diff and report back. > My goal is currently to get a better understanding of our scheduler to > hopefully improve it. > > By using a single runqueue I prioritise latency over throughput. That > means your performance might degrade, but at least I can watch my HD > video while doing a "make -j4". > > As a bonus, the diff below also greatly reduces the number of IPIs on my > systems. In case anyone is interested, here's a version of this diff against -current. It helps a lot for me. I'm not watching HD video while doing "make -j4", just things like trying to move the pointer around the screen and type into a terminal while a map is loading in a browser. Index: sys/sched.h === RCS file: /cvs/src/sys/sys/sched.h,v retrieving revision 1.41 diff -u -p -r1.41 sched.h --- sys/sched.h 17 Mar 2016 13:18:47 - 1.41 +++ sys/sched.h 10 Dec 2016 22:24:15 - @@ -89,9 +89,10 @@ #defineSCHED_NQS 32 /* 32 run queues. */ +#ifdef _KERNEL + /* * Per-CPU scheduler state. - * XXX - expose to userland for now. */ struct schedstate_percpu { struct timespec spc_runtime;/* time curproc started running */ @@ -102,23 +103,16 @@ struct schedstate_percpu { int spc_rrticks;/* ticks until roundrobin() */ int spc_pscnt; /* prof/stat counter */ int spc_psdiv; /* prof/stat divisor */ + unsigned int spc_npeg; /* nb. of pegged threads on runqueue */ struct proc *spc_idleproc; /* idle proc for this cpu */ - u_int spc_nrun; /* procs on the run queues */ fixpt_t spc_ldavg; /* shortest load avg. for this cpu */ - TAILQ_HEAD(prochead, proc) spc_qs[SCHED_NQS]; - volatile uint32_t spc_whichqs; - -#ifdef notyet - struct proc *spc_reaper;/* dead proc reaper */ -#endif LIST_HEAD(,proc) spc_deadproc; volatile int spc_barrier; /* for sched_barrier() */ }; -#ifdef _KERNEL /* spc_flags */ #define SPCF_SEENRR 0x0001 /* process has seen roundrobin() */ @@ -141,14 +135,13 @@ void roundrobin(struct cpu_info *); void scheduler_start(void); void userret(struct proc *p); +void sched_init(void); void sched_init_cpu(struct cpu_info
Re: Scheduler with a single runqueue
On 6 July 2016 at 21:14, Martin Pieuchotwrote: > By using a single runqueue I prioritise latency over throughput. That > means your performance might degrade, but at least I can watch my HD > video while doing a "make -j4". I've been running your patch since you've posted it and haven't had any problems so far. I do get less audio stutter, which used to happen quite often when closing a tab in Chromium. Now it's hard to reproduce. My computer is an Asus laptop with the i7-2670QM and 8 GB RAM.
Re: Scheduler with a single runqueue
On Wed, 6 Jul 2016 21:14:05 +0200 Martin Pieuchotwrote: > By using a single runqueue I prioritise latency over throughput. That > means your performance might degrade, but at least I can watch my HD > video while doing a "make -j4". When I run borgbackup, audio (and my mouse) still stutters. Is disk IO something that this diff should help with? Anything I can do to help diagnose this? X200 with 4G ram.