Re: Unusual threading behavior on single processes
Thank you, your information was very helpful. I compiled and ran malloc_duel and it's working as intended. I wasn't aware of the -H flag for top, and I can see programs are threading as you say, though the bottleneck to my poor performance is still a mystery. I took some screen captures so you can see what I'm seeing: Xonotic: https://0x0.st/iBCD.png OpenMW: https://0x0.st/iBC5.png Terraria: (fnaify if curious, thanks thfr :>) https://0x0.st/iMrr.png In the case of OpenMW, the bottleneck actually seems pretty obvious with what top -H reports. I don't really know what to say about the other examples. I would break out a profiling tool at this stage, but the results of testing with top -H have left me with no idea where the bottleneck is (except openmw where it might actually be CPU); digging through systat hasn't really given me any revelations either. :/ If anyone has a hunch where I should check, or if you need me to test a different software, I'd be more than happy to. Regards, Stefmorino On Sat, Mar 28, 2020, at 09:00:21AM +, Otto Moerbeek wrote: > On Fri, Mar 27, 2020 at 09:03:40PM +, Stefmorino wrote: >> I have question about a performance quirk on OpenBSD, but I'm not really sure >> how to address it, or what the root cause even is; that being how >> multithreaded >> applications (libpthread?) behave (notably, games). >> >> I have tested many applications, the behavior is the same in all of them, but >> I'll talk about OpenMW (an open-source game engine for morrowind) since I >> have >> the most useful information about how this program is threaded. By default, >> OpenMW uses 4 threads (cited here: >> https://openmw.readthedocs.io/en/stable/reference/modding/settings/cells.html), >> one for main/generic processing, one for graphics, one for audio, and one for >> preloading terrain. You can see this if you look at the thread usage under >> top >> while running the game; however, this is exactly where my question comes into >> play. Instead of each thread processing the game independently with their own >> limits, each thread is "capped" to the total limit of one thread (I.E. >> instead >> of openmw's process using 100% of 4 threads, or 400% cpu in top, instead the >> process uses 25% across 4 threads, or 100% cpu in top). I tested this using >> GENERIC instead of GENERIC.MP as well, and get identical performance on the >> one >> thread; it's almost like pthreads is acting as a placeholder of sorts and not >> actually improving performance where it should. >> >> Is it a lock (spin is at 0)? A placeholder? A limitation of how Ryzen SMP is >> implemented? > > Hard to tell, no idea what that game engine does. But this not a > general problem, e.g. the malloc_duel regress test > (/usr/src/regress/lib/libpthread/malloc_duel). I see > 100% as well > with other multi-threaded programs. > > 32013 otto 600 6020K 1552K onproc/3 - 1:07 228.81% > malloc_due > > Wild guess: it could be that you program actually does not do real > threading, but userland threading. Check with top -H if it really > creates threads. You should see multiple threads having the same PID. > or all thraeds are using a resource that cannot be shared. > > -Otto >> >> I'd be happy to do any additional testing, I have a fresh -current source >> tree >> ready >> >> dmesg >> OpenBSD 6.6-current (GENERIC.MP) #75: Tue Mar 24 12:56:37 MDT 2020 >> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP >> real mem = 16603250688 (15834MB) >> avail mem = 16087437312 (15342MB) >> mpath0 at root >> scsibus0 at mpath0: 256 targets >> mainbus0 at root >> bios0 at mainbus0: SMBIOS rev. 3.1 @ 0x986ec000 (62 entries) >> bios0: vendor LENOVO version "R0UET76W (1.56 )" date 11/05/2019 >> bios0: LENOVO 20KVCTO1WW >> acpi0 at bios0: ACPI 5.0 >> acpi0: sleep states S0 S3 S4 S5 >> acpi0: tables DSDT FACP SSDT SSDT CRAT CDIT UEFI MSDM BATB HPET APIC MCFG >> SBST WSMT IVRS FPDT SSDT SSDT SSDT UEFI SSDT >> acpi0: wakeup devices GPP0(S3) GPP1(S3) GPP2(S3) GPP3(S3) GPP4(S3) GPP5(S3) >> GPP6(S3) GP17(S3) XHC0(S3) XHC1(S3) GP18(S3) LID_(S3) SLPB(S3) >> acpitimer0 at acpi0: 3579545 Hz, 32 bits >> acpihpet0 at acpi0: 14318180 Hz >> acpimadt0 at acpi0 addr 0xfee0: PC-AT compat >> cpu0 at mainbus0: apid 0 (boot processor) >> cpu0: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx, 1996.61 MHz, 17-11-00 >> cpu0: >> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,FMA3,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,MWAITX,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA,IBPB,XSAVEOPT,XSAVEC,XGETBV1,XSAVES >> cpu0: 64KB 64b/line 4-way I-cache, 32KB 64b/line 8-way D-cache, 512KB >> 64b/line 8-way L2 cache, 4MB 64b/line 16-w
RE: Unusual threading behavior on single processes
Haai, Just to make a more-or-less general point (or two)... "Otto Moerbeek" wrote: > On Fri, Mar 27, 2020 at 09:03:40PM +, Stefmorino wrote: > >> I have tested many applications, the behavior is the same in all of them, but >> I'll talk about OpenMW (an open-source game engine for morrowind) since I >> have >> the most useful information about how this program is threaded. By default, >> OpenMW uses 4 threads (cited here: >> https://openmw.readthedocs.io/en/stable/reference/modding/settings/cells.html), >> one for main/generic processing, one for graphics, one for audio, and one for >> preloading terrain. >>[snip] >> >> Is it a lock (spin is at 0)? A placeholder? A limitation of how Ryzen SMP is >> implemented? >[snip] > > Wild guess: it could be that you program actually does not do real > threading, but userland threading. "Fibering", in other words. > Check with top -H if it really > creates threads. You should see multiple threads having the same PID. > or all thraeds are using a resource that cannot be shared. Likely the latter. It's always funny, isn't it... A coder thinks "hey, I want a multi-threading 'cause its 1337, I'll just neatly run these subsystems within seperate threads and I'm done!". The fact that such is a frequently a naive proposition should be clear to the more clueful reader. Games tend to be heavy on global state, and are more likely to benefit from a multi-process model w/ carefully thought-out boundaries, than from a shared-everything thread model. While that need not be the case here, mestrongly suspects it is. Take heed, and measure. Always measure. Take care, --zeurkous. > -Otto -- Friggin' Machines!
Re: Unusual threading behavior on single processes
On Fri, Mar 27, 2020 at 09:03:40PM +, Stefmorino wrote: > I have question about a performance quirk on OpenBSD, but I'm not really sure > how to address it, or what the root cause even is; that being how > multithreaded > applications (libpthread?) behave (notably, games). > > I have tested many applications, the behavior is the same in all of them, but > I'll talk about OpenMW (an open-source game engine for morrowind) since I have > the most useful information about how this program is threaded. By default, > OpenMW uses 4 threads (cited here: > https://openmw.readthedocs.io/en/stable/reference/modding/settings/cells.html), > one for main/generic processing, one for graphics, one for audio, and one for > preloading terrain. You can see this if you look at the thread usage under top > while running the game; however, this is exactly where my question comes into > play. Instead of each thread processing the game independently with their own > limits, each thread is "capped" to the total limit of one thread (I.E. instead > of openmw's process using 100% of 4 threads, or 400% cpu in top, instead the > process uses 25% across 4 threads, or 100% cpu in top). I tested this using > GENERIC instead of GENERIC.MP as well, and get identical performance on the > one > thread; it's almost like pthreads is acting as a placeholder of sorts and not > actually improving performance where it should. > > Is it a lock (spin is at 0)? A placeholder? A limitation of how Ryzen SMP is > implemented? Hard to tell, no idea what that game engine does. But this not a general problem, e.g. the malloc_duel regress test (/usr/src/regress/lib/libpthread/malloc_duel). I see > 100% as well with other multi-threaded programs. 32013 otto 600 6020K 1552K onproc/3 - 1:07 228.81% malloc_due Wild guess: it could be that you program actually does not do real threading, but userland threading. Check with top -H if it really creates threads. You should see multiple threads having the same PID. or all thraeds are using a resource that cannot be shared. -Otto > > I'd be happy to do any additional testing, I have a fresh -current source tree > ready > > dmesg > OpenBSD 6.6-current (GENERIC.MP) #75: Tue Mar 24 12:56:37 MDT 2020 > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP > real mem = 16603250688 (15834MB) > avail mem = 16087437312 (15342MB) > mpath0 at root > scsibus0 at mpath0: 256 targets > mainbus0 at root > bios0 at mainbus0: SMBIOS rev. 3.1 @ 0x986ec000 (62 entries) > bios0: vendor LENOVO version "R0UET76W (1.56 )" date 11/05/2019 > bios0: LENOVO 20KVCTO1WW > acpi0 at bios0: ACPI 5.0 > acpi0: sleep states S0 S3 S4 S5 > acpi0: tables DSDT FACP SSDT SSDT CRAT CDIT UEFI MSDM BATB HPET APIC MCFG > SBST WSMT IVRS FPDT SSDT SSDT SSDT UEFI SSDT > acpi0: wakeup devices GPP0(S3) GPP1(S3) GPP2(S3) GPP3(S3) GPP4(S3) GPP5(S3) > GPP6(S3) GP17(S3) XHC0(S3) XHC1(S3) GP18(S3) LID_(S3) SLPB(S3) > acpitimer0 at acpi0: 3579545 Hz, 32 bits > acpihpet0 at acpi0: 14318180 Hz > acpimadt0 at acpi0 addr 0xfee0: PC-AT compat > cpu0 at mainbus0: apid 0 (boot processor) > cpu0: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx, 1996.61 MHz, 17-11-00 > cpu0: > FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,FMA3,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,MWAITX,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA,IBPB,XSAVEOPT,XSAVEC,XGETBV1,XSAVES > cpu0: 64KB 64b/line 4-way I-cache, 32KB 64b/line 8-way D-cache, 512KB > 64b/line 8-way L2 cache, 4MB 64b/line 16-way L3 cache > cpu0: ITLB 64 4KB entries fully associative, 64 4MB entries fully associative > cpu0: DTLB 64 4KB entries fully associative, 64 4MB entries fully associative > cpu0: smt 0, core 0, package 0 > mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges > cpu0: apic clock running at 24MHz > cpu0: mwait min=64, max=64, C-substates=1.1, IBE > cpu1 at mainbus0: apid 1 (application processor) > cpu1: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx, 1996.23 MHz, 17-11-00 > cpu1: > FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,FMA3,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,MWAITX,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA,IBPB,XSAVEOPT,XSAVEC,XGETBV1,XSAVES > cpu1: 64KB 64b/line 4-way I-cache, 32KB 64b/line 8-way D-cache, 512KB > 64b/line 8-way L2 cache, 4MB 64b/line 16-way L3 cache > cpu1: ITLB 64 4KB entries fully associative, 64 4MB entries fully associative > cpu1: DTLB 64 4KB entries fully associative, 64 4MB entries fully associative > cpu1: smt 1, core
Unusual threading behavior on single processes
I have question about a performance quirk on OpenBSD, but I'm not really sure how to address it, or what the root cause even is; that being how multithreaded applications (libpthread?) behave (notably, games). I have tested many applications, the behavior is the same in all of them, but I'll talk about OpenMW (an open-source game engine for morrowind) since I have the most useful information about how this program is threaded. By default, OpenMW uses 4 threads (cited here: https://openmw.readthedocs.io/en/stable/reference/modding/settings/cells.html), one for main/generic processing, one for graphics, one for audio, and one for preloading terrain. You can see this if you look at the thread usage under top while running the game; however, this is exactly where my question comes into play. Instead of each thread processing the game independently with their own limits, each thread is "capped" to the total limit of one thread (I.E. instead of openmw's process using 100% of 4 threads, or 400% cpu in top, instead the process uses 25% across 4 threads, or 100% cpu in top). I tested this using GENERIC instead of GENERIC.MP as well, and get identical performance on the one thread; it's almost like pthreads is acting as a placeholder of sorts and not actually improving performance where it should. Is it a lock (spin is at 0)? A placeholder? A limitation of how Ryzen SMP is implemented? I'd be happy to do any additional testing, I have a fresh -current source tree ready dmesg OpenBSD 6.6-current (GENERIC.MP) #75: Tue Mar 24 12:56:37 MDT 2020 dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP real mem = 16603250688 (15834MB) avail mem = 16087437312 (15342MB) mpath0 at root scsibus0 at mpath0: 256 targets mainbus0 at root bios0 at mainbus0: SMBIOS rev. 3.1 @ 0x986ec000 (62 entries) bios0: vendor LENOVO version "R0UET76W (1.56 )" date 11/05/2019 bios0: LENOVO 20KVCTO1WW acpi0 at bios0: ACPI 5.0 acpi0: sleep states S0 S3 S4 S5 acpi0: tables DSDT FACP SSDT SSDT CRAT CDIT UEFI MSDM BATB HPET APIC MCFG SBST WSMT IVRS FPDT SSDT SSDT SSDT UEFI SSDT acpi0: wakeup devices GPP0(S3) GPP1(S3) GPP2(S3) GPP3(S3) GPP4(S3) GPP5(S3) GPP6(S3) GP17(S3) XHC0(S3) XHC1(S3) GP18(S3) LID_(S3) SLPB(S3) acpitimer0 at acpi0: 3579545 Hz, 32 bits acpihpet0 at acpi0: 14318180 Hz acpimadt0 at acpi0 addr 0xfee0: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx, 1996.61 MHz, 17-11-00 cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,FMA3,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,MWAITX,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA,IBPB,XSAVEOPT,XSAVEC,XGETBV1,XSAVES cpu0: 64KB 64b/line 4-way I-cache, 32KB 64b/line 8-way D-cache, 512KB 64b/line 8-way L2 cache, 4MB 64b/line 16-way L3 cache cpu0: ITLB 64 4KB entries fully associative, 64 4MB entries fully associative cpu0: DTLB 64 4KB entries fully associative, 64 4MB entries fully associative cpu0: smt 0, core 0, package 0 mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges cpu0: apic clock running at 24MHz cpu0: mwait min=64, max=64, C-substates=1.1, IBE cpu1 at mainbus0: apid 1 (application processor) cpu1: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx, 1996.23 MHz, 17-11-00 cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,FMA3,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,MWAITX,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA,IBPB,XSAVEOPT,XSAVEC,XGETBV1,XSAVES cpu1: 64KB 64b/line 4-way I-cache, 32KB 64b/line 8-way D-cache, 512KB 64b/line 8-way L2 cache, 4MB 64b/line 16-way L3 cache cpu1: ITLB 64 4KB entries fully associative, 64 4MB entries fully associative cpu1: DTLB 64 4KB entries fully associative, 64 4MB entries fully associative cpu1: smt 1, core 0, package 0 cpu2 at mainbus0: apid 2 (application processor) cpu2: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx, 1996.23 MHz, 17-11-00 cpu2: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,FMA3,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,MWAITX,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA,IBPB,XSAVEOPT,XSAVEC,XGETBV1,XSAVES cpu2: 64KB 64b/line 4-way I-cache, 32KB 64b/line 8-way D-cache, 512KB 64b/line 8-way L2 cache, 4MB 64b/line 16-way L3 cache cpu2: ITLB 64 4KB entries fully associative, 64 4MB entries fully associative cpu2: DTLB 6