Re: Extreme slowdown due to malloc?
Mark Geisert writes: > The page fault numbers are comparable to what you've shown for Cygwin > on your system. The long pause after zstd prints "Constructing > partial suffix array" is because zstd is cpu-bound in qsort() for a > long time. No paging during that time. Then there's probably another performance bug in qsort since that step finishes much faster on Linux too. > Then when the statistics start being printed out, that's when the > paging insanity starts. OK, so you see the same thing. Good. > What I discovered is that zstd is repeatedly asking malloc() for large > memory blocks, presumably to mmap files in, then free()ing them. Oh, so it's not recycling those… maybe that's not very hard to change. > Any malloc request 256K or larger is fulfilled by mmap() rather than > enlarging the heap for it. But crucially, there is no mechanism for > our malloc to hang on to freed mmap()ed pages for future use. If you > free an mmap()ed block, it is unmap()ed immediately. So for zstd's > usage pattern you get an incredible number of page faults to satisfy > the mmap()s and Windows seems to take a non-trivial bit of time for > each mmap(). It probably puts backing store behind it even when there's enough memory. Come to think of it I should try to move the page file to the NVMe drive (off the SSD it's on currently) and see if that changes things. Regards, Achim. -- +<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+ Factory and User Sound Singles for Waldorf Q+, Q and microQ: http://Synth.Stromeko.net/Downloads.html#WaldorfSounds
Re: Extreme slowdown due to malloc?
Hi Achim, Thank you very much for the detailed instructions and also the comparison data Linux vs Cygwin for all those testcases. Achim Gratz wrote: ASSI writes: I have a Cygwin malloc speedup patch that *might* help the m-t part. I'll prepare and submit that to cygwin-patches shortly. Well, if you want to test it with the new ZStandard, give it a spin… I'll check how far I can strip that test down so you can use the Cygwin source tree for testing. I've now done this. And I don't see any improvement. Reasons below... OK, it's actually pretty simple, do this inside a checkout of newlib-cygwin: $ find newlib winsup texinfo -type f > flist $ zstd --train-cover --ultra -22 -T0 -vv --filelist=flist -o dict-cover On Linux, it reads in all the files in about two seconds, while it takes quite a while longer on Cygwin. But the real bummer is that constructing the partial suffix arrays (which is single-threaded) will seemingly take forever, while it's done much faster on Linux. You can pare down the number of files like that: $ shuf -n 320 flist > slist I've settled on '-n 1600' for testing. I'm running these Cygwin tests on a 2C/4T i3-something with 8GB memory and an SSD used for filesystem and page file. Not a dog but clearly not a dire-wolf either. The page fault numbers are comparable to what you've shown for Cygwin on your system. The long pause after zstd prints "Constructing partial suffix array" is because zstd is cpu-bound in qsort() for a long time. No paging during that time. Then when the statistics start being printed out, that's when the paging insanity starts. What I discovered is that zstd is repeatedly asking malloc() for large memory blocks, presumably to mmap files in, then free()ing them. Any malloc request 256K or larger is fulfilled by mmap() rather than enlarging the heap for it. But crucially, there is no mechanism for our malloc to hang on to freed mmap()ed pages for future use. If you free an mmap()ed block, it is unmap()ed immediately. So for zstd's usage pattern you get an incredible number of page faults to satisfy the mmap()s and Windows seems to take a non-trivial bit of time for each mmap(). I will be looking at our malloc implementation to see if tuning something can fix this behavior. Adding code is the last resort. Thanks again for the great testcase. ..mark
Re: Extreme slowdown due to malloc?
ASSI writes: >> I have a Cygwin malloc speedup patch that *might* help the m-t part. >> I'll prepare and submit that to cygwin-patches shortly. > > Well, if you want to test it with the new ZStandard, give it a spin… > I'll check how far I can strip that test down so you can use the Cygwin > source tree for testing. OK, it's actually pretty simple, do this inside a checkout of newlib-cygwin: $ find newlib winsup texinfo -type f > flist $ zstd --train-cover --ultra -22 -T0 -vv --filelist=flist -o dict-cover On Linux, it reads in all the files in about two seconds, while it takes quite a while longer on Cygwin. But the real bummer is that constructing the partial suffix arrays (which is single-threaded) will seemingly take forever, while it's done much faster on Linux. You can pare down the number of files like that: $ shuf -n 320 flist > slist and then use that shorter file instead. I get this: *** Linux E3-1225v3 4C/4T 3.2/3.6GHz |--+--+---+--+-+++| |n | user | sys |total |wall | util | serial | pagefaults | |--+--+---+--+-+++| | 100 | 116.092 | 0.187 | 116.279 | 30.82 | 377.3% | 2.0% | 0 | | 200 | 145.481 | 0.135 | 145.616 | 38.65 | 376.8% | 2.1% | 0 | | 400 | 288.341 | 0.414 | 288.755 | 77.84 | 371.0% | 2.6% | 0 | | 800 | 517.288 | 0.623 | 517.911 | 138.93 | 372.8% | 2.4% | 0 | | 1600 | 1229.348 | 1.752 | 1231.100 | 333.37 | 369.3% | 2.8% | 0 | | 3200 | 2508.250 | 3.632 | 2511.882 | 678.96 | 370.0% | 2.7% | 0 | | 6400 | 4380.693 | 5.352 | 4386.045 | 1176.43 | 372.8% | 2.4% | 0 | |--+--+---+--+-+++| *** Cygwin E3-1276v3 4C/8T 3.6/4.0GHz |--+--++--+-+++| |n | user |sys |total |wall | util | serial | pagefaults | |--+--++--+-+++| | 100 | 141.906 | 0.796 | 142.702 | 20.53 | 695.1% | 2.2% | 327860 | | 200 | 198.140 | 1.328 | 199.468 | 29.39 | 678.7% | 2.6% | 452870 | | 400 | 425.749 | 2.328 | 428.077 | 66.03 | 648.3% | 3.3% | 752357 | | 800 | 822.250 | 3.499 | 825.749 | 150.42 | 549.0% | 6.5% |1277198 | | 1600 | 1773.578 | 8.483 | 1782.061 | 383.42 | 464.8% | 10.3% |3011298 | | 3200 | 4322.281 | 15.890 | 4338.171 | 1292.92 | 335.5% | 19.8% |5746903 | | 6400 | 8499.750 | 29.437 | 8529.187 | 3275.66 | 260.4% | 29.6% | 10543919 | |--+--++--+-+++| So even with smaller number of files (where the serial portion of the code is not dominating yet) you see that the faster machine expends more cycles already. Looking at the differences there is a strong indication for those pagefaults to constitute the main portion of that extra time. The last column is the time per pagefault in µs assuming that the extra time was all spent there. This is obviously not quite correct, as that number should roughly be constant if that assumption holds, but it's close enough to uphold the original hypothesis. *** Linux vs. Cygwin |--+--+--++--++--| |n | Linux total | Linux scaled | Cygwin total | Cygwin-Linux | pagefaults | t/pf | |--+--+--++--++--| | 100 | 116.279 | 104.651 |142.702 | 38.051 | 327860 | 116. | | 200 | 145.616 | 131.054 |199.468 | 68.414 | 452870 | 151. | | 400 | 288.755 | 259.880 |428.077 | 168.197 | 752357 | 224. | | 800 | 517.911 | 466.120 |825.749 | 359.629 | 1277198 | 282. | | 1600 | 1231.100 | 1107.990 | 1782.061 | 674.071 | 3011298 | 224. | | 3200 | 2511.882 | 2260.694 | 4338.171 | 2077.477 | 5746903 | 361. | | 6400 | 4386.045 | 3947.441 | 8529.187 | 4581.746 | 10543919 | 435. | |--+--+--++--++--| Regards, Achim. -- +<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+ SD adaptation for Waldorf rackAttack V1.04R1: http://Synth.Stromeko.net/Downloads.html#WaldorfSDada
Re: Extreme slowdown due to malloc?
Mark Geisert writes: >> I've been experimenting a bit with ZStandard dictionaries. The >> dictionary builder is probably not the most optimized piece of software > > Is this what leads you to suspect malloc? Really heavy use of malloc? That piece of code does a lot of smallish allocations when it builds up the suffix array. Also that is where I suspect the high number of page faults come from as I have no other explanation. >> The obvious difference is that I/O takes a lot longer on Cygwin >> (roughly >> a minute for reading all the data) and that I have an insane amount of >> page faults on Windows (as reported by time) vs. none on Linux. > > How much RAM does the Windows machine have? Do you have a paging > file? Is it fixed size or "let Windows manage"? How big is it? This machine is fully expanded to 32GiB, it won't run out of physical memory for any of the tests shown. The full dictionary build that I'd wanted to have run on this machine would come in as something like 10 in that table and takes around 6 hours consuming 15GiB on my Linux box (4T), so if I could run that I might see a bit of memory pressure. The pagefile set up as fixed (I think 64GiB), but is unused. >> While doing that I also noticed that top shows the program taking 100% >> CPU in the multithreaded portion of the program, while it should show >> close to 800% at that time. I'm not sure if that information just isn't >> available on Windows or if procps-ng needs to look someplace else for >> that to be shown as expected. > > No offense, but are you sure it's actually running multi-threaded on Windows? Yes, once it finally comes to the statistics part of the dictionary build it'll use all eight threads no problem. That's why it doesn't take twice as long (wall time) at the 400 mark compared to the 200. > I have a Cygwin malloc speedup patch that *might* help the m-t part. > I'll prepare and submit that to cygwin-patches shortly. Well, if you want to test it with the new ZStandard, give it a spin… I'll check how far I can strip that test down so you can use the Cygwin source tree for testing. Regards, Achim. -- +<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+ Waldorf MIDI Implementation & additional documentation: http://Synth.Stromeko.net/Downloads.html#WaldorfDocs
Re: Extreme slowdown due to malloc?
Hi Achim, Achim Gratz wrote: I've been experimenting a bit with ZStandard dictionaries. The dictionary builder is probably not the most optimized piece of software Is this what leads you to suspect malloc? Really heavy use of malloc? and if you feed it large amounts of data it needs quite a lot of cycles. So I thought I run some of this on Cygwin since that machine is faster and has more threads than my Linux box. Unfortunately that plan shattered due to extreme slowness of the first (single-threaded) part of the dictionary builder that sets up the partial suffix array. |--+---+---| | | E3-1225v3 | E3-1276v3 | | | 4C/4T | 4C/8T | | | 3.2/3.6GHz| 3.6/4.0GHz| |--+---+---| | 100 | 00:14 / 55s | 00:23 / 126s | | 200 | 00:39 / 145s | 01:10 / 241s | | 400 | 01:12 / 266s | 01:25 / 322s | | 800 | 02:06 / 466s | 11:12 / 1245s | | 1600 | 03:57 / 872s | > 2hr | | 3200 | 08:03 / 1756s | n/a | | 6400 | 16:17 / 3581s | n/a | |--+---+---| The obvious difference is that I/O takes a lot longer on Cygwin (roughly a minute for reading all the data) and that I have an insane amount of page faults on Windows (as reported by time) vs. none on Linux. How much RAM does the Windows machine have? Do you have a paging file? Is it fixed size or "let Windows manage"? How big is it? While doing that I also noticed that top shows the program taking 100% CPU in the multithreaded portion of the program, while it should show close to 800% at that time. I'm not sure if that information just isn't available on Windows or if procps-ng needs to look someplace else for that to be shown as expected. No offense, but are you sure it's actually running multi-threaded on Windows? I have a Cygwin malloc speedup patch that *might* help the m-t part. I'll prepare and submit that to cygwin-patches shortly. Cheers, ..mark