this stack overflow question indicates that there are two options ( http://stackoverflow.com/questions/153559/what-are-some-good-profilers-for-native-c-on-windows )
https://software.intel.com/sites/default/files/managed/cd/92/Intel-VTune-AmplifierXE-2015-Product-Brief-072914.pdf ($900) http://www.glowcode.com/summary.htm ($500) On Wed Dec 03 2014 at 9:11:28 AM Stefan Karpinski <ste...@karpinski.org> wrote: > This seems nuts. There have to be good profilers on Windows – how do those > work? > > On Tue, Dec 2, 2014 at 11:55 PM, Jameson Nash <vtjn...@gmail.com> wrote: > >> (I forgot to mention, that, to be fair, the windows machine that was used >> to run this test was an underpowered dual-core hyperthreaded atom >> processor, whereas the linux and mac machines were pretty comparable Xeon >> and sandybridge machines, respectively. I only gave windows a factor of 2 >> advantage in the above computation in my accounting for this gap) >> >> On Tue Dec 02 2014 at 10:50:20 PM Tim Holy <tim.h...@gmail.com> wrote: >> >>> Wow, those are pathetically-slow backtraces. Since most of us don't have >>> machines with 500 cores, I don't see anything we can do. >>> >>> --Tim >>> >>> On Wednesday, December 03, 2014 03:14:02 AM Jameson Nash wrote: >>> > you could copy the whole stack (typically only a few 100kb, max of >>> 8MB), >>> > then do the stack walk offline. if you could change the stack pages to >>> > copy-on-write, it may even not be too expensive. >>> > >>> > but this is the real problem: >>> > >>> > ``` >>> > >>> > |__/ | x86_64-linux-gnu >>> > >>> > julia> @time for i=1:10^4 backtrace() end >>> > elapsed time: 2.789268693 seconds (3200320016 bytes allocated, 89.29% >>> gc >>> > time) >>> > ``` >>> > >>> > ``` >>> > >>> > |__/ | x86_64-apple-darwin14.0.0 >>> > >>> > julia> @time for i=1:10^4 backtrace() end >>> > elapsed time: 2.586410216 seconds (6400480000 bytes allocated, 89.96% >>> gc >>> > time) >>> > ``` >>> > >>> > ``` >>> > jameson@julia:~/julia-win32$ ./usr/bin/julia.exe -E " @time for >>> i=1:10^3 >>> > backtrace() end " >>> > fixme:winsock:WS_EnterSingleProtocolW unknown Protocol <0x00000000> >>> > fixme:winsock:WS_EnterSingleProtocolW unknown Protocol <0x00000000> >>> > err:dbghelp_stabs:stabs_parse Unknown stab type 0x0a >>> > elapsed time: 22.6314386 seconds (320032016 bytes allocated, 1.51% gc >>> time) >>> > ``` >>> > >>> > ``` >>> > >>> > |__/ | i686-w64-mingw32 >>> > >>> > julia> @time for i=1:10^4 backtrace() end >>> > elapsed time: 69.243275608 seconds (3200320800 bytes allocated, 13.16% >>> gc >>> > time) >>> > ``` >>> > >>> > And yes, those gc fractions are verifiably correct. With gc_disable(), >>> they >>> > execute in 1/10 of the time. So, that pretty much means you must take >>> 1/100 >>> > of the samples if you want to preserve roughly the same slow down. On >>> > linux, I find the slowdown to be in the range of 2-5x, and consider >>> that to >>> > be pretty reasonable, especially for what you're getting. If you took >>> the >>> > same number of samples on windows, it would cause a 200-500x slowdown >>> (give >>> > or take a few percent). If you wanted to offload this work to other >>> cores >>> > to get the same level of accuracy and no more slowdown than linux, you >>> > would need a machine with 200-500 processors (give or take 2-5)! >>> > >>> > (I think I did those conversions correctly. However, since I just did >>> them >>> > for the purposes of this email, sans calculator, and as I was typing, >>> let >>> > me know if I made more than a factor of 2 error somewhere, or just >>> have fun >>> > reading https://what-if.xkcd.com/84/ instead) >>> > >>> > On Tue Dec 02 2014 at 6:23:07 PM Tim Holy <tim.h...@gmail.com> wrote: >>> > > On Tuesday, December 02, 2014 10:24:43 PM Jameson Nash wrote: >>> > > > You can't profile a moving target. The thread must be frozen first >>> to >>> > > > ensure the stack trace doesn't change while attempting to record it >>> > > >>> > > Got it. I assume there's no good way to "make a copy and then >>> discard if >>> > > the >>> > > copy is bad"? >>> > > >>> > > --Tim >>> > > >>> > > > On Tue, Dec 2, 2014 at 5:12 PM Tim Holy <tim.h...@gmail.com> >>> wrote: >>> > > > > If the work of walking the stack is done in the thread, why does >>> it >>> > > >>> > > cause >>> > > >>> > > > > any >>> > > > > slowdown of the main process? >>> > > > > >>> > > > > But of course the time it takes to complete the backtrace sets an >>> > > > > upper >>> > > > > limit >>> > > > > on how frequently you can take a snapshot. In that case, though, >>> > > >>> > > couldn't >>> > > >>> > > > > you >>> > > > > just have the thread always collecting backtraces? >>> > > > > >>> > > > > --Tim >>> > > > > >>> > > > > On Tuesday, December 02, 2014 09:54:17 PM Jameson Nash wrote: >>> > > > > > That's essentially what we do now. (Minus the busy wait part). >>> The >>> > > > > >>> > > > > overhead >>> > > > > >>> > > > > > is too high to run it any more frequently -- it already causes >>> a >>> > > > > > significant performance penalty on the system, even at the much >>> > > > > > lower >>> > > > > > sample rate than linux. However, I suspect the truncated >>> backtraces >>> > > >>> > > on >>> > > >>> > > > > > win32 were exaggerating the effect somewhat -- that should not >>> be as >>> > > > > > much >>> > > > > > of an issue now. >>> > > > > > >>> > > > > > Sure, windows lets you snoop on (and modify) the address space >>> of >>> > > > > > any >>> > > > > > process, you just need to find the right handle. >>> > > > > > >>> > > > > > On Tue, Dec 2, 2014 at 2:18 PM Tim Holy <tim.h...@gmail.com> >>> wrote: >>> > > > > > > On Windows, is there any chance that one could set up a >>> separate >>> > > > > > > thread >>> > > > > > > for >>> > > > > > > profiling and use busy-wait to do the timing? (I don't even >>> know >>> > > > > >>> > > > > whether >>> > > > > >>> > > > > > > one >>> > > > > > > thread can snoop on the execution state of another thread.) >>> > > > > > > >>> > > > > > > --Tim >>> > > > > > > >>> > > > > > > On Tuesday, December 02, 2014 06:22:39 PM Jameson Nash wrote: >>> > > > > > > > Although, over thanksgiving, I pushed a number of fixes >>> which >>> > > >>> > > should >>> > > >>> > > > > > > > improve the quality of backtraces on win32 (and make >>> sys.dll >>> > > >>> > > usable >>> > > >>> > > > > > > there) >>> > > > > > > >>> > > > > > > > On Tue, Dec 2, 2014 at 1:20 PM Jameson Nash < >>> vtjn...@gmail.com> >>> > > > > >>> > > > > wrote: >>> > > > > > > > > Correct. Windows imposes a much higher overhead on just >>> about >>> > > > > > > > > every >>> > > > > > > >>> > > > > > > aspect >>> > > > > > > >>> > > > > > > > > of doing profiling. Unfortunately, there isn't much we >>> can do >>> > > > > > > > > about >>> > > > > > > >>> > > > > > > this, >>> > > > > > > >>> > > > > > > > > other then to complain to Microsoft. (It doesn't have >>> signals, >>> > > >>> > > so >>> > > >>> > > > > we >>> > > > > >>> > > > > > > must >>> > > > > > > >>> > > > > > > > > emulate them with a separate thread. The accuracy of >>> windows >>> > > > > >>> > > > > timers is >>> > > > > >>> > > > > > > > > somewhat questionable. And the stack walk library (for >>> > > >>> > > recording >>> > > >>> > > > > the >>> > > > > >>> > > > > > > > > backtrace) is apparently just badly written and therefore >>> > > >>> > > insanely >>> > > >>> > > > > > > > > slow >>> > > > > > > > > and >>> > > > > > > > > memory hungry.) >>> > > > > > > > > >>> > > > > > > > > On Tue, Dec 2, 2014 at 12:59 PM Tim Holy < >>> tim.h...@gmail.com> >>> > > > > >>> > > > > wrote: >>> > > > > > > > >> I think it's just that Windows is bad at scheduling >>> tasks >>> > > > > > > > >> with >>> > > > > > > > >> short-latency, >>> > > > > > > > >> high-precision timing, but I am not the right person to >>> > > > > > > > >> answer >>> > > > > >>> > > > > such >>> > > > > >>> > > > > > > > >> questions. >>> > > > > > > > >> >>> > > > > > > > >> --Tim >>> > > > > > > > >> >>> > > > > > > > >> On Tuesday, December 02, 2014 09:57:28 AM Peter Simon >>> wrote: >>> > > > > > > > >> > I have also experienced the inaccurate profile >>> timings on >>> > > > > >>> > > > > Windows. >>> > > > > >>> > > > > > > Is >>> > > > > > > >>> > > > > > > > >> the >>> > > > > > > > >> >>> > > > > > > > >> > reason for the bad profiler performance on Windows >>> > > >>> > > understood? >>> > > >>> > > > > Are >>> > > > > >>> > > > > > > > >> there >>> > > > > > > > >> >>> > > > > > > > >> > plans for improvement? >>> > > > > > > > >> > >>> > > > > > > > >> > Thanks, >>> > > > > > > > >> > --Peter >>> > > > > > > > >> > >>> > > > > > > > >> > On Tuesday, December 2, 2014 3:57:16 AM UTC-8, Tim >>> Holy >>> > > >>> > > wrote: >>> > > > > > > > >> > > By default, the profiler takes one sample per >>> > > >>> > > millisecond. In >>> > > >>> > > > > > > > >> practice, >>> > > > > > > > >> >>> > > > > > > > >> > > the >>> > > > > > > > >> > > timing is quite precise on Linux, seemingly within a >>> > > >>> > > factor >>> > > >>> > > > > > > > >> > > of >>> > > > > > > >>> > > > > > > twoish >>> > > > > > > >>> > > > > > > > >> on >>> > > > > > > > >> >>> > > > > > > > >> > > OSX, >>> > > > > > > > >> > > and nowhere close on Windows. So at least on Linux >>> you >>> > > > > > > > >> > > can >>> > > > > >>> > > > > simply >>> > > > > >>> > > > > > > > >> > > read >>> > > > > > > > >> > > samples >>> > > > > > > > >> > > as milliseconds. >>> > > > > > > > >> > > >>> > > > > > > > >> > > If you want to visualize the relative contributions >>> of >>> > > >>> > > each >>> > > >>> > > > > > > > >> statement, I >>> > > > > > > > >> >>> > > > > > > > >> > > highly recommend ProfileView. If you use >>> LightTable, it's >>> > > > > >>> > > > > already >>> > > > > >>> > > > > > > > >> built-in >>> > > > > > > > >> >>> > > > > > > > >> > > via >>> > > > > > > > >> > > the profile() command. The combination of >>> ProfileView and >>> > > > > > > > >> > > @profile >>> > > > > > > > >> >>> > > > > > > > >> is, in >>> > > > > > > > >> >>> > > > > > > > >> > > my >>> > > > > > > > >> > > (extremely biased) opinion, quite powerful compared >>> to >>> > > >>> > > tools >>> > > >>> > > > > > > > >> > > I >>> > > > > > > >>> > > > > > > used >>> > > > > > > >>> > > > > > > > >> > > previously >>> > > > > > > > >> > > in other programming environments. >>> > > > > > > > >> > > >>> > > > > > > > >> > > Finally, there's IProfile.jl, which works via a >>> > > > > > > > >> > > completely >>> > > > > > > >>> > > > > > > different >>> > > > > > > >>> > > > > > > > >> > > mechanism >>> > > > > > > > >> > > but does report raw timings (with some pretty big >>> > > >>> > > caveats). >>> > > >>> > > > > > > > >> > > Best, >>> > > > > > > > >> > > --Tim >>> > > > > > > > >> > > >>> > > > > > > > >> > > On Monday, December 01, 2014 10:13:16 PM Christoph >>> Ortner >>> > > > > >>> > > > > wrote: >>> > > > > > > > >> > > > How do you get timings from the Julia profiler, >>> or even >>> > > > > >>> > > > > better, >>> > > > > >>> > > > > > > > >> %-es? I >>> > > > > > > > >> >>> > > > > > > > >> > > > guess one can convert from the numbers one gets, >>> but it >>> > > >>> > > is >>> > > >>> > > > > > > > >> > > > a >>> > > > > > > > >> > > > bit >>> > > > > > > > >> > > >>> > > > > > > > >> > > painful? >>> > > > > > > > >> > > >>> > > > > > > > >> > > > Christoph >>> >>> >