Can somebody on a Windows system report back with the output of `Profile.init()`?
--Tim On Wednesday, December 03, 2014 04:38:01 PM Jameson Nash wrote: > Yes, probably (I thought we already had). Someone would need to do some > comparison work though first. > > On Wed, Dec 3, 2014 at 10:57 AM Tim Holy <tim.h...@gmail.com> wrote: > > That's a pretty serious bummer. I can't believe anybody puts up with this. > > > > Should we change the default initialization > > https://github.com/JuliaLang/julia/blob/b1c99af9bdeef22e0999b283885977 > > 57541e2cc7/base/profile.jl#L44 > > so that, on Windows, it fires every 100ms or so? And add a note to the > > Profile > > docs? > > > > --Tim > > > > On Wednesday, December 03, 2014 03:37:59 PM Jameson Nash wrote: > > > The suggestion apparently is to use "Event Tracing for Windows" (aka > > > ptrace/dtrace). Yes, that is faster (on linux too...), but misses the > > > > point > > > > > entirely of profiling user code. > > > > > > the other offerings typically wrap StalkWalk64 (as we do), and complain > > > about how absurdly slow it is > > > > > > we used to use RtlCaptureStackBackTrace, but it often failed to give > > > > useful > > > > > backtraces. I think it depends too heavily of walking the EBP pointer > > > > chain > > > > > (which typically doesn't exist on x86_64). As it happens, the remaining > > > suggestions fall into the category of "well, obviously you should just > > > write your own EBP (32-bit base pointer register) pointer chain walk > > > algorithm. here, I'll even write part of it for you" ... which would be > > > very helpful, if RBP (64-bit base pointer register) was used to make > > > > stack > > > > > frame chains (hint: it isn't). and these days, the EBP isn't used to > > > make > > > stack pointer chains on x86 either. > > > > > > llvm3.5 contains the ability to interpret COFF files, so you could > > > presumably write your own stack-walk algorithm. i don't recommend it, > > > however. you might have to call out to StalkWalk anyways to access the > > > microsoft symbol server (yes, off their network servers) to complete the > > > stalk walk symbol lookup correctly > > > > > > On Wed Dec 03 2014 at 10:04:19 AM Tim Holy <tim.h...@gmail.com> wrote: > > > > Some potentially-interesting links (of which I understand very > > > > little): > > > > http://stackoverflow.com/questions/860602/recommended-open-> > > > > > source-profilers#comment2363112_1137133 > > > > > > http://stackoverflow.com/questions/8406175/optimizing-stack-> > > > > > walking-performance > > > > > > I can tell from this comment: > > > > https://github.com/JuliaLang/julia/issues/2597#issuecomment-15159868 > > > > that you already know about this (and its negatives): > > > > http://www.lenholgate.com/blog/2008/09/alternative-call-stac > > > > k-capturing.html > > > > > > > > --Tim > > > > > > > > On Wednesday, December 03, 2014 02:25:22 PM Jameson Nash wrote: > > > > > this stack overflow question indicates that there are two options ( > > > > > http://stackoverflow.com/questions/153559/what-are-some-> > > > > > > > > > good-profilers-for-n > > > > > > > > > ative-c-on-windows ) > > > > > > > > > > https://software.intel.com/sites/default/files/managed/cd/ > > > > > > > > 92/Intel-VTune-Amp > > > > > > > > > lifierXE-2015-Product-Brief-072914.pdf ($900) > > > > > http://www.glowcode.com/summary.htm ($500) > > > > > > > > > > > > > > > On Wed Dec 03 2014 at 9:11:28 AM Stefan Karpinski < > > > > ste...@karpinski.org> > > > > > > > wrote: > > > > > > This seems nuts. There have to be good profilers on Windows – how > > > > do > > > > > > those > > > > > > > > > > work? > > > > > > > > > > > > On Tue, Dec 2, 2014 at 11:55 PM, Jameson Nash <vtjn...@gmail.com> > > > > > > > > wrote: > > > > > >> (I forgot to mention, that, to be fair, the windows machine that > > > > was > > > > > > used > > > > > > > > > >> to run this test was an underpowered dual-core hyperthreaded atom > > > > > >> processor, whereas the linux and mac machines were pretty > > > > comparable > > > > > > Xeon > > > > > > > > > >> and sandybridge machines, respectively. I only gave windows a > > > > factor > > > > > > of 2 > > > > > > > > > >> advantage in the above computation in my accounting for this gap) > > > > > >> > > > > > >> On Tue Dec 02 2014 at 10:50:20 PM Tim Holy <tim.h...@gmail.com> > > > > > > > > wrote: > > > > > >>> Wow, those are pathetically-slow backtraces. Since most of us > > > > don't > > > > > > have > > > > > > > > > >>> machines with 500 cores, I don't see anything we can do. > > > > > >>> > > > > > >>> --Tim > > > > > >>> > > > > > >>> On Wednesday, December 03, 2014 03:14:02 AM Jameson Nash wrote: > > > > > >>> > you could copy the whole stack (typically only a few 100kb, > > > > max of > > > > > > > >>> 8MB), > > > > > >>> > > > > > >>> > then do the stack walk offline. if you could change the stack > > > > > > > > pages to > > > > > > > > > >>> > copy-on-write, it may even not be too expensive. > > > > > >>> > > > > > > >>> > but this is the real problem: > > > > > >>> > > > > > > >>> > ``` > > > > > >>> > > > > > > >>> > |__/ | x86_64-linux-gnu > > > > > >>> > > > > > > >>> > julia> @time for i=1:10^4 backtrace() end > > > > > >>> > elapsed time: 2.789268693 seconds (3200320016 bytes allocated, > > > > > > > > 89.29% > > > > > > > > > >>> gc > > > > > >>> > > > > > >>> > time) > > > > > >>> > ``` > > > > > >>> > > > > > > >>> > ``` > > > > > >>> > > > > > > >>> > |__/ | x86_64-apple-darwin14.0.0 > > > > > >>> > > > > > > >>> > julia> @time for i=1:10^4 backtrace() end > > > > > >>> > elapsed time: 2.586410216 seconds (6400480000 bytes allocated, > > > > > > > > 89.96% > > > > > > > > > >>> gc > > > > > >>> > > > > > >>> > time) > > > > > >>> > ``` > > > > > >>> > > > > > > >>> > ``` > > > > > >>> > jameson@julia:~/julia-win32$ ./usr/bin/julia.exe -E " @time > > > > for > > > > > > > >>> i=1:10^3 > > > > > >>> > > > > > >>> > backtrace() end " > > > > > >>> > fixme:winsock:WS_EnterSingleProtocolW unknown Protocol > > > > > > > > <0x00000000> > > > > > > > > > >>> > fixme:winsock:WS_EnterSingleProtocolW unknown Protocol > > > > > > > > <0x00000000> > > > > > > > > > >>> > err:dbghelp_stabs:stabs_parse Unknown stab type 0x0a > > > > > >>> > elapsed time: 22.6314386 seconds (320032016 bytes allocated, > > > > 1.51% > > > > > > gc > > > > > > > > > >>> time) > > > > > >>> > > > > > >>> > ``` > > > > > >>> > > > > > > >>> > ``` > > > > > >>> > > > > > > >>> > |__/ | i686-w64-mingw32 > > > > > >>> > > > > > > >>> > julia> @time for i=1:10^4 backtrace() end > > > > > >>> > elapsed time: 69.243275608 seconds (3200320800 bytes > > > > > >>> > allocated, > > > > > > > > 13.16% > > > > > > > > > >>> gc > > > > > >>> > > > > > >>> > time) > > > > > >>> > ``` > > > > > >>> > > > > > > >>> > And yes, those gc fractions are verifiably correct. With > > > > > > > > gc_disable(), > > > > > > > > > >>> they > > > > > >>> > > > > > >>> > execute in 1/10 of the time. So, that pretty much means you > > > > must > > > > > > take > > > > > > > > > >>> 1/100 > > > > > >>> > > > > > >>> > of the samples if you want to preserve roughly the same slow > > > > down. > > > > > > On > > > > > > > > > >>> > linux, I find the slowdown to be in the range of 2-5x, and > > > > > >>> > consider > > > > > >>> > > > > > >>> that to > > > > > >>> > > > > > >>> > be pretty reasonable, especially for what you're getting. If > > > > you > > > > > > took > > > > > > > > > >>> the > > > > > >>> > > > > > >>> > same number of samples on windows, it would cause a 200-500x > > > > > > > > slowdown > > > > > > > > > >>> (give > > > > > >>> > > > > > >>> > or take a few percent). If you wanted to offload this work to > > > > > >>> > other > > > > > >>> > > > > > >>> cores > > > > > >>> > > > > > >>> > to get the same level of accuracy and no more slowdown than > > > > linux, > > > > > > you > > > > > > > > > >>> > would need a machine with 200-500 processors (give or take > > > > 2-5)! > > > > > > > >>> > (I think I did those conversions correctly. However, since I > > > > just > > > > > > did > > > > > > > > > >>> them > > > > > >>> > > > > > >>> > for the purposes of this email, sans calculator, and as I was > > > > > > > > typing, > > > > > > > > > >>> let > > > > > >>> > > > > > >>> > me know if I made more than a factor of 2 error somewhere, or > > > > just > > > > > > > >>> have fun > > > > > >>> > > > > > >>> > reading https://what-if.xkcd.com/84/ instead) > > > > > >>> > > > > > > >>> > On Tue Dec 02 2014 at 6:23:07 PM Tim Holy <tim.h...@gmail.com> > > > > > > > > wrote: > > > > > >>> > > On Tuesday, December 02, 2014 10:24:43 PM Jameson Nash wrote: > > > > > >>> > > > You can't profile a moving target. The thread must be > > > > frozen > > > > > > first > > > > > > > > > >>> to > > > > > >>> > > > > > >>> > > > ensure the stack trace doesn't change while attempting to > > > > > > > > record > > > > > > > > > >>> > > > it > > > > > >>> > > > > > > > >>> > > Got it. I assume there's no good way to "make a copy and > > > > > >>> > > then > > > > > >>> > > > > > >>> discard if > > > > > >>> > > > > > >>> > > the > > > > > >>> > > copy is bad"? > > > > > >>> > > > > > > > >>> > > --Tim > > > > > >>> > > > > > > > >>> > > > On Tue, Dec 2, 2014 at 5:12 PM Tim Holy < > > > > tim.h...@gmail.com> > > > > > > > >>> wrote: > > > > > >>> > > > > If the work of walking the stack is done in the thread, > > > > why > > > > > > does > > > > > > > > > >>> it > > > > > >>> > > > > > >>> > > cause > > > > > >>> > > > > > > > >>> > > > > any > > > > > >>> > > > > slowdown of the main process? > > > > > >>> > > > > > > > > > >>> > > > > But of course the time it takes to complete the > > > > > >>> > > > > backtrace > > > > > > > > sets > > > > > > > > > >>> > > > > an > > > > > >>> > > > > upper > > > > > >>> > > > > limit > > > > > >>> > > > > on how frequently you can take a snapshot. In that case, > > > > > > > > though, > > > > > > > > > >>> > > couldn't > > > > > >>> > > > > > > > >>> > > > > you > > > > > >>> > > > > just have the thread always collecting backtraces? > > > > > >>> > > > > > > > > > >>> > > > > --Tim > > > > > >>> > > > > > > > > > >>> > > > > On Tuesday, December 02, 2014 09:54:17 PM Jameson Nash > > > > wrote: > > > > > >>> > > > > > That's essentially what we do now. (Minus the busy > > > > > >>> > > > > > wait > > > > > > > > part). > > > > > > > > > >>> The > > > > > >>> > > > > > >>> > > > > overhead > > > > > >>> > > > > > > > > > >>> > > > > > is too high to run it any more frequently -- it > > > > > >>> > > > > > already > > > > > > > > causes > > > > > > > > > >>> a > > > > > >>> > > > > > >>> > > > > > significant performance penalty on the system, even at > > > > the > > > > > > > >>> > > > > > much > > > > > >>> > > > > > lower > > > > > >>> > > > > > sample rate than linux. However, I suspect the > > > > truncated > > > > > > > >>> backtraces > > > > > >>> > > > > > >>> > > on > > > > > >>> > > > > > > > >>> > > > > > win32 were exaggerating the effect somewhat -- that > > > > should > > > > > > not > > > > > > > > > >>> be as > > > > > >>> > > > > > >>> > > > > > much > > > > > >>> > > > > > of an issue now. > > > > > >>> > > > > > > > > > > >>> > > > > > Sure, windows lets you snoop on (and modify) the > > > > address > > > > > > space > > > > > > > > > >>> of > > > > > >>> > > > > > >>> > > > > > any > > > > > >>> > > > > > process, you just need to find the right handle. > > > > > >>> > > > > > > > > > > >>> > > > > > On Tue, Dec 2, 2014 at 2:18 PM Tim Holy < > > > > > > > > tim.h...@gmail.com> > > > > > > > > > >>> wrote: > > > > > >>> > > > > > > On Windows, is there any chance that one could set > > > > up a > > > > > > > >>> separate > > > > > >>> > > > > > >>> > > > > > > thread > > > > > >>> > > > > > > for > > > > > >>> > > > > > > profiling and use busy-wait to do the timing? (I > > > > don't > > > > > > even > > > > > > > > > >>> know > > > > > >>> > > > > > >>> > > > > whether > > > > > >>> > > > > > > > > > >>> > > > > > > one > > > > > >>> > > > > > > thread can snoop on the execution state of another > > > > > > > > thread.) > > > > > > > > > >>> > > > > > > --Tim > > > > > >>> > > > > > > > > > > > >>> > > > > > > On Tuesday, December 02, 2014 06:22:39 PM Jameson > > > > Nash > > > > > > wrote: > > > > > >>> > > > > > > > Although, over thanksgiving, I pushed a number of > > > > > >>> > > > > > > > fixes > > > > > >>> > > > > > >>> which > > > > > >>> > > > > > >>> > > should > > > > > >>> > > > > > > > >>> > > > > > > > improve the quality of backtraces on win32 (and > > > > make > > > > > > > >>> sys.dll > > > > > >>> > > > > > >>> > > usable > > > > > >>> > > > > > > > >>> > > > > > > there) > > > > > >>> > > > > > > > > > > > >>> > > > > > > > On Tue, Dec 2, 2014 at 1:20 PM Jameson Nash < > > > > > >>> > > > > > >>> vtjn...@gmail.com> > > > > > >>> > > > > > >>> > > > > wrote: > > > > > >>> > > > > > > > > Correct. Windows imposes a much higher overhead > > > > on > > > > > > just > > > > > > > > > >>> about > > > > > >>> > > > > > >>> > > > > > > > > every > > > > > >>> > > > > > > > > > > > >>> > > > > > > aspect > > > > > >>> > > > > > > > > > > > >>> > > > > > > > > of doing profiling. Unfortunately, there isn't > > > > much > > > > > > we > > > > > > > > > >>> can do > > > > > >>> > > > > > >>> > > > > > > > > about > > > > > >>> > > > > > > > > > > > >>> > > > > > > this, > > > > > >>> > > > > > > > > > > > >>> > > > > > > > > other then to complain to Microsoft. (It doesn't > > > > > >>> > > > > > > > > have > > > > > >>> > > > > > >>> signals, > > > > > >>> > > > > > >>> > > so > > > > > >>> > > > > > > > >>> > > > > we > > > > > >>> > > > > > > > > > >>> > > > > > > must > > > > > >>> > > > > > > > > > > > >>> > > > > > > > > emulate them with a separate thread. The > > > > accuracy of > > > > > > > >>> windows > > > > > >>> > > > > > >>> > > > > timers is > > > > > >>> > > > > > > > > > >>> > > > > > > > > somewhat questionable. And the stack walk > > > > > >>> > > > > > > > > library > > > > > > > > (for > > > > > > > > > >>> > > recording > > > > > >>> > > > > > > > >>> > > > > the > > > > > >>> > > > > > > > > > >>> > > > > > > > > backtrace) is apparently just badly written and > > > > > >>> > > > > > > > > therefore > > > > > >>> > > > > > > > >>> > > insanely > > > > > >>> > > > > > > > >>> > > > > > > > > slow > > > > > >>> > > > > > > > > and > > > > > >>> > > > > > > > > memory hungry.) > > > > > >>> > > > > > > > > > > > > > >>> > > > > > > > > On Tue, Dec 2, 2014 at 12:59 PM Tim Holy < > > > > > >>> > > > > > >>> tim.h...@gmail.com> > > > > > >>> > > > > > >>> > > > > wrote: > > > > > >>> > > > > > > > >> I think it's just that Windows is bad at > > > > scheduling > > > > > > > >>> tasks > > > > > >>> > > > > > >>> > > > > > > > >> with > > > > > >>> > > > > > > > >> short-latency, > > > > > >>> > > > > > > > >> high-precision timing, but I am not the right > > > > > > > > person to > > > > > > > > > >>> > > > > > > > >> answer > > > > > >>> > > > > > > > > > >>> > > > > such > > > > > >>> > > > > > > > > > >>> > > > > > > > >> questions. > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> --Tim > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> On Tuesday, December 02, 2014 09:57:28 AM Peter > > > > > > > > Simon > > > > > > > > > >>> wrote: > > > > > >>> > > > > > > > >> > I have also experienced the inaccurate > > > > > >>> > > > > > > > >> > profile > > > > > >>> > > > > > >>> timings on > > > > > >>> > > > > > >>> > > > > Windows. > > > > > >>> > > > > > > > > > >>> > > > > > > Is > > > > > >>> > > > > > > > > > > > >>> > > > > > > > >> the > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > reason for the bad profiler performance on > > > > > >>> > > > > > > > >> > Windows > > > > > >>> > > > > > > > >>> > > understood? > > > > > >>> > > > > > > > >>> > > > > Are > > > > > >>> > > > > > > > > > >>> > > > > > > > >> there > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > plans for improvement? > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > Thanks, > > > > > >>> > > > > > > > >> > --Peter > > > > > >>> > > > > > > > >> > > > > > > >>> > > > > > > > >> > On Tuesday, December 2, 2014 3:57:16 AM > > > > > >>> > > > > > > > >> > UTC-8, > > > > > >>> > > > > > > > >> > Tim > > > > > >>> > > > > > >>> Holy > > > > > >>> > > > > > >>> > > wrote: > > > > > >>> > > > > > > > >> > > By default, the profiler takes one sample > > > > per > > > > > > > >>> > > millisecond. In > > > > > >>> > > > > > > > >>> > > > > > > > >> practice, > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > the > > > > > >>> > > > > > > > >> > > timing is quite precise on Linux, seemingly > > > > > > > > within > > > > > > > > > >>> > > > > > > > >> > > a > > > > > >>> > > > > > > > >>> > > factor > > > > > >>> > > > > > > > >>> > > > > > > > >> > > of > > > > > >>> > > > > > > > > > > > >>> > > > > > > twoish > > > > > >>> > > > > > > > > > > > >>> > > > > > > > >> on > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > OSX, > > > > > >>> > > > > > > > >> > > and nowhere close on Windows. So at least > > > > > >>> > > > > > > > >> > > on > > > > > > > > Linux > > > > > > > > > >>> you > > > > > >>> > > > > > >>> > > > > > > > >> > > can > > > > > >>> > > > > > > > > > >>> > > > > simply > > > > > >>> > > > > > > > > > >>> > > > > > > > >> > > read > > > > > >>> > > > > > > > >> > > samples > > > > > >>> > > > > > > > >> > > as milliseconds. > > > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > If you want to visualize the relative > > > > > > > > contributions > > > > > > > > > >>> of > > > > > >>> > > > > > >>> > > each > > > > > >>> > > > > > > > >>> > > > > > > > >> statement, I > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > highly recommend ProfileView. If you use > > > > > >>> > > > > > >>> LightTable, it's > > > > > >>> > > > > > >>> > > > > already > > > > > >>> > > > > > > > > > >>> > > > > > > > >> built-in > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > via > > > > > >>> > > > > > > > >> > > the profile() command. The combination of > > > > > >>> > > > > > >>> ProfileView and > > > > > >>> > > > > > >>> > > > > > > > >> > > @profile > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> is, in > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > my > > > > > >>> > > > > > > > >> > > (extremely biased) opinion, quite powerful > > > > > > > > compared > > > > > > > > > >>> to > > > > > >>> > > > > > >>> > > tools > > > > > >>> > > > > > > > >>> > > > > > > > >> > > I > > > > > >>> > > > > > > > > > > > >>> > > > > > > used > > > > > >>> > > > > > > > > > > > >>> > > > > > > > >> > > previously > > > > > >>> > > > > > > > >> > > in other programming environments. > > > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > Finally, there's IProfile.jl, which works > > > > via a > > > > > > > >>> > > > > > > > >> > > completely > > > > > >>> > > > > > > > > > > > >>> > > > > > > different > > > > > >>> > > > > > > > > > > > >>> > > > > > > > >> > > mechanism > > > > > >>> > > > > > > > >> > > but does report raw timings (with some > > > > pretty > > > > > > big > > > > > > > > > >>> > > caveats). > > > > > >>> > > > > > > > >>> > > > > > > > >> > > Best, > > > > > >>> > > > > > > > >> > > --Tim > > > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > On Monday, December 01, 2014 10:13:16 PM > > > > > > > > Christoph > > > > > > > > > >>> Ortner > > > > > >>> > > > > > >>> > > > > wrote: > > > > > >>> > > > > > > > >> > > > How do you get timings from the Julia > > > > > > > > profiler, > > > > > > > > > >>> or even > > > > > >>> > > > > > >>> > > > > better, > > > > > >>> > > > > > > > > > >>> > > > > > > > >> %-es? I > > > > > >>> > > > > > > > >> > > > > > >>> > > > > > > > >> > > > guess one can convert from the numbers > > > > > >>> > > > > > > > >> > > > one > > > > > > > > gets, > > > > > > > > > >>> but it > > > > > >>> > > > > > >>> > > is > > > > > >>> > > > > > > > >>> > > > > > > > >> > > > a > > > > > >>> > > > > > > > >> > > > bit > > > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > painful? > > > > > >>> > > > > > > > >> > > > > > > > >>> > > > > > > > >> > > > Christoph