this stack overflow question indicates that there are two options (
http://stackoverflow.com/questions/153559/what-are-some-good-profilers-for-native-c-on-windows
)

https://software.intel.com/sites/default/files/managed/cd/92/Intel-VTune-AmplifierXE-2015-Product-Brief-072914.pdf
($900)
http://www.glowcode.com/summary.htm ($500)


On Wed Dec 03 2014 at 9:11:28 AM Stefan Karpinski <ste...@karpinski.org>
wrote:

> This seems nuts. There have to be good profilers on Windows – how do those
> work?
>
> On Tue, Dec 2, 2014 at 11:55 PM, Jameson Nash <vtjn...@gmail.com> wrote:
>
>> (I forgot to mention, that, to be fair, the windows machine that was used
>> to run this test was an underpowered dual-core hyperthreaded atom
>> processor, whereas the linux and mac machines were pretty comparable Xeon
>> and sandybridge machines, respectively. I only gave windows a factor of 2
>> advantage in the above computation in my accounting for this gap)
>>
>> On Tue Dec 02 2014 at 10:50:20 PM Tim Holy <tim.h...@gmail.com> wrote:
>>
>>> Wow, those are pathetically-slow backtraces. Since most of us don't have
>>> machines with 500 cores, I don't see anything we can do.
>>>
>>> --Tim
>>>
>>> On Wednesday, December 03, 2014 03:14:02 AM Jameson Nash wrote:
>>> > you could copy the whole stack (typically only a few 100kb, max of
>>> 8MB),
>>> > then do the stack walk offline. if you could change the stack pages to
>>> > copy-on-write, it may even not be too expensive.
>>> >
>>> > but this is the real problem:
>>> >
>>> > ```
>>> >
>>> > |__/                   |  x86_64-linux-gnu
>>> >
>>> > julia> @time for i=1:10^4 backtrace() end
>>> > elapsed time: 2.789268693 seconds (3200320016 bytes allocated, 89.29%
>>> gc
>>> > time)
>>> > ```
>>> >
>>> > ```
>>> >
>>> > |__/                   |  x86_64-apple-darwin14.0.0
>>> >
>>> > julia> @time for i=1:10^4 backtrace() end
>>> > elapsed time: 2.586410216 seconds (6400480000 bytes allocated, 89.96%
>>> gc
>>> > time)
>>> > ```
>>> >
>>> > ```
>>> > jameson@julia:~/julia-win32$ ./usr/bin/julia.exe -E " @time for
>>> i=1:10^3
>>> > backtrace() end "
>>> > fixme:winsock:WS_EnterSingleProtocolW unknown Protocol <0x00000000>
>>> > fixme:winsock:WS_EnterSingleProtocolW unknown Protocol <0x00000000>
>>> > err:dbghelp_stabs:stabs_parse Unknown stab type 0x0a
>>> > elapsed time: 22.6314386 seconds (320032016 bytes allocated, 1.51% gc
>>> time)
>>> > ```
>>> >
>>> > ```
>>> >
>>> > |__/                   |  i686-w64-mingw32
>>> >
>>> > julia> @time for i=1:10^4 backtrace() end
>>> > elapsed time: 69.243275608 seconds (3200320800 bytes allocated, 13.16%
>>> gc
>>> > time)
>>> > ```
>>> >
>>> > And yes, those gc fractions are verifiably correct. With gc_disable(),
>>> they
>>> > execute in 1/10 of the time. So, that pretty much means you must take
>>> 1/100
>>> > of the samples if you want to preserve roughly the same slow down. On
>>> > linux, I find the slowdown to be in the range of 2-5x, and consider
>>> that to
>>> > be pretty reasonable, especially for what you're getting. If you took
>>> the
>>> > same number of samples on windows, it would cause a 200-500x slowdown
>>> (give
>>> > or take a few percent). If you wanted to offload this work to other
>>> cores
>>> > to get the same level of accuracy and no more slowdown than linux, you
>>> > would need a machine with 200-500 processors (give or take 2-5)!
>>> >
>>> > (I think I did those conversions correctly. However, since I just did
>>> them
>>> > for the purposes of this email, sans calculator, and as I was typing,
>>> let
>>> > me know if I made more than a factor of 2 error somewhere, or just
>>> have fun
>>> > reading https://what-if.xkcd.com/84/ instead)
>>> >
>>> > On Tue Dec 02 2014 at 6:23:07 PM Tim Holy <tim.h...@gmail.com> wrote:
>>> > > On Tuesday, December 02, 2014 10:24:43 PM Jameson Nash wrote:
>>> > > > You can't profile a moving target. The thread must be frozen first
>>> to
>>> > > > ensure the stack trace doesn't change while attempting to record it
>>> > >
>>> > > Got it. I assume there's no good way to "make a copy and then
>>> discard if
>>> > > the
>>> > > copy is bad"?
>>> > >
>>> > > --Tim
>>> > >
>>> > > > On Tue, Dec 2, 2014 at 5:12 PM Tim Holy <tim.h...@gmail.com>
>>> wrote:
>>> > > > > If the work of walking the stack is done in the thread, why does
>>> it
>>> > >
>>> > > cause
>>> > >
>>> > > > > any
>>> > > > > slowdown of the main process?
>>> > > > >
>>> > > > > But of course the time it takes to complete the backtrace sets an
>>> > > > > upper
>>> > > > > limit
>>> > > > > on how frequently you can take a snapshot. In that case, though,
>>> > >
>>> > > couldn't
>>> > >
>>> > > > > you
>>> > > > > just have the thread always collecting backtraces?
>>> > > > >
>>> > > > > --Tim
>>> > > > >
>>> > > > > On Tuesday, December 02, 2014 09:54:17 PM Jameson Nash wrote:
>>> > > > > > That's essentially what we do now. (Minus the busy wait part).
>>> The
>>> > > > >
>>> > > > > overhead
>>> > > > >
>>> > > > > > is too high to run it any more frequently -- it already causes
>>> a
>>> > > > > > significant performance penalty on the system, even at the much
>>> > > > > > lower
>>> > > > > > sample rate than linux. However, I suspect the truncated
>>> backtraces
>>> > >
>>> > > on
>>> > >
>>> > > > > > win32 were exaggerating the effect somewhat -- that should not
>>> be as
>>> > > > > > much
>>> > > > > > of an issue now.
>>> > > > > >
>>> > > > > > Sure, windows lets you snoop on (and modify) the address space
>>> of
>>> > > > > > any
>>> > > > > > process, you just need to find the right handle.
>>> > > > > >
>>> > > > > > On Tue, Dec 2, 2014 at 2:18 PM Tim Holy <tim.h...@gmail.com>
>>> wrote:
>>> > > > > > > On Windows, is there any chance that one could set up a
>>> separate
>>> > > > > > > thread
>>> > > > > > > for
>>> > > > > > > profiling and use busy-wait to do the timing? (I don't even
>>> know
>>> > > > >
>>> > > > > whether
>>> > > > >
>>> > > > > > > one
>>> > > > > > > thread can snoop on the execution state of another thread.)
>>> > > > > > >
>>> > > > > > > --Tim
>>> > > > > > >
>>> > > > > > > On Tuesday, December 02, 2014 06:22:39 PM Jameson Nash wrote:
>>> > > > > > > > Although, over thanksgiving, I pushed a number of fixes
>>> which
>>> > >
>>> > > should
>>> > >
>>> > > > > > > > improve the quality of backtraces on win32 (and make
>>> sys.dll
>>> > >
>>> > > usable
>>> > >
>>> > > > > > > there)
>>> > > > > > >
>>> > > > > > > > On Tue, Dec 2, 2014 at 1:20 PM Jameson Nash <
>>> vtjn...@gmail.com>
>>> > > > >
>>> > > > > wrote:
>>> > > > > > > > > Correct. Windows imposes a much higher overhead on just
>>> about
>>> > > > > > > > > every
>>> > > > > > >
>>> > > > > > > aspect
>>> > > > > > >
>>> > > > > > > > > of doing profiling. Unfortunately, there isn't much we
>>> can do
>>> > > > > > > > > about
>>> > > > > > >
>>> > > > > > > this,
>>> > > > > > >
>>> > > > > > > > > other then to complain to Microsoft. (It doesn't have
>>> signals,
>>> > >
>>> > > so
>>> > >
>>> > > > > we
>>> > > > >
>>> > > > > > > must
>>> > > > > > >
>>> > > > > > > > > emulate them with a separate thread. The accuracy of
>>> windows
>>> > > > >
>>> > > > > timers is
>>> > > > >
>>> > > > > > > > > somewhat questionable. And the stack walk library (for
>>> > >
>>> > > recording
>>> > >
>>> > > > > the
>>> > > > >
>>> > > > > > > > > backtrace) is apparently just badly written and therefore
>>> > >
>>> > > insanely
>>> > >
>>> > > > > > > > > slow
>>> > > > > > > > > and
>>> > > > > > > > > memory hungry.)
>>> > > > > > > > >
>>> > > > > > > > > On Tue, Dec 2, 2014 at 12:59 PM Tim Holy <
>>> tim.h...@gmail.com>
>>> > > > >
>>> > > > > wrote:
>>> > > > > > > > >> I think it's just that Windows is bad at scheduling
>>> tasks
>>> > > > > > > > >> with
>>> > > > > > > > >> short-latency,
>>> > > > > > > > >> high-precision timing, but I am not the right person to
>>> > > > > > > > >> answer
>>> > > > >
>>> > > > > such
>>> > > > >
>>> > > > > > > > >> questions.
>>> > > > > > > > >>
>>> > > > > > > > >> --Tim
>>> > > > > > > > >>
>>> > > > > > > > >> On Tuesday, December 02, 2014 09:57:28 AM Peter Simon
>>> wrote:
>>> > > > > > > > >> > I have also experienced the inaccurate profile
>>> timings on
>>> > > > >
>>> > > > > Windows.
>>> > > > >
>>> > > > > > > Is
>>> > > > > > >
>>> > > > > > > > >> the
>>> > > > > > > > >>
>>> > > > > > > > >> > reason for the bad profiler performance on Windows
>>> > >
>>> > > understood?
>>> > >
>>> > > > > Are
>>> > > > >
>>> > > > > > > > >> there
>>> > > > > > > > >>
>>> > > > > > > > >> > plans for improvement?
>>> > > > > > > > >> >
>>> > > > > > > > >> > Thanks,
>>> > > > > > > > >> > --Peter
>>> > > > > > > > >> >
>>> > > > > > > > >> > On Tuesday, December 2, 2014 3:57:16 AM UTC-8, Tim
>>> Holy
>>> > >
>>> > > wrote:
>>> > > > > > > > >> > > By default, the profiler takes one sample per
>>> > >
>>> > > millisecond. In
>>> > >
>>> > > > > > > > >> practice,
>>> > > > > > > > >>
>>> > > > > > > > >> > > the
>>> > > > > > > > >> > > timing is quite precise on Linux, seemingly within a
>>> > >
>>> > > factor
>>> > >
>>> > > > > > > > >> > > of
>>> > > > > > >
>>> > > > > > > twoish
>>> > > > > > >
>>> > > > > > > > >> on
>>> > > > > > > > >>
>>> > > > > > > > >> > > OSX,
>>> > > > > > > > >> > > and nowhere close on Windows. So at least on Linux
>>> you
>>> > > > > > > > >> > > can
>>> > > > >
>>> > > > > simply
>>> > > > >
>>> > > > > > > > >> > > read
>>> > > > > > > > >> > > samples
>>> > > > > > > > >> > > as milliseconds.
>>> > > > > > > > >> > >
>>> > > > > > > > >> > > If you want to visualize the relative contributions
>>> of
>>> > >
>>> > > each
>>> > >
>>> > > > > > > > >> statement, I
>>> > > > > > > > >>
>>> > > > > > > > >> > > highly recommend ProfileView. If you use
>>> LightTable, it's
>>> > > > >
>>> > > > > already
>>> > > > >
>>> > > > > > > > >> built-in
>>> > > > > > > > >>
>>> > > > > > > > >> > > via
>>> > > > > > > > >> > > the profile() command. The combination of
>>> ProfileView and
>>> > > > > > > > >> > > @profile
>>> > > > > > > > >>
>>> > > > > > > > >> is, in
>>> > > > > > > > >>
>>> > > > > > > > >> > > my
>>> > > > > > > > >> > > (extremely biased) opinion, quite powerful compared
>>> to
>>> > >
>>> > > tools
>>> > >
>>> > > > > > > > >> > > I
>>> > > > > > >
>>> > > > > > > used
>>> > > > > > >
>>> > > > > > > > >> > > previously
>>> > > > > > > > >> > > in other programming environments.
>>> > > > > > > > >> > >
>>> > > > > > > > >> > > Finally, there's IProfile.jl, which works via a
>>> > > > > > > > >> > > completely
>>> > > > > > >
>>> > > > > > > different
>>> > > > > > >
>>> > > > > > > > >> > > mechanism
>>> > > > > > > > >> > > but does report raw timings (with some pretty big
>>> > >
>>> > > caveats).
>>> > >
>>> > > > > > > > >> > > Best,
>>> > > > > > > > >> > > --Tim
>>> > > > > > > > >> > >
>>> > > > > > > > >> > > On Monday, December 01, 2014 10:13:16 PM Christoph
>>> Ortner
>>> > > > >
>>> > > > > wrote:
>>> > > > > > > > >> > > > How do you get timings from the Julia profiler,
>>> or even
>>> > > > >
>>> > > > > better,
>>> > > > >
>>> > > > > > > > >> %-es? I
>>> > > > > > > > >>
>>> > > > > > > > >> > > > guess one can convert from the numbers one gets,
>>> but it
>>> > >
>>> > > is
>>> > >
>>> > > > > > > > >> > > > a
>>> > > > > > > > >> > > > bit
>>> > > > > > > > >> > >
>>> > > > > > > > >> > > painful?
>>> > > > > > > > >> > >
>>> > > > > > > > >> > > > Christoph
>>>
>>>
>

Reply via email to