Re: [v8-dev] CpuProfiler processing thread v8:ProfEvntProc fully utilizes one core

m . j . tunnicliffe Tue, 17 Mar 2015 04:28:33 -0700

Hi Ben

A few things I noticed about the CPU improvement change in Node 0.10.36:
* It does not appear to have been contributed upstream to V8
* It only changes POSIX platforms (Windows still uses Sleep(0) a.k.a. yield)
* It hasn't been fully carried forward to 0.12 (I plan to raise a pull 
request to address this, based on 
https://github.com/tunniclm/node/commit/b2c2902f217a7dbae15d4a82d27d984c982415e2),
 
that is, the change to YieldCPU() on POSIX was carried forward, but it is 
no longer called in ProfilerEventsProcessor::Run() due to V8 changes


Regards
Michael 

On Friday, 9 January 2015 11:26:29 UTC, Ben Noordhuis wrote:
>
> Hi Michael, 
>
> On Fri, Jan 9, 2015 at 12:14 AM,  <m.j.tun...@gmail.com <javascript:>> 
> wrote: 
> > Hi Ben, thanks for your reply, I was a little worried because my initial 
> web 
> > searches did not find obvious reports of this issue by others (too much 
> > noise perhaps). 
> > 
> >> 
> >> For V8 3.14 / node.js v0.10, I fixed most of the overhead by means of 
> >> PR [0], what I think you call a poor man's hack in your email.  Not 
> >> that I disagree, but it's remarkably effective. :-) 
> > 
> > 
> > Oops, I didn't mean to imply it is _necessarily_ an incorrect or bad 
> > solution. 
> > I was really thinking in the context of my own implementation, which was 
> > "tactical" and intended purely to give me confidence I was on the right 
> > track while debugging. 
> > My cautious thinking was that the downside would be a potential delay in 
> the 
> > processing of the next batch of work (samples or code events) leading to 
> an 
> > increase in the queue length. 
> > 
> > If that worry is unfounded, it could be a quite pragmatic solution and 
> it 
> > sounds like you were able to get an acceptable reduction in CPU 
> utilisation 
> > with a very small sleep period. 
> > It would be interesting to compare the behaviour of the 1 ns sleep 
> > implementation vs a semaphore implementation. 
>
> I did play around with semaphores (well, condition variables) but as 
> CPU usage was not markedly different under full load, I didn't pursue 
> it.  Calling nanosleep() achieved the same thing and was just a 
> one-liner.  The smaller the patch, the easier it usually is to get 
> buy-in. 
>
> I didn't really care for CPU usage with less than full load because 
> that is a) not a common use case for us, and b) effectively never 
> happening when the event loop wakes up with every SIGPROF signal 
> (although [0] should alleviate that when it lands.) 
>
> [0] https://github.com/joyent/node/pull/8791 
>
> >> sched_yield() only gives up a time slice when there is another process 
> >> scheduled on the same CPU.  I changed that to a nanosleep() call with 
> >> a 1 ns timeout that forcibly puts the thread to sleep, with the 
> >> timeout getting rounded up to operating system granularity (50 us on 
> >> most Linux systems, it's even coarser on OS X.) 
> > 
> > 
> > I also found a sleep quite effective at reducing the CPU usage, but I 
> > haven't checked the effects with any particular depth (at least, not 
> yet). 
> > I tried 1, 10 and 100 microsecond sleeps. 1 and 10 seemed to give a 
> similar 
> > reduction in utilisation, which makes sense if the minimum scheduler 
> > granularity was >10 microseconds on my box. A 100 microsecond sleep 
> appeared 
> > to give a better utilisation reduction. 
> > To give ballpark figures, the profiler processing thread clocked ~100% 
> > normally, ~20% with a 1 or 10 us sleep and ~15% with a 100 us sleep. 
> > 
> > ---- 
> > 
> > I ran a test on a recent V8 version (a late December cut of master from 
> > v8-git-mirror) while capturing a perf profile, and it looks like the 
> samples 
> > are mostly spent reading the current time: 
> > -  97.41%  [kernel]             [k] acpi_pm_read 
> >    - acpi_pm_read 
> >       - 99.98% ktime_get_ts 
> >          - posix_ktime_get_ts 
> >          - sys_clock_gettime 
> >          - system_call_fastpath 
> >          - 0x7fffc5bfe7c2 
> >          - __clock_gettime 
> >             - 95.34% v8::base::ElapsedTimer::Now() 
> >                - 97.86% v8::base::ElapsedTimer::Elapsed() const 
> >                     
> v8::base::ElapsedTimer::HasExpired(v8::base::TimeDelta) 
> > const 
> >                     v8::internal::ProfilerEventsProcessor::Run() 
> >                     v8::base::Thread::NotifyStartedAndRun() 
> >                     v8::base::ThreadEntry(void*) 
> >                     start_thread 
> >                + 2.14% v8::base::ElapsedTimer::Start() 
> > 
> > So it looks like the thread won't yield time slices as it did in 3.14. 
>
> Are you testing that on a VM?  With most virtualization software, 
> querying the system for the current time is extremely expensive, often 
> on the order of 100s of microseconds, and quickly starts to dominate 
> any program that calls clock_gettime() or gettimeofday() frequently. 
>
> That is part of the motivation for #8791; every wake-up from a signal 
> forces the event loop to query for the current time, to find out how 
> much time it spent sleeping in the epoll_wait() / kevent() / etc. 
> system call. 
>

-- 
-- 
v8-dev mailing list
v8-dev@googlegroups.com
http://groups.google.com/group/v8-dev
--- 
You received this message because you are subscribed to the Google Groups 
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to v8-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [v8-dev] CpuProfiler processing thread v8:ProfEvntProc fully utilizes one core

Reply via email to