On Wed, Dec 02, 2015 at 05:54:51PM +0300, Alexander Monakov wrote:
> On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> 
> > On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
> > > On 12/02/15 05:40, Jakub Jelinek wrote:
> > > > Don't know the HW good enough, is there any power consumption, heat etc.
> > > >difference between the two approaches?  I mean does the HW consume 
> > > >different
> > > >amount of power if only one thread in a warp executes code and the other
> > > >threads in the same warp just jump around it, vs. having all threads 
> > > >busy?
> > > 
> > > Having all threads busy will increase power consumption.  It's also bad if
> > > the other vectors are executing memory access instructions.  However, for
> > 
> > Then the uniform SIMT approach might not be that good idea.
> 
> Why?  Remember that the tradeoff is copying registers (and in OpenACC, stacks
> too).  We don't know how the costs balance.  My intuition is that copying is
> worse compared to what I'm doing.
> 
> Anyhow, for good performance the offloaded code needs to be running in vector
> regions most of the time, where the concern doesn't apply.

But you never know if people actually use #pragma omp simd regions or not,
sometimes they will, sometimes they won't, and if the uniform SIMT increases
power consumption, it might not be desirable.

If we have a reasonable IPA pass to discover which addressable variables can
be shared by multiple threads and which can't, then we could use soft-stack
for those that can be shared by multiple PTX threads (different warps, or
same warp, different threads in it), then we shouldn't need to copy any
stack, just broadcast the scalar vars.

        Jakub

Reply via email to