On 12/02/15 05:40, Jakub Jelinek wrote:
 Don't know the HW good enough, is there any power consumption, heat etc.
difference between the two approaches?  I mean does the HW consume different
amount of power if only one thread in a warp executes code and the other
threads in the same warp just jump around it, vs. having all threads busy?

Having all threads busy will increase power consumption. It's also bad if the other vectors are executing memory access instructions. However, for small blocks, it is probably a win over the jump around approach. One of the optimizations for the future of the neutering algorithm is to add such predication for small blocks and keep branching for the larger blocks.

How exactly does OpenACC copy the stack?  At least for OpenMP, one could
have automatic vars whose addresses are passed to simd regions in different
functions, say like:

The stack frame of the current function is copied when entering a partitioned region. (There is no visibility of caller's frame and such.) Again, optimization would be trying to only copy the stack that's used in the partitioned region.

nathan

Reply via email to