On 12/11/2012 8:13 AM, Borislav Petkov wrote:
On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote:
On 12/11/2012 7:48 AM, Borislav Petkov wrote:
On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:
Another testing of parallel compress with pigz on Linus' git tree.
results show we get much better performance/power with powersaving and
balance policy:

testing command:
#pigz -k -c  -p$x -r linux* &> /dev/null

On a NHM EP box
          powersaving               balance              performance
x = 4    166.516 /88 68           170.515 /82 71         165.283 /103 58
x = 8    173.654 /61 94           177.693 /60 93         172.31 /76 76

This looks funny: so "performance" is eating less watts than
"powersaving" and "balance" on NHM. Could it be that the average watts
measurements on NHM are not correct/precise..? On SNB they look as
expected, according to your scheme.

well... it's not always beneficial to group or to spread out
it depends on cache behavior mostly which is best

Let me try to understand what this means: so "performance" above with
8 threads means that those threads are spread out across more than one
socket, no?

If so, this would mean that you have a smaller amount of tasks on each
socket, thus the smaller wattage.

The "powersaving" method OTOH fills up the one socket up to the brim,
thus the slightly higher consumption due to all threads being occupied.

Is that it?

not sure.

by and large, power efficiency is the same as performance efficiency, with some 
twists.
or to reword that to be more clear
if you waste performance due to something that becomes inefficient, you're 
wasting power as well.
now, you might have some hardware effects that can then save you power... but 
those effects
then first need to overcome the waste from the performance inefficiency... and 
that almost never happens.

for example, if you have two workloads that each fit barely inside the last 
level cache...
it's much more efficient to spread these over two sockets... where each has its 
own full LLC
to use.
If you'd group these together, both would thrash the cache all the time and run 
inefficient --> bad for power.

now, on the other hand, if you have two threads of a process that share a bunch 
of data structures,
and you'd spread these over 2 sockets, you end up bouncing data between the two 
sockets a lot,
running inefficient --> bad for power.


having said all this, if you have to tasks that don't have such cache effects, 
the most efficient way
of running things will be on 2 hyperthreading halves... it's very hard to beat 
the power efficiency of that.
But this assumes the tasks don't compete with resources much on the HT level, 
and achieve good scaling.
and this still has to compete with "race to halt", because if you're done 
quicker, you can put the memory
in self refresh quicker.

none of this stuff is easy for humans or computer programs to determine ahead 
of time... or sometimes even afterwards.
heck, even for just performance it's really really hard already, never mind 
adding power.

my personal gut feeling is that we should just optimize this scheduler stuff 
for performance, and that
we're going to be doing quite well on power already if we achieve that.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to