[PATCH] NUMA topology support for powernv

2011-10-28 Thread Dipankar Sarma

This patch adds support for numa topology on powernv platforms running
OPAL formware. It checks for the type of platform at run time and
sets the affinity form correctly so that NUMA topology can be discovered
correctly.

Signed-off-by: Dipankar Sarma dipan...@in.ibm.com
---
 arch/powerpc/mm/numa.c |   24 +---
 1 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0bfb90c..58f292f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -315,7 +315,10 @@ static int __init find_min_common_depth(void)
struct device_node *root;
const char *vec5;
 
-   root = of_find_node_by_path(/rtas);
+   if (firmware_has_feature(FW_FEATURE_OPAL))
+   root = of_find_node_by_path(/ibm,opal);
+   else
+   root = of_find_node_by_path(/rtas);
if (!root)
root = of_find_node_by_path(/);
 
@@ -344,12 +347,19 @@ static int __init find_min_common_depth(void)
 
 #define VEC5_AFFINITY_BYTE 5
 #define VEC5_AFFINITY  0x80
-   chosen = of_find_node_by_path(/chosen);
-   if (chosen) {
-   vec5 = of_get_property(chosen, ibm,architecture-vec-5, NULL);
-   if (vec5  (vec5[VEC5_AFFINITY_BYTE]  VEC5_AFFINITY)) {
-   dbg(Using form 1 affinity\n);
-   form1_affinity = 1;
+
+   if (firmware_has_feature(FW_FEATURE_OPAL))
+   form1_affinity = 1;
+   else {
+   chosen = of_find_node_by_path(/chosen);
+   if (chosen) {
+   vec5 = of_get_property(chosen, 
+  ibm,architecture-vec-5, NULL);
+   if (vec5  (vec5[VEC5_AFFINITY_BYTE]  
+   VEC5_AFFINITY)) {
+   dbg(Using form 1 affinity\n);
+   form1_affinity = 1;
+   }
}
}
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC] powerpc: add support for new hcall H_BEST_ENERGY

2010-03-08 Thread Dipankar Sarma
On Mon, Mar 08, 2010 at 12:20:06PM +0530, Vaidyanathan Srinivasan wrote:
 * Dipankar Sarma dipan...@in.ibm.com [2010-03-06 00:48:11]:
 
  Shouldn't we create this only for supported platforms ?
 
 Hi Dipankar,
 
 Yes we will need a check like
 firmware_has_feature(FW_FEATURE_BEST_ENERGY) to avoid sysfs files in
 unsupported platforms.  I will add that check in the next iteration.

Also, given that this module isn't likely to provide anything on
older platforms, it should get loaded only on newer platforms.

Thanks
Dipankar
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC] powerpc: add support for new hcall H_BEST_ENERGY

2010-03-05 Thread Dipankar Sarma
On Wed, Mar 03, 2010 at 11:48:22PM +0530, Vaidyanathan Srinivasan wrote:
  static void __init cpu_init_thread_core_maps(int tpc)
 diff --git a/arch/powerpc/platforms/pseries/Kconfig 
 b/arch/powerpc/platforms/pseries/Kconfig
 index c667f0f..b3dd108 100644
 --- a/arch/powerpc/platforms/pseries/Kconfig
 +++ b/arch/powerpc/platforms/pseries/Kconfig
 @@ -33,6 +33,16 @@ config PSERIES_MSI
 depends on PCI_MSI  EEH
 default y
 
 +config PSERIES_ENERGY
 + tristate pseries energy management capabilities driver
 + depends on PPC_PSERIES
 + default y
 + help
 +   Provides interface to platform energy management capabilities
 +   on supported PSERIES platforms.
 +   Provides: /sys/devices/system/cpu/pseries_(de)activation_hint_list
 +   and /sys/devices/system/cpu/cpuN/pseries_(de)activation_hint
 +
  config SCANLOG
   tristate Scanlog dump interface
   depends on RTAS_PROC  PPC_PSERIES

.

 +static int __init pseries_energy_init(void)
 +{
 + int cpu, err;
 + struct sys_device *cpu_sys_dev;
 +
 + /* Create the sysfs files */
 + err = sysfs_create_file(cpu_sysdev_class.kset.kobj,
 + attr_cpu_activate_hint_list.attr);
 + if (!err)
 + err = sysfs_create_file(cpu_sysdev_class.kset.kobj,
 + attr_cpu_deactivate_hint_list.attr);
 +
 + for_each_possible_cpu(cpu) {
 + cpu_sys_dev = get_cpu_sysdev(cpu);
 + err = sysfs_create_file(cpu_sys_dev-kobj,
 + attr_percpu_activate_hint.attr);
 + if (err)
 + break;
 + err = sysfs_create_file(cpu_sys_dev-kobj,
 + attr_percpu_deactivate_hint.attr);
 + if (err)
 + break;
 + }
 + return err;
 +
 +}

Shouldn't we create this only for supported platforms ?

Thanks
Dipankar
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

2009-09-16 Thread Dipankar Sarma
On Tue, Sep 15, 2009 at 02:11:41PM +0200, Peter Zijlstra wrote:
 On Tue, 2009-09-15 at 17:36 +0530, Gautham R Shenoy wrote:
  This patchset contains the offline state driver implemented for
  pSeries. For pSeries, we define three available_hotplug_states. They are:
  
  online: The processor is online.
  
  offline: This is the the default behaviour when the cpu is offlined
  
  inactive: This cedes the vCPU to the hypervisor with a cede latency
  
  Any feedback on the patchset will be immensely valuable.
 
 I still think its a layering violation... its the hypervisor manager
 that should be bothered in what state an off-lined cpu is in. 

The problem is that all hypervisor managers cannot figure out what sort
of latency guest OS can tolerate under the situation. They wouldn't know
from what context guest OS has ceded the vcpu. It has to have
some sort of hint, which is what the guest OS provides.

Thanks
Dipankar
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

2009-09-16 Thread Dipankar Sarma
On Wed, Sep 16, 2009 at 05:32:51PM +0200, Peter Zijlstra wrote:
 On Wed, 2009-09-16 at 20:58 +0530, Dipankar Sarma wrote:
  On Tue, Sep 15, 2009 at 02:11:41PM +0200, Peter Zijlstra wrote:
   On Tue, 2009-09-15 at 17:36 +0530, Gautham R Shenoy wrote:
This patchset contains the offline state driver implemented for
pSeries. For pSeries, we define three available_hotplug_states. They 
are:

online: The processor is online.

offline: This is the the default behaviour when the cpu is 
offlined

inactive: This cedes the vCPU to the hypervisor with a cede 
latency

Any feedback on the patchset will be immensely valuable.
   
   I still think its a layering violation... its the hypervisor manager
   that should be bothered in what state an off-lined cpu is in. 
  
  The problem is that all hypervisor managers cannot figure out what sort
  of latency guest OS can tolerate under the situation. They wouldn't know
  from what context guest OS has ceded the vcpu. It has to have
  some sort of hint, which is what the guest OS provides.
 
 I'm missing something here, hot-unplug is a slow path and should not
 ever be latency critical..?

You aren't, I did :)

No, for this specific case, latency isn't an issue. The issue is -
how do we cede unused vcpus to hypervisor for better energy management ?
Yes, it can be done by a hypervisor manager telling the kernel to
offline and make a bunch of vcpus inactive. It does have to choose
offline (release vcpu) vs. inactive (cede but guranteed if needed).
The problem is that long ago we exported a lot of hotplug stuff to
userspace through the sysfs interface and we cannot do something
inside the kernel without keeping the sysfs stuff consistent.
This seems like a sane way to do that without undoing all the
virtual cpu hotplug infrastructure in different supporting archs.

Thanks
Dipankar
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

2009-09-16 Thread Dipankar Sarma
On Wed, Sep 16, 2009 at 07:22:35PM +0200, Peter Zijlstra wrote:
 On Wed, 2009-09-16 at 22:33 +0530, Vaidyanathan Srinivasan wrote:
  * Peter Zijlstra a.p.zijls...@chello.nl [2009-09-16 18:35:16]:
  
   Now if you were to try and online the cpus in the guest, it'd fail
   because the cpus aren't backed anymore, and the hot-plug simply
   times-out and fails.
   
   And we're still good, right?
  
  The requirement differ here.  If we had offlined 2 vCPUs for the
  purpose of system reconfiguration, the expected behavior with offline
  interface will work right.  However the proposed cede interface is
  needed when we want them to temporarily go away but still come back
  when we do an online.  We want the online to always succeed since the
  backing physical resources are not relinquished.  The proposed
  interface facilitates offline without relinquishing the physical
  resources assigned to LPARs.
 
 Then make that the platform default and leave the lpar management to
 whatever pokes at the lpar?

That could have worked - however lpar management already uses
the same sysfs interface to poke. The current semantics make the lpar 
vcpu deconfig state the platform default assuming that it will be used for
lpar management. The only clean way to do this without breaking lpar
management stuff is to add another state - inactive and retain backward
compatibility.

Thanks
Dipankar
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/3] cpu: idle state framework for offline CPUs.

2009-08-17 Thread Dipankar Sarma
On Sun, Aug 16, 2009 at 11:53:22PM +0200, Peter Zijlstra wrote:
 On Mon, 2009-08-17 at 01:14 +0530, Balbir Singh wrote:
  Agreed, I've tried to come with a little ASCII art to depict your
  scenairos graphically
  
  
  ++ don't need (offline)
  |  OS+---++
  +--+-+| hypervisor +- Reuse CPU
 |  ||   for something
 |  ||   else
 |  ||   (visible to users)
 |  ||as resource changed
 |  +--- +
 V (needed, but can cede)
 ++
 | hypervisor | Don't reuse CPU
 ||  (CPU ceded)
 || give back to OS
 ++ when needed.
  (Not visible to
  users as so resource
  binding changed)
 
 I still don't get it... _why_ should this be exposed in the guest
 kernel? Why not let the hypervisor manage a guest's offline cpus in a
 way it sees fit?

For most parts, we do. The guest kernel doesn't manage the offline
CPU state. That is typically done by the hypervisor. However, offline
operation as defined now always result in a VM resize in some hypervisor
systems (like pseries) - it would be convenient to have a non-resize
offline operation which lets the guest cede the cpu to hypervisor
with the hint that the VM shouldn't be resized and the guest needs the guarantee
to get the cpu back any time. The hypervisor can do whatever it wants
with the ceded CPU including putting it in a low power state, but
not change the physical cpu shares of the VM. The pseries hypervisor,
for example, clearly distinguishes between the two - rtas-stop-self call
to resize VM vs. H_CEDE hypercall with a hint. What I am suggesting
is that we allow this with an extension to existing interfaces because it 
makes sense to allow sort of hibernation of the cpus without changing any
configuration of the VMs.

Thanks
Dipankar
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/3] cpu: idle state framework for offline CPUs.

2009-08-17 Thread Dipankar Sarma
On Mon, Aug 17, 2009 at 09:15:57AM +0200, Peter Zijlstra wrote:
 On Mon, 2009-08-17 at 11:54 +0530, Dipankar Sarma wrote:
  For most parts, we do. The guest kernel doesn't manage the offline
  CPU state. That is typically done by the hypervisor. However, offline
  operation as defined now always result in a VM resize in some hypervisor
  systems (like pseries) - it would be convenient to have a non-resize
  offline operation which lets the guest cede the cpu to hypervisor
  with the hint that the VM shouldn't be resized and the guest needs the 
  guarantee
  to get the cpu back any time. The hypervisor can do whatever it wants
  with the ceded CPU including putting it in a low power state, but
  not change the physical cpu shares of the VM. The pseries hypervisor,
  for example, clearly distinguishes between the two - rtas-stop-self call
  to resize VM vs. H_CEDE hypercall with a hint. What I am suggesting
  is that we allow this with an extension to existing interfaces because it 
  makes sense to allow sort of hibernation of the cpus without changing any
  configuration of the VMs.
 
 From my POV the thing you call cede is the only sane thing to do for a
 guest. Let the hypervisor management interface deal with resizing guests
 if and when that's needed.

That is more or less how it currently works - atleast for pseries hypervisor. 
The current offline operation with rtas-stop-self call I mentioned
earlier is initiated by the hypervisor management interfaces/tool in
pseries system. This wakes up a guest system tool that echoes 1
to the offline file resulting in the configuration change.
The OS involvement is necessary to evacuate tasks/interrupts
from the released CPU. We don't really want to initiate this from guests.

 Thing is, you don't want a guest to be able to influence the amount of
 cpu shares attributed to it. You want that in explicit control of
 whomever manages the hypervisor.

Agreed. But given a fixed cpu share by the hypervisor management tools,
we would like to be able to cede cpus to hypervisor leaving the hypervisor
configuration intact. This, we don't have at the moment and want to just
extend the current interface for this.

Thanks
Dipankar

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/3] cpu: idle state framework for offline CPUs.

2009-08-17 Thread Dipankar Sarma
On Mon, Aug 17, 2009 at 01:28:15PM +0530, Dipankar Sarma wrote:
 On Mon, Aug 17, 2009 at 09:15:57AM +0200, Peter Zijlstra wrote:
  On Mon, 2009-08-17 at 11:54 +0530, Dipankar Sarma wrote:
   For most parts, we do. The guest kernel doesn't manage the offline
   CPU state. That is typically done by the hypervisor. However, offline
   operation as defined now always result in a VM resize in some hypervisor
   systems (like pseries) - it would be convenient to have a non-resize
   offline operation which lets the guest cede the cpu to hypervisor
   with the hint that the VM shouldn't be resized and the guest needs the 
   guarantee
   to get the cpu back any time. The hypervisor can do whatever it wants
   with the ceded CPU including putting it in a low power state, but
   not change the physical cpu shares of the VM. The pseries hypervisor,
   for example, clearly distinguishes between the two - rtas-stop-self call
   to resize VM vs. H_CEDE hypercall with a hint. What I am suggesting
   is that we allow this with an extension to existing interfaces because it 
   makes sense to allow sort of hibernation of the cpus without changing 
   any
   configuration of the VMs.
  
  From my POV the thing you call cede is the only sane thing to do for a
  guest. Let the hypervisor management interface deal with resizing guests
  if and when that's needed.
 
 That is more or less how it currently works - atleast for pseries hypervisor. 
 The current offline operation with rtas-stop-self call I mentioned
 earlier is initiated by the hypervisor management interfaces/tool in
 pseries system. This wakes up a guest system tool that echoes 1
 to the offline file resulting in the configuration change.

Should have said - echoes 0 to the online file. 

You don't necessarily need this in the guest Linux as long as there is
a way for hypervisor tools to internally move Linux tasks/interrupts
from a vcpu - async event handled by the kernel, for example.
But I think it is too late for that - the interface has long been
exported.


 The OS involvement is necessary to evacuate tasks/interrupts
 from the released CPU. We don't really want to initiate this from guests.
 
  Thing is, you don't want a guest to be able to influence the amount of
  cpu shares attributed to it. You want that in explicit control of
  whomever manages the hypervisor.
 
 Agreed. But given a fixed cpu share by the hypervisor management tools,
 we would like to be able to cede cpus to hypervisor leaving the hypervisor
 configuration intact. This, we don't have at the moment and want to just
 extend the current interface for this.
 
 Thanks
 Dipankar
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/3] cpu: idle state framework for offline CPUs.

2009-08-12 Thread Dipankar Sarma
On Wed, Aug 12, 2009 at 01:58:06PM +0200, Pavel Machek wrote:
 Hi!
 
  May be having (to pick a number) 3 possible offline states for all
  platforms with one for halt equivalent and one for deepest possible that
  CPU can handle and one for deepest possible that platform likes for
  C-states may make sense. Will keeps things simpler in terms of usage
  expectations and possibly reduce the misuse oppurtubity?
 
 Maybe just going to the deepest offline state automatically is the
 easiest option?

In a native system, I think we should the platform-specific code
export what makes sense. That may be just the lowest possible
state only. Or may be more than one.

In a virtualized system, we would want to do at least the following -

1. An offline configuration state where the hypervisor can
take the cpu back and allocate it to another VM.

2. A low-power state where the guest indicates it doesn't need the
CPU (and can be put in low power state) but doesn't want to give up 
its allocated cpu share. IOW, no visible configuration changes.

So, in any case we would probably want more than one states.

 cpu hotplug/unplug should be rare-enough operation that the latencies
 do not really matter, right?

As of now, from the platform perspective, I don't think low-power
state latencies matter in this code path. The only thing that might
have any relevance is electrical power-off technology and whether
there may be any h/w specific issues restricting its use. I don't know
that there will be any, but it may not be a good idea to prevent
platforms from requiring the use of multiple offline states.

Thanks
Dipankar
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/3] cpu: idle state framework for offline CPUs.

2009-08-12 Thread Dipankar Sarma
On Wed, Aug 12, 2009 at 08:45:18PM -0400, Len Brown wrote:
 On Thu, 13 Aug 2009, Dipankar Sarma wrote:
  In a native system, I think we should the platform-specific code
  export what makes sense. That may be just the lowest possible
  state only. Or may be more than one.
 
 For x86, it is 1 state.

Native x86, yes. For virtualized systems, that may not be the
case depending on how the hypervisor behaves.

  In a virtualized system, we would want to do at least the following -
 
 Are you talking about Linux as a para-virtualized guest here?
 
  1. An offline configuration state where the hypervisor can
  take the cpu back and allocate it to another VM.
 
 The hypervisor gets control no matter what idle state
 the guest enters, yes?

The hypervisor may get control, but what they do may depend
on that the guest OS wished/hinted for - config change or just
shutdown unused cpu if possible for a while.

  2. A low-power state where the guest indicates it doesn't need the
  CPU (and can be put in low power state) but doesn't want to give up 
  its allocated cpu share. IOW, no visible configuration changes.
  
  So, in any case we would probably want more than one states.
 
 How are #1 and #2 different when the hypervisor
 gets control in all idle states?  I assert that
 they are the same, and thus 1 state will suffice.

It depends on the hypervisor implementation. On pseries (powerpc)
hypervisor, for example, they are different. By offlining a vcpu
(and in turn shutting a cpu), you will actually create a configuration
change in the VM that is visible to other systems management tools
which may not be what the system administrator wanted. Ideally,
we would like to distinguish between these two states.

Hope that suffices as an example.

Thanks
Dipankar
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/3] cpu: idle state framework for offline CPUs.

2009-08-11 Thread Dipankar Sarma
On Mon, Aug 10, 2009 at 05:22:17PM -0700, Pallipadi, Venkatesh wrote:
 Also, I don't think using just the ACPI/BIOS supplied states in _CST is
 right thing to do for offline. _CST is meant for C-state and BIOS may
 not include some C-state in _CST if the system manufacturer thinks that
 the latency is too high for the state to be used as a C-state. That
 limitation applies for C-state as the cpu is expected to come out of
 C-state often and execute code handle interrupts etc. But, that
 restriction does not apply for offline online which is not as frequent
 as C-state entry and it already has big latency with startup IPI, and a
 whole baggage of CPU setup code. So, using BIOS CST info for CPU offline
 state doesn't seem right.
 
 May be having (to pick a number) 3 possible offline states for all
 platforms with one for halt equivalent and one for deepest possible that
 CPU can handle and one for deepest possible that platform likes for
 C-states may make sense. Will keeps things simpler in terms of usage
 expectations and possibly reduce the misuse oppurtubity?

Yes, I think we should let specific archs advertise a small set
of possible offline states and let the cpu state be set to one of
those only keeping the platform implementation robust.

Here is variant of the original proposal from Gautham -

/sys/devices/system/cpu/cpunumber/available_states

For example, available state for an Intel platform could be exported as
online dealloc C1 C6

online = fully up
dealloc = offline and de-allocated (as in virtualized environment)
C1 = C1 or C1E halt
C6 = C6 sleep

/sys/devices/system/cpu/cpunumber/state

Writing any of the available states to this file triggers transition to that 
state barring some transitions that are disallowed to keep things simple
(e.g. dealloc cpus support only transition to online).

/sys/devices/system/cpu/cpunumber/online

Backward compatibility - online = 0 changes state to C6 or dealloc depending
on the platform. online = 1 changes state to online.

Would this make sense ?

Thanks
Dipankar




___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev