[PATCH] NUMA topology support for powernv
This patch adds support for numa topology on powernv platforms running OPAL formware. It checks for the type of platform at run time and sets the affinity form correctly so that NUMA topology can be discovered correctly. Signed-off-by: Dipankar Sarma dipan...@in.ibm.com --- arch/powerpc/mm/numa.c | 24 +--- 1 files changed, 17 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 0bfb90c..58f292f 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -315,7 +315,10 @@ static int __init find_min_common_depth(void) struct device_node *root; const char *vec5; - root = of_find_node_by_path(/rtas); + if (firmware_has_feature(FW_FEATURE_OPAL)) + root = of_find_node_by_path(/ibm,opal); + else + root = of_find_node_by_path(/rtas); if (!root) root = of_find_node_by_path(/); @@ -344,12 +347,19 @@ static int __init find_min_common_depth(void) #define VEC5_AFFINITY_BYTE 5 #define VEC5_AFFINITY 0x80 - chosen = of_find_node_by_path(/chosen); - if (chosen) { - vec5 = of_get_property(chosen, ibm,architecture-vec-5, NULL); - if (vec5 (vec5[VEC5_AFFINITY_BYTE] VEC5_AFFINITY)) { - dbg(Using form 1 affinity\n); - form1_affinity = 1; + + if (firmware_has_feature(FW_FEATURE_OPAL)) + form1_affinity = 1; + else { + chosen = of_find_node_by_path(/chosen); + if (chosen) { + vec5 = of_get_property(chosen, + ibm,architecture-vec-5, NULL); + if (vec5 (vec5[VEC5_AFFINITY_BYTE] + VEC5_AFFINITY)) { + dbg(Using form 1 affinity\n); + form1_affinity = 1; + } } } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC] powerpc: add support for new hcall H_BEST_ENERGY
On Mon, Mar 08, 2010 at 12:20:06PM +0530, Vaidyanathan Srinivasan wrote: * Dipankar Sarma dipan...@in.ibm.com [2010-03-06 00:48:11]: Shouldn't we create this only for supported platforms ? Hi Dipankar, Yes we will need a check like firmware_has_feature(FW_FEATURE_BEST_ENERGY) to avoid sysfs files in unsupported platforms. I will add that check in the next iteration. Also, given that this module isn't likely to provide anything on older platforms, it should get loaded only on newer platforms. Thanks Dipankar ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC] powerpc: add support for new hcall H_BEST_ENERGY
On Wed, Mar 03, 2010 at 11:48:22PM +0530, Vaidyanathan Srinivasan wrote: static void __init cpu_init_thread_core_maps(int tpc) diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index c667f0f..b3dd108 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -33,6 +33,16 @@ config PSERIES_MSI depends on PCI_MSI EEH default y +config PSERIES_ENERGY + tristate pseries energy management capabilities driver + depends on PPC_PSERIES + default y + help + Provides interface to platform energy management capabilities + on supported PSERIES platforms. + Provides: /sys/devices/system/cpu/pseries_(de)activation_hint_list + and /sys/devices/system/cpu/cpuN/pseries_(de)activation_hint + config SCANLOG tristate Scanlog dump interface depends on RTAS_PROC PPC_PSERIES . +static int __init pseries_energy_init(void) +{ + int cpu, err; + struct sys_device *cpu_sys_dev; + + /* Create the sysfs files */ + err = sysfs_create_file(cpu_sysdev_class.kset.kobj, + attr_cpu_activate_hint_list.attr); + if (!err) + err = sysfs_create_file(cpu_sysdev_class.kset.kobj, + attr_cpu_deactivate_hint_list.attr); + + for_each_possible_cpu(cpu) { + cpu_sys_dev = get_cpu_sysdev(cpu); + err = sysfs_create_file(cpu_sys_dev-kobj, + attr_percpu_activate_hint.attr); + if (err) + break; + err = sysfs_create_file(cpu_sys_dev-kobj, + attr_percpu_deactivate_hint.attr); + if (err) + break; + } + return err; + +} Shouldn't we create this only for supported platforms ? Thanks Dipankar ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework
On Tue, Sep 15, 2009 at 02:11:41PM +0200, Peter Zijlstra wrote: On Tue, 2009-09-15 at 17:36 +0530, Gautham R Shenoy wrote: This patchset contains the offline state driver implemented for pSeries. For pSeries, we define three available_hotplug_states. They are: online: The processor is online. offline: This is the the default behaviour when the cpu is offlined inactive: This cedes the vCPU to the hypervisor with a cede latency Any feedback on the patchset will be immensely valuable. I still think its a layering violation... its the hypervisor manager that should be bothered in what state an off-lined cpu is in. The problem is that all hypervisor managers cannot figure out what sort of latency guest OS can tolerate under the situation. They wouldn't know from what context guest OS has ceded the vcpu. It has to have some sort of hint, which is what the guest OS provides. Thanks Dipankar ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework
On Wed, Sep 16, 2009 at 05:32:51PM +0200, Peter Zijlstra wrote: On Wed, 2009-09-16 at 20:58 +0530, Dipankar Sarma wrote: On Tue, Sep 15, 2009 at 02:11:41PM +0200, Peter Zijlstra wrote: On Tue, 2009-09-15 at 17:36 +0530, Gautham R Shenoy wrote: This patchset contains the offline state driver implemented for pSeries. For pSeries, we define three available_hotplug_states. They are: online: The processor is online. offline: This is the the default behaviour when the cpu is offlined inactive: This cedes the vCPU to the hypervisor with a cede latency Any feedback on the patchset will be immensely valuable. I still think its a layering violation... its the hypervisor manager that should be bothered in what state an off-lined cpu is in. The problem is that all hypervisor managers cannot figure out what sort of latency guest OS can tolerate under the situation. They wouldn't know from what context guest OS has ceded the vcpu. It has to have some sort of hint, which is what the guest OS provides. I'm missing something here, hot-unplug is a slow path and should not ever be latency critical..? You aren't, I did :) No, for this specific case, latency isn't an issue. The issue is - how do we cede unused vcpus to hypervisor for better energy management ? Yes, it can be done by a hypervisor manager telling the kernel to offline and make a bunch of vcpus inactive. It does have to choose offline (release vcpu) vs. inactive (cede but guranteed if needed). The problem is that long ago we exported a lot of hotplug stuff to userspace through the sysfs interface and we cannot do something inside the kernel without keeping the sysfs stuff consistent. This seems like a sane way to do that without undoing all the virtual cpu hotplug infrastructure in different supporting archs. Thanks Dipankar ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework
On Wed, Sep 16, 2009 at 07:22:35PM +0200, Peter Zijlstra wrote: On Wed, 2009-09-16 at 22:33 +0530, Vaidyanathan Srinivasan wrote: * Peter Zijlstra a.p.zijls...@chello.nl [2009-09-16 18:35:16]: Now if you were to try and online the cpus in the guest, it'd fail because the cpus aren't backed anymore, and the hot-plug simply times-out and fails. And we're still good, right? The requirement differ here. If we had offlined 2 vCPUs for the purpose of system reconfiguration, the expected behavior with offline interface will work right. However the proposed cede interface is needed when we want them to temporarily go away but still come back when we do an online. We want the online to always succeed since the backing physical resources are not relinquished. The proposed interface facilitates offline without relinquishing the physical resources assigned to LPARs. Then make that the platform default and leave the lpar management to whatever pokes at the lpar? That could have worked - however lpar management already uses the same sysfs interface to poke. The current semantics make the lpar vcpu deconfig state the platform default assuming that it will be used for lpar management. The only clean way to do this without breaking lpar management stuff is to add another state - inactive and retain backward compatibility. Thanks Dipankar ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/3] cpu: idle state framework for offline CPUs.
On Sun, Aug 16, 2009 at 11:53:22PM +0200, Peter Zijlstra wrote: On Mon, 2009-08-17 at 01:14 +0530, Balbir Singh wrote: Agreed, I've tried to come with a little ASCII art to depict your scenairos graphically ++ don't need (offline) | OS+---++ +--+-+| hypervisor +- Reuse CPU | || for something | || else | || (visible to users) | ||as resource changed | +--- + V (needed, but can cede) ++ | hypervisor | Don't reuse CPU || (CPU ceded) || give back to OS ++ when needed. (Not visible to users as so resource binding changed) I still don't get it... _why_ should this be exposed in the guest kernel? Why not let the hypervisor manage a guest's offline cpus in a way it sees fit? For most parts, we do. The guest kernel doesn't manage the offline CPU state. That is typically done by the hypervisor. However, offline operation as defined now always result in a VM resize in some hypervisor systems (like pseries) - it would be convenient to have a non-resize offline operation which lets the guest cede the cpu to hypervisor with the hint that the VM shouldn't be resized and the guest needs the guarantee to get the cpu back any time. The hypervisor can do whatever it wants with the ceded CPU including putting it in a low power state, but not change the physical cpu shares of the VM. The pseries hypervisor, for example, clearly distinguishes between the two - rtas-stop-self call to resize VM vs. H_CEDE hypercall with a hint. What I am suggesting is that we allow this with an extension to existing interfaces because it makes sense to allow sort of hibernation of the cpus without changing any configuration of the VMs. Thanks Dipankar ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/3] cpu: idle state framework for offline CPUs.
On Mon, Aug 17, 2009 at 09:15:57AM +0200, Peter Zijlstra wrote: On Mon, 2009-08-17 at 11:54 +0530, Dipankar Sarma wrote: For most parts, we do. The guest kernel doesn't manage the offline CPU state. That is typically done by the hypervisor. However, offline operation as defined now always result in a VM resize in some hypervisor systems (like pseries) - it would be convenient to have a non-resize offline operation which lets the guest cede the cpu to hypervisor with the hint that the VM shouldn't be resized and the guest needs the guarantee to get the cpu back any time. The hypervisor can do whatever it wants with the ceded CPU including putting it in a low power state, but not change the physical cpu shares of the VM. The pseries hypervisor, for example, clearly distinguishes between the two - rtas-stop-self call to resize VM vs. H_CEDE hypercall with a hint. What I am suggesting is that we allow this with an extension to existing interfaces because it makes sense to allow sort of hibernation of the cpus without changing any configuration of the VMs. From my POV the thing you call cede is the only sane thing to do for a guest. Let the hypervisor management interface deal with resizing guests if and when that's needed. That is more or less how it currently works - atleast for pseries hypervisor. The current offline operation with rtas-stop-self call I mentioned earlier is initiated by the hypervisor management interfaces/tool in pseries system. This wakes up a guest system tool that echoes 1 to the offline file resulting in the configuration change. The OS involvement is necessary to evacuate tasks/interrupts from the released CPU. We don't really want to initiate this from guests. Thing is, you don't want a guest to be able to influence the amount of cpu shares attributed to it. You want that in explicit control of whomever manages the hypervisor. Agreed. But given a fixed cpu share by the hypervisor management tools, we would like to be able to cede cpus to hypervisor leaving the hypervisor configuration intact. This, we don't have at the moment and want to just extend the current interface for this. Thanks Dipankar ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/3] cpu: idle state framework for offline CPUs.
On Mon, Aug 17, 2009 at 01:28:15PM +0530, Dipankar Sarma wrote: On Mon, Aug 17, 2009 at 09:15:57AM +0200, Peter Zijlstra wrote: On Mon, 2009-08-17 at 11:54 +0530, Dipankar Sarma wrote: For most parts, we do. The guest kernel doesn't manage the offline CPU state. That is typically done by the hypervisor. However, offline operation as defined now always result in a VM resize in some hypervisor systems (like pseries) - it would be convenient to have a non-resize offline operation which lets the guest cede the cpu to hypervisor with the hint that the VM shouldn't be resized and the guest needs the guarantee to get the cpu back any time. The hypervisor can do whatever it wants with the ceded CPU including putting it in a low power state, but not change the physical cpu shares of the VM. The pseries hypervisor, for example, clearly distinguishes between the two - rtas-stop-self call to resize VM vs. H_CEDE hypercall with a hint. What I am suggesting is that we allow this with an extension to existing interfaces because it makes sense to allow sort of hibernation of the cpus without changing any configuration of the VMs. From my POV the thing you call cede is the only sane thing to do for a guest. Let the hypervisor management interface deal with resizing guests if and when that's needed. That is more or less how it currently works - atleast for pseries hypervisor. The current offline operation with rtas-stop-self call I mentioned earlier is initiated by the hypervisor management interfaces/tool in pseries system. This wakes up a guest system tool that echoes 1 to the offline file resulting in the configuration change. Should have said - echoes 0 to the online file. You don't necessarily need this in the guest Linux as long as there is a way for hypervisor tools to internally move Linux tasks/interrupts from a vcpu - async event handled by the kernel, for example. But I think it is too late for that - the interface has long been exported. The OS involvement is necessary to evacuate tasks/interrupts from the released CPU. We don't really want to initiate this from guests. Thing is, you don't want a guest to be able to influence the amount of cpu shares attributed to it. You want that in explicit control of whomever manages the hypervisor. Agreed. But given a fixed cpu share by the hypervisor management tools, we would like to be able to cede cpus to hypervisor leaving the hypervisor configuration intact. This, we don't have at the moment and want to just extend the current interface for this. Thanks Dipankar -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/3] cpu: idle state framework for offline CPUs.
On Wed, Aug 12, 2009 at 01:58:06PM +0200, Pavel Machek wrote: Hi! May be having (to pick a number) 3 possible offline states for all platforms with one for halt equivalent and one for deepest possible that CPU can handle and one for deepest possible that platform likes for C-states may make sense. Will keeps things simpler in terms of usage expectations and possibly reduce the misuse oppurtubity? Maybe just going to the deepest offline state automatically is the easiest option? In a native system, I think we should the platform-specific code export what makes sense. That may be just the lowest possible state only. Or may be more than one. In a virtualized system, we would want to do at least the following - 1. An offline configuration state where the hypervisor can take the cpu back and allocate it to another VM. 2. A low-power state where the guest indicates it doesn't need the CPU (and can be put in low power state) but doesn't want to give up its allocated cpu share. IOW, no visible configuration changes. So, in any case we would probably want more than one states. cpu hotplug/unplug should be rare-enough operation that the latencies do not really matter, right? As of now, from the platform perspective, I don't think low-power state latencies matter in this code path. The only thing that might have any relevance is electrical power-off technology and whether there may be any h/w specific issues restricting its use. I don't know that there will be any, but it may not be a good idea to prevent platforms from requiring the use of multiple offline states. Thanks Dipankar ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/3] cpu: idle state framework for offline CPUs.
On Wed, Aug 12, 2009 at 08:45:18PM -0400, Len Brown wrote: On Thu, 13 Aug 2009, Dipankar Sarma wrote: In a native system, I think we should the platform-specific code export what makes sense. That may be just the lowest possible state only. Or may be more than one. For x86, it is 1 state. Native x86, yes. For virtualized systems, that may not be the case depending on how the hypervisor behaves. In a virtualized system, we would want to do at least the following - Are you talking about Linux as a para-virtualized guest here? 1. An offline configuration state where the hypervisor can take the cpu back and allocate it to another VM. The hypervisor gets control no matter what idle state the guest enters, yes? The hypervisor may get control, but what they do may depend on that the guest OS wished/hinted for - config change or just shutdown unused cpu if possible for a while. 2. A low-power state where the guest indicates it doesn't need the CPU (and can be put in low power state) but doesn't want to give up its allocated cpu share. IOW, no visible configuration changes. So, in any case we would probably want more than one states. How are #1 and #2 different when the hypervisor gets control in all idle states? I assert that they are the same, and thus 1 state will suffice. It depends on the hypervisor implementation. On pseries (powerpc) hypervisor, for example, they are different. By offlining a vcpu (and in turn shutting a cpu), you will actually create a configuration change in the VM that is visible to other systems management tools which may not be what the system administrator wanted. Ideally, we would like to distinguish between these two states. Hope that suffices as an example. Thanks Dipankar ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/3] cpu: idle state framework for offline CPUs.
On Mon, Aug 10, 2009 at 05:22:17PM -0700, Pallipadi, Venkatesh wrote: Also, I don't think using just the ACPI/BIOS supplied states in _CST is right thing to do for offline. _CST is meant for C-state and BIOS may not include some C-state in _CST if the system manufacturer thinks that the latency is too high for the state to be used as a C-state. That limitation applies for C-state as the cpu is expected to come out of C-state often and execute code handle interrupts etc. But, that restriction does not apply for offline online which is not as frequent as C-state entry and it already has big latency with startup IPI, and a whole baggage of CPU setup code. So, using BIOS CST info for CPU offline state doesn't seem right. May be having (to pick a number) 3 possible offline states for all platforms with one for halt equivalent and one for deepest possible that CPU can handle and one for deepest possible that platform likes for C-states may make sense. Will keeps things simpler in terms of usage expectations and possibly reduce the misuse oppurtubity? Yes, I think we should let specific archs advertise a small set of possible offline states and let the cpu state be set to one of those only keeping the platform implementation robust. Here is variant of the original proposal from Gautham - /sys/devices/system/cpu/cpunumber/available_states For example, available state for an Intel platform could be exported as online dealloc C1 C6 online = fully up dealloc = offline and de-allocated (as in virtualized environment) C1 = C1 or C1E halt C6 = C6 sleep /sys/devices/system/cpu/cpunumber/state Writing any of the available states to this file triggers transition to that state barring some transitions that are disallowed to keep things simple (e.g. dealloc cpus support only transition to online). /sys/devices/system/cpu/cpunumber/online Backward compatibility - online = 0 changes state to C6 or dealloc depending on the platform. online = 1 changes state to online. Would this make sense ? Thanks Dipankar ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev