[PATCH v2 1/2] cpu: Offline state Framework.
Provide an interface by which the system administrator can decide what state should the CPU go to when it is offlined. To query the hotplug states, on needs to perform a read on the sysfs tunable: /sys/devices/system/cpu/cpunumber/available_hotplug_states To query or set the current state for a particular CPU, one needs to use the sysfs interface: /sys/devices/system/cpu/cpunumber/current_state This patch implements the architecture independent bits of the cpu-offline-state framework. Architectures which want to expose the multiple offline-states to the userspace are expected to write a driver which can register with this framework. Such a driver should: - Implement the callbacks defined in the structure struct cpu_offline_driver which can be called into by this framework when the corresponding sysfs interfaces are read or written into. - Ensure that the following operation puts the CPU in the same state as it did in the absence of the driver. echo 0 /sys/devices/system/cpu/cpunumber/online This framework also serializes the writes to the current_state with respect to with the writes to the online sysfs tunable. Signed-off-by: Gautham R Shenoy e...@in.ibm.com --- drivers/base/cpu.c | 176 --- include/linux/cpu.h | 30 + 2 files changed, 197 insertions(+), 9 deletions(-) diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index e62a4cc..73efc55 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -20,7 +20,161 @@ EXPORT_SYMBOL(cpu_sysdev_class); static DEFINE_PER_CPU(struct sys_device *, cpu_sys_devices); +struct sys_device *get_cpu_sysdev(unsigned cpu) +{ + if (cpu nr_cpu_ids cpu_possible(cpu)) + return per_cpu(cpu_sys_devices, cpu); + else + return NULL; +} +EXPORT_SYMBOL_GPL(get_cpu_sysdev); + + #ifdef CONFIG_HOTPLUG_CPU + +struct cpu_offline_driver *cpu_offline_driver; +static DEFINE_MUTEX(cpu_offline_driver_lock); + +ssize_t show_available_states(struct sys_device *dev, + struct sysdev_attribute *attr, char *buf) +{ + struct cpu *cpu = container_of(dev, struct cpu, sysdev); + int cpu_num = cpu-sysdev.id; + ssize_t ret; + + mutex_lock(cpu_offline_driver_lock); + if (!cpu_offline_driver) { + ret = -EEXIST; + goto out_unlock; + } + + ret = cpu_offline_driver-read_available_states(cpu_num, buf); + +out_unlock: + mutex_unlock(cpu_offline_driver_lock); + + return ret; + +} + +ssize_t show_current_state(struct sys_device *dev, + struct sysdev_attribute *attr, char *buf) +{ + struct cpu *cpu = container_of(dev, struct cpu, sysdev); + int cpu_num = cpu-sysdev.id; + ssize_t ret = 0; + + mutex_lock(cpu_offline_driver_lock); + if (!cpu_offline_driver) { + ret = -EEXIST; + goto out_unlock; + } + + ret = cpu_offline_driver-read_current_state(cpu_num, buf); + +out_unlock: + mutex_unlock(cpu_offline_driver_lock); + + return ret; +} + +ssize_t store_current_state(struct sys_device *dev, + struct sysdev_attribute *attr, + const char *buf, size_t count) +{ + struct cpu *cpu = container_of(dev, struct cpu, sysdev); + int cpu_num = cpu-sysdev.id; + ssize_t ret = count; + + mutex_lock(cpu_offline_driver_lock); + if (!cpu_offline_driver) { + ret = -EEXIST; + goto out_unlock; + } + + ret = cpu_offline_driver-write_current_state(cpu_num, buf); + +out_unlock: + mutex_unlock(cpu_offline_driver_lock); + + if (ret = 0) + ret = count; + return ret; +} + +static SYSDEV_ATTR(available_hotplug_states, 0444, show_available_states, + NULL); +static SYSDEV_ATTR(current_state, 0644, show_current_state, + store_current_state); + +/* Should be called with cpu_offline_driver_lock held */ +void cpu_offline_driver_add_cpu(struct sys_device *cpu_sys_dev) +{ + if (!cpu_offline_driver || !cpu_sys_dev) + return; + + sysdev_create_file(cpu_sys_dev, attr_available_hotplug_states); + sysdev_create_file(cpu_sys_dev, attr_current_state); +} + +/* Should be called with cpu_offline_driver_lock held */ +void cpu_offline_driver_remove_cpu(struct sys_device *cpu_sys_dev) +{ + if (!cpu_offline_driver || !cpu_sys_dev) + return; + + sysdev_remove_file(cpu_sys_dev, attr_available_hotplug_states); + sysdev_remove_file(cpu_sys_dev, attr_current_state); + +} + +int register_cpu_offline_driver(struct cpu_offline_driver *arch_cpu_driver) +{ + int ret = 0; + int cpu; + mutex_lock(cpu_offline_driver_lock); + + if (cpu_offline_driver != NULL) { + ret = -EEXIST
[PATCH v2 0/2] cpu: pseries: Offline state framework.
Hi, This is the version 2 of the patch series to provide a cpu-offline framework that enables the administrators choose the state the offline CPU must be put into when multiple such states are exposed by the underlying architecture. Version 1 of the Patch can be found here: http://lkml.org/lkml/2009/8/6/236 The patch-series exposes the following sysfs tunables to allow the system-adminstrator to choose the state of a CPU: To query the available hotplug states, one needs to read the sysfs tunable: /sys/devices/system/cpu/cpunumber/available_hotplug_states To query or set the current state, on needs to read/write the sysfs tunable: /sys/devices/system/cpu/cpunumber/current_states The patchset ensures that the writes to the current_state sysfs file are serialized against the writes to the online file. This patchset also contains the offline state driver implemented for pSeries. For pSeries, we define three available_hotplug_states. They are: online: The processor is online. deallocate: This is the the default behaviour when the cpu is offlined even in the absense of this driver. The CPU would call make an rtas_stop_self() call and hand over the CPU back to the resource pool, thereby effectively deallocating that vCPU from the LPAR. NOTE: This would result in a configuration change to the LPAR which is visible to the outside world. deactivate: This cedes the vCPU to the hypervisor which in turn can put the vCPU time to the best use. NOTE: This option DOES NOT result in a configuration change and the vCPU would be still entitled to the LPAR to which it earlier belong to. Awaiting your feedback. --- Gautham R Shenoy (2): cpu: Implement cpu-offline-state driver for pSeries. cpu: Offline state Framework. arch/powerpc/platforms/pseries/Makefile |2 arch/powerpc/platforms/pseries/hotplug-cpu.c| 76 +- arch/powerpc/platforms/pseries/offline_driver.c | 161 + arch/powerpc/platforms/pseries/offline_driver.h | 20 +++ arch/powerpc/platforms/pseries/smp.c| 17 ++ drivers/base/cpu.c | 176 ++- include/linux/cpu.h | 30 7 files changed, 465 insertions(+), 17 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/offline_driver.c create mode 100644 arch/powerpc/platforms/pseries/offline_driver.h -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 2/2] cpu: Implement cpu-offline-state driver for pSeries.
This patch implements the callbacks to handle the reads/writes into the sysfs interfaces /sys/devices/system/cpu/cpunumber/available_hotplug_states and /sys/devices/system/cpu/cpunumber/current_state Currently, the patch defines two states which the processor can go to when it is offlined. They are - deallocate: This is the the default behaviour when the cpu is offlined even in the absense of this driver. The CPU would call make an rtas_stop_self() call and hand over the CPU back to the resource pool, thereby effectively deallocating that vCPU from the LPAR. This would result in a configuration change to the LPAR which is visible to the outside world. - deactivate: This cedes the vCPU to the hypervisor which in turn can put the vCPU time to the best use. This option does not result in a configuration change and the vCPU would be still entitled to the LPAR to which it earlier belong to. Signed-off-by: Gautham R Shenoy e...@in.ibm.com --- arch/powerpc/platforms/pseries/Makefile |2 arch/powerpc/platforms/pseries/hotplug-cpu.c| 76 ++- arch/powerpc/platforms/pseries/offline_driver.c | 161 +++ arch/powerpc/platforms/pseries/offline_driver.h | 20 +++ arch/powerpc/platforms/pseries/smp.c| 17 ++ 5 files changed, 268 insertions(+), 8 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/offline_driver.c create mode 100644 arch/powerpc/platforms/pseries/offline_driver.h diff --git a/arch/powerpc/platforms/pseries/Makefile b/arch/powerpc/platforms/pseries/Makefile index 790c0b8..3a569c7 100644 --- a/arch/powerpc/platforms/pseries/Makefile +++ b/arch/powerpc/platforms/pseries/Makefile @@ -17,7 +17,7 @@ obj-$(CONFIG_KEXEC) += kexec.o obj-$(CONFIG_PCI) += pci.o pci_dlpar.o obj-$(CONFIG_PSERIES_MSI) += msi.o -obj-$(CONFIG_HOTPLUG_CPU) += hotplug-cpu.o +obj-$(CONFIG_HOTPLUG_CPU) += hotplug-cpu.o offline_driver.o obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o obj-$(CONFIG_HVC_CONSOLE) += hvconsole.o diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index a20ead8..6880a1d 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -30,6 +30,7 @@ #include asm/pSeries_reconfig.h #include xics.h #include plpar_wrappers.h +#include offline_driver.h /* This version can't take the spinlock, because it never returns */ static struct rtas_args rtas_stop_self_args = { @@ -54,13 +55,62 @@ static void rtas_stop_self(void) panic(Alas, I survived.\n); } +static void cede_on_offline(void) +{ + unsigned int cpu = smp_processor_id(); + unsigned int hwcpu = hard_smp_processor_id(); + + get_lppaca()-idle = 1; + if (!get_lppaca()-shared_proc) + get_lppaca()-donate_dedicated_cpu = 1; + + printk(KERN_INFO cpu %u (hwid %u) ceding for offline with hint %d\n, + cpu, hwcpu, cede_latency_hint); + while (get_preferred_offline_state(cpu) != CPU_STATE_ONLINE) { + cede_processor(); + printk(KERN_INFO cpu %u (hwid %u) returned from cede.\n, + cpu, hwcpu); + } + + printk(KERN_INFO cpu %u (hwid %u) got prodded to go online\n, + cpu, hwcpu); + + if (!get_lppaca()-shared_proc) + get_lppaca()-donate_dedicated_cpu = 0; + get_lppaca()-idle = 0; + unregister_slb_shadow(hwcpu, __pa(get_slb_shadow())); + + /* +* NOTE: Calling start_secondary() here for now to start +* a new context. +* +* However, need to do it cleanly by resetting the stack +* pointer. +*/ + start_secondary(); +} + static void pseries_mach_cpu_die(void) { + unsigned int cpu = smp_processor_id(); + local_irq_disable(); idle_task_exit(); xics_teardown_cpu(); - unregister_slb_shadow(hard_smp_processor_id(), __pa(get_slb_shadow())); - rtas_stop_self(); + + if (get_preferred_offline_state(cpu) == CPU_DEALLOCATE) { + + set_cpu_current_state(cpu, CPU_DEALLOCATE); + unregister_slb_shadow(hard_smp_processor_id(), + __pa(get_slb_shadow())); + rtas_stop_self(); + goto out_bug; + } else if (get_preferred_offline_state(cpu) == CPU_DEACTIVATE) { + set_cpu_current_state(cpu, CPU_DEACTIVATE); + cede_on_offline(); + } + +out_bug: /* Should never get here... */ BUG(); for(;;); @@ -112,11 +162,23 @@ static void pseries_cpu_die(unsigned int cpu) int cpu_status; unsigned int pcpu = get_hard_smp_processor_id(cpu); - for (tries = 0; tries 25; tries++) { - cpu_status = query_cpu_stopped(pcpu); - if (cpu_status == 0 || cpu_status == -1) - break
[PATCH v3 0/3] cpu: pseries: Cpu offline states framework
Hi, RFC not for inclusion This is the version 3 of the patch series to provide a cpu-offline framework that enables the administrators choose the state of a CPU when it is offlined, when multiple such states are exposed by the underlying architecture. Changes from Version 2:(can be found here: http://lkml.org/lkml/2009/8/28/102) - Addressed Andrew Morton's review comments regarding names of global variables, handling of error conditions and documentation of the interfaces. - Implemented a patch to provide helper functions to set the cede latency specifier value in the VPA indicating latency expectation of the guest OS when the vcpu is ceded from a subsequent H_CEDE hypercall. Hypervisor may use this for better energy savings. - Renamed of the cpu-hotplug states. deallocate is renamed as offline and deactivate is renamed as inactive. The patch-series exposes the following sysfs tunables to allow the system-adminstrator to choose the state of a CPU: To query the available hotplug states, one needs to read the sysfs tunable: /sys/devices/system/cpu/cpunumber/available_hotplug_states To query or set the current state, on needs to read/write the sysfs tunable: /sys/devices/system/cpu/cpunumber/current_hotplug_state The patchset ensures that the writes to the current_hotplug_state sysfs file are serialized against the writes to the online file. This patchset contains the offline state driver implemented for pSeries. For pSeries, we define three available_hotplug_states. They are: online: The processor is online. offline: This is the the default behaviour when the cpu is offlined even in the absense of this driver. The CPU would call make an rtas_stop_self() call and hand over the CPU back to the resource pool, thereby effectively deallocating that vCPU from the LPAR. NOTE: This would result in a configuration change to the LPAR which is visible to the outside world. inactive: This cedes the vCPU to the hypervisor with a cede latency specifier value 2. NOTE: This option does not result in a configuration change and the vCPU would be still entitled to the LPAR to which it earlier belong to. Any feedback on the patchset will be immensely valuable. --- Arun R Bharadwaj (1): pSeries: cede latency specifier helper function. Gautham R Shenoy (2): cpu: Implement cpu-offline-state callbacks for pSeries. cpu: Offline state Framework. Documentation/cpu-hotplug.txt | 22 +++ arch/powerpc/include/asm/lppaca.h |9 + arch/powerpc/platforms/pseries/Makefile |2 arch/powerpc/platforms/pseries/hotplug-cpu.c| 88 ++- arch/powerpc/platforms/pseries/offline_driver.c | 148 +++ arch/powerpc/platforms/pseries/offline_driver.h | 20 +++ arch/powerpc/platforms/pseries/plpar_wrappers.h | 17 ++ arch/powerpc/platforms/pseries/smp.c| 17 ++ arch/powerpc/xmon/xmon.c|3 drivers/base/cpu.c | 181 ++- include/linux/cpu.h | 10 + 11 files changed, 498 insertions(+), 19 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/offline_driver.c create mode 100644 arch/powerpc/platforms/pseries/offline_driver.h -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 1/3] pSeries: cede latency specifier helper function.
From: Arun R Bharadwaj a...@linux.vnet.ibm.com This patch provides helper functions to set the cede latency specifier value in the VPA indicating the latency expectation of the guest OS to inform the hypervisor's choice of the platform dependent energy saving mode chosen for the processor when unused during the subsequent H_CEDE hypercall. Signed-off-by: Arun R Bharadwaj a...@linux.vnet.ibm.com Signed-off-by: Gautham R Shenoy e...@in.ibm.com --- arch/powerpc/include/asm/lppaca.h |9 - arch/powerpc/platforms/pseries/plpar_wrappers.h | 17 + arch/powerpc/xmon/xmon.c|3 ++- 3 files changed, 27 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/lppaca.h b/arch/powerpc/include/asm/lppaca.h index f78f65c..aaa0066 100644 --- a/arch/powerpc/include/asm/lppaca.h +++ b/arch/powerpc/include/asm/lppaca.h @@ -100,7 +100,14 @@ struct lppaca { // Used to pass parms from the OS to PLIC for SetAsrAndRfid u64 saved_gpr3; // Saved GPR3 x20-x27 u64 saved_gpr4; // Saved GPR4 x28-x2F - u64 saved_gpr5; // Saved GPR5 x30-x37 + union { + u64 saved_gpr5; // Saved GPR5 x30-x37 + struct { + u8 cede_latency_hint; // x30 + u8 reserved[7];// x31-x36 + } fields; + } gpr5_dword; + u8 dtl_enable_mask;// Dispatch Trace Log mask x38-x38 u8 donate_dedicated_cpu; // Donate dedicated CPU cycles x39-x39 diff --git a/arch/powerpc/platforms/pseries/plpar_wrappers.h b/arch/powerpc/platforms/pseries/plpar_wrappers.h index a24a6b2..1174d4b 100644 --- a/arch/powerpc/platforms/pseries/plpar_wrappers.h +++ b/arch/powerpc/platforms/pseries/plpar_wrappers.h @@ -9,11 +9,28 @@ static inline long poll_pending(void) return plpar_hcall_norets(H_POLL_PENDING); } +static inline u8 get_cede_latency_hint(void) +{ + return get_lppaca()-gpr5_dword.fields.cede_latency_hint; +} + +static inline void set_cede_latency_hint(u8 latency_hint) +{ + get_lppaca()-gpr5_dword.fields.cede_latency_hint = latency_hint; +} + static inline long cede_processor(void) { return plpar_hcall_norets(H_CEDE); } +static inline long extended_cede_processor(u8 latency_hint) +{ + set_cede_latency_hint(latency_hint); + cede_processor(); + +} + static inline long vpa_call(unsigned long flags, unsigned long cpu, unsigned long vpa) { diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index e1f33a8..a2089cd 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -1613,7 +1613,8 @@ static void super_regs(void) ptrLpPaca-saved_srr0, ptrLpPaca-saved_srr1); printf(Saved Gpr3=%.16lx Saved Gpr4=%.16lx \n, ptrLpPaca-saved_gpr3, ptrLpPaca-saved_gpr4); - printf(Saved Gpr5=%.16lx \n, ptrLpPaca-saved_gpr5); + printf(Saved Gpr5=%.16lx \n, + ptrLpPaca-gpr5_dword.saved_gpr5); } #endif ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 2/3] cpu: Offline state Framework.
Provide an interface by which the system administrator can decide what state should the CPU go to when it is offlined. To query the hotplug states, on needs to perform a read on: /sys/devices/system/cpu/cpunumber/available_hotplug_states To query or set the current state for a particular CPU, one needs to use the sysfs interface /sys/devices/system/cpu/cpunumber/current_hotplug_state This patch implements the architecture independent bits of the cpu-offline-state framework. The architecture specific bits are expected to register the actual code which implements the callbacks when the above mentioned sysfs interfaces are read or written into. Thus the values provided by reading available_offline_states vary with the architecture. The patch provides serialization for writes to the current_hotplug_state with respect to with the writes to the online sysfs tunable. Signed-off-by: Gautham R Shenoy e...@in.ibm.com --- Documentation/cpu-hotplug.txt | 22 + drivers/base/cpu.c| 181 +++-- include/linux/cpu.h | 10 ++ 3 files changed, 204 insertions(+), 9 deletions(-) diff --git a/Documentation/cpu-hotplug.txt b/Documentation/cpu-hotplug.txt index 9d620c1..dcec06d 100644 --- a/Documentation/cpu-hotplug.txt +++ b/Documentation/cpu-hotplug.txt @@ -115,6 +115,28 @@ Just remember the critical section cannot call any function that can sleep or schedule this process away. The preempt_disable() will work as long as stop_machine_run() is used to take a cpu down. +CPU-offline states +-- +On architectures which allow the more than one valid state when +the CPU goes offline, the system administrator can decide +the state the CPU needs to go to when it is offlined. + +If the architecture has implemented a cpu-offline driver exposing these +multiple offline states, the system administrator can use the following sysfs +interfaces to query the available hotplug states and also query and set the +current hotplug state for a given cpu: + +To query the hotplug states, on needs to perform a read on: +/sys/devices/system/cpu/cpunumber/available_hotplug_states + +To query or set the current state for a particular CPU, +one needs to use the sysfs interface + +/sys/devices/system/cpu/cpunumber/current_hotplug_state + +Writes to the online sysfs files are serialized against the writes to the +current_hotplug_state file. + CPU Hotplug - Frequently Asked Questions. Q: How to enable my kernel to support CPU hotplug? diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index e62a4cc..00c38be 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -20,7 +20,166 @@ EXPORT_SYMBOL(cpu_sysdev_class); static DEFINE_PER_CPU(struct sys_device *, cpu_sys_devices); +struct sys_device *get_cpu_sysdev(unsigned cpu) +{ + if (cpu nr_cpu_ids cpu_possible(cpu)) + return per_cpu(cpu_sys_devices, cpu); + else + return NULL; +} +EXPORT_SYMBOL_GPL(get_cpu_sysdev); + + #ifdef CONFIG_HOTPLUG_CPU + +struct cpu_offline_driver *cpu_offline_driver; +static DEFINE_MUTEX(cpu_offline_driver_lock); + +ssize_t show_available_hotplug_states(struct sys_device *dev, + struct sysdev_attribute *attr, char *buf) +{ + struct cpu *cpu = container_of(dev, struct cpu, sysdev); + int cpu_num = cpu-sysdev.id; + ssize_t ret; + + mutex_lock(cpu_offline_driver_lock); + if (!cpu_offline_driver) { + ret = -EEXIST; + goto out_unlock; + } + + ret = cpu_offline_driver-read_available_states(cpu_num, buf); + +out_unlock: + mutex_unlock(cpu_offline_driver_lock); + + return ret; + +} + +ssize_t show_current_hotplug_state(struct sys_device *dev, + struct sysdev_attribute *attr, char *buf) +{ + struct cpu *cpu = container_of(dev, struct cpu, sysdev); + int cpu_num = cpu-sysdev.id; + ssize_t ret = 0; + + mutex_lock(cpu_offline_driver_lock); + if (!cpu_offline_driver) { + ret = -EEXIST; + goto out_unlock; + } + + ret = cpu_offline_driver-read_current_state(cpu_num, buf); + +out_unlock: + mutex_unlock(cpu_offline_driver_lock); + + return ret; +} + +ssize_t store_current_hotplug_state(struct sys_device *dev, + struct sysdev_attribute *attr, + const char *buf, size_t count) +{ + struct cpu *cpu = container_of(dev, struct cpu, sysdev); + int cpu_num = cpu-sysdev.id; + ssize_t ret = count; + + mutex_lock(cpu_offline_driver_lock); + if (!cpu_offline_driver) { + ret = -EEXIST; + goto out_unlock; + } + + ret = cpu_offline_driver-write_current_state(cpu_num, buf); + +out_unlock: + mutex_unlock(cpu_offline_driver_lock); + + if (ret = 0) + ret = count; + return ret; +} + +static SYSDEV_ATTR
[PATCH v3 3/3] cpu: Implement cpu-offline-state callbacks for pSeries.
This patch implements the callbacks to handle the reads/writes into the sysfs interfaces /sys/devices/system/cpu/cpunumber/available_hotplug_states and /sys/devices/system/cpu/cpunumber/current_hotplug_state Currently, the patch defines two states which the processor can go to when it is offlined. They are - offline: The current behaviour when the cpu is offlined. The CPU would call make an rtas_stop_self() call and hand over the CPU back to the resource pool, thereby effectively deallocating that vCPU from the LPAR. - inactive: This is expected to cede the processor to the hypervisor with a latency hint specifier value. Hypervisor may use this hint to provide better energy savings. In this state, the control of the vCPU will continue to be with the LPAR. Signed-off-by: Gautham R Shenoy e...@in.ibm.com --- arch/powerpc/platforms/pseries/Makefile |2 arch/powerpc/platforms/pseries/hotplug-cpu.c| 88 +- arch/powerpc/platforms/pseries/offline_driver.c | 148 +++ arch/powerpc/platforms/pseries/offline_driver.h | 20 +++ arch/powerpc/platforms/pseries/smp.c| 17 +++ 5 files changed, 267 insertions(+), 8 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/offline_driver.c create mode 100644 arch/powerpc/platforms/pseries/offline_driver.h diff --git a/arch/powerpc/platforms/pseries/Makefile b/arch/powerpc/platforms/pseries/Makefile index 790c0b8..3a569c7 100644 --- a/arch/powerpc/platforms/pseries/Makefile +++ b/arch/powerpc/platforms/pseries/Makefile @@ -17,7 +17,7 @@ obj-$(CONFIG_KEXEC) += kexec.o obj-$(CONFIG_PCI) += pci.o pci_dlpar.o obj-$(CONFIG_PSERIES_MSI) += msi.o -obj-$(CONFIG_HOTPLUG_CPU) += hotplug-cpu.o +obj-$(CONFIG_HOTPLUG_CPU) += hotplug-cpu.o offline_driver.o obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o obj-$(CONFIG_HVC_CONSOLE) += hvconsole.o diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index a20ead8..1e06bb1 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -30,6 +30,7 @@ #include asm/pSeries_reconfig.h #include xics.h #include plpar_wrappers.h +#include offline_driver.h /* This version can't take the spinlock, because it never returns */ static struct rtas_args rtas_stop_self_args = { @@ -54,13 +55,74 @@ static void rtas_stop_self(void) panic(Alas, I survived.\n); } +static void cede_on_offline(u8 cede_latency_hint) +{ + unsigned int cpu = smp_processor_id(); + unsigned int hwcpu = hard_smp_processor_id(); + u8 old_cede_latency_hint; + + old_cede_latency_hint = get_cede_latency_hint(); + get_lppaca()-idle = 1; + if (!get_lppaca()-shared_proc) + get_lppaca()-donate_dedicated_cpu = 1; + + printk(KERN_INFO cpu %u (hwid %u) ceding for offline with hint %d\n, + cpu, hwcpu, cede_latency_hint); + while (get_preferred_offline_state(cpu) != CPU_STATE_ONLINE) { + extended_cede_processor(cede_latency_hint); + printk(KERN_INFO cpu %u (hwid %u) returned from cede.\n, + cpu, hwcpu); + printk(KERN_INFO + Decrementer value = %x Timebase value = %llx\n, + get_dec(), get_tb()); + } + + printk(KERN_INFO cpu %u (hwid %u) got prodded to go online\n, + cpu, hwcpu); + + if (!get_lppaca()-shared_proc) + get_lppaca()-donate_dedicated_cpu = 0; + get_lppaca()-idle = 0; + + /* Reset the cede_latency specifier value */ + set_cede_latency_hint(old_cede_latency_hint); + + unregister_slb_shadow(hwcpu, __pa(get_slb_shadow())); + + /* +* NOTE: Calling start_secondary() here for now to start +* a new context. +* +* However, need to do it cleanly by resetting the stack +* pointer. +*/ + start_secondary(); +} + static void pseries_mach_cpu_die(void) { + unsigned int cpu = smp_processor_id(); + u8 cede_latency_hint = 0; + local_irq_disable(); idle_task_exit(); xics_teardown_cpu(); - unregister_slb_shadow(hard_smp_processor_id(), __pa(get_slb_shadow())); - rtas_stop_self(); + + if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) { + + set_cpu_current_state(cpu, CPU_STATE_OFFLINE); + unregister_slb_shadow(hard_smp_processor_id(), + __pa(get_slb_shadow())); + rtas_stop_self(); + goto out_bug; + } else if (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) { + set_cpu_current_state(cpu, CPU_STATE_INACTIVE); + cede_latency_hint = 2; + cede_on_offline(cede_latency_hint); + + } + +out_bug: /* Should never get here
[PATCH v4 0/4] pseries: Add cede support for cpu-offline
Hi, This is version 4 of patch series that provides a framework to choose the state a pseries CPU must be put to when it is offlined. Previous versions can be found here: Version 3: http://lkml.org/lkml/2009/9/15/164 Version 2: http://lkml.org/lkml/2009/8/28/102 Version 1: http://lkml.org/lkml/2009/8/6/236 This patch series differs considerably from the previous iterations in that: - It is based on top of Nathan Fontenot's pseries kernel handling of dynamic logical partitioning v2 patches found here: http://lkml.org/lkml/2009/9/18/173 - It does away with the sysfs tunables available_hotplug_states and currrent_hotplug_state, thereby refraining from exposing any platform specific option to the userspace. - It provides a framework that discovers the support for ceding the vcpu to the hypervisor with a latency specifier value. When such a support is present, it would put the offlined CPUs into a ceded state while setting the cede latency specifier value in the VPA indicating the latency expectation of the guest OS. Hypervisor may use this for better energy savings. - On platforms which don't have support for ceding with a latency specifier hint, it puts the offlined CPUs into the rtas_stop_self() state. - Currently, Nathan's patches have provided sysfs interfaces /sys/devices/system/cpu/{probe, release} to dynamically allocate and deallocate resources respectively. The current process of allocating and deallocating CPUs would require two steps, namely: - Write to the probe/release file - Write to the online file. This patch-series combines these two operations such that when the user writes to the probe, the CPUs would not only get allocated to the partition but also onlined. Similarly, when the user writes to release, the CPUs would be offlined before they are deallocated. Thus, the user will now have to play with only one set of sysfs tunables for performing deallocate and the sysfs tunable online can be used for the purpose of deactivating the CPU while still letting the LPAR own it. - It provides serializations for handling the writes the the online file against the writes to probe/release pair so that at any given instant, the user could either be deallocating the CPUs or deactivating it, but not both. With this approach, I hope that the objection regarding the layering violation that was raised by Peter for previous approaches is addressed, while also providing clean interfaces for the userspace tools to distinguish between the deallocation and deactivation without exposing any platform specific details. The patchset has been tested on the available pseries platforms and it works as per the expectations. Comments are very much appreciated. --- Gautham R Shenoy (4): pseries: Serialize cpu hotplug operations during deactivate Vs deallocate pseries: Add code to online/offline CPUs of a DLPAR node. pSeries: Add hooks to put the CPU into an appropriate offline state pSeries: extended_cede_processor() helper function. arch/powerpc/include/asm/lppaca.h |9 + arch/powerpc/platforms/pseries/dlpar.c | 129 ++- arch/powerpc/platforms/pseries/hotplug-cpu.c| 157 ++- arch/powerpc/platforms/pseries/offline_states.h | 18 +++ arch/powerpc/platforms/pseries/plpar_wrappers.h | 22 +++ arch/powerpc/platforms/pseries/smp.c| 19 +++ arch/powerpc/xmon/xmon.c|3 drivers/base/cpu.c |2 include/linux/cpu.h | 13 ++ 9 files changed, 356 insertions(+), 16 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/offline_states.h -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v4 2/4] pSeries: Add hooks to put the CPU into an appropriate offline state
When a CPU is offlined on POWER currently, we call rtas_stop_self() and hand the CPU back to the resource pool. This path is used for DLPAR which will cause a change in the LPAR configuration which will be visible outside. This patch changes the default state a CPU is put into when it is offlined. On platforms which support ceding the processor to the hypervisor with latency hint specifier value, during a cpu offline operation, instead of calling rtas_stop_self(), we cede the vCPU to the hypervisor while passing a latency hint specifier value. The Hypervisor can use this hint to provide better energy savings. Also, during the offline operation, the control of the vCPU remains with the LPAR as oppposed to returning it to the resource pool. The patch achieves this by creating an infrastructure to set the preferred_offline_state() which can be either - CPU_STATE_OFFLINE: which is the current behaviour of calling rtas_stop_self() - CPU_STATE_INACTIVE: which cedes the vCPU to the hypervisor with the latency hint specifier. The codepath which wants to perform a DLPAR operation can set the preferred_offline_state() of a CPU to CPU_STATE_OFFLINE before invoking cpu_down(). Signed-off-by: Gautham R Shenoy e...@in.ibm.com Signed-off-by: Nathan Fontenot nf...@austin.ibm.com --- arch/powerpc/platforms/pseries/hotplug-cpu.c| 157 ++- arch/powerpc/platforms/pseries/offline_states.h | 18 +++ arch/powerpc/platforms/pseries/smp.c| 19 +++ 3 files changed, 185 insertions(+), 9 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/offline_states.h diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index ebff6d9..8a47db4 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -30,6 +30,7 @@ #include asm/pSeries_reconfig.h #include xics.h #include plpar_wrappers.h +#include offline_states.h /* This version can't take the spinlock, because it never returns */ static struct rtas_args rtas_stop_self_args = { @@ -39,6 +40,37 @@ static struct rtas_args rtas_stop_self_args = { .rets = rtas_stop_self_args.args[0], }; +static DEFINE_PER_CPU(enum cpu_state_vals, preferred_offline_state) = + CPU_STATE_OFFLINE; +static DEFINE_PER_CPU(enum cpu_state_vals, current_state) = CPU_STATE_OFFLINE; + +static enum cpu_state_vals default_offline_state = CPU_STATE_OFFLINE; + +enum cpu_state_vals get_cpu_current_state(int cpu) +{ + return per_cpu(current_state, cpu); +} + +void set_cpu_current_state(int cpu, enum cpu_state_vals state) +{ + per_cpu(current_state, cpu) = state; +} + +enum cpu_state_vals get_preferred_offline_state(int cpu) +{ + return per_cpu(preferred_offline_state, cpu); +} + +void set_preferred_offline_state(int cpu, enum cpu_state_vals state) +{ + per_cpu(preferred_offline_state, cpu) = state; +} + +void set_default_offline_state(int cpu) +{ + per_cpu(preferred_offline_state, cpu) = default_offline_state; +} + static void rtas_stop_self(void) { struct rtas_args *args = rtas_stop_self_args; @@ -56,11 +88,66 @@ static void rtas_stop_self(void) static void pseries_mach_cpu_die(void) { + unsigned int cpu = smp_processor_id(); + unsigned int hwcpu = hard_smp_processor_id(); + u8 cede_latency_hint = 0; + local_irq_disable(); idle_task_exit(); xics_teardown_cpu(); - unregister_slb_shadow(hard_smp_processor_id(), __pa(get_slb_shadow())); - rtas_stop_self(); + + if (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) { + set_cpu_current_state(cpu, CPU_STATE_INACTIVE); + cede_latency_hint = 2; + + get_lppaca()-idle = 1; + if (!get_lppaca()-shared_proc) + get_lppaca()-donate_dedicated_cpu = 1; + + printk(KERN_INFO + cpu %u (hwid %u) ceding for offline with hint %d\n, + cpu, hwcpu, cede_latency_hint); + while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) { + extended_cede_processor(cede_latency_hint); + printk(KERN_INFO cpu %u (hwid %u) returned from cede.\n, + cpu, hwcpu); + printk(KERN_INFO + Decrementer value = %x Timebase value = %llx\n, + get_dec(), get_tb()); + } + + printk(KERN_INFO cpu %u (hwid %u) got prodded to go online\n, + cpu, hwcpu); + + if (!get_lppaca()-shared_proc) + get_lppaca()-donate_dedicated_cpu = 0; + get_lppaca()-idle = 0; + } + + if (get_preferred_offline_state(cpu) == CPU_STATE_ONLINE) { + unregister_slb_shadow(hwcpu, __pa(get_slb_shadow
[PATCH v4 3/4] pseries: Add code to online/offline CPUs of a DLPAR node.
Currently the cpu-allocation/deallocation on pSeries is a two step process from the Userspace. - Set the indicators and update the device tree by writing to the sysfs tunable probe during allocation and release during deallocation. - Online / Offline the CPUs of the allocated/would_be_deallocated node by writing to the sysfs tunable online. This patch adds kernel code to online/offline the CPUs soon_after/just_before they have been allocated/would_be_deallocated. This way, the userspace tool that performs DLPAR operations would only have to deal with one set of sysfs tunables namely probe and release. Signed-off-by: Gautham R Shenoy e...@in.ibm.com Signed-off-by: Nathan Fontenot nf...@austin.ibm.com --- arch/powerpc/platforms/pseries/dlpar.c | 103 1 files changed, 102 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c index 84d156b..9752386 100644 --- a/arch/powerpc/platforms/pseries/dlpar.c +++ b/arch/powerpc/platforms/pseries/dlpar.c @@ -20,7 +20,7 @@ #include linux/sysdev.h #include linux/sysfs.h #include linux/cpu.h - +#include offline_states.h #include asm/prom.h #include asm/machdep.h @@ -357,6 +357,98 @@ int remove_device_tree_nodes(struct device_node *dn) return rc; } +int online_node_cpus(struct device_node *dn) +{ + int rc = 0; + unsigned int cpu; + int len, nthreads, i; + const u32 *intserv; + + intserv = of_get_property(dn, ibm,ppc-interrupt-server#s, len); + if (!intserv) + return -EINVAL; + + nthreads = len / sizeof(u32); + + cpu_maps_update_begin(); + for (i = 0; i nthreads; i++) { + for_each_present_cpu(cpu) { + if (get_hard_smp_processor_id(cpu) != intserv[i]) + continue; + BUG_ON(get_cpu_current_state(cpu) + != CPU_STATE_OFFLINE); + cpu_maps_update_done(); + rc = cpu_up(cpu); + if (rc) + goto out; + cpu_maps_update_begin(); + + break; + } + if (cpu == num_possible_cpus()) + printk(KERN_WARNING Could not find cpu to online + with physical id 0x%x\n, intserv[i]); + } + cpu_maps_update_done(); + +out: + return rc; + +} + +int offline_node_cpus(struct device_node *dn) +{ + int rc = 0; + unsigned int cpu; + int len, nthreads, i; + const u32 *intserv; + + intserv = of_get_property(dn, ibm,ppc-interrupt-server#s, len); + if (!intserv) + return -EINVAL; + + nthreads = len / sizeof(u32); + + cpu_maps_update_begin(); + for (i = 0; i nthreads; i++) { + for_each_present_cpu(cpu) { + if (get_hard_smp_processor_id(cpu) != intserv[i]) + continue; + + if (get_cpu_current_state(cpu) == CPU_STATE_OFFLINE) + break; + + if (get_cpu_current_state(cpu) == CPU_STATE_ONLINE) { + cpu_maps_update_done(); + rc = cpu_down(cpu); + if (rc) + goto out; + cpu_maps_update_begin(); + break; + + } + + /* +* The cpu is in CPU_STATE_INACTIVE. +* Upgrade it's state to CPU_STATE_OFFLINE. +*/ + set_preferred_offline_state(cpu, CPU_STATE_OFFLINE); + BUG_ON(plpar_hcall_norets(H_PROD, intserv[i]) + != H_SUCCESS); + __cpu_die(cpu); + break; + } + if (cpu == num_possible_cpus()) + printk(KERN_WARNING Could not find cpu to offline + with physical id 0x%x\n, intserv[i]); + } + cpu_maps_update_done(); + +out: + return rc; + +} + #define DR_ENTITY_SENSE9003 #define DR_ENTITY_PRESENT 1 #define DR_ENTITY_UNUSABLE 2 @@ -591,6 +683,8 @@ static ssize_t cpu_probe_store(struct class *class, const char *buf, if (rc) release_drc(drc_index); + rc = online_node_cpus(dn); + return rc ? rc : count; } @@ -611,6 +705,11 @@ static ssize_t cpu_release_store(struct class *class, const char *buf, return -EINVAL; } + rc = offline_node_cpus(dn); + + if (rc) + goto out; + rc = release_drc(*drc_index); if (rc
[PATCH v4 4/4] pseries: Serialize cpu hotplug operations during deactivate Vs deallocate
Currently the cpu-allocation/deallocation process comprises of two steps: - Set the indicators and to update the device tree with DLPAR node information. - Online/offline the allocated/deallocated CPU. This is achieved by writing to the sysfs tunables probe during allocation and release during deallocation. At the sametime, the userspace can independently online/offline the CPUs of the system using the sysfs tunable online. It is quite possible that when a userspace tool offlines a CPU for the purpose of deallocation and is in the process of updating the device tree, some other userspace tool could bring the CPU back online by writing to the online sysfs tunable thereby causing the deallocate process to fail. The solution to this is to serialize writes to the probe/release sysfs tunable with the writes to the online sysfs tunable. This patch employs a mutex to provide this serialization, which is a no-op on all architectures except PPC_PSERIES Signed-off-by: Gautham R Shenoy e...@in.ibm.com --- arch/powerpc/platforms/pseries/dlpar.c | 26 ++ drivers/base/cpu.c |2 ++ include/linux/cpu.h| 13 + 3 files changed, 37 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c index 9752386..fc261e6 100644 --- a/arch/powerpc/platforms/pseries/dlpar.c +++ b/arch/powerpc/platforms/pseries/dlpar.c @@ -644,6 +644,18 @@ static ssize_t memory_release_store(struct class *class, const char *buf, return rc ? -1 : count; } +static DEFINE_MUTEX(pseries_cpu_hotplug_mutex); + +void cpu_hotplug_driver_lock() +{ + mutex_lock(pseries_cpu_hotplug_mutex); +} + +void cpu_hotplug_driver_unlock() +{ + mutex_unlock(pseries_cpu_hotplug_mutex); +} + static ssize_t cpu_probe_store(struct class *class, const char *buf, size_t count) { @@ -656,14 +668,15 @@ static ssize_t cpu_probe_store(struct class *class, const char *buf, if (rc) return -EINVAL; + cpu_hotplug_driver_lock(); rc = acquire_drc(drc_index); if (rc) - return rc; + goto out; dn = configure_connector(drc_index); if (!dn) { release_drc(drc_index); - return rc; + goto out; } /* fixup dn name */ @@ -672,7 +685,8 @@ static ssize_t cpu_probe_store(struct class *class, const char *buf, if (!cpu_name) { free_cc_nodes(dn); release_drc(drc_index); - return -ENOMEM; + rc = -ENOMEM; + goto out; } sprintf(cpu_name, /cpus/%s, dn-full_name); @@ -684,6 +698,8 @@ static ssize_t cpu_probe_store(struct class *class, const char *buf, release_drc(drc_index); rc = online_node_cpus(dn); +out: + cpu_hotplug_driver_unlock(); return rc ? rc : count; } @@ -705,6 +721,7 @@ static ssize_t cpu_release_store(struct class *class, const char *buf, return -EINVAL; } + cpu_hotplug_driver_lock(); rc = offline_node_cpus(dn); if (rc) @@ -713,7 +730,7 @@ static ssize_t cpu_release_store(struct class *class, const char *buf, rc = release_drc(*drc_index); if (rc) { of_node_put(dn); - return rc; + goto out; } rc = remove_device_tree_nodes(dn); @@ -723,6 +740,7 @@ static ssize_t cpu_release_store(struct class *class, const char *buf, of_node_put(dn); out: + cpu_hotplug_driver_unlock(); return rc ? rc : count; } diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index e62a4cc..07c3f05 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -35,6 +35,7 @@ static ssize_t __ref store_online(struct sys_device *dev, struct sysdev_attribut struct cpu *cpu = container_of(dev, struct cpu, sysdev); ssize_t ret; + cpu_hotplug_driver_lock(); switch (buf[0]) { case '0': ret = cpu_down(cpu-sysdev.id); @@ -49,6 +50,7 @@ static ssize_t __ref store_online(struct sys_device *dev, struct sysdev_attribut default: ret = -EINVAL; } + cpu_hotplug_driver_unlock(); if (ret = 0) ret = count; diff --git a/include/linux/cpu.h b/include/linux/cpu.h index 4753619..b0ad4e1 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -115,6 +115,19 @@ extern void put_online_cpus(void); #define unregister_hotcpu_notifier(nb) unregister_cpu_notifier(nb) int cpu_down(unsigned int cpu); +#ifdef CONFIG_PPC_PSERIES +extern void cpu_hotplug_driver_lock(void); +extern void cpu_hotplug_driver_unlock(void); +#else +static inline void cpu_hotplug_driver_lock(void) +{ +} + +static inline void cpu_hotplug_driver_unlock(void) +{ +} +#endif + #else
[PATCH v5 1/4] pSeries: extended_cede_processor() helper function.
This patch provides an extended_cede_processor() helper function which takes the cede latency hint as an argument. This hint is to be passed on to the hypervisor to cede to the corresponding state on platforms which support it. Signed-off-by: Gautham R Shenoy e...@in.ibm.com Signed-off-by: Arun R Bharadwaj a...@linux.vnet.ibm.com --- arch/powerpc/include/asm/lppaca.h |9 - arch/powerpc/platforms/pseries/plpar_wrappers.h | 22 ++ arch/powerpc/xmon/xmon.c|3 ++- 3 files changed, 32 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/lppaca.h b/arch/powerpc/include/asm/lppaca.h index f78f65c..14b592d 100644 --- a/arch/powerpc/include/asm/lppaca.h +++ b/arch/powerpc/include/asm/lppaca.h @@ -100,7 +100,14 @@ struct lppaca { // Used to pass parms from the OS to PLIC for SetAsrAndRfid u64 saved_gpr3; // Saved GPR3 x20-x27 u64 saved_gpr4; // Saved GPR4 x28-x2F - u64 saved_gpr5; // Saved GPR5 x30-x37 + union { + u64 saved_gpr5; /* Saved GPR5 x30-x37 */ + struct { + u8 cede_latency_hint; /* x30 */ + u8 reserved[7];/* x31-x36 */ + } fields; + } gpr5_dword; + u8 dtl_enable_mask;// Dispatch Trace Log mask x38-x38 u8 donate_dedicated_cpu; // Donate dedicated CPU cycles x39-x39 diff --git a/arch/powerpc/platforms/pseries/plpar_wrappers.h b/arch/powerpc/platforms/pseries/plpar_wrappers.h index a24a6b2..0603c91 100644 --- a/arch/powerpc/platforms/pseries/plpar_wrappers.h +++ b/arch/powerpc/platforms/pseries/plpar_wrappers.h @@ -9,11 +9,33 @@ static inline long poll_pending(void) return plpar_hcall_norets(H_POLL_PENDING); } +static inline u8 get_cede_latency_hint(void) +{ + return get_lppaca()-gpr5_dword.fields.cede_latency_hint; +} + +static inline void set_cede_latency_hint(u8 latency_hint) +{ + get_lppaca()-gpr5_dword.fields.cede_latency_hint = latency_hint; +} + static inline long cede_processor(void) { return plpar_hcall_norets(H_CEDE); } +static inline long extended_cede_processor(unsigned long latency_hint) +{ + long rc; + u8 old_latency_hint = get_cede_latency_hint(); + + set_cede_latency_hint(latency_hint); + rc = cede_processor(); + set_cede_latency_hint(old_latency_hint); + + return rc; +} + static inline long vpa_call(unsigned long flags, unsigned long cpu, unsigned long vpa) { diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index c6f0a71..57124cf 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -1623,7 +1623,8 @@ static void super_regs(void) ptrLpPaca-saved_srr0, ptrLpPaca-saved_srr1); printf(Saved Gpr3=%.16lx Saved Gpr4=%.16lx \n, ptrLpPaca-saved_gpr3, ptrLpPaca-saved_gpr4); - printf(Saved Gpr5=%.16lx \n, ptrLpPaca-saved_gpr5); + printf(Saved Gpr5=%.16lx \n, + ptrLpPaca-gpr5_dword.saved_gpr5); } #endif ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v5 2/4] pSeries: Add hooks to put the CPU into an appropriate offline state
When a CPU is offlined on POWER currently, we call rtas_stop_self() and hand the CPU back to the resource pool. This path is used for DLPAR which will cause a change in the LPAR configuration which will be visible outside. This patch changes the default state a CPU is put into when it is offlined. On platforms which support ceding the processor to the hypervisor with latency hint specifier value, during a cpu offline operation, instead of calling rtas_stop_self(), we cede the vCPU to the hypervisor while passing a latency hint specifier value. The Hypervisor can use this hint to provide better energy savings. Also, during the offline operation, the control of the vCPU remains with the LPAR as oppposed to returning it to the resource pool. The patch achieves this by creating an infrastructure to set the preferred_offline_state() which can be either - CPU_STATE_OFFLINE: which is the current behaviour of calling rtas_stop_self() - CPU_STATE_INACTIVE: which cedes the vCPU to the hypervisor with the latency hint specifier. The codepath which wants to perform a DLPAR operation can set the preferred_offline_state() of a CPU to CPU_STATE_OFFLINE before invoking cpu_down(). The patch also provides a boot-time command line argument to disable/enable CPU_STATE_INACTIVE. Signed-off-by: Gautham R Shenoy e...@in.ibm.com Signed-off-by: Nathan Fontenot nf...@austin.ibm.com --- Documentation/cpu-hotplug.txt |6 + arch/powerpc/platforms/pseries/hotplug-cpu.c| 182 ++- arch/powerpc/platforms/pseries/offline_states.h | 18 ++ arch/powerpc/platforms/pseries/smp.c| 19 ++ 4 files changed, 216 insertions(+), 9 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/offline_states.h diff --git a/Documentation/cpu-hotplug.txt b/Documentation/cpu-hotplug.txt index 9d620c1..4d4a644 100644 --- a/Documentation/cpu-hotplug.txt +++ b/Documentation/cpu-hotplug.txt @@ -49,6 +49,12 @@ maxcpus=nRestrict boot time cpus to n. Say if you have 4 cpus, using additional_cpus=n (*) Use this to limit hotpluggable cpus. This option sets cpu_possible_map = cpu_present_map + additional_cpus +cede_offline={off,on} Use this option to disable/enable putting offlined + processors to an extended H_CEDE state on + supported pseries platforms. + If nothing is specified, + cede_offline is set to on. + (*) Option valid only for following architectures - ia64 diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index ebff6d9..6ea4698 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -30,6 +30,7 @@ #include asm/pSeries_reconfig.h #include xics.h #include plpar_wrappers.h +#include offline_states.h /* This version can't take the spinlock, because it never returns */ static struct rtas_args rtas_stop_self_args = { @@ -39,6 +40,55 @@ static struct rtas_args rtas_stop_self_args = { .rets = rtas_stop_self_args.args[0], }; +static DEFINE_PER_CPU(enum cpu_state_vals, preferred_offline_state) = + CPU_STATE_OFFLINE; +static DEFINE_PER_CPU(enum cpu_state_vals, current_state) = CPU_STATE_OFFLINE; + +static enum cpu_state_vals default_offline_state = CPU_STATE_OFFLINE; + +static int cede_offline_enabled __read_mostly = 1; + +/* + * Enable/disable cede_offline when available. + */ +static int __init setup_cede_offline(char *str) +{ + if (!strcmp(str, off)) + cede_offline_enabled = 0; + else if (!strcmp(str, on)) + cede_offline_enabled = 1; + else + return 0; + return 1; +} + +__setup(cede_offline=, setup_cede_offline); + +enum cpu_state_vals get_cpu_current_state(int cpu) +{ + return per_cpu(current_state, cpu); +} + +void set_cpu_current_state(int cpu, enum cpu_state_vals state) +{ + per_cpu(current_state, cpu) = state; +} + +enum cpu_state_vals get_preferred_offline_state(int cpu) +{ + return per_cpu(preferred_offline_state, cpu); +} + +void set_preferred_offline_state(int cpu, enum cpu_state_vals state) +{ + per_cpu(preferred_offline_state, cpu) = state; +} + +void set_default_offline_state(int cpu) +{ + per_cpu(preferred_offline_state, cpu) = default_offline_state; +} + static void rtas_stop_self(void) { struct rtas_args *args = rtas_stop_self_args; @@ -56,11 +106,61 @@ static void rtas_stop_self(void) static void pseries_mach_cpu_die(void) { + unsigned int cpu = smp_processor_id(); + unsigned int hwcpu = hard_smp_processor_id(); + u8 cede_latency_hint = 0; + local_irq_disable(); idle_task_exit(); xics_teardown_cpu(); - unregister_slb_shadow(hard_smp_processor_id(), __pa(get_slb_shadow
[PATCH v5 4/4] pseries: Serialize cpu hotplug operations during deactivate Vs deallocate
Currently the cpu-allocation/deallocation process comprises of two steps: - Set the indicators and to update the device tree with DLPAR node information. - Online/offline the allocated/deallocated CPU. This is achieved by writing to the sysfs tunables probe during allocation and release during deallocation. At the sametime, the userspace can independently online/offline the CPUs of the system using the sysfs tunable online. It is quite possible that when a userspace tool offlines a CPU for the purpose of deallocation and is in the process of updating the device tree, some other userspace tool could bring the CPU back online by writing to the online sysfs tunable thereby causing the deallocate process to fail. The solution to this is to serialize writes to the probe/release sysfs tunable with the writes to the online sysfs tunable. This patch employs a mutex to provide this serialization, which is a no-op on all architectures except PPC_PSERIES Signed-off-by: Gautham R Shenoy e...@in.ibm.com --- arch/powerpc/platforms/pseries/dlpar.c | 28 +++- drivers/base/cpu.c |2 ++ include/linux/cpu.h| 13 + 3 files changed, 38 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c index 8e04a69..b6fc6ab 100644 --- a/arch/powerpc/platforms/pseries/dlpar.c +++ b/arch/powerpc/platforms/pseries/dlpar.c @@ -501,6 +501,18 @@ int release_drc(u32 drc_index) } #ifdef CONFIG_HOTPLUG_CPU +static DEFINE_MUTEX(pseries_cpu_hotplug_mutex); + +void cpu_hotplug_driver_lock() +{ + mutex_lock(pseries_cpu_hotplug_mutex); +} + +void cpu_hotplug_driver_unlock() +{ + mutex_unlock(pseries_cpu_hotplug_mutex); +} + static ssize_t cpu_probe_store(struct class *class, const char *buf, size_t count) { @@ -509,18 +521,19 @@ static ssize_t cpu_probe_store(struct class *class, const char *buf, char *cpu_name; int rc; + cpu_hotplug_driver_lock(); rc = strict_strtoul(buf, 0, drc_index); if (rc) - return -EINVAL; + goto out; rc = acquire_drc(drc_index); if (rc) - return -EINVAL; + goto out; dn = configure_connector(drc_index); if (!dn) { release_drc(drc_index); - return -EINVAL; + goto out; } /* fixup dn name */ @@ -529,7 +542,8 @@ static ssize_t cpu_probe_store(struct class *class, const char *buf, if (!cpu_name) { free_cc_nodes(dn); release_drc(drc_index); - return -ENOMEM; + rc = -ENOMEM; + goto out; } sprintf(cpu_name, /cpus/%s, dn-full_name); @@ -541,6 +555,8 @@ static ssize_t cpu_probe_store(struct class *class, const char *buf, release_drc(drc_index); rc = online_node_cpus(dn); +out: + cpu_hotplug_driver_unlock(); return rc ? -EINVAL : count; } @@ -562,6 +578,7 @@ static ssize_t cpu_release_store(struct class *class, const char *buf, return -EINVAL; } + cpu_hotplug_driver_lock(); rc = offline_node_cpus(dn); if (rc) @@ -570,7 +587,7 @@ static ssize_t cpu_release_store(struct class *class, const char *buf, rc = release_drc(*drc_index); if (rc) { of_node_put(dn); - return -EINVAL; + goto out; } rc = remove_device_tree_nodes(dn); @@ -579,6 +596,7 @@ static ssize_t cpu_release_store(struct class *class, const char *buf, of_node_put(dn); out: + cpu_hotplug_driver_unlock(); return rc ? -EINVAL : count; } diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index e62a4cc..07c3f05 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -35,6 +35,7 @@ static ssize_t __ref store_online(struct sys_device *dev, struct sysdev_attribut struct cpu *cpu = container_of(dev, struct cpu, sysdev); ssize_t ret; + cpu_hotplug_driver_lock(); switch (buf[0]) { case '0': ret = cpu_down(cpu-sysdev.id); @@ -49,6 +50,7 @@ static ssize_t __ref store_online(struct sys_device *dev, struct sysdev_attribut default: ret = -EINVAL; } + cpu_hotplug_driver_unlock(); if (ret = 0) ret = count; diff --git a/include/linux/cpu.h b/include/linux/cpu.h index 4753619..b0ad4e1 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -115,6 +115,19 @@ extern void put_online_cpus(void); #define unregister_hotcpu_notifier(nb) unregister_cpu_notifier(nb) int cpu_down(unsigned int cpu); +#ifdef CONFIG_PPC_PSERIES +extern void cpu_hotplug_driver_lock(void); +extern void cpu_hotplug_driver_unlock(void); +#else +static inline void cpu_hotplug_driver_lock(void
Re: [PATCH v5 0/4] pseries: Add cede support for cpu-offline
On Fri, Oct 30, 2009 at 10:52:43AM +0530, Gautham R Shenoy wrote: Hi, Hi Peter, Did you get a chance to look at this new design ? This is version 5 of patch series that provides a framework to choose the state a pseries CPU must be put to when it is offlined. Previous versions can be found here: Version 4: http://lkml.org/lkml/2009/10/9/59 Version 3: http://lkml.org/lkml/2009/9/15/164 Version 2: http://lkml.org/lkml/2009/8/28/102 Version 1: http://lkml.org/lkml/2009/8/6/236 Changes from the previous version include: - Rebased against Nathan Fontenot's latest pseries kernel handling of dynamic logical paritioning v4 patches found here: http://lkml.org/lkml/2009/10/21/98 - Added boot-time option to disable putting the offlined vcpus into an extended H_CEDE state. - Addressed Ben's comments regarding the if-else sequencing in pseries_mach_cpu_die(). - Addition of comments for pseries_cpu_die() to distinguish it from pseries_mach_cpu_die() Also, - This approach addresses Peter Z's objections regarding layering violations. The user simply offlines the cpu and doesn't worry about what state the CPU should be put into. That part is automatically handled by the kernel. - It does not add any additional sysfs interface instead uses the existing sysfs interface to offline CPUs. - On platforms which do not have support for ceding the vcpu with a latency specifier value, the offlining mechanism defaults to the current method of calling rtas_stop_self(). The patchset has been tested on the available pseries platforms and it works as per the expectations. I believe that the patch set is ready for inclusion. --- Gautham R Shenoy (4): pseries: Serialize cpu hotplug operations during deactivate Vs deallocate pseries: Add code to online/offline CPUs of a DLPAR node. pSeries: Add hooks to put the CPU into an appropriate offline state pSeries: extended_cede_processor() helper function. Documentation/cpu-hotplug.txt |6 + arch/powerpc/include/asm/lppaca.h |9 + arch/powerpc/platforms/pseries/dlpar.c | 129 arch/powerpc/platforms/pseries/hotplug-cpu.c| 182 ++- arch/powerpc/platforms/pseries/offline_states.h | 18 ++ arch/powerpc/platforms/pseries/plpar_wrappers.h | 22 +++ arch/powerpc/platforms/pseries/smp.c| 19 ++ arch/powerpc/xmon/xmon.c|3 drivers/base/cpu.c |2 include/linux/cpu.h | 13 ++ 10 files changed, 387 insertions(+), 16 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/offline_states.h -- Thanks and Regards gautham. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Thanks and Regards gautham ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/2] pseries: Don't panic when H_PROD fails during cpu-online.
If an online-attempt on a CPU which has been offlined using H_CEDE with an appropriate cede latency hint fails, don't panic. Instead print the error message and let the __cpu_up() code notify the CPU Hotplug framework of the failure, which in turn can notify the other subsystem through CPU_UP_CANCELED. Signed-off-by: Gautham R Shenoy e...@in.ibm.com --- arch/powerpc/platforms/pseries/smp.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c index 8868c01..b488663 100644 --- a/arch/powerpc/platforms/pseries/smp.c +++ b/arch/powerpc/platforms/pseries/smp.c @@ -144,8 +144,8 @@ static void __devinit smp_pSeries_kick_cpu(int nr) hcpuid = get_hard_smp_processor_id(nr); rc = plpar_hcall_norets(H_PROD, hcpuid); if (rc != H_SUCCESS) - panic(Error: Prod to wake up processor %d Ret= %ld\n, - nr, rc); + printk(KERN_ERR Error: Prod to wake up processor %d\ + Ret= %ld\n, nr, rc); } } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/2] pseries: Make declarations of cpu_hotplug_driver_lock() ANSI compatible.
And add the __acquires() and __releases() annotations, while at it. Signed-off-by: Gautham R Shenoy e...@in.ibm.com --- arch/powerpc/platforms/pseries/dlpar.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c index 12df9e8..67b7a10 100644 --- a/arch/powerpc/platforms/pseries/dlpar.c +++ b/arch/powerpc/platforms/pseries/dlpar.c @@ -346,12 +346,14 @@ int dlpar_release_drc(u32 drc_index) static DEFINE_MUTEX(pseries_cpu_hotplug_mutex); -void cpu_hotplug_driver_lock() +void cpu_hotplug_driver_lock(void) +__acquires(pseries_cpu_hotplug_mutex) { mutex_lock(pseries_cpu_hotplug_mutex); } -void cpu_hotplug_driver_unlock() +void cpu_hotplug_driver_unlock(void) +__releases(pseries_cpu_hotplug_mutex) { mutex_unlock(pseries_cpu_hotplug_mutex); } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/cpufreq: Add pr_warn() on OPAL firmware failures
On Sun, Aug 03, 2014 at 02:54:05PM +0530, Vaidyanathan Srinivasan wrote: @@ -131,7 +136,12 @@ static unsigned int pstate_id_to_freq(int pstate_id) int i; i = powernv_pstate_info.max - pstate_id; - BUG_ON(i = powernv_pstate_info.nr_pstates || i 0); + if (i = powernv_pstate_info.nr_pstates || i 0) { + pr_warn(PState id %d outside of PState table, + reporting nominal id %d instead\n, + pstate_id, powernv_pstate_info.nominal); + i = powernv_pstate_info.max - powernv_pstate_info.nominal; As of now the default loglevel corresponds to KERN_WARNING so this warning should get printed anyway. However, don't you think it would be better if we make it a pr_err( ) since it's a platform error that's causing the pstate_id to go out of bounds ? Otherwise it looks ok. Acked-by: Gautham R. Shenoy e...@linux.vnet.ibm.com + } return powernv_freqs[i].frequency; } -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/4] powernv:cpufreq: Create pstate_id_to_freq() helper
From: Gautham R. Shenoy e...@linux.vnet.ibm.com Create a helper routine that can return the cpu-frequency for the corresponding pstate_id. Also, cache the values of the pstate_max, pstate_min and pstate_nominal and nr_pstates in a static structure so that they can be reused in the future to perform any validations. Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 28 1 file changed, 28 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 345501e..d0a8dee 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -37,6 +37,15 @@ static DEFINE_PER_CPU(struct mutex, freq_switch_lock); #define POWERNV_MAX_PSTATES256 static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; +struct powernv_pstate_info { + int pstate_min_id; + int pstate_max_id; + int pstate_nominal_id; + int nr_pstates; +}; + +static struct powernv_pstate_info powernv_pstate_info; + /* * Initialize the freq table based on data obtained @@ -106,9 +115,28 @@ static int init_powernv_pstates(void) powernv_freqs[i].driver_data = 0; powernv_freqs[i].frequency = CPUFREQ_TABLE_END; + powernv_pstate_info.pstate_min_id = pstate_min; + powernv_pstate_info.pstate_max_id = pstate_max; + powernv_pstate_info.pstate_nominal_id = pstate_nominal; + powernv_pstate_info.nr_pstates = nr_pstates; + return 0; } +/** + * Returns the cpu frequency corresponding to the pstate_id. + */ +static unsigned int pstate_id_to_freq(int pstate_id) +{ + int i; + + i = powernv_pstate_info.pstate_max_id - pstate_id; + + BUG_ON(i = powernv_pstate_info.nr_pstates || i 0); + WARN_ON(powernv_freqs[i].driver_data != pstate_id); + return powernv_freqs[i].frequency; +} + static struct freq_attr *powernv_cpu_freq_attr[] = { cpufreq_freq_attr_scaling_available_freqs, NULL, -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/4] powernv:cpufreq: Export nominal frequency via sysfs.
From: Gautham R. Shenoy e...@linux.vnet.ibm.com Create a driver attribute named cpuinfo_nominal_frequency which creates a sysfs read-only file named cpuinfo_nominal_frequency. Export the frequency corresponding to the nominal_pstate through this interface. Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index d0a8dee..c59eb26 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -137,8 +137,30 @@ static unsigned int pstate_id_to_freq(int pstate_id) return powernv_freqs[i].frequency; } +/** + * show_cpuinfo_nominal_freq - Show the nominal CPU frequency as indicated by + * the firmware + */ +static ssize_t show_cpuinfo_nominal_freq(struct cpufreq_policy *policy, + char *buf) +{ + int nominal_freq; + nominal_freq = pstate_id_to_freq(powernv_pstate_info.pstate_nominal_id); + return sprintf(buf, %u\n, nominal_freq); +} + + +struct freq_attr cpufreq_freq_attr_cpuinfo_nominal_freq = { + .attr = { .name = cpuinfo_nominal_freq, + .mode = 0444, + }, + .show = show_cpuinfo_nominal_freq, +}; + + static struct freq_attr *powernv_cpu_freq_attr[] = { cpufreq_freq_attr_scaling_available_freqs, + cpufreq_freq_attr_cpuinfo_nominal_freq, NULL, }; -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/4] powernv:cpufreq: Export nominal and current frequency via sysfs
From: Gautham R. Shenoy e...@linux.vnet.ibm.com Hi, On IBM POWERNV platforms[1], presently we do not report the current operating frequency of the processor through the sysfs interface cpuinfo_cur_freq since the cpu frequency driver[1] has not implemented the -get(unsigned int cpu) method. Fix this by implementing the -get(unsigned int cpu) method which will report the frequency corresponding to the pstate_id on PMSR on cpu. Also, export the nominal frequency of the processor through the sysfs interface cpuinfo_nominal_freq by defining it as a driver attribute. The patch series depends on the cpu-frequency driver patch-series[1]. [1] http://comments.gmane.org/gmane.linux.ports.ppc.embedded/67905 Thanks and Regards gautham. Summary: == Gautham R. Shenoy (4): powernv:cpufreq: Create pstate_id_to_freq() helper powernv:cpufreq: Export nominal frequency via sysfs. powernv:cpufreq: Create a powernv_cpu_to_core_mask() helper. powernv:cpufreq: Implement the driver-get() method drivers/cpufreq/powernv-cpufreq.c | 115 -- 1 file changed, 110 insertions(+), 5 deletions(-) -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 4/4] powernv:cpufreq: Implement the driver-get() method
From: Gautham R. Shenoy e...@linux.vnet.ibm.com The current frequency of a cpu is reported through the sysfs file cpuinfo_cur_freq. This requires the driver to implement a -get(unsigned int cpu) method which will return the current operating frequency. Implement a function named powernv_cpufreq_get() which corresponds to the required -get() method. Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 41 +++ 1 file changed, 41 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index f0dae6f..5f43e4f 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -218,6 +218,46 @@ static inline void set_pmspr(unsigned long sprn, unsigned long val) BUG(); } +/** + * Computes the current frequency on this cpu + * and stores the result in *ret_freq. + */ +static void powernv_get_cpu_freq(void *ret_freq) +{ + unsigned long pmspr_val; + long pstate_id; + int *cur_freq, freq; + + cur_freq = (int *)ret_freq; + pmspr_val = get_pmspr(SPRN_PMSR); + pstate_id = pmspr_val; + pstate_id = pstate_id 56; + WARN_ON(pstate_id 0); + freq = pstate_id_to_freq(pstate_id); + pr_debug(cpu %d pmsr %lx pstate_id %ld frequency %d \n, + smp_processor_id(), pmspr_val, pstate_id, freq); + *cur_freq = freq; +} + +/** + * Returns the cpu frequency as reported by the firmware for 'cpu'. + * This value is reported through the sysfs file cpuinfo_cur_freq. + */ +unsigned int powernv_cpufreq_get(unsigned int cpu) +{ + int ret_freq; + cpumask_var_t sibling_mask; + + if (!zalloc_cpumask_var(sibling_mask, GFP_KERNEL)) + return -ENOMEM; + + powernv_cpu_to_core_mask(cpu, sibling_mask); + smp_call_function_any(sibling_mask, powernv_get_cpu_freq, ret_freq, 1); + + free_cpumask_var(sibling_mask); + return ret_freq; +} + static void set_pstate(void *pstate) { unsigned long val; @@ -310,6 +350,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, static struct cpufreq_driver powernv_cpufreq_driver = { .verify = powernv_cpufreq_verify, .target = powernv_cpufreq_target, + .get= powernv_cpufreq_get, .init = powernv_cpufreq_cpu_init, .exit = powernv_cpufreq_cpu_exit, .name = powernv-cpufreq, -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 14/52] powerpc, sysfs: Fix CPU hotplug callback registration
Hello Ben, On Fri, Mar 07, 2014 at 01:57:31PM +1100, Benjamin Herrenschmidt wrote: On Fri, 2014-02-14 at 13:22 +0530, Srivatsa S. Bhat wrote: Subsystems that want to register CPU hotplug callbacks, as well as perform initialization for the CPUs that are already online, often do it as shown below: get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); register_cpu_notifier(foobar_cpu_notifier); put_online_cpus(); This patch breaks a good half of my test configs with: /home/benh/linux-powerpc-test/arch/powerpc/kernel/sysfs.c: In function 'topology_init': /home/benh/linux-powerpc-test/arch/powerpc/kernel/sysfs.c:979:2: error: implicit declaration of function 'cpu_notifier_register_begin' [-Werror=implicit-function-declaration] /home/benh/linux-powerpc-test/arch/powerpc/kernel/sysfs.c:1004:2: error: implicit declaration of function '__register_cpu_notifier' [-Werror=implicit-function-declaration] /home/benh/linux-powerpc-test/arch/powerpc/kernel/sysfs.c:1006:2: error: implicit declaration of function 'cpu_notifier_register_done' [-Werror=implicit-function-declaration] cc1: all warnings being treated as errors make[2]: *** [arch/powerpc/kernel/sysfs.o] Error 1 make[2]: *** Waiting for unfinished jobs This patch depends on [PATCH v2 02/52] CPU hotplug: Provide lockless versions of callback registration functions of the series (Can be found here: https://lkml.org/lkml/2014/2/14/59). This particular patch defines 'cpu_notifier_register_begin', '__register_cpu_notifier', and 'cpu_notifier_register_done' in the cpu-hotplug core. Are you seeing the build breakage with this patch on ? -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 0/6] powernv:cpufreq: Dynamic cpu-frequency scaling
From: Gautham R. Shenoy e...@linux.vnet.ibm.com Hi, This is the v2 of the consolidated patchset consisting patches for enabling cpufreq on IBM POWERNV platforms along with some enhancements. The v1 of these patches have been previously submitted on linuxppc-dev [1][2]. - This patchset contains code for the platform driver to support CPU frequency scaling on IBM POWERNV platforms. - In addition to the standard control and status files exposed by the cpufreq core, the patchset exposes the nominal frequency through the file named cpuinfo_nominal_freq. The patchset is based against commit c3bebc71c4bcdafa24b506adf0c1de3c1f77e2e0 of the mainline tree. [1]: https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-February/115244.html [2]: https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-March/115703.html Gautham R. Shenoy (3): powernv:cpufreq: Create pstate_id_to_freq() helper powernv:cpufreq: Export nominal frequency via sysfs. powernv:cpufreq: Implement the driver-get() method Srivatsa S. Bhat (2): powernv:cpufreq: Create a powernv_cpu_to_core_mask() helper. powernv,cpufreq:Add per-core locking to serialize frequency transitions Vaidyanathan Srinivasan (1): powernv: cpufreq driver for powernv platform arch/powerpc/include/asm/reg.h | 4 + arch/powerpc/platforms/powernv/Kconfig | 1 + drivers/cpufreq/Kconfig| 1 + drivers/cpufreq/Kconfig.powerpc| 13 ++ drivers/cpufreq/Makefile | 1 + drivers/cpufreq/powernv-cpufreq.c | 397 + 6 files changed, 417 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 5/6] powernv:cpufreq: Export nominal frequency via sysfs.
From: Gautham R. Shenoy e...@linux.vnet.ibm.com Create a driver attribute named cpuinfo_nominal_freq which creates a sysfs read-only file named cpuinfo_nominal_freq. Export the frequency corresponding to the nominal_pstate through this interface. Nominal frequency is the highest non-turbo frequency for the platform. This is generally used for setting governor policies from user space for optimal energy efficiency. Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 0ecd163..183bbc4 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -142,8 +142,30 @@ static unsigned int pstate_id_to_freq(int pstate_id) return powernv_freqs[i].frequency; } +/** + * show_cpuinfo_nominal_freq - Show the nominal CPU frequency as indicated by + * the firmware + */ +static ssize_t show_cpuinfo_nominal_freq(struct cpufreq_policy *policy, + char *buf) +{ + int nominal_freq; + nominal_freq = pstate_id_to_freq(powernv_pstate_info.pstate_nominal_id); + return sprintf(buf, %u\n, nominal_freq); +} + + +struct freq_attr cpufreq_freq_attr_cpuinfo_nominal_freq = { + .attr = { .name = cpuinfo_nominal_freq, + .mode = 0444, + }, + .show = show_cpuinfo_nominal_freq, +}; + + static struct freq_attr *powernv_cpu_freq_attr[] = { cpufreq_freq_attr_scaling_available_freqs, + cpufreq_freq_attr_cpuinfo_nominal_freq, NULL, }; -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 1/6] powernv: cpufreq driver for powernv platform
From: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Backend driver to dynamically set voltage and frequency on IBM POWER non-virtualized platforms. Power management SPRs are used to set the required PState. This driver works in conjunction with cpufreq governors like 'ondemand' to provide a demand based frequency and voltage setting on IBM POWER non-virtualized platforms. PState table is obtained from OPAL v3 firmware through device tree. powernv_cpufreq back-end driver would parse the relevant device-tree nodes and initialise the cpufreq subsystem on powernv platform. Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Anton Blanchard an...@samba.org Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- arch/powerpc/include/asm/reg.h | 4 + arch/powerpc/platforms/powernv/Kconfig | 1 + drivers/cpufreq/Kconfig| 1 + drivers/cpufreq/Kconfig.powerpc| 13 ++ drivers/cpufreq/Makefile | 1 + drivers/cpufreq/powernv-cpufreq.c | 277 + 6 files changed, 297 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 90c06ec..84f92ca 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -271,6 +271,10 @@ #define SPRN_HSRR1 0x13B /* Hypervisor Save/Restore 1 */ #define SPRN_IC0x350 /* Virtual Instruction Count */ #define SPRN_VTB 0x351 /* Virtual Time Base */ +#define SPRN_PMICR 0x354 /* Power Management Idle Control Reg */ +#define SPRN_PMSR 0x355 /* Power Management Status Reg */ +#define SPRN_PMCR 0x374 /* Power Management Control Register */ + /* HFSCR and FSCR bit numbers are the same */ #define FSCR_TAR_LG8 /* Enable Target Address Register */ #define FSCR_EBB_LG7 /* Enable Event Based Branching */ diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig index 895e8a2..1fe12b1 100644 --- a/arch/powerpc/platforms/powernv/Kconfig +++ b/arch/powerpc/platforms/powernv/Kconfig @@ -11,6 +11,7 @@ config PPC_POWERNV select PPC_UDBG_16550 select PPC_SCOM select ARCH_RANDOM + select CPU_FREQ default y config PPC_POWERNV_RTAS diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index 4b029c0..4ba1632 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -48,6 +48,7 @@ config CPU_FREQ_STAT_DETAILS choice prompt Default CPUFreq governor default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ + default CPU_FREQ_DEFAULT_GOV_ONDEMAND if POWERNV_CPUFREQ default CPU_FREQ_DEFAULT_GOV_PERFORMANCE help This option sets which CPUFreq governor shall be loaded at diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc index ca0021a..93f8689 100644 --- a/drivers/cpufreq/Kconfig.powerpc +++ b/drivers/cpufreq/Kconfig.powerpc @@ -54,3 +54,16 @@ config PPC_PASEMI_CPUFREQ help This adds the support for frequency switching on PA Semi PWRficient processors. + +config POWERNV_CPUFREQ + tristate CPU frequency scaling for IBM POWERNV platform + depends on PPC_POWERNV + select CPU_FREQ_GOV_PERFORMANCE + select CPU_FREQ_GOV_POWERSAVE + select CPU_FREQ_GOV_USERSPACE + select CPU_FREQ_GOV_ONDEMAND + select CPU_FREQ_GOV_CONSERVATIVE + default y + help +This adds support for CPU frequency switching on IBM POWERNV +platform diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile index 7494565..0dbb963 100644 --- a/drivers/cpufreq/Makefile +++ b/drivers/cpufreq/Makefile @@ -86,6 +86,7 @@ obj-$(CONFIG_PPC_CORENET_CPUFREQ) += ppc-corenet-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC)+= pmac32-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC64) += pmac64-cpufreq.o obj-$(CONFIG_PPC_PASEMI_CPUFREQ) += pasemi-cpufreq.o +obj-$(CONFIG_POWERNV_CPUFREQ) += powernv-cpufreq.o ## # Other platform drivers diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c new file mode 100644 index 000..ab1551f --- /dev/null +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -0,0 +1,277 @@ +/* + * POWERNV cpufreq driver for the IBM POWER processors + * + * (C) Copyright IBM 2014 + * + * Author: Vaidyanathan Srinivasan svaidy at linux.vnet.ibm.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + * This program is distributed in the hope
[PATCH v2 3/6] powernv, cpufreq:Add per-core locking to serialize frequency transitions
From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com On POWER systems, the CPU frequency is controlled at a core-level and hence we need to serialize so that only one of the threads in the core switches the core's frequency at a time. Using a global mutex lock would needlessly serialize _all_ frequency transitions in the system (across all cores). So introduce per-core locking to enable finer-grained synchronization and thereby enhance the speed and responsiveness of the cpufreq driver to varying workload demands. The design of per-core locking is very simple and straight-forward: we first define a Per-CPU lock and use the ones that belongs to the first thread sibling of the core. cpu_first_thread_sibling() macro is used to find the *common* lock for all thread siblings belonging to a core. Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 21 - 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 4cad727..4c2e8ca 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -24,8 +24,15 @@ #include linux/of.h #include asm/cputhreads.h -/* FIXME: Make this per-core */ -static DEFINE_MUTEX(freq_switch_mutex); +/* Per-Core locking for frequency transitions */ +static DEFINE_PER_CPU(struct mutex, freq_switch_lock); + +#define lock_core_freq(cpu)\ + mutex_lock(per_cpu(freq_switch_lock,\ + cpu_first_thread_sibling(cpu))); +#define unlock_core_freq(cpu) \ + mutex_unlock(per_cpu(freq_switch_lock,\ + cpu_first_thread_sibling(cpu))); #define POWERNV_MAX_PSTATES256 @@ -233,7 +240,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, freqs.new = powernv_freqs[new_index].frequency; freqs.cpu = policy-cpu; - mutex_lock(freq_switch_mutex); + lock_core_freq(policy-cpu); cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE); pr_debug(setting frequency for cpu %d to %d kHz index %d pstate %d, @@ -245,7 +252,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, rc = powernv_set_freq(policy-cpus, new_index); cpufreq_notify_transition(policy, freqs, CPUFREQ_POSTCHANGE); - mutex_unlock(freq_switch_mutex); + unlock_core_freq(policy-cpu); return rc; } @@ -262,7 +269,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = { static int __init powernv_cpufreq_init(void) { - int rc = 0; + int cpu, rc = 0; /* Discover pstates from device tree and init */ @@ -272,6 +279,10 @@ static int __init powernv_cpufreq_init(void) pr_info(powernv-cpufreq disabled\n); return rc; } + /* Init per-core mutex */ + for_each_possible_cpu(cpu) { + mutex_init(per_cpu(freq_switch_lock, cpu)); + } rc = cpufreq_register_driver(powernv_cpufreq_driver); return rc; -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 2/6] powernv:cpufreq: Create a powernv_cpu_to_core_mask() helper.
From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Create a helper method that computes the cpumask corresponding to the thread-siblings of a cpu. Use this for initializing the policy-cpus mask for a given cpu. (Original code written by Srivatsa S. Bhat. Gautham moved this to a helper function!) Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index ab1551f..4cad727 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -115,6 +115,23 @@ static struct freq_attr *powernv_cpu_freq_attr[] = { /* Helper routines */ +/** + * Sets the bits corresponding to the thread-siblings of cpu in its core + * in 'cpus'. + */ +static void powernv_cpu_to_core_mask(unsigned int cpu, cpumask_var_t cpus) +{ + int base, i; + + base = cpu_first_thread_sibling(cpu); + + for (i = 0; i threads_per_core; i++) { + cpumask_set_cpu(base + i, cpus); + } + + return; +} + /* Access helpers to power mgt SPR */ static inline unsigned long get_pmspr(unsigned long sprn) @@ -180,13 +197,8 @@ static int powernv_set_freq(cpumask_var_t cpus, unsigned int new_index) static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy) { - int base, i; - #ifdef CONFIG_SMP - base = cpu_first_thread_sibling(policy-cpu); - - for (i = 0; i threads_per_core; i++) - cpumask_set_cpu(base + i, policy-cpus); + powernv_cpu_to_core_mask(policy-cpu, policy-cpus); #endif policy-cpuinfo.transition_latency = 25000; -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 6/6] powernv:cpufreq: Implement the driver-get() method
From: Gautham R. Shenoy e...@linux.vnet.ibm.com The current frequency of a cpu is reported through the sysfs file cpuinfo_cur_freq. This requires the driver to implement a -get(unsigned int cpu) method which will return the current operating frequency. Implement a function named powernv_cpufreq_get() which reads the local pstate from the PMSR and returns the corresponding frequency. Set the powernv_cpufreq_driver.get hook to powernv_cpufreq_get(). Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 48 +++ 1 file changed, 48 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 183bbc4..6f3b6e1 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -223,6 +223,53 @@ static inline void set_pmspr(unsigned long sprn, unsigned long val) BUG(); } +/* + * Computes the current frequency on this cpu + * and stores the result in *ret_freq. + */ +static void powernv_read_cpu_freq(void *ret_freq) +{ + unsigned long pmspr_val; + s8 local_pstate_id; + int *cur_freq, freq, pstate_id; + + cur_freq = (int *)ret_freq; + pmspr_val = get_pmspr(SPRN_PMSR); + + /* The local pstate id corresponds bits 48..55 in the PMSR. + * Note: Watch out for the sign! */ + local_pstate_id = (pmspr_val 48) 0xFF; + pstate_id = local_pstate_id; + + freq = pstate_id_to_freq(pstate_id); + pr_debug(cpu %d pmsr %lx pstate_id %d frequency %d \n, + smp_processor_id(), pmspr_val, pstate_id, freq); + *cur_freq = freq; +} + +/* + * Returns the cpu frequency as reported by the firmware for 'cpu'. + * This value is reported through the sysfs file cpuinfo_cur_freq. + */ +unsigned int powernv_cpufreq_get(unsigned int cpu) +{ + int ret_freq; + cpumask_var_t sibling_mask; + + if (unlikely(!zalloc_cpumask_var(sibling_mask, GFP_KERNEL))) { + smp_call_function_single(cpu, powernv_read_cpu_freq, + ret_freq, 1); + return ret_freq; + } + + powernv_cpu_to_core_mask(cpu, sibling_mask); + smp_call_function_any(sibling_mask, powernv_read_cpu_freq, + ret_freq, 1); + + free_cpumask_var(sibling_mask); + return ret_freq; +} + static void set_pstate(void *pstate) { unsigned long val; @@ -309,6 +356,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, static struct cpufreq_driver powernv_cpufreq_driver = { .verify = powernv_cpufreq_verify, .target = powernv_cpufreq_target, + .get= powernv_cpufreq_get, .init = powernv_cpufreq_cpu_init, .exit = powernv_cpufreq_cpu_exit, .name = powernv-cpufreq, -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 4/6] powernv:cpufreq: Create pstate_id_to_freq() helper
From: Gautham R. Shenoy e...@linux.vnet.ibm.com Create a helper routine that can return the cpu-frequency for the corresponding pstate_id. Also, cache the values of the pstate_max, pstate_min and pstate_nominal and nr_pstates in a static structure so that they can be reused in the future to perform any validations. Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 4c2e8ca..0ecd163 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -39,6 +39,14 @@ static DEFINE_PER_CPU(struct mutex, freq_switch_lock); static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1]; +struct powernv_pstate_info { + int pstate_min_id; + int pstate_max_id; + int pstate_nominal_id; + int nr_pstates; +}; +static struct powernv_pstate_info powernv_pstate_info; + /* * Initialize the freq table based on data obtained * from the firmware passed via device-tree @@ -112,9 +120,28 @@ static int init_powernv_pstates(void) for (i = 0; powernv_freqs[i].frequency != CPUFREQ_TABLE_END; i++) pr_debug(%d: %d\n, i, powernv_freqs[i].frequency); + powernv_pstate_info.pstate_min_id = pstate_min; + powernv_pstate_info.pstate_max_id = pstate_max; + powernv_pstate_info.pstate_nominal_id = pstate_nominal; + powernv_pstate_info.nr_pstates = nr_pstates; + return 0; } +/** + * Returns the cpu frequency corresponding to the pstate_id. + */ +static unsigned int pstate_id_to_freq(int pstate_id) +{ + int i; + + i = powernv_pstate_info.pstate_max_id - pstate_id; + + BUG_ON(i = powernv_pstate_info.nr_pstates || i 0); + WARN_ON(powernv_pstate_ids[i] != pstate_id); + return powernv_freqs[i].frequency; +} + static struct freq_attr *powernv_cpu_freq_attr[] = { cpufreq_freq_attr_scaling_available_freqs, NULL, -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH 1/2] powerpc: powernv: Framework to show the correct clock in /proc/cpuinfo
From: Gautham R. Shenoy e...@linux.vnet.ibm.com Currently, the code in setup-common.c for powerpc assumes that all clock rates are same in a smp system. This value is cached in the variable named ppc_proc_freq and is the value that is reported in /proc/cpuinfo. However on the PowerNV platform, the clock rate is same only across the threads of the same core. Hence the value that is reported in /proc/cpuinfo is incorrect on PowerNV platforms. We need a better way to query and report the correct value of the processor clock in /proc/cpuinfo. The patch achieves this by creating a machdep_call named get_proc_freq() which is expected to returns the frequency in Hz. The code in show_cpuinfo() can invoke this method to display the correct clock rate on platforms that have implemented this method. On the other powerpc platforms it can use the value cached in ppc_proc_freq. Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- arch/powerpc/include/asm/machdep.h | 2 ++ arch/powerpc/kernel/setup-common.c | 16 2 files changed, 14 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index ad3025d..eb1c6d4 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -113,6 +113,8 @@ struct machdep_calls { /* Optional, may be NULL. */ void(*show_cpuinfo)(struct seq_file *m); void(*show_percpuinfo)(struct seq_file *m, int i); + /* Returns the current operating frequency of cpu in Hz */ + unsigned long (*get_proc_freq)(unsigned int cpu); void(*init_IRQ)(void); diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index bc76cc6..bdd3045 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -209,6 +209,7 @@ static int show_cpuinfo(struct seq_file *m, void *v) { unsigned long cpu_id = (unsigned long)v - 1; unsigned int pvr; + unsigned long proc_freq; unsigned short maj; unsigned short min; @@ -260,12 +261,19 @@ static int show_cpuinfo(struct seq_file *m, void *v) #endif /* CONFIG_TAU */ /* -* Assume here that all clock rates are the same in a -* smp system. -- Cort +* Platforms that have variable clock rates, should implement +* the method ppc_md.get_proc_freq() that reports the clock +* rate of a given cpu. The rest can use ppc_proc_freq to +* report the clock rate that is same across all cpus. */ - if (ppc_proc_freq) + if (ppc_md.get_proc_freq) + proc_freq = ppc_md.get_proc_freq(cpu_id); + else + proc_freq = ppc_proc_freq; + + if (proc_freq) seq_printf(m, clock\t\t: %lu.%06luMHz\n, - ppc_proc_freq / 100, ppc_proc_freq % 100); + proc_freq / 100, proc_freq % 100); if (ppc_md.show_percpuinfo != NULL) ppc_md.show_percpuinfo(m, cpu_id); -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH 2/2] powerpc: powernv: Implement ppc_md.get_proc_freq()
From: Gautham R. Shenoy e...@linux.vnet.ibm.com Implement a method named pnv_get_proc_freq(unsigned int cpu) which returns the current clock rate on the 'cpu' in Hz to be reported in /proc/cpuinfo. This method uses the value reported by cpufreq when such a value is sane. Otherwise it falls back to old way of reporting the clockrate, i.e. ppc_proc_freq. Set the ppc_md.get_proc_freq() hook to pnv_get_proc_freq() on the PowerNV platform. Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- arch/powerpc/platforms/powernv/setup.c | 21 + 1 file changed, 21 insertions(+) diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c index 110f4fb..423e36d 100644 --- a/arch/powerpc/platforms/powernv/setup.c +++ b/arch/powerpc/platforms/powernv/setup.c @@ -28,6 +28,7 @@ #include linux/bug.h #include linux/cpuidle.h #include linux/pci.h +#include linux/cpufreq.h #include asm/machdep.h #include asm/firmware.h @@ -235,6 +236,25 @@ void powernv_idle(void) } } +/* + * Returns the cpu frequency for 'cpu' in Hz. This is used by + * /proc/cpuinfo + */ +unsigned long pnv_get_proc_freq(unsigned int cpu) +{ + unsigned long ret_freq; + + ret_freq = cpufreq_quick_get(cpu) * 1000ul; + + /* +* If the backend cpufreq driver does not exist, + * then fallback to old way of reporting the clockrate. +*/ + if (!ret_freq) + ret_freq = ppc_proc_freq; + return ret_freq; +} + define_machine(powernv) { .name = PowerNV, .probe = pnv_probe, @@ -242,6 +262,7 @@ define_machine(powernv) { .setup_arch = pnv_setup_arch, .init_IRQ = pnv_init_IRQ, .show_cpuinfo = pnv_show_cpuinfo, + .get_proc_freq = pnv_get_proc_freq, .progress = pnv_progress, .machine_shutdown = pnv_shutdown, .power_save = powernv_idle, -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH 0/2] powernv: Show the correct clock value in /proc/cpuinfo
From: Gautham R. Shenoy e...@linux.vnet.ibm.com Hi, Currently, the code in setup-common.c for powerpc assumes that all clock rates are same in a smp system. This value is cached in the variable named ppc_proc_freq and is the value that is reported in /proc/cpuinfo. However on the PowerNV platform, the clock rate is same only across the threads of the same core. Hence the value that is reported in /proc/cpuinfo is incorrect on PowerNV platforms. This patch-series fixes this problem by having /proc/cpuinfo report the value returned by cpufreq_quick_get(cpu) whenever the cpufreq backend driver is available and fallback to the old way of reporting the clock rate in its absence. These patches depend on the patches to enable dynamic cpufrequency scaling on PowerNV that can be found here: http://linuxppc.10917.n7.nabble.com/PATCH-v2-0-6-powernv-cpufreq-Dynamic-cpu-frequency-scaling-td80641.html Gautham R. Shenoy (2): powerpc: powernv: Framework to show the correct clock in /proc/cpuinfo powerpc: powernv: Implement ppc_md.get_proc_freq() arch/powerpc/include/asm/machdep.h | 2 ++ arch/powerpc/kernel/setup-common.c | 16 arch/powerpc/platforms/powernv/setup.c | 21 + 3 files changed, 35 insertions(+), 4 deletions(-) -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 2/6] powernv:cpufreq: Create a powernv_cpu_to_core_mask() helper.
On Wed, Mar 19, 2014 at 12:05:20PM +0530, Srivatsa S. Bhat wrote: On 03/19/2014 05:07 AM, Benjamin Herrenschmidt wrote: On Mon, 2014-03-10 at 16:40 +0530, Gautham R. Shenoy wrote: From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Create a helper method that computes the cpumask corresponding to the thread-siblings of a cpu. Use this for initializing the policy-cpus mask for a given cpu. (Original code written by Srivatsa S. Bhat. Gautham moved this to a helper function!) How does that differ from the existing entry in the sibling map which you can obtain with the helper cpu_sibling_mask() ? (You probably want to *copy* the mask of course but I don't see the need of re-creating one from scratch). The intent here was to have a sibling mask that is invariant of CPU hotplug. cpu_sibling_mask is updated on every hotplug operation, and this would break cpufreq, since policy-cpus has to be hotplug invariant. This should have been noted in the changelog of patch 1 as well as this patch. (The earlier (internal) versions of this patchset had the description in the changelogs, but somehow it got dropped accidentally). I'll work with Gautham and ensure that we have this important info included in the changelog. Thanks for pointing it out! I reused that part of the code because I was unaware of cpu_sibling_mask()! For implementing powernv_cpufreq_get(), cpu_sibling_mask() suffices since we are using this mask as a parameter to smp_call_function_any(), and hence are concerned only about the online siblings. I shall fix the changelog to reflect Srivatsa's reason for having a mask that does not vary with cpu-hotplug. Regards, Srivatsa S. Bhat Also, this should have been CCed to the cpufreq mailing list and maybe the relevant maintainer too. Ok. Will do it in the subsequent version. Thanks for the review! Cheers, Ben. Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index ab1551f..4cad727 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -115,6 +115,23 @@ static struct freq_attr *powernv_cpu_freq_attr[] = { /* Helper routines */ +/** + * Sets the bits corresponding to the thread-siblings of cpu in its core + * in 'cpus'. + */ +static void powernv_cpu_to_core_mask(unsigned int cpu, cpumask_var_t cpus) +{ + int base, i; + + base = cpu_first_thread_sibling(cpu); + + for (i = 0; i threads_per_core; i++) { + cpumask_set_cpu(base + i, cpus); + } + + return; +} + /* Access helpers to power mgt SPR */ static inline unsigned long get_pmspr(unsigned long sprn) @@ -180,13 +197,8 @@ static int powernv_set_freq(cpumask_var_t cpus, unsigned int new_index) static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy) { - int base, i; - #ifdef CONFIG_SMP - base = cpu_first_thread_sibling(policy-cpu); - - for (i = 0; i threads_per_core; i++) - cpumask_set_cpu(base + i, policy-cpus); + powernv_cpu_to_core_mask(policy-cpu, policy-cpus); #endif policy-cpuinfo.transition_latency = 25000; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 0/5] powernv: Enable Dynamic Frequency
From: Gautham R. Shenoy e...@linux.vnet.ibm.com Hi, This is the v3 of the consolidated patchset consisting patches for enabling cpufreq on IBM POWERNV platforms along with some enhancements. I have incorporated the review for v2 (which can be found at [1]). - This patchset contains code for the platform driver to support CPU frequency scaling on IBM POWERNV platforms. - In addition to the standard control and status files exposed by the cpufreq core, the patchset exposes the nominal frequency through the file named cpuinfo_nominal_freq. The changes from from v2: - Updated the changelog for powernv: cpufreq driver for powernv platform to record the fact that policy-cpus must be populated with a sibling mask that includes even the offlined siblings. - Dropped the patch named powernv:cpufreq: Create a powernv_cpu_to_core_mask() helper since the driver-get() method can use cpu_sibling_mask(). - Updated powernv:cpufreq: Implement the driver-get() method to use cpu_sibling_mask() in powernv_cpufreq_get() The patchset is based against commit 4907cdca7210c5895311bddcf05a4c85b67d8566 of the mainline tree. [1]: https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-March/115861.html Gautham R. Shenoy (3): powernv:cpufreq: Create pstate_id_to_freq() helper powernv:cpufreq: Export nominal frequency via sysfs. powernv:cpufreq: Implement the driver-get() method Srivatsa S. Bhat (1): powernv,cpufreq:Add per-core locking to serialize frequency transitions Vaidyanathan Srinivasan (1): powernv: cpufreq driver for powernv platform arch/powerpc/include/asm/reg.h | 4 + arch/powerpc/platforms/powernv/Kconfig | 1 + drivers/cpufreq/Kconfig| 1 + drivers/cpufreq/Kconfig.powerpc| 13 ++ drivers/cpufreq/Makefile | 1 + drivers/cpufreq/powernv-cpufreq.c | 375 + 6 files changed, 395 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 1/5] powernv: cpufreq driver for powernv platform
From: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Backend driver to dynamically set voltage and frequency on IBM POWER non-virtualized platforms. Power management SPRs are used to set the required PState. This driver works in conjunction with cpufreq governors like 'ondemand' to provide a demand based frequency and voltage setting on IBM POWER non-virtualized platforms. PState table is obtained from OPAL v3 firmware through device tree. powernv_cpufreq back-end driver would parse the relevant device-tree nodes and initialise the cpufreq subsystem on powernv platform. The policy-cpus needs to be populated in a hotplug-invariant manner instead of using cpu_sibling_mask() which varies with cpu-hotplug. This is because the cpufreq core code copies this content into policy-related_cpus mask which is should not vary on cpu-hotplug. [Change log updated by e...@linux.vnet.ibm.com] Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Anton Blanchard an...@samba.org Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- arch/powerpc/include/asm/reg.h | 4 + arch/powerpc/platforms/powernv/Kconfig | 1 + drivers/cpufreq/Kconfig| 1 + drivers/cpufreq/Kconfig.powerpc| 13 ++ drivers/cpufreq/Makefile | 1 + drivers/cpufreq/powernv-cpufreq.c | 277 + 6 files changed, 297 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 90c06ec..84f92ca 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -271,6 +271,10 @@ #define SPRN_HSRR1 0x13B /* Hypervisor Save/Restore 1 */ #define SPRN_IC0x350 /* Virtual Instruction Count */ #define SPRN_VTB 0x351 /* Virtual Time Base */ +#define SPRN_PMICR 0x354 /* Power Management Idle Control Reg */ +#define SPRN_PMSR 0x355 /* Power Management Status Reg */ +#define SPRN_PMCR 0x374 /* Power Management Control Register */ + /* HFSCR and FSCR bit numbers are the same */ #define FSCR_TAR_LG8 /* Enable Target Address Register */ #define FSCR_EBB_LG7 /* Enable Event Based Branching */ diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig index 895e8a2..1fe12b1 100644 --- a/arch/powerpc/platforms/powernv/Kconfig +++ b/arch/powerpc/platforms/powernv/Kconfig @@ -11,6 +11,7 @@ config PPC_POWERNV select PPC_UDBG_16550 select PPC_SCOM select ARCH_RANDOM + select CPU_FREQ default y config PPC_POWERNV_RTAS diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index 4b029c0..4ba1632 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -48,6 +48,7 @@ config CPU_FREQ_STAT_DETAILS choice prompt Default CPUFreq governor default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ + default CPU_FREQ_DEFAULT_GOV_ONDEMAND if POWERNV_CPUFREQ default CPU_FREQ_DEFAULT_GOV_PERFORMANCE help This option sets which CPUFreq governor shall be loaded at diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc index ca0021a..93f8689 100644 --- a/drivers/cpufreq/Kconfig.powerpc +++ b/drivers/cpufreq/Kconfig.powerpc @@ -54,3 +54,16 @@ config PPC_PASEMI_CPUFREQ help This adds the support for frequency switching on PA Semi PWRficient processors. + +config POWERNV_CPUFREQ + tristate CPU frequency scaling for IBM POWERNV platform + depends on PPC_POWERNV + select CPU_FREQ_GOV_PERFORMANCE + select CPU_FREQ_GOV_POWERSAVE + select CPU_FREQ_GOV_USERSPACE + select CPU_FREQ_GOV_ONDEMAND + select CPU_FREQ_GOV_CONSERVATIVE + default y + help +This adds support for CPU frequency switching on IBM POWERNV +platform diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile index 7494565..0dbb963 100644 --- a/drivers/cpufreq/Makefile +++ b/drivers/cpufreq/Makefile @@ -86,6 +86,7 @@ obj-$(CONFIG_PPC_CORENET_CPUFREQ) += ppc-corenet-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC)+= pmac32-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC64) += pmac64-cpufreq.o obj-$(CONFIG_PPC_PASEMI_CPUFREQ) += pasemi-cpufreq.o +obj-$(CONFIG_POWERNV_CPUFREQ) += powernv-cpufreq.o ## # Other platform drivers diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c new file mode 100644 index 000..ab1551f --- /dev/null +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -0,0 +1,277 @@ +/* + * POWERNV cpufreq driver for the IBM POWER processors + * + * (C) Copyright IBM 2014 + * + * Author: Vaidyanathan Srinivasan svaidy
[PATCH v3 2/5] powernv, cpufreq:Add per-core locking to serialize frequency transitions
From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com On POWER systems, the CPU frequency is controlled at a core-level and hence we need to serialize so that only one of the threads in the core switches the core's frequency at a time. Using a global mutex lock would needlessly serialize _all_ frequency transitions in the system (across all cores). So introduce per-core locking to enable finer-grained synchronization and thereby enhance the speed and responsiveness of the cpufreq driver to varying workload demands. The design of per-core locking is very simple and straight-forward: we first define a Per-CPU lock and use the ones that belongs to the first thread sibling of the core. cpu_first_thread_sibling() macro is used to find the *common* lock for all thread siblings belonging to a core. Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 21 - 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index ab1551f..66dae0d 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -24,8 +24,15 @@ #include linux/of.h #include asm/cputhreads.h -/* FIXME: Make this per-core */ -static DEFINE_MUTEX(freq_switch_mutex); +/* Per-Core locking for frequency transitions */ +static DEFINE_PER_CPU(struct mutex, freq_switch_lock); + +#define lock_core_freq(cpu)\ + mutex_lock(per_cpu(freq_switch_lock,\ + cpu_first_thread_sibling(cpu))); +#define unlock_core_freq(cpu) \ + mutex_unlock(per_cpu(freq_switch_lock,\ + cpu_first_thread_sibling(cpu))); #define POWERNV_MAX_PSTATES256 @@ -221,7 +228,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, freqs.new = powernv_freqs[new_index].frequency; freqs.cpu = policy-cpu; - mutex_lock(freq_switch_mutex); + lock_core_freq(policy-cpu); cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE); pr_debug(setting frequency for cpu %d to %d kHz index %d pstate %d, @@ -233,7 +240,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, rc = powernv_set_freq(policy-cpus, new_index); cpufreq_notify_transition(policy, freqs, CPUFREQ_POSTCHANGE); - mutex_unlock(freq_switch_mutex); + unlock_core_freq(policy-cpu); return rc; } @@ -250,7 +257,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = { static int __init powernv_cpufreq_init(void) { - int rc = 0; + int cpu, rc = 0; /* Discover pstates from device tree and init */ @@ -260,6 +267,10 @@ static int __init powernv_cpufreq_init(void) pr_info(powernv-cpufreq disabled\n); return rc; } + /* Init per-core mutex */ + for_each_possible_cpu(cpu) { + mutex_init(per_cpu(freq_switch_lock, cpu)); + } rc = cpufreq_register_driver(powernv_cpufreq_driver); return rc; -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 3/5] powernv:cpufreq: Create pstate_id_to_freq() helper
From: Gautham R. Shenoy e...@linux.vnet.ibm.com Create a helper routine that can return the cpu-frequency for the corresponding pstate_id. Also, cache the values of the pstate_max, pstate_min and pstate_nominal and nr_pstates in a static structure so that they can be reused in the future to perform any validations. Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 66dae0d..e7b0292 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -39,6 +39,14 @@ static DEFINE_PER_CPU(struct mutex, freq_switch_lock); static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1]; +struct powernv_pstate_info { + int pstate_min_id; + int pstate_max_id; + int pstate_nominal_id; + int nr_pstates; +}; +static struct powernv_pstate_info powernv_pstate_info; + /* * Initialize the freq table based on data obtained * from the firmware passed via device-tree @@ -112,9 +120,28 @@ static int init_powernv_pstates(void) for (i = 0; powernv_freqs[i].frequency != CPUFREQ_TABLE_END; i++) pr_debug(%d: %d\n, i, powernv_freqs[i].frequency); + powernv_pstate_info.pstate_min_id = pstate_min; + powernv_pstate_info.pstate_max_id = pstate_max; + powernv_pstate_info.pstate_nominal_id = pstate_nominal; + powernv_pstate_info.nr_pstates = nr_pstates; + return 0; } +/** + * Returns the cpu frequency corresponding to the pstate_id. + */ +static unsigned int pstate_id_to_freq(int pstate_id) +{ + int i; + + i = powernv_pstate_info.pstate_max_id - pstate_id; + + BUG_ON(i = powernv_pstate_info.nr_pstates || i 0); + WARN_ON(powernv_pstate_ids[i] != pstate_id); + return powernv_freqs[i].frequency; +} + static struct freq_attr *powernv_cpu_freq_attr[] = { cpufreq_freq_attr_scaling_available_freqs, NULL, -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 4/5] powernv:cpufreq: Export nominal frequency via sysfs.
From: Gautham R. Shenoy e...@linux.vnet.ibm.com Create a driver attribute named cpuinfo_nominal_freq which creates a sysfs read-only file named cpuinfo_nominal_freq. Export the frequency corresponding to the nominal_pstate through this interface. Nominal frequency is the highest non-turbo frequency for the platform. This is generally used for setting governor policies from user space for optimal energy efficiency. Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index e7b0292..46bee8a 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -142,8 +142,30 @@ static unsigned int pstate_id_to_freq(int pstate_id) return powernv_freqs[i].frequency; } +/** + * show_cpuinfo_nominal_freq - Show the nominal CPU frequency as indicated by + * the firmware + */ +static ssize_t show_cpuinfo_nominal_freq(struct cpufreq_policy *policy, + char *buf) +{ + int nominal_freq; + nominal_freq = pstate_id_to_freq(powernv_pstate_info.pstate_nominal_id); + return sprintf(buf, %u\n, nominal_freq); +} + + +struct freq_attr cpufreq_freq_attr_cpuinfo_nominal_freq = { + .attr = { .name = cpuinfo_nominal_freq, + .mode = 0444, + }, + .show = show_cpuinfo_nominal_freq, +}; + + static struct freq_attr *powernv_cpu_freq_attr[] = { cpufreq_freq_attr_scaling_available_freqs, + cpufreq_freq_attr_cpuinfo_nominal_freq, NULL, }; -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 5/5] powernv:cpufreq: Implement the driver-get() method
From: Gautham R. Shenoy e...@linux.vnet.ibm.com The current frequency of a cpu is reported through the sysfs file cpuinfo_cur_freq. This requires the driver to implement a -get(unsigned int cpu) method which will return the current operating frequency. Implement a function named powernv_cpufreq_get() which reads the local pstate from the PMSR and returns the corresponding frequency. Set the powernv_cpufreq_driver.get hook to powernv_cpufreq_get(). Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 38 ++ 1 file changed, 38 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 46bee8a..ef6ed8c 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -206,6 +206,43 @@ static inline void set_pmspr(unsigned long sprn, unsigned long val) BUG(); } +/* + * Computes the current frequency on this cpu + * and stores the result in *ret_freq. + */ +static void powernv_read_cpu_freq(void *ret_freq) +{ + unsigned long pmspr_val; + s8 local_pstate_id; + int *cur_freq, freq, pstate_id; + + cur_freq = (int *)ret_freq; + pmspr_val = get_pmspr(SPRN_PMSR); + + /* The local pstate id corresponds bits 48..55 in the PMSR. + * Note: Watch out for the sign! */ + local_pstate_id = (pmspr_val 48) 0xFF; + pstate_id = local_pstate_id; + + freq = pstate_id_to_freq(pstate_id); + pr_debug(cpu %d pmsr %lx pstate_id %d frequency %d \n, + smp_processor_id(), pmspr_val, pstate_id, freq); + *cur_freq = freq; +} + +/* + * Returns the cpu frequency as reported by the firmware for 'cpu'. + * This value is reported through the sysfs file cpuinfo_cur_freq. + */ +unsigned int powernv_cpufreq_get(unsigned int cpu) +{ + int ret_freq; + + smp_call_function_any(cpu_sibling_mask(cpu), powernv_read_cpu_freq, + ret_freq, 1); + return ret_freq; +} + static void set_pstate(void *pstate) { unsigned long val; @@ -297,6 +334,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, static struct cpufreq_driver powernv_cpufreq_driver = { .verify = powernv_cpufreq_verify, .target = powernv_cpufreq_target, + .get= powernv_cpufreq_get, .init = powernv_cpufreq_cpu_init, .exit = powernv_cpufreq_cpu_exit, .name = powernv-cpufreq, -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 2/5] powernv, cpufreq:Add per-core locking to serialize frequency transitions
On Thu, Mar 20, 2014 at 05:40:57PM +0530, Gautham R. Shenoy wrote: From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com On POWER systems, the CPU frequency is controlled at a core-level and hence we need to serialize so that only one of the threads in the core switches the core's frequency at a time. Using a global mutex lock would needlessly serialize _all_ frequency transitions in the system (across all cores). So introduce per-core locking to enable finer-grained synchronization and thereby enhance the speed and responsiveness of the cpufreq driver to varying workload demands. The design of per-core locking is very simple and straight-forward: we first define a Per-CPU lock and use the ones that belongs to the first thread sibling of the core. cpu_first_thread_sibling() macro is used to find the *common* lock for all thread siblings belonging to a core. Forgot to add the following line. Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 21 - 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index ab1551f..66dae0d 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -24,8 +24,15 @@ #include linux/of.h #include asm/cputhreads.h -/* FIXME: Make this per-core */ -static DEFINE_MUTEX(freq_switch_mutex); +/* Per-Core locking for frequency transitions */ +static DEFINE_PER_CPU(struct mutex, freq_switch_lock); + +#define lock_core_freq(cpu) \ + mutex_lock(per_cpu(freq_switch_lock,\ + cpu_first_thread_sibling(cpu))); +#define unlock_core_freq(cpu)\ + mutex_unlock(per_cpu(freq_switch_lock,\ + cpu_first_thread_sibling(cpu))); #define POWERNV_MAX_PSTATES 256 @@ -221,7 +228,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, freqs.new = powernv_freqs[new_index].frequency; freqs.cpu = policy-cpu; - mutex_lock(freq_switch_mutex); + lock_core_freq(policy-cpu); cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE); pr_debug(setting frequency for cpu %d to %d kHz index %d pstate %d, @@ -233,7 +240,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, rc = powernv_set_freq(policy-cpus, new_index); cpufreq_notify_transition(policy, freqs, CPUFREQ_POSTCHANGE); - mutex_unlock(freq_switch_mutex); + unlock_core_freq(policy-cpu); return rc; } @@ -250,7 +257,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = { static int __init powernv_cpufreq_init(void) { - int rc = 0; + int cpu, rc = 0; /* Discover pstates from device tree and init */ @@ -260,6 +267,10 @@ static int __init powernv_cpufreq_init(void) pr_info(powernv-cpufreq disabled\n); return rc; } + /* Init per-core mutex */ + for_each_possible_cpu(cpu) { + mutex_init(per_cpu(freq_switch_lock, cpu)); + } rc = cpufreq_register_driver(powernv_cpufreq_driver); return rc; -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 3/5] powernv:cpufreq: Create pstate_id_to_freq() helper
On Thu, Mar 20, 2014 at 05:40:58PM +0530, Gautham R. Shenoy wrote: From: Gautham R. Shenoy e...@linux.vnet.ibm.com Create a helper routine that can return the cpu-frequency for the corresponding pstate_id. Also, cache the values of the pstate_max, pstate_min and pstate_nominal and nr_pstates in a static structure so that they can be reused in the future to perform any validations. Forgot to add the following line: Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 66dae0d..e7b0292 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -39,6 +39,14 @@ static DEFINE_PER_CPU(struct mutex, freq_switch_lock); static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1]; +struct powernv_pstate_info { + int pstate_min_id; + int pstate_max_id; + int pstate_nominal_id; + int nr_pstates; +}; +static struct powernv_pstate_info powernv_pstate_info; + /* * Initialize the freq table based on data obtained * from the firmware passed via device-tree @@ -112,9 +120,28 @@ static int init_powernv_pstates(void) for (i = 0; powernv_freqs[i].frequency != CPUFREQ_TABLE_END; i++) pr_debug(%d: %d\n, i, powernv_freqs[i].frequency); + powernv_pstate_info.pstate_min_id = pstate_min; + powernv_pstate_info.pstate_max_id = pstate_max; + powernv_pstate_info.pstate_nominal_id = pstate_nominal; + powernv_pstate_info.nr_pstates = nr_pstates; + return 0; } +/** + * Returns the cpu frequency corresponding to the pstate_id. + */ +static unsigned int pstate_id_to_freq(int pstate_id) +{ + int i; + + i = powernv_pstate_info.pstate_max_id - pstate_id; + + BUG_ON(i = powernv_pstate_info.nr_pstates || i 0); + WARN_ON(powernv_pstate_ids[i] != pstate_id); + return powernv_freqs[i].frequency; +} + static struct freq_attr *powernv_cpu_freq_attr[] = { cpufreq_freq_attr_scaling_available_freqs, NULL, -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver-get() method
On Thu, Mar 20, 2014 at 05:41:00PM +0530, Gautham R. Shenoy wrote: From: Gautham R. Shenoy e...@linux.vnet.ibm.com The current frequency of a cpu is reported through the sysfs file cpuinfo_cur_freq. This requires the driver to implement a -get(unsigned int cpu) method which will return the current operating frequency. Implement a function named powernv_cpufreq_get() which reads the local pstate from the PMSR and returns the corresponding frequency. Set the powernv_cpufreq_driver.get hook to powernv_cpufreq_get(). Forgot to add Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 38 ++ 1 file changed, 38 insertions(+) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 46bee8a..ef6ed8c 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -206,6 +206,43 @@ static inline void set_pmspr(unsigned long sprn, unsigned long val) BUG(); } +/* + * Computes the current frequency on this cpu + * and stores the result in *ret_freq. + */ +static void powernv_read_cpu_freq(void *ret_freq) +{ + unsigned long pmspr_val; + s8 local_pstate_id; + int *cur_freq, freq, pstate_id; + + cur_freq = (int *)ret_freq; + pmspr_val = get_pmspr(SPRN_PMSR); + + /* The local pstate id corresponds bits 48..55 in the PMSR. + * Note: Watch out for the sign! */ + local_pstate_id = (pmspr_val 48) 0xFF; + pstate_id = local_pstate_id; + + freq = pstate_id_to_freq(pstate_id); + pr_debug(cpu %d pmsr %lx pstate_id %d frequency %d \n, + smp_processor_id(), pmspr_val, pstate_id, freq); + *cur_freq = freq; +} + +/* + * Returns the cpu frequency as reported by the firmware for 'cpu'. + * This value is reported through the sysfs file cpuinfo_cur_freq. + */ +unsigned int powernv_cpufreq_get(unsigned int cpu) +{ + int ret_freq; + + smp_call_function_any(cpu_sibling_mask(cpu), powernv_read_cpu_freq, + ret_freq, 1); + return ret_freq; +} + static void set_pstate(void *pstate) { unsigned long val; @@ -297,6 +334,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy, static struct cpufreq_driver powernv_cpufreq_driver = { .verify = powernv_cpufreq_verify, .target = powernv_cpufreq_target, + .get= powernv_cpufreq_get, .init = powernv_cpufreq_cpu_init, .exit = powernv_cpufreq_cpu_exit, .name = powernv-cpufreq, -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 0/5] powernv: Enable Dynamic Frequency
Hi Viresh, On Fri, Mar 21, 2014 at 01:16:03PM +0530, Viresh Kumar wrote: On Thu, Mar 20, 2014 at 5:40 PM, Gautham R. Shenoy e...@linux.vnet.ibm.com wrote: This is the v3 of the consolidated patchset consisting patches for enabling cpufreq on IBM POWERNV platforms along with some enhancements. This is the first time I saw them. Looks like you never Cc'd linux-pm list. Also, would be better to keep Maintainers in --to field so that they can review them as soon as possible. Otherwise they might stay on lists for a long time.. Yes, I hadn't Cc'ed linux-pm list last time around. Sorry about that. Will keep this in mind the next time around. Also, your suggestion regarding keeping the Maintainers in --to field is well taken. -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 4/5] powernv:cpufreq: Export nominal frequency via sysfs.
On Fri, Mar 21, 2014 at 02:17:28PM +0530, Viresh Kumar wrote: +static ssize_t show_cpuinfo_nominal_freq(struct cpufreq_policy *policy, + char *buf) +{ + int nominal_freq; + nominal_freq = pstate_id_to_freq(powernv_pstate_info.pstate_nominal_id); + return sprintf(buf, %u\n, nominal_freq); return sprintf(buf, %u\n, pstate_id_to_freq(powernv_pstate_info.pstate_nominal_id)); Sure. Will fix this. ?? +} + + remove extra blank line. +struct freq_attr cpufreq_freq_attr_cpuinfo_nominal_freq = { + .attr = { .name = cpuinfo_nominal_freq, + .mode = 0444, + }, Align {} Probably the use of ATTR_RO(cpuinfo_nominal_freq) and renaming show_cpuinfo_nominal_freq to cpuinfo_nominal_freq_show() would be even better. What do you think ? + .show = show_cpuinfo_nominal_freq, +}; + + static struct freq_attr *powernv_cpu_freq_attr[] = { cpufreq_freq_attr_scaling_available_freqs, + cpufreq_freq_attr_cpuinfo_nominal_freq, NULL, }; This needs to be rewritten to include the entries of cpufreq_generic_attr. Thanks for the review. -- Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 1/5] powernv: cpufreq driver for powernv platform
Hi Viresh, On Fri, Mar 21, 2014 at 02:11:32PM +0530, Viresh Kumar wrote: On Thu, Mar 20, 2014 at 5:40 PM, Gautham R. Shenoy e...@linux.vnet.ibm.com wrote: From: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Hi Vaidy, diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index 4b029c0..4ba1632 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -48,6 +48,7 @@ config CPU_FREQ_STAT_DETAILS choice prompt Default CPUFreq governor default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ + default CPU_FREQ_DEFAULT_GOV_ONDEMAND if POWERNV_CPUFREQ Probably we should remove SA1100's entry as well from here. This is not the right way of doing it. Imagine 100 platforms having entries here. If you want it, then select it from your platforms Kconfig. Sure. Will move these bits and the other governor related bits to the Powernv Kconfig. diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c new file mode 100644 index 000..ab1551f --- /dev/null + +#define pr_fmt(fmt)powernv-cpufreq: fmt + +#include linux/module.h +#include linux/cpufreq.h +#include linux/of.h +#include asm/cputhreads.h That's it? Sure? Even if things compile for you, you must explicitly include all the files on which you depend. Ok. + + WARN_ON(len_ids != len_freqs); + nr_pstates = min(len_ids, len_freqs) / sizeof(u32); + WARN_ON(!nr_pstates); Why do you want to continue here? Good point. We might be better off exiting at this point. + pr_debug(NR PStates %d\n, nr_pstates); + for (i = 0; i nr_pstates; i++) { + u32 id = be32_to_cpu(pstate_ids[i]); + u32 freq = be32_to_cpu(pstate_freqs[i]); + + pr_debug(PState id %d freq %d MHz\n, id, freq); + powernv_freqs[i].driver_data = i; I don't think you are using this field at all and this is the field you can use for driver_data and so you can get rid of powernv_pstate_ids[ ]. Using driver_data to record powernv_pstate_ids won't work since powernv_pstate_ids can be negative. So a pstate_id -3 can be confused with CPUFREQ_BOOST_FREQ thereby not displaying the frequency corresponding to pstate id -3. So for now I think we will be retaining powernv_pstate_ids. + powernv_freqs[i].frequency = freq * 1000; /* kHz */ + powernv_pstate_ids[i] = id; + } + /* End of list marker entry */ + powernv_freqs[i].driver_data = 0; Not required. Ok. + powernv_freqs[i].frequency = CPUFREQ_TABLE_END; + + /* Print frequency table */ + for (i = 0; powernv_freqs[i].frequency != CPUFREQ_TABLE_END; i++) + pr_debug(%d: %d\n, i, powernv_freqs[i].frequency); You have already printed this table earlier.. Fair enough. + + return 0; +} + +static struct freq_attr *powernv_cpu_freq_attr[] = { + cpufreq_freq_attr_scaling_available_freqs, + NULL, +}; Can use this instead: cpufreq_generic_attr? In this patch yes. But later patch introduces an additional attribute for displaying the nominal frequency. Will handle that part in a clean way in the next version. +/* Helper routines */ + +/* Access helpers to power mgt SPR */ + +static inline unsigned long get_pmspr(unsigned long sprn) Looks big enough not be inlined? It is called from one function. It has been defined separately for readability. +{ + switch (sprn) { + case SPRN_PMCR: + return mfspr(SPRN_PMCR); + + case SPRN_PMICR: + return mfspr(SPRN_PMICR); + + case SPRN_PMSR: + return mfspr(SPRN_PMSR); + } + BUG(); +} + +static inline void set_pmspr(unsigned long sprn, unsigned long val) +{ Same here.. Same reason as above. + switch (sprn) { + case SPRN_PMCR: + mtspr(SPRN_PMCR, val); + return; + + case SPRN_PMICR: + mtspr(SPRN_PMICR, val); + return; + + case SPRN_PMSR: + mtspr(SPRN_PMSR, val); + return; + } + BUG(); +} + +static void set_pstate(void *pstate) +{ + unsigned long val; + unsigned long pstate_ul = *(unsigned long *) pstate; Why not sending value only to this routine instead of pointer? Well this function is called via an smp_call_function. so, cannot send a value :( + + val = get_pmspr(SPRN_PMCR); + val = val 0xULL; Maybe a blank line here? Ok. + /* Set both global(bits 56..63) and local(bits 48..55) PStates */ + val = val | (pstate_ul 56) | (pstate_ul 48); here as well? Ok. + pr_debug(Setting cpu %d pmcr to %016lX\n, smp_processor_id(), val
Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver-get() method
On Fri, Mar 21, 2014 at 03:01:29PM +0530, Viresh Kumar wrote: On Thu, Mar 20, 2014 at 5:41 PM, Gautham R. Shenoy e...@linux.vnet.ibm.com wrote: From: Gautham R. Shenoy e...@linux.vnet.ibm.com The current frequency of a cpu is reported through the sysfs file cpuinfo_cur_freq. This requires the driver to implement a -get(unsigned int cpu) method which will return the current operating frequency. Implement a function named powernv_cpufreq_get() which reads the local pstate from the PMSR and returns the corresponding frequency. Set the powernv_cpufreq_driver.get hook to powernv_cpufreq_get(). Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 38 ++ 1 file changed, 38 insertions(+) Please merge these fixups with the first patch which is creating the driver. I understand that a different guy has created this patch and so wanted to have a patch on his name but its really difficult to review this way. Heh! Well, that wasn't the reason why this was sent out as a separate patch, but never mind. Though I don't understand why it would be difficult to review the patch though. Better add your signed-off in the first patch instead. Sending such changes once the driver is mainlined looks fine. Sure, this makes sense. diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 46bee8a..ef6ed8c 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -206,6 +206,43 @@ static inline void set_pmspr(unsigned long sprn, unsigned long val) BUG(); } +/* + * Computes the current frequency on this cpu + * and stores the result in *ret_freq. + */ +static void powernv_read_cpu_freq(void *ret_freq) +{ + unsigned long pmspr_val; + s8 local_pstate_id; + int *cur_freq, freq, pstate_id; + + cur_freq = (int *)ret_freq; You don't need cur_freq variable at all.. I don't like it either. But the compiler complains without this hack :-( + pmspr_val = get_pmspr(SPRN_PMSR); + + /* The local pstate id corresponds bits 48..55 in the PMSR. + * Note: Watch out for the sign! */ + local_pstate_id = (pmspr_val 48) 0xFF; + pstate_id = local_pstate_id; similarly local_pstate_id well, I am interested in the bits 48..55 of pmspr_val. That's the pstate_id which can be negative. So I'll like to keep local_pstate_id. + + freq = pstate_id_to_freq(pstate_id); + pr_debug(cpu %d pmsr %lx pstate_id %d frequency %d \n, + smp_processor_id(), pmspr_val, pstate_id, freq); + *cur_freq = freq; Move above print here after a blank line. Also remove freq variable as well and use *cur_freq directly.. And then you can rename it to freq as well. We can get rid of freq and use *cur_freq in its place. Thanks for the detailed review. -- Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver-get() method
On Fri, Mar 21, 2014 at 12:01:07PM +, David Laight wrote: From: Viresh Kumar On 21 March 2014 16:34, Gautham R Shenoy e...@linux.vnet.ibm.com wrote: Heh! Well, that wasn't the reason why this was sent out as a separate patch, but never mind. Though I don't understand why it would be difficult to review the patch though. Because the initial driver wasn't complete earlier. There were 2-3 patches after the first one which are changing what the first patch has added. Nothing else :) +static void powernv_read_cpu_freq(void *ret_freq) +{ + unsigned long pmspr_val; + s8 local_pstate_id; + int *cur_freq, freq, pstate_id; + + cur_freq = (int *)ret_freq; You don't need cur_freq variable at all.. I don't like it either. But the compiler complains without this hack :-( Why would the compiler warn for doing this?: *(int *)ret_freq = freq; Because it is very likely to be wrong. In general casts of pointers to integer types are dangerous. In this case why not make the function return the value? Because this function is called via an smp_call_function(). And we need a way of returning the value to the caller. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver-get() method
On Fri, Mar 21, 2014 at 05:15:34PM +0530, Viresh Kumar wrote: On 21 March 2014 16:34, Gautham R Shenoy e...@linux.vnet.ibm.com wrote: Heh! Well, that wasn't the reason why this was sent out as a separate patch, but never mind. Though I don't understand why it would be difficult to review the patch though. Because the initial driver wasn't complete earlier. There were 2-3 patches after the first one which are changing what the first patch has added. Nothing else :) +static void powernv_read_cpu_freq(void *ret_freq) +{ + unsigned long pmspr_val; + s8 local_pstate_id; + int *cur_freq, freq, pstate_id; + + cur_freq = (int *)ret_freq; You don't need cur_freq variable at all.. I don't like it either. But the compiler complains without this hack :-( Why would the compiler warn for doing this?: *(int *)ret_freq = freq; The compiler doesn't complain if I do this. + pmspr_val = get_pmspr(SPRN_PMSR); + + /* The local pstate id corresponds bits 48..55 in the PMSR. + * Note: Watch out for the sign! */ + local_pstate_id = (pmspr_val 48) 0xFF; + pstate_id = local_pstate_id; similarly local_pstate_id well, I am interested in the bits 48..55 of pmspr_val. That's the pstate_id which can be negative. So I'll like to keep local_pstate_id. Can you please explain at bit level how getting the value in a s8 first and then into a s32 will help you here? Instead of getting it directly into s32. Consider the case when pmspr = 0x00fe; We are interested in extracting the value 'fe'. And ensure that when we store this value into an int, we get the sign extension right. So the following doesn't work: pstate_id = (pmspr_val 48) 0x; Since we get pstate_id = 0x00fe, which is not the same as 0xfffe. Of course, we could do the following, but I wasn't sure if two levels of type casting helped on the readability front. pstate_id = (int) ((s8)((pmspr_val 48) 0xFF)); -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 1/5] powernv: cpufreq driver for powernv platform
Hi Viresh, On Fri, Mar 21, 2014 at 04:24:27PM +0530, Viresh Kumar wrote: On 21 March 2014 16:13, Gautham R Shenoy e...@linux.vnet.ibm.com wrote: On Fri, Mar 21, 2014 at 02:11:32PM +0530, Viresh Kumar wrote: + pr_debug(PState id %d freq %d MHz\n, id, freq); + powernv_freqs[i].driver_data = i; I don't think you are using this field at all and this is the field you can use for driver_data and so you can get rid of powernv_pstate_ids[ ]. Using driver_data to record powernv_pstate_ids won't work since powernv_pstate_ids can be negative. So a pstate_id -3 can be confused with CPUFREQ_BOOST_FREQ thereby not displaying the frequency corresponding to pstate id -3. So for now I think we will be retaining powernv_pstate_ids. As I said earlier, this field is only used by cpufreq drivers and cpufreq core isn't supposed to touch it. CPUFREQ_BOOST_FREQ and other macros are there for .frequency field and not this one. Ok, I had based my code on linus's git tree. I checked the 'pm-cpufreq' branch of Rafael's 'linux-pm' tree and freq_table.c contains the following code snippet in show_available_frequencies: if (show_boost ^ (table[i].driver_data == CPUFREQ_BOOST_FREQ)) continue; I suspect we're talking about different code bases. So could you please tell me which one should I be looking at ? +static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy) +{ + int base, i; + +#ifdef CONFIG_SMP What will break if you don't have this ifdef here? Without that as well below code should work? Missed this one? + base = cpu_first_thread_sibling(policy-cpu); + + for (i = 0; i threads_per_core; i++) + cpumask_set_cpu(base + i, policy-cpus); +#endif + policy-cpuinfo.transition_latency = 25000; + + policy-cur = powernv_freqs[0].frequency; + cpufreq_frequency_table_get_attr(powernv_freqs, policy-cpu); This doesn't exist anymore. Didn't get this comment! cpufreq_frequency_table_get_attr() routine doesn't exist anymore. +static int powernv_cpufreq_cpu_exit(struct cpufreq_policy *policy) +{ + cpufreq_frequency_table_put_attr(policy-cpu); + return 0; +} You don't need this.. Why not ? Because cpufreq_frequency_table_get_attr() and cpufreq_frequency_table_put_attr() don't exist anymore :) Well, it does in the codebases that I was looking at. May be I've been looking at the wrong place. +module_init(powernv_cpufreq_init); +module_exit(powernv_cpufreq_exit); Place these right after the routines without any blank lines in between. Is this the new convention ? Don't know I have been following this since long time, probably from the time I started with Mainline stuff.. I have seen many maintainers advising this as it make code more readable, specially if these routines are quite big.. Probably it isn't mentioned in coding guidelines but people follow it most of the times. Ok. I was not aware of this. Good to know :-) -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver-get() method
On Fri, Mar 21, 2014 at 06:42:50PM +0530, Viresh Kumar wrote: On 21 March 2014 18:34, Gautham R Shenoy e...@linux.vnet.ibm.com wrote: Consider the case when pmspr = 0x00fe; We are interested in extracting the value 'fe'. And ensure that when we store this value into an int, we get the sign extension right. So the following doesn't work: pstate_id = (pmspr_val 48) 0x; What about pstate_id = (pmspr_val 48) 0xFF; ?? Nope. Suppose the pmspr_val is contained in the register %rax, and pstate_id corresponds to the address -20(%rbp) then: pstate_id = (pmspr_val 48) 0xFF; would corrspond to shrq$48, %rax// Left shift by 48 bits andl$255, %eax // Mask the lower 32 bits of %rax with 0x00FF movl%eax, -20(%rbp) // Store the lower 32 bits of %rax into pstate_id On the other hand, pstate_id = (int)((s8)((pmspr_val 48) 0xFF)); would correspond to: shrq$48, %rax // Left shift by 48 bits. movsbl %al, %eax // Move the lower 8 bits of %rax to %eax with sign-extension. movl%eax, -20(%rbp) // store the result in pstate_id; So with this, we are getting the sign extension due to the use of movsbl. And if local_pstate_id corresponds to the address -1(%rbp) then: local_pstate_id = (pmspr_val 48) 0xFF; pstate_id = local_pstate_id; would correspond to: shrq$48, %rax// Left shift by 48 bits movb%al, -1(%rbp)//copy the lower 8 bits to local_pstate_id movsbl -1(%rbp), %eax //move the contents of local_pstate_id to %eax with sign extension. movl%eax, -20(%rbp) // Store the result in pstate_id Hope this helps :-) -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 1/5] powernv: cpufreq driver for powernv platform
On Fri, Mar 21, 2014 at 04:24:27PM +0530, Viresh Kumar wrote: +static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy) +{ + int base, i; + + base = cpu_first_thread_sibling(policy-cpu); + + for (i = 0; i threads_per_core; i++) + cpumask_set_cpu(base + i, policy-cpus); + policy-cpuinfo.transition_latency = 25000; + + policy-cur = powernv_freqs[0].frequency; + cpufreq_frequency_table_get_attr(powernv_freqs, policy-cpu); This doesn't exist anymore. Ok, I guess the right thing to do at this point is call cpufreq_table_validate_and_show(policy, powernv_freqs); Will fix the code to take care of this. -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver-get() method
Hi Ben, On Sat, Mar 22, 2014 at 09:56:30AM +1100, Benjamin Herrenschmidt wrote: On Fri, 2014-03-21 at 16:34 +0530, Gautham R Shenoy wrote: +/* + * Computes the current frequency on this cpu + * and stores the result in *ret_freq. + */ +static void powernv_read_cpu_freq(void *ret_freq) +{ + unsigned long pmspr_val; + s8 local_pstate_id; + int *cur_freq, freq, pstate_id; + + cur_freq = (int *)ret_freq; You don't need cur_freq variable at all.. I don't like it either. But the compiler complains without this hack :-( Casting integers into void * is a recipe for disaster... what is that supposed to be about ? Like I mentioned elsewhere on this thread, we're calling powernv_read_cpu_freq via an smp_call_function(). We use this to obtain the frequency on the cpu where powernv_read_cpu_freq executes and return it to the caller of smp_call_function. We lose all type checking and get exposed to endian issues etc... the day somebody uses a different type on both sides. Yes, I understand the problem now. I'll think of a safer way to pass the return value. Also is freq a frequency ? In this case an int isn't big enough. freq is the frequency stored in the cpufreq_table. The value is in kHz. So, int should be big enough. Cheers, Ben. -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v4] powernv: Dynamic Frequency Scaling Enablement
From: Gautham R. Shenoy e...@linux.vnet.ibm.com Hi, This is the v4 of the patchset to enable Dynamic Frequency Scaling on IBM PowerNV Platforms. I have incorporated the review comments from the previous version (can be found at [1]). In this version, * Multiple patches from the previous version have been into a single patch, since the higher numbered patches implemented some helper functions and the driver methods which should have been a part of the first patch to begin with. * Use the generic helpers provided by the cpufreq core available in the latest linux-next tree. * Fix the code to avoid casting integer pointers to void. The patch is based on top of the commit:06ed26d1de59ce7cbbe68378b7e470be169750e5 of the linux-next tree. [1]: https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg76675.html Vaidyanathan Srinivasan (1): powernv, cpufreq: cpufreq driver for powernv platform arch/powerpc/configs/pseries_defconfig| 1 + arch/powerpc/configs/pseries_le_defconfig | 1 + arch/powerpc/include/asm/reg.h| 4 + arch/powerpc/platforms/powernv/Kconfig| 6 + drivers/cpufreq/Kconfig.powerpc | 8 + drivers/cpufreq/Makefile | 1 + drivers/cpufreq/powernv-cpufreq.c | 372 ++ 7 files changed, 393 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v4] powernv, cpufreq: cpufreq driver for powernv platform
From: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Backend driver to dynamically set voltage and frequency on IBM POWER non-virtualized platforms. Power management SPRs are used to set the required PState. This driver works in conjunction with cpufreq governors like 'ondemand' to provide a demand based frequency and voltage setting on IBM POWER non-virtualized platforms. PState table is obtained from OPAL v3 firmware through device tree. powernv_cpufreq back-end driver would parse the relevant device-tree nodes and initialise the cpufreq subsystem on powernv platform. The code was originally written by sva...@linux.vnet.ibm.com. Over time it was modified to accomodate bug-fixes as well as updates to the the cpu-freq core. Relevant portions of the change logs corresponding to those modifications are noted below: * The policy-cpus needs to be populated in a hotplug-invariant manner instead of using cpu_sibling_mask() which varies with cpu-hotplug. This is because the cpufreq core code copies this content into policy-related_cpus mask which is should not vary on cpu-hotplug. [Authored by srivatsa.b...@linux.vnet.ibm.com] * On POWER systems, the CPU frequency is controlled at a core-level and hence we need to serialize so that only one of the threads in the core switches the core's frequency at a time. Introduce per-core locking to enable finer-grained synchronization and thereby enhance the speed and responsiveness of the cpufreq driver to varying workload demands. The design of per-core locking is very simple and straight-forward: we first define a Per-CPU lock and use the ones that belongs to the first thread sibling of the core. cpu_first_thread_sibling() macro is used to find the *common* lock for all thread siblings belonging to a core. [Authored by srivatsa.b...@linux.vnet.ibm.com] * Create a helper routine that can return the cpu-frequency for the corresponding pstate_id. Also, cache the values of the pstate_max, pstate_min and pstate_nominal and nr_pstates in a static structure so that they can be reused in the future to perform any validations. [Authored by e...@linux.vnet.ibm.com] * Create a driver attribute named cpuinfo_nominal_freq which creates a sysfs read-only file named cpuinfo_nominal_freq. Export the frequency corresponding to the nominal_pstate through this interface. Nominal frequency is the highest non-turbo frequency for the platform. This is generally used for setting governor policies from user space for optimal energy efficiency. [Authored by e...@linux.vnet.ibm.com] * Implement a powernv_cpufreq_get(unsigned int cpu) method which will return the current operating frequency. Export this via the sysfs interface cpuinfo_cur_freq by setting powernv_cpufreq_driver.get to powernv_cpufreq_get(). [Authored by e...@linux.vnet.ibm.com] [Change log updated by e...@linux.vnet.ibm.com] Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Anton Blanchard an...@samba.org Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- arch/powerpc/configs/pseries_defconfig| 1 + arch/powerpc/configs/pseries_le_defconfig | 1 + arch/powerpc/include/asm/reg.h| 4 + arch/powerpc/platforms/powernv/Kconfig| 6 + drivers/cpufreq/Kconfig.powerpc | 8 + drivers/cpufreq/Makefile | 1 + drivers/cpufreq/powernv-cpufreq.c | 372 ++ 7 files changed, 393 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c diff --git a/arch/powerpc/configs/pseries_defconfig b/arch/powerpc/configs/pseries_defconfig index e9a8b4e..a285d44 100644 --- a/arch/powerpc/configs/pseries_defconfig +++ b/arch/powerpc/configs/pseries_defconfig @@ -353,3 +353,4 @@ CONFIG_CRYPTO_DEV_NX_ENCRYPT=m CONFIG_VIRTUALIZATION=y CONFIG_KVM_BOOK3S_64=m CONFIG_KVM_BOOK3S_64_HV=y +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y diff --git a/arch/powerpc/configs/pseries_le_defconfig b/arch/powerpc/configs/pseries_le_defconfig index 62771e0..47e6161 100644 --- a/arch/powerpc/configs/pseries_le_defconfig +++ b/arch/powerpc/configs/pseries_le_defconfig @@ -350,3 +350,4 @@ CONFIG_CRYPTO_LZO=m # CONFIG_CRYPTO_ANSI_CPRNG is not set CONFIG_CRYPTO_DEV_NX=y CONFIG_CRYPTO_DEV_NX_ENCRYPT=m +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 90c06ec..84f92ca 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -271,6 +271,10 @@ #define SPRN_HSRR1 0x13B /* Hypervisor Save/Restore 1 */ #define SPRN_IC0x350 /* Virtual Instruction Count */ #define SPRN_VTB 0x351 /* Virtual Time Base */ +#define SPRN_PMICR 0x354 /* Power Management Idle Control Reg */ +#define
Re: [PATCH v4] powernv, cpufreq: cpufreq driver for powernv platform
On Thu, Mar 27, 2014 at 12:09:53PM +0530, Viresh Kumar wrote: Cc'ing Rafael. Thanks. It was a lapse on my part by not Cc'ing Rafael in the first place. On 26 March 2014 22:25, Gautham R. Shenoy e...@linux.vnet.ibm.com wrote: From: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Backend driver to dynamically set voltage and frequency on IBM POWER non-virtualized platforms. Power management SPRs are used to set the required PState. This driver works in conjunction with cpufreq governors like 'ondemand' to provide a demand based frequency and voltage setting on IBM POWER non-virtualized platforms. PState table is obtained from OPAL v3 firmware through device tree. powernv_cpufreq back-end driver would parse the relevant device-tree nodes and initialise the cpufreq subsystem on powernv platform. The code was originally written by sva...@linux.vnet.ibm.com. Over time it was modified to accomodate bug-fixes as well as updates to the the cpu-freq core. Relevant portions of the change logs corresponding to those modifications are noted below: * The policy-cpus needs to be populated in a hotplug-invariant manner instead of using cpu_sibling_mask() which varies with cpu-hotplug. This is because the cpufreq core code copies this content into policy-related_cpus mask which is should not vary on s/is / Good catch! Will fix this [...] * On POWER systems, the CPU frequency is controlled at a core-level and hence we need to serialize so that only one of the threads in the core switches the core's frequency at a time. Introduce per-core locking to enable finer-grained synchronization and thereby enhance the speed and responsiveness of the cpufreq driver to varying workload demands. The design of per-core locking is very simple and straight-forward: we first define a Per-CPU lock and use the ones that belongs to the first thread sibling of the core. cpu_first_thread_sibling() macro is used to find the *common* lock for all thread siblings belonging to a core. [Authored by srivatsa.b...@linux.vnet.ibm.com] We don't need that after serialization patch of core is accepted. And it should be accepted soon, in one form or other. As of now, I prefer this patch be based on code that is in the -next tree. I'll get rid of the per-core locking once the serialization patch of the core is accepted. [...] --- a/arch/powerpc/configs/pseries_defconfig +++ b/arch/powerpc/configs/pseries_defconfig @@ -353,3 +353,4 @@ CONFIG_CRYPTO_DEV_NX_ENCRYPT=m CONFIG_VIRTUALIZATION=y CONFIG_KVM_BOOK3S_64=m CONFIG_KVM_BOOK3S_64_HV=y +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y diff --git a/arch/powerpc/configs/pseries_le_defconfig b/arch/powerpc/configs/pseries_le_defconfig index 62771e0..47e6161 100644 --- a/arch/powerpc/configs/pseries_le_defconfig +++ b/arch/powerpc/configs/pseries_le_defconfig @@ -350,3 +350,4 @@ CONFIG_CRYPTO_LZO=m # CONFIG_CRYPTO_ANSI_CPRNG is not set CONFIG_CRYPTO_DEV_NX=y CONFIG_CRYPTO_DEV_NX_ENCRYPT=m +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y don't know how Rafael want this, but probably this part could have gone through ppc tree.. That would mean multiple patches :-) + +#define pr_fmt(fmt)powernv-cpufreq: fmt + +#include linux/module.h +#include linux/cpufreq.h +#include linux/smp.h +#include linux/of.h +#include asm/cputhreads.h I thought I gave a comment on missing headers here? Ok, so these seem to be the missing ones. #include linux/kernel.h #include linux/percpu-defs.h #include linux/mutex.h #include linux/sysfs.h #include linux/cpumask.h #include asm/reg.h Have I missed anything else ? +static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; +static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1]; I though we don't need this anymore? Either Rafael will take my patch as is for the BOOST fixup or we will end up creating .isboost field in struct cpufreq_frequency_table and so you could have used .driver_data here. I mentioned in my reply to your BOOST fixup patch that we would like pstate_ids to be of the type signed int, while the .driver_data is an unsigned int. If we end up using .driver_data, we would have to be careful and not use it directly but reassign it to a signed int (or typecast it) before using it. Not just that, the pr_debugs in the core which are printed during frequency transitions will then end up printing large positive values (since it will interpret negative pstate_ids as unsigned int) in the place of pstate_ids, which would not be particularly useful. +struct powernv_pstate_info { + int pstate_min_id; + int pstate_max_id; + int pstate_nominal_id; + int nr_pstates; +}; +static struct powernv_pstate_info powernv_pstate_info; Maybe write it as this (if you like :), as this is the only instance of this struct): Also
Re: [PATCH v4] powernv, cpufreq: cpufreq driver for powernv platform
On Thu, Mar 27, 2014 at 03:29:36PM +0530, Viresh Kumar wrote: On 27 March 2014 15:00, Gautham R Shenoy e...@linux.vnet.ibm.com wrote: As of now, I prefer this patch be based on code that is in the -next tree. I'll get rid of the per-core locking once the serialization patch of the core is accepted. Okay.. Then divide this patch into two parts, second one doing all the serialization stuff and I will review only the first one from V5 :) Maybe mention that in Patch2 as well, that once serialization patches are accepted, its not required. Sure. This would be a good idea. don't know how Rafael want this, but probably this part could have gone through ppc tree.. That would mean multiple patches :-) I wasn't opposed to multiple patches at all, but multiple patches for basic functionality of same driver file. Otherwise separating things into patches is the best receipe. Fair enough! +#define pr_fmt(fmt)powernv-cpufreq: fmt + +#include linux/module.h +#include linux/cpufreq.h +#include linux/smp.h +#include linux/of.h +#include asm/cputhreads.h I thought I gave a comment on missing headers here? Ok, so these seem to be the missing ones. #include linux/kernel.h #include linux/percpu-defs.h #include linux/mutex.h This may shift to Patch 2 .. Ok. #include linux/sysfs.h #include linux/cpumask.h #include asm/reg.h Have I missed anything else ? Not sure :) .. I will see if I can find anymore.. Cool. +static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; +static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1]; I though we don't need this anymore? Either Rafael will take my patch as is for the BOOST fixup or we will end up creating .isboost field in struct cpufreq_frequency_table and so you could have used .driver_data here. I mentioned in my reply to your BOOST fixup patch that we would like pstate_ids to be of the type signed int, while the .driver_data is an unsigned int. If we end up using .driver_data, we would have to be careful and not use it directly but reassign it to a signed int (or typecast it) before using it. Probably many people would want it to be unsigned it, so that they can put some strange hex values there.. Which part of your code conflicts with this right now? I can see only this line in powernv_set_freq() I am not denying the advantage of storing pstate_ids in .driver_data since we will then have all the information in one place. (That said, in the future if we want to export additional information, say pertaining to the voltage, we will have to keep a separate array anyway :-) ) However, as of now, I am wary about reusing a variable provided by the core, whose type is fixed (which is not the type I am interested in) and whose interpretation is ambiguous (since it is also being used as a BOOST indicator). When we have a .is_boost field, or a flags field in the cpufreq_table, and once the .driver_data field is completely opaque, I would be comfortable making use of .driver_data to store the pstate_ids. So I'll send out those patches once the fixes are present in the core. For now, let this code stay as it is. I think we should be ok with the additional overhead of 1kb as of now. freq_data.pstate_id = powernv_pstate_ids[new_index]; And I don't think we will have side effects here. Not just that, the pr_debugs in the core which are printed during frequency transitions will then end up printing large positive values (since it will interpret negative pstate_ids as unsigned int) in the place of pstate_ids, which would not be particularly useful. I have checked it again, those prints shouldn't have used that field. Will fix them. And so you can use that field in your code and I will get core fixed for you.. Like I said earlier, once the core is fixed, I'll be comfortable using the .driver_data field for the purpose I need. Not until then :-) Why do you need to get these from DT? And not find that yourself here instead? Well, I think we can obtain a more accurate value from the DT which would have already computed it. But I'll let vaidy answer this. Can we simply refer to frequency values here for finding min and max pstates? If yes, then probably this is the better place as there can be human errors in DT.. Also those binding are just not required then. And obviously less headache for you as well when you are changing pstate tables. + freq_data = (struct powernv_smp_call_data *)arg; don't need casting here ? Doesn't harm having it there. Just like the #includes :-) Do you ever have these typecasts while you do kmalloc? It also returns void *.. That's the same.. CodingStyle: Chapter 14: Allocating memory: Casting the return value which is a void pointer is redundant. The conversion from void pointer to any other pointer type is guaranteed by the C programming
[PATCH v5 0/3] powernv,cpufreq: Dynamic Frequency Scaling support
From: Gautham R. Shenoy e...@linux.vnet.ibm.com Hi, This is v5 of the patchset to enable dynamic frequency scaling on IBM PowerNV platforms. This patchset does address all the review comments obtained for v4 (which can be found at [1]). Changes from v4: * Created a separate patch to select the CPUFreq related Config options for PowerNV * Dropped the per-core locking hunks in drivers/cpufreq/powernv-cpufreq.c since the CPUFreq core takes care of the for us after the following commit which is present in linux-next: commit 12478cf0c55e5969f740bb38a24b1a0104ae18d8 Author: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Date: Mon Mar 24 13:35:44 2014 +0530 cpufreq: Make sure frequency transitions are serialized * [PATCH v5 3/3] gets rid of the powernv_pstate_ids[] array that was being used to record the pstate ids. After the following patch it is safe to use cpufreq_frequency_table.driver_data since it is opaque to the cpufreq core: From: Viresh Kumar viresh.ku...@linaro.org Date: 2014-03-28 13:53:47 url: http://marc.info/?l=linux-pmm=139601416804702w=2 cpufreq: create another field .flags in cpufreq_frequency_table The patchset is based on the commit 201544be8c37dffbf069bb5fc9edb5674f8c1754 of the linux-next tree. While all the patches in the patchset apply cleanly on linux-next, [PATCH v5 3/3] requires the Viresh's patch that was mentioned above. Otherwise, the frequency corresponding to pstate id -3 will be omited while reporting the scaling_available_frequencies in sysfs. [1]: http://marc.info/?l=linux-pmm=139585297620612w=2 Gautham R. Shenoy (2): powernv, cpufreq: Select CPUFreq related Kconfig options for powernv powernv,cpufreq: Use cpufreq_frequency_table.driver_data to store pstate ids Vaidyanathan Srinivasan (1): powernv, cpufreq: cpufreq driver for powernv platform arch/powerpc/configs/pseries_defconfig| 1 + arch/powerpc/configs/pseries_le_defconfig | 1 + arch/powerpc/include/asm/reg.h| 4 + arch/powerpc/platforms/powernv/Kconfig| 6 + drivers/cpufreq/Kconfig.powerpc | 8 + drivers/cpufreq/Makefile | 1 + drivers/cpufreq/powernv-cpufreq.c | 341 ++ 7 files changed, 362 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v5 2/3] powernv, cpufreq: cpufreq driver for powernv platform
From: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Backend driver to dynamically set voltage and frequency on IBM POWER non-virtualized platforms. Power management SPRs are used to set the required PState. This driver works in conjunction with cpufreq governors like 'ondemand' to provide a demand based frequency and voltage setting on IBM POWER non-virtualized platforms. PState table is obtained from OPAL v3 firmware through device tree. powernv_cpufreq back-end driver would parse the relevant device-tree nodes and initialise the cpufreq subsystem on powernv platform. The code was originally written by sva...@linux.vnet.ibm.com. Over time it was modified to accomodate bug-fixes as well as updates to the the cpu-freq core. Relevant portions of the change logs corresponding to those modifications are noted below: * The policy-cpus needs to be populated in a hotplug-invariant manner instead of using cpu_sibling_mask() which varies with cpu-hotplug. This is because the cpufreq core code copies this content into policy-related_cpus mask which should not vary on cpu-hotplug. [Authored by srivatsa.b...@linux.vnet.ibm.com] * Create a helper routine that can return the cpu-frequency for the corresponding pstate_id. Also, cache the values of the pstate_max, pstate_min and pstate_nominal and nr_pstates in a static structure so that they can be reused in the future to perform any validations. [Authored by e...@linux.vnet.ibm.com] * Create a driver attribute named cpuinfo_nominal_freq which creates a sysfs read-only file named cpuinfo_nominal_freq. Export the frequency corresponding to the nominal_pstate through this interface. Nominal frequency is the highest non-turbo frequency for the platform. This is generally used for setting governor policies from user space for optimal energy efficiency. [Authored by e...@linux.vnet.ibm.com] * Implement a powernv_cpufreq_get(unsigned int cpu) method which will return the current operating frequency. Export this via the sysfs interface cpuinfo_cur_freq by setting powernv_cpufreq_driver.get to powernv_cpufreq_get(). [Authored by e...@linux.vnet.ibm.com] [Change log updated by e...@linux.vnet.ibm.com] Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Anton Blanchard an...@samba.org Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- arch/powerpc/include/asm/reg.h| 4 + drivers/cpufreq/Kconfig.powerpc | 8 + drivers/cpufreq/Makefile | 1 + drivers/cpufreq/powernv-cpufreq.c | 342 ++ 4 files changed, 355 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 1a36b8e..2189f8f 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -271,6 +271,10 @@ #define SPRN_HSRR1 0x13B /* Hypervisor Save/Restore 1 */ #define SPRN_IC0x350 /* Virtual Instruction Count */ #define SPRN_VTB 0x351 /* Virtual Time Base */ +#define SPRN_PMICR 0x354 /* Power Management Idle Control Reg */ +#define SPRN_PMSR 0x355 /* Power Management Status Reg */ +#define SPRN_PMCR 0x374 /* Power Management Control Register */ + /* HFSCR and FSCR bit numbers are the same */ #define FSCR_TAR_LG8 /* Enable Target Address Register */ #define FSCR_EBB_LG7 /* Enable Event Based Branching */ diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc index ca0021a..72564b7 100644 --- a/drivers/cpufreq/Kconfig.powerpc +++ b/drivers/cpufreq/Kconfig.powerpc @@ -54,3 +54,11 @@ config PPC_PASEMI_CPUFREQ help This adds the support for frequency switching on PA Semi PWRficient processors. + +config POWERNV_CPUFREQ + tristate CPU frequency scaling for IBM POWERNV platform + depends on PPC_POWERNV + default y + help +This adds support for CPU frequency switching on IBM POWERNV +platform diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile index 7494565..0dbb963 100644 --- a/drivers/cpufreq/Makefile +++ b/drivers/cpufreq/Makefile @@ -86,6 +86,7 @@ obj-$(CONFIG_PPC_CORENET_CPUFREQ) += ppc-corenet-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC)+= pmac32-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC64) += pmac64-cpufreq.o obj-$(CONFIG_PPC_PASEMI_CPUFREQ) += pasemi-cpufreq.o +obj-$(CONFIG_POWERNV_CPUFREQ) += powernv-cpufreq.o ## # Other platform drivers diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c new file mode 100644 index 000..e1e5197 --- /dev/null +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -0,0 +1,342
[PATCH v5 1/3] powernv, cpufreq: Select CPUFreq related Kconfig options for powernv
From: Gautham R. Shenoy e...@linux.vnet.ibm.com Enable CPUFreq for PowerNV. Select performance, powersave, userspace and ondemand governors. Choose ondemand to be the default governor. Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com --- arch/powerpc/configs/pseries_defconfig| 1 + arch/powerpc/configs/pseries_le_defconfig | 1 + arch/powerpc/platforms/powernv/Kconfig| 6 ++ 3 files changed, 8 insertions(+) diff --git a/arch/powerpc/configs/pseries_defconfig b/arch/powerpc/configs/pseries_defconfig index 9ea8342b..a905063 100644 --- a/arch/powerpc/configs/pseries_defconfig +++ b/arch/powerpc/configs/pseries_defconfig @@ -306,3 +306,4 @@ CONFIG_KVM_BOOK3S_64=m CONFIG_KVM_BOOK3S_64_HV=y CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y diff --git a/arch/powerpc/configs/pseries_le_defconfig b/arch/powerpc/configs/pseries_le_defconfig index 3c84f9d..58e3dbf 100644 --- a/arch/powerpc/configs/pseries_le_defconfig +++ b/arch/powerpc/configs/pseries_le_defconfig @@ -301,3 +301,4 @@ CONFIG_CRYPTO_LZO=m # CONFIG_CRYPTO_ANSI_CPRNG is not set CONFIG_CRYPTO_DEV_NX=y CONFIG_CRYPTO_DEV_NX_ENCRYPT=m +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig index 895e8a2..c252ee9 100644 --- a/arch/powerpc/platforms/powernv/Kconfig +++ b/arch/powerpc/platforms/powernv/Kconfig @@ -11,6 +11,12 @@ config PPC_POWERNV select PPC_UDBG_16550 select PPC_SCOM select ARCH_RANDOM + select CPU_FREQ + select CPU_FREQ_GOV_PERFORMANCE + select CPU_FREQ_GOV_POWERSAVE + select CPU_FREQ_GOV_USERSPACE + select CPU_FREQ_GOV_ONDEMAND + select CPU_FREQ_GOV_CONSERVATIVE default y config PPC_POWERNV_RTAS -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v5 3/3] powernv, cpufreq: Use cpufreq_frequency_table.driver_data to store pstate ids
From: Gautham R. Shenoy e...@linux.vnet.ibm.com The .driver_data field in the cpufreq_frequency_table was supposed to be private to the drivers. However at some later point, it was being used to indicate if the particular frequency in the table is the BOOST_FREQUENCY. After patches [1] and [2], the .driver_data is once again private to the driver. Thus we can safely use cpufreq_frequency_table.driver_data to store pstate_ids instead of having to maintain a separate array powernv_pstate_ids[] for this purpose. [1]: Subject: cpufreq: don't print value of .driver_data from core From : Viresh Kumar viresh.kumar@ linaro.org url: http://marc.info/?l=linux-pmm=139601421504709w=2 [2]: Subject: cpufreq: create another field .flags in cpufreq_frequency_table From : Viresh Kumar viresh.ku...@linaro.org url: http://marc.info/?l=linux-pmm=139601416804702w=2 Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index e1e5197..9edccc6 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -33,7 +33,6 @@ #define POWERNV_MAX_PSTATES256 static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; -static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1]; /* * Note: The set of pstates consists of contiguous integers, the @@ -112,7 +111,7 @@ static int init_powernv_pstates(void) pr_debug(PState id %d freq %d MHz\n, id, freq); powernv_freqs[i].frequency = freq * 1000; /* kHz */ - powernv_pstate_ids[i] = id; + powernv_freqs[i].driver_data = id; } /* End of list marker entry */ powernv_freqs[i].frequency = CPUFREQ_TABLE_END; @@ -283,7 +282,7 @@ static int powernv_cpufreq_target_index(struct cpufreq_policy *policy, { struct powernv_smp_call_data freq_data; - freq_data.pstate_id = powernv_pstate_ids[new_index]; + freq_data.pstate_id = powernv_freqs[new_index].driver_data; /* * Use smp_call_function to send IPI and execute the -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH REPOST v5 2/3] powernv, cpufreq: cpufreq driver for powernv platform
From: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Backend driver to dynamically set voltage and frequency on IBM POWER non-virtualized platforms. Power management SPRs are used to set the required PState. This driver works in conjunction with cpufreq governors like 'ondemand' to provide a demand based frequency and voltage setting on IBM POWER non-virtualized platforms. PState table is obtained from OPAL v3 firmware through device tree. powernv_cpufreq back-end driver would parse the relevant device-tree nodes and initialise the cpufreq subsystem on powernv platform. The code was originally written by sva...@linux.vnet.ibm.com. Over time it was modified to accomodate bug-fixes as well as updates to the the cpu-freq core. Relevant portions of the change logs corresponding to those modifications are noted below: * The policy-cpus needs to be populated in a hotplug-invariant manner instead of using cpu_sibling_mask() which varies with cpu-hotplug. This is because the cpufreq core code copies this content into policy-related_cpus mask which should not vary on cpu-hotplug. [Authored by srivatsa.b...@linux.vnet.ibm.com] * Create a helper routine that can return the cpu-frequency for the corresponding pstate_id. Also, cache the values of the pstate_max, pstate_min and pstate_nominal and nr_pstates in a static structure so that they can be reused in the future to perform any validations. [Authored by e...@linux.vnet.ibm.com] * Create a driver attribute named cpuinfo_nominal_freq which creates a sysfs read-only file named cpuinfo_nominal_freq. Export the frequency corresponding to the nominal_pstate through this interface. Nominal frequency is the highest non-turbo frequency for the platform. This is generally used for setting governor policies from user space for optimal energy efficiency. [Authored by e...@linux.vnet.ibm.com] * Implement a powernv_cpufreq_get(unsigned int cpu) method which will return the current operating frequency. Export this via the sysfs interface cpuinfo_cur_freq by setting powernv_cpufreq_driver.get to powernv_cpufreq_get(). [Authored by e...@linux.vnet.ibm.com] [Change log updated by e...@linux.vnet.ibm.com] Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com Signed-off-by: Anton Blanchard an...@samba.org Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- arch/powerpc/include/asm/reg.h| 4 + drivers/cpufreq/Kconfig.powerpc | 8 + drivers/cpufreq/Makefile | 1 + drivers/cpufreq/powernv-cpufreq.c | 342 ++ 4 files changed, 355 insertions(+) create mode 100644 drivers/cpufreq/powernv-cpufreq.c diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 1a36b8e..2189f8f 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -271,6 +271,10 @@ #define SPRN_HSRR1 0x13B /* Hypervisor Save/Restore 1 */ #define SPRN_IC0x350 /* Virtual Instruction Count */ #define SPRN_VTB 0x351 /* Virtual Time Base */ +#define SPRN_PMICR 0x354 /* Power Management Idle Control Reg */ +#define SPRN_PMSR 0x355 /* Power Management Status Reg */ +#define SPRN_PMCR 0x374 /* Power Management Control Register */ + /* HFSCR and FSCR bit numbers are the same */ #define FSCR_TAR_LG8 /* Enable Target Address Register */ #define FSCR_EBB_LG7 /* Enable Event Based Branching */ diff --git a/drivers/cpufreq/Kconfig.powerpc b/drivers/cpufreq/Kconfig.powerpc index ca0021a..72564b7 100644 --- a/drivers/cpufreq/Kconfig.powerpc +++ b/drivers/cpufreq/Kconfig.powerpc @@ -54,3 +54,11 @@ config PPC_PASEMI_CPUFREQ help This adds the support for frequency switching on PA Semi PWRficient processors. + +config POWERNV_CPUFREQ + tristate CPU frequency scaling for IBM POWERNV platform + depends on PPC_POWERNV + default y + help +This adds support for CPU frequency switching on IBM POWERNV +platform diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile index 7494565..0dbb963 100644 --- a/drivers/cpufreq/Makefile +++ b/drivers/cpufreq/Makefile @@ -86,6 +86,7 @@ obj-$(CONFIG_PPC_CORENET_CPUFREQ) += ppc-corenet-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC)+= pmac32-cpufreq.o obj-$(CONFIG_CPU_FREQ_PMAC64) += pmac64-cpufreq.o obj-$(CONFIG_PPC_PASEMI_CPUFREQ) += pasemi-cpufreq.o +obj-$(CONFIG_POWERNV_CPUFREQ) += powernv-cpufreq.o ## # Other platform drivers diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c new file mode 100644 index 000..e1e5197 --- /dev/null +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -0,0 +1,342
[PATCH REPOST v5 3/3] powernv, cpufreq: Use cpufreq_frequency_table.driver_data to store pstate ids
From: Gautham R. Shenoy e...@linux.vnet.ibm.com The .driver_data field in the cpufreq_frequency_table was supposed to be private to the drivers. However at some later point, it was being used to indicate if the particular frequency in the table is the BOOST_FREQUENCY. After patches [1] and [2], the .driver_data is once again private to the driver. Thus we can safely use cpufreq_frequency_table.driver_data to store pstate_ids instead of having to maintain a separate array powernv_pstate_ids[] for this purpose. [1]: Subject: cpufreq: don't print value of .driver_data from core From : Viresh Kumar viresh.kumar@ linaro.org url: http://marc.info/?l=linux-pmm=139601421504709w=2 [2]: Subject: cpufreq: create another field .flags in cpufreq_frequency_table From : Viresh Kumar viresh.ku...@linaro.org url: http://marc.info/?l=linux-pmm=139601416804702w=2 Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index e1e5197..9edccc6 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -33,7 +33,6 @@ #define POWERNV_MAX_PSTATES256 static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; -static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1]; /* * Note: The set of pstates consists of contiguous integers, the @@ -112,7 +111,7 @@ static int init_powernv_pstates(void) pr_debug(PState id %d freq %d MHz\n, id, freq); powernv_freqs[i].frequency = freq * 1000; /* kHz */ - powernv_pstate_ids[i] = id; + powernv_freqs[i].driver_data = id; } /* End of list marker entry */ powernv_freqs[i].frequency = CPUFREQ_TABLE_END; @@ -283,7 +282,7 @@ static int powernv_cpufreq_target_index(struct cpufreq_policy *policy, { struct powernv_smp_call_data freq_data; - freq_data.pstate_id = powernv_pstate_ids[new_index]; + freq_data.pstate_id = powernv_freqs[new_index].driver_data; /* * Use smp_call_function to send IPI and execute the -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH REPOST v5 1/3] powernv, cpufreq: Select CPUFreq related Kconfig options for powernv
From: Gautham R. Shenoy e...@linux.vnet.ibm.com Enable CPUFreq for PowerNV. Select performance, powersave, userspace and ondemand governors. Choose ondemand to be the default governor. Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com --- arch/powerpc/configs/pseries_defconfig| 1 + arch/powerpc/configs/pseries_le_defconfig | 1 + arch/powerpc/platforms/powernv/Kconfig| 6 ++ 3 files changed, 8 insertions(+) diff --git a/arch/powerpc/configs/pseries_defconfig b/arch/powerpc/configs/pseries_defconfig index 9ea8342b..a905063 100644 --- a/arch/powerpc/configs/pseries_defconfig +++ b/arch/powerpc/configs/pseries_defconfig @@ -306,3 +306,4 @@ CONFIG_KVM_BOOK3S_64=m CONFIG_KVM_BOOK3S_64_HV=y CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y diff --git a/arch/powerpc/configs/pseries_le_defconfig b/arch/powerpc/configs/pseries_le_defconfig index 3c84f9d..58e3dbf 100644 --- a/arch/powerpc/configs/pseries_le_defconfig +++ b/arch/powerpc/configs/pseries_le_defconfig @@ -301,3 +301,4 @@ CONFIG_CRYPTO_LZO=m # CONFIG_CRYPTO_ANSI_CPRNG is not set CONFIG_CRYPTO_DEV_NX=y CONFIG_CRYPTO_DEV_NX_ENCRYPT=m +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig index 895e8a2..c252ee9 100644 --- a/arch/powerpc/platforms/powernv/Kconfig +++ b/arch/powerpc/platforms/powernv/Kconfig @@ -11,6 +11,12 @@ config PPC_POWERNV select PPC_UDBG_16550 select PPC_SCOM select ARCH_RANDOM + select CPU_FREQ + select CPU_FREQ_GOV_PERFORMANCE + select CPU_FREQ_GOV_POWERSAVE + select CPU_FREQ_GOV_USERSPACE + select CPU_FREQ_GOV_ONDEMAND + select CPU_FREQ_GOV_CONSERVATIVE default y config PPC_POWERNV_RTAS -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/4] powernv:cpufreq: Create a powernv_cpu_to_core_mask() helper.
From: Gautham R. Shenoy e...@linux.vnet.ibm.com Move the code that computes the cpumask corresponding to the thread-siblings of a cpu to a new method named powernv_cpu_to_core_mask() so that it can be used by other places in the code. Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 24 +++- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index c59eb26..f0dae6f 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -166,6 +166,23 @@ static struct freq_attr *powernv_cpu_freq_attr[] = { /* Helper routines */ +/** + * Sets the bits corresponding to the thread-siblings of cpu in its core + * in 'cpus'. + */ +static void powernv_cpu_to_core_mask(unsigned int cpu, cpumask_var_t cpus) +{ + int base, i; + + base = cpu_first_thread_sibling(cpu); + + for (i = 0; i threads_per_core; i++) { + cpumask_set_cpu(base + i, cpus); + } + + return; +} + /* Access helpers to power mgt SPR */ static inline unsigned long get_pmspr(unsigned long sprn) @@ -231,13 +248,10 @@ static int powernv_set_freq(cpumask_var_t cpus, unsigned int new_index) static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy) { - int base, i; + int i; #ifdef CONFIG_SMP - base = cpu_first_thread_sibling(policy-cpu); - - for (i = 0; i threads_per_core; i++) - cpumask_set_cpu(base + i, policy-cpus); + powernv_cpu_to_core_mask(policy-cpu, policy-cpus); #endif policy-cpuinfo.transition_latency = 25000; -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] pseries: cpu: Reduce the polling interval in __cpu_up()
Time time taken for a single cpu online operation on a pseries machine is as follows: Dedicated LPAR (POWER6): ~220ms. Shared LPAR (POWER5) : ~240ms. Of this time, approximately 200ms is taken up by __cpu_up(). This is because we poll every 200ms to check if the new cpu has notified it's presence through the cpu_callin_map. We repeat this operation until the new cpu sets the value in cpu_callin_map or 5 seconds elapse, whichever comes earlier. However, using completion_structs instead of polling loops, the time taken by the new processor to indicate it's presence has found to be less than 1ms on pseries. This method however may not work on all powerpc platforms due to the time-base synchronization code. Keeping this in mind, we could reduce msleep polling interval from 200ms to 1ms while retaining the 5 second timeout. With this, the time taken for a cpu online operation changes as follows: Dedicated LPAR (POWER6): 20-25ms. Shared LPAR (POWER5) : 60-80ms. In both these cases, it was found that the code polls through the loop only once indicating that 1ms is a reasonable value, atleast on pseries. The code needs testing on other powerpc platforms. Signed-off-by: Gautham R Shenoy e...@in.ibm.com --- arch/powerpc/kernel/smp.c |5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 65484b2..00c13a1 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -411,9 +411,8 @@ int __cpuinit __cpu_up(unsigned int cpu) * CPUs can take much longer to come up in the * hotplug case. Wait five seconds. */ - for (c = 25; c !cpu_callin_map[cpu]; c--) { - msleep(200); - } + for (c = 5000; c !cpu_callin_map[cpu]; c--) + msleep(1); #endif if (!cpu_callin_map[cpu]) { ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/3] cpu: idle state framework for offline CPUs.
Hi, RFC not for inclusion When we perform a CPU-Offline operation today, we do not put the CPU into the most energy efficient state. On x86, it loops in hlt as opposed to going to one of the low-power C-states. On pSeries, we call rtas_stop_self() and hand over the vCPU back to the resource pool, thereby deallocating the vCPU. Thus, when applications or platforms desire to put a particular CPU to an extended low-power state for a short while, currently they have to piggy-back on scheduler heuristics such as sched_mc_powersavings or play with exclusive Cpusets. The former does a good job based on the workload, but fails to provide any guarentee that the CPU won't be used for the next seconds, while the latter might conflict with the existing CPUsets configurations. There were efforts to alleviate these problems and various proposals have been put forth. They include putting the CPU to the deepest possible idle-state when offlined [1], removing the desired CPU from the topmost-cpuset [2], a driver which forces a high-priority idle thread to run on the desired CPU thereby putting it to idle [3]. In this patch-series, we propose to extend the CPU-Hotplug infrastructure and allow the system administrator to choose the desired state the CPU should go to when it is offlined. We think this approach addresses the concerns about determinism as well as transparency, since CPU-Hotplug already provides notification mechanism which the userspace can listen to for any change in the configuration and correspondingly readjust any previously set cpu-affinities. Also, approaches such as [1] can make use of this extended infrastructure instead of putting the CPU to an arbitrary C-state when it is offlined, thereby providing the system administrator a rope to hang himself with should he feel the need to do so. This patch-series tries to achieve this by implementing an architecture independent framework that exposes sysfs tunables to allow the system-adminstrator to choose the offline-state of a CPU. /sys/devices/system/cpu/cpunumber/available_offline_states and /sys/devices/system/cpu/cpunumber/preferred_offline_states For the purpose of proof-of-concept, we've implemented the backend for pSeries. For pSeries, we define two available_offline_states. They are: deallocate: This is default behaviour which on an offline, deallocates the vCPU by invoking rtas_stop_self() and hands it back to the resource pool. deactivate: This calls H_CEDE, which will request the hypervisor to idle the vCPU in the lowest power mode and give it back as soon as we need it. Any feedback on the patchset will be immensely valuable. References: --- [1] Pallipadi, Venkatesh: x86: Make offline cpus to go to deepest idle state using mwait (URL: http://lkml.org/lkml/2009/5/22/431) [2] Li, Shaohua: cpuset: add new API to change cpuset top group's cpus (URL: http://lkml.org/lkml/2009/5/19/54) [3] Li, Shaohua: new ACPI processor driver to force CPUs idle (URL: http://www.spinics.net/lists/linux-acpi/msg22863.html) Changelog: --- Gautham R Shenoy (3): pSeries: cpu: Cede CPU during a deactivate-offline cpu: Implement cpu-offline-state callbacks for pSeries. cpu: Offline state Framework. arch/powerpc/platforms/pseries/hotplug-cpu.c| 160 ++- arch/powerpc/platforms/pseries/offline_driver.h | 17 ++ arch/powerpc/platforms/pseries/plpar_wrappers.h |6 + arch/powerpc/platforms/pseries/smp.c| 18 ++- drivers/base/cpu.c | 111 include/linux/cpu.h | 15 ++ 6 files changed, 319 insertions(+), 8 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/offline_driver.h -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/3] cpu: Implement cpu-offline-state callbacks for pSeries.
This patch implements the callbacks to handle the reads/writes into the sysfs interfaces /sys/devices/system/cpu/cpunumber/available_offline_states and /sys/devices/system/cpu/cpunumber/preferred_offline_state Currently, the patch defines two states which the processor can go to when it is offlined. They are - deallocate: The current behaviour when the cpu is offlined. The CPU would call make an rtas_stop_self() call and hand over the CPU back to the resource pool, thereby effectively deallocating that vCPU from the LPAR. - deactivate: This is expected to cede the processor to the hypervisor, so that on processors which support appropriate low-power states, they can be exploited. This can be considered as an extended tickless idle state. The patch only implements the callbacks which will display the available states, and record the preferred states. The code bits to call rtas_stop_self() or H_CEDE, depending on the preferred_offline_state is implemented in the next patch. Signed-off-by: Gautham R Shenoy e...@in.ibm.com --- arch/powerpc/platforms/pseries/hotplug-cpu.c| 90 +++ arch/powerpc/platforms/pseries/offline_driver.h | 16 2 files changed, 106 insertions(+), 0 deletions(-) create mode 100644 arch/powerpc/platforms/pseries/offline_driver.h diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index a20ead8..f15de99 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -30,6 +30,95 @@ #include asm/pSeries_reconfig.h #include xics.h #include plpar_wrappers.h +#include offline_driver.h + +struct cpu_offline_state { + enum cpu_state_vals state_val; + const char *state_name; +} pSeries_cpu_offline_states[] = { + {CPU_DEACTIVATE, deactivate}, + {CPU_DEALLOCATE, deallocate}, +}; + +DEFINE_PER_CPU(enum cpu_state_vals, preferred_offline_state) = CPU_DEALLOCATE; + +ssize_t pSeries_show_available_states(struct sys_device *dev, + struct sysdev_attribute *attr, char *buf) +{ + int state; + ssize_t ret = 0; + + for (state = CPU_DEACTIVATE; state CPU_MAX_OFFLINE_STATES; state++) { + if (state == CPU_STATE_ONLINE) + continue; + + if (ret = (ssize_t) ((PAGE_SIZE / sizeof(char)) + - (CPU_STATES_LEN + 2))) + goto out; + ret += scnprintf(buf[ret], CPU_STATES_LEN, %s , + pSeries_cpu_offline_states[state].state_name); + } + +out: + ret += sprintf(buf[ret], \n); + return ret; +} + +ssize_t pSeries_show_preferred_state(struct sys_device *dev, + struct sysdev_attribute *attr, char *buf) +{ + struct cpu *cpu = container_of(dev, struct cpu, sysdev); + int state = per_cpu(preferred_offline_state, cpu-sysdev.id); + + return scnprintf(buf, CPU_STATES_LEN, %s\n, + pSeries_cpu_offline_states[state].state_name); +} + +ssize_t pSeries_store_preferred_state(struct sys_device *dev, + struct sysdev_attribute *attr, + const char *buf, size_t count) +{ + struct cpu *cpu = container_of(dev, struct cpu, sysdev); + unsigned int ret = -EINVAL; + char state_name[CPU_STATES_LEN]; + int i; + cpu_maps_update_begin(); + ret = sscanf(buf, %15s, state_name); + + if (ret != 1) { + ret = -EINVAL; + goto out_unlock; + } + + for (i = CPU_DEACTIVATE; i CPU_MAX_OFFLINE_STATES; i++) + if (!strnicmp(state_name, + pSeries_cpu_offline_states[i].state_name, + CPU_STATES_LEN)) + break; + + if (i == CPU_MAX_OFFLINE_STATES) { + ret = -EINVAL; + goto out_unlock; + } + + per_cpu(preferred_offline_state, cpu-sysdev.id) = + pSeries_cpu_offline_states[i].state_val; + ret = 0; + +out_unlock: + cpu_maps_update_done(); + + if (ret) + return ret; + else + return count; +} + +struct cpu_offline_driver pSeries_offline_driver = { + .show_available_states = pSeries_show_available_states, + .show_preferred_state = pSeries_show_preferred_state, + .store_preferred_state = pSeries_store_preferred_state, +}; /* This version can't take the spinlock, because it never returns */ static struct rtas_args rtas_stop_self_args = { @@ -281,6 +370,7 @@ static int __init pseries_cpu_hotplug_init(void) ppc_md.cpu_die = pseries_mach_cpu_die; smp_ops-cpu_disable = pseries_cpu_disable; smp_ops-cpu_die = pseries_cpu_die; + register_cpu_offline_driver(pSeries_offline_driver); /* Processors can be added/removed only on LPAR
[PATCH 1/3] cpu: Offline state Framework.
Provide an interface by which the system administrator can decide what state should the CPU go to when it is offlined. To query the available offline states, on needs to perform a read on: /sys/devices/system/cpu/cpunumber/available_offline_states To query or set the preferred offline state for a particular CPU, one needs to use the sysfs interface /sys/devices/system/cpu/cpunumber/preferred_offline_state This patch implements the architecture independent bits of the cpu-offline-state framework. The architecture specific bits are expected to register the actual code which implements the callbacks when the above mentioned sysfs interfaces are read or written into. Thus the values provided by reading available_offline_states are expected to vary with the architecture. Signed-off-by: Gautham R Shenoy e...@in.ibm.com --- drivers/base/cpu.c | 111 +++ include/linux/cpu.h | 15 +++ 2 files changed, 126 insertions(+), 0 deletions(-) diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index e62a4cc..1a63de0 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -56,26 +56,137 @@ static ssize_t __ref store_online(struct sys_device *dev, struct sysdev_attribut } static SYSDEV_ATTR(online, 0644, show_online, store_online); +static struct cpu_offline_driver *cpu_offline_driver; +static SYSDEV_ATTR(available_offline_states, 0444, NULL, NULL); +static SYSDEV_ATTR(preferred_offline_state, 0644, NULL, NULL); + +/* Should be called with cpu_add_remove_lock held */ +void cpu_offline_driver_add_cpu(struct sys_device *cpu_sys_dev) +{ + if (!cpu_offline_driver) + return; + + sysdev_create_file(cpu_sys_dev, attr_available_offline_states); + sysdev_create_file(cpu_sys_dev, attr_preferred_offline_state); +} + +/* Should be called with cpu_add_remove_lock held */ +void cpu_offline_driver_remove_cpu(struct sys_device *cpu_sys_dev) +{ + if (!cpu_offline_driver) + return; + + sysdev_remove_file(cpu_sys_dev, attr_available_offline_states); + sysdev_remove_file(cpu_sys_dev, attr_preferred_offline_state); + +} + static void __cpuinit register_cpu_control(struct cpu *cpu) { sysdev_create_file(cpu-sysdev, attr_online); + cpu_offline_driver_add_cpu(cpu-sysdev); } + void unregister_cpu(struct cpu *cpu) { int logical_cpu = cpu-sysdev.id; unregister_cpu_under_node(logical_cpu, cpu_to_node(logical_cpu)); + cpu_offline_driver_remove_cpu(cpu-sysdev); sysdev_remove_file(cpu-sysdev, attr_online); sysdev_unregister(cpu-sysdev); per_cpu(cpu_sys_devices, logical_cpu) = NULL; return; } + +static int __cpuinit +cpu_driver_callback(struct notifier_block *nfb, unsigned long action, + void *hcpu) +{ + struct sys_device *cpu_sysdev = per_cpu(cpu_sys_devices, + (unsigned long)(hcpu)); + + switch (action) { + case CPU_DEAD: + case CPU_DEAD_FROZEN: + cpu_offline_driver_remove_cpu(cpu_sysdev); + break; + + case CPU_ONLINE: + case CPU_ONLINE_FROZEN: + cpu_offline_driver_add_cpu(cpu_sysdev); + break; + default: + break; + } + + return NOTIFY_OK; +} + +static struct notifier_block __cpuinitdata cpu_driver_notifier = { + .notifier_call = cpu_driver_callback, + .priority = 0 +}; + +int register_cpu_offline_driver(struct cpu_offline_driver *arch_cpu_driver) +{ + int ret = 0; + cpu_maps_update_begin(); + + if (cpu_offline_driver != NULL) { + ret = -EEXIST; + goto out_unlock; + } + + if (!(arch_cpu_driver-show_available_states + arch_cpu_driver-show_preferred_state + arch_cpu_driver-store_preferred_state)) { + ret = -EINVAL; + goto out_unlock; + } + + attr_available_offline_states.show = + arch_cpu_driver-show_available_states; + attr_preferred_offline_state.show = + arch_cpu_driver-show_preferred_state; + attr_preferred_offline_state.store = + arch_cpu_driver-store_preferred_state; + + cpu_offline_driver = arch_cpu_driver; + +out_unlock: + cpu_maps_update_done(); + if (!ret) + register_cpu_notifier(cpu_driver_notifier); + return ret; +} + +void unregister_cpu_offline_driver(struct cpu_offline_driver *arch_cpu_driver) +{ + cpu_maps_update_begin(); + + if (!cpu_offline_driver) { + WARN_ON(1); + cpu_maps_update_done(); + return; + } + + cpu_offline_driver = NULL; + attr_available_offline_states.show = NULL; + attr_preferred_offline_state.show = NULL; + attr_preferred_offline_state.store = NULL
[PATCH 3/3] pSeries: cpu: Cede CPU during a deactivate-offline
Implements the pSeries specific code bits to put the CPU into rtas_stop_self() state or H_CEDE state depending on the preferred_offline_state value for that CPU. Signed-off-by: Gautham R Shenoy e...@in.ibm.com --- arch/powerpc/platforms/pseries/hotplug-cpu.c| 70 +-- arch/powerpc/platforms/pseries/offline_driver.h |3 + arch/powerpc/platforms/pseries/plpar_wrappers.h |6 ++ arch/powerpc/platforms/pseries/smp.c| 18 ++ 4 files changed, 88 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index f15de99..5b47d6c 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -41,6 +41,8 @@ struct cpu_offline_state { }; DEFINE_PER_CPU(enum cpu_state_vals, preferred_offline_state) = CPU_DEALLOCATE; +DEFINE_PER_CPU(enum cpu_state_vals, cpu_current_state); +DEFINE_PER_CPU(int, cpu_offline_ack); ssize_t pSeries_show_available_states(struct sys_device *dev, struct sysdev_attribute *attr, char *buf) @@ -148,8 +150,47 @@ static void pseries_mach_cpu_die(void) local_irq_disable(); idle_task_exit(); xics_teardown_cpu(); - unregister_slb_shadow(hard_smp_processor_id(), __pa(get_slb_shadow())); - rtas_stop_self(); + if (__get_cpu_var(preferred_offline_state) == CPU_DEACTIVATE) { + + __get_cpu_var(cpu_offline_ack) = 1; + get_lppaca()-idle = 1; + if (!get_lppaca()-shared_proc) + get_lppaca()-donate_dedicated_cpu = 1; + + printk(KERN_INFO cpu %u (hwid %u) ceding\n, + smp_processor_id(), hard_smp_processor_id()); + + while (__get_cpu_var(cpu_current_state) == CPU_DEACTIVATE) { + cede_processor(); + printk(KERN_INFO cpu %u (hwid %u) Returned from cede.\ + Decrementer value: %x. Timebase value:%llx \n, + smp_processor_id(), hard_smp_processor_id(), + get_dec(), get_tb()); + } + + printk(KERN_INFO cpu %u (hwid %u) Received online PROD \n, + smp_processor_id(), hard_smp_processor_id()); + + if (!get_lppaca()-shared_proc) + get_lppaca()-donate_dedicated_cpu = 0; + get_lppaca()-idle = 0; + + unregister_slb_shadow(hard_smp_processor_id(), + __pa(get_slb_shadow())); + + /** +* NOTE: Calling start_secondary() here, is not a very nice +* way of beginning a new context. +* +* We need to reset the stack-pointer. +* Find a cleaner way to do this. +*/ + start_secondary(NULL); + } else { + unregister_slb_shadow(hard_smp_processor_id(), + __pa(get_slb_shadow())); + rtas_stop_self(); + } /* Should never get here... */ BUG(); for(;;); @@ -192,6 +233,10 @@ static int pseries_cpu_disable(void) /* FIXME: abstract this to not be platform specific later on */ xics_migrate_irqs_away(); + __get_cpu_var(cpu_current_state) = + __get_cpu_var(preferred_offline_state); + + __get_cpu_var(cpu_offline_ack) = 0; return 0; } @@ -201,11 +246,22 @@ static void pseries_cpu_die(unsigned int cpu) int cpu_status; unsigned int pcpu = get_hard_smp_processor_id(cpu); - for (tries = 0; tries 25; tries++) { - cpu_status = query_cpu_stopped(pcpu); - if (cpu_status == 0 || cpu_status == -1) - break; - cpu_relax(); + if (per_cpu(preferred_offline_state, cpu) == CPU_DEACTIVATE) { + /* Wait for some Ack */ + for (tries = 0; tries 1; tries++) { + cpu_status = !per_cpu(cpu_offline_ack, cpu); + if (!cpu_status) + break; + cpu_relax(); + } + + } else { + for (tries = 0; tries 25; tries++) { + cpu_status = query_cpu_stopped(pcpu); + if (cpu_status == 0 || cpu_status == -1) + break; + cpu_relax(); + } } if (cpu_status != 0) { printk(Querying DEAD? cpu %i (%i) shows %i\n, diff --git a/arch/powerpc/platforms/pseries/offline_driver.h b/arch/powerpc/platforms/pseries/offline_driver.h index bdae76a..571e085 100644 --- a/arch/powerpc/platforms/pseries/offline_driver.h +++ b/arch/powerpc/platforms/pseries/offline_driver.h @@ -12,5 +12,6 @@ enum
Re: [PATCH 0/3] cpu: idle state framework for offline CPUs.
Hi Shaohua, On Thu, Aug 06, 2009 at 09:58:55AM +0800, Shaohua Li wrote: Hi, On Wed, Aug 05, 2009 at 10:25:53PM +0800, Gautham R Shenoy wrote: In this patch-series, we propose to extend the CPU-Hotplug infrastructure and allow the system administrator to choose the desired state the CPU should go to when it is offlined. We think this approach addresses the concerns about determinism as well as transparency, since CPU-Hotplug already provides notification mechanism which the userspace can listen to for any change in the configuration and correspondingly readjust any previously set cpu-affinities. Peter dislikes any approach (including cpuhotplug) which breaks userspace policy, even userspace can get a notification. I think Peter's problem was more to do with the kernel offlining the CPUs behind the scenes, right ? We don't do that in this patch series. The option to offline the CPUs is very much with the admin. The patch-series only provides the interface that helps the admin choose the state the CPU must reside in when it goes offline. Also, approaches such as [1] can make use of this extended infrastructure instead of putting the CPU to an arbitrary C-state when it is offlined, thereby providing the system administrator a rope to hang himself with should he feel the need to do so. I didn't see the reason why administrator needs to know which state offline cpu should stay. Don't know about powerpc side, but in x86 side, it appears deepest C-state is already preferred. We can still provide a sane default value based on what states are available and what the BIOS limits us to. Thus we can still use the idle-state-offline patch that Venki posted sometime ago, right ? Thanks, Shaohua -- Thanks and Regards gautham ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] book3s_hv_rmhandlers:Pass the correct trap argument to kvmhv_commence_exit
In guest_exit_cont we call kvmhv_commence_exit which expects the trap number as the argument. However r3 doesn't contain the trap number at this point and as a result we would be calling the function with a spurious trap number. Fix this by copying r12 into r3 before calling kvmhv_commence_exit as r12 contains the trap number Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 4d70df2..f0d7c54 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -1170,6 +1170,7 @@ mc_cont: bl kvmhv_accumulate_time #endif + mr r3, r12 /* Increment exit count, poke other threads to exit */ bl kvmhv_commence_exit nop -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] book3s_hv_rmhandlers:Pass the correct trap argument to kvmhv_commence_exit
Hi Sam, On Fri, Aug 14, 2015 at 03:07:28PM +1000, Sam Bobroff wrote: On Thu, May 21, 2015 at 01:57:04PM +0530, Gautham R. Shenoy wrote: In guest_exit_cont we call kvmhv_commence_exit which expects the trap number as the argument. However r3 doesn't contain the trap number at this point and as a result we would be calling the function with a spurious trap number. Fix this by copying r12 into r3 before calling kvmhv_commence_exit as r12 contains the trap number Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com Hi Gautham, I agree with your logic: r3 is quite clearly corrupted in that path. So: Reviewed-by: Sam Bobroff sam.bobr...@au1.ibm.com Just one comment: Do you have a case of this causing some visible problem due to the corrupted trap number? (I'll test the patch if you do.) Actually no! I found this bug while reviewing the code for some other issue. Cheers, Sam. -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc: Add an inline function to update HID0
Hi Michael, On Tue, Aug 04, 2015 at 08:08:58PM +1000, Michael Ellerman wrote: On Tue, 2015-04-08 at 08:30:58 UTC, Gautham R. Shenoy wrote: Section 3.7 of Version 1.2 of the Power8 Processor User's Manual prescribes that updates to HID0 be preceded by a SYNC instruction and followed by an ISYNC instruction (Page 91). Create a function name update_hid0() which follows this recipe and invoke it from the static split core path. Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- arch/powerpc/include/asm/kvm_ppc.h | 11 +++ Why is it in there? It's not KVM related per se. Ok. Will fix this. Where should it go? I think reg.h would be best, ideally near the definition for HID0, though that's probably not possible because of ASSEMBLY requirements. So at the bottom of reg.h ? diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index c6ef05b..325f1d6 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -685,4 +685,15 @@ static inline ulong kvmppc_get_ea_indexed(struct kvm_vcpu *vcpu, int ra, int rb) extern void xics_wake_cpu(int cpu); +static inline void update_hid0(unsigned long hid0) +{ + /* +* The HID0 update should at the very least be preceded by a +* a SYNC instruction followed by an ISYNC instruction +*/ + mb(); + mtspr(SPRN_HID0, hid0); + isync(); That's going to turn into three separate inline asm blocks, which is maybe a bit unfortunate. Have you checked the generated code is what we want, ie. just sync, mtspr, isync ? Yes, the objdump of subcore.o shows exactly these three instructions: 7c 00 04 ac sync 7c 70 fb a6 mtspr 1008,r3 4c 00 01 2c isync cheers -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] powerpc: Add an inline function to update HID0
Section 3.7 of Version 1.2 of the Power8 Processor User's Manual prescribes that updates to HID0 be preceded by a SYNC instruction and followed by an ISYNC instruction (Page 91). Create an inline function name update_hid0() which follows this recipe and invoke it from the static split core path. Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- [v1-- v2: Moved defn of update_hid0 to reg.h from kvm_ppc.h] arch/powerpc/include/asm/reg.h | 13 + arch/powerpc/platforms/powernv/subcore.c | 4 ++-- 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index af56b5c..beb98ac 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -12,6 +12,8 @@ #include linux/stringify.h #include asm/cputable.h +#include asm/barrier.h +#include asm/synch.h /* Pickup Book E specific registers. */ #if defined(CONFIG_BOOKE) || defined(CONFIG_40x) @@ -1281,6 +1283,17 @@ struct pt_regs; extern void ppc_save_regs(struct pt_regs *regs); +static inline void update_hid0(unsigned long hid0) +{ + /* +* The HID0 update should at the very least be preceded by a +* a SYNC instruction followed by an ISYNC instruction +*/ + mb(); + mtspr(SPRN_HID0, hid0); + isync(); +} + #endif /* __ASSEMBLY__ */ #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_REG_H */ diff --git a/arch/powerpc/platforms/powernv/subcore.c b/arch/powerpc/platforms/powernv/subcore.c index f60f80a..37e77bf 100644 --- a/arch/powerpc/platforms/powernv/subcore.c +++ b/arch/powerpc/platforms/powernv/subcore.c @@ -190,7 +190,7 @@ static void unsplit_core(void) hid0 = mfspr(SPRN_HID0); hid0 = ~HID0_POWER8_DYNLPARDIS; - mtspr(SPRN_HID0, hid0); + update_hid0(hid0); update_hid_in_slw(hid0); while (mfspr(SPRN_HID0) mask) @@ -227,7 +227,7 @@ static void split_core(int new_mode) /* Write new mode */ hid0 = mfspr(SPRN_HID0); hid0 |= HID0_POWER8_DYNLPARDIS | split_parms[i].value; - mtspr(SPRN_HID0, hid0); + update_hid0(hid0); update_hid_in_slw(hid0); /* Wait for it to happen */ -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc: Add an inline function to update HID0
Section 3.7 of Version 1.2 of the Power8 Processor User's Manual prescribes that updates to HID0 be preceded by a SYNC instruction and followed by an ISYNC instruction (Page 91). Create a function name update_hid0() which follows this recipe and invoke it from the static split core path. Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- arch/powerpc/include/asm/kvm_ppc.h | 11 +++ arch/powerpc/platforms/powernv/subcore.c | 4 ++-- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index c6ef05b..325f1d6 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -685,4 +685,15 @@ static inline ulong kvmppc_get_ea_indexed(struct kvm_vcpu *vcpu, int ra, int rb) extern void xics_wake_cpu(int cpu); +static inline void update_hid0(unsigned long hid0) +{ + /* +* The HID0 update should at the very least be preceded by a +* a SYNC instruction followed by an ISYNC instruction +*/ + mb(); + mtspr(SPRN_HID0, hid0); + isync(); +} + #endif /* __POWERPC_KVM_PPC_H__ */ diff --git a/arch/powerpc/platforms/powernv/subcore.c b/arch/powerpc/platforms/powernv/subcore.c index f60f80a..37e77bf 100644 --- a/arch/powerpc/platforms/powernv/subcore.c +++ b/arch/powerpc/platforms/powernv/subcore.c @@ -190,7 +190,7 @@ static void unsplit_core(void) hid0 = mfspr(SPRN_HID0); hid0 = ~HID0_POWER8_DYNLPARDIS; - mtspr(SPRN_HID0, hid0); + update_hid0(hid0); update_hid_in_slw(hid0); while (mfspr(SPRN_HID0) mask) @@ -227,7 +227,7 @@ static void split_core(int new_mode) /* Write new mode */ hid0 = mfspr(SPRN_HID0); hid0 |= HID0_POWER8_DYNLPARDIS | split_parms[i].value; - mtspr(SPRN_HID0, hid0); + update_hid0(hid0); update_hid_in_slw(hid0); /* Wait for it to happen */ -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc: Add an inline function to update HID0
Hi Segher, Thanks for the suggestions. I will rename the function to update_power8_hid0() and use asm volatile. On Tue, Aug 04, 2015 at 09:30:57PM -0500, Segher Boessenkool wrote: On Tue, Aug 04, 2015 at 08:08:58PM +1000, Michael Ellerman wrote: +static inline void update_hid0(unsigned long hid0) +{ + /* + * The HID0 update should at the very least be preceded by a + * a SYNC instruction followed by an ISYNC instruction + */ + mb(); + mtspr(SPRN_HID0, hid0); + isync(); That's going to turn into three separate inline asm blocks, which is maybe a bit unfortunate. Have you checked the generated code is what we want, ie. just sync, mtspr, isync ? The mb() is not such a great name anyway: you don't want a memory barrier, you want an actual sync instruction (sync 0, hwsync, whatever the currently preferred spelling is). The function name should also say this is for POWER8 (the required sequences are different for some other processors; and some others might not even _have_ a HID0, or not at 1008). power8_write_hid0 or such? For writing it as one asm, why not just asm volatile(sync ; mtspr %0,%1 ; isync : : i(SPRN_HID0), r(hid0)); instead of the stringify stuff? Segher -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3] powerpc: Add an inline function to update POWER8 HID0
Section 3.7 of Version 1.2 of the Power8 Processor User's Manual prescribes that updates to HID0 be preceded by a SYNC instruction and followed by an ISYNC instruction (Page 91). Create an inline function name update_power8_hid0() which follows this recipe and invoke it from the static split core path. Signed-off-by: Gautham R. Shenoy e...@linux.vnet.ibm.com --- [v1 -- v2: Moved defn of update_hid0 to reg.h from kvm_ppc.h] [v2 -- v3: Renamed to update_power8_hid0 and used asm volatile] arch/powerpc/include/asm/reg.h | 9 + arch/powerpc/platforms/powernv/subcore.c | 4 ++-- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index af56b5c..1245d99 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -1281,6 +1281,15 @@ struct pt_regs; extern void ppc_save_regs(struct pt_regs *regs); +static inline void update_power8_hid0(unsigned long hid0) +{ + /* +* The HID0 update on Power8 should at the very least be +* preceded by a a SYNC instruction followed by an ISYNC +* instruction +*/ + asm volatile(sync; mtspr %0,%1; isync:: i(SPRN_HID0), r(hid0)); +} #endif /* __ASSEMBLY__ */ #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_REG_H */ diff --git a/arch/powerpc/platforms/powernv/subcore.c b/arch/powerpc/platforms/powernv/subcore.c index f60f80a..503a73f 100644 --- a/arch/powerpc/platforms/powernv/subcore.c +++ b/arch/powerpc/platforms/powernv/subcore.c @@ -190,7 +190,7 @@ static void unsplit_core(void) hid0 = mfspr(SPRN_HID0); hid0 = ~HID0_POWER8_DYNLPARDIS; - mtspr(SPRN_HID0, hid0); + update_power8_hid0(hid0); update_hid_in_slw(hid0); while (mfspr(SPRN_HID0) mask) @@ -227,7 +227,7 @@ static void split_core(int new_mode) /* Write new mode */ hid0 = mfspr(SPRN_HID0); hid0 |= HID0_POWER8_DYNLPARDIS | split_parms[i].value; - mtspr(SPRN_HID0, hid0); + update_power8_hid0(hid0); update_hid_in_slw(hid0); /* Wait for it to happen */ -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] book3s_hv: Handle H_DOORBELL on the guest exit path
Currently a CPU running a guest can receive a H_DOORBELL in the following two cases: 1) When the CPU is napping due to CEDE or there not being a guest vcpu. 2) The CPU is running the guest vcpu. Case 1), the doorbell message is not cleared since we were waking up from nap. Hence when the EE bit gets set on transition from guest to host, the H_DOORBELL interrupt is delivered to the host and the corresponding handler is invoked. However in Case 2), the message gets cleared by the action of taking the H_DOORBELL interrupt. Since the CPU was running a guest, instead of invoking the doorbell handler, the code invokes the second-level interrupt handler to switch the context from the guest to the host. At this point the setting of the EE bit doesn't result in the CPU getting the doorbell interrupt since it has already been delivered once. So, the handler for this doorbell is never invoked! This causes softlockups if the missed DOORBELL was as IPI sent from a sibling subcore CPU. This patch fixes it by explitly invoking the doorbell handler on the exit path if the exit reason is H_DOORBELL similar to the way an EXTERNAL interrupt is handled. Since this will also handle Case 1), we can unconditionally clear the doorbell message in kvmppc_check_wake_reason. Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 17 ++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index b98889e..106c7f9 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -150,6 +150,8 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) cmpwi cr1, r12, BOOK3S_INTERRUPT_MACHINE_CHECK cmpwi r12, BOOK3S_INTERRUPT_EXTERNAL beq 11f + cmpwi r12, BOOK3S_INTERRUPT_H_DOORBELL + beq 15f /* Invoke the H_DOORBELL handler */ cmpwi cr2, r12, BOOK3S_INTERRUPT_HMI beq cr2, 14f/* HMI check */ @@ -174,6 +176,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) mtspr SPRN_HSRR1, r7 b hmi_exception_after_realmode +15:mtspr SPRN_HSRR0, r8 + mtspr SPRN_HSRR1, r7 + ba0xe80 + kvmppc_primary_no_guest: /* We handle this much like a ceded vcpu */ /* put the HDEC into the DEC, since HDEC interrupts don't wake us */ @@ -2436,14 +2442,19 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) /* hypervisor doorbell */ 3: li r12, BOOK3S_INTERRUPT_H_DOORBELL + + /* +* Clear the doorbell as we will invoke the handler +* explicitly in the guest exit path. +*/ + lis r6, (PPC_DBELL_SERVER << (63-36))@h + PPC_MSGCLR(6) /* see if it's a host IPI */ li r3, 1 lbz r0, HSTATE_HOST_IPI(r13) cmpwi r0, 0 bnelr - /* if not, clear it and return -1 */ - lis r6, (PPC_DBELL_SERVER << (63-36))@h - PPC_MSGCLR(6) + /* if not, return -1 */ li r3, -1 blr -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RESEND v4 3/4] cpufreq: powernv: Add a trace print for the throttle event
Hi Shilpa, Just saw this resend! On Tue, Jan 12, 2016 at 04:24:26AM -0600, Shilpasri G Bhat wrote: > Record the throttle event with a trace print replacing the printk, > except for events like throttling below nominal and occ reset > event which print a warning message. > > Signed-off-by: Shilpasri G Bhat> --- [..snip..] > > -static void powernv_cpufreq_throttle_check(void *data) > +static void powernv_cpufreq_check_pmax(void) ^^^ This function only contains code moved from powernv_cpufreq_throttle_check with pr_crit/pr_warns replaced by trace_powernv_throttle. Furthermore, it is not called from any other place. Given that the original function was ~60 lines do we really need to split it into two separate functions ? If yes, could it be an inline function ? > { > unsigned int cpu = smp_processor_id(); > unsigned int chip_id = pir_to_chip_id(hard_smp_processor_id()); > - unsigned long pmsr; > int pmsr_pmax, i; > > - pmsr = get_pmspr(SPRN_PMSR); > + pmsr_pmax = (s8)PMSR_MAX(get_pmspr(SPRN_PMSR)); > > for (i = 0; i < nr_chips; i++) > if (chips[i].id == chip_id) > break; > > - /* Check for Pmax Capping */ > - pmsr_pmax = (s8)PMSR_MAX(pmsr); > if (pmsr_pmax != powernv_pstate_info.max) { > if (chips[i].throttled) > - goto next; > + return; > + > chips[i].throttled = true; > if (pmsr_pmax < powernv_pstate_info.nominal) > - pr_crit("CPU %d on Chip %u has Pmax reduced below > nominal frequency (%d < %d)\n", > - cpu, chips[i].id, pmsr_pmax, > - powernv_pstate_info.nominal); > - else > - pr_info("CPU %d on Chip %u has Pmax reduced below turbo > frequency (%d < %d)\n", > - cpu, chips[i].id, pmsr_pmax, > - powernv_pstate_info.max); > + pr_warn_once("CPU %d on Chip %u has Pmax reduced below > nominal frequency (%d < %d)\n", > + cpu, chips[i].id, pmsr_pmax, > + powernv_pstate_info.nominal); > + > + trace_powernv_throttle(chips[i].id, > +throttle_reason[chips[i].throt_reason], > +pmsr_pmax); > } else if (chips[i].throttled) { > chips[i].throttled = false; > - pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu, > - chips[i].id, pmsr_pmax); > + trace_powernv_throttle(chips[i].id, > +throttle_reason[chips[i].throt_reason], > +pmsr_pmax); > } > +} > + > +static void powernv_cpufreq_throttle_check(void *data) > +{ > + unsigned long pmsr; > + > + pmsr = get_pmspr(SPRN_PMSR); > + > + /* Check for Pmax Capping */ > + powernv_cpufreq_check_pmax(); If you want to retain this function, you could pass pmsr as an argument instead of computing it afresh in powernv_cpufreq_check_pmax() > /* Check if Psafe_mode_active is set in PMSR. */ > -next: > if (pmsr & PMSR_PSAFE_ENABLE) { > throttled = true; > pr_info("Pstate set to safe frequency\n"); -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RESEND v4 4/4] cpufreq: powernv: Add sysfs attributes to show throttle stats
Hi Shilpa, On Tue, Jan 12, 2016 at 04:24:27AM -0600, Shilpasri G Bhat wrote: > +static inline int get_chip_index(struct kobject *kobj) Probably have "get_chip_index(int id)". See the reason below. > +{ > + int i, id; > + > + i = kstrtoint(kobj->name + 4, 0, ); > + if (i) > + return i; > + > + for (i = 0; i < nr_chips; i++) > + if (chips[i].id == id) > + return i; This pattern to obtain a chip index from the chip id is repeated in multiple place inside this file. Might be worthwhile to move this to a helper function, i.e get_chip_index(id)! > + return -EINVAL; > +} > + > +static ssize_t throttle_freq_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + int i, count = 0, id; > + We obtain the id from kobj here and then obtain the index from id via the function below. > + id = get_chip_index(kobj); > + if (id < 0) > + return id; > + > + for (i = 0; i < powernv_pstate_info.nr_pstates; i++) > + count += sprintf([count], "%d %d\n", > +powernv_freqs[i].frequency, > +chips[id].pstate_stat[i]); > + > + return count; > +} > + > +static struct kobj_attribute attr_throttle_frequencies = > +__ATTR(throttle_frequencies, 0444, throttle_freq_show, NULL); > + -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/2] workqueue:Fix affinity of an unbound worker of a node with 1 online CPU
With commit e9d867a67fd03ccc ("sched: Allow per-cpu kernel threads to run on online && !active"), __set_cpus_allowed_ptr() expects that only strict per-cpu kernel threads can have affinity to an online CPU which is not yet active. This assumption is currently broken in the CPU_ONLINE notification handler for the workqueues where restore_unbound_workers_cpumask() calls set_cpus_allowed_ptr() when the first cpu in the unbound worker's pool->attr->cpumask comes online. Since set_cpus_allowed_ptr() is called with pool->attr->cpumask in which only one CPU is online which is not yet active, we get the following WARN_ON during an CPU online operation. [ cut here ] WARNING: CPU: 40 PID: 248 at kernel/sched/core.c:1166 __set_cpus_allowed_ptr+0x228/0x2e0 Modules linked in: CPU: 40 PID: 248 Comm: cpuhp/40 Not tainted 4.6.0-autotest+ #4 <..snip..> Call Trace: [c00f273ff920] [c010493c] __set_cpus_allowed_ptr+0x2cc/0x2e0 (unreliable) [c00f273ffac0] [c00ed4b0] workqueue_cpu_up_callback+0x2c0/0x470 [c00f273ffb70] [c00f5c58] notifier_call_chain+0x98/0x100 [c00f273ffbc0] [c00c5ed0] __cpu_notify+0x70/0xe0 [c00f273ffc00] [c00c6028] notify_online+0x38/0x50 [c00f273ffc30] [c00c5214] cpuhp_invoke_callback+0x84/0x250 [c00f273ffc90] [c00c562c] cpuhp_up_callbacks+0x5c/0x120 [c00f273ffce0] [c00c64d4] cpuhp_thread_fun+0x184/0x1c0 [c00f273ffd20] [c00fa050] smpboot_thread_fn+0x290/0x2a0 [c00f273ffd80] [c00f45b0] kthread+0x110/0x130 [c00f273ffe30] [c0009570] ret_from_kernel_thread+0x5c/0x6c ---[ end trace 00f1456578b2a3b2 ]--- This patch sets the affinity of the worker to a) the only online CPU in the cpumask of the worker pool when it comes online. b) the cpumask of the worker pool when the second CPU in the pool's cpumask comes online. Reported-by: Abdul Haleem <abdha...@linux.vnet.ibm.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Tejun Heo <hte...@gmail.com> Cc: Michael Ellerman <m...@ellerman.id.au> Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> --- kernel/workqueue.c | 19 +++ 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index e412794..1199f73 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4586,7 +4586,7 @@ static void rebind_workers(struct worker_pool *pool) * * An unbound pool may end up with a cpumask which doesn't have any online * CPUs. When a worker of such pool get scheduled, the scheduler resets - * its cpus_allowed. If @cpu is in @pool's cpumask which didn't have any + * its cpus_allowed. If @cpu is in @pool's cpumask which had at most one * online CPU before, cpus_allowed of all its workers should be restored. */ static void restore_unbound_workers_cpumask(struct worker_pool *pool, int cpu) @@ -4600,15 +4600,26 @@ static void restore_unbound_workers_cpumask(struct worker_pool *pool, int cpu) if (!cpumask_test_cpu(cpu, pool->attrs->cpumask)) return; - /* is @cpu the only online CPU? */ cpumask_and(, pool->attrs->cpumask, cpu_online_mask); - if (cpumask_weight() != 1) + + /* +* The affinity needs to be set +* a) to @cpu when that is the only online CPU in +*pool->attrs->cpumask. +* b) to pool->attrs->cpumask when exactly two CPUs in +*pool->attrs->cpumask are online. This affinity will be +*retained when subsequent CPUs come online. +*/ + if (cpumask_weight() > 2) return; + if (cpumask_weight() == 2) + cpumask_copy(, pool->attrs->cpumask); + /* as we're called from CPU_ONLINE, the following shouldn't fail */ for_each_pool_worker(worker, pool) WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, - pool->attrs->cpumask) < 0); + ) < 0); } /* -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/2] workqueue: Move wq_update_unbound_numa() to the beginning of CPU_ONLINE
Currently in the CPU_ONLINE workqueue handler, the restore_unbound_workers_cpumask() will never call set_cpus_allowed_ptr() for a newly created unbound worker thread. This is because the function which creates a new unbound worker thread when the first CPU in the node comes online [wq_update_unbound_numa()] is invoked after the call to restore_unbound_workers_cpumask(). Thus the affinity is never set for this worker thread when the first CPU in the node comes online. Furthermore, due to an optimization in restore_unbound_workers_cpumask(), set_cpus_allowed_ptr() is not called when subsequent CPUs in the node come online since it assumes that the affinity would have been set when the first CPU has come online. This patch fixes this issue by invoking wq_update_unbound_numa() before the calling restore_unbound_workers_cpumask(). Cc: Peter Zijlstra <pet...@infradead.org> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Tejun Heo <hte...@gmail.com> Cc: Michael Ellerman <m...@ellerman.id.au> Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> --- kernel/workqueue.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index e1c0e99..e412794 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4638,6 +4638,10 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb, case CPU_ONLINE: mutex_lock(_pool_mutex); + /* update NUMA affinity of unbound workqueues */ + list_for_each_entry(wq, , list) + wq_update_unbound_numa(wq, cpu, true); + for_each_pool(pool, pi) { mutex_lock(>attach_mutex); @@ -4649,10 +4653,6 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb, mutex_unlock(>attach_mutex); } - /* update NUMA affinity of unbound workqueues */ - list_for_each_entry(wq, , list) - wq_update_unbound_numa(wq, cpu, true); - mutex_unlock(_pool_mutex); break; } -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/2] Fix CPU Online handling for unbounded worker threads
Hi, This patchset fixes a couple of issues in the CPU_ONLINE notification handling for the workqueues with respect to unbounded worker threads. Patch 1 ensures that the affinity of a unbound worker thread associated with a node whose very first CPU has come online is set correctly. In the existing code path we will never call set_cpus_allowed_ptr() for unbound worker threads that have been created on a CPU Online operation after boot. Patch 2 fixes the following WARN_ON() reported by Abdul when set_cpus_allowed_ptr() for an unbound worker thread is invoked when only one of the CPUs in its cpumask is online but not yet active. [ cut here ] WARNING: CPU: 40 PID: 248 at kernel/sched/core.c:1166 __set_cpus_allowed_ptr+0x21c/0x290 Modules linked in: CPU: 40 PID: 248 Comm: cpuhp/40 Not tainted 4.6.0-autotest #1 task: c00f27284200 ti: c00f273fc000 task.ti: c00f273fc000 NIP: c010488c LR: c0104874 CTR: REGS: c00f273ff7d0 TRAP: 0700 Not tainted (4.6.0-autotest) MSR: 900100029033 <SF,HV,EE,ME,IR,DR,RI,LE,TM[E]> CR: 28002804 XER: 2000 CFAR: c05b0888 SOFTE: 0 GPR00: c010478c c00f273ffa50 c13ce400 GPR04: c140ed98 0800 c007f64d9408 GPR08: 0028 c140ee90 0020 GPR12: 2200 cfb96800 c00f44a8 c007fa158480 GPR16: c007fc621a70 c00f2721f800 0001 GPR20: c1571ef0 c134879f c12bc510 GPR24: 0100 c140ea98 c007f64d9408 GPR28: c007fbc21c00 ffea c00f2728 NIP [c010488c] __set_cpus_allowed_ptr+0x21c/0x290 LR [c0104874] __set_cpus_allowed_ptr+0x204/0x290 Call Trace: [c00f273ffa50] [c010478c] __set_cpus_allowed_ptr+0x11c/0x290 (unreliable) [c00f273ffac0] [c00ed4b0] workqueue_cpu_up_callback+0x2c0/0x470 [c00f273ffb70] [c00f5c58] notifier_call_chain+0x98/0x100 [c00f273ffbc0] [c00c5ed0] __cpu_notify+0x70/0xe0 [c00f273ffc00] [c00c6028] notify_online+0x38/0x50 [c00f273ffc30] [c00c5214] cpuhp_invoke_callback+0x84/0x250 [c00f273ffc90] [c00c562c] cpuhp_up_callbacks+0x5c/0x120 [c00f273ffce0] [c00c64d4] cpuhp_thread_fun+0x184/0x1c0 [c00f273ffd20] [c00fa050] smpboot_thread_fn+0x290/0x2a0 [c00f273ffd80] [c00f45b0] kthread+0x110/0x130 [c00f273ffe30] [c0009570] ret_from_kernel_thread+0x5c/0x6c Instruction dump: 419eff3c 3d420004 38a00800 388a0998 7f63db78 484abfa1 6000 2fa3 409eff1c 813f0378 2f890001 419eff10 <0fe0> 4b08 6000 6000 ---[ end trace cbc1c5cfbc9591d0 ]--- The patches are based on 4.7-rc2. I have tested the patches on a multi-node x86_64 and a ppc64 Gautham R. Shenoy (2): workqueue: Move wq_update_unbound_numa() to the beginning of CPU_ONLINE workqueue:Fix affinity of an unbound worker of a node with 1 online CPU kernel/workqueue.c | 27 +++ 1 file changed, 19 insertions(+), 8 deletions(-) -- 1.9.3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/2] Fix CPU Online handling for unbounded worker threads
Hi Peter, Thomas, On Tue, Jun 07, 2016 at 08:44:01PM +0530, Gautham R. Shenoy wrote: > Hi, > > This patchset fixes a couple of issues in the CPU_ONLINE notification > handling for the workqueues with respect to unbounded worker threads. Any thoughts on these patches ? They fix a race which was causing WARN_ON() to be consistently reproduced on POWER machines since 4.6. Could you please review these patches ? -- Thanks and Regard gautham. > > Patch 1 ensures that the affinity of a unbound worker thread > associated with a node whose very first CPU has come online is set > correctly. In the existing code path we will never call > set_cpus_allowed_ptr() for unbound worker threads that have been > created on a CPU Online operation after boot. > > Patch 2 fixes the following WARN_ON() reported by Abdul when > set_cpus_allowed_ptr() for an unbound worker thread is invoked when > only one of the CPUs in its cpumask is online but not yet active. > > [ cut here ] > WARNING: CPU: 40 PID: 248 at kernel/sched/core.c:1166 > __set_cpus_allowed_ptr+0x21c/0x290 > Modules linked in: > CPU: 40 PID: 248 Comm: cpuhp/40 Not tainted 4.6.0-autotest #1 > task: c00f27284200 ti: c00f273fc000 task.ti: c00f273fc000 > NIP: c010488c LR: c0104874 CTR: > REGS: c00f273ff7d0 TRAP: 0700 Not tainted (4.6.0-autotest) > MSR: 900100029033 <SF,HV,EE,ME,IR,DR,RI,LE,TM[E]> CR: 28002804 XER: > 2000 > CFAR: c05b0888 SOFTE: 0 > GPR00: c010478c c00f273ffa50 c13ce400 > GPR04: c140ed98 0800 c007f64d9408 > GPR08: 0028 c140ee90 0020 > GPR12: 2200 cfb96800 c00f44a8 c007fa158480 > GPR16: c007fc621a70 c00f2721f800 0001 > GPR20: c1571ef0 c134879f c12bc510 > GPR24: 0100 c140ea98 c007f64d9408 > GPR28: c007fbc21c00 ffea c00f2728 > NIP [c010488c] __set_cpus_allowed_ptr+0x21c/0x290 > LR [c0104874] __set_cpus_allowed_ptr+0x204/0x290 > Call Trace: > [c00f273ffa50] [c010478c] __set_cpus_allowed_ptr+0x11c/0x290 > (unreliable) > [c00f273ffac0] [c00ed4b0] workqueue_cpu_up_callback+0x2c0/0x470 > [c00f273ffb70] [c00f5c58] notifier_call_chain+0x98/0x100 > [c00f273ffbc0] [c00c5ed0] __cpu_notify+0x70/0xe0 > [c00f273ffc00] [c00c6028] notify_online+0x38/0x50 > [c00f273ffc30] [c00c5214] cpuhp_invoke_callback+0x84/0x250 > [c00f273ffc90] [c00c562c] cpuhp_up_callbacks+0x5c/0x120 > [c00f273ffce0] [c00c64d4] cpuhp_thread_fun+0x184/0x1c0 > [c00f273ffd20] [c00fa050] smpboot_thread_fn+0x290/0x2a0 > [c00f273ffd80] [c00f45b0] kthread+0x110/0x130 > [c00f273ffe30] [c0009570] ret_from_kernel_thread+0x5c/0x6c > Instruction dump: > 419eff3c 3d420004 38a00800 388a0998 7f63db78 484abfa1 6000 2fa3 > 409eff1c 813f0378 2f890001 419eff10 <0fe0> 4b08 60000000 6000 > ---[ end trace cbc1c5cfbc9591d0 ]--- > > The patches are based on 4.7-rc2. I have tested the patches on a > multi-node x86_64 and a ppc64 > > Gautham R. Shenoy (2): > workqueue: Move wq_update_unbound_numa() to the beginning of > CPU_ONLINE > workqueue:Fix affinity of an unbound worker of a node with 1 online > CPU > > kernel/workqueue.c | 27 +++ > 1 file changed, 19 insertions(+), 8 deletions(-) > > -- > 1.9.3 > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 08/10] powerpc/powernv: Add platform support for stop instruction
Hi Shreyas, On Tue, May 24, 2016 at 06:45:12PM +0530, Shreyas B. Prabhu wrote: > POWER ISA v3 defines a new idle processor core mechanism. In summary, > a) new instruction named stop is added. This instruction replaces > instructions like nap, sleep, rvwinkle. > b) new per thread SPR named Processor Stop Status and Control Register > (PSSCR) is added which controls the behavior of stop instruction. > > PSSCR layout: > -- > | PLS | /// | SD | ESL | EC | PSLL | /// | TR | MTL | RL | > -- > 0 4 41 4243 44 4854 5660 > > PSSCR key fields: > Bits 0:3 - Power-Saving Level Status. This field indicates the lowest > power-saving state the thread entered since stop instruction was last > executed. > > Bit 42 - Enable State Loss > 0 - No state is lost irrespective of other fields > 1 - Allows state loss > > Bits 44:47 - Power-Saving Level Limit > This limits the power-saving level that can be entered into. > > Bits 60:63 - Requested Level > Used to specify which power-saving level must be entered on executing > stop instruction > > This patch adds support for stop instruction and PSSCR handling. This version looks good to me. Reviewed-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] workqueue:Fix affinity of an unbound worker of a node with 1 online CPU
Hi Peter, On Tue, Jun 14, 2016 at 01:22:34PM +0200, Peter Zijlstra wrote: > On Tue, Jun 07, 2016 at 08:44:03PM +0530, Gautham R. Shenoy wrote: > > I'm still puzzled why we don't see this on x86. Afaict there's nothing > PPC specific about this. You are right. On PPC, at boot time we hit the WARN_ON like once in 5 times. Using some debug prints, I have verified that these are instances when the workqueue subsystem gets initialized before all the CPUs come online. On x86, I have never been able to hit this since it appears that every time the workqueues get initialized only after all the CPUs have come online. PPC doesn't uses any specific unbound workqueue early in the boot. The unbound workqueues causing the WARN_ON() were the "events_unbound" workqueue which was created by workqueue_init(). = [WQ] Creating Unbound workers for WQ events_unbound,cpumask 0-127. online mask 0 [WQ] Creating Unbound workers for WQ events_unbound,cpumask 0-31. online mask 0 [WQ] Creating Unbound workers for WQ events_unbound,cpumask 32-63. online mask 0 [WQ] Creating Unbound workers for WQ events_unbound,cpumask 64-95. online mask 0 [WQ] Creating Unbound workers for WQ events_unbound,cpumask 96-127. online mask 0 = Also, with the first patch in the series (which ensures that restore_unbound_workers are called *after* the new workers for the newly onlined CPUs are created) and without this one, you can reproduce this WARN_ON on both x86 and PPC by offlining all the CPUs of a node and bringing just one of them online. So essentially the BUG fixed by the previous patch is currently hiding this BUG which is why we are not able to reproduce this WARN_ON() with CPU-hotplug once the system has booted. > > > This patch sets the affinity of the worker to > > a) the only online CPU in the cpumask of the worker pool when it comes > >online. > > b) the cpumask of the worker pool when the second CPU in the pool's > >cpumask comes online. > > This basically works around the WARN conditions, which I suppose is fair > enough, but I would like a note here to revisit this once the whole cpu > hotplug rework has settled. > Sure. > The real problem is that workqueues seem to want to create worker > threads before there's anybody who would use them or something like > that. I am not sure about that. The workqueue creates unbound workers for a node via wq_update_unbound_numa() whenever the first CPU of every node comes online. So that seems legitimate. It then tries to affine these workers to the cpumask of that node. Again this seems right. As an optimization, it does this only when the first CPU of the node comes online. Since this online CPU is not yet active, and since nr_cpus_allowed > 1, we will hit the WARN_ON(). However, I agree with you that during boot-up, the workqueue subsystem needs to create unbound worker threads for only the online CPUs (instead of all possible CPUs as it currently does!) and let the CPU_ONLINE notification take care of creating the remaining workers when they are really required. > > Or is that what PPC does funny? Use an unbound workqueue this early in > cpu bringup? Like I pointed out above, PPC doesn't use an unbound workqueue early in the CPU bring up. -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] workqueue:Fix affinity of an unbound worker of a node with 1 online CPU
On Wed, Jun 15, 2016 at 01:32:49PM +0200, Peter Zijlstra wrote: > On Wed, Jun 15, 2016 at 03:49:36PM +0530, Gautham R Shenoy wrote: > > > Also, with the first patch in the series (which ensures that > > restore_unbound_workers are called *after* the new workers for the > > newly onlined CPUs are created) and without this one, you can > > reproduce this WARN_ON on both x86 and PPC by offlining all the CPUs > > of a node and bringing just one of them online. > > Ah good. > > > I am not sure about that. The workqueue creates unbound workers for a > > node via wq_update_unbound_numa() whenever the first CPU of every node > > comes online. So that seems legitimate. It then tries to affine these > > workers to the cpumask of that node. Again this seems right. As an > > optimization, it does this only when the first CPU of the node comes > > online. Since this online CPU is not yet active, and since > > nr_cpus_allowed > 1, we will hit the WARN_ON(). > > So I had another look and isn't the below a much simpler solution? > > It seems to work on my x86 with: > > for i in /sys/devices/system/cpu/cpu*/online ; do echo 0 > $i ; done > for i in /sys/devices/system/cpu/cpu*/online ; do echo 1 > $i ; done > > without complaint. Yup. This will work on PPC as well. We will no longer have the optimization in restore_unbound_workers_cpumask() but I suppose we don't lose much by resetting the affinity every time a CPU in the pool->attr->cpumask comes online. -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] workqueue: Move wq_update_unbound_numa() to the beginning of CPU_ONLINE
On Tue, Jun 21, 2016 at 09:47:19PM +0200, Peter Zijlstra wrote: > On Tue, Jun 21, 2016 at 03:43:56PM -0400, Tejun Heo wrote: > > On Tue, Jun 21, 2016 at 09:37:09PM +0200, Peter Zijlstra wrote: > > > Hurm.. So I've applied it, just to get this issue sorted, but I'm not > > > entirely sure I like it. > > > > > > I think I prefer ego's version because that makes it harder to get stuff > > > to run on !active,online cpus. I think we really want to be careful what > > > gets to run during that state. > > > > The original patch just did set_cpus_allowed one more time late enough > > so that the target kthread (in most cases) doesn't have to go through > > fallback rq selection afterwards. I don't know what the long term > > solution is but CPU_ONLINE callbacks should be able to bind kthreads > > to the new CPU one way or the other. > > Fair enough; clearly I need to stare harder. In any case, patch is on > its way to sched/urgent. Thanks Tejun, Peter! > -- Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] workqueue: Move wq_update_unbound_numa() to the beginning of CPU_ONLINE
Hello Tejun, On Wed, Jun 15, 2016 at 11:53:50AM -0400, Tejun Heo wrote: > Hello, > > On Tue, Jun 07, 2016 at 08:44:02PM +0530, Gautham R. Shenoy wrote: > > Currently in the CPU_ONLINE workqueue handler, the > > restore_unbound_workers_cpumask() will never call > > set_cpus_allowed_ptr() for a newly created unbound worker thread. > > Hmmm... did you actually verify that this happens? A new kworker > always gets bound to the cpumask that it's assigned to in > create_worker(). Yes I have verified that this happens despite the fact that create_worker() calls kthread_bind_mask() to bind the worker thread to attrs->cpumask. However, this doesn't seem to be sufficient. Consider the following case of a 2-node POWER machine running 4.7-rc3. CPUs 0-79 belong to the 1st node and CPUs 80-159 belong to the second. == root@fir01:~# uname -r 4.7.0-rc3-vanilla root@fir01:~# numactl -H available: 2 nodes (0,8) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 node 0 size: 65246 MB node 0 free: 64025 MB node 8 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 node 8 size: 65304 MB node 8 free: 64985 MB node distances: node 0 8 0: 10 40 8: 40 10 == If we inspect the unbound worker threads affinity we will observe that the ordered unbound worker threads (pids 6, 1088, 1122) are affined to the online CPUs while the remaining unbound worker threads are affined to the nodemask. == root@fir01:/home/ego# ./pr_unbound_workers_affinity.sh #See [1] below pid 6's current affinity list: 0-159 pid 7's current affinity list: 0-79 pid 1018's current affinity list: 80-159 pid 1054's current affinity list: 80-159 pid 1088's current affinity list: 0-159 pid 1089's current affinity list: 0-79 pid 1090's current affinity list: 80-159 pid 1122's current affinity list: 0-159 pid 1176's current affinity list: 0-79 pid 3683's current affinity list: 0-79 == At this point if we offline all but CPU0, the only unbound workers that exist are the unordered workers and those affined to the first node. == root@fir01:/home/ego# ./cpuhp.sh 0 1 159 #See [2] below root@fir01:/home/ego# ./pr_unbound_workers_affinity.sh pid 6's current affinity list: 0 pid 7's current affinity list: 0 pid 1088's current affinity list: 0 pid 1089's current affinity list: 0 pid 1122's current affinity list: 0 pid 1176's current affinity list: 0 pid 3683's current affinity list: 0 == We now online CPU80 which is the first CPU in the second node. We would expect that an unbound worker thread corresponding to the second node would be created and would have the mask 80-159. However, the newly created workers (pid 4109 and 4110) are affined to CPU0 instead of CPU80! == root@fir01:/home/ego# ./cpuhp.sh 1 80 80 root@fir01:/home/ego# ./pr_unbound_workers_affinity.sh pid 6's current affinity list: 0,80 pid 7's current affinity list: 0 pid 1088's current affinity list: 0,80 pid 1089's current affinity list: 0 pid 1122's current affinity list: 0,80 pid 1176's current affinity list: 0 pid 3683's current affinity list: 0 pid 4109's current affinity list: 0 pid 4110's current affinity list: 0 == Furthermore, if we now bring all the CPUs online, we don't expect new worker threads to be created since the representative for the second node would have been created with CPU80 coming online. However, we do expect that those worker threads are affined to CPUs 80-159. But that's not the case either! == root@fir01:/home/ego# ./cpuhp.sh 1 1 159 root@fir01:/home/ego# ./pr_unbound_workers_affinity.sh pid 6's current affinity list: 0-159 pid 7's current affinity list: 0-79 pid 1088's current affinity list: 0-159 pid 1089's current affinity list: 0-79 pid 1122's current affinity list: 0-159 pid 1176's current affinity list: 0-79 pid 3683's current affinity list: 0-79 pid 4109's current affinity list: 0 pid 4110's current affinity list: 0
Re: [PATCH 1/2] workqueue: Move wq_update_unbound_numa() to the beginning of CPU_ONLINE
Hi Tejun, On Thu, Jun 16, 2016 at 03:35:04PM -0400, Tejun Heo wrote: > Hello, > > So, the issue of the initial worker not having its affinity set > correctly wasn't caused by the order of the operations. Reordering > just made set_cpus_allowed tried one more time late enough so that it > hides the race condition most of the time. The problem is that > CPU_ONLINE callbacks are called while the cpu being onlined is online > but not active and select_fallback_rq() only considers active cpus, so > if a kthread gets scheduled in the meantime and it doesn't have any > cpu which is active in its allowed mask, it's allowed mask gets reset > to cpu_possible_mask. > > Would something like the following make sense? > > Thanks. > -- 8< -- > Subject: [PATCH] sched: allow kthreads to fallback to online && !active cpus > > During CPU hotplug, CPU_ONLINE callbacks are run while the CPU is > online but not active. A CPU_ONLINE callback may create or bind a > kthread so that its cpus_allowed mask only allows the CPU which is > being brought online. The kthread may start executing before the CPU > is made active and can end up in select_fallback_rq(). > > In such cases, the expected behavior is selecting the CPU which is > coming online; however, because select_fallback_rq() only chooses from > active CPUs, it determines that the task doesn't have any viable CPU > in its allowed mask and ends up overriding it to cpu_possible_mask. > > CPU_ONLINE callbacks should be able to put kthreads on the CPU which > is coming online. Update select_fallback_rq() so that it follows > cpu_online() rather than cpu_active() for kthreads. > > Signed-off-by: Tejun Heo <t...@kernel.org> > Reported-by: Gautham R Shenoy <e...@linux.vnet.ibm.com> Hi Tejun, This patch fixes the issue on POWER. I am able to see the worker threads of the unbound workqueues of the newly onlined node with this. Tested-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> > --- > kernel/sched/core.c |4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 017d539..a12e3db 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -1536,7 +1536,9 @@ static int select_fallback_rq(int cpu, struct > task_struct *p) > for (;;) { > /* Any allowed, online CPU? */ > for_each_cpu(dest_cpu, tsk_cpus_allowed(p)) { > - if (!cpu_active(dest_cpu)) > + if (!(p->flags & PF_KTHREAD) && !cpu_active(dest_cpu)) > + continue; > + if (!cpu_online(dest_cpu)) > continue; > goto out; > } > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v7 6/6] cpufreq: powernv: Add sysfs attributes to show throttle stats
Hi Shilpa, A minor nit. On Thu, Jan 28, 2016 at 12:55:41PM +0530, Shilpasri G Bhat wrote: [..snip..] > + > +What: > /sys/devices/system/cpu/cpufreq/chip*/throttle_reasons/ > +Date:Jan 2016 > +Contact: Linux kernel mailing list> + Linux for PowerPC mailing list > +Description: CPU Frequency throttle reason stat for the chip > + > + This directory contains throttle reason files. Each file gives > + the total number of times the max frequency is throttled, except > + for 'unthrottle_count', which gives the total number of times > + the max frequency is unthrottled after being throttled. Below > + are the reason attributes. > + > + cpu_over_temperature: Throttled due to cpu over temperature > + > + occ_reset: Throttled due to reset of OCC > + > + over_current: Throttled due to over current Overcurrent is a single word. No need of the extra space. You could fix that and add my Reviewed-by. -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v7 1/6] cpufreq: powernv: Free 'chips' on module exit
On Thu, Jan 28, 2016 at 12:55:36PM +0530, Shilpasri G Bhat wrote: > This will free the dynamically allocated memory of'chips' on > module exit. > > Signed-off-by: Shilpasri G Bhat <shilpa.b...@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v7 3/6] cpufreq: powernv: Remove cpu_to_chip_id() from hot-path
On Thu, Jan 28, 2016 at 12:55:38PM +0530, Shilpasri G Bhat wrote: > cpu_to_chip_id() does a DT walk through to find out the chip id by > taking a contended device tree lock. This adds an unnecessary overhead > in a hot path. So instead of calling cpu_to_chip_id() everytime cache > the chip ids for all cores in the array 'core_to_chip_map' and use it > in the hotpath. > > Reported-by: Anton Blanchard <an...@samba.org> > Signed-off-by: Shilpasri G Bhat <shilpa.b...@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v7 5/6] cpufreq: powernv: Replace pr_info with trace print for throttle event
On Thu, Jan 28, 2016 at 12:55:40PM +0530, Shilpasri G Bhat wrote: > Currently we use printk message to notify the throttle event. But this > can flood the console if the cpu is throttled frequently. So replace the > printk with the tracepoint to notify the throttle event. And also events > like throttle below nominal frequency and OCC_RESET are reduced to > pr_warn/pr_warn_once as pointed by MFG to not mark them as critical > messages. This patch adds 'throttle_reason' to struct chip to store the > throttle reason. > > Signed-off-by: Shilpasri G Bhat <shilpa.b...@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 6/6] cpufreq: powernv: Add sysfs attributes to show throttle stats
Hi Viresh, > > What I can suggest is: > - Move this directory inside cpuX/cpufreq/ directory, in a similar way > as to how we create 'stats' directory today. > - You can then get policy->cpu, to get chip->id out of it. > - The only disadvantage here is that the same chip directory will be > replicated in multiple policies, but that makes it more readable. Thinking about it, having a sysfs group attached to a policy kobject looks ok if replication of the same chip information across multiple policies is not objectionable. Regarding the table-format, it breaks the sysfs's one-value-per-file rule. So I would still prefer each throttle reason being a separate file which gives the number of times the chip frequency was throttled due to that reason. We can live without the per-frequency throttle stats listed in the throttle_status. So, would the following be sysfs group structure be acceptable? $ls -1 /sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/ unthrottle powercap overtemp supply_fault overcurrent occ_reset turbo_stat sub_turbo_stat -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v5 5/5] cpufreq: powernv: Add sysfs attributes to show throttle stats
On Thu, Jan 21, 2016 at 03:08:59PM +0530, Shilpasri G Bhat wrote: > Signed-off-by: Shilpasri G Bhat <shilpa.b...@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 2/5] cpufreq: powernv: Remove cpu_to_chip_id() from hot-path
Hello Balbir, On Sat, Jan 23, 2016 at 07:59:20PM +1100, Balbir Singh wrote: > On Fri, 22 Jan 2016 12:49:02 +0530 > Shilpasri G Bhat <shilpa.b...@linux.vnet.ibm.com> wrote: > > > cpu_to_chip_id() does a DT walk through to find out the chip id by > > taking a contended device tree lock. This adds an unnecessary overhead > > in a hot path. So instead of calling cpu_to_chip_id() everytime cache > > the chip ids for all cores in the array 'core_to_chip_map' and use it > > in the hotpath. > > > > Reported-by: Anton Blanchard <an...@samba.org> > > Signed-off-by: Shilpasri G Bhat <shilpa.b...@linux.vnet.ibm.com> > > Reviewed-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> > > snip > > Does the core_to_chip_map need to be updated/refreshed on/after/ a > cpu (core) hotplug? I presume id's don't change No, the id's don't change on cpu/core hotplug. The core_to_chip_map is initialized in init_chip_info() where we use for_each_possible_cpu(). Thanks for the review! > > Balbir > -- Thanks and Regards gautham. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 2/9] powerpc/kvm: make hypervisor state restore a function
Hi Shreyas, On Tue, May 03, 2016 at 01:54:31PM +0530, Shreyas B. Prabhu wrote: > In the current code, when the thread wakes up in reset vector, some > of the state restore code and check for whether a thread needs to > branch to kvm is duplicated. Reorder the code such that this > duplication is avoided. > > At a higher level this is what the change looks like- I have manually verified that the code flow in the new patch is has the same effect as whatever we were doing earlier. There a couple of comments inline. > diff --git a/arch/powerpc/kernel/exceptions-64s.S > b/arch/powerpc/kernel/exceptions-64s.S > index 7716ceb..7ebfbb0 100644 > --- a/arch/powerpc/kernel/exceptions-64s.S > +++ b/arch/powerpc/kernel/exceptions-64s.S > @@ -107,25 +107,8 @@ BEGIN_FTR_SECTION > beq 9f > > cmpwi cr3,r13,2 > + bl power7_restore_hyp_resource > > - /* > - * Check if last bit of HSPGR0 is set. This indicates whether we are > - * waking up from winkle. > - */ > - GET_PACA(r13) > - clrldi r5,r13,63 > - clrrdi r13,r13,1 > - cmpwi cr4,r5,1 > - mtspr SPRN_HSPRG0,r13 > - > - lbz r0,PACA_THREAD_IDLE_STATE(r13) > - cmpwi cr2,r0,PNV_THREAD_NAP > - bgt cr2,8f /* Either sleep or Winkle */ > - > - /* Waking up from nap should not cause hypervisor state loss */ > - bgt cr3,. > - > - /* Waking up from nap */ > li r0,PNV_THREAD_RUNNING > stb r0,PACA_THREAD_IDLE_STATE(r13) /* Clear thread state */ > > @@ -143,13 +126,9 @@ BEGIN_FTR_SECTION > > /* Return SRR1 from power7_nap() */ > mfspr r3,SPRN_SRR1 > - beq cr3,2f > - b power7_wakeup_noloss > -2: b power7_wakeup_loss > - > - /* Fast Sleep wakeup on PowerNV */ > -8: GET_PACA(r13) In the old code, we do a GET_PACA(r13) before invoking the power7_wakeup_tb_loss. In the new code we don't. Can you explain this omission ? [..snip..] > @@ -420,33 +451,9 @@ common_exit: > > hypervisor_state_restored: > > - li r5,PNV_THREAD_RUNNING > - stb r5,PACA_THREAD_IDLE_STATE(r13) > - > mtspr SPRN_SRR1,r16 > -#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE > - li r0,KVM_HWTHREAD_IN_KERNEL > - stb r0,HSTATE_HWTHREAD_STATE(r13) > - /* Order setting hwthread_state vs. testing hwthread_req */ > - sync > - lbz r0,HSTATE_HWTHREAD_REQ(r13) > - cmpwi r0,0 > - beq 6f > - b kvm_start_guest > -6: > -#endif > - > - REST_NVGPRS(r1) > - REST_GPR(2, r1) > - ld r3,_CCR(r1) > - ld r4,_MSR(r1) > - ld r5,_NIP(r1) > - addir1,r1,INT_FRAME_SIZE > - mtcrr3 > - mfspr r3,SPRN_SRR1/* Return SRR1 */ > - mtspr SPRN_SRR1,r4 > - mtspr SPRN_SRR0,r5 > - rfid > + mtlrr17 > + blr Perhaps you could add a comment against this blr to indicate that we go back to the reset vector right after the call to power7_restore_hyp_resource. > > fastsleep_workaround_at_exit: > li r3,1 > -- > 2.4.11 > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 6/9] powerpc/powernv: set power_save func after the idle states are initialized
On Tue, May 03, 2016 at 01:54:35PM +0530, Shreyas B. Prabhu wrote: > pnv_init_idle_states discovers supported idle states from the > device tree and does the required initialization. Set power_save > function pointer only after this initialization is done > > Signed-off-by: Shreyas B. Prabhy <shre...@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> > --- > arch/powerpc/platforms/powernv/idle.c | 3 +++ > arch/powerpc/platforms/powernv/setup.c | 2 +- > 2 files changed, 4 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/platforms/powernv/idle.c > b/arch/powerpc/platforms/powernv/idle.c > index fcc8b68..fbb09fb 100644 > --- a/arch/powerpc/platforms/powernv/idle.c > +++ b/arch/powerpc/platforms/powernv/idle.c > @@ -285,6 +285,9 @@ static int __init pnv_init_idle_states(void) > } > > pnv_alloc_idle_core_states(); > + > + if (supported_cpuidle_states & OPAL_PM_NAP_ENABLED) > + ppc_md.power_save = power7_idle; > out_free: > kfree(flags); > out: > diff --git a/arch/powerpc/platforms/powernv/setup.c > b/arch/powerpc/platforms/powernv/setup.c > index 1acb0c7..c9685b6 100644 > --- a/arch/powerpc/platforms/powernv/setup.c > +++ b/arch/powerpc/platforms/powernv/setup.c > @@ -312,7 +312,7 @@ define_machine(powernv) { > .get_proc_freq = pnv_get_proc_freq, > .progress = pnv_progress, > .machine_shutdown = pnv_shutdown, > - .power_save = power7_idle, > + .power_save = NULL, > .calibrate_decr = generic_calibrate_decr, > #ifdef CONFIG_KEXEC > .kexec_cpu_down = pnv_kexec_cpu_down, > -- > 2.4.11 > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev