Re: [PATCH v6 02/27] ACPI/CDAT: Add CDAT/DSMAS shared and read only flag values
On Tue, Nov 5, 2024 at 7:38 PM Ira Weiny wrote: > > The Coherent Device Attribute Table (CDAT) Device Scoped Memory Affinity > Structure (DSMAS) version 1.04 [1] defines flags to indicate if a DPA range > is read only and/or shared. > > Add read only and shareable flag definitions. > > This change was merged in ACPI via PR 976.[2] s/ACPI/ACPICA/ > > Link: > https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.04%20published_0.pdf > [1] > Link: https://github.com/acpica/acpica/pull/976 [2] > Cc: Robert Moore > Cc: Len Brown > Cc: Rafael J. Wysocki > Cc: linux-a...@vger.kernel.org > Cc: acpica-de...@lists.linux.dev > Signed-off-by: Ira Weiny With the above change: Acked-by: Rafael J. Wysocki > --- > include/acpi/actbl1.h | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h > index > 199afc2cd122ca8b383b1c9286f8c8cc33842fae..387fc821703a80b324637743f0d5afe03b8d7943 > 100644 > --- a/include/acpi/actbl1.h > +++ b/include/acpi/actbl1.h > @@ -403,6 +403,8 @@ struct acpi_cdat_dsmas { > /* Flags for subtable above */ > > #define ACPI_CDAT_DSMAS_NON_VOLATILE(1 << 2) > +#define ACPI_CDAT_DSMAS_SHAREABLE (1 << 3) > +#define ACPI_CDAT_DSMAS_READ_ONLY (1 << 6) > > /* Subtable 1: Device scoped Latency and Bandwidth Information Structure > (DSLBIS) */ > > > -- > 2.47.0 > >
Re: [PATCH v4 1/5] cpufreq: Add a cpufreq pressure feedback for the scheduler
On Tue, Jan 9, 2024 at 5:47 PM Vincent Guittot wrote: > > Provide to the scheduler a feedback about the temporary max available > capacity. Unlike arch_update_thermal_pressure, this doesn't need to be > filtered as the pressure will happen for dozens ms or more. > > Signed-off-by: Vincent Guittot Acked-by: Rafael J. Wysocki and I think I've given the tag on this patch already. > --- > drivers/cpufreq/cpufreq.c | 36 > include/linux/cpufreq.h | 10 ++ > 2 files changed, 46 insertions(+) > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > index 44db4f59c4cc..f4eee3d107f1 100644 > --- a/drivers/cpufreq/cpufreq.c > +++ b/drivers/cpufreq/cpufreq.c > @@ -2563,6 +2563,40 @@ int cpufreq_get_policy(struct cpufreq_policy *policy, > unsigned int cpu) > } > EXPORT_SYMBOL(cpufreq_get_policy); > > +DEFINE_PER_CPU(unsigned long, cpufreq_pressure); > + > +/** > + * cpufreq_update_pressure() - Update cpufreq pressure for CPUs > + * @policy: cpufreq policy of the CPUs. > + * > + * Update the value of cpufreq pressure for all @cpus in the policy. > + */ > +static void cpufreq_update_pressure(struct cpufreq_policy *policy) > +{ > + unsigned long max_capacity, capped_freq, pressure; > + u32 max_freq; > + int cpu; > + > + cpu = cpumask_first(policy->related_cpus); > + max_freq = arch_scale_freq_ref(cpu); > + capped_freq = policy->max; > + > + /* > +* Handle properly the boost frequencies, which should simply clean > +* the cpufreq pressure value. > +*/ > + if (max_freq <= capped_freq) { > + pressure = 0; > + } else { > + max_capacity = arch_scale_cpu_capacity(cpu); > + pressure = max_capacity - > + mult_frac(max_capacity, capped_freq, max_freq); > + } > + > + for_each_cpu(cpu, policy->related_cpus) > + WRITE_ONCE(per_cpu(cpufreq_pressure, cpu), pressure); > +} > + > /** > * cpufreq_set_policy - Modify cpufreq policy parameters. > * @policy: Policy object to modify. > @@ -2618,6 +2652,8 @@ static int cpufreq_set_policy(struct cpufreq_policy > *policy, > policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H); > trace_cpu_frequency_limits(policy); > > + cpufreq_update_pressure(policy); > + > policy->cached_target_freq = UINT_MAX; > > pr_debug("new min and max freqs are %u - %u kHz\n", > diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h > index afda5f24d3dd..b1d97edd3253 100644 > --- a/include/linux/cpufreq.h > +++ b/include/linux/cpufreq.h > @@ -241,6 +241,12 @@ struct kobject *get_governor_parent_kobj(struct > cpufreq_policy *policy); > void cpufreq_enable_fast_switch(struct cpufreq_policy *policy); > void cpufreq_disable_fast_switch(struct cpufreq_policy *policy); > bool has_target_index(void); > + > +DECLARE_PER_CPU(unsigned long, cpufreq_pressure); > +static inline unsigned long cpufreq_get_pressure(int cpu) > +{ > + return per_cpu(cpufreq_pressure, cpu); > +} > #else > static inline unsigned int cpufreq_get(unsigned int cpu) > { > @@ -263,6 +269,10 @@ static inline bool cpufreq_supports_freq_invariance(void) > return false; > } > static inline void disable_cpufreq(void) { } > +static inline unsigned long cpufreq_get_pressure(int cpu) > +{ > + return 0; > +} > #endif > > #ifdef CONFIG_CPU_FREQ_STAT > -- > 2.34.1 >
Re: [PATCH 05/21] ACPI: Move ACPI_HOTPLUG_CPU to be disabled on arm64 and riscv
On Tue, Nov 21, 2023 at 2:44 PM Russell King wrote: > > From: James Morse > > Neither arm64 nor riscv support physical hotadd of CPUs that were not > present at boot. For arm64 much of the platform description is in static > tables which do not have update methods. arm64 does support HOTPLUG_CPU, > which is backed by a firmware interface to turn CPUs on and off. > > acpi_processor_hotadd_init() and acpi_processor_remove() are for adding > and removing CPUs that were not present at boot. arm64 systems that do this > are not supported as there is currently insufficient information in the > platform description. (e.g. did the GICR get removed too?) > > arm64 currently relies on the MADT enabled flag check in map_gicc_mpidr() > to prevent CPUs that were not described as present at boot from being > added to the system. Similarly, riscv relies on the same check in > map_rintc_hartid(). Both architectures also rely on the weak 'always fails' > definitions of acpi_map_cpu() and arch_register_cpu(). > > Subsequent changes will redefine ACPI_HOTPLUG_CPU as making possible > CPUs present. Neither arm64 nor riscv support this. > > Disable ACPI_HOTPLUG_CPU for arm64 and riscv by removing 'default y' and > selecting it on the other three ACPI architectures. This allows the weak > definitions of some symbols to be removed. > > Signed-off-by: James Morse > Reviewed-by: Shaoqin Huang > Reviewed-by: Gavin Shan > Signed-off-by: Russell King (Oracle) I can apply this if it gets ACKs from the maintainers of the affected architectures. > --- > Changes since RFC: > * Expanded conditions to avoid ACPI_HOTPLUG_CPU being enabled when >HOTPLUG_CPU isn't. > Changes since RFC v3: > * Dropped ia64 changes > --- > arch/loongarch/Kconfig| 1 + > arch/x86/Kconfig | 1 + > drivers/acpi/Kconfig | 1 - > drivers/acpi/acpi_processor.c | 18 -- > 4 files changed, 2 insertions(+), 19 deletions(-) > > diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig > index ee123820a476..331becb2cb4f 100644 > --- a/arch/loongarch/Kconfig > +++ b/arch/loongarch/Kconfig > @@ -5,6 +5,7 @@ config LOONGARCH > select ACPI > select ACPI_GENERIC_GSI if ACPI > select ACPI_MCFG if ACPI > + select ACPI_HOTPLUG_CPU if ACPI_PROCESSOR && HOTPLUG_CPU > select ACPI_PPTT if ACPI > select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI > select ARCH_BINFMT_ELF_STATE > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 3762f41bb092..dbdcfc708369 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -59,6 +59,7 @@ config X86 > # > select ACPI_LEGACY_TABLES_LOOKUPif ACPI > select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI > + select ACPI_HOTPLUG_CPU if ACPI_PROCESSOR && > HOTPLUG_CPU > select ARCH_32BIT_OFF_T if X86_32 > select ARCH_CLOCKSOURCE_INIT > select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE > diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig > index f819e760ff19..a3acfc750fce 100644 > --- a/drivers/acpi/Kconfig > +++ b/drivers/acpi/Kconfig > @@ -310,7 +310,6 @@ config ACPI_HOTPLUG_CPU > bool > depends on ACPI_PROCESSOR && HOTPLUG_CPU > select ACPI_CONTAINER > - default y > > config ACPI_PROCESSOR_AGGREGATOR > tristate "Processor Aggregator" > diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c > index 0f5218e361df..4fe2ef54088c 100644 > --- a/drivers/acpi/acpi_processor.c > +++ b/drivers/acpi/acpi_processor.c > @@ -184,24 +184,6 @@ static void __init acpi_pcc_cpufreq_init(void) {} > > /* Initialization */ > #ifdef CONFIG_ACPI_HOTPLUG_CPU > -int __weak acpi_map_cpu(acpi_handle handle, > - phys_cpuid_t physid, u32 acpi_id, int *pcpu) > -{ > - return -ENODEV; > -} > - > -int __weak acpi_unmap_cpu(int cpu) > -{ > - return -ENODEV; > -} > - > -int __weak arch_register_cpu(int cpu) > -{ > - return -ENODEV; > -} > - > -void __weak arch_unregister_cpu(int cpu) {} > - > static int acpi_processor_hotadd_init(struct acpi_processor *pr) > { > unsigned long long sta; > -- > 2.30.2 >
Re: [PATCH 02/21] x86: intel_epb: Don't rely on link order
On Tue, Nov 21, 2023 at 2:44 PM Russell King wrote: > > From: James Morse > > intel_epb_init() is called as a subsys_initcall() to register cpuhp > callbacks. The callbacks make use of get_cpu_device() which will return > NULL unless register_cpu() has been called. register_cpu() is called > from topology_init(), which is also a subsys_initcall(). > > This is fragile. Moving the register_cpu() to a different > subsys_initcall() leads to a NULL dereference during boot. > > Make intel_epb_init() a late_initcall(), user-space can't provide a > policy before this point anyway. > > Signed-off-by: James Morse > Reviewed-by: Gavin Shan > Signed-off-by: Russell King (Oracle) Acked-by: Rafael J. Wysocki and I'd suggest sending this separately to the x86 list. > --- > subsys_initcall_sync() would be an option, but moving the register_cpu() > calls into ACPI also means adding a safety net for CPUs that are online > but not described properly by firmware. This lives in subsys_initcall_sync(). > --- > arch/x86/kernel/cpu/intel_epb.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/cpu/intel_epb.c b/arch/x86/kernel/cpu/intel_epb.c > index e4c3ba91321c..f18d35fe27a9 100644 > --- a/arch/x86/kernel/cpu/intel_epb.c > +++ b/arch/x86/kernel/cpu/intel_epb.c > @@ -237,4 +237,4 @@ static __init int intel_epb_init(void) > cpuhp_remove_state(CPUHP_AP_X86_INTEL_EPB_ONLINE); > return ret; > } > -subsys_initcall(intel_epb_init); > +late_initcall(intel_epb_init); > -- > 2.30.2 > >
Re: [RFC PATCH v3 00/39] ACPI/arm64: add support for virtual cpuhotplug
On Tue, Oct 24, 2023 at 5:15 PM Russell King (Oracle) wrote: > > Hi, > > I'm posting James' patch set updated with most of the review comments > from his RFC v2 series back in September. Individual patches have a > changelog attached at the bottom of the commit message. Those which > I have finished updating have my S-o-b on them, those which still have > outstanding review comments from RFC v2 do not. In some of these cases > I've asked questions and am waiting for responses. > > I'm posting this as RFC v3 because there's still some unaddressed > comments and it's clearly not ready for merging. Even if it was ready > to be merged, it is too late in this development cycle to be taking > this change in, so there would be little point posting it non-RFC. > Also James stated that he's waiting for confirmation from the > Kubernetes/Kata folk - I have no idea what the status is there. > > I will be sending each patch individually to a wider audience > appropriate for that patch - apologies to those missing out on this > cover message. I have added more mailing lists to the series with the > exception of the acpica list in a hope of this cover message also > reaching those folk. > > The changes that aren't included are: > > 1. Updates for my patch that was merged via Thomas (thanks!): >c4dd854f740c cpu-hotplug: Provide prototypes for arch CPU registration >rather than having this change spread through James' patches. > > 2. New patch - simplification of PA-RISC's smp_prepare_boot_cpu() > > 3. Moved "ACPI: Use the acpi_device_is_present() helper in more places" >and "ACPI: Rename acpi_scan_device_not_present() to be about >enumeration" to the beginning of the series - these two patches are >already queued up for merging into 6.7. > > 4. Moved "arm64, irqchip/gic-v3, ACPI: Move MADT GICC enabled check into >a helper" to the beginning of the series, which has been submitted, >but as yet the fate of that posting isn't known. > > The first four patches in this series are provided for completness only. > > There is an additional patch in James' git tree that isn't in the set > of patches that James posted: "ACPI: processor: Only call > arch_unregister_cpu() if HOTPLUG_CPU is selected" which looks to me to > be a workaround for arch_unregister_cpu() being under the ifdef. I've > commented on this on the RFC v2 posting making a suggestion, but as yet > haven't had any response. > > I've included almost all of James' original covering body below the > diffstat. > > The reason that I'm doing this is to help move this code forward so > hopefully it can be merged - which is why I have been keen to dig out > from James' patches anything that can be merged and submit it > separately, since this is a feature for which some users have a > definite need for. I've gone through the series and there is at least one thing in it that concerns me a lot and some others that at least appear to be really questionable. I need more time to send comments which I'm not going to do before the 6.7 merge window (sorry), but from what I can say right now, this is not looking good. Thanks!
Re: [PATCH 18/39] ACPI: Only enumerate enabled (or functional) devices
On Tue, Oct 24, 2023 at 5:17 PM Russell King wrote: > > From: James Morse > > Today the ACPI enumeration code 'visits' all devices that are present. > > This is a problem for arm64, where CPUs are always present, but not > always enabled. When a device-check occurs because the firmware-policy > has changed and a CPU is now enabled, the following error occurs: > | acpi ACPI0007:48: Enumeration failure > > This is ultimately because acpi_dev_ready_for_enumeration() returns > true for a device that is not enabled. The ACPI Processor driver > will not register such CPUs as they are not 'decoding their resources'. > > Change acpi_dev_ready_for_enumeration() to also check the enabled bit. > ACPI allows a device to be functional instead of maintaining the > present and enabled bit. Make this behaviour an explicit check with > a reference to the spec, and then check the present and enabled bits. > This is needed to avoid enumerating present && functional devices that > are not enabled. > > Signed-off-by: James Morse > Signed-off-by: Russell King (Oracle) > --- > If this change causes problems on deployed hardware, I suggest an TBH, I am expecting problems to be there. If something has been interpreted in a specific way for several years, then changing that interpretation is just incompatible with the entire installed base, at least potentially. It is not even possible to estimate the potential adverse impact of this change, as it causes a firmware-provided bit that has never been taken into account so far to become meaningful and it does so for every device in the system. It will be very hard to convince me that this change is a good idea. > arch opt-in: ACPI_IGNORE_STA_ENABLED, that causes > acpi_dev_ready_for_enumeration() to only check the present bit. But this can work as long as the given arch does not care about platforms in which the "enabled" bit may not be set as expected for some devices. > > Changes since RFC v2: > * Incorporate comment suggestion by Gavin Shan. > Other review comments from Jonathan Cameron not yet addressed. > --- > drivers/acpi/device_pm.c| 2 +- > drivers/acpi/device_sysfs.c | 2 +- > drivers/acpi/internal.h | 1 - > drivers/acpi/property.c | 2 +- > drivers/acpi/scan.c | 24 ++-- > 5 files changed, 17 insertions(+), 14 deletions(-) > > diff --git a/drivers/acpi/device_pm.c b/drivers/acpi/device_pm.c > index f007116a8427..76c38478a502 100644 > --- a/drivers/acpi/device_pm.c > +++ b/drivers/acpi/device_pm.c > @@ -313,7 +313,7 @@ int acpi_bus_init_power(struct acpi_device *device) > return -EINVAL; > > device->power.state = ACPI_STATE_UNKNOWN; > - if (!acpi_device_is_present(device)) { > + if (!acpi_dev_ready_for_enumeration(device)) { > device->flags.initialized = false; > return -ENXIO; > } > diff --git a/drivers/acpi/device_sysfs.c b/drivers/acpi/device_sysfs.c > index b9bbf0746199..16e586d74aa2 100644 > --- a/drivers/acpi/device_sysfs.c > +++ b/drivers/acpi/device_sysfs.c > @@ -141,7 +141,7 @@ static int create_pnp_modalias(const struct acpi_device > *acpi_dev, char *modalia > struct acpi_hardware_id *id; > > /* Avoid unnecessarily loading modules for non present devices. */ > - if (!acpi_device_is_present(acpi_dev)) > + if (!acpi_dev_ready_for_enumeration(acpi_dev)) > return 0; > > /* > diff --git a/drivers/acpi/internal.h b/drivers/acpi/internal.h > index 866c7c4ed233..a1b45e345bcc 100644 > --- a/drivers/acpi/internal.h > +++ b/drivers/acpi/internal.h > @@ -107,7 +107,6 @@ int acpi_device_setup_files(struct acpi_device *dev); > void acpi_device_remove_files(struct acpi_device *dev); > void acpi_device_add_finalize(struct acpi_device *device); > void acpi_free_pnp_ids(struct acpi_device_pnp *pnp); > -bool acpi_device_is_present(const struct acpi_device *adev); > bool acpi_device_is_battery(struct acpi_device *adev); > bool acpi_device_is_first_physical_node(struct acpi_device *adev, > const struct device *dev); > diff --git a/drivers/acpi/property.c b/drivers/acpi/property.c > index 413e4fcadcaf..e03f00b98701 100644 > --- a/drivers/acpi/property.c > +++ b/drivers/acpi/property.c > @@ -1418,7 +1418,7 @@ static bool acpi_fwnode_device_is_available(const > struct fwnode_handle *fwnode) > if (!is_acpi_device_node(fwnode)) > return false; > > - return acpi_device_is_present(to_acpi_device_node(fwnode)); > + return acpi_dev_ready_for_enumeration(to_acpi_device_node(fwnode)); > } > > static const void * > diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c > index 17ab875a7d4e..06e9bb4a633f 100644 > --- a/drivers/acpi/scan.c > +++ b/drivers/acpi/scan.c > @@ -304,7 +304,7 @@ static int acpi_scan_device_check(struct acpi_device > *adev) > int error; > > acpi_bus_get_status(adev); > - if (acpi_device
Re: [PATCH 8/9] acpi: Use built-in RCU list checking for acpi_ioremaps list (v1)
On Mon, Jul 15, 2019 at 4:43 PM Joel Fernandes (Google) wrote: > > list_for_each_entry_rcu has built-in RCU and lock checking. Make use of > it for acpi_ioremaps list traversal. > > Signed-off-by: Joel Fernandes (Google) Acked-by: Rafael J. Wysocki > --- > drivers/acpi/osl.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c > index 9c0edf2fc0dd..2f9d0d20b836 100644 > --- a/drivers/acpi/osl.c > +++ b/drivers/acpi/osl.c > @@ -14,6 +14,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -80,6 +81,7 @@ struct acpi_ioremap { > > static LIST_HEAD(acpi_ioremaps); > static DEFINE_MUTEX(acpi_ioremap_lock); > +#define acpi_ioremap_lock_held() lock_is_held(&acpi_ioremap_lock.dep_map) > > static void __init acpi_request_region (struct acpi_generic_address *gas, > unsigned int length, char *desc) > @@ -206,7 +208,7 @@ acpi_map_lookup(acpi_physical_address phys, acpi_size > size) > { > struct acpi_ioremap *map; > > - list_for_each_entry_rcu(map, &acpi_ioremaps, list) > + list_for_each_entry_rcu(map, &acpi_ioremaps, list, > acpi_ioremap_lock_held()) > if (map->phys <= phys && > phys + size <= map->phys + map->size) > return map; > @@ -249,7 +251,7 @@ acpi_map_lookup_virt(void __iomem *virt, acpi_size size) > { > struct acpi_ioremap *map; > > - list_for_each_entry_rcu(map, &acpi_ioremaps, list) > + list_for_each_entry_rcu(map, &acpi_ioremaps, list, > acpi_ioremap_lock_held()) > if (map->virt <= virt && > virt + size <= map->virt + map->size) > return map; > -- > 2.22.0.510.g264f2c817a-goog >
Re: [PATCH] Documentation: cpu-freq: Convert core.txt file to ReST format
On Fri, Jul 5, 2019 at 11:04 PM Shreeya Patel wrote: > > Convert core file to ReST format, in order to allow it to > be parsed by Sphinx. Make a minor change of correcting the wrong > function name cpufreq_put_cpu to cpufreq_cpu_put. > Also create an index.rst file in cpu-freq and add it's entry > in the main Documentation/index.rst file. > > Signed-off-by: Shreeya Patel I've said "no" no three previous attempts and this one is not different. I don't want to have anything .rst in Documentation/cpu-freq/. There is a *new* admin-guide doc for cpufreq already and what is missing is a *new* driver-api one. Thanks!
Re: [PATCH 41/43] docs: extcon: convert it to ReST and move to acpi dir
On Friday, June 28, 2019 2:20:37 PM CEST Mauro Carvalho Chehab wrote: > The intel-int3496.txt file is a documentation for an ACPI driver. > > There's no reason to keep it on a separate directory. > > So, instead of keeping it on some random location, move it > to a sub-directory inside the ACPI documentation dir, > renaming it to .rst. > > Signed-off-by: Mauro Carvalho Chehab > --- > .../acpi/extcon-intel-int3496.rst} | 14 ++ > Documentation/firmware-guide/acpi/index.rst| 1 + > MAINTAINERS| 6 +++--- > 3 files changed, 14 insertions(+), 7 deletions(-) > rename Documentation/{extcon/intel-int3496.txt => > firmware-guide/acpi/extcon-intel-int3496.rst} (66%) > > diff --git a/Documentation/extcon/intel-int3496.txt > b/Documentation/firmware-guide/acpi/extcon-intel-int3496.rst > similarity index 66% > rename from Documentation/extcon/intel-int3496.txt > rename to Documentation/firmware-guide/acpi/extcon-intel-int3496.rst > index 8155dbc7fad3..5137ca834b54 100644 > --- a/Documentation/extcon/intel-int3496.txt > +++ b/Documentation/firmware-guide/acpi/extcon-intel-int3496.rst > @@ -1,5 +1,6 @@ > += > Intel INT3496 ACPI device extcon driver documentation > -- > += > > The Intel INT3496 ACPI device extcon driver is a driver for ACPI > devices with an acpi-id of INT3496, such as found for example on > @@ -13,15 +14,20 @@ between an USB host and an USB peripheral controller. > The ACPI devices exposes this functionality by returning an array with up > to 3 gpio descriptors from its ACPI _CRS (Current Resource Settings) call: > > -Index 0: The input gpio for the id-pin, this is always present and valid > -Index 1: The output gpio for enabling Vbus output from the device to the otg > +=== > = > +Index 0 The input gpio for the id-pin, this is always present and valid > +Index 1 The output gpio for enabling Vbus output from the device to the otg > port, write 1 to enable the Vbus output (this gpio descriptor may > be absent or invalid) > -Index 2: The output gpio for muxing of the data pins between the USB host and > +Index 2 The output gpio for muxing of the data pins between the USB host and > the USB peripheral controller, write 1 to mux to the peripheral > controller > +=== > = > > There is a mapping between indices and GPIO connection IDs as follows > + > + === === > id index 0 > vbusindex 1 > mux index 2 > + === === > diff --git a/Documentation/firmware-guide/acpi/index.rst > b/Documentation/firmware-guide/acpi/index.rst > index ae609eec4679..90c90d42d9ad 100644 > --- a/Documentation/firmware-guide/acpi/index.rst > +++ b/Documentation/firmware-guide/acpi/index.rst > @@ -24,3 +24,4 @@ ACPI Support > acpi-lid > lpit > video_extension > + extcon-intel-int3496 > diff --git a/MAINTAINERS b/MAINTAINERS > index fd6fab0dec77..2cf8abf6d48e 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -321,7 +321,7 @@ F:drivers/pnp/pnpacpi/ > F: include/linux/acpi.h > F: include/linux/fwnode.h > F: include/acpi/ > -F: Documentation/acpi/ > +F: Documentation/firmware-guide/acpi/ > F: Documentation/ABI/testing/sysfs-bus-acpi > F: Documentation/ABI/testing/configfs-acpi > F: drivers/pci/*acpi* > @@ -4896,7 +4896,7 @@ S: Maintained > F: Documentation/ > F: scripts/kernel-doc > X: Documentation/ABI/ > -X: Documentation/acpi/ > +X: Documentation/firmware-guide/acpi/ > X: Documentation/devicetree/ > X: Documentation/i2c/ > X: Documentation/media/ > @@ -6073,7 +6073,7 @@ S: Maintained > F: drivers/extcon/ > F: include/linux/extcon/ > F: include/linux/extcon.h > -F: Documentation/extcon/ > +F: Documentation/firmware-guide/acpi/extcon-intel-int3496.rst > F: Documentation/devicetree/bindings/extcon/ > > EXYNOS DP DRIVER > Applied, thanks!
Re: [PATCH] MAINTAINERS: Update for Intel Speed Select Technology
On Wednesday, July 3, 2019 3:53:31 AM CEST Srinivas Pandruvada wrote: > Added myself as the maintainer. > > Signed-off-by: Srinivas Pandruvada Acked-by: Rafael J. Wysocki > --- > MAINTAINERS | 8 > 1 file changed, 8 insertions(+) > > diff --git a/MAINTAINERS b/MAINTAINERS > index 5cfbea4ce575..b6ed7958372d 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -8101,6 +8101,14 @@ S: Supported > F: drivers/infiniband/hw/i40iw/ > F: include/uapi/rdma/i40iw-abi.h > > +INTEL SPEED SELECT TECHNOLOGY > +M: Srinivas Pandruvada > +L: platform-driver-...@vger.kernel.org > +S: Maintained > +F: drivers/platform/x86/intel_speed_select_if/ > +F: tools/power/x86/intel-speed-select/ > +F: include/uapi/linux/isst_if.h > + > INTEL TELEMETRY DRIVER > M: Rajneesh Bhardwaj > M: "David E. Box" >
Re: [UPDATE][PATCH 10/10] tools/power/x86: A tool to validate Intel Speed Select commands
On Sunday, June 30, 2019 7:14:08 PM CEST Srinivas Pandruvada wrote: > The Intel(R) Speed select technologies contains four features. > > Performance profile:An non architectural mechanism that allows multiple > optimized performance profiles per system via static and/or dynamic > adjustment of core count, workload, Tjmax, and TDP, etc. aka ISS > in the documentation. > > Base Frequency: Enables users to increase guaranteed base frequency on > certain cores (high priority cores) in exchange for lower base frequency > on remaining cores (low priority cores). aka PBF in the documenation. > > Turbo frequency: Enables the ability to set different turbo ratio limits > to cores based on priority. aka FACT in the documentation. > > Core power: An Interface that allows user to define per core/tile > priority. > > There is a multi level help for commands and options. This can be used > to check required arguments for each feature and commands for the > feature. > > To start navigating the features start with > > $sudo intel-speed-select --help > > For help on a specific feature for example > $sudo intel-speed-select perf-profile --help > > To get help for a command for a feature for example > $sudo intel-speed-select perf-profile get-lock-status --help > > Signed-off-by: Srinivas Pandruvada Acked-by: Rafael J. Wysocki > --- > Updates: > - Copied Makefile from tools/gpio and moified the Makefile here > - Added entry to tools/build/Makefile > - Rename directory to match the executable name > - Fix one error message > > tools/Makefile| 12 +- > tools/power/x86/intel-speed-select/Build |1 + > tools/power/x86/intel-speed-select/Makefile | 56 + > .../x86/intel-speed-select/isst-config.c | 1607 + > .../power/x86/intel-speed-select/isst-core.c | 721 > .../x86/intel-speed-select/isst-display.c | 479 + > tools/power/x86/intel-speed-select/isst.h | 231 +++ > 7 files changed, 3102 insertions(+), 5 deletions(-) > create mode 100644 tools/power/x86/intel-speed-select/Build > create mode 100644 tools/power/x86/intel-speed-select/Makefile > create mode 100644 tools/power/x86/intel-speed-select/isst-config.c > create mode 100644 tools/power/x86/intel-speed-select/isst-core.c > create mode 100644 tools/power/x86/intel-speed-select/isst-display.c > create mode 100644 tools/power/x86/intel-speed-select/isst.h > > diff --git a/tools/Makefile b/tools/Makefile > index 3dfd72ae6c1a..68defd7ecf5d 100644 > --- a/tools/Makefile > +++ b/tools/Makefile > @@ -19,6 +19,7 @@ help: > @echo ' gpio - GPIO tools' > @echo ' hv - tools used when in Hyper-V clients' > @echo ' iio- IIO tools' > + @echo ' intel-speed-select - Intel Speed Select tool' > @echo ' kvm_stat - top-like utility for displaying kvm > statistics' > @echo ' leds - LEDs tools' > @echo ' liblockdep - user-space wrapper for kernel > locking-validator' > @@ -82,7 +83,7 @@ perf: FORCE > selftests: FORCE > $(call descend,testing/$@) > > -turbostat x86_energy_perf_policy: FORCE > +turbostat x86_energy_perf_policy intel-speed-select: FORCE > $(call descend,power/x86/$@) > > tmon: FORCE > @@ -115,7 +116,7 @@ liblockdep_install: > selftests_install: > $(call descend,testing/$(@:_install=),install) > > -turbostat_install x86_energy_perf_policy_install: > +turbostat_install x86_energy_perf_policy_install intel-speed-select_install: > $(call descend,power/x86/$(@:_install=),install) > > tmon_install: > @@ -132,7 +133,7 @@ install: acpi_install cgroup_install cpupower_install > gpio_install \ > perf_install selftests_install turbostat_install usb_install \ > virtio_install vm_install bpf_install > x86_energy_perf_policy_install \ > tmon_install freefall_install objtool_install kvm_stat_install \ > - wmi_install pci_install debugging_install > + wmi_install pci_install debugging_install > intel-speed-select_install > > acpi_clean: > $(call descend,power/acpi,clean) > @@ -162,7 +163,7 @@ perf_clean: > selftests_clean: > $(call descend,testing/$(@:_clean=),clean) > > -turbostat_clean x86_energy_perf_policy_clean: > +turbostat_clean x86_energy_perf_policy_clean intel-speed-select_clean: > $(call descend,power/x86/$(@:_clean=),clean) > > tmon_clean: > @@ -178,6 +179,7 @@ clean: acpi_clean cgroup_clean cp
Re: [PATCH v1 20/22] docs: extcon: move it to acpi dir and convert it to ReST
On Wed, Jun 19, 2019 at 11:59 AM Rafael J. Wysocki wrote: > > On Tuesday, June 18, 2019 11:05:44 PM CEST Mauro Carvalho Chehab wrote: > > The intel-int3496.txt file is a documentation for an ACPI driver. > > > > There's no reason to keep it on a separate directory. > > > > So, instead of keeping it on some random location, move it > > to a sub-directory inside the ACPI documentation dir. > > > > For now, keep it with .txt extension, in order to avoid > > Sphinx build noise. A later patch should change it to .rst > > and movin it to be together with other acpi docs. > > > > Signed-off-by: Mauro Carvalho Chehab > > Acked-by: Rafael J. Wysocki Which said, the changelog appears to be outdated, because the format of the doc *does* change. > > --- > > .../acpi/extcon-intel-int3496.rst} | 14 ++ > > Documentation/firmware-guide/acpi/index.rst| 1 + > > MAINTAINERS| 6 +++--- > > 3 files changed, 14 insertions(+), 7 deletions(-) > > rename Documentation/{extcon/intel-int3496.txt => > > firmware-guide/acpi/extcon-intel-int3496.rst} (66%) > > > > diff --git a/Documentation/extcon/intel-int3496.txt > > b/Documentation/firmware-guide/acpi/extcon-intel-int3496.rst
Re: [PATCH v1 20/22] docs: extcon: move it to acpi dir and convert it to ReST
On Tuesday, June 18, 2019 11:05:44 PM CEST Mauro Carvalho Chehab wrote: > The intel-int3496.txt file is a documentation for an ACPI driver. > > There's no reason to keep it on a separate directory. > > So, instead of keeping it on some random location, move it > to a sub-directory inside the ACPI documentation dir. > > For now, keep it with .txt extension, in order to avoid > Sphinx build noise. A later patch should change it to .rst > and movin it to be together with other acpi docs. > > Signed-off-by: Mauro Carvalho Chehab Acked-by: Rafael J. Wysocki Or please let me know if you want me to pick up this one. > --- > .../acpi/extcon-intel-int3496.rst} | 14 ++ > Documentation/firmware-guide/acpi/index.rst| 1 + > MAINTAINERS| 6 +++--- > 3 files changed, 14 insertions(+), 7 deletions(-) > rename Documentation/{extcon/intel-int3496.txt => > firmware-guide/acpi/extcon-intel-int3496.rst} (66%) > > diff --git a/Documentation/extcon/intel-int3496.txt > b/Documentation/firmware-guide/acpi/extcon-intel-int3496.rst > similarity index 66% > rename from Documentation/extcon/intel-int3496.txt > rename to Documentation/firmware-guide/acpi/extcon-intel-int3496.rst > index 8155dbc7fad3..5137ca834b54 100644 > --- a/Documentation/extcon/intel-int3496.txt > +++ b/Documentation/firmware-guide/acpi/extcon-intel-int3496.rst > @@ -1,5 +1,6 @@ > += > Intel INT3496 ACPI device extcon driver documentation > -- > += > > The Intel INT3496 ACPI device extcon driver is a driver for ACPI > devices with an acpi-id of INT3496, such as found for example on > @@ -13,15 +14,20 @@ between an USB host and an USB peripheral controller. > The ACPI devices exposes this functionality by returning an array with up > to 3 gpio descriptors from its ACPI _CRS (Current Resource Settings) call: > > -Index 0: The input gpio for the id-pin, this is always present and valid > -Index 1: The output gpio for enabling Vbus output from the device to the otg > +=== > = > +Index 0 The input gpio for the id-pin, this is always present and valid > +Index 1 The output gpio for enabling Vbus output from the device to the otg > port, write 1 to enable the Vbus output (this gpio descriptor may > be absent or invalid) > -Index 2: The output gpio for muxing of the data pins between the USB host and > +Index 2 The output gpio for muxing of the data pins between the USB host and > the USB peripheral controller, write 1 to mux to the peripheral > controller > +=== > = > > There is a mapping between indices and GPIO connection IDs as follows > + > + === === > id index 0 > vbusindex 1 > mux index 2 > + === === > diff --git a/Documentation/firmware-guide/acpi/index.rst > b/Documentation/firmware-guide/acpi/index.rst > index ae609eec4679..90c90d42d9ad 100644 > --- a/Documentation/firmware-guide/acpi/index.rst > +++ b/Documentation/firmware-guide/acpi/index.rst > @@ -24,3 +24,4 @@ ACPI Support > acpi-lid > lpit > video_extension > + extcon-intel-int3496 > diff --git a/MAINTAINERS b/MAINTAINERS > index e07cbd44d48a..b7c81bd0f8e8 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -321,7 +321,7 @@ F:drivers/pnp/pnpacpi/ > F: include/linux/acpi.h > F: include/linux/fwnode.h > F: include/acpi/ > -F: Documentation/acpi/ > +F: Documentation/firmware-guide/acpi/ > F: Documentation/ABI/testing/sysfs-bus-acpi > F: Documentation/ABI/testing/configfs-acpi > F: drivers/pci/*acpi* > @@ -4881,7 +4881,7 @@ S: Maintained > F: Documentation/ > F: scripts/kernel-doc > X: Documentation/ABI/ > -X: Documentation/acpi/ > +X: Documentation/firmware-guide/acpi/ > X: Documentation/devicetree/ > X: Documentation/i2c/ > X: Documentation/media/ > @@ -6057,7 +6057,7 @@ S: Maintained > F: drivers/extcon/ > F: include/linux/extcon/ > F: include/linux/extcon.h > -F: Documentation/extcon/ > +F: Documentation/firmware-guide/acpi/extcon-intel-int3496.rst > F: Documentation/devicetree/bindings/extcon/ > > EXYNOS DP DRIVER >
Re: [PATCH 01/14] ABI: fix some syntax issues at the ABI database
On Fri, Jun 14, 2019 at 4:04 AM Mauro Carvalho Chehab wrote: > > From: Mauro Carvalho Chehab > > On those three files, the ABI representation described at > README are violated. > > - at sysfs-bus-iio-proximity-as3935: > a ':' character is missing after "What" > > - at sysfs-class-devfreq: > there's a typo at Description > > - at sysfs-class-cxl, it is using the ":" character at a > file preamble, causing it to be misinterpreted as a > tag. > > - On the other files, instead of "What", they use "Where". > > Signed-off-by: Mauro Carvalho Chehab > Signed-off-by: Mauro Carvalho Chehab Acked-by: Rafael J. Wysocki > --- > Documentation/ABI/testing/pstore | 2 +- > .../sysfs-bus-event_source-devices-format | 2 +- > .../ABI/testing/sysfs-bus-i2c-devices-hm6352 | 6 ++--- > .../ABI/testing/sysfs-bus-iio-distance-srf08 | 4 ++-- > .../testing/sysfs-bus-iio-proximity-as3935| 4 ++-- > .../ABI/testing/sysfs-bus-pci-devices-cciss | 22 +-- > .../testing/sysfs-bus-usb-devices-usbsevseg | 12 +- > Documentation/ABI/testing/sysfs-class-cxl | 6 ++--- > Documentation/ABI/testing/sysfs-class-devfreq | 2 +- > .../ABI/testing/sysfs-class-powercap | 2 +- > Documentation/ABI/testing/sysfs-kernel-fscaps | 2 +- > .../ABI/testing/sysfs-kernel-vmcoreinfo | 2 +- > 12 files changed, 33 insertions(+), 33 deletions(-) > > diff --git a/Documentation/ABI/testing/pstore > b/Documentation/ABI/testing/pstore > index 5fca9f5e10a3..8d6e48f4e8ef 100644 > --- a/Documentation/ABI/testing/pstore > +++ b/Documentation/ABI/testing/pstore > @@ -1,4 +1,4 @@ > -Where: /sys/fs/pstore/... (or /dev/pstore/...) > +What: /sys/fs/pstore/... (or /dev/pstore/...) > Date: March 2011 > Kernel Version: 2.6.39 > Contact: tony.l...@intel.com > diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-format > b/Documentation/ABI/testing/sysfs-bus-event_source-devices-format > index 77f47ff5ee02..b6f8748e0200 100644 > --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-format > +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-format > @@ -1,4 +1,4 @@ > -Where: /sys/bus/event_source/devices//format > +What: /sys/bus/event_source/devices//format > Date: January 2012 > Kernel Version: 3.3 > Contact: Jiri Olsa > diff --git a/Documentation/ABI/testing/sysfs-bus-i2c-devices-hm6352 > b/Documentation/ABI/testing/sysfs-bus-i2c-devices-hm6352 > index feb2e4a87075..29bd447e50a0 100644 > --- a/Documentation/ABI/testing/sysfs-bus-i2c-devices-hm6352 > +++ b/Documentation/ABI/testing/sysfs-bus-i2c-devices-hm6352 > @@ -1,18 +1,18 @@ > -Where: /sys/bus/i2c/devices/.../heading0_input > +What: /sys/bus/i2c/devices/.../heading0_input > Date: April 2010 > Kernel Version: 2.6.36? > Contact: alan@intel.com > Description: Reports the current heading from the compass as a floating > point value in degrees. > > -Where: /sys/bus/i2c/devices/.../power_state > +What: /sys/bus/i2c/devices/.../power_state > Date: April 2010 > Kernel Version: 2.6.36? > Contact: alan@intel.com > Description: Sets the power state of the device. 0 sets the device into > sleep mode, 1 wakes it up. > > -Where: /sys/bus/i2c/devices/.../calibration > +What: /sys/bus/i2c/devices/.../calibration > Date: April 2010 > Kernel Version: 2.6.36? > Contact: alan@intel.com > diff --git a/Documentation/ABI/testing/sysfs-bus-iio-distance-srf08 > b/Documentation/ABI/testing/sysfs-bus-iio-distance-srf08 > index 0a1ca1487fa9..a133fd8d081a 100644 > --- a/Documentation/ABI/testing/sysfs-bus-iio-distance-srf08 > +++ b/Documentation/ABI/testing/sysfs-bus-iio-distance-srf08 > @@ -1,4 +1,4 @@ > -What /sys/bus/iio/devices/iio:deviceX/sensor_sensitivity > +What: /sys/bus/iio/devices/iio:deviceX/sensor_sensitivity > Date: January 2017 > KernelVersion: 4.11 > Contact: linux-...@vger.kernel.org > @@ -6,7 +6,7 @@ Description: > Show or set the gain boost of the amp, from 0-31 range. > default 31 > > -What /sys/bus/iio/devices/iio:deviceX/sensor_max_range > +What: /sys/bus/iio/devices/iio:deviceX/sensor_max_range > Date: January 2017 > KernelVersion: 4.11 > Contact: linux-...@vger.kernel.org > diff --git a/Documentation/ABI/testing/sysfs-bus-iio-proximity-as3935 > b/Documentation/ABI/tes
Re: [PATCH v3 07/33] docs: cpu-freq: convert docs to ReST and rename to *.rst
On Sun, Jun 9, 2019 at 4:30 AM Mauro Carvalho Chehab wrote: > > The conversion is actually: > - add blank lines and identation in order to identify paragraphs; > - fix tables markups; > - add some lists markups; > - mark literal blocks; > - adjust title markups. > > At its new index.rst, let's add a :orphan: while this is not linked to > the main index.rst file, in order to avoid build warnings. > > Signed-off-by: Mauro Carvalho Chehab I have said "no" to this already twice. How many times do I still need to repeat that? There already is some cpufreq documentation under admin-guide in the .rst format and the rest is obsolete. It is going to be replaced with something new and more up to date. The API docs need to go under driver-api and conversions like this don't help. Indeed, they are counter-prodictive in my view. DIsappointed, Rafael
Re: [PATCH 01/22] ABI: sysfs-devices-system-cpu: point to the right docs
On Thursday, May 30, 2019 1:23:32 AM CEST Mauro Carvalho Chehab wrote: > The cpuidle doc was split on two, one at the admin guide > and another one at the driver API guide. Instead of pointing > to a non-existent file, point to both (admin guide being > the first one). > > Signed-off-by: Mauro Carvalho Chehab Acked-by: Rafael J. Wysocki > --- > Documentation/ABI/testing/sysfs-devices-system-cpu | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu > b/Documentation/ABI/testing/sysfs-devices-system-cpu > index 1528239f69b2..87478ac6c2af 100644 > --- a/Documentation/ABI/testing/sysfs-devices-system-cpu > +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu > @@ -137,7 +137,8 @@ Description: Discover cpuidle policy and mechanism > current_governor: (RW) displays current idle policy. Users can > switch the governor at runtime by writing to this file. > > - See files in Documentation/cpuidle/ for more information. > + See Documentation/admin-guide/pm/cpuidle.rst and > + Documentation/driver-api/pm/cpuidle.rst for more information. > > > What:/sys/devices/system/cpu/cpuX/cpuidle/stateN/name >
Re: [PATCH 10/10] docs: fix broken documentation links
On Mon, May 20, 2019 at 4:48 PM Mauro Carvalho Chehab wrote: > > Mostly due to x86 and acpi conversion, several documentation > links are still pointing to the old file. Fix them. > > Signed-off-by: Mauro Carvalho Chehab For the ACPI part: Acked-by: Rafael J. Wysocki
Re: [PATCH 2/2] sched: Document Energy Aware Scheduling
On Fri, Jan 18, 2019 at 12:11 PM Quentin Perret wrote: > > On Friday 18 Jan 2019 at 12:01:35 (+0100), Rafael J. Wysocki wrote: > > On Fri, Jan 18, 2019 at 11:58 AM Rafael J. Wysocki > > wrote: > > > > > > On Fri, Jan 18, 2019 at 11:34 AM Quentin Perret > > > wrote: > > > > > > > > Hi Rafael, > > > > > > > > On Friday 18 Jan 2019 at 10:57:08 (+0100), Rafael J. Wysocki wrote: > > > > > On Fri, Jan 18, 2019 at 10:16 AM Quentin Perret > > > > > wrote: > > > > > > > > > > > > Hi Juri, > > > > > > > > > > > > On Thursday 17 Jan 2019 at 16:51:17 (+0100), Juri Lelli wrote: > > > > > > > On 10/01/19 11:05, Quentin Perret wrote: > > > > > > [...] > > > > > > > > +The idea behind introducing an EM is to allow the scheduler to > > > > > > > > evaluate the > > > > > > > > +implications of its decisions rather than blindly applying > > > > > > > > energy-saving > > > > > > > > +techniques that may have positive effects only on some > > > > > > > > platforms. At the same > > > > > > > > +time, the EM must be as simple as possible to minimize the > > > > > > > > scheduler latency > > > > > > > > +impact. > > > > > > > > + > > > > > > > > +In short, EAS changes the way CFS tasks are assigned to CPUs. > > > > > > > > When it is time > > > > > > > > > > > > > > Not sure if we want to remark the fact that EAS is looking at CFS > > > > > > > tasks > > > > > > > only ATM. > > > > > > > > > > > > Oh, what's wrong about mentioning it ? I mean, it is a fact ATM ... > > > > > > > > > > But it won't hurt to mention that it may cover other scheduling > > > > > classes in the future. IOW, the scope limit is not fundamental. > > > > > > > > Agreed, I can do that. > > > > > > > > > > > > +for the scheduler to decide where a task should run (during > > > > > > > > wake-up), the EM > > > > > > > > +is used to break the tie between several good CPU candidates > > > > > > > > and pick the one > > > > > > > > +that is predicted to yield the best energy consumption without > > > > > > > > harming the > > > > > > > > +system's throughput. The predictions made by EAS rely on > > > > > > > > specific elements of > > > > > > > > +knowledge about the platform's topology, which include the > > > > > > > > 'capacity' of CPUs, > > > > > > > > > > > > > > Add a reference to DT bindings docs defining 'capacity' (or > > > > > > > define it > > > > > > > somewhere)? > > > > > > > > > > > > Right, I can mention this is defined in the next section. But are > > > > > > you > > > > > > sure about the reference to the DT bindings ? They're arm-specific > > > > > > right ? > > > > > > Maybe I can give that as an example or something ... > > > > > > > > > > Example sounds right. > > > > > > > > > > You also can point to the section below from here. > > > > > > > > Sounds good. > > > > > > > > > Side note: If the doc is in the .rst format (which Peter won't like > > > > > I'm sure :-)), you can actually use cross-references in it and you get > > > > > a translation to an HTML doc (hosted at kernel.org) for free and the > > > > > cross-references become clickable links in that. > > > > > > > > Right, I personally don't mind the .rst format, but the existing files > > > > in Documentation/power/ and Documentation/scheduler/ are good old txt > > > > files so I just wanted to keep things consistent. > > > > > > In fact, Documentation/power/ is under a slow on-going transition to > > > .rst (due to the benefits mentioned above). > > > > > > > I don't mind converting to rst if necessary :-) > > > > > > It is not necessary, but maybe worth considering. > > > > That said, as this is targeted at Documentation/scheduler/, being > > consistent with the other material in there is probably more > > important. > > Right. Patch 01/02 is targeted at Documentation/power/ though. So if > that makes your life easier I can turn that one into a .rst file, no > problem at all. Yes, I'd prefer it that way. And please put it into Documentation/driver-api/pm/.
Re: [PATCH 2/2] sched: Document Energy Aware Scheduling
On Fri, Jan 18, 2019 at 11:58 AM Rafael J. Wysocki wrote: > > On Fri, Jan 18, 2019 at 11:34 AM Quentin Perret > wrote: > > > > Hi Rafael, > > > > On Friday 18 Jan 2019 at 10:57:08 (+0100), Rafael J. Wysocki wrote: > > > On Fri, Jan 18, 2019 at 10:16 AM Quentin Perret > > > wrote: > > > > > > > > Hi Juri, > > > > > > > > On Thursday 17 Jan 2019 at 16:51:17 (+0100), Juri Lelli wrote: > > > > > On 10/01/19 11:05, Quentin Perret wrote: > > > > [...] > > > > > > +The idea behind introducing an EM is to allow the scheduler to > > > > > > evaluate the > > > > > > +implications of its decisions rather than blindly applying > > > > > > energy-saving > > > > > > +techniques that may have positive effects only on some platforms. > > > > > > At the same > > > > > > +time, the EM must be as simple as possible to minimize the > > > > > > scheduler latency > > > > > > +impact. > > > > > > + > > > > > > +In short, EAS changes the way CFS tasks are assigned to CPUs. When > > > > > > it is time > > > > > > > > > > Not sure if we want to remark the fact that EAS is looking at CFS > > > > > tasks > > > > > only ATM. > > > > > > > > Oh, what's wrong about mentioning it ? I mean, it is a fact ATM ... > > > > > > But it won't hurt to mention that it may cover other scheduling > > > classes in the future. IOW, the scope limit is not fundamental. > > > > Agreed, I can do that. > > > > > > > > +for the scheduler to decide where a task should run (during > > > > > > wake-up), the EM > > > > > > +is used to break the tie between several good CPU candidates and > > > > > > pick the one > > > > > > +that is predicted to yield the best energy consumption without > > > > > > harming the > > > > > > +system's throughput. The predictions made by EAS rely on specific > > > > > > elements of > > > > > > +knowledge about the platform's topology, which include the > > > > > > 'capacity' of CPUs, > > > > > > > > > > Add a reference to DT bindings docs defining 'capacity' (or define it > > > > > somewhere)? > > > > > > > > Right, I can mention this is defined in the next section. But are you > > > > sure about the reference to the DT bindings ? They're arm-specific > > > > right ? > > > > Maybe I can give that as an example or something ... > > > > > > Example sounds right. > > > > > > You also can point to the section below from here. > > > > Sounds good. > > > > > Side note: If the doc is in the .rst format (which Peter won't like > > > I'm sure :-)), you can actually use cross-references in it and you get > > > a translation to an HTML doc (hosted at kernel.org) for free and the > > > cross-references become clickable links in that. > > > > Right, I personally don't mind the .rst format, but the existing files > > in Documentation/power/ and Documentation/scheduler/ are good old txt > > files so I just wanted to keep things consistent. > > In fact, Documentation/power/ is under a slow on-going transition to > .rst (due to the benefits mentioned above). > > > I don't mind converting to rst if necessary :-) > > It is not necessary, but maybe worth considering. That said, as this is targeted at Documentation/scheduler/, being consistent with the other material in there is probably more important.
Re: [PATCH 2/2] sched: Document Energy Aware Scheduling
On Fri, Jan 18, 2019 at 11:34 AM Quentin Perret wrote: > > Hi Rafael, > > On Friday 18 Jan 2019 at 10:57:08 (+0100), Rafael J. Wysocki wrote: > > On Fri, Jan 18, 2019 at 10:16 AM Quentin Perret > > wrote: > > > > > > Hi Juri, > > > > > > On Thursday 17 Jan 2019 at 16:51:17 (+0100), Juri Lelli wrote: > > > > On 10/01/19 11:05, Quentin Perret wrote: > > > [...] > > > > > +The idea behind introducing an EM is to allow the scheduler to > > > > > evaluate the > > > > > +implications of its decisions rather than blindly applying > > > > > energy-saving > > > > > +techniques that may have positive effects only on some platforms. At > > > > > the same > > > > > +time, the EM must be as simple as possible to minimize the scheduler > > > > > latency > > > > > +impact. > > > > > + > > > > > +In short, EAS changes the way CFS tasks are assigned to CPUs. When > > > > > it is time > > > > > > > > Not sure if we want to remark the fact that EAS is looking at CFS tasks > > > > only ATM. > > > > > > Oh, what's wrong about mentioning it ? I mean, it is a fact ATM ... > > > > But it won't hurt to mention that it may cover other scheduling > > classes in the future. IOW, the scope limit is not fundamental. > > Agreed, I can do that. > > > > > > +for the scheduler to decide where a task should run (during > > > > > wake-up), the EM > > > > > +is used to break the tie between several good CPU candidates and > > > > > pick the one > > > > > +that is predicted to yield the best energy consumption without > > > > > harming the > > > > > +system's throughput. The predictions made by EAS rely on specific > > > > > elements of > > > > > +knowledge about the platform's topology, which include the > > > > > 'capacity' of CPUs, > > > > > > > > Add a reference to DT bindings docs defining 'capacity' (or define it > > > > somewhere)? > > > > > > Right, I can mention this is defined in the next section. But are you > > > sure about the reference to the DT bindings ? They're arm-specific right ? > > > Maybe I can give that as an example or something ... > > > > Example sounds right. > > > > You also can point to the section below from here. > > Sounds good. > > > Side note: If the doc is in the .rst format (which Peter won't like > > I'm sure :-)), you can actually use cross-references in it and you get > > a translation to an HTML doc (hosted at kernel.org) for free and the > > cross-references become clickable links in that. > > Right, I personally don't mind the .rst format, but the existing files > in Documentation/power/ and Documentation/scheduler/ are good old txt > files so I just wanted to keep things consistent. In fact, Documentation/power/ is under a slow on-going transition to .rst (due to the benefits mentioned above). > I don't mind converting to rst if necessary :-) It is not necessary, but maybe worth considering.
Re: [PATCH 2/2] sched: Document Energy Aware Scheduling
On Fri, Jan 18, 2019 at 10:16 AM Quentin Perret wrote: > > Hi Juri, > > On Thursday 17 Jan 2019 at 16:51:17 (+0100), Juri Lelli wrote: > > On 10/01/19 11:05, Quentin Perret wrote: > [...] > > > +The idea behind introducing an EM is to allow the scheduler to evaluate > > > the > > > +implications of its decisions rather than blindly applying energy-saving > > > +techniques that may have positive effects only on some platforms. At the > > > same > > > +time, the EM must be as simple as possible to minimize the scheduler > > > latency > > > +impact. > > > + > > > +In short, EAS changes the way CFS tasks are assigned to CPUs. When it is > > > time > > > > Not sure if we want to remark the fact that EAS is looking at CFS tasks > > only ATM. > > Oh, what's wrong about mentioning it ? I mean, it is a fact ATM ... But it won't hurt to mention that it may cover other scheduling classes in the future. IOW, the scope limit is not fundamental. > > > +for the scheduler to decide where a task should run (during wake-up), > > > the EM > > > +is used to break the tie between several good CPU candidates and pick > > > the one > > > +that is predicted to yield the best energy consumption without harming > > > the > > > +system's throughput. The predictions made by EAS rely on specific > > > elements of > > > +knowledge about the platform's topology, which include the 'capacity' of > > > CPUs, > > > > Add a reference to DT bindings docs defining 'capacity' (or define it > > somewhere)? > > Right, I can mention this is defined in the next section. But are you > sure about the reference to the DT bindings ? They're arm-specific right ? > Maybe I can give that as an example or something ... Example sounds right. You also can point to the section below from here. Side note: If the doc is in the .rst format (which Peter won't like I'm sure :-)), you can actually use cross-references in it and you get a translation to an HTML doc (hosted at kernel.org) for free and the cross-references become clickable links in that.
Re: [PATCH] Documentation: driver-api: PM: Add cpuidle document
On Thu, Jan 17, 2019 at 11:56 AM Ulf Hansson wrote: > > On Wed, 16 Jan 2019 at 23:10, Rafael J. Wysocki wrote: > > > > On Wed, Jan 9, 2019 at 11:54 AM Rafael J. Wysocki > > wrote: > > > > > > From: Rafael J. Wysocki > > > > > > Replace the remaining documents under Documentation/cpuidle/ > > > with one more complete governor and driver API document for cpuidle > > > under Documentation/driver-api/pm/. > > > > > > Signed-off-by: Rafael J. Wysocki > > > > In the face of the lack of any feedback I'm assuming that this is fine > > by everyone. > > I have look through it, looks good and sorry for the delay. No worries. > Feel free to add: > > Reviewed-by: Ulf Hansson Thank you!
Re: [PATCH] Documentation: driver-api: PM: Add cpuidle document
On Wed, Jan 9, 2019 at 11:54 AM Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki > > Replace the remaining documents under Documentation/cpuidle/ > with one more complete governor and driver API document for cpuidle > under Documentation/driver-api/pm/. > > Signed-off-by: Rafael J. Wysocki In the face of the lack of any feedback I'm assuming that this is fine by everyone. > --- > > On top of https://patchwork.kernel.org/patch/10747185/ > > --- > Documentation/cpuidle/driver.txt| 37 > Documentation/cpuidle/governor.txt | 28 --- > Documentation/driver-api/pm/cpuidle.rst | 282 > > Documentation/driver-api/pm/index.rst |7 > MAINTAINERS |1 > 5 files changed, 287 insertions(+), 68 deletions(-) > > Index: linux-pm/Documentation/driver-api/pm/index.rst > === > --- linux-pm.orig/Documentation/driver-api/pm/index.rst > +++ linux-pm/Documentation/driver-api/pm/index.rst > @@ -1,9 +1,10 @@ > -=== > -Device Power Management > -=== > +=== > +CPU and Device Power Management > +=== > > .. toctree:: > > + cpuidle > devices > notifiers > types > Index: linux-pm/Documentation/driver-api/pm/cpuidle.rst > === > --- /dev/null > +++ linux-pm/Documentation/driver-api/pm/cpuidle.rst > @@ -0,0 +1,282 @@ > +.. |struct cpuidle_governor| replace:: :c:type:`struct cpuidle_governor > ` > +.. |struct cpuidle_device| replace:: :c:type:`struct cpuidle_device > ` > +.. |struct cpuidle_driver| replace:: :c:type:`struct cpuidle_driver > ` > +.. |struct cpuidle_state| replace:: :c:type:`struct cpuidle_state > ` > + > + > +CPU Idle Time Management > + > + > +:: > + > + Copyright (c) 2019 Intel Corp., Rafael J. Wysocki > > + > + > +CPU Idle Time Management Subsystem > +== > + > +Every time one of the logical CPUs in the system (the entities that appear to > +fetch and execute instructions: hardware threads, if present, or processor > +cores) is idle after an interrupt or equivalent wakeup event, which means > that > +there are no tasks to run on it except for the special "idle" task associated > +with it, there is an opportunity to save energy for the processor that it > +belongs to. That can be done by making the idle logical CPU stop fetching > +instructions from memory and putting some of the processor's functional units > +depended on by it into an idle state in which they will draw less power. > + > +However, there may be multiple different idle states that can be used in > such a > +situation in principle, so it may be necessary to find the most suitable one > +(from the kernel perspective) and ask the processor to use (or "enter") that > +particular idle state. That is the role of the CPU idle time management > +subsystem in the kernel, called ``CPUIdle``. > + > +The design of ``CPUIdle`` is modular and based on the code duplication > avoidance > +principle, so the generic code that in principle need not depend on the > hardware > +or platform design details in it is separate from the code that interacts > with > +the hardware. It generally is divided into three categories of functional > +units: *governors* responsible for selecting idle states to ask the processor > +to enter, *drivers* that pass the governors' decisions on to the hardware and > +the *core* providing a common framework for them. > + > + > +CPU Idle Time Governors > +=== > + > +A CPU idle time (``CPUIdle``) governor is a bundle of policy code invoked > when > +one of the logical CPUs in the system turns out to be idle. Its role is to > +select an idle state to ask the processor to enter in order to save some > energy. > + > +``CPUIdle`` governors are generic and each of them can be used on any > hardware > +platform that the Linux kernel can run on. For this reason, data structures > +operated on by them cannot depend on any hardware architecture or platform > +design details as well. > + > +The governor itself is represented by a |struct cpuidle_governor| object > +containing four callback pointers, :c:member:`enable`, :c:member:`disable`, > +:c:member:`select`, :c:member:`reflect`, a :c:member:`rating` field described > +below, and a name (string) used for identifying it. > + > +For the governor to be available at all, that object
Re: [PATCH v2] cpuidle: Add 'above' and 'below' idle state metrics
On Mon, Jan 14, 2019 at 11:39 AM Daniel Lezcano wrote: > > > Hi Rafael, > > sorry for the delay. > > On 10/01/2019 11:20, Rafael J. Wysocki wrote: > > [ ... ] > > >>> if (entered_state >= 0) { > >>> + s64 diff, delay = drv->states[entered_state].exit_latency; > >>> + int i; > >>> + > >>> /* > >>>* Update cpuidle counters > >>>* This can be moved to within driver enter routine, > >>> @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d > >>> dev->last_residency = (int)diff; > >> > >> Shouldn't we subtract the 'delay' from the computed 'diff' in any case ? > > > > No. > > > >> Otherwise the 'last_residency' accumulates the effective sleep time and > >> the time to wakeup. We are interested in the sleep time only for > >> prediction and metrics no ? > > > > Yes, but 'delay' is the worst-case latency and not the actual one > > experienced, most of the time, and (on average) we would underestimate > > the sleep time if it was always subtracted. > > IMO, the exit latency is more or less constant for the cpu power down > state. When it is the cluster power down state, the first cpu waking up > has the worst latency, then the others have the same has the cpu power > down state. > > If we can model that, the gray area you mention below can be reduced. > There are platform where the exit latency is very high [1] and not > taking it into account will give very wrong metrics. That is kind of a special case, though, and there is no way for the cpuidle core do distinguish it from all of the other cases. > > The idea here is to only count the wakeup as 'above' if the total > > 'last_residency' is below the target residency of the idle state that > > was asked for (as in that case we know for certain that the CPU has > > been woken up too early) and to only count it as 'below' if the > > difference between 'last_residency' and 'delay' is greater than or > > equal to the target residency of a deeper idle state (as in that case > > we know for certain that the CPU has been woken up too late). > > > > Of course, this means that there is a "gray area" in which we are not > > really sure if the sleep time has matched the idle state that was > > asked for, but there's not much we can do about that IMO. > > There is another aspect of the metric which can be improved, the 'above' > and the 'below' give an rough indication about the correctness of the > prediction but they don't tell us which idle state we should have > selected (we can be constantly choosing state3 instead of state1 for > example). > > It would be nice to add a 'missed' field for each idle states, so when > we check if there is a 'above' or a 'below' condition, we increment the > idle state 'missed' field for the idle state we should have selected. That's a governor's job however. That's why there is the ->reflect governor callback after all, among other things.
Re: [PATCH v2] cpuidle: Add 'above' and 'below' idle state metrics
On Thu, Jan 10, 2019 at 10:53 AM Daniel Lezcano wrote: > > On 10/12/2018 12:30, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki > > > > Add two new metrics for CPU idle states, "above" and "below", to count > > the number of times the given state had been asked for (or entered > > from the kernel's perspective), but the observed idle duration turned > > out to be too short or too long for it (respectively). > > > > These metrics help to estimate the quality of the CPU idle governor > > in use. > > > > Signed-off-by: Rafael J. Wysocki > > --- > > > > -> v2: Fix a leftover in the documentation from the previous versions > >of the patch and a typo in the changelog. > > > > --- > > Documentation/ABI/testing/sysfs-devices-system-cpu |7 > > Documentation/admin-guide/pm/cpuidle.rst | 10 ++ > > drivers/cpuidle/cpuidle.c | 31 > > - > > drivers/cpuidle/sysfs.c|6 > > include/linux/cpuidle.h|2 + > > 5 files changed, 55 insertions(+), 1 deletion(-) > > > > Index: linux-pm/drivers/cpuidle/cpuidle.c > > === > > --- linux-pm.orig/drivers/cpuidle/cpuidle.c > > +++ linux-pm/drivers/cpuidle/cpuidle.c > > @@ -202,7 +202,6 @@ int cpuidle_enter_state(struct cpuidle_d > > struct cpuidle_state *target_state = &drv->states[index]; > > bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP); > > ktime_t time_start, time_end; > > - s64 diff; > > > > /* > >* Tell the time framework to switch to a broadcast timer because our > > @@ -248,6 +247,9 @@ int cpuidle_enter_state(struct cpuidle_d > > local_irq_enable(); > > > > if (entered_state >= 0) { > > + s64 diff, delay = drv->states[entered_state].exit_latency; > > + int i; > > + > > /* > >* Update cpuidle counters > >* This can be moved to within driver enter routine, > > @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d > > dev->last_residency = (int)diff; > > Shouldn't we subtract the 'delay' from the computed 'diff' in any case ? No. > Otherwise the 'last_residency' accumulates the effective sleep time and > the time to wakeup. We are interested in the sleep time only for > prediction and metrics no ? Yes, but 'delay' is the worst-case latency and not the actual one experienced, most of the time, and (on average) we would underestimate the sleep time if it was always subtracted. The idea here is to only count the wakeup as 'above' if the total 'last_residency' is below the target residency of the idle state that was asked for (as in that case we know for certain that the CPU has been woken up too early) and to only count it as 'below' if the difference between 'last_residency' and 'delay' is greater than or equal to the target residency of a deeper idle state (as in that case we know for certain that the CPU has been woken up too late). Of course, this means that there is a "gray area" in which we are not really sure if the sleep time has matched the idle state that was asked for, but there's not much we can do about that IMO.
[PATCH] Documentation: driver-api: PM: Add cpuidle document
From: Rafael J. Wysocki Replace the remaining documents under Documentation/cpuidle/ with one more complete governor and driver API document for cpuidle under Documentation/driver-api/pm/. Signed-off-by: Rafael J. Wysocki --- On top of https://patchwork.kernel.org/patch/10747185/ --- Documentation/cpuidle/driver.txt| 37 Documentation/cpuidle/governor.txt | 28 --- Documentation/driver-api/pm/cpuidle.rst | 282 Documentation/driver-api/pm/index.rst |7 MAINTAINERS |1 5 files changed, 287 insertions(+), 68 deletions(-) Index: linux-pm/Documentation/driver-api/pm/index.rst === --- linux-pm.orig/Documentation/driver-api/pm/index.rst +++ linux-pm/Documentation/driver-api/pm/index.rst @@ -1,9 +1,10 @@ -=== -Device Power Management -=== +=== +CPU and Device Power Management +=== .. toctree:: + cpuidle devices notifiers types Index: linux-pm/Documentation/driver-api/pm/cpuidle.rst === --- /dev/null +++ linux-pm/Documentation/driver-api/pm/cpuidle.rst @@ -0,0 +1,282 @@ +.. |struct cpuidle_governor| replace:: :c:type:`struct cpuidle_governor ` +.. |struct cpuidle_device| replace:: :c:type:`struct cpuidle_device ` +.. |struct cpuidle_driver| replace:: :c:type:`struct cpuidle_driver ` +.. |struct cpuidle_state| replace:: :c:type:`struct cpuidle_state ` + + +CPU Idle Time Management + + +:: + + Copyright (c) 2019 Intel Corp., Rafael J. Wysocki + + +CPU Idle Time Management Subsystem +== + +Every time one of the logical CPUs in the system (the entities that appear to +fetch and execute instructions: hardware threads, if present, or processor +cores) is idle after an interrupt or equivalent wakeup event, which means that +there are no tasks to run on it except for the special "idle" task associated +with it, there is an opportunity to save energy for the processor that it +belongs to. That can be done by making the idle logical CPU stop fetching +instructions from memory and putting some of the processor's functional units +depended on by it into an idle state in which they will draw less power. + +However, there may be multiple different idle states that can be used in such a +situation in principle, so it may be necessary to find the most suitable one +(from the kernel perspective) and ask the processor to use (or "enter") that +particular idle state. That is the role of the CPU idle time management +subsystem in the kernel, called ``CPUIdle``. + +The design of ``CPUIdle`` is modular and based on the code duplication avoidance +principle, so the generic code that in principle need not depend on the hardware +or platform design details in it is separate from the code that interacts with +the hardware. It generally is divided into three categories of functional +units: *governors* responsible for selecting idle states to ask the processor +to enter, *drivers* that pass the governors' decisions on to the hardware and +the *core* providing a common framework for them. + + +CPU Idle Time Governors +=== + +A CPU idle time (``CPUIdle``) governor is a bundle of policy code invoked when +one of the logical CPUs in the system turns out to be idle. Its role is to +select an idle state to ask the processor to enter in order to save some energy. + +``CPUIdle`` governors are generic and each of them can be used on any hardware +platform that the Linux kernel can run on. For this reason, data structures +operated on by them cannot depend on any hardware architecture or platform +design details as well. + +The governor itself is represented by a |struct cpuidle_governor| object +containing four callback pointers, :c:member:`enable`, :c:member:`disable`, +:c:member:`select`, :c:member:`reflect`, a :c:member:`rating` field described +below, and a name (string) used for identifying it. + +For the governor to be available at all, that object needs to be registered +with the ``CPUIdle`` core by calling :c:func:`cpuidle_register_governor()` with +a pointer to it passed as the argument. If successful, that causes the core to +add the governor to the global list of available governors and, if it is the +only one in the list (that is, the list was empty before) or the value of its +:c:member:`rating` field is greater than the value of that field for the +governor currently in use, or the name of the new governor was passed to the +kernel as the value of the ``cpuidle.governor=`` command line parameter, the new +governor will be used from that point on (there can be only one ``CPUIdle`` +governor in use at a time). Also, if ``cpuidle_sysfs_switch`` is passed
Re: [PATCH 1/1] doc: trace: fix reference to cpuidle documentation file
On Wednesday, January 9, 2019 12:56:51 AM CET Otto Sabart wrote: > > --LZvS9be/3tNcYl/X > Content-Type: text/plain; charset=us-ascii > Content-Disposition: inline > Content-Transfer-Encoding: quoted-printable > > Old cpuidle/sysfs.txt file was replaced in aa5eee355b46. So, refer to > an updated file. > > Fixes: aa5eee355b46 ("Documentation: admin-guide: PM: Add cpuidle document") > Signed-off-by: Otto Sabart > --- > Documentation/trace/coresight-cpu-debug.txt | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/Documentation/trace/coresight-cpu-debug.txt b/Documentation/tr= > ace/coresight-cpu-debug.txt > index 89ab09e78e8d..f07e38094b40 100644 > --- a/Documentation/trace/coresight-cpu-debug.txt > +++ b/Documentation/trace/coresight-cpu-debug.txt > @@ -165,7 +165,7 @@ Do some work... > The same can also be done from an application program. > =20 > Disable specific CPU's specific idle state from cpuidle sysfs (see > -Documentation/cpuidle/sysfs.txt): > +Documentation/admin-guide/pm/cpuidle.rst): > # echo 1 > /sys/devices/system/cpu/cpu$cpu/cpuidle/state$state/disable Applied, thanks!
[PATCH] cpuidle / Documentation: Update cpuidle MAINTAINERS entry
From: Rafael J. Wysocki Update the MAINTAINERS entry for cpuidle by making it clear that it is not just drivers and adding a documentation record to it. Signed-off-by: Rafael J. Wysocki --- MAINTAINERS |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-pm/MAINTAINERS === --- linux-pm.orig/MAINTAINERS +++ linux-pm/MAINTAINERS @@ -3945,13 +3945,14 @@ S: Supported F: drivers/cpuidle/cpuidle-exynos.c F: arch/arm/mach-exynos/pm.c -CPUIDLE DRIVERS +CPU IDLE TIME MANAGEMENT FRAMEWORK M: "Rafael J. Wysocki" M: Daniel Lezcano L: linux...@vger.kernel.org S: Maintained T: git git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git B: https://bugzilla.kernel.org +F: Documentation/admin-guide/pm/cpuidle.rst F: drivers/cpuidle/* F: include/linux/cpuidle.h
[PATCH] cpufreq / Documentation: Update cpufreq MAINTAINERS entry
From: Rafael J. Wysocki Update the MAINTAINERS entry for cpufreq by making it clear that it is not just drivers and adding current documentation records to it. Signed-off-by: Rafael J. Wysocki --- MAINTAINERS |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) Index: linux-pm/MAINTAINERS === --- linux-pm.orig/MAINTAINERS +++ linux-pm/MAINTAINERS @@ -3888,7 +3888,7 @@ L:net...@vger.kernel.org S: Maintained F: drivers/net/ethernet/ti/cpmac.c -CPU FREQUENCY DRIVERS +CPU FREQUENCY SCALING FRAMEWORK M: "Rafael J. Wysocki" M: Viresh Kumar L: linux...@vger.kernel.org @@ -3896,6 +3896,8 @@ S:Maintained T: git git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git T: git git://git.linaro.org/people/vireshk/linux.git (For ARM Updates) B: https://bugzilla.kernel.org +F: Documentation/admin-guide/pm/cpufreq.rst +F: Documentation/admin-guide/pm/intel_pstate.rst F: Documentation/cpu-freq/ F: Documentation/devicetree/bindings/cpufreq/ F: drivers/cpufreq/
Re: [PATCH v2] cpuidle: Add 'above' and 'below' idle state metrics
On Wed, Dec 12, 2018 at 10:57 AM Ulf Hansson wrote: > > On Wed, 12 Dec 2018 at 10:46, Peter Zijlstra wrote: > > > > On Tue, Dec 11, 2018 at 10:51:48AM +0100, Rafael J. Wysocki wrote: > > > On Mon, Dec 10, 2018 at 11:51 PM Peter Zijlstra > > > wrote: > > > > > > Dunno; it could be cold cachelines, at which point it can be fairly > > > > expensive. Also, being stuck with API is fairly horrible if you want to > > > > 'fix' it. > > > > > > All of the cache lines involved should've been touched earlier in this > > > code path by the governor. At least menu and the new one both touch > > > them. > > > > > > The API part I'm not too worried about. I know it is useful and two > > > other people have told that to me already. :-) > > > > Like said on IRC; I mostly wanted to raise the issue of overhead due to > > stats and ABI -- it's something I've been bitten by too many times :/ > > > > If you're confident you're hitting the same lines with the already > > extant accouning (time and usage) and you cannot make the whole lot > > conditional because of ABI (bah) then I'll not stand in the way here. > > > > I agree the numbers are useful, I'm just weary of overhead. > > I tend to agree. So then why not move this to debugfs, it seems like > it's really there it belongs. No? The "usage" and "time" counters are there in sysfs already and they are ABI, so putting the new ones into debugfs only makes them harder to use for no real gain.
Re: [PATCH 1/2] cpufreq: intel_pstate: Force HWP min perf before offline
On Friday, November 16, 2018 11:24:19 PM CET Srinivas Pandruvada wrote: > Force HWP Request MAX = HWP Request MIN = HWP Capability MIN and EPP to > 0xFF. In this way the performance limits on the offlined CPU will not > influence performance limits on its sibling CPU, which is still online. > > If the sibling CPU is calling for higher performance, it will impact the > max core performance. Here core performance will follow higher of the > performance requests from each sibling. > > Reported-and-tested-by: Chen Yu > Signed-off-by: Srinivas Pandruvada > --- > drivers/cpufreq/intel_pstate.c | 28 ++-- > 1 file changed, 26 insertions(+), 2 deletions(-) > > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c > index 14f551a3d36e..501ab9f4489a 100644 > --- a/drivers/cpufreq/intel_pstate.c > +++ b/drivers/cpufreq/intel_pstate.c > @@ -830,6 +830,28 @@ static void intel_pstate_hwp_set(unsigned int cpu) > wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value); > } > > +static void intel_pstate_hwp_force_min_perf(int cpu) > +{ > + u64 value; > + int min_perf; > + > + value = all_cpu_data[cpu]->hwp_req_cached; > + value &= ~GENMASK_ULL(31, 0); > + min_perf = HWP_LOWEST_PERF(all_cpu_data[cpu]->hwp_cap_cached); > + > + /* Set hwp_max = hwp_min */ > + value |= HWP_MAX_PERF(min_perf); > + value |= HWP_MIN_PERF(min_perf); > + > + /* Set EPP/EPB to min */ > + if (static_cpu_has(X86_FEATURE_HWP_EPP)) > + value |= HWP_ENERGY_PERF_PREFERENCE(HWP_EPP_POWERSAVE); > + else > + intel_pstate_set_epb(cpu, HWP_EPP_BALANCE_POWERSAVE); > + > + wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value); > +} > + > static int intel_pstate_hwp_save_state(struct cpufreq_policy *policy) > { > struct cpudata *cpu_data = all_cpu_data[policy->cpu]; > @@ -2093,10 +2115,12 @@ static void intel_pstate_stop_cpu(struct > cpufreq_policy *policy) > pr_debug("CPU %d exiting\n", policy->cpu); > > intel_pstate_clear_update_util_hook(policy->cpu); > - if (hwp_active) > + if (hwp_active) { > intel_pstate_hwp_save_state(policy); > - else > + intel_pstate_hwp_force_min_perf(policy->cpu); > + } else { > intel_cpufreq_stop_cpu(policy); > + } > } > > static int intel_pstate_cpu_exit(struct cpufreq_policy *policy) > Both this one and the [2/2] applied, thanks!
Re: [PATCH v2] cpuidle: Add 'above' and 'below' idle state metrics
On Mon, Dec 10, 2018 at 11:51 PM Peter Zijlstra wrote: > > On Mon, Dec 10, 2018 at 10:36:40PM +0100, Rafael J. Wysocki wrote: > > On Mon, Dec 10, 2018 at 1:21 PM Peter Zijlstra wrote: > > > > One question on this; why is this tracked unconditionally? > > > > Because I didn't quite see how to make that conditional in a sensible way. > > Something like: > > if (static_branch_unlikely(__tracepoint_idle_above) || > static_branch_unlikely(__tracepoint_idle_below)) { > > // do stuff that calls trace_idle_above() / > // trace_idle_below(). > > } > > > These things are counters and counting with the help of tracepoints > > isn't particularly convenient (and one needs debugfs to be there to > > use tracepoints and they require root access etc). > > Root only should not be a problem for a developer; and aren't these > numbers only really interesting if you're prodding at the idle governor? What about regression testing? > > > Would not a tracepoint be better?; then there is no overhead in the > > > normal case where nobody gives a crap about these here numbers. > > > > There is an existing tracepoint that in principle could be used to > > produce this information, but it is such a major PITA in practice that > > nobody does that. Guess why. :-) > > Sounds like you need to ship a convenient script or something :-) That would be a very ugly script. Also I'd need to expose drv->states[i].disabled somehow (that's not accessible from user space now). > > Also, the "usage" and "time" counters are there in sysfs, so why not these > > two? > > > > And is the overhead really that horrible? > > Dunno; it could be cold cachelines, at which point it can be fairly > expensive. Also, being stuck with API is fairly horrible if you want to > 'fix' it. All of the cache lines involved should've been touched earlier in this code path by the governor. At least menu and the new one both touch them. The API part I'm not too worried about. I know it is useful and two other people have told that to me already. :-)
Re: [PATCH v2] cpuidle: Add 'above' and 'below' idle state metrics
On Mon, Dec 10, 2018 at 1:21 PM Peter Zijlstra wrote: > > On Mon, Dec 10, 2018 at 12:30:23PM +0100, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki > > > > Add two new metrics for CPU idle states, "above" and "below", to count > > the number of times the given state had been asked for (or entered > > from the kernel's perspective), but the observed idle duration turned > > out to be too short or too long for it (respectively). > > > > These metrics help to estimate the quality of the CPU idle governor > > in use. > > > > Signed-off-by: Rafael J. Wysocki > > > @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d > > dev->last_residency = (int)diff; > > dev->states_usage[entered_state].time += dev->last_residency; > > dev->states_usage[entered_state].usage++; > > + > > + if (diff < drv->states[entered_state].target_residency) { > > + for (i = entered_state - 1; i >= 0; i--) { > > + if (drv->states[i].disabled || > > + dev->states_usage[i].disable) > > + continue; > > + > > + /* Shallower states are enabled, so update. */ > > + dev->states_usage[entered_state].above++; > > + break; > > + } > > + } else if (diff > delay) { > > + for (i = entered_state + 1; i < drv->state_count; > > i++) { > > + if (drv->states[i].disabled || > > + dev->states_usage[i].disable) > > + continue; > > + > > + /* > > + * Update if a deeper state would have been a > > + * better match for the observed idle > > duration. > > + */ > > + if (diff - delay >= > > drv->states[i].target_residency) > > + > > dev->states_usage[entered_state].below++; > > + > > + break; > > + } > > + } > > One question on this; why is this tracked unconditionally? Because I didn't quite see how to make that conditional in a sensible way. These things are counters and counting with the help of tracepoints isn't particularly convenient (and one needs debugfs to be there to use tracepoints and they require root access etc). > Would not a tracepoint be better?; then there is no overhead in the > normal case where nobody gives a crap about these here numbers. There is an existing tracepoint that in principle could be used to produce this information, but it is such a major PITA in practice that nobody does that. Guess why. :-) Also, the "usage" and "time" counters are there in sysfs, so why not these two? And is the overhead really that horrible?
[PATCH v2] cpuidle: Add 'above' and 'below' idle state metrics
From: Rafael J. Wysocki Add two new metrics for CPU idle states, "above" and "below", to count the number of times the given state had been asked for (or entered from the kernel's perspective), but the observed idle duration turned out to be too short or too long for it (respectively). These metrics help to estimate the quality of the CPU idle governor in use. Signed-off-by: Rafael J. Wysocki --- -> v2: Fix a leftover in the documentation from the previous versions of the patch and a typo in the changelog. --- Documentation/ABI/testing/sysfs-devices-system-cpu |7 Documentation/admin-guide/pm/cpuidle.rst | 10 ++ drivers/cpuidle/cpuidle.c | 31 - drivers/cpuidle/sysfs.c|6 include/linux/cpuidle.h|2 + 5 files changed, 55 insertions(+), 1 deletion(-) Index: linux-pm/drivers/cpuidle/cpuidle.c === --- linux-pm.orig/drivers/cpuidle/cpuidle.c +++ linux-pm/drivers/cpuidle/cpuidle.c @@ -202,7 +202,6 @@ int cpuidle_enter_state(struct cpuidle_d struct cpuidle_state *target_state = &drv->states[index]; bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP); ktime_t time_start, time_end; - s64 diff; /* * Tell the time framework to switch to a broadcast timer because our @@ -248,6 +247,9 @@ int cpuidle_enter_state(struct cpuidle_d local_irq_enable(); if (entered_state >= 0) { + s64 diff, delay = drv->states[entered_state].exit_latency; + int i; + /* * Update cpuidle counters * This can be moved to within driver enter routine, @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d dev->last_residency = (int)diff; dev->states_usage[entered_state].time += dev->last_residency; dev->states_usage[entered_state].usage++; + + if (diff < drv->states[entered_state].target_residency) { + for (i = entered_state - 1; i >= 0; i--) { + if (drv->states[i].disabled || + dev->states_usage[i].disable) + continue; + + /* Shallower states are enabled, so update. */ + dev->states_usage[entered_state].above++; + break; + } + } else if (diff > delay) { + for (i = entered_state + 1; i < drv->state_count; i++) { + if (drv->states[i].disabled || + dev->states_usage[i].disable) + continue; + + /* +* Update if a deeper state would have been a +* better match for the observed idle duration. +*/ + if (diff - delay >= drv->states[i].target_residency) + dev->states_usage[entered_state].below++; + + break; + } + } } else { dev->last_residency = 0; } Index: linux-pm/include/linux/cpuidle.h === --- linux-pm.orig/include/linux/cpuidle.h +++ linux-pm/include/linux/cpuidle.h @@ -33,6 +33,8 @@ struct cpuidle_state_usage { unsigned long long disable; unsigned long long usage; unsigned long long time; /* in US */ + unsigned long long above; /* Number of times it's been too deep */ + unsigned long long below; /* Number of times it's been too shallow */ #ifdef CONFIG_SUSPEND unsigned long long s2idle_usage; unsigned long long s2idle_time; /* in US */ Index: linux-pm/drivers/cpuidle/sysfs.c === --- linux-pm.orig/drivers/cpuidle/sysfs.c +++ linux-pm/drivers/cpuidle/sysfs.c @@ -301,6 +301,8 @@ define_show_state_str_function(name) define_show_state_str_function(desc) define_show_state_ull_function(disable) define_store_state_ull_function(disable) +define_show_state_ull_function(above) +define_show_state_ull_function(below) define_one_state_ro(name, show_state_name); define_one_state_ro(desc, show_state_desc); @@ -310,6 +312,8 @@ define_one_state_ro(power, show_state_po define_one_state_ro(usage, show_state_usage); define_one_state_ro(time, show_state_time); define_one_state_rw(disable, show_state_disable, store_state_disab
Re: [PATCH] cpuidle: Add 'above' and 'below' idle state metrics
On Fri, Dec 7, 2018 at 1:57 PM Lorenzo Pieralisi wrote: > > On Fri, Dec 07, 2018 at 12:57:00PM +0100, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki > > Subject: [PATCH] cpuidle: Add 'above' and 'below' idle state metrics > > > > Add two new metrics for CPU idle states, "above" and "below", to count > > the number of times the given state had been asked for (or entered > > from the kernel's perspective), but the observed idle duration turned > > out to be too short or too long for it (respectively). > > > > These mertics help to estimate the quality of the CPU idle governor > > s/mertics/metrics Right, thanks! > > in use. > > > > Signed-off-by: Rafael J. Wysocki > > --- > > > > This is a replacement for https://patchwork.kernel.org/patch/10714995/ with > > the metrics renamed and some documentation confusion cleaned up. Thanks! > > > > --- > > Documentation/ABI/testing/sysfs-devices-system-cpu |7 > > Documentation/admin-guide/pm/cpuidle.rst | 10 ++ > > drivers/cpuidle/cpuidle.c | 31 > > - > > drivers/cpuidle/sysfs.c|6 > > include/linux/cpuidle.h|2 + > > 5 files changed, 55 insertions(+), 1 deletion(-) > > > > Index: linux-pm/drivers/cpuidle/cpuidle.c > > === > > --- linux-pm.orig/drivers/cpuidle/cpuidle.c > > +++ linux-pm/drivers/cpuidle/cpuidle.c > > @@ -202,7 +202,6 @@ int cpuidle_enter_state(struct cpuidle_d > > struct cpuidle_state *target_state = &drv->states[index]; > > bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP); > > ktime_t time_start, time_end; > > - s64 diff; > > > > /* > >* Tell the time framework to switch to a broadcast timer because our > > @@ -248,6 +247,9 @@ int cpuidle_enter_state(struct cpuidle_d > > local_irq_enable(); > > > > if (entered_state >= 0) { > > + s64 diff, delay = drv->states[entered_state].exit_latency; > > I am probably pointing out something that has been already debated, > apologies if so. > > exit_latency is the *worst* case exit latency for idle states that involve > multiple CPUs, we can't say for certain it is the latency that was > actually experienced by the idle state exit. Right. > It can be microseconds (eg CPU resume) vs milliseconds (eg groups of > cpus resume). > > I think the current approach (which may only understimate the "below" by > substracting the worst case value) is reasonable but I pointed this out > since I do not know how these stats will be used. This is on purpose. I want to count the cases when the selected state has been off for certain. Thanks, Rafael
Re: [PATCH] cpuidle: Add 'above' and 'below' idle state metrics
On Friday, December 7, 2018 1:02:05 PM CET Quentin Perret wrote: > Hi Rafael, > > On Friday 07 Dec 2018 at 12:57:00 (+0100), Rafael J. Wysocki wrote: > > --- linux-pm.orig/Documentation/ABI/testing/sysfs-devices-system-cpu > > +++ linux-pm/Documentation/ABI/testing/sysfs-devices-system-cpu > > @@ -145,6 +145,8 @@ What: /sys/devices/system/cpu/cpuX/cpui > > /sys/devices/system/cpu/cpuX/cpuidle/stateN/power > > /sys/devices/system/cpu/cpuX/cpuidle/stateN/time > > /sys/devices/system/cpu/cpuX/cpuidle/stateN/usage > > + /sys/devices/system/cpu/cpuX/cpuidle/stateN/high > > That should be s/high/above I suppose. Right, thanks for spotting this. :-) > > + /sys/devices/system/cpu/cpuX/cpuidle/stateN/below > > Other than that this seems really useful :-) Thanks!
[PATCH] cpuidle: Add 'above' and 'below' idle state metrics
From: Rafael J. Wysocki Subject: [PATCH] cpuidle: Add 'above' and 'below' idle state metrics Add two new metrics for CPU idle states, "above" and "below", to count the number of times the given state had been asked for (or entered from the kernel's perspective), but the observed idle duration turned out to be too short or too long for it (respectively). These mertics help to estimate the quality of the CPU idle governor in use. Signed-off-by: Rafael J. Wysocki --- This is a replacement for https://patchwork.kernel.org/patch/10714995/ with the metrics renamed and some documentation confusion cleaned up. Thanks! --- Documentation/ABI/testing/sysfs-devices-system-cpu |7 Documentation/admin-guide/pm/cpuidle.rst | 10 ++ drivers/cpuidle/cpuidle.c | 31 - drivers/cpuidle/sysfs.c|6 include/linux/cpuidle.h|2 + 5 files changed, 55 insertions(+), 1 deletion(-) Index: linux-pm/drivers/cpuidle/cpuidle.c === --- linux-pm.orig/drivers/cpuidle/cpuidle.c +++ linux-pm/drivers/cpuidle/cpuidle.c @@ -202,7 +202,6 @@ int cpuidle_enter_state(struct cpuidle_d struct cpuidle_state *target_state = &drv->states[index]; bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP); ktime_t time_start, time_end; - s64 diff; /* * Tell the time framework to switch to a broadcast timer because our @@ -248,6 +247,9 @@ int cpuidle_enter_state(struct cpuidle_d local_irq_enable(); if (entered_state >= 0) { + s64 diff, delay = drv->states[entered_state].exit_latency; + int i; + /* * Update cpuidle counters * This can be moved to within driver enter routine, @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d dev->last_residency = (int)diff; dev->states_usage[entered_state].time += dev->last_residency; dev->states_usage[entered_state].usage++; + + if (diff < drv->states[entered_state].target_residency) { + for (i = entered_state - 1; i >= 0; i--) { + if (drv->states[i].disabled || + dev->states_usage[i].disable) + continue; + + /* Shallower states are enabled, so update. */ + dev->states_usage[entered_state].above++; + break; + } + } else if (diff > delay) { + for (i = entered_state + 1; i < drv->state_count; i++) { + if (drv->states[i].disabled || + dev->states_usage[i].disable) + continue; + + /* +* Update if a deeper state would have been a +* better match for the observed idle duration. +*/ + if (diff - delay >= drv->states[i].target_residency) + dev->states_usage[entered_state].below++; + + break; + } + } } else { dev->last_residency = 0; } Index: linux-pm/include/linux/cpuidle.h === --- linux-pm.orig/include/linux/cpuidle.h +++ linux-pm/include/linux/cpuidle.h @@ -33,6 +33,8 @@ struct cpuidle_state_usage { unsigned long long disable; unsigned long long usage; unsigned long long time; /* in US */ + unsigned long long above; /* Number of times it's been too deep */ + unsigned long long below; /* Number of times it's been too shallow */ #ifdef CONFIG_SUSPEND unsigned long long s2idle_usage; unsigned long long s2idle_time; /* in US */ Index: linux-pm/drivers/cpuidle/sysfs.c === --- linux-pm.orig/drivers/cpuidle/sysfs.c +++ linux-pm/drivers/cpuidle/sysfs.c @@ -301,6 +301,8 @@ define_show_state_str_function(name) define_show_state_str_function(desc) define_show_state_ull_function(disable) define_store_state_ull_function(disable) +define_show_state_ull_function(above) +define_show_state_ull_function(below) define_one_state_ro(name, show_state_name); define_one_state_ro(desc, show_state_desc); @@ -310,6 +312,8 @@ define_one_state_ro(power, show_state_po define_one_state_ro(usage, s
Re: [PATCH] cpuidle: Add 'high' and 'low' idle state metrics
On Thu, Dec 6, 2018 at 12:08 AM Doug Smythies wrote: > > Hi Rafael, > > On 2018.12.03 04:32 Rafael J. Wysocki wrote: > > > Add two new metrics for CPU idle states, "high" and "low", to count > > the number of times the given state had been asked for (or entered > > from the kernel's perspective), but the observed idle duration turned > > out to be too high or too low for it (respectively). > > I wonder about the "high" "low" terminology here. I took these names, because they are concise and simple. I could use "below" and "above" respectively I guess. What about these? > > These mertics help to estimat the quality of the CPU idle governor > > in use. > > Yes, very useful. Thanks. > > Here the terms are mixed with "deep" and "shallow" > > > + unsigned long long high; /* Number of times it's been too deep */ > > + unsigned long long low; /* Number of times it's been too > > shallow */ > > > +``high`` > > + Total number of times this idle state had been asked for, but the > > + observed idle duration was too short to match its target residency. > > + > > O.K. To avoid ambiguity, how about naming them "too_short" and "too_long"? Well, I guess the "too_" prefix may be skipped, so "short" and "long" could be used, but that's about the state, not about the idle duration. The state cannot be "too long" or "too short", arguably. It might be "too deep" or "too shallow". > > +``low`` > > + Total number of times this idle state had been asked for, but a deeper > > + idle state would have been a better match for the observed idle > > duration. > > Even though I read the patch, by the time I actually looked at the numbers I > had > forgotten the meaning. I had look at idle state 0 and 4 (the deepest for my > computer) > to figure it out: > > doug@s15:~/c$ cat /sys/devices/system/cpu/cpu0/cpuidle/state0/low > 259871 > doug@s15:~/c$ cat /sys/devices/system/cpu/cpu0/cpuidle/state0/high > 0 > doug@s15:~/c$ cat /sys/devices/system/cpu/cpu0/cpuidle/state4/low > 0 > doug@s15:~/c$ cat /sys/devices/system/cpu/cpu0/cpuidle/state4/high > 5584 > > Because state 0 can not be too short and state 4 can not be too long. Where "state" actually means "the target residency of state" I suppose? :-)
[PATCH] cpuidle: Add cpuidle.governor= command line parameter
From: Rafael J. Wysocki Add cpuidle.governor= command line parameter to allow the default cpuidle governor to be replaced. That is useful, for example, if someone running a tickful kernel wants to use the menu governor on it. Signed-off-by: Rafael J. Wysocki --- On top of https://patchwork.kernel.org/patch/10714995/ and https://patchwork.kernel.org/patch/10705317/ --- Documentation/admin-guide/kernel-parameters.txt |3 +++ Documentation/admin-guide/pm/cpuidle.rst|7 +++ drivers/cpuidle/cpuidle.c |1 + drivers/cpuidle/cpuidle.h |1 + drivers/cpuidle/governor.c |9 +++-- 5 files changed, 19 insertions(+), 2 deletions(-) Index: linux-pm/drivers/cpuidle/governor.c === --- linux-pm.orig/drivers/cpuidle/governor.c +++ linux-pm/drivers/cpuidle/governor.c @@ -11,10 +11,13 @@ #include #include #include +#include #include #include "cpuidle.h" +char param_governor[CPUIDLE_NAME_LEN]; + LIST_HEAD(cpuidle_governors); struct cpuidle_governor *cpuidle_curr_governor; @@ -86,9 +89,11 @@ int cpuidle_register_governor(struct cpu mutex_lock(&cpuidle_lock); if (__cpuidle_find_governor(gov->name) == NULL) { ret = 0; - list_add_tail(&gov->governor_list, &cpuidle_governors); if (!cpuidle_curr_governor || - cpuidle_curr_governor->rating < gov->rating) + !strncasecmp(param_governor, gov->name, CPUIDLE_NAME_LEN) || + (cpuidle_curr_governor->rating < gov->rating && +strncasecmp(param_governor, cpuidle_curr_governor->name, +CPUIDLE_NAME_LEN))) cpuidle_switch_governor(gov); } mutex_unlock(&cpuidle_lock); Index: linux-pm/drivers/cpuidle/cpuidle.c === --- linux-pm.orig/drivers/cpuidle/cpuidle.c +++ linux-pm/drivers/cpuidle/cpuidle.c @@ -731,4 +731,5 @@ static int __init cpuidle_init(void) } module_param(off, int, 0444); +module_param_string(governor, param_governor, CPUIDLE_NAME_LEN, 0444); core_initcall(cpuidle_init); Index: linux-pm/drivers/cpuidle/cpuidle.h === --- linux-pm.orig/drivers/cpuidle/cpuidle.h +++ linux-pm/drivers/cpuidle/cpuidle.h @@ -7,6 +7,7 @@ #define __DRIVER_CPUIDLE_H /* For internal use only */ +extern char param_governor[]; extern struct cpuidle_governor *cpuidle_curr_governor; extern struct list_head cpuidle_governors; extern struct list_head cpuidle_detected_devices; Index: linux-pm/Documentation/admin-guide/pm/cpuidle.rst === --- linux-pm.orig/Documentation/admin-guide/pm/cpuidle.rst +++ linux-pm/Documentation/admin-guide/pm/cpuidle.rst @@ -576,6 +576,13 @@ processors implementing the architecture however, so it is rather crude and not very energy-efficient. For this reason, it is not recommended for production use. +The ``cpuidle.governor=`` kernel command line switch allows the ``CPUIdle`` +governor to use to be specified. It has to be appended with a string matching +the name of an available governor (e.g. ``cpuidle.governor=menu``) and that +governor will be used instead of the default one. It is possible to force +the ``menu`` governor to be used on the systems that use the ``ladder`` governor +by default this way, for example. + The other kernel command line parameters controlling CPU idle time management described below are only relevant for the *x86* architecture and some of them affect Intel processors only. Index: linux-pm/Documentation/admin-guide/kernel-parameters.txt === --- linux-pm.orig/Documentation/admin-guide/kernel-parameters.txt +++ linux-pm/Documentation/admin-guide/kernel-parameters.txt @@ -674,6 +674,9 @@ cpuidle.off=1 [CPU_IDLE] disable the cpuidle sub-system + cpuidle.governor= + [CPU_IDLE] Name of the cpuidle governor to use. + cpufreq.off=1 [CPU_FREQ] disable the cpufreq sub-system
[PATCH v2] cpuidle: Add 'high' and 'low' idle state metrics
From: Rafael J. Wysocki Subject: [PATCH] cpuidle: Add 'high' and 'low' idle state metrics Add two new metrics for CPU idle states, "high" and "low", to count the number of times the given state had been asked for (or entered from the kernel's perspective), but the observed idle duration turned out to be too high or too low for it (respectively). These mertics help to estimate the quality of the CPU idle governor in use. Signed-off-by: Rafael J. Wysocki --- -> v2: Compute the "low" metric as the number of times the entered state has been too shallow for sure (and not just probably) by taking that state's exit latency into account when computing it (the "high" metric computation did that in the previous version too). --- Documentation/ABI/testing/sysfs-devices-system-cpu |7 Documentation/admin-guide/pm/cpuidle.rst | 10 ++ drivers/cpuidle/cpuidle.c | 31 - drivers/cpuidle/sysfs.c|6 include/linux/cpuidle.h|2 + 5 files changed, 55 insertions(+), 1 deletion(-) Index: linux-pm/drivers/cpuidle/cpuidle.c === --- linux-pm.orig/drivers/cpuidle/cpuidle.c +++ linux-pm/drivers/cpuidle/cpuidle.c @@ -202,7 +202,6 @@ int cpuidle_enter_state(struct cpuidle_d struct cpuidle_state *target_state = &drv->states[index]; bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP); ktime_t time_start, time_end; - s64 diff; /* * Tell the time framework to switch to a broadcast timer because our @@ -248,6 +247,9 @@ int cpuidle_enter_state(struct cpuidle_d local_irq_enable(); if (entered_state >= 0) { + s64 diff, delay = drv->states[entered_state].exit_latency; + int i; + /* * Update cpuidle counters * This can be moved to within driver enter routine, @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d dev->last_residency = (int)diff; dev->states_usage[entered_state].time += dev->last_residency; dev->states_usage[entered_state].usage++; + + if (diff < drv->states[entered_state].target_residency) { + for (i = entered_state - 1; i >= 0; i--) { + if (drv->states[i].disabled || + dev->states_usage[i].disable) + continue; + + /* Shallower states are enabled, so update. */ + dev->states_usage[entered_state].high++; + break; + } + } else if (diff > delay) { + for (i = entered_state + 1; i < drv->state_count; i++) { + if (drv->states[i].disabled || + dev->states_usage[i].disable) + continue; + + /* +* Update if a deeper state would have been a +* better match for the observed idle duration. +*/ + if (diff - delay >= drv->states[i].target_residency) + dev->states_usage[entered_state].low++; + + break; + } + } } else { dev->last_residency = 0; } Index: linux-pm/include/linux/cpuidle.h === --- linux-pm.orig/include/linux/cpuidle.h +++ linux-pm/include/linux/cpuidle.h @@ -33,6 +33,8 @@ struct cpuidle_state_usage { unsigned long long disable; unsigned long long usage; unsigned long long time; /* in US */ + unsigned long long high; /* Number of times it's been too deep */ + unsigned long long low; /* Number of times it's been too shallow */ #ifdef CONFIG_SUSPEND unsigned long long s2idle_usage; unsigned long long s2idle_time; /* in US */ Index: linux-pm/drivers/cpuidle/sysfs.c === --- linux-pm.orig/drivers/cpuidle/sysfs.c +++ linux-pm/drivers/cpuidle/sysfs.c @@ -301,6 +301,8 @@ define_show_state_str_function(name) define_show_state_str_function(desc) define_show_state_ull_function(disable) define_store_state_ull_function(disable) +define_show_state_ull_function(high) +define_show_state_ull_function(low) define_one_state_ro(name, show_sta
[PATCH] cpuidle: poll_state: Disregard disable idle states
From: Rafael J. Wysocki When computing the limit of time to spend in the loop in poll_idle(), use the target residency of the first enabled idle state deeper than state 0 instead of always using the target residency of state 1. This helps when state 1 is disabled for diagnostics, for instance. Signed-off-by: Rafael J. Wysocki --- drivers/cpuidle/poll_state.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) Index: linux-pm/drivers/cpuidle/poll_state.c === --- linux-pm.orig/drivers/cpuidle/poll_state.c +++ linux-pm/drivers/cpuidle/poll_state.c @@ -20,8 +20,17 @@ static int __cpuidle poll_idle(struct cp local_irq_enable(); if (!current_set_polling_and_test()) { - u64 limit = (u64)drv->states[1].target_residency * NSEC_PER_USEC; unsigned int loop_count = 0; + u64 limit = TICK_USEC; + int i; + + for (i = 1; i < drv->state_count; i++) { + if (drv->states[i].disabled || dev->states_usage[i].disable) + continue; + + limit = (u64)drv->states[i].target_residency * NSEC_PER_USEC; + break; + } while (!need_resched()) { cpu_relax();
[PATCH] cpuidle: Add 'high' and 'low' idle state metrics
From: Rafael J. Wysocki Add two new metrics for CPU idle states, "high" and "low", to count the number of times the given state had been asked for (or entered from the kernel's perspective), but the observed idle duration turned out to be too high or too low for it (respectively). These mertics help to estimat the quality of the CPU idle governor in use. Signed-off-by: Rafael J. Wysocki --- On top of https://patchwork.kernel.org/patch/10705317/ --- Documentation/ABI/testing/sysfs-devices-system-cpu |7 + Documentation/admin-guide/pm/cpuidle.rst |8 + drivers/cpuidle/cpuidle.c | 29 + drivers/cpuidle/sysfs.c|6 include/linux/cpuidle.h|2 + 5 files changed, 52 insertions(+) Index: linux-pm/drivers/cpuidle/cpuidle.c === --- linux-pm.orig/drivers/cpuidle/cpuidle.c +++ linux-pm/drivers/cpuidle/cpuidle.c @@ -248,6 +248,8 @@ int cpuidle_enter_state(struct cpuidle_d local_irq_enable(); if (entered_state >= 0) { + int i; + /* * Update cpuidle counters * This can be moved to within driver enter routine, @@ -260,6 +262,33 @@ int cpuidle_enter_state(struct cpuidle_d dev->last_residency = (int)diff; dev->states_usage[entered_state].time += dev->last_residency; dev->states_usage[entered_state].usage++; + + if (diff < drv->states[entered_state].target_residency) { + for (i = entered_state - 1; i >= 0; i--) { + if (drv->states[i].disabled || + dev->states_usage[i].disable) + continue; + + /* Shallower states are enabled, so update. */ + dev->states_usage[entered_state].high++; + break; + } + } else { + for (i = entered_state + 1; i < drv->state_count; i++) { + if (drv->states[i].disabled || + dev->states_usage[i].disable) + continue; + + /* +* Update if a deeper state would have been a +* better match for the observed idle duration. +*/ + if (diff >= drv->states[i].target_residency) + dev->states_usage[entered_state].low++; + + break; + } + } } else { dev->last_residency = 0; } Index: linux-pm/include/linux/cpuidle.h === --- linux-pm.orig/include/linux/cpuidle.h +++ linux-pm/include/linux/cpuidle.h @@ -33,6 +33,8 @@ struct cpuidle_state_usage { unsigned long long disable; unsigned long long usage; unsigned long long time; /* in US */ + unsigned long long high; /* Number of times it's been too deep */ + unsigned long long low; /* Number of times it's been too shallow */ #ifdef CONFIG_SUSPEND unsigned long long s2idle_usage; unsigned long long s2idle_time; /* in US */ Index: linux-pm/drivers/cpuidle/sysfs.c === --- linux-pm.orig/drivers/cpuidle/sysfs.c +++ linux-pm/drivers/cpuidle/sysfs.c @@ -301,6 +301,8 @@ define_show_state_str_function(name) define_show_state_str_function(desc) define_show_state_ull_function(disable) define_store_state_ull_function(disable) +define_show_state_ull_function(high) +define_show_state_ull_function(low) define_one_state_ro(name, show_state_name); define_one_state_ro(desc, show_state_desc); @@ -310,6 +312,8 @@ define_one_state_ro(power, show_state_po define_one_state_ro(usage, show_state_usage); define_one_state_ro(time, show_state_time); define_one_state_rw(disable, show_state_disable, store_state_disable); +define_one_state_ro(high, show_state_high); +define_one_state_ro(low, show_state_low); static struct attribute *cpuidle_state_default_attrs[] = { &attr_name.attr, @@ -320,6 +324,8 @@ static struct attribute *cpuidle_state_d &attr_usage.attr, &attr_time.attr, &attr_disable.attr, + &attr_high.attr, + &attr_low.attr, NULL }; Index: linux-pm/Documentation/admin-guide/pm/cpuidle.rst === --- linux-pm.orig/Docume
Re: [PATCH v3] Documentation: admin-guide: PM: Add cpuidle document
On Thu, Nov 29, 2018 at 10:40 PM Jonathan Corbet wrote: > > On Thu, 29 Nov 2018 22:22:04 +0100 > "Rafael J. Wysocki" wrote: > > > Important information is missing from user/admin cpuidle documentation > > available today, so add a new user/admin document for cpuidle containing > > current and comprehensive information to admin-guide and drop the old > > .txt documents it is replacing. > > I've not had a chance to read through it in depth, but a first scan looks > good. Thanks! > I assume this is going up via your tree? Yes, I would like to take it as there may be new patches on top of it. Cheers, Rafael
[PATCH v3] Documentation: admin-guide: PM: Add cpuidle document
From: Rafael J. Wysocki Important information is missing from user/admin cpuidle documentation available today, so add a new user/admin document for cpuidle containing current and comprehensive information to admin-guide and drop the old .txt documents it is replacing. Signed-off-by: Rafael J. Wysocki Reviewed-by: Viresh Kumar Reviewed-by: Ulf Hansson --- v2 -> v3: * Add paragraph covering processors without automatic coordination between different herarchy levels in hardware (Ulf). * Add Reviewed-by tags from Viresh and Ulf. -> v2: * Fix typos. * Improve description of the state0, state1, etc directories to cover the relationship between their numbers and the represented idle states depth (Viresh). --- Documentation/admin-guide/pm/cpuidle.rst | 615 + Documentation/admin-guide/pm/working-state.rst |1 Documentation/cpuidle/core.txt | 23 Documentation/cpuidle/sysfs.txt| 98 --- 4 files changed, 616 insertions(+), 121 deletions(-) Index: linux-pm/Documentation/admin-guide/pm/cpuidle.rst === --- /dev/null +++ linux-pm/Documentation/admin-guide/pm/cpuidle.rst @@ -0,0 +1,615 @@ +.. |struct cpuidle_state| replace:: :c:type:`struct cpuidle_state ` +.. |cpufreq| replace:: :doc:`CPU Performance Scaling ` + + +CPU Idle Time Management + + +:: + + Copyright (c) 2018 Intel Corp., Rafael J. Wysocki + +Concepts + + +Modern processors are generally able to enter states in which the execution of +a program is suspended and instructions belonging to it are not fetched from +memory or executed. Those states are the *idle* states of the processor. + +Since part of the processor hardware is not used in idle states, entering them +generally allows power drawn by the processor to be reduced and, in consequence, +it is an opportunity to save energy. + +CPU idle time management is an energy-efficiency feature concerned about using +the idle states of processors for this purpose. + +Logical CPUs + + +CPU idle time management operates on CPUs as seen by the *CPU scheduler* (that +is the part of the kernel responsible for the distribution of computational +work in the system). In its view, CPUs are *logical* units. That is, they need +not be separate physical entities and may just be interfaces appearing to +software as individual single-core processors. In other words, a CPU is an +entity which appears to be fetching instructions that belong to one sequence +(program) from memory and executing them, but it need not work this way +physically. Generally, three different cases can be consider here. + +First, if the whole processor can only follow one sequence of instructions (one +program) at a time, it is a CPU. In that case, if the hardware is asked to +enter an idle state, that applies to the processor as a whole. + +Second, if the processor is multi-core, each core in it is able to follow at +least one program at a time. The cores need not be entirely independent of each +other (for example, they may share caches), but still most of the time they +work physically in parallel with each other, so if each of them executes only +one program, those programs run mostly independently of each other at the same +time. The entire cores are CPUs in that case and if the hardware is asked to +enter an idle state, that applies to the core that asked for it in the first +place, but it also may apply to a larger unit (say a "package" or a "cluster") +that the core belongs to (in fact, it may apply to an entire hierarchy of larger +units containing the core). Namely, if all of the cores in the larger unit +except for one have been put into idle states at the "core level" and the +remaining core asks the processor to enter an idle state, that may trigger it +to put the whole larger unit into an idle state which also will affect the +other cores in that unit. + +Finally, each core in a multi-core processor may be able to follow more than one +program in the same time frame (that is, each core may be able to fetch +instructions from multiple locations in memory and execute them in the same time +frame, but not necessarily entirely in parallel with each other). In that case +the cores present themselves to software as "bundles" each consisting of +multiple individual single-core "processors", referred to as *hardware threads* +(or hyper-threads specifically on Intel hardware), that each can follow one +sequence of instructions. Then, the hardware threads are CPUs from the CPU idle +time management perspective and if the processor is asked to enter an idle state +by one of them, the hardware thread (or CPU) that asked for it is stopped, but +nothing more happens, unless all of the other hardware threads within the same +core also h
Re: [PATCH v2] Documentation: admin-guide: PM: Add cpuidle document
On Thu, Nov 29, 2018 at 9:07 AM Ulf Hansson wrote: > > On Wed, 28 Nov 2018 at 12:47, Rafael J. Wysocki wrote: > > > > From: Rafael J. Wysocki > > > > Important information is missing from user/admin cpuidle documentation > > available today, so add a new user/admin document for cpuidle containing > > current and comprehensive information to admin-guide and drop the old > > .txt documents it is replacing. > > > > Signed-off-by: Rafael J. Wysocki > > --- > > Wow! Great work! > > [...] > > > +.. _idle-states-representation: > > + > > +Representation of Idle States > > += > > + > > +For the CPU idle time management purposes all of the physical idle states > > +supported by the processor have to be represented as a one-dimensional > > array of > > +|struct cpuidle_state| objects each allowing an individual (logical) CPU > > to ask > > +the processor hardware to enter an idle state of certain properties. If > > there > > +is a hierarchy of units in the processor, one |struct cpuidle_state| > > object can > > +cover a combination of idle states supported by the units at different > > levels of > > +the hierarchy. In that case, the `target residency and exit latency > > parameters > > +of it `_, must reflect the properties of the idle state at the > > +deepest level (i.e. the idle state of the unit containing all of the other > > +units). > > + > > +For example, take a processor with two cores in a larger unit referred to > > as > > +a "module" and suppose that asking the hardware to enter a specific idle > > state > > +(say "X") at the "core" level by one core will trigger the module to try to > > +enter a specific idle state of its own (say "MX") if the other core is in > > idle > > +state "X" already. In other words, asking for idle state "X" at the "core" > > +level gives the hardware a license to go as deep as to idle state "MX" at > > the > > +"module" level, but there is no guarantee that this is going to happen > > (the core > > +asking for idle state "X" may just end up in that state by itself instead). > > This is a good description of a HW like x86 and ARM64 (using PSCI in > platform coordinated mode). > > However, for "legacy" ARM products, having at least two cores in a > module, this description does in many cases *not* match how the HW > works. I wonder about other architectures. Well, this is an example and it is very clearly presented this way IMO (it says "for example" upfront and then "suppose" etc). Basically, the purpose of it is to explain that the parameters shown in sysfs may be effective. > My point is, I think it would be fair to mention that the above > description is specific to certain HWs. In other words, for HWs that > leaves the entire state synchronization of the last man standing > algorithm to be managed by a FW, the cpuidle framework plays well. But > for others it has limitations. I can add it after this entire paragraph I think. Something like "There are processors without direct coordination between different levels of the hierarchy of units inside them, however. In those cases asking for an idle state at the "core" level does not automatically affect the "module" level, for example, in any way and the CPUIdle driver is responsible for the entire handling of the hierarchy. Then, the definition of the idle state objects is entirely up to the driver, but still the physical properties of the idle state that the processor hardware finally goes into must always follow the parameters used by the governor for idle state selection (for instance, the actual exit latency of that idle state must not exceed the exit latency parameter of the idle state object selected by the governor)." > > > +Then, the target residency of the |struct cpuidle_state| object > > representing > > +idle state "X" must reflect the minimum time to spend in idle state "MX" of > > +the module (including the time needed to enter it), because that is the > > minimum > > +time the CPU needs to be idle to save any energy in case the hardware > > enters > > +that state. Analogously, the exit latency parameter of that object must > > cover > > +the exit time of idle state "MX" of the module (and usually its entry time > > too), > > +because that is the maximum delay between a wakeup signal and the time the > > CPU > > +will start to execute the first new instructi
[PATCH v2] Documentation: admin-guide: PM: Add cpuidle document
From: Rafael J. Wysocki Important information is missing from user/admin cpuidle documentation available today, so add a new user/admin document for cpuidle containing current and comprehensive information to admin-guide and drop the old .txt documents it is replacing. Signed-off-by: Rafael J. Wysocki --- -> v2: * Fix typos. * Improve description of the state0, state1, etc directories to cover the relationship between their numbers and the represented idle states depth (Viresh). --- Documentation/admin-guide/pm/cpuidle.rst | 604 + Documentation/admin-guide/pm/working-state.rst |1 Documentation/cpuidle/core.txt | 23 Documentation/cpuidle/sysfs.txt| 98 4 files changed, 605 insertions(+), 121 deletions(-) Index: linux-pm/Documentation/admin-guide/pm/cpuidle.rst === --- /dev/null +++ linux-pm/Documentation/admin-guide/pm/cpuidle.rst @@ -0,0 +1,604 @@ +.. |struct cpuidle_state| replace:: :c:type:`struct cpuidle_state ` +.. |cpufreq| replace:: :doc:`CPU Performance Scaling ` + + +CPU Idle Time Management + + +:: + + Copyright (c) 2018 Intel Corp., Rafael J. Wysocki + +Concepts + + +Modern processors are generally able to enter states in which the execution of +a program is suspended and instructions belonging to it are not fetched from +memory or executed. Those states are the *idle* states of the processor. + +Since part of the processor hardware is not used in idle states, entering them +generally allows power drawn by the processor to be reduced and, in consequence, +it is an opportunity to save energy. + +CPU idle time management is an energy-efficiency feature concerned about using +the idle states of processors for this purpose. + +Logical CPUs + + +CPU idle time management operates on CPUs as seen by the *CPU scheduler* (that +is the part of the kernel responsible for the distribution of computational +work in the system). In its view, CPUs are *logical* units. That is, they need +not be separate physical entities and may just be interfaces appearing to +software as individual single-core processors. In other words, a CPU is an +entity which appears to be fetching instructions that belong to one sequence +(program) from memory and executing them, but it need not work this way +physically. Generally, three different cases can be consider here. + +First, if the whole processor can only follow one sequence of instructions (one +program) at a time, it is a CPU. In that case, if the hardware is asked to +enter an idle state, that applies to the processor as a whole. + +Second, if the processor is multi-core, each core in it is able to follow at +least one program at a time. The cores need not be entirely independent of each +other (for example, they may share caches), but still most of the time they +work physically in parallel with each other, so if each of them executes only +one program, those programs run mostly independently of each other at the same +time. The entire cores are CPUs in that case and if the hardware is asked to +enter an idle state, that applies to the core that asked for it in the first +place, but it also may apply to a larger unit (say a "package" or a "cluster") +that the core belongs to (in fact, it may apply to an entire hierarchy of larger +units containing the core). Namely, if all of the cores in the larger unit +except for one have been put into idle states at the "core level" and the +remaining core asks the processor to enter an idle state, that may trigger it +to put the whole larger unit into an idle state which also will affect the +other cores in that unit. + +Finally, each core in a multi-core processor may be able to follow more than one +program in the same time frame (that is, each core may be able to fetch +instructions from multiple locations in memory and execute them in the same time +frame, but not necessarily entirely in parallel with each other). In that case +the cores present themselves to software as "bundles" each consisting of +multiple individual single-core "processors", referred to as *hardware threads* +(or hyper-threads specifically on Intel hardware), that each can follow one +sequence of instructions. Then, the hardware threads are CPUs from the CPU idle +time management perspective and if the processor is asked to enter an idle state +by one of them, the hardware thread (or CPU) that asked for it is stopped, but +nothing more happens, unless all of the other hardware threads within the same +core also have asked the processor to enter an idle state. In that situation, +the core may be put into an idle state individually or a larger unit containing +it may be put into an idle state as a whole (if the other cores within the +
Re: [PATCH] Documentation: admin-guide: PM: Add cpuidle document
On Wed, Nov 28, 2018 at 10:31 AM Viresh Kumar wrote: > > On 28-11-18, 10:11, Rafael J. Wysocki wrote: > > On Wed, Nov 28, 2018 at 6:48 AM Viresh Kumar > > wrote: > > > > > > On 26-11-18, 14:11, Rafael J. Wysocki wrote: > > > > From: Rafael J. Wysocki > > > > > > > > Important information is missing from user/admin cpuidle documentation > > > > available today, so add a new user/admin document for cpuidle containing > > > > current and comprehensive information to admin-guide and drop the old > > > > .txt documents it is replacing. > > > > > > > > Signed-off-by: Rafael J. Wysocki > > > > --- > > > > Documentation/admin-guide/pm/cpuidle.rst | 603 > > > > + > > > > Documentation/admin-guide/pm/working-state.rst |1 > > > > Documentation/cpuidle/core.txt | 23 > > > > Documentation/cpuidle/sysfs.txt| 98 > > > > 4 files changed, 604 insertions(+), 121 deletions(-) > > > > > > Nice work Rafael. Minor nits below.. > > > > Thanks for the typo fixes. The spellchecker I have here evidently doesn't > > work. > > I have this in my .vimrc and I am shown these spelling mistakes somewhat > forcefully :) > > set spell spelllang=en_us Interestingly enough, it appears to work when I turn the automatic spellchecking, which I don't do for code as a rule, because it distracts me. I will need to do that for docs going forward it seems, though. BTW, I didn't respond to the remark about the ladder governor. I have no plans to describe it at this time and that can be done at any time later easily enough if anyone wants to do it. Thanks, Rafael
Re: [PATCH] Documentation: admin-guide: PM: Add cpuidle document
On Wed, Nov 28, 2018 at 6:48 AM Viresh Kumar wrote: > > On 26-11-18, 14:11, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki > > > > Important information is missing from user/admin cpuidle documentation > > available today, so add a new user/admin document for cpuidle containing > > current and comprehensive information to admin-guide and drop the old > > .txt documents it is replacing. > > > > Signed-off-by: Rafael J. Wysocki > > --- > > Documentation/admin-guide/pm/cpuidle.rst | 603 > > + > > Documentation/admin-guide/pm/working-state.rst |1 > > Documentation/cpuidle/core.txt | 23 > > Documentation/cpuidle/sysfs.txt| 98 > > 4 files changed, 604 insertions(+), 121 deletions(-) > > Nice work Rafael. Minor nits below.. Thanks for the typo fixes. The spellchecker I have here evidently doesn't work. [cut] > Maybe I missed, but I couldn't find any text that says what state 0, 1, ... N > mean. There is a paragraph on that above. > Like which is the deepest idle state and which one is the shallowest. But this part is missing, good catch! Thanks, Rafael
[PATCH] Documentation: admin-guide: PM: Add cpuidle document
From: Rafael J. Wysocki Important information is missing from user/admin cpuidle documentation available today, so add a new user/admin document for cpuidle containing current and comprehensive information to admin-guide and drop the old .txt documents it is replacing. Signed-off-by: Rafael J. Wysocki --- Documentation/admin-guide/pm/cpuidle.rst | 603 + Documentation/admin-guide/pm/working-state.rst |1 Documentation/cpuidle/core.txt | 23 Documentation/cpuidle/sysfs.txt| 98 4 files changed, 604 insertions(+), 121 deletions(-) Index: linux-pm/Documentation/admin-guide/pm/cpuidle.rst === --- /dev/null +++ linux-pm/Documentation/admin-guide/pm/cpuidle.rst @@ -0,0 +1,603 @@ +.. |struct cpuidle_state| replace:: :c:type:`struct cpuidle_state ` +.. |cpufreq| replace:: :doc:`CPU Performance Scaling ` + + +CPU Idle Time Management + + +:: + + Copyright (c) 2018 Intel Corp., Rafael J. Wysocki + +Concepts + + +Modern processors are generally able to enter states in which the execution of +a program is suspended and instructions belonging to it are not fetched from +memory or executed. Those states are the *idle* states of the processor. + +Since part of the processor hardware is not used in idle states, entering them +generally allows power drawn by the processor to be reduced and, in consequence, +it is an opportunity to save energy. + +CPU idle time management is an energy-efficiency feature concerned about using +the idle states of processors for this purpose. + +Logical CPUs + + +CPU idle time management operates on CPUs as seen by the *CPU scheduler* (that +is the part of the kernel responsible for the distribution of computational +work in the system). In its view, CPUs are *logical* units. That is, they need +not be separate physical entities and may just be interfaces appearing to +software as individual single-core processors. In other words, a CPU is an +entity which appears to be fetching instructions that belong to one sequence +(program) from memory and executing them, but it need not work this way +physically. Generally, three different cases can be consider here. + +First, if the whole processor can only follow one sequence of instructions (one +program) at a time, it is a CPU. In that case, if the hardware is asked to +enter an idle state, that applies to the processor as a whole. + +Second, if the processor is multi-core, each core in it is able to follow at +least one program at a time. The cores need not be entirely independent of each +other (for example, they may share caches), but still most of the time they +work physically in parallel with each other, so if each of them executes only +one program, those programs run mostly independently of each other at the same +time. The entire cores are CPUs in that case and if the hardware is asked to +enter an idle state, that applies to the core that asked for it in the first +place, but it also may apply to a larger unit (say a "package" or a "cluster") +that the core belongs to (in fact, it may apply to an entire hierarchy of larger +units containing the core). Namely, if all of the cores in the larger unit +except for one have been put into idle states at the "core level" and the +remaining core asks the processor to enter an idle state, that may trigger it +to put the whole larger unit into an idle state which also will affect the +other cores in that unit. + +Finally, each core in a multi-core processor may be able to follow more than one +program in the same time frame (that is, each core may be able to fetch +instructions from multiple locations in memory and execute them in the same time +frame, but not necessarily entirely in parallel with each other). In that case +the cores present themselves to software as "bundles" each consisting of +multiple individual single-core "processors", referred to as *hardware threads* +(or hyper-threads specifically on Intel hardware), that each can follow one +sequence of instructions. Then, the hardware threads are CPUs from the CPU idle +time management perspective and if the processor is asked to enter an idle state +by one of them, the hardware thread (or CPU) that asked for it is stopped, but +nothing more happens, unless all of the other hardware threads within the same +core also have asked the processor to enter an idle state. In that situation, +the core may be put into an idle state individually or a larger unit containing +it may be put into an idle state as a whole (if the other cores within the +larger unit are in idle states already). + +Idle CPUs +- + +Logical CPUs, simply referred to as "CPUs" in what follows, are regarded as +*idle* by the Linux kernel when there are no tasks to run on the
Re: [PATCH v1 7/8] PM / Hibernate: use pfn_to_online_page()
On Monday, November 19, 2018 11:16:15 AM CET David Hildenbrand wrote: > Let's use pfn_to_online_page() instead of pfn_to_page() when checking > for saveable pages to not save/restore offline memory sections. > > Cc: "Rafael J. Wysocki" > Cc: Pavel Machek > Cc: Len Brown > Cc: Andrew Morton > Cc: Matthew Wilcox > Cc: Michal Hocko > Cc: "Michael S. Tsirkin" > Suggested-by: Michal Hocko > Signed-off-by: David Hildenbrand Acked-by: Rafael J. Wysocki > --- > kernel/power/snapshot.c | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c > index 640b2034edd6..87e6dd57819f 100644 > --- a/kernel/power/snapshot.c > +++ b/kernel/power/snapshot.c > @@ -1215,8 +1215,8 @@ static struct page *saveable_highmem_page(struct zone > *zone, unsigned long pfn) > if (!pfn_valid(pfn)) > return NULL; > > - page = pfn_to_page(pfn); > - if (page_zone(page) != zone) > + page = pfn_to_online_page(pfn); > + if (!page || page_zone(page) != zone) > return NULL; > > BUG_ON(!PageHighMem(page)); > @@ -1277,8 +1277,8 @@ static struct page *saveable_page(struct zone *zone, > unsigned long pfn) > if (!pfn_valid(pfn)) > return NULL; > > - page = pfn_to_page(pfn); > - if (page_zone(page) != zone) > + page = pfn_to_online_page(pfn); > + if (!page || page_zone(page) != zone) > return NULL; > > BUG_ON(PageHighMem(page)); >
Re: [PATCH v1 8/8] PM / Hibernate: exclude all PageOffline() pages
On Monday, November 19, 2018 11:16:16 AM CET David Hildenbrand wrote: > The content of pages that are marked PG_offline is not of interest > (e.g. inflated by a balloon driver), let's skip these pages. > > Cc: "Rafael J. Wysocki" > Cc: Pavel Machek > Cc: Len Brown > Cc: Andrew Morton > Cc: Matthew Wilcox > Cc: Michal Hocko > Cc: "Michael S. Tsirkin" > Acked-by: Pavel Machek > Signed-off-by: David Hildenbrand Acked-by: Rafael J. Wysocki > --- > kernel/power/snapshot.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c > index 87e6dd57819f..8d7b4d458842 100644 > --- a/kernel/power/snapshot.c > +++ b/kernel/power/snapshot.c > @@ -1222,7 +1222,7 @@ static struct page *saveable_highmem_page(struct zone > *zone, unsigned long pfn) > BUG_ON(!PageHighMem(page)); > > if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page) || > - PageReserved(page)) > + PageReserved(page) || PageOffline(page)) > return NULL; > > if (page_is_guard(page)) > @@ -1286,6 +1286,9 @@ static struct page *saveable_page(struct zone *zone, > unsigned long pfn) > if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page)) > return NULL; > > + if (PageOffline(page)) > + return NULL; > + > if (PageReserved(page) > && (!kernel_page_present(page) || pfn_is_nosave(pfn))) > return NULL; >
[GIT PULL] Power management updates for v4.20-rc3
Hi Linus, Please pull from the tag git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \ pm-4.20-rc3 with top-most commit 97dc6c03c1b0a9af7aa26da7b0b266a762451f5f Merge branch 'pm-cpuidle' on top of commit 651022382c7f8da46cb4872a545ee1da6d097d2a Linux 4.20-rc1 to receive power management updates for 4.20-rc3. These remove a stale DT entry left behind after recent removal of a cpufreq driver without users, fix up error handling in the imx6q cpufreq driver, fix two issues in the cpufreq documentation, and update the ARM cpufreq driver. Specifics: - Drop stale DT binding for the arm_big_little_dt driver removed recently (Sudeep Holla). - Fix up error handling in the imx6q cpufreq driver to make it report voltage scaling failures (Anson Huang). - Fix two issues in the cpufreq documentation (Viresh Kumar, Zhao Wei Liew). - Fix ARM cpuidle driver initialization regression from the 4.19 time frame and rework the driver registration part of it to simplify code (Ulf Hansson). Thanks! --- Anson Huang (1): cpufreq: imx6q: add return value check for voltage scale Sudeep Holla (1): dt-bindings: cpufreq: remove stale arm_big_little_dt entry Ulf Hansson (2): ARM: cpuidle: Don't register the driver when back-end init returns -ENXIO ARM: cpuidle: Convert to use cpuidle_register|unregister() Viresh Kumar (1): Documentation: cpu-freq: Frequencies aren't always sorted Zhao Wei Liew (1): Documentation: cpufreq: Correct a typo --- Documentation/admin-guide/pm/cpufreq.rst | 2 +- Documentation/cpu-freq/cpufreq-stats.txt | 8 ++- .../bindings/cpufreq/arm_big_little_dt.txt | 65 -- drivers/cpufreq/imx6q-cpufreq.c| 7 ++- drivers/cpuidle/cpuidle-arm.c | 40 +++-- 5 files changed, 19 insertions(+), 103 deletions(-)
Re: [PATCH v4 0/3] cpufreq: intel_pstate: Base frequency attribute
On Monday, October 15, 2018 7:37:18 PM CEST Srinivas Pandruvada wrote: > This series presents base frequency to cpufreq sysfs when intel_pstate > is in use in HWP mode. > > Changes: > v4: > - Documentation update only as suggested by Rafael > > v3: > - Update documentation > v2 > - Removed guaranteed attribute addition to acpi_cppc sysfs > - Using the cppc_acpi interface to get base frequency and present > > Srinivas Pandruvada (3): > ACPI / CPPC: Add support for guaranteed performance > cpufreq: intel_pstate: Add base_frequency attribute > Documentation: intel_pstate: Add base_frequency information > > Documentation/admin-guide/pm/intel_pstate.rst | 7 > drivers/acpi/cppc_acpi.c | 8 +++- > drivers/cpufreq/intel_pstate.c| 38 +++ > include/acpi/cppc_acpi.h | 1 + > 4 files changed, 52 insertions(+), 2 deletions(-) > > All three applied, thanks!
Re: [PATCH v3 3/3] Documentation: intel_pstate: Add base_frequency information
On Fri, Oct 12, 2018 at 6:44 PM Srinivas Pandruvada wrote: > > Updated documentation to explain base_frequency attribute. > > Signed-off-by: Srinivas Pandruvada > --- > Documentation/admin-guide/pm/intel_pstate.rst | 4 > 1 file changed, 4 insertions(+) > > diff --git a/Documentation/admin-guide/pm/intel_pstate.rst > b/Documentation/admin-guide/pm/intel_pstate.rst > index 8f1d3de449b5..14a5505e073e 100644 > --- a/Documentation/admin-guide/pm/intel_pstate.rst > +++ b/Documentation/admin-guide/pm/intel_pstate.rst > @@ -465,6 +465,10 @@ Next, the following policy attributes have special > meaning if > policy for the time interval between the last two invocations of the > driver's utilization update callback by the CPU scheduler for that > CPU. > > +``base_frequency`` > + When present, shows the base frequency of the CPU. Any frequency above > + this will be in the turbo frequency range. > + This isn't entirely correct, because base_frequency is not present in the passive mode (and it is not present for the other drivers for that matter). > The meaning of these attributes in the `passive mode `_ is the > same as for other scaling drivers. Instead, I would say: "One more policy attribute is present if the `HWP feature is enabled in the processor `_:" here and I would add the description of the attribute below. Thanks, Rafael
Re: [PATCH] Documentation: fix image_size default value
On Saturday, September 1, 2018 6:51:15 AM CEST Vladimir D. Seleznev wrote: > This commit updates the default value of /sys/power/image_size in > the documentation. > > Since ac5c24ec1e983313ef0015258fba6f630e54e7cn the `image_size' value is > set to about 2/5 of RAM, according to kernel/power/snapshot.c: > > image_size = ((totalram_pages * 2) / 5) * PAGE_SIZE; > > but this change was not reflected everywhere in the documentation. > > Signed-off-by: Vladimir D. Seleznev > --- > Documentation/ABI/testing/sysfs-power | 2 +- > Documentation/power/swsusp.txt| 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/Documentation/ABI/testing/sysfs-power > b/Documentation/ABI/testing/sysfs-power > index 2f813d644c69..18b7dc929234 100644 > --- a/Documentation/ABI/testing/sysfs-power > +++ b/Documentation/ABI/testing/sysfs-power > @@ -99,7 +99,7 @@ Description: > this file, the suspend image will be as small as possible. > > Reading from this file will display the current image size > - limit, which is set to 500 MB by default. > + limit, which is set to around 2/5 of available RAM by default. > > What:/sys/power/pm_trace > Date:August 2006 > diff --git a/Documentation/power/swsusp.txt b/Documentation/power/swsusp.txt > index cc87adf44c0a..236d1fb13640 100644 > --- a/Documentation/power/swsusp.txt > +++ b/Documentation/power/swsusp.txt > @@ -56,7 +56,7 @@ If you want to limit the suspend image size to N bytes, do > > echo N > /sys/power/image_size > > -before suspend (it is limited to 500 MB by default). > +before suspend (it is limited to around 2/5 of available RAM by default). > > . The resume process checks for the presence of the resume device, > if found, it then checks the contents for the hibernation image signature. > Applied, thanks!
Re: [PATCH 4/5] acpi/processor: Fix the return value of acpi_processor_ids_walk()
On Wed, May 23, 2018 at 3:34 AM, Dou Liyang wrote: > At 05/22/2018 09:47 AM, Dou Liyang wrote: >> >> >> >> At 05/19/2018 11:06 PM, Thomas Gleixner wrote: >>> >>> On Tue, 20 Mar 2018, Dou Liyang wrote: >>> ACPI driver should make sure all the processor IDs in their ACPI Namespace are unique for CPU hotplug. the driver performs a depth-first walk of the namespace tree and calls the acpi_processor_ids_walk(). But, the acpi_processor_ids_walk() will return true if one processor is checked, that cause the walk break after walking pass the first processor. Repace the value with AE_OK which is the standard acpi_status value. Fixes 8c8cb30f49b8 ("acpi/processor: Implement DEVICE operator for processor enumeration") Signed-off-by: Dou Liyang --- drivers/acpi/acpi_processor.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c index 449d86d39965..db5bdb59639c 100644 --- a/drivers/acpi/acpi_processor.c +++ b/drivers/acpi/acpi_processor.c @@ -663,11 +663,11 @@ static acpi_status __init (acpi_handle handle, } processor_validated_ids_update(uid); -return true; +return AE_OK; err: acpi_handle_info(handle, "Invalid processor object\n"); -return false; +return AE_OK; >>> >>> >>> I'm not sure whether this is the right return value here. Rafael? >>> > > +Cc Rafael's common used email address. > > I am sorry, I created the cc list using ./script/get_maintainers.pl ... > and didn't check it. No worries, I saw your messages, but thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] PM: docs: sleep-states: Fix a typo ("includig")
On Wednesday, April 25, 2018 12:07:03 PM CEST Jonathan Neuschäfer wrote: > Signed-off-by: Jonathan Neuschäfer > --- > Documentation/admin-guide/pm/sleep-states.rst | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/Documentation/admin-guide/pm/sleep-states.rst > b/Documentation/admin-guide/pm/sleep-states.rst > index 1e5c0f00cb2f..dbf5acd49f35 100644 > --- a/Documentation/admin-guide/pm/sleep-states.rst > +++ b/Documentation/admin-guide/pm/sleep-states.rst > @@ -15,7 +15,7 @@ Sleep States That Can Be Supported > == > > Depending on its configuration and the capabilities of the platform it runs > on, > -the Linux kernel can support up to four system sleep states, includig > +the Linux kernel can support up to four system sleep states, including > hibernation and up to three variants of system suspend. The sleep states > that > can be supported by the kernel are listed below. > > Applied and pushed for 4.17-rc5, thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Documentation/admin-guide/pm/intel_pstate: fix Active Mode w/o HWP paragraph
On Tuesday, May 8, 2018 8:36:44 PM CEST Srinivas Pandruvada wrote: > On Tue, 2018-05-08 at 17:12 +0200, Juri Lelli wrote: > > P-state selection algorithm (powersave or performance) is selected by > > echoing the desired choice to scaling_governor sysfs attribute and > > not > > to scaling_cur_freq (as currently stated). > > > > Fix it. > Thanks for the fix. > > > > > Signed-off-by: Juri Lelli > > Cc: Jonathan Corbet > > Cc: "Rafael J. Wysocki" > > Cc: Srinivas Pandruvada > > Cc: linux-doc@vger.kernel.org > > Cc: linux...@vger.kernel.org > Reviewed-by: Srinivas Pandruvada > > > > > > --- > > Documentation/admin-guide/pm/intel_pstate.rst | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/Documentation/admin-guide/pm/intel_pstate.rst > > b/Documentation/admin-guide/pm/intel_pstate.rst > > index d2b6fda3d67b..ab2fe0eda1d7 100644 > > --- a/Documentation/admin-guide/pm/intel_pstate.rst > > +++ b/Documentation/admin-guide/pm/intel_pstate.rst > > @@ -145,7 +145,7 @@ feature enabled.] > > > > In this mode ``intel_pstate`` registers utilization update callbacks > > with the > > CPU scheduler in order to run a P-state selection algorithm, either > > -``powersave`` or ``performance``, depending on the > > ``scaling_cur_freq`` policy > > +``powersave`` or ``performance``, depending on the > > ``scaling_governor`` policy > > setting in ``sysfs``. The current CPU frequency information to be > > made > > available from the ``scaling_cur_freq`` policy attribute in > > ``sysfs`` is > > periodically updated by those utilization update callbacks too. > Applied and pushed for 4.17-rc5, thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] acpi: Fix ACPI GPE mask kernel parameter
On Thursday, November 30, 2017 9:05:59 PM CET Prarit Bhargava wrote: > The acpi_mask_gpe= kernel parameter documentation states that the range > of mask is 128 GPEs (0x00 to 0x7F). The acpi_masked_gpes mask is a u64 so > only 64 GPEs (0x00 to 0x3F) can really be masked. > > Use a bitmap of size 0xFF instead of a u64 for the GPE mask so 256 > GPEs can be masked. > > Fixes: 9c4aa1eecb48 ("ACPI / sysfs: Provide quirk mechanism to prevent GPE > flooding") > Signed-off-by: Prarit Bharava > Cc: Lv Zheng > Cc: Jonathan Corbet > Cc: "Rafael J. Wysocki" > Cc: linux-doc@vger.kernel.org > --- > Documentation/admin-guide/kernel-parameters.txt | 1 - > drivers/acpi/sysfs.c| 26 > - > 2 files changed, 8 insertions(+), 19 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > index 6571fbfdb2a1..89ba74761180 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -114,7 +114,6 @@ > This facility can be used to prevent such uncontrolled > GPE floodings. > Format: > - Support masking of GPEs numbered from 0x00 to 0x7f. > > acpi_no_auto_serialize [HW,ACPI] > Disable auto-serialization of AML methods > diff --git a/drivers/acpi/sysfs.c b/drivers/acpi/sysfs.c > index 06a150bb35bf..4fc59c3bc673 100644 > --- a/drivers/acpi/sysfs.c > +++ b/drivers/acpi/sysfs.c > @@ -816,14 +816,8 @@ static ssize_t counter_set(struct kobject *kobj, > * interface: > * echo unmask > /sys/firmware/acpi/interrupts/gpe00 > */ > - > -/* > - * Currently, the GPE flooding prevention only supports to mask the GPEs > - * numbered from 00 to 7f. > - */ > -#define ACPI_MASKABLE_GPE_MAX0x80 > - > -static u64 __initdata acpi_masked_gpes; > +#define ACPI_MASKABLE_GPE_MAX0xFF > +static DECLARE_BITMAP(acpi_masked_gpes_map, ACPI_MASKABLE_GPE_MAX) > __initdata; > > static int __init acpi_gpe_set_masked_gpes(char *val) > { > @@ -831,7 +825,7 @@ static int __init acpi_gpe_set_masked_gpes(char *val) > > if (kstrtou8(val, 0, &gpe) || gpe > ACPI_MASKABLE_GPE_MAX) > return -EINVAL; > - acpi_masked_gpes |= ((u64)1< + set_bit(gpe, acpi_masked_gpes_map); > > return 1; > } > @@ -843,15 +837,11 @@ void __init acpi_gpe_apply_masked_gpes(void) > acpi_status status; > u8 gpe; > > - for (gpe = 0; > - gpe < min_t(u8, ACPI_MASKABLE_GPE_MAX, acpi_current_gpe_count); > - gpe++) { > - if (acpi_masked_gpes & ((u64)1< - status = acpi_get_gpe_device(gpe, &handle); > - if (ACPI_SUCCESS(status)) { > - pr_info("Masking GPE 0x%x.\n", gpe); > - (void)acpi_mask_gpe(handle, gpe, TRUE); > - } > + for_each_set_bit(gpe, acpi_masked_gpes_map, ACPI_MASKABLE_GPE_MAX) { > + status = acpi_get_gpe_device(gpe, &handle); > + if (ACPI_SUCCESS(status)) { > + pr_info("Masking GPE 0x%x.\n", gpe); > + (void)acpi_mask_gpe(handle, gpe, TRUE); > } > } > } > Applied, thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] acpi, spcr: Make SPCR avialable to other architectures
On Monday, December 11, 2017 4:50:58 PM CET Prarit Bhargava wrote: > Other architectures can use SPCR to setup an early console or console > but the current code is ARM64 specific. > > Change the name of parse_spcr() to acpi_parse_spcr(). Add a weak > function acpi_arch_setup_console() that can be used for arch-specific > setup. Move flags into ACPI code. Update the Documention on the use of > the SPCR. > > [v2]: Don't return an error in the baud_rate check of acpi_parse_spcr(). > Keep ACPI_SPCR_TABLE selected for ARM64. Fix 8-bit port access width > mmio value. Move baud rate check earlier. > > Signed-off-by: Prarit Bhargava This mostly affects ARM64, so ACKs from that side are requisite for it. > Cc: linux-doc@vger.kernel.org > Cc: linux-ker...@vger.kernel.org > Cc: linux-arm-ker...@lists.infradead.org > Cc: linux...@vger.kernel.org > Cc: linux-a...@vger.kernel.org > Cc: linux-ser...@vger.kernel.org > Cc: Bhupesh Sharma > Cc: Lv Zheng > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: "H. Peter Anvin" > Cc: x...@kernel.org > Cc: Jonathan Corbet > Cc: Catalin Marinas > Cc: Will Deacon > Cc: "Rafael J. Wysocki" > Cc: Timur Tabi > --- > Documentation/admin-guide/kernel-parameters.txt | 6 +- > arch/arm64/kernel/acpi.c| 128 - > drivers/acpi/Kconfig| 7 +- > drivers/acpi/spcr.c | 175 > ++-- > drivers/tty/serial/earlycon.c | 15 +- > include/linux/acpi.h| 11 +- > include/linux/serial_core.h | 2 - > 7 files changed, 184 insertions(+), 160 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > index 6571fbfdb2a1..0d173289c67e 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -914,9 +914,9 @@ > > earlycon= [KNL] Output early console device and options. > > - When used with no options, the early console is > - determined by the stdout-path property in device > - tree's chosen node. > + [ARM64] The early console is determined by the > + stdout-path property in device tree's chosen node, > + or determined by the ACPI SPCR table. > > cdns,[,options] > Start an early, polled-mode console on a Cadence > diff --git a/arch/arm64/kernel/acpi.c b/arch/arm64/kernel/acpi.c > index b3162715ed78..b3e33bbdf3b7 100644 > --- a/arch/arm64/kernel/acpi.c > +++ b/arch/arm64/kernel/acpi.c > @@ -25,7 +25,6 @@ > #include > #include > #include > -#include > > #include > #include > @@ -177,6 +176,128 @@ static int __init acpi_fadt_sanity_check(void) > return ret; > } > > +/* > + * Erratum 44 for QDF2432v1 and QDF2400v1 SoCs describes the BUSY bit as > + * occasionally getting stuck as 1. To avoid the potential for a hang, check > + * TXFE == 0 instead of BUSY == 1. This may not be suitable for all UART > + * implementations, so only do so if an affected platform is detected in > + * acpi_parse_spcr(). > + */ > +bool qdf2400_e44_present; > +EXPORT_SYMBOL(qdf2400_e44_present); > + > +/* > + * Some Qualcomm Datacenter Technologies SoCs have a defective UART BUSY bit. > + * Detect them by examining the OEM fields in the SPCR header, similar to PCI > + * quirk detection in pci_mcfg.c. > + */ > +static bool qdf2400_erratum_44_present(struct acpi_table_header *h) > +{ > + if (memcmp(h->oem_id, "QCOM ", ACPI_OEM_ID_SIZE)) > + return false; > + > + if (!memcmp(h->oem_table_id, "QDF2432 ", ACPI_OEM_TABLE_ID_SIZE)) > + return true; > + > + if (!memcmp(h->oem_table_id, "QDF2400 ", ACPI_OEM_TABLE_ID_SIZE) && > + h->oem_revision == 1) > + return true; > + > + return false; > +} > + > +/* > + * APM X-Gene v1 and v2 UART hardware is an 16550 like device but has its > + * register aligned to 32-bit. In addition, the BIOS also encoded the > + * access width to be 8 bits. This function detects this errata condition. > + */ > +static bool xgene_8250_erratum_present(struct acpi_table_spcr *tb) > +{ > + bool xgene_8250 = false; > + > + if (tb->interface_type != ACPI_DBG2_16550_COMPATIBLE) > + return false; > + > + if (memcmp(tb->header.oem_id, "APMC0D", ACPI_OEM_ID
Re: [PATCH v2] thinkad_acpi: Support the battery wear control
On Sunday, December 3, 2017 11:56:40 PM CET Ognjen Galic wrote: > Add support for the ACPI batteries on newer thinkpad models > (>Sandy Bridge) that support the setting of start and stop > thresholds. > > The actual interface to the driver is a extension for the > existing ACPI battery driver. This was done so that users > can write transparently to the interface of the ACPI battery > driver and dont have to use some private interface > (for ex. /sys/devices/platform/thinkpad_acpi/). > > Two new interfaces are created: > > /sys/class/power_supply/BAT{0,1}/charge_start_threshold > /sys/class/power_supply/BAT{0,1}/charge_stop_threshold > > The ACPI battery driver won't expose the interface unless > DMI says that the machine is a Lenovo ThinkPad. This was done > so that distributions that distribute thinkpad_acpi don't > expose the API on non-ThinkPad laptops. > > v2: Forgot to signoff, fixed that > > Signed-off-by: Ognjen Galic > --- > Documentation/laptops/thinkpad-acpi.txt | 16 ++ > drivers/acpi/battery.c | 227 +++ > drivers/platform/x86/Kconfig| 11 + > drivers/platform/x86/thinkpad_acpi.c| 376 > > include/linux/thinkpad_acpi.h | 16 ++ > 5 files changed, 646 insertions(+) > > diff --git a/Documentation/laptops/thinkpad-acpi.txt > b/Documentation/laptops/thinkpad-acpi.txt > index 00b6dfe..9aca62d 100644 > --- a/Documentation/laptops/thinkpad-acpi.txt > +++ b/Documentation/laptops/thinkpad-acpi.txt > @@ -46,6 +46,7 @@ detailed description): > - Fan control and monitoring: fan speed, fan enable/disable > - WAN enable and disable > - UWB enable and disable > + - Battery Wear Control (BWC) > > A compatibility table by model and feature is maintained on the web > site, http://ibm-acpi.sf.net/. I appreciate any success or failure > @@ -1376,6 +1377,21 @@ For more details about which buttons will appear > depending on the mode, please > review the laptop's user guide: > http://www.lenovo.com/shop/americas/content/user_guides/x1carbon_2_ug_en.pdf > > +Battey Wear Control (BWC) > +- > + > +The driver supports control for the embedded controller ACPI battery > configuration. > +This means that you can set start and stop charge thresholds for batteries in > +ThinkPads that have a processor newer than Sandy Bridge. > + > +The actual sysfs interface is an extension of the standard ACPI battery > interface. > +The interface is usually in: > + > +Start thresholds:/sys/class/power_supply/BATN/charge_start_threshold > +Stop threshold: > /sys/class/power_supply/BATN/charge_stop_threshold > + > +These attributes support reading and writing. > + > Multiple Commands, Module Parameters > > > diff --git a/drivers/acpi/battery.c b/drivers/acpi/battery.c > index 13e7b56..4852a76 100644 > --- a/drivers/acpi/battery.c > +++ b/drivers/acpi/battery.c > @@ -41,6 +41,7 @@ > > #include > #include > +#include > > #include "battery.h" > > @@ -70,6 +71,7 @@ static async_cookie_t async_cookie; > static bool battery_driver_registered; > static int battery_bix_broken_package; > static int battery_notification_delay_ms; > +static int has_thinkpad_extension = 0; > static unsigned int cache_time = 1000; > module_param(cache_time, uint, 0644); > MODULE_PARM_DESC(cache_time, "cache time in milliseconds"); > @@ -626,6 +628,189 @@ static const struct device_attribute alarm_attr = { > .store = acpi_battery_alarm_store, > }; > > +/* -- > +ThinkPad Battery Wear Control ACPI Extension > + > -- */ > + > +#ifdef CONFIG_THINKPAD_ACPI_BWC Not really. This is generic code, so no thinkpad_acpi-specific stuff in this file, please, even under #ifdefs. > + > +/* > + * Because the thinkpad_acpi module usually is loaded right after > + * the disk is mounted, we will lazy-load it on demand when any of the > + * sysfs methods is read or written if it is not loaded. > + */ > +static int thinkpad_acpi_lazyload(void) > +{ > +void* func; > + > +func = symbol_get(tpacpi_battery_get_functionality); > + > +if (func) { > + // thinkpad_acpi is loaded Please clean up these comments (they should be in C format and formatted as described in the coding style doc). > + return 0; > +} Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 5/6] PM / core: Direct handling of DPM_FLAG_LEAVE_SUSPENDED
On Wednesday, November 22, 2017 2:10:51 AM CET Rafael J. Wysocki wrote: > On Monday, November 20, 2017 2:42:26 PM CET Ulf Hansson wrote: > > On 18 November 2017 at 15:41, Rafael J. Wysocki wrote: > > > From: Rafael J. Wysocki > > > > > > Make the PM core handle DPM_FLAG_LEAVE_SUSPENDED directly for > > > devices whose "noirq", "late" and "early" driver callbacks are > > > invoked directly by it. > > > > This indicates that your target for this particular change isn't > > ACPI/PCI, but instead this aims to be a more generic solution to be > > able to optimize the resume path for devices. > > > > Assuming, this is the case, I don't think this is good enough as I > > pointed out [1] earlier. Simply because it isn't as flexible as is > > required - to really be able cover generic cases. > > I'll go back to that message, but nothing so far has been flexible enough to > cover everything you can imagine. > > The case this and the next patch cover really is to allow drivers that can be > used with or without a PM domain to avoid doing any "are we suspended?" type > of checks in their callbacks. Actually, the [6/6] is more important from that > standpoint, but this one also may play the role because of the dependencies > between devices involved in the handling of LEAVE_SUSPENDED (eg. say a PCI > parent has a child platform or I2C or similar devices without a PM domain > and what happens to the child affects the parent). > > > > > > > Namely, make it skip all of the system-wide resume callbacks for > > > such devices with DPM_FLAG_LEAVE_SUSPENDED set if they are in > > > runtime suspend during the "noirq" phase of system-wide suspend > > > (or analogous) transitions or the system transition under way is > > > a proper suspend (rather than anything related to hibernation) and > > > the device's wakeup settings are compatible with runtime PM (that > > > is, the device cannot generate wakeup signals at all or it is > > > allowed to wake up the system from sleep). > > > > As I pointed out by submitting another patch [2], device_may_wakeup() > > doesn't really tell whether the wakeup is configured as "in-band" or > > "out-of-band". That knowledge is known by the driver and the subsystem > > layer - and for that reason I don't think the PM core shall base > > generic decisions like this on it. > > The "or it is allowed to wake up the system from sleep" case may be overly > optimistic, but the remaining two (runtime-suspended devices and devices > that can't generate wakeup signals at all) are quite straightforward to me. BTW, I'm not sure if the device_may_wakeup() check is really insufficient in this particular case. Say the device was not in runtime suspend before, but device_may_wakeup() returns "true" for it and the system is resuming from suspend. The device's state should be suitable to wake up the system in any case, so the "in-band" vs "out-of-band" difference has had to be taken care of already during system suspend. Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 5/6] PM / core: Direct handling of DPM_FLAG_LEAVE_SUSPENDED
On Monday, November 20, 2017 2:42:26 PM CET Ulf Hansson wrote: > On 18 November 2017 at 15:41, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki > > > > Make the PM core handle DPM_FLAG_LEAVE_SUSPENDED directly for > > devices whose "noirq", "late" and "early" driver callbacks are > > invoked directly by it. > > This indicates that your target for this particular change isn't > ACPI/PCI, but instead this aims to be a more generic solution to be > able to optimize the resume path for devices. > > Assuming, this is the case, I don't think this is good enough as I > pointed out [1] earlier. Simply because it isn't as flexible as is > required - to really be able cover generic cases. I'll go back to that message, but nothing so far has been flexible enough to cover everything you can imagine. The case this and the next patch cover really is to allow drivers that can be used with or without a PM domain to avoid doing any "are we suspended?" type of checks in their callbacks. Actually, the [6/6] is more important from that standpoint, but this one also may play the role because of the dependencies between devices involved in the handling of LEAVE_SUSPENDED (eg. say a PCI parent has a child platform or I2C or similar devices without a PM domain and what happens to the child affects the parent). > > > > Namely, make it skip all of the system-wide resume callbacks for > > such devices with DPM_FLAG_LEAVE_SUSPENDED set if they are in > > runtime suspend during the "noirq" phase of system-wide suspend > > (or analogous) transitions or the system transition under way is > > a proper suspend (rather than anything related to hibernation) and > > the device's wakeup settings are compatible with runtime PM (that > > is, the device cannot generate wakeup signals at all or it is > > allowed to wake up the system from sleep). > > As I pointed out by submitting another patch [2], device_may_wakeup() > doesn't really tell whether the wakeup is configured as "in-band" or > "out-of-band". That knowledge is known by the driver and the subsystem > layer - and for that reason I don't think the PM core shall base > generic decisions like this on it. The "or it is allowed to wake up the system from sleep" case may be overly optimistic, but the remaining two (runtime-suspended devices and devices that can't generate wakeup signals at all) are quite straightforward to me. Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 1/6] PM / core: Add LEAVE_SUSPENDED driver flag
On Mon, Nov 20, 2017 at 1:25 PM, Ulf Hansson wrote: > On 18 November 2017 at 15:31, Rafael J. Wysocki wrote: >> From: Rafael J. Wysocki >> >> Define and document a new driver flag, DPM_FLAG_LEAVE_SUSPENDED, to >> instruct the PM core and middle-layer (bus type, PM domain, etc.) >> code that it is desirable to leave the device in runtime suspend >> after system-wide transitions to the working state (for example, >> the device may be slow to resume and it may be better to avoid >> resuming it right away). >> >> Generally, the middle-layer code involved in the handling of the >> device is expected to indicate to the PM core whether or not the >> device may be left in suspend with the help of the device's >> power.may_skip_resume status bit. That has to happen in the "noirq" >> phase of the preceding system suspend (or analogous) transition. >> The middle layer is then responsible for handling the device as >> appropriate in its "noirq" resume callback which is executed >> regardless of whether or not the device may be left suspended, but >> the other resume callbacks (except for ->complete) will be skipped >> automatically by the core if the device really can be left in >> suspend. >> >> The additional power.must_resume status bit introduced for the >> implementation of this mechanisn is used internally by the PM core >> to track the requirement to resume the device (which may depend on >> its children etc). >> >> Signed-off-by: Rafael J. Wysocki >> Acked-by: Greg Kroah-Hartman > > Reviewed-by: Ulf Hansson Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 2/6] PCI / PM: Support for LEAVE_SUSPENDED driver flag
From: Rafael J. Wysocki Add support for DPM_FLAG_LEAVE_SUSPENDED to the PCI bus type by making it (a) set the power.may_skip_resume status bit for devices that, from its perspective, may be left in suspend after system wakeup from sleep and (b) return early from pci_pm_resume_noirq() for devices whose remaining resume callbacks during the transition under way are going to be skipped by the PM core. Signed-off-by: Rafael J. Wysocki Acked-by: Greg Kroah-Hartman Acked-by: Bjorn Helgaas --- v3 -> v4: No changes. v2 -> v3: Add the Acked-by from Bjorn, no changes in the patch. --- Documentation/power/pci.txt | 11 +++ drivers/pci/pci-driver.c| 19 +-- 2 files changed, 28 insertions(+), 2 deletions(-) Index: linux-pm/drivers/pci/pci-driver.c === --- linux-pm.orig/drivers/pci/pci-driver.c +++ linux-pm/drivers/pci/pci-driver.c @@ -699,7 +699,7 @@ static void pci_pm_complete(struct devic pm_generic_complete(dev); /* Resume device if platform firmware has put it in reset-power-on */ - if (dev->power.direct_complete && pm_resume_via_firmware()) { + if (pm_runtime_suspended(dev) && pm_resume_via_firmware()) { pci_power_t pre_sleep_state = pci_dev->current_state; pci_update_current_state(pci_dev, pci_dev->current_state); @@ -783,8 +783,10 @@ static int pci_pm_suspend_noirq(struct d struct pci_dev *pci_dev = to_pci_dev(dev); const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL; - if (dev_pm_smart_suspend_and_suspended(dev)) + if (dev_pm_smart_suspend_and_suspended(dev)) { + dev->power.may_skip_resume = true; return 0; + } if (pci_has_legacy_pm_support(pci_dev)) return pci_legacy_suspend_late(dev, PMSG_SUSPEND); @@ -838,6 +840,16 @@ static int pci_pm_suspend_noirq(struct d Fixup: pci_fixup_device(pci_fixup_suspend_late, pci_dev); + /* +* If the target system sleep state is suspend-to-idle, it is sufficient +* to check whether or not the device's wakeup settings are good for +* runtime PM. Otherwise, the pm_resume_via_firmware() check will cause +* pci_pm_complete() to take care of fixing up the device's state +* anyway, if need be. +*/ + dev->power.may_skip_resume = device_may_wakeup(dev) || + !device_can_wakeup(dev); + return 0; } @@ -847,6 +859,9 @@ static int pci_pm_resume_noirq(struct de struct device_driver *drv = dev->driver; int error = 0; + if (dev_pm_may_skip_resume(dev)) + return 0; + /* * Devices with DPM_FLAG_SMART_SUSPEND may be left in runtime suspend * during system suspend, so update their runtime PM status to "active" Index: linux-pm/Documentation/power/pci.txt === --- linux-pm.orig/Documentation/power/pci.txt +++ linux-pm/Documentation/power/pci.txt @@ -994,6 +994,17 @@ into D0 going forward), but if it is in the function will set the power.direct_complete flag for it (to make the PM core skip the subsequent "thaw" callbacks for it) and return. +Setting the DPM_FLAG_LEAVE_SUSPENDED flag means that the driver prefers the +device to be left in suspend after system-wide transitions to the working state. +This flag is checked by the PM core, but the PCI bus type informs the PM core +which devices may be left in suspend from its perspective (that happens during +the "noirq" phase of system-wide suspend and analogous transitions) and next it +uses the dev_pm_may_skip_resume() helper to decide whether or not to return from +pci_pm_resume_noirq() early, as the PM core will skip the remaining resume +callbacks for the device during the transition under way and will set its +runtime PM status to "suspended" if dev_pm_may_skip_resume() returns "true" for +it. + 3.2. Device Runtime Power Management In addition to providing device power management callbacks PCI device drivers -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 4/6] PM / core: Add helpers for subsystem callback selection
From: Rafael J. Wysocki Add helper routines to find and return a suitable subsystem callback during the "noirq" phases of system suspend/resume (or analogous) transitions as well as during the "late" phase of system suspend and the "early" phase of system resume (or analogous) transitions. The helpers will be called from additional sites going forward. Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson --- v3 -> v4: Move forward declarations of two functions to subsequent patches, add Reviewed-by from Ulf. v2 -> v3: No changes. --- drivers/base/power/main.c | 188 +++--- 1 file changed, 128 insertions(+), 60 deletions(-) Index: linux-pm/drivers/base/power/main.c === --- linux-pm.orig/drivers/base/power/main.c +++ linux-pm/drivers/base/power/main.c @@ -539,6 +539,35 @@ bool dev_pm_may_skip_resume(struct devic return !dev->power.must_resume && pm_transition.event != PM_EVENT_RESTORE; } +static pm_callback_t dpm_subsys_resume_noirq_cb(struct device *dev, + pm_message_t state, + const char **info_p) +{ + pm_callback_t callback; + const char *info; + + if (dev->pm_domain) { + info = "noirq power domain "; + callback = pm_noirq_op(&dev->pm_domain->ops, state); + } else if (dev->type && dev->type->pm) { + info = "noirq type "; + callback = pm_noirq_op(dev->type->pm, state); + } else if (dev->class && dev->class->pm) { + info = "noirq class "; + callback = pm_noirq_op(dev->class->pm, state); + } else if (dev->bus && dev->bus->pm) { + info = "noirq bus "; + callback = pm_noirq_op(dev->bus->pm, state); + } else { + return NULL; + } + + if (info_p) + *info_p = info; + + return callback; +} + /** * device_resume_noirq - Execute a "noirq resume" callback for given device. * @dev: Device to handle. @@ -550,8 +579,8 @@ bool dev_pm_may_skip_resume(struct devic */ static int device_resume_noirq(struct device *dev, pm_message_t state, bool async) { - pm_callback_t callback = NULL; - const char *info = NULL; + pm_callback_t callback; + const char *info; int error = 0; TRACE_DEVICE(dev); @@ -565,19 +594,7 @@ static int device_resume_noirq(struct de dpm_wait_for_superior(dev, async); - if (dev->pm_domain) { - info = "noirq power domain "; - callback = pm_noirq_op(&dev->pm_domain->ops, state); - } else if (dev->type && dev->type->pm) { - info = "noirq type "; - callback = pm_noirq_op(dev->type->pm, state); - } else if (dev->class && dev->class->pm) { - info = "noirq class "; - callback = pm_noirq_op(dev->class->pm, state); - } else if (dev->bus && dev->bus->pm) { - info = "noirq bus "; - callback = pm_noirq_op(dev->bus->pm, state); - } + callback = dpm_subsys_resume_noirq_cb(dev, state, &info); if (!callback && dev->driver && dev->driver->pm) { info = "noirq driver "; @@ -693,6 +710,35 @@ void dpm_resume_noirq(pm_message_t state dpm_noirq_end(); } +static pm_callback_t dpm_subsys_resume_early_cb(struct device *dev, + pm_message_t state, + const char **info_p) +{ + pm_callback_t callback; + const char *info; + + if (dev->pm_domain) { + info = "early power domain "; + callback = pm_late_early_op(&dev->pm_domain->ops, state); + } else if (dev->type && dev->type->pm) { + info = "early type "; + callback = pm_late_early_op(dev->type->pm, state); + } else if (dev->class && dev->class->pm) { + info = "early class "; + callback = pm_late_early_op(dev->class->pm, state); + } else if (dev->bus && dev->bus->pm) { + info = "early bus "; + callback = pm_late_early_op(dev->bus->pm, state); + } else { + return NULL; + } + + if (info_p) + *info_p = info; + + return callback; +} + /** * device_resume_early - Execute an "
Re: [PATCH v4 0/6] PM / sleep: Driver flags for system suspend/resume (part 2)
Hi All, The following still applies: > On Wednesday, November 8, 2017 1:41:35 AM CET Rafael J. Wysocki wrote: > > > > This is a follow-up for the first part of the PM driver flags series > > sent previously some time ago with an intro as follows: > > > > On Saturday, October 28, 2017 12:11:55 AM CET Rafael J. Wysocki wrote: > > > The following part of the original cover letter still applies: > > > > > > On Monday, October 16, 2017 3:12:35 AM CEST Rafael J. Wysocki wrote: > > > > > > > > This work was triggered by attempts to fix and optimize PM in the > > > > i2c-designware-platdev driver that ended up with adding a couple of > > > > flags to the driver's internal data structures for the tracking of > > > > device state (https://marc.info/?l=linux-acpi&m=150629646805636&w=2). > > > > That approach is sort of suboptimal, though, because other drivers will > > > > probably want to do similar things and if all of them need to use > > > > internal > > > > flags for that, quite a bit of code duplication may ensue at least. > > > > > > > > That can be avoided in a couple of ways and one of them is to provide a > > > > means > > > > for drivers to tell the core what to do and to make the core take care > > > > of it > > > > if told to do so. Hence, the idea to use driver flags for system-wide > > > > PM > > > > that was briefly discussed during the LPC in LA last month. > > > > > > [...] > > > > > > > What can work (and this is the only strategy that can work AFAICS) is to > > > > point different callback pointers *in* *a* *driver* to the same routine > > > > if the driver wants to reuse that code. That actually will work for PCI > > > > and USB drivers today, at least most of the time, but unfortunately > > > > there > > > > are problems with it for, say, platform devices. > > > > > > > > The first problem is the requirement to track the status of the device > > > > (suspended vs not suspended) in the callbacks, because the system-wide > > > > PM > > > > code in the PM core doesn't do that. The runtime PM framework does it, > > > > so > > > > this means adding some extra code which isn't necessary for runtime PM > > > > to > > > > the callback routines and that is not particularly nice. > > > > > > > > The second problem is that, if the driver wants to do anything in its > > > > ->suspend callback, it generally has to prevent runtime suspend of the > > > > device from taking place in parallel with that, which is quite > > > > cumbersome. > > > > Usually, that is taken care of by resuming the device from runtime > > > > suspend > > > > upfront, but generally doing that is wasteful (there may be no real > > > > need to > > > > resume the device except for the fact that the code is designed this > > > > way). > > > > > > > > On top of the above, there are optimizations to be made, like leaving > > > > certain > > > > devices in suspend after system resume to avoid wasting time on waiting > > > > for > > > > them to resume before user space can run again and similar. > > > > > > > > This patch series focuses on addressing those problems so as to make it > > > > easier to reuse callback routines by pointing different callback > > > > pointers > > > > to them in device drivers. The flags introduced here are to instruct > > > > the > > > > PM core and middle layers (whatever they are) on how the driver wants > > > > the > > > > device to be handled and then the driver has to provide callbacks to > > > > match > > > > these instructions and the rest should be taken care of by the code > > > > above it. > > > > > > > > The flags are introduced one by one to avoid making too many changes in > > > > one go and to allow things to be explained better (hopefully). They > > > > mostly > > > > are mutually independent with some clearly documented exceptions. > > > > > > but I had to rework the core patches to address the problem pointed with > > > the > > > generic power domains (genpd) framework pointed out by Ulf. > > > > > > Namely, genpd ex
[PATCH v4 1/6] PM / core: Add LEAVE_SUSPENDED driver flag
From: Rafael J. Wysocki Define and document a new driver flag, DPM_FLAG_LEAVE_SUSPENDED, to instruct the PM core and middle-layer (bus type, PM domain, etc.) code that it is desirable to leave the device in runtime suspend after system-wide transitions to the working state (for example, the device may be slow to resume and it may be better to avoid resuming it right away). Generally, the middle-layer code involved in the handling of the device is expected to indicate to the PM core whether or not the device may be left in suspend with the help of the device's power.may_skip_resume status bit. That has to happen in the "noirq" phase of the preceding system suspend (or analogous) transition. The middle layer is then responsible for handling the device as appropriate in its "noirq" resume callback which is executed regardless of whether or not the device may be left suspended, but the other resume callbacks (except for ->complete) will be skipped automatically by the core if the device really can be left in suspend. The additional power.must_resume status bit introduced for the implementation of this mechanisn is used internally by the PM core to track the requirement to resume the device (which may depend on its children etc). Signed-off-by: Rafael J. Wysocki Acked-by: Greg Kroah-Hartman --- v3 -> v4: Fix the dev->power.usage_count check added in v3, clarify documentation, add comment to explain why the runtime PM status is changed in device_resume_noirq() and make the description of the NEVER_SKIP flag more precise. v2 -> v3: Take dev->power.usage_count when updating power.must_resume in __device_suspend_noirq(). --- Documentation/driver-api/pm/devices.rst | 27 ++- drivers/base/power/main.c | 73 +--- include/linux/pm.h | 16 +-- 3 files changed, 105 insertions(+), 11 deletions(-) Index: linux-pm/include/linux/pm.h === --- linux-pm.orig/include/linux/pm.h +++ linux-pm/include/linux/pm.h @@ -556,9 +556,10 @@ struct pm_subsys_data { * These flags can be set by device drivers at the probe time. They need not be * cleared by the drivers as the driver core will take care of that. * - * NEVER_SKIP: Do not skip system suspend/resume callbacks for the device. + * NEVER_SKIP: Do not skip all system suspend/resume callbacks for the device. * SMART_PREPARE: Check the return value of the driver's ->prepare callback. * SMART_SUSPEND: No need to resume the device from runtime suspend. + * LEAVE_SUSPENDED: Avoid resuming the device during system resume if possible. * * Setting SMART_PREPARE instructs bus types and PM domains which may want * system suspend/resume callbacks to be skipped for the device to return 0 from @@ -572,10 +573,14 @@ struct pm_subsys_data { * necessary from the driver's perspective. It also may cause them to skip * invocations of the ->suspend_late and ->suspend_noirq callbacks provided by * the driver if they decide to leave the device in runtime suspend. + * + * Setting LEAVE_SUSPENDED informs the PM core and middle-layer code that the + * driver prefers the device to be left in suspend after system resume. */ -#define DPM_FLAG_NEVER_SKIPBIT(0) -#define DPM_FLAG_SMART_PREPARE BIT(1) -#define DPM_FLAG_SMART_SUSPEND BIT(2) +#define DPM_FLAG_NEVER_SKIPBIT(0) +#define DPM_FLAG_SMART_PREPARE BIT(1) +#define DPM_FLAG_SMART_SUSPEND BIT(2) +#define DPM_FLAG_LEAVE_SUSPENDED BIT(3) struct dev_pm_info { pm_message_tpower_state; @@ -597,6 +602,8 @@ struct dev_pm_info { boolwakeup_path:1; boolsyscore:1; boolno_pm_callbacks:1; /* Owned by the PM core */ + unsigned intmust_resume:1; /* Owned by the PM core */ + unsigned intmay_skip_resume:1; /* Set by subsystems */ #else unsigned intshould_wakeup:1; #endif @@ -765,6 +772,7 @@ extern int pm_generic_poweroff_late(stru extern int pm_generic_poweroff(struct device *dev); extern void pm_generic_complete(struct device *dev); +extern bool dev_pm_may_skip_resume(struct device *dev); extern bool dev_pm_smart_suspend_and_suspended(struct device *dev); #else /* !CONFIG_PM_SLEEP */ Index: linux-pm/drivers/base/power/main.c === --- linux-pm.orig/drivers/base/power/main.c +++ linux-pm/drivers/base/power/main.c @@ -528,6 +528,18 @@ static void dpm_watchdog_clear(struct dp /*- Resume routines -*/ /** + * dev_pm_may_skip_resume - System-wide device resume optimization check. + * @dev: Target device. + * + * Checks whether or not the device may be left in susp
[PATCH v4 3/6] ACPI / PM: Support for LEAVE_SUSPENDED driver flag in ACPI PM domain
From: Rafael J. Wysocki Add support for DPM_FLAG_LEAVE_SUSPENDED to the ACPI PM domain by making it (a) set the power.may_skip_resume status bit for devices that, from its perspective, may be left in suspend after system wakeup from sleep and (b) return early from acpi_subsys_resume_noirq() for devices whose remaining resume callbacks during the transition under way are going to be skipped by the PM core. Signed-off-by: Rafael J. Wysocki Acked-by: Greg Kroah-Hartman --- v2 -> v4: No changes. --- drivers/acpi/device_pm.c | 27 --- 1 file changed, 24 insertions(+), 3 deletions(-) Index: linux-pm/drivers/acpi/device_pm.c === --- linux-pm.orig/drivers/acpi/device_pm.c +++ linux-pm/drivers/acpi/device_pm.c @@ -990,7 +990,7 @@ void acpi_subsys_complete(struct device * the sleep state it is going out of and it has never been resumed till * now, resume it in case the firmware powered it up. */ - if (dev->power.direct_complete && pm_resume_via_firmware()) + if (pm_runtime_suspended(dev) && pm_resume_via_firmware()) pm_request_resume(dev); } EXPORT_SYMBOL_GPL(acpi_subsys_complete); @@ -1039,10 +1039,28 @@ EXPORT_SYMBOL_GPL(acpi_subsys_suspend_la */ int acpi_subsys_suspend_noirq(struct device *dev) { - if (dev_pm_smart_suspend_and_suspended(dev)) + int ret; + + if (dev_pm_smart_suspend_and_suspended(dev)) { + dev->power.may_skip_resume = true; return 0; + } - return pm_generic_suspend_noirq(dev); + ret = pm_generic_suspend_noirq(dev); + if (ret) + return ret; + + /* +* If the target system sleep state is suspend-to-idle, it is sufficient +* to check whether or not the device's wakeup settings are good for +* runtime PM. Otherwise, the pm_resume_via_firmware() check will cause +* acpi_subsys_complete() to take care of fixing up the device's state +* anyway, if need be. +*/ + dev->power.may_skip_resume = device_may_wakeup(dev) || + !device_can_wakeup(dev); + + return 0; } EXPORT_SYMBOL_GPL(acpi_subsys_suspend_noirq); @@ -1052,6 +1070,9 @@ EXPORT_SYMBOL_GPL(acpi_subsys_suspend_no */ int acpi_subsys_resume_noirq(struct device *dev) { + if (dev_pm_may_skip_resume(dev)) + return 0; + /* * Devices with DPM_FLAG_SMART_SUSPEND may be left in runtime suspend * during system suspend, so update their runtime PM status to "active" -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 5/6] PM / core: Direct handling of DPM_FLAG_LEAVE_SUSPENDED
From: Rafael J. Wysocki Make the PM core handle DPM_FLAG_LEAVE_SUSPENDED directly for devices whose "noirq", "late" and "early" driver callbacks are invoked directly by it. Namely, make it skip all of the system-wide resume callbacks for such devices with DPM_FLAG_LEAVE_SUSPENDED set if they are in runtime suspend during the "noirq" phase of system-wide suspend (or analogous) transitions or the system transition under way is a proper suspend (rather than anything related to hibernation) and the device's wakeup settings are compatible with runtime PM (that is, the device cannot generate wakeup signals at all or it is allowed to wake up the system from sleep). Signed-off-by: Rafael J. Wysocki --- v3 -> v4: Rebase on the v4 of patch [1/6], add a forward declaration of dpm_subsys_suspend_late_cb() dropped from the [4/6]. v2 -> v3: Rebase on the v3 of patch [1/6]. --- Documentation/driver-api/pm/devices.rst |9 + drivers/base/power/main.c | 51 2 files changed, 55 insertions(+), 5 deletions(-) Index: linux-pm/drivers/base/power/main.c === --- linux-pm.orig/drivers/base/power/main.c +++ linux-pm/drivers/base/power/main.c @@ -525,6 +525,10 @@ static void dpm_watchdog_clear(struct dp #define dpm_watchdog_clear(x) #endif +static pm_callback_t dpm_subsys_suspend_late_cb(struct device *dev, + pm_message_t state, + const char **info_p); + /*- Resume routines -*/ /** @@ -581,6 +585,7 @@ static int device_resume_noirq(struct de { pm_callback_t callback; const char *info; + bool skip_resume; int error = 0; TRACE_DEVICE(dev); @@ -594,17 +599,27 @@ static int device_resume_noirq(struct de dpm_wait_for_superior(dev, async); + skip_resume = dev_pm_may_skip_resume(dev); + callback = dpm_subsys_resume_noirq_cb(dev, state, &info); + if (callback) + goto Run; + + if (skip_resume) + goto Skip; if (!callback && dev->driver && dev->driver->pm) { info = "noirq driver "; callback = pm_noirq_op(dev->driver->pm, state); } +Run: error = dpm_run_callback(callback, dev, state, info); + +Skip: dev->power.is_noirq_suspended = false; - if (dev_pm_may_skip_resume(dev)) { + if (skip_resume) { /* * The device is going to be left in suspend, but it might not * have been in runtime suspend before the system suspended, so @@ -617,7 +632,7 @@ static int device_resume_noirq(struct de dev->power.is_suspended = false; } - Out: +Out: complete_all(&dev->power.completion); TRACE_RESUME(error); return error; @@ -1193,6 +1208,7 @@ static int __device_suspend_noirq(struct { pm_callback_t callback; const char *info; + bool direct_cb = false; int error = 0; TRACE_DEVICE(dev); @@ -1212,12 +1228,17 @@ static int __device_suspend_noirq(struct goto Complete; callback = dpm_subsys_suspend_noirq_cb(dev, state, &info); + if (callback) + goto Run; - if (!callback && dev->driver && dev->driver->pm) { + direct_cb = true; + + if (dev->driver && dev->driver->pm) { info = "noirq driver "; callback = pm_noirq_op(dev->driver->pm, state); } +Run: error = dpm_run_callback(callback, dev, state, info); if (error) { async_error = error; @@ -1227,13 +1248,33 @@ static int __device_suspend_noirq(struct dev->power.is_noirq_suspended = true; if (dev_pm_test_driver_flags(dev, DPM_FLAG_LEAVE_SUSPENDED)) { + pm_message_t resume_msg = resume_event(state); + bool skip_resume; + + if (direct_cb && + !dpm_subsys_suspend_late_cb(dev, state, NULL) && + !dpm_subsys_resume_early_cb(dev, resume_msg, NULL) && + !dpm_subsys_resume_noirq_cb(dev, resume_msg, NULL)) { + /* +* If all of the device driver's "noirq", "late" and +* "early" callbacks are invoked directly by the core, +* the decision to allow the device to stay in suspend +* can be based on its current runtime PM status and its +* wakeup settings. +*/ + skip_
[PATCH v4 6/6] PM / core: DPM_FLAG_SMART_SUSPEND optimization
From: Rafael J. Wysocki Make the PM core avoid invoking the "late" and "noirq" system-wide suspend (or analogous) callbacks for devices that are in runtime suspend during the corresponding phases of system-wide suspend (or analogous) transitions. The underlying observation is that runtime PM is disabled for devices during those system-wide suspend phases, so their runtime PM status should not change going forward and if it has not changed so far, their state should be compatible with the target system sleep state. This change really makes it possible for, say, platform device drivers to re-use runtime PM suspend and resume callbacks by pointing ->suspend_late and ->resume_early, respectively (and possibly the analogous hibernation-related callback pointers too), to them without adding any extra "is the device already suspended?" type of checks to the callback routines, as long as they will be invoked directly by the core. Signed-off-by: Rafael J. Wysocki --- v3 -> v4: Add a forward declaration of dpm_subsys_suspend_noirq_cb() dropped from patch [4/6]. v2 -> v3: No changes. --- Documentation/driver-api/pm/devices.rst | 18 drivers/base/power/main.c | 66 +--- 2 files changed, 70 insertions(+), 14 deletions(-) Index: linux-pm/drivers/base/power/main.c === --- linux-pm.orig/drivers/base/power/main.c +++ linux-pm/drivers/base/power/main.c @@ -525,6 +525,10 @@ static void dpm_watchdog_clear(struct dp #define dpm_watchdog_clear(x) #endif +static pm_callback_t dpm_subsys_suspend_noirq_cb(struct device *dev, +pm_message_t state, +const char **info_p); + static pm_callback_t dpm_subsys_suspend_late_cb(struct device *dev, pm_message_t state, const char **info_p); @@ -532,6 +536,24 @@ static pm_callback_t dpm_subsys_suspend_ /*- Resume routines -*/ /** + * suspend_event - Return a "suspend" message for given "resume" one. + * @resume_msg: PM message representing a system-wide resume transition. + */ +static pm_message_t suspend_event(pm_message_t resume_msg) +{ + switch (resume_msg.event) { + case PM_EVENT_RESUME: + return PMSG_SUSPEND; + case PM_EVENT_THAW: + case PM_EVENT_RESTORE: + return PMSG_FREEZE; + case PM_EVENT_RECOVER: + return PMSG_HIBERNATE; + } + return PMSG_ON; +} + +/** * dev_pm_may_skip_resume - System-wide device resume optimization check. * @dev: Target device. * @@ -605,6 +627,25 @@ static int device_resume_noirq(struct de if (callback) goto Run; + if (dev_pm_smart_suspend_and_suspended(dev)) { + pm_message_t suspend_msg = suspend_event(state); + + /* +* If "freeze" callbacks have been skipped during a transition +* related to hibernation, the subsequent "thaw" callbacks must +* be skipped too or bad things may happen. Otherwise, if the +* device is to be resumed, its runtime PM status must be +* changed to reflect the new configuration. +*/ + if (!dpm_subsys_suspend_late_cb(dev, suspend_msg, NULL) && + !dpm_subsys_suspend_noirq_cb(dev, suspend_msg, NULL)) { + if (state.event == PM_EVENT_THAW) + skip_resume = true; + else if (!skip_resume) + pm_runtime_set_active(dev); + } + } + if (skip_resume) goto Skip; @@ -1231,7 +1272,10 @@ static int __device_suspend_noirq(struct if (callback) goto Run; - direct_cb = true; + direct_cb = !dpm_subsys_suspend_late_cb(dev, state, NULL); + + if (dev_pm_smart_suspend_and_suspended(dev) && direct_cb) + goto Skip; if (dev->driver && dev->driver->pm) { info = "noirq driver "; @@ -1245,6 +1289,7 @@ Run: goto Complete; } +Skip: dev->power.is_noirq_suspended = true; if (dev_pm_test_driver_flags(dev, DPM_FLAG_LEAVE_SUSPENDED)) { @@ -1252,7 +1297,6 @@ Run: bool skip_resume; if (direct_cb && - !dpm_subsys_suspend_late_cb(dev, state, NULL) && !dpm_subsys_resume_early_cb(dev, resume_msg, NULL) && !dpm_subsys_resume_noirq_cb(dev, resume_msg, NULL)) { /* @@ -1449,17 +1493,27 @@ static
Re: [PATCH v3 1/6] PM / core: Add LEAVE_SUSPENDED driver flag
On Fri, Nov 17, 2017 at 2:49 PM, Ulf Hansson wrote: > [...] > >>> > Second, have you considered setting the default value of > dev->power.may_skip_resume to true? Yes. > That would means the subsystem > instead need to implement an opt-out method. I am thinking that it may > not be an issue, since we anyway at this point, don't have drivers > using the LEAVE_SUSPENDED flag. Opt-out doesn't work because of the need to invoke the "noirq" callbacks. >>> >>> I am not sure I follow that. >>> >>> Whatever needs to be fixed on the subsystem level, that could be done >>> before the driver starts using the LEAVE_SUSPENDED flag. No? >> >> That requires a bit of explanation, sorry for being overly concise. >> >> The core calls ->resume_noirq from the middle layer regardless of >> whether or not the device will be left suspended, so the >> ->resume_noirq cannot do arbitrary things to it. Setting >> may_skip_resume by the middle layer tells the core that the middle >> layer is ready for that and is going to cooperate. If may_skip_resume >> had been set by default, that piece of information would have been >> missing. > > Huh, I still don't get that. Sorry. > > If the "may_skip_resume" is default set to true by the PM core, > wouldn't that just mean that the middle-layer needs to implement an > opt-out method, rather than opt-in. In principle to opt-out the > middle-layer needs to set may_skip_resume to false in suspend_noirq > phase, no? Yes, but if the middle-layer doesn't clear it, that may mean two things. First, the middle layer is ready and so on. Good. Second, the middle layer is not aware of the whole thing. Not good. The core cannot tell. In the opt-in case, however, all is clear. :-) > Then we only need to make sure drivers don't starts use > LEAVE_SUSPENDED, before we make sure the middle layers is adopted. But > that should not be a problem. > > The benefit would be that those middle layers that can cope with > LEAVE_SUSPENDED as of today don't need to change. I'm not sure if that's the case. The middle layer has to evaluate dev_pm_may_skip_resume() in ->resume_noirq() to check if the device can be left in suspend, as it cannot determine that in ->suspend_noirq() yet. Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/6] PM / core: Add LEAVE_SUSPENDED driver flag
On Fri, Nov 17, 2017 at 7:11 AM, Ulf Hansson wrote: > [...] > >>> > +++ linux-pm/include/linux/pm.h >>> > @@ -559,6 +559,7 @@ struct pm_subsys_data { >>> > * NEVER_SKIP: Do not skip system suspend/resume callbacks for the >>> > device. >>> > * SMART_PREPARE: Check the return value of the driver's ->prepare >>> > callback. >>> > * SMART_SUSPEND: No need to resume the device from runtime suspend. >>> > + * LEAVE_SUSPENDED: Avoid resuming the device during system resume if >>> > possible. >>> > * >>> > * Setting SMART_PREPARE instructs bus types and PM domains which may >>> > want >>> > * system suspend/resume callbacks to be skipped for the device to >>> > return 0 from >>> > @@ -572,10 +573,14 @@ struct pm_subsys_data { >>> > * necessary from the driver's perspective. It also may cause them to >>> > skip >>> > * invocations of the ->suspend_late and ->suspend_noirq callbacks >>> > provided by >>> > * the driver if they decide to leave the device in runtime suspend. >>> > + * >>> > + * Setting LEAVE_SUSPENDED informs the PM core and middle-layer code >>> > that the >>> > + * driver prefers the device to be left in runtime suspend after system >>> > resume. >>> > */ >>> >>> Question: Can LEAVE_SUSPENDED and NEVER_SKIP be valid combination? I >>> guess not!? Should we validate for wrong combinations? >> >> Why not? There's no real overlap between them. > > Except that NEVER_SKIP, documentation wise, tells you that your > suspend and resume callbacks will never be skipped. :-) You mean the comment in pm.h I suppose? Yes, it isn't precise enough. The proper documentation in devices.rst is less ambiguous, though. :-) > [...] > >>> Second, have you considered setting the default value of >>> dev->power.may_skip_resume to true? >> >> Yes. >> >>> That would means the subsystem >>> instead need to implement an opt-out method. I am thinking that it may >>> not be an issue, since we anyway at this point, don't have drivers >>> using the LEAVE_SUSPENDED flag. >> >> Opt-out doesn't work because of the need to invoke the "noirq" callbacks. > > I am not sure I follow that. > > Whatever needs to be fixed on the subsystem level, that could be done > before the driver starts using the LEAVE_SUSPENDED flag. No? That requires a bit of explanation, sorry for being overly concise. The core calls ->resume_noirq from the middle layer regardless of whether or not the device will be left suspended, so the ->resume_noirq cannot do arbitrary things to it. Setting may_skip_resume by the middle layer tells the core that the middle layer is ready for that and is going to cooperate. If may_skip_resume had been set by default, that piece of information would have been missing. Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/6] PM / core: Add LEAVE_SUSPENDED driver flag
On Fri, Nov 17, 2017 at 12:07 AM, Rafael J. Wysocki wrote: > On Thursday, November 16, 2017 4:10:16 PM CET Ulf Hansson wrote: >> On 12 November 2017 at 01:37, Rafael J. Wysocki wrote: >> > From: Rafael J. Wysocki >> > >> > Define and document a new driver flag, DPM_FLAG_LEAVE_SUSPENDED, to >> > instruct the PM core and middle-layer (bus type, PM domain, etc.) >> > code that it is desirable to leave the device in runtime suspend >> > after system-wide transitions to the working state (for example, >> > the device may be slow to resume and it may be better to avoid >> > resuming it right away). >> > >> > Generally, the middle-layer code involved in the handling of the >> > device is expected to indicate to the PM core whether or not the >> > device may be left in suspend with the help of the device's >> > power.may_skip_resume status bit. That has to happen in the "noirq" >> > phase of the preceding system suspend (or analogous) transition. >> > The middle layer is then responsible for handling the device as >> > appropriate in its "noirq" resume callback which is executed >> > regardless of whether or not the device may be left suspended, but >> > the other resume callbacks (except for ->complete) will be skipped >> > automatically by the core if the device really can be left in >> > suspend. >> > >> > The additional power.must_resume status bit introduced for the >> > implementation of this mechanisn is used internally by the PM core >> > to track the requirement to resume the device (which may depend on >> > its children etc). >> > >> > Signed-off-by: Rafael J. Wysocki >> > Acked-by: Greg Kroah-Hartman >> > --- >> > >> > v2 -> v3: Take dev->power.usage_count when updating power.must_resume in >> > __device_suspend_noirq(). >> > >> > --- [...] >> > + } else { >> > + dev->power.must_resume = true; >> > + } >> > + >> > + if (dev->power.must_resume) >> > + dpm_superior_set_must_resume(dev); >> > >> > Complete: >> > complete_all(&dev->power.completion); >> > @@ -1487,6 +1539,9 @@ static int __device_suspend(struct devic >> > dev->power.direct_complete = false; >> > } >> > >> > + dev->power.may_skip_resume = false; >> > + dev->power.must_resume = false; >> > + >> >> First, these assignment could be bypassed if the direct_complete path >> is used. Perhaps it's more robust to reset these flags already in >> device_prepare(). > > In the direct-complete case may_skip_resume doesn't matter. > > must_resume should be set to "false", however, so that parents of > direct-complete devices may be left in suspend (in case they don't > fall under direct-complete themselves), so good catch. Actually, not really. must_resume for parents/suppliers is not updated if the device has direct_complete set and the device's own must_resume doesn't matter then. So this part is good as is AFAICS. Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/6] PM / core: Add LEAVE_SUSPENDED driver flag
On Thursday, November 16, 2017 4:10:16 PM CET Ulf Hansson wrote: > On 12 November 2017 at 01:37, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki > > > > Define and document a new driver flag, DPM_FLAG_LEAVE_SUSPENDED, to > > instruct the PM core and middle-layer (bus type, PM domain, etc.) > > code that it is desirable to leave the device in runtime suspend > > after system-wide transitions to the working state (for example, > > the device may be slow to resume and it may be better to avoid > > resuming it right away). > > > > Generally, the middle-layer code involved in the handling of the > > device is expected to indicate to the PM core whether or not the > > device may be left in suspend with the help of the device's > > power.may_skip_resume status bit. That has to happen in the "noirq" > > phase of the preceding system suspend (or analogous) transition. > > The middle layer is then responsible for handling the device as > > appropriate in its "noirq" resume callback which is executed > > regardless of whether or not the device may be left suspended, but > > the other resume callbacks (except for ->complete) will be skipped > > automatically by the core if the device really can be left in > > suspend. > > > > The additional power.must_resume status bit introduced for the > > implementation of this mechanisn is used internally by the PM core > > to track the requirement to resume the device (which may depend on > > its children etc). > > > > Signed-off-by: Rafael J. Wysocki > > Acked-by: Greg Kroah-Hartman > > --- > > > > v2 -> v3: Take dev->power.usage_count when updating power.must_resume in > > __device_suspend_noirq(). > > > > --- > > Documentation/driver-api/pm/devices.rst | 24 ++- > > drivers/base/power/main.c | 66 > > +--- > > drivers/base/power/runtime.c|9 ++-- > > include/linux/pm.h | 14 +- > > include/linux/pm_runtime.h |9 ++-- > > 5 files changed, 104 insertions(+), 18 deletions(-) > > > > Index: linux-pm/include/linux/pm.h > > === > > --- linux-pm.orig/include/linux/pm.h > > +++ linux-pm/include/linux/pm.h > > @@ -559,6 +559,7 @@ struct pm_subsys_data { > > * NEVER_SKIP: Do not skip system suspend/resume callbacks for the device. > > * SMART_PREPARE: Check the return value of the driver's ->prepare > > callback. > > * SMART_SUSPEND: No need to resume the device from runtime suspend. > > + * LEAVE_SUSPENDED: Avoid resuming the device during system resume if > > possible. > > * > > * Setting SMART_PREPARE instructs bus types and PM domains which may want > > * system suspend/resume callbacks to be skipped for the device to return > > 0 from > > @@ -572,10 +573,14 @@ struct pm_subsys_data { > > * necessary from the driver's perspective. It also may cause them to skip > > * invocations of the ->suspend_late and ->suspend_noirq callbacks > > provided by > > * the driver if they decide to leave the device in runtime suspend. > > + * > > + * Setting LEAVE_SUSPENDED informs the PM core and middle-layer code that > > the > > + * driver prefers the device to be left in runtime suspend after system > > resume. > > */ > > Question: Can LEAVE_SUSPENDED and NEVER_SKIP be valid combination? I > guess not!? Should we validate for wrong combinations? Why not? There's no real overlap between them. > > [...] > > > /** > > * __device_suspend_noirq - Execute a "noirq suspend" callback for given > > device. > > * @dev: Device to handle. > > @@ -1127,10 +1161,28 @@ static int __device_suspend_noirq(struct > > } > > > > error = dpm_run_callback(callback, dev, state, info); > > - if (!error) > > - dev->power.is_noirq_suspended = true; > > - else > > + if (error) { > > async_error = error; > > + goto Complete; > > + } > > + > > + dev->power.is_noirq_suspended = true; > > + > > + if (dev_pm_test_driver_flags(dev, DPM_FLAG_LEAVE_SUSPENDED)) { > > + /* > > +* The only safe strategy here is to require that if the > > device > > +* may not be left in suspend, resume callbacks must be > > invoked > >
Re: [PATCH v3 4/6] PM / core: Add helpers for subsystem callback selection
On Wed, Nov 15, 2017 at 8:43 AM, Ulf Hansson wrote: > On 12 November 2017 at 01:42, Rafael J. Wysocki wrote: >> From: Rafael J. Wysocki >> >> Add helper routines to find and return a suitable subsystem callback >> during the "noirq" phases of system suspend/resume (or analogous) >> transitions as well as during the "late" phase of system suspend and >> the "early" phase of system resume (or analogous) transitions. >> >> The helpers will be called from additional sites going forward. >> >> Signed-off-by: Rafael J. Wysocki > > With a minor nitpick, see below, feel free to add: > > Reviewed-by: Ulf Hansson > >> --- >> >> v2 -> v3: No changes. >> >> --- >> drivers/base/power/main.c | 196 >> +++--- >> 1 file changed, 136 insertions(+), 60 deletions(-) >> >> Index: linux-pm/drivers/base/power/main.c >> === >> --- linux-pm.orig/drivers/base/power/main.c >> +++ linux-pm/drivers/base/power/main.c >> @@ -525,6 +525,14 @@ static void dpm_watchdog_clear(struct dp >> #define dpm_watchdog_clear(x) >> #endif >> >> +static pm_callback_t dpm_subsys_suspend_noirq_cb(struct device *dev, >> +pm_message_t state, >> +const char **info_p); >> + >> +static pm_callback_t dpm_subsys_suspend_late_cb(struct device *dev, >> + pm_message_t state, >> + const char **info_p); >> + > > There is no need to declare these functions. > > Perhaps a following patch in the series need them, but then that > change should add these or even better (in my opinion) just move the > implementations and avoid the declarations all together. Well, all of the changes in this patch are for the benefit of the subsequent patches. :-) I just wanted to move additional code churn noise from those patches. Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/6] PM / core: Add LEAVE_SUSPENDED driver flag
On Tuesday, November 14, 2017 5:07:59 PM CET Ulf Hansson wrote: > On 11 November 2017 at 00:45, Rafael J. Wysocki wrote: > > On Fri, Nov 10, 2017 at 10:09 AM, Ulf Hansson > > wrote: > >> On 8 November 2017 at 14:25, Rafael J. Wysocki wrote: > >>> From: Rafael J. Wysocki > >>> > >>> Define and document a new driver flag, DPM_FLAG_LEAVE_SUSPENDED, to > >>> instruct the PM core and middle-layer (bus type, PM domain, etc.) > >>> code that it is desirable to leave the device in runtime suspend > >>> after system-wide transitions to the working state (for example, > >>> the device may be slow to resume and it may be better to avoid > >>> resuming it right away). > >>> > >>> Generally, the middle-layer code involved in the handling of the > >>> device is expected to indicate to the PM core whether or not the > >>> device may be left in suspend with the help of the device's > >>> power.may_skip_resume status bit. That has to happen in the "noirq" > >>> phase of the preceding system suspend (or analogous) transition. > >>> The middle layer is then responsible for handling the device as > >>> appropriate in its "noirq" resume callback which is executed > >>> regardless of whether or not the device may be left suspended, but > >>> the other resume callbacks (except for ->complete) will be skipped > >>> automatically by the core if the device really can be left in > >>> suspend. > >> > >> I don't understand the reason to why you need to skip invoking resume > >> callbacks to achieve this behavior, could you elaborate on that? > > > > The reason why it is done this way is because that takes less code and > > is easier (or at least less error-prone, because it avoids repeating > > patterns in middle layers). > > > > Note that the callbacks only may be skipped by the core if the middle > > layer has set power.skip_resume for the device (or if the core is > > handling it in patch [5/6], but that's one more step ahead still). > > > >> Couldn't the PM domain or the middle-layer instead decide what to do? > > > > They still can, the whole thing is a total opt-in. > > > > But to be constructive, do you have any specific examples in mind? > > See more below. > > > > >> To me it sounds a bit prone to errors by skipping callbacks from the > >> PM core, and I wonder if the general driver author will be able to > >> understand how to use this flag properly. > > > > This has nothing to do with general driver authors and I'm not sure > > what you mean here and where you are going with this. > > Let me elaborate. > > My general goal is that I want to make it easier (or as easy as > possible) for the general driver author to deploy runtime PM and > system-wide PM support - in an optimized manner. Therefore, I am > pondering over the solution you picked in this series, trying to > understand how it fits into those aspects. > > Particular I am a bit worried from a complexity point of view, about > the part with skipping callbacks from the PM core. We have observed > some difficulties with the direct_complete path (i2c dw driver), which > is based on a similar approach as this one. These are resume callbacks, not suspend callbacks. Also not all of them are skipped. That is quite a bit different from skipping *all* callbacks. Moreover, at the point the core decides to skip the callbacks, the device *has* *to* be left suspended and there simply is no point in running them no matter what. That part of code can be trivially moved to middle layers, but then each of them will have to do exactly the same thing. I don't see any reason to do that and I'm not finding one in your comments. Sorry. > Additionally, in this case, to trigger skipping of callbacks to > happen, first, drivers needs to inform the middle-layer, second, the > middle layer acts on that information and then informs the PM core, > then in the third step, the PM core can decide what to do. It doesn't > sound straight-forward. It really doesn't work like that. First, the driver sets the LEAVE_SUSPENDED flag for the core to consume. The middle layers don't have to look at it at all. Second, each middle layer sets power.may_skip_resume for devices whose state after system suspend should match the runtime suspend state. The middle layer must know that this is the case to set that bit. [The core effectively does that part for devices handled by it directly in patch [5/6].] The core then takes the LEAV
[PATCH v3 4/6] PM / core: Add helpers for subsystem callback selection
From: Rafael J. Wysocki Add helper routines to find and return a suitable subsystem callback during the "noirq" phases of system suspend/resume (or analogous) transitions as well as during the "late" phase of system suspend and the "early" phase of system resume (or analogous) transitions. The helpers will be called from additional sites going forward. Signed-off-by: Rafael J. Wysocki --- v2 -> v3: No changes. --- drivers/base/power/main.c | 196 +++--- 1 file changed, 136 insertions(+), 60 deletions(-) Index: linux-pm/drivers/base/power/main.c === --- linux-pm.orig/drivers/base/power/main.c +++ linux-pm/drivers/base/power/main.c @@ -525,6 +525,14 @@ static void dpm_watchdog_clear(struct dp #define dpm_watchdog_clear(x) #endif +static pm_callback_t dpm_subsys_suspend_noirq_cb(struct device *dev, +pm_message_t state, +const char **info_p); + +static pm_callback_t dpm_subsys_suspend_late_cb(struct device *dev, + pm_message_t state, + const char **info_p); + /*- Resume routines -*/ /** @@ -539,6 +547,35 @@ bool dev_pm_may_skip_resume(struct devic return !dev->power.must_resume && pm_transition.event != PM_EVENT_RESTORE; } +static pm_callback_t dpm_subsys_resume_noirq_cb(struct device *dev, + pm_message_t state, + const char **info_p) +{ + pm_callback_t callback; + const char *info; + + if (dev->pm_domain) { + info = "noirq power domain "; + callback = pm_noirq_op(&dev->pm_domain->ops, state); + } else if (dev->type && dev->type->pm) { + info = "noirq type "; + callback = pm_noirq_op(dev->type->pm, state); + } else if (dev->class && dev->class->pm) { + info = "noirq class "; + callback = pm_noirq_op(dev->class->pm, state); + } else if (dev->bus && dev->bus->pm) { + info = "noirq bus "; + callback = pm_noirq_op(dev->bus->pm, state); + } else { + return NULL; + } + + if (info_p) + *info_p = info; + + return callback; +} + /** * device_resume_noirq - Execute a "noirq resume" callback for given device. * @dev: Device to handle. @@ -550,8 +587,8 @@ bool dev_pm_may_skip_resume(struct devic */ static int device_resume_noirq(struct device *dev, pm_message_t state, bool async) { - pm_callback_t callback = NULL; - const char *info = NULL; + pm_callback_t callback; + const char *info; int error = 0; TRACE_DEVICE(dev); @@ -565,19 +602,7 @@ static int device_resume_noirq(struct de dpm_wait_for_superior(dev, async); - if (dev->pm_domain) { - info = "noirq power domain "; - callback = pm_noirq_op(&dev->pm_domain->ops, state); - } else if (dev->type && dev->type->pm) { - info = "noirq type "; - callback = pm_noirq_op(dev->type->pm, state); - } else if (dev->class && dev->class->pm) { - info = "noirq class "; - callback = pm_noirq_op(dev->class->pm, state); - } else if (dev->bus && dev->bus->pm) { - info = "noirq bus "; - callback = pm_noirq_op(dev->bus->pm, state); - } + callback = dpm_subsys_resume_noirq_cb(dev, state, &info); if (!callback && dev->driver && dev->driver->pm) { info = "noirq driver "; @@ -686,6 +711,35 @@ void dpm_resume_noirq(pm_message_t state dpm_noirq_end(); } +static pm_callback_t dpm_subsys_resume_early_cb(struct device *dev, + pm_message_t state, + const char **info_p) +{ + pm_callback_t callback; + const char *info; + + if (dev->pm_domain) { + info = "early power domain "; + callback = pm_late_early_op(&dev->pm_domain->ops, state); + } else if (dev->type && dev->type->pm) { + info = "early type "; + callback = pm_late_early_op(dev->type->pm, state); + } else if (dev->class && dev->class->pm) { + info = "early cl
[PATCH v3 3/6] ACPI / PM: Support for LEAVE_SUSPENDED driver flag in ACPI PM domain
From: Rafael J. Wysocki Add support for DPM_FLAG_LEAVE_SUSPENDED to the ACPI PM domain by making it (a) set the power.may_skip_resume status bit for devices that, from its perspective, may be left in suspend after system wakeup from sleep and (b) return early from acpi_subsys_resume_noirq() for devices whose remaining resume callbacks during the transition under way are going to be skipped by the PM core. Signed-off-by: Rafael J. Wysocki Acked-by: Greg Kroah-Hartman --- v2 -> v3: No changes. --- drivers/acpi/device_pm.c | 27 --- 1 file changed, 24 insertions(+), 3 deletions(-) Index: linux-pm/drivers/acpi/device_pm.c === --- linux-pm.orig/drivers/acpi/device_pm.c +++ linux-pm/drivers/acpi/device_pm.c @@ -987,7 +987,7 @@ void acpi_subsys_complete(struct device * the sleep state it is going out of and it has never been resumed till * now, resume it in case the firmware powered it up. */ - if (dev->power.direct_complete && pm_resume_via_firmware()) + if (pm_runtime_suspended(dev) && pm_resume_via_firmware()) pm_request_resume(dev); } EXPORT_SYMBOL_GPL(acpi_subsys_complete); @@ -1036,10 +1036,28 @@ EXPORT_SYMBOL_GPL(acpi_subsys_suspend_la */ int acpi_subsys_suspend_noirq(struct device *dev) { - if (dev_pm_smart_suspend_and_suspended(dev)) + int ret; + + if (dev_pm_smart_suspend_and_suspended(dev)) { + dev->power.may_skip_resume = true; return 0; + } - return pm_generic_suspend_noirq(dev); + ret = pm_generic_suspend_noirq(dev); + if (ret) + return ret; + + /* +* If the target system sleep state is suspend-to-idle, it is sufficient +* to check whether or not the device's wakeup settings are good for +* runtime PM. Otherwise, the pm_resume_via_firmware() check will cause +* acpi_subsys_complete() to take care of fixing up the device's state +* anyway, if need be. +*/ + dev->power.may_skip_resume = device_may_wakeup(dev) || + !device_can_wakeup(dev); + + return 0; } EXPORT_SYMBOL_GPL(acpi_subsys_suspend_noirq); @@ -1049,6 +1067,9 @@ EXPORT_SYMBOL_GPL(acpi_subsys_suspend_no */ int acpi_subsys_resume_noirq(struct device *dev) { + if (dev_pm_may_skip_resume(dev)) + return 0; + /* * Devices with DPM_FLAG_SMART_SUSPEND may be left in runtime suspend * during system suspend, so update their runtime PM status to "active" -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 6/6] PM / core: DPM_FLAG_SMART_SUSPEND optimization
From: Rafael J. Wysocki Make the PM core avoid invoking the "late" and "noirq" system-wide suspend (or analogous) callbacks for devices that are in runtime suspend during the corresponding phases of system-wide suspend (or analogous) transitions. The underlying observation is that runtime PM is disabled for devices during those system-wide suspend phases, so their runtime PM status should not change going forward and if it has not changed so far, their state should be compatible with the target system sleep state. This change really makes it possible for, say, platform device drivers to re-use runtime PM suspend and resume callbacks by pointing ->suspend_late and ->resume_early, respectively (and possibly the analogous hibernation-related callback pointers too), to them without adding any extra "is the device already suspended?" type of checks to the callback routines, as long as they will be invoked directly by the core. Signed-off-by: Rafael J. Wysocki --- v2 -> v3: No changes. --- Documentation/driver-api/pm/devices.rst | 18 + drivers/base/power/main.c | 62 2 files changed, 66 insertions(+), 14 deletions(-) Index: linux-pm/drivers/base/power/main.c === --- linux-pm.orig/drivers/base/power/main.c +++ linux-pm/drivers/base/power/main.c @@ -536,6 +536,24 @@ static pm_callback_t dpm_subsys_suspend_ /*- Resume routines -*/ /** + * suspend_event - Return a "suspend" message for given "resume" one. + * @resume_msg: PM message representing a system-wide resume transition. + */ +static pm_message_t suspend_event(pm_message_t resume_msg) +{ + switch (resume_msg.event) { + case PM_EVENT_RESUME: + return PMSG_SUSPEND; + case PM_EVENT_THAW: + case PM_EVENT_RESTORE: + return PMSG_FREEZE; + case PM_EVENT_RECOVER: + return PMSG_HIBERNATE; + } + return PMSG_ON; +} + +/** * dev_pm_may_skip_resume - System-wide device resume optimization check. * @dev: Target device. * @@ -609,6 +627,25 @@ static int device_resume_noirq(struct de if (callback) goto Run; + if (dev_pm_smart_suspend_and_suspended(dev)) { + pm_message_t suspend_msg = suspend_event(state); + + /* +* If "freeze" callbacks have been skipped during a transition +* related to hibernation, the subsequent "thaw" callbacks must +* be skipped too or bad things may happen. Otherwise, if the +* device is to be resumed, its runtime PM status must be +* changed to reflect the new configuration. +*/ + if (!dpm_subsys_suspend_late_cb(dev, suspend_msg, NULL) && + !dpm_subsys_suspend_noirq_cb(dev, suspend_msg, NULL)) { + if (state.event == PM_EVENT_THAW) + skip_resume = true; + else if (!skip_resume) + pm_runtime_set_active(dev); + } + } + if (skip_resume) goto Skip; @@ -1228,7 +1265,10 @@ static int __device_suspend_noirq(struct if (callback) goto Run; - direct_cb = true; + direct_cb = !dpm_subsys_suspend_late_cb(dev, state, NULL); + + if (dev_pm_smart_suspend_and_suspended(dev) && direct_cb) + goto Skip; if (dev->driver && dev->driver->pm) { info = "noirq driver "; @@ -1242,6 +1282,7 @@ Run: goto Complete; } +Skip: dev->power.is_noirq_suspended = true; if (dev_pm_test_driver_flags(dev, DPM_FLAG_LEAVE_SUSPENDED)) { @@ -1249,7 +1290,6 @@ Run: bool skip_resume; if (direct_cb && - !dpm_subsys_suspend_late_cb(dev, state, NULL) && !dpm_subsys_resume_early_cb(dev, resume_msg, NULL) && !dpm_subsys_resume_noirq_cb(dev, resume_msg, NULL)) { /* @@ -1446,17 +1486,27 @@ static int __device_suspend_late(struct goto Complete; callback = dpm_subsys_suspend_late_cb(dev, state, &info); + if (callback) + goto Run; - if (!callback && dev->driver && dev->driver->pm) { + if (dev_pm_smart_suspend_and_suspended(dev) && + !dpm_subsys_suspend_noirq_cb(dev, state, NULL)) + goto Skip; + + if (dev->driver && dev->driver->pm) { info = "late driver "; callback = pm_late_early_op(dev->driver->pm, state);
[PATCH v3 1/6] PM / core: Add LEAVE_SUSPENDED driver flag
From: Rafael J. Wysocki Define and document a new driver flag, DPM_FLAG_LEAVE_SUSPENDED, to instruct the PM core and middle-layer (bus type, PM domain, etc.) code that it is desirable to leave the device in runtime suspend after system-wide transitions to the working state (for example, the device may be slow to resume and it may be better to avoid resuming it right away). Generally, the middle-layer code involved in the handling of the device is expected to indicate to the PM core whether or not the device may be left in suspend with the help of the device's power.may_skip_resume status bit. That has to happen in the "noirq" phase of the preceding system suspend (or analogous) transition. The middle layer is then responsible for handling the device as appropriate in its "noirq" resume callback which is executed regardless of whether or not the device may be left suspended, but the other resume callbacks (except for ->complete) will be skipped automatically by the core if the device really can be left in suspend. The additional power.must_resume status bit introduced for the implementation of this mechanisn is used internally by the PM core to track the requirement to resume the device (which may depend on its children etc). Signed-off-by: Rafael J. Wysocki Acked-by: Greg Kroah-Hartman --- v2 -> v3: Take dev->power.usage_count when updating power.must_resume in __device_suspend_noirq(). --- Documentation/driver-api/pm/devices.rst | 24 ++- drivers/base/power/main.c | 66 +--- drivers/base/power/runtime.c|9 ++-- include/linux/pm.h | 14 +- include/linux/pm_runtime.h |9 ++-- 5 files changed, 104 insertions(+), 18 deletions(-) Index: linux-pm/include/linux/pm.h === --- linux-pm.orig/include/linux/pm.h +++ linux-pm/include/linux/pm.h @@ -559,6 +559,7 @@ struct pm_subsys_data { * NEVER_SKIP: Do not skip system suspend/resume callbacks for the device. * SMART_PREPARE: Check the return value of the driver's ->prepare callback. * SMART_SUSPEND: No need to resume the device from runtime suspend. + * LEAVE_SUSPENDED: Avoid resuming the device during system resume if possible. * * Setting SMART_PREPARE instructs bus types and PM domains which may want * system suspend/resume callbacks to be skipped for the device to return 0 from @@ -572,10 +573,14 @@ struct pm_subsys_data { * necessary from the driver's perspective. It also may cause them to skip * invocations of the ->suspend_late and ->suspend_noirq callbacks provided by * the driver if they decide to leave the device in runtime suspend. + * + * Setting LEAVE_SUSPENDED informs the PM core and middle-layer code that the + * driver prefers the device to be left in runtime suspend after system resume. */ -#define DPM_FLAG_NEVER_SKIPBIT(0) -#define DPM_FLAG_SMART_PREPARE BIT(1) -#define DPM_FLAG_SMART_SUSPEND BIT(2) +#define DPM_FLAG_NEVER_SKIPBIT(0) +#define DPM_FLAG_SMART_PREPARE BIT(1) +#define DPM_FLAG_SMART_SUSPEND BIT(2) +#define DPM_FLAG_LEAVE_SUSPENDED BIT(3) struct dev_pm_info { pm_message_tpower_state; @@ -597,6 +602,8 @@ struct dev_pm_info { boolwakeup_path:1; boolsyscore:1; boolno_pm_callbacks:1; /* Owned by the PM core */ + unsigned intmust_resume:1; /* Owned by the PM core */ + unsigned intmay_skip_resume:1; /* Set by subsystems */ #else unsigned intshould_wakeup:1; #endif @@ -765,6 +772,7 @@ extern int pm_generic_poweroff_late(stru extern int pm_generic_poweroff(struct device *dev); extern void pm_generic_complete(struct device *dev); +extern bool dev_pm_may_skip_resume(struct device *dev); extern bool dev_pm_smart_suspend_and_suspended(struct device *dev); #else /* !CONFIG_PM_SLEEP */ Index: linux-pm/drivers/base/power/main.c === --- linux-pm.orig/drivers/base/power/main.c +++ linux-pm/drivers/base/power/main.c @@ -528,6 +528,18 @@ static void dpm_watchdog_clear(struct dp /*- Resume routines -*/ /** + * dev_pm_may_skip_resume - System-wide device resume optimization check. + * @dev: Target device. + * + * Checks whether or not the device may be left in suspend after a system-wide + * transition to the working state. + */ +bool dev_pm_may_skip_resume(struct device *dev) +{ + return !dev->power.must_resume && pm_transition.event != PM_EVENT_RESTORE; +} + +/** * device_resume_noirq - Execute a "noirq resume" callback for given device. * @dev: Device to handle. * @state: PM transition of the system being
[PATCH v3 2/6] PCI / PM: Support for LEAVE_SUSPENDED driver flag
From: Rafael J. Wysocki Add support for DPM_FLAG_LEAVE_SUSPENDED to the PCI bus type by making it (a) set the power.may_skip_resume status bit for devices that, from its perspective, may be left in suspend after system wakeup from sleep and (b) return early from pci_pm_resume_noirq() for devices whose remaining resume callbacks during the transition under way are going to be skipped by the PM core. Signed-off-by: Rafael J. Wysocki Acked-by: Greg Kroah-Hartman Acked-by: Bjorn Helgaas --- v2 -> v3: Add the Acked-by from Bjorn, no changes in the patch. --- Documentation/power/pci.txt | 11 +++ drivers/pci/pci-driver.c| 19 +-- 2 files changed, 28 insertions(+), 2 deletions(-) Index: linux-pm/drivers/pci/pci-driver.c === --- linux-pm.orig/drivers/pci/pci-driver.c +++ linux-pm/drivers/pci/pci-driver.c @@ -699,7 +699,7 @@ static void pci_pm_complete(struct devic pm_generic_complete(dev); /* Resume device if platform firmware has put it in reset-power-on */ - if (dev->power.direct_complete && pm_resume_via_firmware()) { + if (pm_runtime_suspended(dev) && pm_resume_via_firmware()) { pci_power_t pre_sleep_state = pci_dev->current_state; pci_update_current_state(pci_dev, pci_dev->current_state); @@ -783,8 +783,10 @@ static int pci_pm_suspend_noirq(struct d struct pci_dev *pci_dev = to_pci_dev(dev); const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL; - if (dev_pm_smart_suspend_and_suspended(dev)) + if (dev_pm_smart_suspend_and_suspended(dev)) { + dev->power.may_skip_resume = true; return 0; + } if (pci_has_legacy_pm_support(pci_dev)) return pci_legacy_suspend_late(dev, PMSG_SUSPEND); @@ -838,6 +840,16 @@ static int pci_pm_suspend_noirq(struct d Fixup: pci_fixup_device(pci_fixup_suspend_late, pci_dev); + /* +* If the target system sleep state is suspend-to-idle, it is sufficient +* to check whether or not the device's wakeup settings are good for +* runtime PM. Otherwise, the pm_resume_via_firmware() check will cause +* pci_pm_complete() to take care of fixing up the device's state +* anyway, if need be. +*/ + dev->power.may_skip_resume = device_may_wakeup(dev) || + !device_can_wakeup(dev); + return 0; } @@ -847,6 +859,9 @@ static int pci_pm_resume_noirq(struct de struct device_driver *drv = dev->driver; int error = 0; + if (dev_pm_may_skip_resume(dev)) + return 0; + /* * Devices with DPM_FLAG_SMART_SUSPEND may be left in runtime suspend * during system suspend, so update their runtime PM status to "active" Index: linux-pm/Documentation/power/pci.txt === --- linux-pm.orig/Documentation/power/pci.txt +++ linux-pm/Documentation/power/pci.txt @@ -994,6 +994,17 @@ into D0 going forward), but if it is in the function will set the power.direct_complete flag for it (to make the PM core skip the subsequent "thaw" callbacks for it) and return. +Setting the DPM_FLAG_LEAVE_SUSPENDED flag means that the driver prefers the +device to be left in suspend after system-wide transitions to the working state. +This flag is checked by the PM core, but the PCI bus type informs the PM core +which devices may be left in suspend from its perspective (that happens during +the "noirq" phase of system-wide suspend and analogous transitions) and next it +uses the dev_pm_may_skip_resume() helper to decide whether or not to return from +pci_pm_resume_noirq() early, as the PM core will skip the remaining resume +callbacks for the device during the transition under way and will set its +runtime PM status to "suspended" if dev_pm_may_skip_resume() returns "true" for +it. + 3.2. Device Runtime Power Management In addition to providing device power management callbacks PCI device drivers -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 5/6] PM / core: Direct handling of DPM_FLAG_LEAVE_SUSPENDED
From: Rafael J. Wysocki Make the PM core handle DPM_FLAG_LEAVE_SUSPENDED directly for devices whose "noirq", "late" and "early" driver callbacks are invoked directly by it. Namely, make it skip all of the system-wide resume callbacks for such devices with DPM_FLAG_LEAVE_SUSPENDED set if they are in runtime suspend during the "noirq" phase of system-wide suspend (or analogous) transitions or the system transition under way is a proper suspend (rather than anything related to hibernation) and the device's wakeup settings are compatible with runtime PM (that is, the device cannot generate wakeup signals at all or it is allowed to wake up the system from sleep). Signed-off-by: Rafael J. Wysocki --- v2 -> v3: Rebase on the v3 of patch [1/6]. --- Documentation/driver-api/pm/devices.rst |9 ++ drivers/base/power/main.c | 47 2 files changed, 51 insertions(+), 5 deletions(-) Index: linux-pm/drivers/base/power/main.c === --- linux-pm.orig/drivers/base/power/main.c +++ linux-pm/drivers/base/power/main.c @@ -589,6 +589,7 @@ static int device_resume_noirq(struct de { pm_callback_t callback; const char *info; + bool skip_resume; int error = 0; TRACE_DEVICE(dev); @@ -602,23 +603,33 @@ static int device_resume_noirq(struct de dpm_wait_for_superior(dev, async); + skip_resume = dev_pm_may_skip_resume(dev); + callback = dpm_subsys_resume_noirq_cb(dev, state, &info); + if (callback) + goto Run; + + if (skip_resume) + goto Skip; if (!callback && dev->driver && dev->driver->pm) { info = "noirq driver "; callback = pm_noirq_op(dev->driver->pm, state); } +Run: error = dpm_run_callback(callback, dev, state, info); + +Skip: dev->power.is_noirq_suspended = false; - if (dev_pm_may_skip_resume(dev)) { + if (skip_resume) { pm_runtime_set_suspended(dev); dev->power.is_late_suspended = false; dev->power.is_suspended = false; } - Out: +Out: complete_all(&dev->power.completion); TRACE_RESUME(error); return error; @@ -1194,6 +1205,7 @@ static int __device_suspend_noirq(struct { pm_callback_t callback; const char *info; + bool direct_cb = false; int error = 0; TRACE_DEVICE(dev); @@ -1213,12 +1225,17 @@ static int __device_suspend_noirq(struct goto Complete; callback = dpm_subsys_suspend_noirq_cb(dev, state, &info); + if (callback) + goto Run; - if (!callback && dev->driver && dev->driver->pm) { + direct_cb = true; + + if (dev->driver && dev->driver->pm) { info = "noirq driver "; callback = pm_noirq_op(dev->driver->pm, state); } +Run: error = dpm_run_callback(callback, dev, state, info); if (error) { async_error = error; @@ -1228,13 +1245,33 @@ static int __device_suspend_noirq(struct dev->power.is_noirq_suspended = true; if (dev_pm_test_driver_flags(dev, DPM_FLAG_LEAVE_SUSPENDED)) { + pm_message_t resume_msg = resume_event(state); + bool skip_resume; + + if (direct_cb && + !dpm_subsys_suspend_late_cb(dev, state, NULL) && + !dpm_subsys_resume_early_cb(dev, resume_msg, NULL) && + !dpm_subsys_resume_noirq_cb(dev, resume_msg, NULL)) { + /* +* If all of the device driver's "noirq", "late" and +* "early" callbacks are invoked directly by the core, +* the decision to allow the device to stay in suspend +* can be based on its current runtime PM status and its +* wakeup settings. +*/ + skip_resume = pm_runtime_status_suspended(dev) || + (resume_msg.event == PM_EVENT_RESUME && +(!device_can_wakeup(dev) || + device_may_wakeup(dev))); + } else { + skip_resume = dev->power.may_skip_resume; + } /* * The only safe strategy here is to require that if the device * may not be left in suspend, resume callbacks must be invoked * for it. */ - dev->power.must_resume = dev->power.must_resume
[PATCH v3 0/6] PM / sleep: Driver flags for system suspend/resume (part 2)
Hi All, The following still applies: On Wednesday, November 8, 2017 1:41:35 AM CET Rafael J. Wysocki wrote: > > This is a follow-up for the first part of the PM driver flags series > sent previously some time ago with an intro as follows: > > On Saturday, October 28, 2017 12:11:55 AM CET Rafael J. Wysocki wrote: > > The following part of the original cover letter still applies: > > > > On Monday, October 16, 2017 3:12:35 AM CEST Rafael J. Wysocki wrote: > > > > > > This work was triggered by attempts to fix and optimize PM in the > > > i2c-designware-platdev driver that ended up with adding a couple of > > > flags to the driver's internal data structures for the tracking of > > > device state (https://marc.info/?l=linux-acpi&m=150629646805636&w=2). > > > That approach is sort of suboptimal, though, because other drivers will > > > probably want to do similar things and if all of them need to use internal > > > flags for that, quite a bit of code duplication may ensue at least. > > > > > > That can be avoided in a couple of ways and one of them is to provide a > > > means > > > for drivers to tell the core what to do and to make the core take care of > > > it > > > if told to do so. Hence, the idea to use driver flags for system-wide PM > > > that was briefly discussed during the LPC in LA last month. > > > > [...] > > > > > What can work (and this is the only strategy that can work AFAICS) is to > > > point different callback pointers *in* *a* *driver* to the same routine > > > if the driver wants to reuse that code. That actually will work for PCI > > > and USB drivers today, at least most of the time, but unfortunately there > > > are problems with it for, say, platform devices. > > > > > > The first problem is the requirement to track the status of the device > > > (suspended vs not suspended) in the callbacks, because the system-wide PM > > > code in the PM core doesn't do that. The runtime PM framework does it, so > > > this means adding some extra code which isn't necessary for runtime PM to > > > the callback routines and that is not particularly nice. > > > > > > The second problem is that, if the driver wants to do anything in its > > > ->suspend callback, it generally has to prevent runtime suspend of the > > > device from taking place in parallel with that, which is quite cumbersome. > > > Usually, that is taken care of by resuming the device from runtime suspend > > > upfront, but generally doing that is wasteful (there may be no real need > > > to > > > resume the device except for the fact that the code is designed this way). > > > > > > On top of the above, there are optimizations to be made, like leaving > > > certain > > > devices in suspend after system resume to avoid wasting time on waiting > > > for > > > them to resume before user space can run again and similar. > > > > > > This patch series focuses on addressing those problems so as to make it > > > easier to reuse callback routines by pointing different callback pointers > > > to them in device drivers. The flags introduced here are to instruct the > > > PM core and middle layers (whatever they are) on how the driver wants the > > > device to be handled and then the driver has to provide callbacks to match > > > these instructions and the rest should be taken care of by the code above > > > it. > > > > > > The flags are introduced one by one to avoid making too many changes in > > > one go and to allow things to be explained better (hopefully). They > > > mostly > > > are mutually independent with some clearly documented exceptions. > > > > but I had to rework the core patches to address the problem pointed with the > > generic power domains (genpd) framework pointed out by Ulf. > > > > Namely, genpd expects its "noirq" callbacks to be invoked for devices in > > runtime suspend too and it has valid reasons for that, so its "noirq" > > callbacks can never be skipped, even for devices with the SMART_SUSPEND > > flag set. For this reason, the logic related to DPM_FLAG_SMART_SUSPEND > > had to be moved from the core to the PCI bus type and the ACPI PM domain > > which are mostly affected by it anyway. The code after the changes looks > > more straightforward to me, but it generally is more code and some patterns > > had to be repeated in a few places. > > I promised
Re: [PATCH v2 1/6] PM / core: Add LEAVE_SUSPENDED driver flag
On Sat, Nov 11, 2017 at 12:45 AM, Rafael J. Wysocki wrote: > On Fri, Nov 10, 2017 at 10:09 AM, Ulf Hansson wrote: >> On 8 November 2017 at 14:25, Rafael J. Wysocki wrote: [cut] >> Moreover, you should check the return value from >> pm_runtime_set_suspended(). > > This is in "noirq", so failures of that are meaningless here. They *should* be meaningless, but __pm_runtime_set_status() is sort of buggy and checks child_count regardless of whether or not runtime PM is enabled for the children (but when changing the status to "active" it actually checks if runtime PM is enabled for the parent before returning -EBUSY, so it is not event consistent internally). Oh well. >> Then I wonder, what should you do when it fails here? >> >> Perhaps a better idea is to do this in the noirq suspend phase, >> because it allows you to bail out in case pm_runtime_set_suspended() >> fails. > > This doesn't make sense, sorry. Not for the above reason, but that would allow the bug in __pm_runtime_set_status() to be sort of worked around by setting the status to "suspended" for children before doing that for their parents. Moreover, stuff with nonzero usage_counts cannot be left in suspend regardless. Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/6] PM / core: Add LEAVE_SUSPENDED driver flag
On Sat, Nov 11, 2017 at 12:45 AM, Rafael J. Wysocki wrote: > On Fri, Nov 10, 2017 at 10:09 AM, Ulf Hansson wrote: >> On 8 November 2017 at 14:25, Rafael J. Wysocki wrote: >>> From: Rafael J. Wysocki >>> >>> Define and document a new driver flag, DPM_FLAG_LEAVE_SUSPENDED, to >>> instruct the PM core and middle-layer (bus type, PM domain, etc.) >>> code that it is desirable to leave the device in runtime suspend >>> after system-wide transitions to the working state (for example, >>> the device may be slow to resume and it may be better to avoid >>> resuming it right away). >>> >>> Generally, the middle-layer code involved in the handling of the >>> device is expected to indicate to the PM core whether or not the >>> device may be left in suspend with the help of the device's >>> power.may_skip_resume status bit. That has to happen in the "noirq" >>> phase of the preceding system suspend (or analogous) transition. >>> The middle layer is then responsible for handling the device as >>> appropriate in its "noirq" resume callback which is executed >>> regardless of whether or not the device may be left suspended, but >>> the other resume callbacks (except for ->complete) will be skipped >>> automatically by the core if the device really can be left in >>> suspend. >> >> I don't understand the reason to why you need to skip invoking resume >> callbacks to achieve this behavior, could you elaborate on that? > > The reason why it is done this way is because that takes less code and > is easier (or at least less error-prone, because it avoids repeating > patterns in middle layers). Actually, it also is a matter of correctness, at least to some extent. Namely, if the parent or any supplier of the device has power.must_resume clear in dpm_noirq_resume_devices(), then the device should not be touched during the whole system resume transition (because the access may very well go through the suspended parent or supplier) and the most straightforward way to make that happen is to avoid running the code that may touch the device. [Arguably, if middle layers were made responsible for handling that, they would need to do pretty much the same thing and so there is no reason for not doing it in the core.] Allowing the "noirq" callback from middle layers to run in that case is a stretch already, but since genpd needs that, well, tough nuggets. All of that said, if there is a middle layer wanting to set power.skip_resume and needing to do something different for the resume callbacks, then this piece can be moved from the core to the middle layers at any time later. So far there's none, though. At least not in this patch series. Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/6] PM / core: Add LEAVE_SUSPENDED driver flag
On Fri, Nov 10, 2017 at 10:09 AM, Ulf Hansson wrote: > On 8 November 2017 at 14:25, Rafael J. Wysocki wrote: >> From: Rafael J. Wysocki >> >> Define and document a new driver flag, DPM_FLAG_LEAVE_SUSPENDED, to >> instruct the PM core and middle-layer (bus type, PM domain, etc.) >> code that it is desirable to leave the device in runtime suspend >> after system-wide transitions to the working state (for example, >> the device may be slow to resume and it may be better to avoid >> resuming it right away). >> >> Generally, the middle-layer code involved in the handling of the >> device is expected to indicate to the PM core whether or not the >> device may be left in suspend with the help of the device's >> power.may_skip_resume status bit. That has to happen in the "noirq" >> phase of the preceding system suspend (or analogous) transition. >> The middle layer is then responsible for handling the device as >> appropriate in its "noirq" resume callback which is executed >> regardless of whether or not the device may be left suspended, but >> the other resume callbacks (except for ->complete) will be skipped >> automatically by the core if the device really can be left in >> suspend. > > I don't understand the reason to why you need to skip invoking resume > callbacks to achieve this behavior, could you elaborate on that? The reason why it is done this way is because that takes less code and is easier (or at least less error-prone, because it avoids repeating patterns in middle layers). Note that the callbacks only may be skipped by the core if the middle layer has set power.skip_resume for the device (or if the core is handling it in patch [5/6], but that's one more step ahead still). > Couldn't the PM domain or the middle-layer instead decide what to do? They still can, the whole thing is a total opt-in. But to be constructive, do you have any specific examples in mind? > To me it sounds a bit prone to errors by skipping callbacks from the > PM core, and I wonder if the general driver author will be able to > understand how to use this flag properly. This has nothing to do with general driver authors and I'm not sure what you mean here and where you are going with this. > That said, as the series don't include any changes for drivers making > use of the flag, could please fold in such change as it would provide > a more complete picture? I've already done so, see https://patchwork.kernel.org/patch/10007349/ IMHO it's not really useful to drag this stuff (which doesn't change BTW) along with every iteration of the core patches. >> >> The additional power.must_resume status bit introduced for the >> implementation of this mechanisn is used internally by the PM core >> to track the requirement to resume the device (which may depend on >> its children etc). > > Yeah, clearly the PM core needs to be involved, because of the need of > dealing with parent/child relations, however as kind of indicate > above, couldn't the PM core just set some flag/status bits, which > instructs the middle-layer and PM domain on what to do? That sounds > like an easier approach. No, it is not easier. And it is backwards. >> >> Signed-off-by: Rafael J. Wysocki >> Acked-by: Greg Kroah-Hartman >> --- >> Documentation/driver-api/pm/devices.rst | 24 ++- >> drivers/base/power/main.c | 65 >> +--- >> include/linux/pm.h | 14 +- >> 3 files changed, 93 insertions(+), 10 deletions(-) >> >> Index: linux-pm/include/linux/pm.h >> === >> --- linux-pm.orig/include/linux/pm.h >> +++ linux-pm/include/linux/pm.h >> @@ -559,6 +559,7 @@ struct pm_subsys_data { >> * NEVER_SKIP: Do not skip system suspend/resume callbacks for the device. >> * SMART_PREPARE: Check the return value of the driver's ->prepare callback. >> * SMART_SUSPEND: No need to resume the device from runtime suspend. >> + * LEAVE_SUSPENDED: Avoid resuming the device during system resume if >> possible. >> * >> * Setting SMART_PREPARE instructs bus types and PM domains which may want >> * system suspend/resume callbacks to be skipped for the device to return 0 >> from >> @@ -572,10 +573,14 @@ struct pm_subsys_data { >> * necessary from the driver's perspective. It also may cause them to skip >> * invocations of the ->suspend_late and ->suspend_noirq callbacks provided >> by >> * the driver if they decide to leave the device in runtime suspe
Re: [PATCH v2 2/6] PCI / PM: Support for LEAVE_SUSPENDED driver flag
On Wed, Nov 8, 2017 at 9:38 PM, Bjorn Helgaas wrote: > On Wed, Nov 08, 2017 at 02:28:18PM +0100, Rafael J. Wysocki wrote: >> From: Rafael J. Wysocki >> >> Add support for DPM_FLAG_LEAVE_SUSPENDED to the PCI bus type by >> making it (a) set the power.may_skip_resume status bit for devices >> that, from its perspective, may be left in suspend after system >> wakeup from sleep and (b) return early from pci_pm_resume_noirq() >> for devices whose remaining resume callbacks during the transition >> under way are going to be skipped by the PM core. >> >> Signed-off-by: Rafael J. Wysocki >> Acked-by: Greg Kroah-Hartman > > Acked-by: Bjorn Helgaas Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/6] ACPI / PM: Support for LEAVE_SUSPENDED driver flag in ACPI PM domain
From: Rafael J. Wysocki Add support for DPM_FLAG_LEAVE_SUSPENDED to the ACPI PM domain by making it (a) set the power.may_skip_resume status bit for devices that, from its perspective, may be left in suspend after system wakeup from sleep and (b) return early from acpi_subsys_resume_noirq() for devices whose remaining resume callbacks during the transition under way are going to be skipped by the PM core. Signed-off-by: Rafael J. Wysocki Acked-by: Greg Kroah-Hartman --- drivers/acpi/device_pm.c | 27 --- 1 file changed, 24 insertions(+), 3 deletions(-) Index: linux-pm/drivers/acpi/device_pm.c === --- linux-pm.orig/drivers/acpi/device_pm.c +++ linux-pm/drivers/acpi/device_pm.c @@ -987,7 +987,7 @@ void acpi_subsys_complete(struct device * the sleep state it is going out of and it has never been resumed till * now, resume it in case the firmware powered it up. */ - if (dev->power.direct_complete && pm_resume_via_firmware()) + if (pm_runtime_suspended(dev) && pm_resume_via_firmware()) pm_request_resume(dev); } EXPORT_SYMBOL_GPL(acpi_subsys_complete); @@ -1036,10 +1036,28 @@ EXPORT_SYMBOL_GPL(acpi_subsys_suspend_la */ int acpi_subsys_suspend_noirq(struct device *dev) { - if (dev_pm_smart_suspend_and_suspended(dev)) + int ret; + + if (dev_pm_smart_suspend_and_suspended(dev)) { + dev->power.may_skip_resume = true; return 0; + } - return pm_generic_suspend_noirq(dev); + ret = pm_generic_suspend_noirq(dev); + if (ret) + return ret; + + /* +* If the target system sleep state is suspend-to-idle, it is sufficient +* to check whether or not the device's wakeup settings are good for +* runtime PM. Otherwise, the pm_resume_via_firmware() check will cause +* acpi_subsys_complete() to take care of fixing up the device's state +* anyway, if need be. +*/ + dev->power.may_skip_resume = device_may_wakeup(dev) || + !device_can_wakeup(dev); + + return 0; } EXPORT_SYMBOL_GPL(acpi_subsys_suspend_noirq); @@ -1049,6 +1067,9 @@ EXPORT_SYMBOL_GPL(acpi_subsys_suspend_no */ int acpi_subsys_resume_noirq(struct device *dev) { + if (dev_pm_may_skip_resume(dev)) + return 0; + /* * Devices with DPM_FLAG_SMART_SUSPEND may be left in runtime suspend * during system suspend, so update their runtime PM status to "active" -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/6] PCI / PM: Support for LEAVE_SUSPENDED driver flag
From: Rafael J. Wysocki Add support for DPM_FLAG_LEAVE_SUSPENDED to the PCI bus type by making it (a) set the power.may_skip_resume status bit for devices that, from its perspective, may be left in suspend after system wakeup from sleep and (b) return early from pci_pm_resume_noirq() for devices whose remaining resume callbacks during the transition under way are going to be skipped by the PM core. Signed-off-by: Rafael J. Wysocki Acked-by: Greg Kroah-Hartman --- Documentation/power/pci.txt | 11 +++ drivers/pci/pci-driver.c| 19 +-- 2 files changed, 28 insertions(+), 2 deletions(-) Index: linux-pm/drivers/pci/pci-driver.c === --- linux-pm.orig/drivers/pci/pci-driver.c +++ linux-pm/drivers/pci/pci-driver.c @@ -699,7 +699,7 @@ static void pci_pm_complete(struct devic pm_generic_complete(dev); /* Resume device if platform firmware has put it in reset-power-on */ - if (dev->power.direct_complete && pm_resume_via_firmware()) { + if (pm_runtime_suspended(dev) && pm_resume_via_firmware()) { pci_power_t pre_sleep_state = pci_dev->current_state; pci_update_current_state(pci_dev, pci_dev->current_state); @@ -783,8 +783,10 @@ static int pci_pm_suspend_noirq(struct d struct pci_dev *pci_dev = to_pci_dev(dev); const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL; - if (dev_pm_smart_suspend_and_suspended(dev)) + if (dev_pm_smart_suspend_and_suspended(dev)) { + dev->power.may_skip_resume = true; return 0; + } if (pci_has_legacy_pm_support(pci_dev)) return pci_legacy_suspend_late(dev, PMSG_SUSPEND); @@ -838,6 +840,16 @@ static int pci_pm_suspend_noirq(struct d Fixup: pci_fixup_device(pci_fixup_suspend_late, pci_dev); + /* +* If the target system sleep state is suspend-to-idle, it is sufficient +* to check whether or not the device's wakeup settings are good for +* runtime PM. Otherwise, the pm_resume_via_firmware() check will cause +* pci_pm_complete() to take care of fixing up the device's state +* anyway, if need be. +*/ + dev->power.may_skip_resume = device_may_wakeup(dev) || + !device_can_wakeup(dev); + return 0; } @@ -847,6 +859,9 @@ static int pci_pm_resume_noirq(struct de struct device_driver *drv = dev->driver; int error = 0; + if (dev_pm_may_skip_resume(dev)) + return 0; + /* * Devices with DPM_FLAG_SMART_SUSPEND may be left in runtime suspend * during system suspend, so update their runtime PM status to "active" Index: linux-pm/Documentation/power/pci.txt === --- linux-pm.orig/Documentation/power/pci.txt +++ linux-pm/Documentation/power/pci.txt @@ -994,6 +994,17 @@ into D0 going forward), but if it is in the function will set the power.direct_complete flag for it (to make the PM core skip the subsequent "thaw" callbacks for it) and return. +Setting the DPM_FLAG_LEAVE_SUSPENDED flag means that the driver prefers the +device to be left in suspend after system-wide transitions to the working state. +This flag is checked by the PM core, but the PCI bus type informs the PM core +which devices may be left in suspend from its perspective (that happens during +the "noirq" phase of system-wide suspend and analogous transitions) and next it +uses the dev_pm_may_skip_resume() helper to decide whether or not to return from +pci_pm_resume_noirq() early, as the PM core will skip the remaining resume +callbacks for the device during the transition under way and will set its +runtime PM status to "suspended" if dev_pm_may_skip_resume() returns "true" for +it. + 3.2. Device Runtime Power Management In addition to providing device power management callbacks PCI device drivers -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 6/6] PM / core: DPM_FLAG_SMART_SUSPEND optimization
From: Rafael J. Wysocki Make the PM core avoid invoking the "late" and "noirq" system-wide suspend (or analogous) callbacks for devices that are in runtime suspend during the corresponding phases of system-wide suspend (or analogous) transitions. The underlying observation is that runtime PM is disabled for devices during those system-wide suspend phases, so their runtime PM status should not change going forward and if it has not changed so far, their state should be compatible with the target system sleep state. This change really makes it possible for, say, platform device drivers to re-use runtime PM suspend and resume callbacks by pointing ->suspend_late and ->resume_early, respectively (and possibly the analogous hibernation-related callback pointers too), to them without adding any extra "is the device already suspended?" type of checks to the callback routines, as long as they will be invoked directly by the core. Signed-off-by: Rafael J. Wysocki --- Documentation/driver-api/pm/devices.rst | 18 + drivers/base/power/main.c | 62 2 files changed, 66 insertions(+), 14 deletions(-) Index: linux-pm/drivers/base/power/main.c === --- linux-pm.orig/drivers/base/power/main.c +++ linux-pm/drivers/base/power/main.c @@ -536,6 +536,24 @@ static pm_callback_t dpm_subsys_suspend_ /*- Resume routines -*/ /** + * suspend_event - Return a "suspend" message for given "resume" one. + * @resume_msg: PM message representing a system-wide resume transition. + */ +static pm_message_t suspend_event(pm_message_t resume_msg) +{ + switch (resume_msg.event) { + case PM_EVENT_RESUME: + return PMSG_SUSPEND; + case PM_EVENT_THAW: + case PM_EVENT_RESTORE: + return PMSG_FREEZE; + case PM_EVENT_RECOVER: + return PMSG_HIBERNATE; + } + return PMSG_ON; +} + +/** * dev_pm_may_skip_resume - System-wide device resume optimization check. * @dev: Target device. * @@ -609,6 +627,25 @@ static int device_resume_noirq(struct de if (callback) goto Run; + if (dev_pm_smart_suspend_and_suspended(dev)) { + pm_message_t suspend_msg = suspend_event(state); + + /* +* If "freeze" callbacks have been skipped during a transition +* related to hibernation, the subsequent "thaw" callbacks must +* be skipped too or bad things may happen. Otherwise, if the +* device is to be resumed, its runtime PM status must be +* changed to reflect the new configuration. +*/ + if (!dpm_subsys_suspend_late_cb(dev, suspend_msg, NULL) && + !dpm_subsys_suspend_noirq_cb(dev, suspend_msg, NULL)) { + if (state.event == PM_EVENT_THAW) + skip_resume = true; + else if (!skip_resume) + pm_runtime_set_active(dev); + } + } + if (skip_resume) goto Skip; @@ -1228,7 +1265,10 @@ static int __device_suspend_noirq(struct if (callback) goto Run; - direct_cb = true; + direct_cb = !dpm_subsys_suspend_late_cb(dev, state, NULL); + + if (dev_pm_smart_suspend_and_suspended(dev) && direct_cb) + goto Skip; if (dev->driver && dev->driver->pm) { info = "noirq driver "; @@ -1242,6 +1282,7 @@ Run: goto Complete; } +Skip: dev->power.is_noirq_suspended = true; if (dev_pm_test_driver_flags(dev, DPM_FLAG_LEAVE_SUSPENDED)) { @@ -1249,7 +1290,6 @@ Run: bool skip_resume; if (direct_cb && - !dpm_subsys_suspend_late_cb(dev, state, NULL) && !dpm_subsys_resume_early_cb(dev, resume_msg, NULL) && !dpm_subsys_resume_noirq_cb(dev, resume_msg, NULL)) { /* @@ -1445,17 +1485,27 @@ static int __device_suspend_late(struct goto Complete; callback = dpm_subsys_suspend_late_cb(dev, state, &info); + if (callback) + goto Run; - if (!callback && dev->driver && dev->driver->pm) { + if (dev_pm_smart_suspend_and_suspended(dev) && + !dpm_subsys_suspend_noirq_cb(dev, state, NULL)) + goto Skip; + + if (dev->driver && dev->driver->pm) { info = "late driver "; callback = pm_late_early_op(dev->driver->pm, state); } +Run: error = dpm
[PATCH v2 5/6] PM / core: Direct handling of DPM_FLAG_LEAVE_SUSPENDED
From: Rafael J. Wysocki Make the PM core handle DPM_FLAG_LEAVE_SUSPENDED directly for devices whose "noirq", "late" and "early" driver callbacks are invoked directly by it. Namely, make it skip all of the system-wide resume callbacks for such devices with DPM_FLAG_LEAVE_SUSPENDED set if they are in runtime suspend during the "noirq" phase of system-wide suspend (or analogous) transitions or the system transition under way is a proper suspend (rather than anything related to hibernation) and the device's wakeup settings are compatible with runtime PM (that is, the device cannot generate wakeup signals at all or it is allowed to wake up the system from sleep). Signed-off-by: Rafael J. Wysocki --- Documentation/driver-api/pm/devices.rst |9 ++ drivers/base/power/main.c | 47 2 files changed, 51 insertions(+), 5 deletions(-) Index: linux-pm/drivers/base/power/main.c === --- linux-pm.orig/drivers/base/power/main.c +++ linux-pm/drivers/base/power/main.c @@ -589,6 +589,7 @@ static int device_resume_noirq(struct de { pm_callback_t callback; const char *info; + bool skip_resume; int error = 0; TRACE_DEVICE(dev); @@ -602,23 +603,33 @@ static int device_resume_noirq(struct de dpm_wait_for_superior(dev, async); + skip_resume = dev_pm_may_skip_resume(dev); + callback = dpm_subsys_resume_noirq_cb(dev, state, &info); + if (callback) + goto Run; + + if (skip_resume) + goto Skip; if (!callback && dev->driver && dev->driver->pm) { info = "noirq driver "; callback = pm_noirq_op(dev->driver->pm, state); } +Run: error = dpm_run_callback(callback, dev, state, info); + +Skip: dev->power.is_noirq_suspended = false; - if (dev_pm_may_skip_resume(dev)) { + if (skip_resume) { pm_runtime_set_suspended(dev); dev->power.is_late_suspended = false; dev->power.is_suspended = false; } - Out: +Out: complete_all(&dev->power.completion); TRACE_RESUME(error); return error; @@ -1194,6 +1205,7 @@ static int __device_suspend_noirq(struct { pm_callback_t callback; const char *info; + bool direct_cb = false; int error = 0; TRACE_DEVICE(dev); @@ -1213,12 +1225,17 @@ static int __device_suspend_noirq(struct goto Complete; callback = dpm_subsys_suspend_noirq_cb(dev, state, &info); + if (callback) + goto Run; - if (!callback && dev->driver && dev->driver->pm) { + direct_cb = true; + + if (dev->driver && dev->driver->pm) { info = "noirq driver "; callback = pm_noirq_op(dev->driver->pm, state); } +Run: error = dpm_run_callback(callback, dev, state, info); if (error) { async_error = error; @@ -1228,13 +1245,33 @@ static int __device_suspend_noirq(struct dev->power.is_noirq_suspended = true; if (dev_pm_test_driver_flags(dev, DPM_FLAG_LEAVE_SUSPENDED)) { + pm_message_t resume_msg = resume_event(state); + bool skip_resume; + + if (direct_cb && + !dpm_subsys_suspend_late_cb(dev, state, NULL) && + !dpm_subsys_resume_early_cb(dev, resume_msg, NULL) && + !dpm_subsys_resume_noirq_cb(dev, resume_msg, NULL)) { + /* +* If all of the device driver's "noirq", "late" and +* "early" callbacks are invoked directly by the core, +* the decision to allow the device to stay in suspend +* can be based on its current runtime PM status and its +* wakeup settings. +*/ + skip_resume = pm_runtime_status_suspended(dev) || + (resume_msg.event == PM_EVENT_RESUME && +(!device_can_wakeup(dev) || + device_may_wakeup(dev))); + } else { + skip_resume = dev->power.may_skip_resume; + } /* * The only safe strategy here is to require that if the device * may not be left in suspend, resume callbacks must be invoked * for it. */ - dev->power.must_resume = dev->power.must_resume || -
[PATCH v2 1/6] PM / core: Add LEAVE_SUSPENDED driver flag
From: Rafael J. Wysocki Define and document a new driver flag, DPM_FLAG_LEAVE_SUSPENDED, to instruct the PM core and middle-layer (bus type, PM domain, etc.) code that it is desirable to leave the device in runtime suspend after system-wide transitions to the working state (for example, the device may be slow to resume and it may be better to avoid resuming it right away). Generally, the middle-layer code involved in the handling of the device is expected to indicate to the PM core whether or not the device may be left in suspend with the help of the device's power.may_skip_resume status bit. That has to happen in the "noirq" phase of the preceding system suspend (or analogous) transition. The middle layer is then responsible for handling the device as appropriate in its "noirq" resume callback which is executed regardless of whether or not the device may be left suspended, but the other resume callbacks (except for ->complete) will be skipped automatically by the core if the device really can be left in suspend. The additional power.must_resume status bit introduced for the implementation of this mechanisn is used internally by the PM core to track the requirement to resume the device (which may depend on its children etc). Signed-off-by: Rafael J. Wysocki Acked-by: Greg Kroah-Hartman --- Documentation/driver-api/pm/devices.rst | 24 ++- drivers/base/power/main.c | 65 +--- include/linux/pm.h | 14 +- 3 files changed, 93 insertions(+), 10 deletions(-) Index: linux-pm/include/linux/pm.h === --- linux-pm.orig/include/linux/pm.h +++ linux-pm/include/linux/pm.h @@ -559,6 +559,7 @@ struct pm_subsys_data { * NEVER_SKIP: Do not skip system suspend/resume callbacks for the device. * SMART_PREPARE: Check the return value of the driver's ->prepare callback. * SMART_SUSPEND: No need to resume the device from runtime suspend. + * LEAVE_SUSPENDED: Avoid resuming the device during system resume if possible. * * Setting SMART_PREPARE instructs bus types and PM domains which may want * system suspend/resume callbacks to be skipped for the device to return 0 from @@ -572,10 +573,14 @@ struct pm_subsys_data { * necessary from the driver's perspective. It also may cause them to skip * invocations of the ->suspend_late and ->suspend_noirq callbacks provided by * the driver if they decide to leave the device in runtime suspend. + * + * Setting LEAVE_SUSPENDED informs the PM core and middle-layer code that the + * driver prefers the device to be left in runtime suspend after system resume. */ -#define DPM_FLAG_NEVER_SKIPBIT(0) -#define DPM_FLAG_SMART_PREPARE BIT(1) -#define DPM_FLAG_SMART_SUSPEND BIT(2) +#define DPM_FLAG_NEVER_SKIPBIT(0) +#define DPM_FLAG_SMART_PREPARE BIT(1) +#define DPM_FLAG_SMART_SUSPEND BIT(2) +#define DPM_FLAG_LEAVE_SUSPENDED BIT(3) struct dev_pm_info { pm_message_tpower_state; @@ -597,6 +602,8 @@ struct dev_pm_info { boolwakeup_path:1; boolsyscore:1; boolno_pm_callbacks:1; /* Owned by the PM core */ + unsigned intmust_resume:1; /* Owned by the PM core */ + unsigned intmay_skip_resume:1; /* Set by subsystems */ #else unsigned intshould_wakeup:1; #endif @@ -765,6 +772,7 @@ extern int pm_generic_poweroff_late(stru extern int pm_generic_poweroff(struct device *dev); extern void pm_generic_complete(struct device *dev); +extern bool dev_pm_may_skip_resume(struct device *dev); extern bool dev_pm_smart_suspend_and_suspended(struct device *dev); #else /* !CONFIG_PM_SLEEP */ Index: linux-pm/drivers/base/power/main.c === --- linux-pm.orig/drivers/base/power/main.c +++ linux-pm/drivers/base/power/main.c @@ -528,6 +528,18 @@ static void dpm_watchdog_clear(struct dp /*- Resume routines -*/ /** + * dev_pm_may_skip_resume - System-wide device resume optimization check. + * @dev: Target device. + * + * Checks whether or not the device may be left in suspend after a system-wide + * transition to the working state. + */ +bool dev_pm_may_skip_resume(struct device *dev) +{ + return !dev->power.must_resume && pm_transition.event != PM_EVENT_RESTORE; +} + +/** * device_resume_noirq - Execute a "noirq resume" callback for given device. * @dev: Device to handle. * @state: PM transition of the system being carried out. @@ -575,6 +587,12 @@ static int device_resume_noirq(struct de error = dpm_run_callback(callback, dev, state, info); dev->power.is_noirq_suspended = false; + if (dev_pm_may_skip_resume(dev
[PATCH v2 0/6] PM / sleep: Driver flags for system suspend/resume (part 2)
Hi All, This is a follow-up for the first part of the PM driver flags series sent previously some time ago with an intro as follows: On Saturday, October 28, 2017 12:11:55 AM CET Rafael J. Wysocki wrote: > The following part of the original cover letter still applies: > > On Monday, October 16, 2017 3:12:35 AM CEST Rafael J. Wysocki wrote: > > > > This work was triggered by attempts to fix and optimize PM in the > > i2c-designware-platdev driver that ended up with adding a couple of > > flags to the driver's internal data structures for the tracking of > > device state (https://marc.info/?l=linux-acpi&m=150629646805636&w=2). > > That approach is sort of suboptimal, though, because other drivers will > > probably want to do similar things and if all of them need to use internal > > flags for that, quite a bit of code duplication may ensue at least. > > > > That can be avoided in a couple of ways and one of them is to provide a > > means > > for drivers to tell the core what to do and to make the core take care of it > > if told to do so. Hence, the idea to use driver flags for system-wide PM > > that was briefly discussed during the LPC in LA last month. > > [...] > > > What can work (and this is the only strategy that can work AFAICS) is to > > point different callback pointers *in* *a* *driver* to the same routine > > if the driver wants to reuse that code. That actually will work for PCI > > and USB drivers today, at least most of the time, but unfortunately there > > are problems with it for, say, platform devices. > > > > The first problem is the requirement to track the status of the device > > (suspended vs not suspended) in the callbacks, because the system-wide PM > > code in the PM core doesn't do that. The runtime PM framework does it, so > > this means adding some extra code which isn't necessary for runtime PM to > > the callback routines and that is not particularly nice. > > > > The second problem is that, if the driver wants to do anything in its > > ->suspend callback, it generally has to prevent runtime suspend of the > > device from taking place in parallel with that, which is quite cumbersome. > > Usually, that is taken care of by resuming the device from runtime suspend > > upfront, but generally doing that is wasteful (there may be no real need to > > resume the device except for the fact that the code is designed this way). > > > > On top of the above, there are optimizations to be made, like leaving > > certain > > devices in suspend after system resume to avoid wasting time on waiting for > > them to resume before user space can run again and similar. > > > > This patch series focuses on addressing those problems so as to make it > > easier to reuse callback routines by pointing different callback pointers > > to them in device drivers. The flags introduced here are to instruct the > > PM core and middle layers (whatever they are) on how the driver wants the > > device to be handled and then the driver has to provide callbacks to match > > these instructions and the rest should be taken care of by the code above > > it. > > > > The flags are introduced one by one to avoid making too many changes in > > one go and to allow things to be explained better (hopefully). They mostly > > are mutually independent with some clearly documented exceptions. > > but I had to rework the core patches to address the problem pointed with the > generic power domains (genpd) framework pointed out by Ulf. > > Namely, genpd expects its "noirq" callbacks to be invoked for devices in > runtime suspend too and it has valid reasons for that, so its "noirq" > callbacks can never be skipped, even for devices with the SMART_SUSPEND > flag set. For this reason, the logic related to DPM_FLAG_SMART_SUSPEND > had to be moved from the core to the PCI bus type and the ACPI PM domain > which are mostly affected by it anyway. The code after the changes looks > more straightforward to me, but it generally is more code and some patterns > had to be repeated in a few places. I promised to send the rest of the series then: > I will send the core patches for the remaining two flags introduced by the > original series separately and the intel-lpss and i2c-designware ones will > be posted when the core patches have been reviewed and agreed on. and here it goes. It actually only adds support for one additional flag, namely for DPM_FLAG_LEAVE_SUSPENDED, to the PM core (basic bits), PCI bus type and the ACPI PM domain. That part of the series (patches [1-3/6]) is rather straightforward and, as PCI and
[PATCH v2 4/6] PM / core: Add helpers for subsystem callback selection
From: Rafael J. Wysocki Add helper routines to find and return a suitable subsystem callback during the "noirq" phases of system suspend/resume (or analogous) transitions as well as during the "late" phase of system suspend and the "early" phase of system resume (or analogous) transitions. The helpers will be called from additional sites going forward. Signed-off-by: Rafael J. Wysocki --- drivers/base/power/main.c | 196 +++--- 1 file changed, 136 insertions(+), 60 deletions(-) Index: linux-pm/drivers/base/power/main.c === --- linux-pm.orig/drivers/base/power/main.c +++ linux-pm/drivers/base/power/main.c @@ -525,6 +525,14 @@ static void dpm_watchdog_clear(struct dp #define dpm_watchdog_clear(x) #endif +static pm_callback_t dpm_subsys_suspend_noirq_cb(struct device *dev, +pm_message_t state, +const char **info_p); + +static pm_callback_t dpm_subsys_suspend_late_cb(struct device *dev, + pm_message_t state, + const char **info_p); + /*- Resume routines -*/ /** @@ -539,6 +547,35 @@ bool dev_pm_may_skip_resume(struct devic return !dev->power.must_resume && pm_transition.event != PM_EVENT_RESTORE; } +static pm_callback_t dpm_subsys_resume_noirq_cb(struct device *dev, + pm_message_t state, + const char **info_p) +{ + pm_callback_t callback; + const char *info; + + if (dev->pm_domain) { + info = "noirq power domain "; + callback = pm_noirq_op(&dev->pm_domain->ops, state); + } else if (dev->type && dev->type->pm) { + info = "noirq type "; + callback = pm_noirq_op(dev->type->pm, state); + } else if (dev->class && dev->class->pm) { + info = "noirq class "; + callback = pm_noirq_op(dev->class->pm, state); + } else if (dev->bus && dev->bus->pm) { + info = "noirq bus "; + callback = pm_noirq_op(dev->bus->pm, state); + } else { + return NULL; + } + + if (info_p) + *info_p = info; + + return callback; +} + /** * device_resume_noirq - Execute a "noirq resume" callback for given device. * @dev: Device to handle. @@ -550,8 +587,8 @@ bool dev_pm_may_skip_resume(struct devic */ static int device_resume_noirq(struct device *dev, pm_message_t state, bool async) { - pm_callback_t callback = NULL; - const char *info = NULL; + pm_callback_t callback; + const char *info; int error = 0; TRACE_DEVICE(dev); @@ -565,19 +602,7 @@ static int device_resume_noirq(struct de dpm_wait_for_superior(dev, async); - if (dev->pm_domain) { - info = "noirq power domain "; - callback = pm_noirq_op(&dev->pm_domain->ops, state); - } else if (dev->type && dev->type->pm) { - info = "noirq type "; - callback = pm_noirq_op(dev->type->pm, state); - } else if (dev->class && dev->class->pm) { - info = "noirq class "; - callback = pm_noirq_op(dev->class->pm, state); - } else if (dev->bus && dev->bus->pm) { - info = "noirq bus "; - callback = pm_noirq_op(dev->bus->pm, state); - } + callback = dpm_subsys_resume_noirq_cb(dev, state, &info); if (!callback && dev->driver && dev->driver->pm) { info = "noirq driver "; @@ -686,6 +711,35 @@ void dpm_resume_noirq(pm_message_t state dpm_noirq_end(); } +static pm_callback_t dpm_subsys_resume_early_cb(struct device *dev, + pm_message_t state, + const char **info_p) +{ + pm_callback_t callback; + const char *info; + + if (dev->pm_domain) { + info = "early power domain "; + callback = pm_late_early_op(&dev->pm_domain->ops, state); + } else if (dev->type && dev->type->pm) { + info = "early type "; + callback = pm_late_early_op(dev->type->pm, state); + } else if (dev->class && dev->class->pm) { + info = "early class "; + ca
Re: [PATCH v2 3/6] PM / core: Add SMART_SUSPEND driver flag
On Mon, Nov 6, 2017 at 9:09 AM, Ulf Hansson wrote: > On 28 October 2017 at 00:22, Rafael J. Wysocki wrote: >> From: Rafael J. Wysocki >> >> Define and document a SMART_SUSPEND flag to instruct bus types and PM >> domains that the system suspend callbacks provided by the driver can >> cope with runtime-suspended devices, so from the driver's perspective >> it should be safe to leave devices in runtime suspend during system >> suspend. >> >> Setting that flag may also cause middle-layer code (bus types, >> PM domains etc.) to skip invocations of the ->suspend_late and >> ->suspend_noirq callbacks provided by the driver if the device >> is in runtime suspend at the beginning of the "late" phase of >> the system-wide suspend transition, in which case the driver's >> system-wide resume callbacks may be invoked back-to-back with >> its ->runtime_suspend callback, so the driver has to be able to >> cope with that too. >> >> Signed-off-by: Rafael J. Wysocki >> Acked-by: Greg Kroah-Hartman > > If not too late: > > Reviewed-by: Ulf Hansson Not too late, thanks! I'll be sending the next batch of this shortly. :-) -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/12] PM / mfd: intel-lpss: Use DPM_FLAG_SMART_SUSPEND
On Wed, Nov 1, 2017 at 10:28 AM, Lee Jones wrote: > On Tue, 31 Oct 2017, Rafael J. Wysocki wrote: > >> On Tue, Oct 31, 2017 at 4:09 PM, Lee Jones wrote: >> > On Mon, 16 Oct 2017, Rafael J. Wysocki wrote: >> > >> >> From: Rafael J. Wysocki >> >> >> >> Make the intel-lpss driver set DPM_FLAG_SMART_SUSPEND for its >> >> devices which will allow them to stay in runtime suspend during >> >> system suspend unless they need to be reconfigured for some reason. >> >> >> >> Also make it avoid resuming its child devices if they have >> >> DPM_FLAG_SMART_SUSPEND set to allow them to remain in runtime >> >> suspend during system suspend. >> >> >> >> Signed-off-by: Rafael J. Wysocki >> >> --- >> >> drivers/mfd/intel-lpss.c |6 +- >> >> 1 file changed, 5 insertions(+), 1 deletion(-) >> > >> > Is this patch independent? >> >> It depends on the flag definition at least, but functionally it also >> depends on the PCI support for the flag. > > No problem. Which tree to you propose this goes through? linux-pm.git if that's not a problem as the patches it depends on will go through it too. That said I'll resend it when the core patches it depends on are ready. Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/12] PM / mfd: intel-lpss: Use DPM_FLAG_SMART_SUSPEND
On Tue, Oct 31, 2017 at 4:09 PM, Lee Jones wrote: > On Mon, 16 Oct 2017, Rafael J. Wysocki wrote: > >> From: Rafael J. Wysocki >> >> Make the intel-lpss driver set DPM_FLAG_SMART_SUSPEND for its >> devices which will allow them to stay in runtime suspend during >> system suspend unless they need to be reconfigured for some reason. >> >> Also make it avoid resuming its child devices if they have >> DPM_FLAG_SMART_SUSPEND set to allow them to remain in runtime >> suspend during system suspend. >> >> Signed-off-by: Rafael J. Wysocki >> --- >> drivers/mfd/intel-lpss.c |6 +- >> 1 file changed, 5 insertions(+), 1 deletion(-) > > Is this patch independent? It depends on the flag definition at least, but functionally it also depends on the PCI support for the flag. > For my own reference: > Acked-for-MFD-by: Lee Jones Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/6] PM / core: Add NEVER_SKIP and SMART_PREPARE driver flags
From: Rafael J. Wysocki The motivation for this change is to provide a way to work around a problem with the direct-complete mechanism used for avoiding system suspend/resume handling for devices in runtime suspend. The problem is that some middle layer code (the PCI bus type and the ACPI PM domain in particular) returns positive values from its system suspend ->prepare callbacks regardless of whether the driver's ->prepare returns a positive value or 0, which effectively prevents drivers from being able to control the direct-complete feature. Some drivers need that control, however, and the PCI bus type has grown its own flag to deal with this issue, but since it is not limited to PCI, it is better to address it by adding driver flags at the core level. To that end, add a driver_flags field to struct dev_pm_info for flags that can be set by device drivers at the probe time to inform the PM core and/or bus types, PM domains and so on on the capabilities and/or preferences of device drivers. Also add two static inline helpers for setting that field and testing it against a given set of flags and make the driver core clear it automatically on driver remove and probe failures. Define and document two PM driver flags related to the direct- complete feature: NEVER_SKIP and SMART_PREPARE that can be used, respectively, to indicate to the PM core that the direct-complete mechanism should never be used for the device and to inform the middle layer code (bus types, PM domains etc) that it can only request the PM core to use the direct-complete mechanism for the device (by returning a positive value from its ->prepare callback) if it also has been requested by the driver. While at it, make the core check pm_runtime_suspended() when setting power.direct_complete so that it doesn't need to be checked by ->prepare callbacks. Signed-off-by: Rafael J. Wysocki Acked-by: Greg Kroah-Hartman Acked-by: Bjorn Helgaas --- -> v2: Do not use pm_generic_prepare() in acpi_subsys_prepare() as the latter has to distinguish between the lack of the driver's ->prepare callback and the situation in which that callback has returned 0 if DPM_FLAG_SMART_PREPARE is set. --- Documentation/driver-api/pm/devices.rst | 14 ++ Documentation/power/pci.txt | 19 +++ drivers/acpi/device_pm.c| 13 + drivers/base/dd.c |2 ++ drivers/base/power/main.c |4 +++- drivers/pci/pci-driver.c|5 - include/linux/device.h | 10 ++ include/linux/pm.h | 20 8 files changed, 81 insertions(+), 6 deletions(-) Index: linux-pm/include/linux/device.h === --- linux-pm.orig/include/linux/device.h +++ linux-pm/include/linux/device.h @@ -1070,6 +1070,16 @@ static inline void dev_pm_syscore_device #endif } +static inline void dev_pm_set_driver_flags(struct device *dev, u32 flags) +{ + dev->power.driver_flags = flags; +} + +static inline bool dev_pm_test_driver_flags(struct device *dev, u32 flags) +{ + return !!(dev->power.driver_flags & flags); +} + static inline void device_lock(struct device *dev) { mutex_lock(&dev->mutex); Index: linux-pm/include/linux/pm.h === --- linux-pm.orig/include/linux/pm.h +++ linux-pm/include/linux/pm.h @@ -550,6 +550,25 @@ struct pm_subsys_data { #endif }; +/* + * Driver flags to control system suspend/resume behavior. + * + * These flags can be set by device drivers at the probe time. They need not be + * cleared by the drivers as the driver core will take care of that. + * + * NEVER_SKIP: Do not skip system suspend/resume callbacks for the device. + * SMART_PREPARE: Check the return value of the driver's ->prepare callback. + * + * Setting SMART_PREPARE instructs bus types and PM domains which may want + * system suspend/resume callbacks to be skipped for the device to return 0 from + * their ->prepare callbacks if the driver's ->prepare callback returns 0 (in + * other words, the system suspend/resume callbacks can only be skipped for the + * device if its driver doesn't object against that). This flag has no effect + * if NEVER_SKIP is set. + */ +#define DPM_FLAG_NEVER_SKIPBIT(0) +#define DPM_FLAG_SMART_PREPARE BIT(1) + struct dev_pm_info { pm_message_tpower_state; unsigned intcan_wakeup:1; @@ -561,6 +580,7 @@ struct dev_pm_info { boolis_late_suspended:1; boolearly_init:1; /* Owned by the PM core */ booldirect_complete:1; /* Owned by the PM core */ + u32 driver_flags; spinlock_t
[PATCH v2 3/6] PM / core: Add SMART_SUSPEND driver flag
From: Rafael J. Wysocki Define and document a SMART_SUSPEND flag to instruct bus types and PM domains that the system suspend callbacks provided by the driver can cope with runtime-suspended devices, so from the driver's perspective it should be safe to leave devices in runtime suspend during system suspend. Setting that flag may also cause middle-layer code (bus types, PM domains etc.) to skip invocations of the ->suspend_late and ->suspend_noirq callbacks provided by the driver if the device is in runtime suspend at the beginning of the "late" phase of the system-wide suspend transition, in which case the driver's system-wide resume callbacks may be invoked back-to-back with its ->runtime_suspend callback, so the driver has to be able to cope with that too. Signed-off-by: Rafael J. Wysocki Acked-by: Greg Kroah-Hartman --- -> v2: Drop the changes in main.c, as the logic implemented by them previously is now going to be implemented in the PCI bus type and the ACPI PM domain directly. --- Documentation/driver-api/pm/devices.rst | 20 drivers/base/power/main.c |3 +++ include/linux/pm.h |8 3 files changed, 31 insertions(+) Index: linux-pm/Documentation/driver-api/pm/devices.rst === --- linux-pm.orig/Documentation/driver-api/pm/devices.rst +++ linux-pm/Documentation/driver-api/pm/devices.rst @@ -766,6 +766,26 @@ the state of devices (possibly except fo from their ``->prepare`` and ``->suspend`` callbacks (or equivalent) *before* invoking device drivers' ``->suspend`` callbacks (or equivalent). +Some bus types and PM domains have a policy to resume all devices from runtime +suspend upfront in their ``->suspend`` callbacks, but that may not be really +necessary if the driver of the device can cope with runtime-suspended devices. +The driver can indicate that by setting ``DPM_FLAG_SMART_SUSPEND`` in +:c:member:`power.driver_flags` at the probe time, by passing it to the +:c:func:`dev_pm_set_driver_flags` helper. That also may cause middle-layer code +(bus types, PM domains etc.) to skip the ``->suspend_late`` and +``->suspend_noirq`` callbacks provided by the driver if the device remains in +runtime suspend at the beginning of the ``suspend_late`` phase of system-wide +suspend (or in the ``poweroff_late`` phase of hibernation), when runtime PM +has been disabled for it, under the assumption that its state should not change +after that point until the system-wide transition is over. If that happens, the +driver's system-wide resume callbacks, if present, may still be invoked during +the subsequent system-wide resume transition and the device's runtime power +management status may be set to "active" before enabling runtime PM for it, +so the driver must be prepared to cope with the invocation of its system-wide +resume callbacks back-to-back with its ``->runtime_suspend`` one (without the +intervening ``->runtime_resume`` and so on) and the final state of the device +must reflect the "active" status for runtime PM in that case. + During system-wide resume from a sleep state it's easiest to put devices into the full-power state, as explained in :file:`Documentation/power/runtime_pm.txt`. Refer to that document for more information regarding this particular issue as Index: linux-pm/drivers/base/power/main.c === --- linux-pm.orig/drivers/base/power/main.c +++ linux-pm/drivers/base/power/main.c @@ -1652,6 +1652,9 @@ static int device_prepare(struct device if (dev->power.syscore) return 0; + WARN_ON(dev_pm_test_driver_flags(dev, DPM_FLAG_SMART_SUSPEND) && + !pm_runtime_enabled(dev)); + /* * If a device's parent goes into runtime suspend at the wrong time, * it won't be possible to resume the device. To prevent this we Index: linux-pm/include/linux/pm.h === --- linux-pm.orig/include/linux/pm.h +++ linux-pm/include/linux/pm.h @@ -558,6 +558,7 @@ struct pm_subsys_data { * * NEVER_SKIP: Do not skip system suspend/resume callbacks for the device. * SMART_PREPARE: Check the return value of the driver's ->prepare callback. + * SMART_SUSPEND: No need to resume the device from runtime suspend. * * Setting SMART_PREPARE instructs bus types and PM domains which may want * system suspend/resume callbacks to be skipped for the device to return 0 from @@ -565,9 +566,16 @@ struct pm_subsys_data { * other words, the system suspend/resume callbacks can only be skipped for the * device if its driver doesn't object against that). This flag has no effect * if NEVER_SKIP is set. + * + * Setting SMART_SUSPEND ins