On Tue, 2020-12-01 at 02:04 +0800, Kai-Heng Feng wrote: > > On Dec 1, 2020, at 00:19, Srinivas Pandruvada < > > srinivas.pandruv...@linux.intel.com> wrote: > > > > On Mon, 2020-11-30 at 16:23 +0800, Kai-Heng Feng wrote: > > > > On Nov 30, 2020, at 15:57, Daniel Lezcano < > > > > daniel.lezc...@linaro.org> wrote: > > > > > > > > > > > > [Added Srinivas] > > > > > > > > On 28/11/2020 18:54, Kai-Heng Feng wrote: > > > > > We are seeing thermal shutdown on Intel based mobile > > > > > workstations, the > > > > > shutdown happens during the first trip handle in > > > > > thermal_zone_device_register(): > > > > > kernel: thermal thermal_zone15: critical temperature reached > > > > > (101 > > > > > C), shutting down > > > > > > > > > > However, we shouldn't do a thermal shutdown here, since > > > > > 1) We may want to use a dedicated daemon, Intel's thermald in > > > > > this case, > > > > > to handle thermal shutdown. > > > > > > > > > > 2) For ACPI based system, _CRT doesn't mean shutdown unless > > > > > it's > > > > > inside > > > > > ThermalZone. ACPI Spec, 11.4.4 _CRT (Critical Temperature): > > > > > "... If this object it present under a device, the device’s > > > > > driver > > > > > evaluates this object to determine the device’s critical > > > > > cooling > > > > > temperature trip point. This value may then be used by the > > > > > device’s > > > > > driver to program an internal device temperature sensor trip > > > > > point." > > > > > > > > > > So a "critical trip" here merely means we should take a more > > > > > aggressive > > > > > cooling method. > > > > > > > > Well, actually it is stated before: > > > > > > > > "This object, when defined under a thermal zone, returns the > > > > critical > > > > temperature at which OSPM must shutdown the system". > > > > > > This means specifically for the ACPI ThermalZone in AML, e.g.: > > > > > > ThermalZone (TZ0) { > > > .... > > > Method(_CRT) { ... } > > > } // end of TZ0 > > > > > > However the device is not under any ACPI ThermalZone. > > > > > > > That is what does the thermal subsystem, no ? > > > > > > > > > So add an indication to let thermal core know it should leave > > > > > thermal > > > > > device to userspace to handle. > > > > > > > > You may want to check the 'HOT' trip point and then use the > > > > notification > > > > mechanism to get notified in userspace and take action from > > > > there > > > > (eg. > > > > offline some CPUs). > > > > > > For this particular issue we are facing, the thermal shutdown > > > happens > > > in thermal_zone_device_register() and userspace isn't up yet. > > > > What about creating an new callback > > > > enum thermal_trip_status { > > THERMAL_TRIP_DISABLED = 0, > > THERMAL_TRIP_ENABLED, > > }; > > > > int get_trip_status(struct thermal_zone_device *, int trip, enum > > thermal_trip_status *state); > > > > Then in > > static void handle_thermal_trip(struct thermal_zone_device *tz, int > > trip) > > { > > > > /* before tz->ops->get_trip_temp(tz, trip, &trip_temp); */ > > if (tz->ops->get_trip_status) { > > enum thermal_trip_status *status; > > > > if (!tz->ops->get_trip_status(tz, trip, &status)) { > > if (status == THERMAL_TRIP_DISABLED) > > return; > > } > > } > > ... > > ... > > > > } > > > > > > This callback will help the cases: > > - Allows drivers to selectively disable certain trips during init > > state > > or system resume where there can be spikes or always. int340x > > drivers > > can disable always. > > This sounds really great. This is indeed can happen on system resume, > before userspace process thaw. > > > - Still give options for drivers to handle critical trip even if > > they > > are bound to user space governors. User space process may be dead, > > so > > still allow kernel to process graceful shutdown > > To make the scenario happen, do we need a new sysfs to let usespace > enable it with THERMAL_TRIP_ENABLED? This should be drivers call not user space.
Thanks, Srinivas > > Kai-Heng > > > Thanks, > > Srinivas > > > > > Kai-Heng > > > > > > > > Signed-off-by: Kai-Heng Feng <kai.heng.f...@canonical.com> > > > > > --- > > > > > drivers/thermal/thermal_core.c | 3 +++ > > > > > include/linux/thermal.h | 2 ++ > > > > > 2 files changed, 5 insertions(+) > > > > > > > > > > diff --git a/drivers/thermal/thermal_core.c > > > > > b/drivers/thermal/thermal_core.c > > > > > index c6d74bc1c90b..6561e3767529 100644 > > > > > --- a/drivers/thermal/thermal_core.c > > > > > +++ b/drivers/thermal/thermal_core.c > > > > > @@ -1477,6 +1477,9 @@ thermal_zone_device_register(const char > > > > > *type, int trips, int mask, > > > > > goto unregister; > > > > > } > > > > > > > > > > + if (tz->tzp && tz->tzp->userspace) > > > > > + thermal_zone_device_disable(tz); > > > > > + > > > > > mutex_lock(&thermal_list_lock); > > > > > list_add_tail(&tz->node, &thermal_tz_list); > > > > > mutex_unlock(&thermal_list_lock); > > > > > diff --git a/include/linux/thermal.h > > > > > b/include/linux/thermal.h > > > > > index d07ea27e72a9..e8e8fac78fc8 100644 > > > > > --- a/include/linux/thermal.h > > > > > +++ b/include/linux/thermal.h > > > > > @@ -247,6 +247,8 @@ struct thermal_zone_params { > > > > > */ > > > > > bool no_hwmon; > > > > > > > > > > + bool userspace; > > > > > + > > > > > int num_tbps; /* Number of tbp entries */ > > > > > struct thermal_bind_params *tbp; > > > > > > > > > > > > > > > > > > -- > > > > <http://www.linaro.org/> Linaro.org │ Open source software for > > > > ARM > > > > SoCs > > > > > > > > Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook > > > > | > > > > <http://twitter.com/#!/linaroorg> Twitter | > > > > <http://www.linaro.org/linaro-blog/> Blog