Re: [v8 0/4] cgroup-aware OOM killer
On Wed, Sep 13, 2017 at 02:29:14PM +0200, Michal Hocko wrote: > On Mon 11-09-17 13:44:39, David Rientjes wrote: > > On Mon, 11 Sep 2017, Roman Gushchin wrote: > > > > > This patchset makes the OOM killer cgroup-aware. > > > > > > v8: > > > - Do not kill tasks with OOM_SCORE_ADJ -1000 > > > - Make the whole thing opt-in with cgroup mount option control > > > - Drop oom_priority for further discussions > > > > Nack, we specifically require oom_priority for this to function correctly, > > otherwise we cannot prefer to kill from low priority leaf memcgs as > > required. > > While I understand that your usecase might require priorities I do not > think this part missing is a reason to nack the cgroup based selection > and kill-all parts. This can be done on top. The only important part > right now is the current selection semantic - only leaf memcgs vs. size > of the hierarchy). I agree. > I strongly believe that comparing only leaf memcgs > is more straightforward and it doesn't lead to unexpected results as > mentioned before (kill a small memcg which is a part of the larger > sub-hierarchy). One of two main goals of this patchset is to introduce cgroup-level fairness: bigger cgroups should be affected more than smaller, despite the size of tasks inside. I believe the same principle should be used for cgroups. Also, the opposite will make oom_semantics more weird: it will mean kill all tasks, but also treat memcg as a leaf cgroup. > > I didn't get to read the new version of this series yet and hope to get > to it soon. Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/3] led: ledtrig-transient: add support for hrtimer
On Wed 2017-09-13 14:20:58, David Lin wrote: > On Wed, Sep 13, 2017 at 1:20 PM, Pavel Machekwrote: > > > > Hi! > > > > > These patch series add the LED_BRIGHTNESS_FAST flag support for > > > ledtrig-transient to use hrtimer so that platforms with high-resolution > > > timer > > > support can have better accuracy in the trigger duration timing. The need > > > for > > > this support is driven by the fact that Android has removed the > > > timed_ouput [1] > > > and is now using led-trigger for handling vibrator control which requires > > > the > > > timer to be accurate up to a millisecond. However, this flag support > > > would also > > > allow hrtimer to co-exist with the ktimer without causing warning to the > > > existing drivers [2]. > > > > NAK. > > > > LEDs do not need extra overhead, and vibrator control should not go > > through LED subsystem. > > > > Input subsystem includes support for vibrations and force > > feedback. Please use that instead. > > I thought we are already over this discussion. As of now, the support > of vibrator through ledtrig is documented > (Documentation/leds/ledtrig-transient.txt) and there are users using > it. I also thought we are over with that discussion. Yes, I'm working on fixing the docs. What mainline users are doing that? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH v2 0/3] led: ledtrig-transient: add support for hrtimer
On Wed, Sep 13, 2017 at 1:20 PM, Pavel Machekwrote: > > Hi! > > > These patch series add the LED_BRIGHTNESS_FAST flag support for > > ledtrig-transient to use hrtimer so that platforms with high-resolution > > timer > > support can have better accuracy in the trigger duration timing. The need > > for > > this support is driven by the fact that Android has removed the timed_ouput > > [1] > > and is now using led-trigger for handling vibrator control which requires > > the > > timer to be accurate up to a millisecond. However, this flag support would > > also > > allow hrtimer to co-exist with the ktimer without causing warning to the > > existing drivers [2]. > > NAK. > > LEDs do not need extra overhead, and vibrator control should not go > through LED subsystem. > > Input subsystem includes support for vibrations and force > feedback. Please use that instead. I thought we are already over this discussion. As of now, the support of vibrator through ledtrig is documented (Documentation/leds/ledtrig-transient.txt) and there are users using it. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v8 2/4] mm, oom: cgroup-aware OOM killer
On Mon, 11 Sep 2017, Roman Gushchin wrote: > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 15af3da5af02..da2b12ea4667 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2661,6 +2661,231 @@ static inline bool memcg_has_children(struct > mem_cgroup *memcg) > return ret; > } > > +static long memcg_oom_badness(struct mem_cgroup *memcg, > + const nodemask_t *nodemask, > + unsigned long totalpages) > +{ > + long points = 0; > + int nid; > + pg_data_t *pgdat; > + > + /* > + * We don't have necessary stats for the root memcg, > + * so we define it's oom_score as the maximum oom_score > + * of the belonging tasks. > + */ > + if (memcg == root_mem_cgroup) { > + struct css_task_iter it; > + struct task_struct *task; > + long score, max_score = 0; > + > + css_task_iter_start(>css, 0, ); > + while ((task = css_task_iter_next())) { > + score = oom_badness(task, memcg, nodemask, > + totalpages); > + if (max_score > score) score > max_score > + max_score = score; > + } > + css_task_iter_end(); > + > + return max_score; > + } > + > + for_each_node_state(nid, N_MEMORY) { > + if (nodemask && !node_isset(nid, *nodemask)) > + continue; > + > + points += mem_cgroup_node_nr_lru_pages(memcg, nid, > + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE)); > + > + pgdat = NODE_DATA(nid); > + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg), > + NR_SLAB_UNRECLAIMABLE); > + } > + > + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) / > + (PAGE_SIZE / 1024); > + points += memcg_page_state(memcg, MEMCG_SOCK); > + points += memcg_page_state(memcg, MEMCG_SWAP); > + > + return points; > +} -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v8 0/4] cgroup-aware OOM killer
On Wed, 13 Sep 2017, Michal Hocko wrote: > > > This patchset makes the OOM killer cgroup-aware. > > > > > > v8: > > > - Do not kill tasks with OOM_SCORE_ADJ -1000 > > > - Make the whole thing opt-in with cgroup mount option control > > > - Drop oom_priority for further discussions > > > > Nack, we specifically require oom_priority for this to function correctly, > > otherwise we cannot prefer to kill from low priority leaf memcgs as > > required. > > While I understand that your usecase might require priorities I do not > think this part missing is a reason to nack the cgroup based selection > and kill-all parts. This can be done on top. The only important part > right now is the current selection semantic - only leaf memcgs vs. size > of the hierarchy). I strongly believe that comparing only leaf memcgs > is more straightforward and it doesn't lead to unexpected results as > mentioned before (kill a small memcg which is a part of the larger > sub-hierarchy). > The problem is that we cannot enable the cgroup-aware oom killer and oom_group behavior because, without oom priorities, we have no ability to influence the cgroup that it chooses. It is doing two things: providing more fairness amongst cgroups by selecting based on cumulative usage rather than single large process (good!), and effectively is removing all userspace control of oom selection (bad). We want the former, but it needs to be coupled with support so that we can protect vital cgroups, regardless of their usage. It is certainly possible to add oom priorities on top before it is merged, but I don't see why it isn't part of the patchset. We need it before its merged to avoid users playing with /proc/pid/oom_score_adj to prevent any killing in the most preferable memcg when they could have simply changed the oom priority. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/3] led: ledtrig-transient: add support for hrtimer
Hi! > These patch series add the LED_BRIGHTNESS_FAST flag support for > ledtrig-transient to use hrtimer so that platforms with high-resolution timer > support can have better accuracy in the trigger duration timing. The need for > this support is driven by the fact that Android has removed the timed_ouput > [1] > and is now using led-trigger for handling vibrator control which requires the > timer to be accurate up to a millisecond. However, this flag support would > also > allow hrtimer to co-exist with the ktimer without causing warning to the > existing drivers [2]. NAK. LEDs do not need extra overhead, and vibrator control should not go through LED subsystem. Input subsystem includes support for vibrations and force feedback. Please use that instead. Pavel > David Lin (3): > leds: Replace flags bit shift with BIT() macros > leds: Add the LED_BRIGHTNESS_FAST flag > led: ledtrig-transient: add support for hrtimer > > Documentation/leds/leds-class.txt| 5 +++ > drivers/leds/trigger/ledtrig-transient.c | 59 > +--- > include/linux/leds.h | 19 +- > 3 files changed, 69 insertions(+), 14 deletions(-) > -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH 3/3 v11] printk: Add monotonic, boottime, and realtime timestamps
On 09/05/2017 05:06 AM, Prarit Bhargava wrote: printk.time=1/CONFIG_PRINTK_TIME=1 adds a unmodified local hardware clock timestamp to printk messages. The local hardware clock loses time each day making it difficult to determine exactly when an issue has occurred in the kernel log, and making it difficult to determine how kernel and hardware issues relate to each other in real time. Make printk output different timestamps by adding options for no timestamp, the local hardware clock, the monotonic clock, the boottime clock, and the real clock. Allow a user to pick one of the clocks by using the printk.time kernel parameter. Output the type of clock in /sys/module/printk/parameters/time so userspace programs can interpret the timestamp. v2: Use peterz's suggested Kconfig options. Merge patchset together. Fix i386 !CONFIG_PRINTK builds. v3: Fixed x86_64_defconfig. Added printk_time_type enum and printk_time_str for better output. Added BOOTTIME clock functionality. v4: Fix messages, add additional printk.time options, and fix configs. v5: Renaming of structures, and allow printk_time_set() to evaluate substrings of entries (eg: allow 'r', 'real', 'realtime'). From peterz, make fast functions return 0 until timekeeping is initialized (removes timekeeping_active & ktime_get_boot|real_log_ts() suggested by tglx and adds ktime_get_real_offset()). Switch to a function pointer for printk_get_ts() and reference fast functions. Make timestamp_sources enum match choice options for CONFIG_PRINTK_TIME (adds PRINTK_TIME_UNDEFINED). v6: Define PRINTK_TIME_UNDEFINED for !CONFIG_PRINTK builds. Separate timekeeping changes into separate patch. Minor include file cleanup. v7: Add default case to printk_set_timestamp() and add PRINTK_TIME_DEBUG for users that want to set timestamp to different values during runtime. Add jstultz' Kconfig to avoid defconfig churn. v8: Add CONFIG_PRINTK_TIME_DEBUG to allow timestamp runtime switching. Rename PRINTK_TIME_DISABLE to PRINTK_TIME_DISABLED. Rename printk_set_timestamp() to printk_set_ts_func(). Separate printk_set_ts_func() and printk_get_first_ts() portions. Rename param functions. Adjust configs, enum, and timestamp_sources_str to be 0-4. Add mention realtime clock is UTC in Documentation. v9: Fix typo. Add __ktime_get_real_fast_ns_unsafe(). v10: Remove time parameter restrictions. Ack and unit tested on Backport to 4.9. (you are missing the v11 respin comment, NBD) -- Mark -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] arm64: fix documentation on kernel pages mappings to HYP VA
The Documentation/arm64/memory.txt says: When using KVM, the hypervisor maps kernel pages in EL2, at a fixed offset from the kernel VA (top 24bits of the kernel VA set to zero): In fact, kernel addresses are transleted to HYP with kern_hyp_va macro, which has more options, and none of them assumes clearing of top 24bits of the kernel VA. Signed-off-by: Yury Norov--- Documentation/arm64/memory.txt | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/Documentation/arm64/memory.txt b/Documentation/arm64/memory.txt index d7273a5f6456..c39895d7e3a2 100644 --- a/Documentation/arm64/memory.txt +++ b/Documentation/arm64/memory.txt @@ -86,9 +86,12 @@ Translation table lookup with 64KB pages: +-> [63] TTBR0/1 -When using KVM, the hypervisor maps kernel pages in EL2, at a fixed -offset from the kernel VA (top 24bits of the kernel VA set to zero): - -Start End SizeUse -0040 007f 256GB kernel objects mapped in HYP +When using KVM without Virtualization Host Extensions, the hypervisor maps +kernel pages in EL2, at a fixed offset from the kernel VA. Namely, top 16 +or 25 bits of the kernel VA set to zero depending on ARM64_VA_BITS_48 or +ARM64_VA_BITS_39 config option selected; or top 17 or 26 bits of the kernel +VA set to zero if CPU has Reduced HYP mapping offset capability. See +kern_hyp_va macro. + +When using KVM with Virtualization Host Extensions, no additional mappings +created as host kernel already operates in EL2. -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/3] led: ledtrig-transient: add support for hrtimer
Hi, These patch series add the LED_BRIGHTNESS_FAST flag support for ledtrig-transient to use hrtimer so that platforms with high-resolution timer support can have better accuracy in the trigger duration timing. The need for this support is driven by the fact that Android has removed the timed_ouput [1] and is now using led-trigger for handling vibrator control which requires the timer to be accurate up to a millisecond. However, this flag support would also allow hrtimer to co-exist with the ktimer without causing warning to the existing drivers [2]. David [1] https://patchwork.kernel.org/patch/8664831/ [2] https://lkml.org/lkml/2015/4/28/260 Changes from v1 to v2: - Convert all the bit shifting flag in leds.h to use the BIT macro. - Removed inline modifiers for the timer helper function. David Lin (3): leds: Replace flags bit shift with BIT() macros leds: Add the LED_BRIGHTNESS_FAST flag led: ledtrig-transient: add support for hrtimer Documentation/leds/leds-class.txt| 5 +++ drivers/leds/trigger/ledtrig-transient.c | 59 +--- include/linux/leds.h | 19 +- 3 files changed, 69 insertions(+), 14 deletions(-) -- 2.14.1.581.gf28d330327-goog -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/3] leds: Replace flags bit shift with BIT() macros
This is for readability as well as to avoid checkpatch warnings when adding new bit flag information in the future. Signed-off-by: David Lin--- include/linux/leds.h | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/include/linux/leds.h b/include/linux/leds.h index bf6db4fe895b..5579c64c8fd6 100644 --- a/include/linux/leds.h +++ b/include/linux/leds.h @@ -40,16 +40,16 @@ struct led_classdev { int flags; /* Lower 16 bits reflect status */ -#define LED_SUSPENDED (1 << 0) -#define LED_UNREGISTERING (1 << 1) +#define LED_SUSPENDED BIT(0) +#define LED_UNREGISTERING BIT(1) /* Upper 16 bits reflect control information */ -#define LED_CORE_SUSPENDRESUME (1 << 16) -#define LED_SYSFS_DISABLE (1 << 17) -#define LED_DEV_CAP_FLASH (1 << 18) -#define LED_HW_PLUGGABLE (1 << 19) -#define LED_PANIC_INDICATOR(1 << 20) -#define LED_BRIGHT_HW_CHANGED (1 << 21) -#define LED_RETAIN_AT_SHUTDOWN (1 << 22) +#define LED_CORE_SUSPENDRESUME BIT(16) +#define LED_SYSFS_DISABLE BIT(17) +#define LED_DEV_CAP_FLASH BIT(18) +#define LED_HW_PLUGGABLE BIT(19) +#define LED_PANIC_INDICATORBIT(20) +#define LED_BRIGHT_HW_CHANGED BIT(21) +#define LED_RETAIN_AT_SHUTDOWN BIT(22) /* set_brightness_work / blink_timer flags, atomic, private. */ unsigned long work_flags; -- 2.14.1.581.gf28d330327-goog -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/3] leds: Add the LED_BRIGHTNESS_FAST flag
This patch adds the LED_BRIGHTNESS_FAST flag to allow the driver to indicate that the brightness_set() callback is implemented on a fastpath so that the LED core may choose to for example use a hrtimer to implement the duration of a trigger for better timing accuracy. Suggested-by: Jacek AnaszewskiSigned-off-by: David Lin --- Documentation/leds/leds-class.txt | 5 + include/linux/leds.h | 1 + 2 files changed, 6 insertions(+) diff --git a/Documentation/leds/leds-class.txt b/Documentation/leds/leds-class.txt index 836cb16d6f09..70d7a3dca621 100644 --- a/Documentation/leds/leds-class.txt +++ b/Documentation/leds/leds-class.txt @@ -80,6 +80,11 @@ flag must be set in flags before registering. Calling led_classdev_notify_brightness_hw_changed on a classdev not registered with the LED_BRIGHT_HW_CHANGED flag is a bug and will trigger a WARN_ON. +Optionally, the driver may choose to register with the LED_BRIGHTNESS_FAST flag. +This flag indicates that the driver implements the brightness_set() callback +function using a fastpath so the LED core can use hrtimer if the driver requires +high precision for the trigger timing. + Hardware accelerated blink of LEDs == diff --git a/include/linux/leds.h b/include/linux/leds.h index 5579c64c8fd6..ccfa0a1799fe 100644 --- a/include/linux/leds.h +++ b/include/linux/leds.h @@ -50,6 +50,7 @@ struct led_classdev { #define LED_PANIC_INDICATORBIT(20) #define LED_BRIGHT_HW_CHANGED BIT(21) #define LED_RETAIN_AT_SHUTDOWN BIT(22) +#define LED_BRIGHTNESS_FASTBIT(23) /* set_brightness_work / blink_timer flags, atomic, private. */ unsigned long work_flags; -- 2.14.1.581.gf28d330327-goog -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/3] led: ledtrig-transient: add support for hrtimer
This patch adds a hrtimer to ledtrig-transient so that when driver is registered with LED_BRIGHTNESS_FAST, the hrtimer is used for the better time accuracy in handling the duration. Signed-off-by: David Lin--- drivers/leds/trigger/ledtrig-transient.c | 59 +--- 1 file changed, 54 insertions(+), 5 deletions(-) diff --git a/drivers/leds/trigger/ledtrig-transient.c b/drivers/leds/trigger/ledtrig-transient.c index 7e6011bd3646..7d2ce757b39d 100644 --- a/drivers/leds/trigger/ledtrig-transient.c +++ b/drivers/leds/trigger/ledtrig-transient.c @@ -24,15 +24,18 @@ #include #include #include +#include #include #include "../leds.h" struct transient_trig_data { + struct led_classdev *led_cdev; int activate; int state; int restore_state; unsigned long duration; struct timer_list timer; + struct hrtimer hrtimer; }; static void transient_timer_function(unsigned long data) @@ -44,6 +47,54 @@ static void transient_timer_function(unsigned long data) led_set_brightness_nosleep(led_cdev, transient_data->restore_state); } +static enum hrtimer_restart transient_hrtimer_function(struct hrtimer *timer) +{ + struct transient_trig_data *transient_data = + container_of(timer, struct transient_trig_data, hrtimer); + + transient_timer_function((unsigned long)transient_data->led_cdev); + + return HRTIMER_NORESTART; +} + +static void transient_timer_setup(struct led_classdev *led_cdev) +{ + struct transient_trig_data *tdata = led_cdev->trigger_data; + + if (led_cdev->flags & LED_BRIGHTNESS_FAST) { + tdata->led_cdev = led_cdev; + hrtimer_init(>hrtimer, CLOCK_MONOTONIC, +HRTIMER_MODE_REL); + tdata->hrtimer.function = transient_hrtimer_function; + } else { + setup_timer(>timer, transient_timer_function, + (unsigned long)led_cdev); + } +} + +static void transient_timer_start(struct led_classdev *led_cdev) +{ + struct transient_trig_data *tdata = led_cdev->trigger_data; + + if (led_cdev->flags & LED_BRIGHTNESS_FAST) { + hrtimer_start(>hrtimer, ms_to_ktime(tdata->duration), + HRTIMER_MODE_REL); + } else { + mod_timer(>timer, + jiffies + msecs_to_jiffies(tdata->duration)); + } +} + +static void transient_timer_cancel(struct led_classdev *led_cdev) +{ + struct transient_trig_data *tdata = led_cdev->trigger_data; + + if (led_cdev->flags & LED_BRIGHTNESS_FAST) + hrtimer_cancel(>hrtimer); + else + del_timer_sync(>timer); +} + static ssize_t transient_activate_show(struct device *dev, struct device_attribute *attr, char *buf) { @@ -70,7 +121,7 @@ static ssize_t transient_activate_store(struct device *dev, /* cancel the running timer */ if (state == 0 && transient_data->activate == 1) { - del_timer(_data->timer); + transient_timer_cancel(led_cdev); transient_data->activate = state; led_set_brightness_nosleep(led_cdev, transient_data->restore_state); @@ -84,8 +135,7 @@ static ssize_t transient_activate_store(struct device *dev, led_set_brightness_nosleep(led_cdev, transient_data->state); transient_data->restore_state = (transient_data->state == LED_FULL) ? LED_OFF : LED_FULL; - mod_timer(_data->timer, - jiffies + msecs_to_jiffies(transient_data->duration)); + transient_timer_start(led_cdev); } /* state == 0 && transient_data->activate == 0 @@ -182,8 +232,7 @@ static void transient_trig_activate(struct led_classdev *led_cdev) if (rc) goto err_out_state; - setup_timer(>timer, transient_timer_function, - (unsigned long) led_cdev); + transient_timer_setup(led_cdev); led_cdev->activated = true; return; -- 2.14.1.581.gf28d330327-goog -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v8 0/4] cgroup-aware OOM killer
On Mon 11-09-17 13:44:39, David Rientjes wrote: > On Mon, 11 Sep 2017, Roman Gushchin wrote: > > > This patchset makes the OOM killer cgroup-aware. > > > > v8: > > - Do not kill tasks with OOM_SCORE_ADJ -1000 > > - Make the whole thing opt-in with cgroup mount option control > > - Drop oom_priority for further discussions > > Nack, we specifically require oom_priority for this to function correctly, > otherwise we cannot prefer to kill from low priority leaf memcgs as > required. While I understand that your usecase might require priorities I do not think this part missing is a reason to nack the cgroup based selection and kill-all parts. This can be done on top. The only important part right now is the current selection semantic - only leaf memcgs vs. size of the hierarchy). I strongly believe that comparing only leaf memcgs is more straightforward and it doesn't lead to unexpected results as mentioned before (kill a small memcg which is a part of the larger sub-hierarchy). I didn't get to read the new version of this series yet and hope to get to it soon. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v8 3/4] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
On Tue 12-09-17 21:01:15, Roman Gushchin wrote: > On Mon, Sep 11, 2017 at 01:48:39PM -0700, David Rientjes wrote: > > On Mon, 11 Sep 2017, Roman Gushchin wrote: > > > > > Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware > > > OOM killer. If not set, the OOM selection is performed in > > > a "traditional" per-process way. > > > > > > The behavior can be changed dynamically by remounting the cgroupfs. > > > > I can't imagine that Tejun would be happy with a new mount option, > > especially when it's not required. > > > > OOM behavior does not need to be defined at mount time and for the entire > > hierarchy. It's possible to very easily implement a tunable as part of > > mem cgroup that is propagated to descendants and controls the oom scoring > > behavior for that hierarchy. It does not need to be system wide and > > affect scoring of all processes based on which mem cgroup they are > > attached to at any given time. > > No, I don't think that mixing per-cgroup and per-process OOM selection > algorithms is a good idea. > > So, there are 3 reasonable options: > 1) boot option > 2) sysctl > 3) cgroup mount option > > I believe, 3) is better, because it allows changing the behavior dynamically, > and explicitly depends on v2 (what sysctl lacks). I see your argument here. I would just be worried that we end up really needing more oom strategies in future and those wouldn't fit into memcg mount option scope. So 1/2 sounds more exensible to me long term. Boot time would be easier because we do not have to bother dynamic selection in that case. > So, the only question is should it be opt-in or opt-out option. > Personally, I would prefer opt-out, but Michal has a very strong opinion here. Yes I still strongly believe this has to be opt-in. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
On 2017/8/29 22:56, Michael S. Tsirkin wrote: On Tue, Aug 29, 2017 at 11:46:34AM +, Yang Zhang wrote: Some latency-intensive workload will see obviously performance drop when running inside VM. But are we trading a lot of CPU for a bit of lower latency? The main reason is that the overhead is amplified when running inside VM. The most cost i have seen is inside idle path. This patch introduces a new mechanism to poll for a while before entering idle state. If schedule is needed during poll, then we don't need to goes through the heavy overhead path. Isn't it the job of an idle driver to find the best way to halt the CPU? It looks like just by adding a cstate we can make it halt at higher latencies only. And at lower latencies, if it's doing a good job we can hopefully use mwait to stop the CPU. In fact I have been experimenting with exactly that. Some initial results are encouraging but I could use help with testing and especially tuning. If you can help pls let me know! Quan, Can you help to test it and give result? Thanks. -- Yang Alibaba Cloud Computing -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html