Re: [RESEND 2] [PATCH] rlimits: Print more information when limits are exceeded

2017-03-01 Thread Thomas Gleixner
On Sat, 18 Feb 2017, Arun Raghavan wrote:

> This dumps some information in logs when a process exceeds its CPU or RT
> limits (soft and hard). Makes debugging easier when userspace triggers
> these limits.

Sigh. This changelog sucks. "dumps some information" is pretty useless and
it does not explain WHY you want to do that. Please structure the changelog
in a way which makes it easy to understand.

1) Problem description

2) Solution

> diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
> index e9e8c10..6dbcf84 100644
> --- a/kernel/time/posix-cpu-timers.c
> +++ b/kernel/time/posix-cpu-timers.c
> @@ -860,6 +860,9 @@ static void check_thread_timers(struct task_struct *tsk,
>* At the hard limit, we just die.
>* No need to calculate anything else now.
>*/
> + printk(KERN_INFO

pr_info("CPU Watchdog Timeout (hard): %s[%d]\n",

and no artificial line breaks, please.

Thanks,

tglx


Re: [RESEND 2] [PATCH] rlimits: Print more information when limits are exceeded

2017-03-01 Thread Thomas Gleixner
On Sat, 18 Feb 2017, Arun Raghavan wrote:

> This dumps some information in logs when a process exceeds its CPU or RT
> limits (soft and hard). Makes debugging easier when userspace triggers
> these limits.

Sigh. This changelog sucks. "dumps some information" is pretty useless and
it does not explain WHY you want to do that. Please structure the changelog
in a way which makes it easy to understand.

1) Problem description

2) Solution

> diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
> index e9e8c10..6dbcf84 100644
> --- a/kernel/time/posix-cpu-timers.c
> +++ b/kernel/time/posix-cpu-timers.c
> @@ -860,6 +860,9 @@ static void check_thread_timers(struct task_struct *tsk,
>* At the hard limit, we just die.
>* No need to calculate anything else now.
>*/
> + printk(KERN_INFO

pr_info("CPU Watchdog Timeout (hard): %s[%d]\n",

and no artificial line breaks, please.

Thanks,

tglx


Re: [RESEND 2] [PATCH] rlimits: Print more information when limits are exceeded

2017-02-18 Thread Arun Raghavan


On Sat, 18 Feb 2017, at 02:07 PM, Arun Raghavan wrote:
> This dumps some information in logs when a process exceeds its CPU or RT
> limits (soft and hard). Makes debugging easier when userspace triggers
> these limits.
> 
> Signed-off-by: Arun Raghavan 
> ---
>  kernel/time/posix-cpu-timers.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> Hello,
> This has come up a couple of times in the past, but we haven't been able
> to
> resolve whatever issues were pointed out.
> 
> In the mean time, we have frustrated users who don't know where they're
> getting
> a SIGKILL from, and I'd really like to have a way for people to not have
> to go
> through this.
> 
> The issues that came up the last time were:
> 
>  1. SIGXCPU messages shouldn't be needed since they can be caught: it's
>  still
> useful to have the log because it isn't always possible to pin down
> the
> thread causing the problem in userspace.
> 
>  2. SIGKILL logging should be centralised: there seem to be multiple
>  paths that
> trigger a SIGKILL -- and it seemed a bit ugly to try to add a reason
> parameter on all of them for the KILL case. Any other suggestions on
> how to
> deal with this?
> 
> I'm happy to fix this up to actually make it this time, but if there
> aren't
> none, just pushing this out will make our lives a little less painful.

That was meant to read -- "... if there aren't blocking objections to
this, just pushing this out will make our lives a little less painful."

-- Arun


Re: [RESEND 2] [PATCH] rlimits: Print more information when limits are exceeded

2017-02-18 Thread Arun Raghavan


On Sat, 18 Feb 2017, at 02:07 PM, Arun Raghavan wrote:
> This dumps some information in logs when a process exceeds its CPU or RT
> limits (soft and hard). Makes debugging easier when userspace triggers
> these limits.
> 
> Signed-off-by: Arun Raghavan 
> ---
>  kernel/time/posix-cpu-timers.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> Hello,
> This has come up a couple of times in the past, but we haven't been able
> to
> resolve whatever issues were pointed out.
> 
> In the mean time, we have frustrated users who don't know where they're
> getting
> a SIGKILL from, and I'd really like to have a way for people to not have
> to go
> through this.
> 
> The issues that came up the last time were:
> 
>  1. SIGXCPU messages shouldn't be needed since they can be caught: it's
>  still
> useful to have the log because it isn't always possible to pin down
> the
> thread causing the problem in userspace.
> 
>  2. SIGKILL logging should be centralised: there seem to be multiple
>  paths that
> trigger a SIGKILL -- and it seemed a bit ugly to try to add a reason
> parameter on all of them for the KILL case. Any other suggestions on
> how to
> deal with this?
> 
> I'm happy to fix this up to actually make it this time, but if there
> aren't
> none, just pushing this out will make our lives a little less painful.

That was meant to read -- "... if there aren't blocking objections to
this, just pushing this out will make our lives a little less painful."

-- Arun


[RESEND 2] [PATCH] rlimits: Print more information when limits are exceeded

2017-02-18 Thread Arun Raghavan
This dumps some information in logs when a process exceeds its CPU or RT
limits (soft and hard). Makes debugging easier when userspace triggers
these limits.

Signed-off-by: Arun Raghavan 
---
 kernel/time/posix-cpu-timers.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Hello,
This has come up a couple of times in the past, but we haven't been able to
resolve whatever issues were pointed out.

In the mean time, we have frustrated users who don't know where they're getting
a SIGKILL from, and I'd really like to have a way for people to not have to go
through this.

The issues that came up the last time were:

 1. SIGXCPU messages shouldn't be needed since they can be caught: it's still
useful to have the log because it isn't always possible to pin down the
thread causing the problem in userspace.

 2. SIGKILL logging should be centralised: there seem to be multiple paths that
trigger a SIGKILL -- and it seemed a bit ugly to try to add a reason
parameter on all of them for the KILL case. Any other suggestions on how to
deal with this?

I'm happy to fix this up to actually make it this time, but if there aren't
none, just pushing this out will make our lives a little less painful.

Thanks,
Arun

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index e9e8c10..6dbcf84 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -860,6 +860,9 @@ static void check_thread_timers(struct task_struct *tsk,
 * At the hard limit, we just die.
 * No need to calculate anything else now.
 */
+   printk(KERN_INFO
+   "CPU Watchdog Timeout (hard): %s[%d]\n",
+   tsk->comm, task_pid_nr(tsk));
__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
return;
}
@@ -872,7 +875,7 @@ static void check_thread_timers(struct task_struct *tsk,
sig->rlim[RLIMIT_RTTIME].rlim_cur = soft;
}
printk(KERN_INFO
-   "RT Watchdog Timeout: %s[%d]\n",
+   "RT Watchdog Timeout (soft): %s[%d]\n",
tsk->comm, task_pid_nr(tsk));
__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
}
@@ -980,6 +983,9 @@ static void check_process_timers(struct task_struct *tsk,
 * At the hard limit, we just die.
 * No need to calculate anything else now.
 */
+   printk(KERN_INFO
+   "RT Watchdog Timeout (hard): %s[%d]\n",
+   tsk->comm, task_pid_nr(tsk));
__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
return;
}
@@ -987,6 +993,9 @@ static void check_process_timers(struct task_struct *tsk,
/*
 * At the soft limit, send a SIGXCPU every second.
 */
+   printk(KERN_INFO
+   "CPU Watchdog Timeout (soft): %s[%d]\n",
+   tsk->comm, task_pid_nr(tsk));
__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
if (soft < hard) {
soft++;
-- 
2.9.3



[RESEND 2] [PATCH] rlimits: Print more information when limits are exceeded

2017-02-18 Thread Arun Raghavan
This dumps some information in logs when a process exceeds its CPU or RT
limits (soft and hard). Makes debugging easier when userspace triggers
these limits.

Signed-off-by: Arun Raghavan 
---
 kernel/time/posix-cpu-timers.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Hello,
This has come up a couple of times in the past, but we haven't been able to
resolve whatever issues were pointed out.

In the mean time, we have frustrated users who don't know where they're getting
a SIGKILL from, and I'd really like to have a way for people to not have to go
through this.

The issues that came up the last time were:

 1. SIGXCPU messages shouldn't be needed since they can be caught: it's still
useful to have the log because it isn't always possible to pin down the
thread causing the problem in userspace.

 2. SIGKILL logging should be centralised: there seem to be multiple paths that
trigger a SIGKILL -- and it seemed a bit ugly to try to add a reason
parameter on all of them for the KILL case. Any other suggestions on how to
deal with this?

I'm happy to fix this up to actually make it this time, but if there aren't
none, just pushing this out will make our lives a little less painful.

Thanks,
Arun

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index e9e8c10..6dbcf84 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -860,6 +860,9 @@ static void check_thread_timers(struct task_struct *tsk,
 * At the hard limit, we just die.
 * No need to calculate anything else now.
 */
+   printk(KERN_INFO
+   "CPU Watchdog Timeout (hard): %s[%d]\n",
+   tsk->comm, task_pid_nr(tsk));
__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
return;
}
@@ -872,7 +875,7 @@ static void check_thread_timers(struct task_struct *tsk,
sig->rlim[RLIMIT_RTTIME].rlim_cur = soft;
}
printk(KERN_INFO
-   "RT Watchdog Timeout: %s[%d]\n",
+   "RT Watchdog Timeout (soft): %s[%d]\n",
tsk->comm, task_pid_nr(tsk));
__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
}
@@ -980,6 +983,9 @@ static void check_process_timers(struct task_struct *tsk,
 * At the hard limit, we just die.
 * No need to calculate anything else now.
 */
+   printk(KERN_INFO
+   "RT Watchdog Timeout (hard): %s[%d]\n",
+   tsk->comm, task_pid_nr(tsk));
__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
return;
}
@@ -987,6 +993,9 @@ static void check_process_timers(struct task_struct *tsk,
/*
 * At the soft limit, send a SIGXCPU every second.
 */
+   printk(KERN_INFO
+   "CPU Watchdog Timeout (soft): %s[%d]\n",
+   tsk->comm, task_pid_nr(tsk));
__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
if (soft < hard) {
soft++;
-- 
2.9.3