from:"Enke Chen"

[PATCH net] tcp: fix keepalive when data remain undelivered

2021-02-19 Thread Enke Chen

From: Enke Chen 

TCP keepalive does not timeout under the condition that network connection
is lost and data remain undelivered (incl. retransmit). A very simple
scenarios of the failure is to write data to a tcp socket after the network
connection is lost.

Under the specified condition the keepalive timeout is not evaluated in
the keepalive timer. That is the primary cause of the failure. In addition,
the keepalive probe is not sent out in the keepalive timer. Although packet
retransmit or 0-window probe can serve a similar purpose, they have their
own timers and backoffs that are generally not aligned with the keepalive
parameters for probes and timeout.

As the timing and conditions of the events involved are random, the tcp
keepalive can fail randomly. Given the randomness of the failures, fixing
the issue would not cause any backward compatibility issues. As was well
said, "Determinism is a special case of randomness".

The fix in this patch consists of the following:

  a. Always evaluate the keepalive timeout in the keepalive timer.

  b. Always send out the keepalive probe in the keepalive timer (post the
 keepalive idle time). Given that the keepalive intervals are usually
 in the range of 30 - 60 seconds, there is no need for an optimization
 to further reduce the number of keepalive probes in the presence of
 packet retransmit.

  c. Use the elapsed time (instead of the 0-window probe counter) in
 evaluating tcp keepalive timeout.

Thanks to Eric Dumazet, Neal Cardwell, and Yuchung Cheng for helpful
discussions about the issue and options for fixing it.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2 Initial git repository build")
Signed-off-by: Enke Chen 
---
 net/ipv4/tcp_timer.c | 20 ++--
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 4ef08079ccfa..16a044da20db 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -708,29 +708,23 @@ static void tcp_keepalive_timer (struct timer_list *t)
((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_SYN_SENT)))
goto out;
 
-   elapsed = keepalive_time_when(tp);
-
-   /* It is alive without keepalive 8) */
-   if (tp->packets_out || !tcp_write_queue_empty(sk))
-   goto resched;
-
elapsed = keepalive_time_elapsed(tp);
 
if (elapsed >= keepalive_time_when(tp)) {
/* If the TCP_USER_TIMEOUT option is enabled, use that
 * to determine when to timeout instead.
 */
-   if ((icsk->icsk_user_timeout != 0 &&
-   elapsed >= msecs_to_jiffies(icsk->icsk_user_timeout) &&
-   icsk->icsk_probes_out > 0) ||
-   (icsk->icsk_user_timeout == 0 &&
-   icsk->icsk_probes_out >= keepalive_probes(tp))) {
+   u32 timeout = icsk->icsk_user_timeout ?
+ msecs_to_jiffies(icsk->icsk_user_timeout) :
+ keepalive_intvl_when(tp) * keepalive_probes(tp) +
+ keepalive_time_when(tp);
+
+   if (elapsed >= timeout) {
tcp_send_active_reset(sk, GFP_ATOMIC);
tcp_write_err(sk);
goto out;
}
if (tcp_write_wakeup(sk, LINUX_MIB_TCPKEEPALIVE) <= 0) {
-   icsk->icsk_probes_out++;
elapsed = keepalive_intvl_when(tp);
} else {
/* If keepalive was lost due to local congestion,
@@ -744,8 +738,6 @@ static void tcp_keepalive_timer (struct timer_list *t)
}
 
sk_mem_reclaim(sk);
-
-resched:
inet_csk_reset_keepalive_timer (sk, elapsed);
goto out;
 
-- 
2.29.2

Re: [PATCH net] tcp: make TCP_USER_TIMEOUT accurate for zero window probes

2021-01-23 Thread Enke Chen

Hi, Neal:

What you described is more accurate, and is correct.

Thanks.  -- Enke

On Sat, Jan 23, 2021 at 07:19:13PM -0500, Neal Cardwell wrote:
> On Fri, Jan 22, 2021 at 9:45 PM Enke Chen  wrote:
> >
> > Hi, Jakub:
> >
> > On Fri, Jan 22, 2021 at 06:34:24PM -0800, Jakub Kicinski wrote:
> > > On Fri, 22 Jan 2021 18:28:23 -0800 Enke Chen wrote:
> > > > Hi, Jakub:
> > > >
> > > > In terms of backporting, this patch should go together with:
> > > >
> > > > 9d9b1ee0b2d1 tcp: fix TCP_USER_TIMEOUT with zero window
> > >
> > > As in it:
> > >
> > > Fixes: 9d9b1ee0b2d1 tcp: fix TCP_USER_TIMEOUT with zero window
> > >
> > > or does it further fix the same issue, so:
> > >
> > > Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on 
> > > USER_TIMEOUT")
> > >
> > > ?
> >
> > Let me clarify:
> >
> > 1) 9d9b1ee0b2d1 tcp: fix TCP_USER_TIMEOUT with zero window
> >
> >fixes the bug and makes it work.
> >
> > 2) The current patch makes the TCP_USER_TIMEOUT accurate for 0-window 
> > probes.
> >It's independent.
> 
> Patch (2) ("tcp: make TCP_USER_TIMEOUT accurate for zero window
> probes") is indeed conceptually independent of (1) but its
> implementation depends on the icsk_probes_tstamp field defined in (1),
> so AFAICT (2) cannot be backported further back than (1).
> 
> Patch (1) fixes a bug in 5.1:
> Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on 
> USER_TIMEOUT")
> 
> So probably (1) and (2) should be backported as a pair, and only back
> as far as 5.1. (That covers 2 LTS kernels, 5.4 and 5.10, so hopefully
> that is good enough.)
> 
> neal

Re: [PATCH net] tcp: make TCP_USER_TIMEOUT accurate for zero window probes

2021-01-22 Thread Enke Chen

Hi, Jakub:

On Fri, Jan 22, 2021 at 06:34:24PM -0800, Jakub Kicinski wrote:
> On Fri, 22 Jan 2021 18:28:23 -0800 Enke Chen wrote:
> > Hi, Jakub:
> > 
> > In terms of backporting, this patch should go together with:
> > 
> > 9d9b1ee0b2d1 tcp: fix TCP_USER_TIMEOUT with zero window
> 
> As in it:
> 
> Fixes: 9d9b1ee0b2d1 tcp: fix TCP_USER_TIMEOUT with zero window
> 
> or does it further fix the same issue, so:
> 
> Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on USER_TIMEOUT")
>
> ?

Let me clarify:

1) 9d9b1ee0b2d1 tcp: fix TCP_USER_TIMEOUT with zero window

   fixes the bug and makes it work.

2) The current patch makes the TCP_USER_TIMEOUT accurate for 0-window probes.
   It's independent.

With 1) and 2), the known issues with TCP_USER_TIMEOUT for 0-window probes
would be resolved.

Thanks.   -- Enke

Re: [PATCH net] tcp: make TCP_USER_TIMEOUT accurate for zero window probes

2021-01-22 Thread Enke Chen

Hi, Jakub:

In terms of backporting, this patch should go together with:

9d9b1ee0b2d1 tcp: fix TCP_USER_TIMEOUT with zero window

Thanks.  -- Enke

On Fri, Jan 22, 2021 at 05:43:25PM -0800, Jakub Kicinski wrote:
> On Fri, 22 Jan 2021 11:13:06 -0800 Enke Chen wrote:
> > From: Enke Chen 
> > 
> > The TCP_USER_TIMEOUT is checked by the 0-window probe timer. As the
> > timer has backoff with a max interval of about two minutes, the
> > actual timeout for TCP_USER_TIMEOUT can be off by up to two minutes.
> > 
> > In this patch the TCP_USER_TIMEOUT is made more accurate by taking it
> > into account when computing the timer value for the 0-window probes.
> > 
> > This patch is similar to the one that made TCP_USER_TIMEOUT accurate for
> > RTOs in commit b701a99e431d ("tcp: Add tcp_clamp_rto_to_user_timeout()
> > helper to improve accuracy").
> > 
> > Signed-off-by: Enke Chen 
> > Reviewed-by: Neal Cardwell 
> 
> This is targeting net, any guidance on Fixes / backporting?

Re: [PATCH] tcp: keepalive fixes

2021-01-22 Thread Enke Chen

Hi, Folks:

Please ignore this patch. I will split it into separate ones as suggested
off-list by Neal Cardwell .

Thanks.  -- Enke

On Tue, Jan 12, 2021 at 11:25:44AM -0800, Enke Chen wrote:
> From: Enke Chen 
> 
> In this patch two issues with TCP keepalives are fixed:
> 
> 1) TCP keepalive does not timeout when there are data waiting to be
>delivered and then the connection got broken. The TCP keepalive
>timeout is not evaluated in that condition.
> 
>The fix is to remove the code that prevents TCP keepalive from
>being evaluated for timeout.
> 
> 2) With the fix for #1, TCP keepalive can erroneously timeout after
>the 0-window probe kicks in. The 0-window probe counter is wrongly
>applied to TCP keepalives.
> 
>The fix is to use the elapsed time instead of the 0-window probe
>counter in evaluating TCP keepalive timeout.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Enke Chen 
> ---
>  net/ipv4/tcp_timer.c | 15 +++
>  1 file changed, 3 insertions(+), 12 deletions(-)
> 
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 6c62b9ea1320..40953aa40d53 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -696,12 +696,6 @@ static void tcp_keepalive_timer (struct timer_list *t)
>   ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_SYN_SENT)))
>   goto out;
>  
> - elapsed = keepalive_time_when(tp);
> -
> - /* It is alive without keepalive 8) */
> - if (tp->packets_out || !tcp_write_queue_empty(sk))
> - goto resched;
> -
>   elapsed = keepalive_time_elapsed(tp);
>  
>   if (elapsed >= keepalive_time_when(tp)) {
> @@ -709,16 +703,15 @@ static void tcp_keepalive_timer (struct timer_list *t)
>* to determine when to timeout instead.
>*/
>   if ((icsk->icsk_user_timeout != 0 &&
> - elapsed >= msecs_to_jiffies(icsk->icsk_user_timeout) &&
> - icsk->icsk_probes_out > 0) ||
> +  elapsed >= msecs_to_jiffies(icsk->icsk_user_timeout)) ||
>   (icsk->icsk_user_timeout == 0 &&
> - icsk->icsk_probes_out >= keepalive_probes(tp))) {
> +  (elapsed >= keepalive_time_when(tp) +
> +   keepalive_intvl_when(tp) * keepalive_probes(tp {
>   tcp_send_active_reset(sk, GFP_ATOMIC);
>   tcp_write_err(sk);
>   goto out;
>   }
>   if (tcp_write_wakeup(sk, LINUX_MIB_TCPKEEPALIVE) <= 0) {
> - icsk->icsk_probes_out++;
>   elapsed = keepalive_intvl_when(tp);
>   } else {
>   /* If keepalive was lost due to local congestion,
> @@ -732,8 +725,6 @@ static void tcp_keepalive_timer (struct timer_list *t)
>   }
>  
>   sk_mem_reclaim(sk);
> -
> -resched:
>   inet_csk_reset_keepalive_timer (sk, elapsed);
>   goto out;
>  
> -- 
> 2.29.2
>

[PATCH net] tcp: make TCP_USER_TIMEOUT accurate for zero window probes

2021-01-22 Thread Enke Chen

From: Enke Chen 

The TCP_USER_TIMEOUT is checked by the 0-window probe timer. As the
timer has backoff with a max interval of about two minutes, the
actual timeout for TCP_USER_TIMEOUT can be off by up to two minutes.

In this patch the TCP_USER_TIMEOUT is made more accurate by taking it
into account when computing the timer value for the 0-window probes.

This patch is similar to the one that made TCP_USER_TIMEOUT accurate for
RTOs in commit b701a99e431d ("tcp: Add tcp_clamp_rto_to_user_timeout()
helper to improve accuracy").

Signed-off-by: Enke Chen 
Reviewed-by: Neal Cardwell 
---
 include/net/tcp.h |  1 +
 net/ipv4/tcp_input.c  |  4 ++--
 net/ipv4/tcp_output.c |  2 ++
 net/ipv4/tcp_timer.c  | 18 ++
 4 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 78d13c88720f..ca7e2c6cc663 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -630,6 +630,7 @@ static inline void tcp_clear_xmit_timers(struct sock *sk)
 
 unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu);
 unsigned int tcp_current_mss(struct sock *sk);
+u32 tcp_clamp_probe0_to_user_timeout(const struct sock *sk, u32 when);
 
 /* Bound MSS / TSO packet size with the half of the window */
 static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bafcab75f425..4923cdbea95a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3392,8 +3392,8 @@ static void tcp_ack_probe(struct sock *sk)
} else {
unsigned long when = tcp_probe0_when(sk, TCP_RTO_MAX);
 
-   tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
-when, TCP_RTO_MAX);
+   when = tcp_clamp_probe0_to_user_timeout(sk, when);
+   tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0, when, TCP_RTO_MAX);
}
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index ab458697881e..8478cf749821 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -4099,6 +4099,8 @@ void tcp_send_probe0(struct sock *sk)
 */
timeout = TCP_RESOURCE_PROBE_INTERVAL;
}
+
+   timeout = tcp_clamp_probe0_to_user_timeout(sk, timeout);
tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0, timeout, TCP_RTO_MAX);
 }
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 454732ecc8f3..90722e30ad90 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -40,6 +40,24 @@ static u32 tcp_clamp_rto_to_user_timeout(const struct sock 
*sk)
return min_t(u32, icsk->icsk_rto, msecs_to_jiffies(remaining));
 }
 
+u32 tcp_clamp_probe0_to_user_timeout(const struct sock *sk, u32 when)
+{
+   struct inet_connection_sock *icsk = inet_csk(sk);
+   u32 remaining;
+   s32 elapsed;
+
+   if (!icsk->icsk_user_timeout || !icsk->icsk_probes_tstamp)
+   return when;
+
+   elapsed = tcp_jiffies32 - icsk->icsk_probes_tstamp;
+   if (unlikely(elapsed < 0))
+   elapsed = 0;
+   remaining = msecs_to_jiffies(icsk->icsk_user_timeout) - elapsed;
+   remaining = max_t(u32, remaining, TCP_TIMEOUT_MIN);
+
+   return min_t(u32, remaining, when);
+}
+
 /**
  *  tcp_write_err() - close socket and save error info
  *  @sk:  The socket the error has appeared on.
-- 
2.29.2

Re: [PATCH net v2] tcp: fix TCP_USER_TIMEOUT with zero window

2021-01-18 Thread Enke Chen

On Mon, Jan 18, 2021 at 08:02:21PM -0800, Jakub Kicinski wrote:
> On Fri, 15 Jan 2021 14:30:58 -0800 Enke Chen wrote:
> > From: Enke Chen 
> > 
> > The TCP session does not terminate with TCP_USER_TIMEOUT when data
> > remain untransmitted due to zero window.
> > 
> > The number of unanswered zero-window probes (tcp_probes_out) is
> > reset to zero with incoming acks irrespective of the window size,
> > as described in tcp_probe_timer():
> > 
> > RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
> > as long as the receiver continues to respond probes. We support
> > this by default and reset icsk_probes_out with incoming ACKs.
> > 
> > This counter, however, is the wrong one to be used in calculating the
> > duration that the window remains closed and data remain untransmitted.
> > Thanks to Jonathan Maxwell  for diagnosing the
> > actual issue.
> > 
> > In this patch a new timestamp is introduced for the socket in order to
> > track the elapsed time for the zero-window probes that have not been
> > answered with any non-zero window ack.
> > 
> > Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on USER_TIMEOUT")
> > Reported-by: William McCall 
> > Co-developed-by: Neal Cardwell 
> > Signed-off-by: Neal Cardwell 
> > Signed-off-by: Enke Chen 
> > Reviewed-by: Yuchung Cheng 
> > Reviewed-by: Eric Dumazet 
> 
> I take it you got all these tags off-list? I don't see them on the v1
> discussion.

Yes, the tags have been approved off-list by those named.

> 
> Applied to net, thanks!

Thanks.  -- Enke

[PATCH net v2] tcp: fix TCP_USER_TIMEOUT with zero window

2021-01-15 Thread Enke Chen

From: Enke Chen 

The TCP session does not terminate with TCP_USER_TIMEOUT when data
remain untransmitted due to zero window.

The number of unanswered zero-window probes (tcp_probes_out) is
reset to zero with incoming acks irrespective of the window size,
as described in tcp_probe_timer():

RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
as long as the receiver continues to respond probes. We support
this by default and reset icsk_probes_out with incoming ACKs.

This counter, however, is the wrong one to be used in calculating the
duration that the window remains closed and data remain untransmitted.
Thanks to Jonathan Maxwell  for diagnosing the
actual issue.

In this patch a new timestamp is introduced for the socket in order to
track the elapsed time for the zero-window probes that have not been
answered with any non-zero window ack.

Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on USER_TIMEOUT")
Reported-by: William McCall 
Co-developed-by: Neal Cardwell 
Signed-off-by: Neal Cardwell 
Signed-off-by: Enke Chen 
Reviewed-by: Yuchung Cheng 
Reviewed-by: Eric Dumazet 
---
 include/net/inet_connection_sock.h |  3 +++
 net/ipv4/inet_connection_sock.c|  1 +
 net/ipv4/tcp.c |  1 +
 net/ipv4/tcp_input.c   |  1 +
 net/ipv4/tcp_output.c  |  1 +
 net/ipv4/tcp_timer.c   | 14 +++---
 6 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/include/net/inet_connection_sock.h 
b/include/net/inet_connection_sock.h
index 7338b3865a2a..111d7771b208 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -76,6 +76,8 @@ struct inet_connection_sock_af_ops {
  * @icsk_ext_hdr_len: Network protocol overhead (IP/IPv6 options)
  * @icsk_ack: Delayed ACK control data
  * @icsk_mtup;MTU probing control data
+ * @icsk_probes_tstamp:Probe timestamp (cleared by non-zero window ack)
+ * @icsk_user_timeout:TCP_USER_TIMEOUT value
  */
 struct inet_connection_sock {
/* inet_sock has to be the first member! */
@@ -129,6 +131,7 @@ struct inet_connection_sock {
 
u32   probe_timestamp;
} icsk_mtup;
+   u32   icsk_probes_tstamp;
u32   icsk_user_timeout;
 
u64   icsk_ca_priv[104 / sizeof(u64)];
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index fd8b8800a2c3..6bd7ca09af03 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -851,6 +851,7 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
newicsk->icsk_retransmits = 0;
newicsk->icsk_backoff = 0;
newicsk->icsk_probes_out  = 0;
+   newicsk->icsk_probes_tstamp = 0;
 
/* Deinitialize accept_queue to trap illegal accesses. */
memset(>icsk_accept_queue, 0, 
sizeof(newicsk->icsk_accept_queue));
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ed42d2193c5c..32545ecf2ab1 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2937,6 +2937,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 
icsk->icsk_backoff = 0;
icsk->icsk_probes_out = 0;
+   icsk->icsk_probes_tstamp = 0;
icsk->icsk_rto = TCP_TIMEOUT_INIT;
icsk->icsk_rto_min = TCP_RTO_MIN;
icsk->icsk_delack_max = TCP_DELACK_MAX;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c7e16b0ed791..bafcab75f425 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3384,6 +3384,7 @@ static void tcp_ack_probe(struct sock *sk)
return;
if (!after(TCP_SKB_CB(head)->end_seq, tcp_wnd_end(tp))) {
icsk->icsk_backoff = 0;
+   icsk->icsk_probes_tstamp = 0;
inet_csk_clear_xmit_timer(sk, ICSK_TIME_PROBE0);
/* Socket must be waked up by subsequent tcp_data_snd_check().
 * This function is not for random using!
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f322e798a351..ab458697881e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -4084,6 +4084,7 @@ void tcp_send_probe0(struct sock *sk)
/* Cancel probe timer, if it is not required. */
icsk->icsk_probes_out = 0;
icsk->icsk_backoff = 0;
+   icsk->icsk_probes_tstamp = 0;
return;
}
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 6c62b9ea1320..454732ecc8f3 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -349,6 +349,7 @@ static void tcp_probe_timer(struct sock *sk)
 
if (tp->packets_out || !skb) {
icsk->icsk_probes_out = 0;
+   icsk->icsk_probes_tstamp = 0;
return;
}
 
@@ -360,13 +361,12 @@ static void t

Re: [PATCH] tcp: fix TCP_USER_TIMEOUT with zero window

2021-01-13 Thread Enke Chen

Hi, Eric:

Yes, that is a good point! I have been discussing with Neal and Yuchung also
and will work on revising the patch.

Thanks.   -- Enke

On Wed, Jan 13, 2021 at 09:44:11PM +0100, Eric Dumazet wrote:
> On Wed, Jan 13, 2021 at 9:12 PM Enke Chen  wrote:
> >
> > From: Enke Chen 
> >
> > The TCP session does not terminate with TCP_USER_TIMEOUT when data
> > remain untransmitted due to zero window.
> >
> > The number of unanswered zero-window probes (tcp_probes_out) is
> > reset to zero with incoming acks irrespective of the window size,
> > as described in tcp_probe_timer():
> >
> > RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
> > as long as the receiver continues to respond probes. We support
> > this by default and reset icsk_probes_out with incoming ACKs.
> >
> > This counter, however, is the wrong one to be used in calculating the
> > duration that the window remains closed and data remain untransmitted.
> > Thanks to Jonathan Maxwell  for diagnosing the
> > actual issue.
> >
> > In this patch a separate counter is introduced to track the number of
> > zero-window probes that are not answered with any non-zero window ack.
> > This new counter is used in determining when to abort the session with
> > TCP_USER_TIMEOUT.
> >
> 
> I think one possible issue would be that local congestion (full qdisc)
> would abort early,
> because tcp_model_timeout() assumes linear backoff.
> 
> Neal or Yuchung can further comment on that, it is late for me in France.
> 
> packetdrill test would be :
> 
>0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>+0 bind(3, ..., ...) = 0
>+0 listen(3, 1) = 0
> 
> 
>+0 < S 0:0(0) win 0 
>+0 > S. 0:0(0) ack 1 
> 
>   +.1 < . 1:1(0) ack 1 win 65530
>+0 accept(3, ..., ...) = 4
> 
>+0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0
>+0 write(4, ..., 24) = 24
>+0 > P. 1:25(24) ack 1
>+.1 < . 1:1(0) ack 25 win 65530
>+0 %{ assert tcpi_probes == 0, tcpi_probes; \
>  assert tcpi_backoff == 0, tcpi_backoff }%
> 
> // install a qdisc dropping all packets
>+0 `tc qdisc delete dev tun0 root 2>/dev/null ; tc qdisc add dev
> tun0 root pfifo limit 0`
>+0 write(4, ..., 24) = 24
>// When qdisc is congested we retry every 500ms therefore in theory
>// we'd retry 6 times before hitting 3s timeout. However, since we
>// estimate the elapsed time based on exp backoff of actual RTO (300ms),
>// we'd bail earlier with only 3 probes.
>+2.1 write(4, ..., 24) = -1
>+0 %{ assert tcpi_probes == 3, tcpi_probes; \
>  assert tcpi_backoff == 0, tcpi_backoff }%
>+0 close(4) = 0
> 
> > Cc: sta...@vger.kernel.org
> > Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on USER_TIMEOUT")
> > Reported-by: William McCall 
> > Signed-off-by: Enke Chen 
> > ---
> >  include/linux/tcp.h   | 5 +
> >  net/ipv4/tcp.c| 1 +
> >  net/ipv4/tcp_input.c  | 3 ++-
> >  net/ipv4/tcp_output.c | 2 ++
> >  net/ipv4/tcp_timer.c  | 5 +++--
> >  5 files changed, 13 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> > index 2f87377e9af7..c9415b30fa67 100644
> > --- a/include/linux/tcp.h
> > +++ b/include/linux/tcp.h
> > @@ -352,6 +352,11 @@ struct tcp_sock {
> >
> > int linger2;
> >
> > +   /* While icsk_probes_out is for unanswered 0 window probes, this
> > +* counter is for 0-window probes that are not answered with any
> > +* non-zero window (nzw) acks.
> > +*/
> > +   u8  probes_nzw;
> >
> >  /* Sock_ops bpf program related variables */
> >  #ifdef CONFIG_BPF
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index ed42d2193c5c..af6a41a5a5ac 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -2940,6 +2940,7 @@ int tcp_disconnect(struct sock *sk, int flags)
> > icsk->icsk_rto = TCP_TIMEOUT_INIT;
> > icsk->icsk_rto_min = TCP_RTO_MIN;
> > icsk->icsk_delack_max = TCP_DELACK_MAX;
> > +   tp->probes_nzw = 0;
> > tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
> > tp->snd_cwnd = TCP_INIT_CWND;
> > tp->snd_cwnd_cnt = 0;
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index c7e16b0ed791..4812a969c18a 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@

Re: [PATCH] tcp: fix TCP_USER_TIMEOUT with zero window

2021-01-13 Thread Enke Chen

Yes, I am convinced :-) Thanks to Eric, Neal and Yuchung for their help.

-- Enke

On Wed, Jan 13, 2021 at 01:20:55PM -0800, Yuchung Cheng wrote:
> On Wed, Jan 13, 2021 at 12:49 PM Eric Dumazet  wrote:
> >
> > On Wed, Jan 13, 2021 at 9:12 PM Enke Chen  wrote:
> > >
> > > From: Enke Chen 
> > >
> > > The TCP session does not terminate with TCP_USER_TIMEOUT when data
> > > remain untransmitted due to zero window.
> > >
> > > The number of unanswered zero-window probes (tcp_probes_out) is
> > > reset to zero with incoming acks irrespective of the window size,
> > > as described in tcp_probe_timer():
> > >
> > > RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
> > > as long as the receiver continues to respond probes. We support
> > > this by default and reset icsk_probes_out with incoming ACKs.
> > >
> > > This counter, however, is the wrong one to be used in calculating the
> > > duration that the window remains closed and data remain untransmitted.
> > > Thanks to Jonathan Maxwell  for diagnosing the
> > > actual issue.
> > >
> > > In this patch a separate counter is introduced to track the number of
> > > zero-window probes that are not answered with any non-zero window ack.
> > > This new counter is used in determining when to abort the session with
> > > TCP_USER_TIMEOUT.
> > >
> >
> > I think one possible issue would be that local congestion (full qdisc)
> > would abort early,
> > because tcp_model_timeout() assumes linear backoff.
> Yes exactly. if ZWPs are dropped due to local congestion, the
> model_timeout computes incorrectly. Therefore having a starting
> timestamp is the surest way b/c it does not assume any specific
> backoff behavior.
> 
> >
> > Neal or Yuchung can further comment on that, it is late for me in France.
> >
> > packetdrill test would be :
> >
> >0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> >+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> >+0 bind(3, ..., ...) = 0
> >+0 listen(3, 1) = 0
> >
> >
> >+0 < S 0:0(0) win 0 
> >+0 > S. 0:0(0) ack 1 
> >
> >   +.1 < . 1:1(0) ack 1 win 65530
> >+0 accept(3, ..., ...) = 4
> >
> >+0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0
> >+0 write(4, ..., 24) = 24
> >+0 > P. 1:25(24) ack 1
> >+.1 < . 1:1(0) ack 25 win 65530
> >+0 %{ assert tcpi_probes == 0, tcpi_probes; \
> >  assert tcpi_backoff == 0, tcpi_backoff }%
> >
> > // install a qdisc dropping all packets
> >+0 `tc qdisc delete dev tun0 root 2>/dev/null ; tc qdisc add dev
> > tun0 root pfifo limit 0`
> >+0 write(4, ..., 24) = 24
> >// When qdisc is congested we retry every 500ms therefore in theory
> >// we'd retry 6 times before hitting 3s timeout. However, since we
> >// estimate the elapsed time based on exp backoff of actual RTO (300ms),
> >// we'd bail earlier with only 3 probes.
> >+2.1 write(4, ..., 24) = -1
> >+0 %{ assert tcpi_probes == 3, tcpi_probes; \
> >  assert tcpi_backoff == 0, tcpi_backoff }%
> >+0 close(4) = 0
> >

Re: [PATCH] tcp: fix TCP_USER_TIMEOUT with zero window

2021-01-13 Thread Enke Chen

Hi, Neal:

Thank you for your detailed analysis and your help in coming up with the
right fix. After going through multiple iterations of fixes and discussions,
we are converging to using the timestamp for measuring the elapsed time.

-- Enke

On Wed, Jan 13, 2021 at 04:07:00PM -0500, Neal Cardwell wrote:
> Hi Enke,
> 
> Sorry I was not clear. :-) I'm trying to convey that there is a functional
> difference between the probes_nzw and  icsk_probes_start versions of the
> patch.
> 
> The functional difference is the one Eric just mentioned, and for which
> Eric provided the script that we see behaving in a way that illustrates the
> functional difference. The script Eric provided misbehaves for both the
> probes_nzw patch and the patch that reverts 9721e709fa68 ("tcp: simplify
> window probe aborting on USER_TIMEOUT"). Here is a summary of how the
> various approaches behave when run with this test:
> 
> o for the probes_nzw patch, the connection times out due to USER_TIMEOUT
> too soon (e.g. with a TCP_USER_TIMEOUT of 30 secs the connection times out
> after roughly 4 secs)
> 
> o for the revert of 9721e709fa68 ("tcp: simplify window probe aborting on
> USER_TIMEOUT") the connection never times out due to USER_TIMEOUT
> 
> o for the icsk_probes_start version the connection times out due
> to USER_TIMEOUT at the appropriate time
> 
> As Eric noted, the issue in the probes_nzw case is that the probes_nzw
> patch relies on tcp_model_timeout(), which assumes exponential backoff, and
> exponential backoff does not happen in the tcp_send_probe0() code path that
> sets timeout = TCP_RESOURCE_PROBE_INTERVAL.
> 
> best,
> neal
> 
> 
> On Wed, Jan 13, 2021 at 3:49 PM Eric Dumazet  wrote:
> 
> > On Wed, Jan 13, 2021 at 9:12 PM Enke Chen  wrote:
> > >
> > > From: Enke Chen 
> > >
> > > The TCP session does not terminate with TCP_USER_TIMEOUT when data
> > > remain untransmitted due to zero window.
> > >
> > > The number of unanswered zero-window probes (tcp_probes_out) is
> > > reset to zero with incoming acks irrespective of the window size,
> > > as described in tcp_probe_timer():
> > >
> > > RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
> > > as long as the receiver continues to respond probes. We support
> > > this by default and reset icsk_probes_out with incoming ACKs.
> > >
> > > This counter, however, is the wrong one to be used in calculating the
> > > duration that the window remains closed and data remain untransmitted.
> > > Thanks to Jonathan Maxwell  for diagnosing the
> > > actual issue.
> > >
> > > In this patch a separate counter is introduced to track the number of
> > > zero-window probes that are not answered with any non-zero window ack.
> > > This new counter is used in determining when to abort the session with
> > > TCP_USER_TIMEOUT.
> > >
> >
> > I think one possible issue would be that local congestion (full qdisc)
> > would abort early,
> > because tcp_model_timeout() assumes linear backoff.
> >
> > Neal or Yuchung can further comment on that, it is late for me in France.
> >
> > packetdrill test would be :
> >
> >0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> >+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> >+0 bind(3, ..., ...) = 0
> >+0 listen(3, 1) = 0
> >
> >
> >+0 < S 0:0(0) win 0 
> >+0 > S. 0:0(0) ack 1 
> >
> >   +.1 < . 1:1(0) ack 1 win 65530
> >+0 accept(3, ..., ...) = 4
> >
> >+0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0
> >+0 write(4, ..., 24) = 24
> >+0 > P. 1:25(24) ack 1
> >+.1 < . 1:1(0) ack 25 win 65530
> >+0 %{ assert tcpi_probes == 0, tcpi_probes; \
> >  assert tcpi_backoff == 0, tcpi_backoff }%
> >
> > // install a qdisc dropping all packets
> >+0 `tc qdisc delete dev tun0 root 2>/dev/null ; tc qdisc add dev
> > tun0 root pfifo limit 0`
> >+0 write(4, ..., 24) = 24
> >// When qdisc is congested we retry every 500ms therefore in theory
> >// we'd retry 6 times before hitting 3s timeout. However, since we
> >// estimate the elapsed time based on exp backoff of actual RTO (300ms),
> >// we'd bail earlier with only 3 probes.
> >+2.1 write(4, ..., 24) = -1
> >+0 %{ assert tcpi_probes == 3, tcpi_probes; \
> >  assert tcpi_backoff == 0, tcpi_backoff }%
> >+0 close(4) = 0
> >
> > > Cc: sta...@vger.kernel.org
> >

Re: [PATCH] tcp: keepalive fixes

2021-01-13 Thread Enke Chen

On Wed, Jan 13, 2021 at 12:06:27PM -0800, Enke Chen wrote:
> Hi, Eric:
> 
> Just to clarify: the issues for tcp keepalive and TCP_USER_TIMEOUT are
> separate isues, and the fixes would not conflict afaik.
> 
> Thanks.  -- Enke

I have posted patches for both issues, and there is no conflict between
the patches.

Thanks.  -- Enke

> 
> On Tue, Jan 12, 2021 at 11:52:43PM +0100, Eric Dumazet wrote:
> > On Tue, Jan 12, 2021 at 11:48 PM Yuchung Cheng  wrote:
> > >
> > > On Tue, Jan 12, 2021 at 2:31 PM Enke Chen  wrote:
> > > >
> > > > From: Enke Chen 
> > > >
> > > > In this patch two issues with TCP keepalives are fixed:
> > > >
> > > > 1) TCP keepalive does not timeout when there are data waiting to be
> > > >delivered and then the connection got broken. The TCP keepalive
> > > >timeout is not evaluated in that condition.
> > > hi enke
> > > Do you have an example to demonstrate this issue -- in theory when
> > > there is data inflight, an RTO timer should be pending (which
> > > considers user-timeout setting). based on the user-timeout description
> > > (man tcp), the user timeout should abort the socket per the specified
> > > time after data commences. some data would help to understand the
> > > issue.
> > >
> > 
> > +1
> > 
> > A packetdrill test would be ideal.
> > 
> > Also, given that there is this ongoing issue with TCP_USER_TIMEOUT,
> > lets not mix things
> > or risk added work for backports to stable versions.

[PATCH] tcp: fix TCP_USER_TIMEOUT with zero window

2021-01-13 Thread Enke Chen

From: Enke Chen 

The TCP session does not terminate with TCP_USER_TIMEOUT when data
remain untransmitted due to zero window.

The number of unanswered zero-window probes (tcp_probes_out) is
reset to zero with incoming acks irrespective of the window size,
as described in tcp_probe_timer():

RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
as long as the receiver continues to respond probes. We support
this by default and reset icsk_probes_out with incoming ACKs.

This counter, however, is the wrong one to be used in calculating the
duration that the window remains closed and data remain untransmitted.
Thanks to Jonathan Maxwell  for diagnosing the
actual issue.

In this patch a separate counter is introduced to track the number of
zero-window probes that are not answered with any non-zero window ack.
This new counter is used in determining when to abort the session with
TCP_USER_TIMEOUT.

Cc: sta...@vger.kernel.org
Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on USER_TIMEOUT")
Reported-by: William McCall 
Signed-off-by: Enke Chen 
---
 include/linux/tcp.h   | 5 +
 net/ipv4/tcp.c| 1 +
 net/ipv4/tcp_input.c  | 3 ++-
 net/ipv4/tcp_output.c | 2 ++
 net/ipv4/tcp_timer.c  | 5 +++--
 5 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 2f87377e9af7..c9415b30fa67 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -352,6 +352,11 @@ struct tcp_sock {
 
int linger2;
 
+   /* While icsk_probes_out is for unanswered 0 window probes, this
+* counter is for 0-window probes that are not answered with any
+* non-zero window (nzw) acks.
+*/
+   u8  probes_nzw;
 
 /* Sock_ops bpf program related variables */
 #ifdef CONFIG_BPF
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ed42d2193c5c..af6a41a5a5ac 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2940,6 +2940,7 @@ int tcp_disconnect(struct sock *sk, int flags)
icsk->icsk_rto = TCP_TIMEOUT_INIT;
icsk->icsk_rto_min = TCP_RTO_MIN;
icsk->icsk_delack_max = TCP_DELACK_MAX;
+   tp->probes_nzw = 0;
tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
tp->snd_cwnd = TCP_INIT_CWND;
tp->snd_cwnd_cnt = 0;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c7e16b0ed791..4812a969c18a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3377,13 +3377,14 @@ static void tcp_ack_probe(struct sock *sk)
 {
struct inet_connection_sock *icsk = inet_csk(sk);
struct sk_buff *head = tcp_send_head(sk);
-   const struct tcp_sock *tp = tcp_sk(sk);
+   struct tcp_sock *tp = tcp_sk(sk);
 
/* Was it a usable window open? */
if (!head)
return;
if (!after(TCP_SKB_CB(head)->end_seq, tcp_wnd_end(tp))) {
icsk->icsk_backoff = 0;
+   tp->probes_nzw = 0;
inet_csk_clear_xmit_timer(sk, ICSK_TIME_PROBE0);
/* Socket must be waked up by subsequent tcp_data_snd_check().
 * This function is not for random using!
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f322e798a351..1b64cdabc299 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -4084,10 +4084,12 @@ void tcp_send_probe0(struct sock *sk)
/* Cancel probe timer, if it is not required. */
icsk->icsk_probes_out = 0;
icsk->icsk_backoff = 0;
+   tp->probes_nzw = 0;
return;
}
 
icsk->icsk_probes_out++;
+   tp->probes_nzw++;
if (err <= 0) {
if (icsk->icsk_backoff < net->ipv4.sysctl_tcp_retries2)
icsk->icsk_backoff++;
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 6c62b9ea1320..87e9f5998b8e 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -349,6 +349,7 @@ static void tcp_probe_timer(struct sock *sk)
 
if (tp->packets_out || !skb) {
icsk->icsk_probes_out = 0;
+   tp->probes_nzw = 0;
return;
}
 
@@ -360,8 +361,8 @@ static void tcp_probe_timer(struct sock *sk)
 * corresponding system limit. We also implement similar policy when
 * we use RTO to probe window in tcp_retransmit_timer().
 */
-   if (icsk->icsk_user_timeout) {
-   u32 elapsed = tcp_model_timeout(sk, icsk->icsk_probes_out,
+   if (icsk->icsk_user_timeout && tp->probes_nzw) {
+   u32 elapsed = tcp_model_timeout(sk, tp->probes_nzw,
tcp_probe0_base(sk));
 
if (elapsed >= icsk->icsk_user_timeout)
-- 
2.29.2

Re: [PATCH] tcp: keepalive fixes

2021-01-13 Thread Enke Chen

Hi, Eric:

Just to clarify: the issues for tcp keepalive and TCP_USER_TIMEOUT are
separate isues, and the fixes would not conflict afaik.

Thanks.  -- Enke

On Tue, Jan 12, 2021 at 11:52:43PM +0100, Eric Dumazet wrote:
> On Tue, Jan 12, 2021 at 11:48 PM Yuchung Cheng  wrote:
> >
> > On Tue, Jan 12, 2021 at 2:31 PM Enke Chen  wrote:
> > >
> > > From: Enke Chen 
> > >
> > > In this patch two issues with TCP keepalives are fixed:
> > >
> > > 1) TCP keepalive does not timeout when there are data waiting to be
> > >delivered and then the connection got broken. The TCP keepalive
> > >timeout is not evaluated in that condition.
> > hi enke
> > Do you have an example to demonstrate this issue -- in theory when
> > there is data inflight, an RTO timer should be pending (which
> > considers user-timeout setting). based on the user-timeout description
> > (man tcp), the user timeout should abort the socket per the specified
> > time after data commences. some data would help to understand the
> > issue.
> >
> 
> +1
> 
> A packetdrill test would be ideal.
> 
> Also, given that there is this ongoing issue with TCP_USER_TIMEOUT,
> lets not mix things
> or risk added work for backports to stable versions.

Re: [PATCH] tcp: keepalive fixes

2021-01-12 Thread Enke Chen

Hi, Yuchung:

I have attached the python script that reproduces the keepalive issues.
The script is a slight modification of the one written by Marek Majkowski:

https://github.com/cloudflare/cloudflare-blog/blob/master/2019-09-tcp-keepalives/test-zero.py

Please note that only the TCP keepalive is configured, and not the user timeout.

Thanks.  -- Enke

On Tue, Jan 12, 2021 at 02:48:01PM -0800, Yuchung Cheng wrote:
> On Tue, Jan 12, 2021 at 2:31 PM Enke Chen  wrote:
> >
> > From: Enke Chen 
> >
> > In this patch two issues with TCP keepalives are fixed:
> >
> > 1) TCP keepalive does not timeout when there are data waiting to be
> >delivered and then the connection got broken. The TCP keepalive
> >timeout is not evaluated in that condition.
> hi enke
> Do you have an example to demonstrate this issue -- in theory when
> there is data inflight, an RTO timer should be pending (which
> considers user-timeout setting). based on the user-timeout description
> (man tcp), the user timeout should abort the socket per the specified
> time after data commences. some data would help to understand the
> issue.
> 

--
#! /usr/bin/python

import io
import os
import select
import socket
import time
import utils
import ctypes

utils.new_ns()

port = 1

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0)
s.bind(('127.0.0.1', port))
s.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024)

s.listen(16)

tcpdump = utils.tcpdump_start(port)
c = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0)
c.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024)
c.connect(('127.0.0.1', port))

x, _ = s.accept()

if False:
c.setsockopt(socket.IPPROTO_TCP, socket.TCP_USER_TIMEOUT, 90*1000)

if True:
c.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
c.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 5)
c.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 10)
c.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 10)

time.sleep(0.2)
print("[ ] c.send()")
import fcntl
TIOCOUTQ=0x5411
c.setblocking(False)
while True:
bytes_avail = ctypes.c_int()
fcntl.ioctl(c.fileno(), TIOCOUTQ, bytes_avail)
if bytes_avail.value > 64*1024:
break
try:
c.send(b"A" * 16384 * 4)
except io.BlockingIOError:
break
c.setblocking(True)
time.sleep(0.2)
utils.ss(port)
utils.check_buffer(c)

t0 = time.time()

if True:
utils.drop_start(dport=port)
utils.drop_start(sport=port)

poll = select.poll()
poll.register(c, select.POLLIN)
poll.poll()

utils.ss(port)


e = c.getsockopt(socket.SOL_SOCKET, socket.SO_ERROR)
print("[ ] SO_ERROR = %s" % (e,))

t1 = time.time()
print("[ ] took: %f seconds" % (t1-t0,))

[PATCH] tcp: keepalive fixes

2021-01-12 Thread Enke Chen

From: Enke Chen 

In this patch two issues with TCP keepalives are fixed:

1) TCP keepalive does not timeout when there are data waiting to be
   delivered and then the connection got broken. The TCP keepalive
   timeout is not evaluated in that condition.

   The fix is to remove the code that prevents TCP keepalive from
   being evaluated for timeout.

2) With the fix for #1, TCP keepalive can erroneously timeout after
   the 0-window probe kicks in. The 0-window probe counter is wrongly
   applied to TCP keepalives.

   The fix is to use the elapsed time instead of the 0-window probe
   counter in evaluating TCP keepalive timeout.

Cc: sta...@vger.kernel.org
Signed-off-by: Enke Chen 
---
 net/ipv4/tcp_timer.c | 15 +++
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 6c62b9ea1320..40953aa40d53 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -696,12 +696,6 @@ static void tcp_keepalive_timer (struct timer_list *t)
((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_SYN_SENT)))
goto out;
 
-   elapsed = keepalive_time_when(tp);
-
-   /* It is alive without keepalive 8) */
-   if (tp->packets_out || !tcp_write_queue_empty(sk))
-   goto resched;
-
elapsed = keepalive_time_elapsed(tp);
 
if (elapsed >= keepalive_time_when(tp)) {
@@ -709,16 +703,15 @@ static void tcp_keepalive_timer (struct timer_list *t)
 * to determine when to timeout instead.
 */
if ((icsk->icsk_user_timeout != 0 &&
-   elapsed >= msecs_to_jiffies(icsk->icsk_user_timeout) &&
-   icsk->icsk_probes_out > 0) ||
+elapsed >= msecs_to_jiffies(icsk->icsk_user_timeout)) ||
(icsk->icsk_user_timeout == 0 &&
-   icsk->icsk_probes_out >= keepalive_probes(tp))) {
+(elapsed >= keepalive_time_when(tp) +
+ keepalive_intvl_when(tp) * keepalive_probes(tp {
tcp_send_active_reset(sk, GFP_ATOMIC);
tcp_write_err(sk);
goto out;
}
if (tcp_write_wakeup(sk, LINUX_MIB_TCPKEEPALIVE) <= 0) {
-   icsk->icsk_probes_out++;
elapsed = keepalive_intvl_when(tp);
} else {
/* If keepalive was lost due to local congestion,
@@ -732,8 +725,6 @@ static void tcp_keepalive_timer (struct timer_list *t)
}
 
sk_mem_reclaim(sk);
-
-resched:
inet_csk_reset_keepalive_timer (sk, elapsed);
goto out;
 
-- 
2.29.2

Re: [PATCH] Revert "tcp: simplify window probe aborting on USER_TIMEOUT"

2021-01-11 Thread Enke Chen

Hi, Neal:

Thank you for testing the reverted patch, and provding the detailed analysis
of the underline issue with the original patch.

Let me go back to the simple and clean approach using a separate counter, as
we were discussing before.

-- Enke

On Mon, Jan 11, 2021 at 09:58:33AM -0500, Neal Cardwell wrote:
> On Fri, Jan 8, 2021 at 11:38 PM Enke Chen  wrote:
> >
> > From: Enke Chen 
> >
> > This reverts commit 9721e709fa68ef9b860c322b474cfbd1f8285b0f.
> >
> > With the commit 9721e709fa68 ("tcp: simplify window probe aborting
> > on USER_TIMEOUT"), the TCP session does not terminate with
> > TCP_USER_TIMEOUT when data remain untransmitted due to zero window.
> >
> > The number of unanswered zero-window probes (tcp_probes_out) is
> > reset to zero with incoming acks irrespective of the window size,
> > as described in tcp_probe_timer():
> >
> > RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
> > as long as the receiver continues to respond probes. We support
> > this by default and reset icsk_probes_out with incoming ACKs.
> >
> > This counter, however, is the wrong one to be used in calculating the
> > duration that the window remains closed and data remain untransmitted.
> > Thanks to Jonathan Maxwell  for diagnosing the
> > actual issue.
> >
> > Cc: sta...@vger.kernel.org
> > Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on USER_TIMEOUT")
> > Reported-by: William McCall 
> > Signed-off-by: Enke Chen 
> > ---
> 
> I ran this revert commit through our packetdrill TCP tests, and it's
> causing failures in a ZWP/USER_TIMEOUT test due to interactions with
> this Jan 2019 patch:
> 
> 7f12422c4873e9b274bc151ea59cb0cdf9415cf1
> tcp: always timestamp on every skb transmission
> 
> The issue seems to be that after 7f12422c4873 the skb->skb_mstamp_ns
> is set on every transmit attempt. That means that even skbs that are
> not successfully transmitted have a non-zero skb_mstamp_ns. That means
> that if ZWPs are repeatedly failing to be sent due to severe local
> qdisc congestion, then at this point in the code the start_ts is
> always only 500ms in the past (from TCP_RESOURCE_PROBE_INTERVAL =
> 500ms). That means that if there is severe local qdisc congestion a
> USER_TIMEOUT above 500ms is a NOP, and the socket can live far past
> the USER_TIMEOUT.
> 
> It seems we need a slightly different approach than the revert in this commit.
> 
> neal

[PATCH] Revert "tcp: simplify window probe aborting on USER_TIMEOUT"

2021-01-08 Thread Enke Chen

From: Enke Chen 

This reverts commit 9721e709fa68ef9b860c322b474cfbd1f8285b0f.

With the commit 9721e709fa68 ("tcp: simplify window probe aborting
on USER_TIMEOUT"), the TCP session does not terminate with
TCP_USER_TIMEOUT when data remain untransmitted due to zero window.

The number of unanswered zero-window probes (tcp_probes_out) is
reset to zero with incoming acks irrespective of the window size,
as described in tcp_probe_timer():

RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
as long as the receiver continues to respond probes. We support
this by default and reset icsk_probes_out with incoming ACKs.

This counter, however, is the wrong one to be used in calculating the
duration that the window remains closed and data remain untransmitted.
Thanks to Jonathan Maxwell  for diagnosing the
actual issue.

Cc: sta...@vger.kernel.org
Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on USER_TIMEOUT")
Reported-by: William McCall 
Signed-off-by: Enke Chen 
---
 net/ipv4/tcp_timer.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 6c62b9ea1320..ad98f2ea89f1 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -346,6 +346,7 @@ static void tcp_probe_timer(struct sock *sk)
struct sk_buff *skb = tcp_send_head(sk);
struct tcp_sock *tp = tcp_sk(sk);
int max_probes;
+   u32 start_ts;
 
if (tp->packets_out || !skb) {
icsk->icsk_probes_out = 0;
@@ -360,13 +361,12 @@ static void tcp_probe_timer(struct sock *sk)
 * corresponding system limit. We also implement similar policy when
 * we use RTO to probe window in tcp_retransmit_timer().
 */
-   if (icsk->icsk_user_timeout) {
-   u32 elapsed = tcp_model_timeout(sk, icsk->icsk_probes_out,
-   tcp_probe0_base(sk));
-
-   if (elapsed >= icsk->icsk_user_timeout)
-   goto abort;
-   }
+   start_ts = tcp_skb_timestamp(skb);
+   if (!start_ts)
+   skb->skb_mstamp_ns = tp->tcp_clock_cache;
+   else if (icsk->icsk_user_timeout &&
+(s32)(tcp_time_stamp(tp) - start_ts) > icsk->icsk_user_timeout)
+   goto abort;
 
max_probes = sock_net(sk)->ipv4.sysctl_tcp_retries2;
if (sock_flag(sk, SOCK_DEAD)) {
-- 
2.29.2

Re: [PATCH] net: Remove the source address setting in connect() for UDP

2019-09-10 Thread Enke Chen (enkechen)

Hi, David:

Do you still have concerns about backward compatibility of the fix?

I really do not see how existing, working applications would be negatively 
impacted
by the fix.

Thanks.   -- Enke

-Original Message-
From: "Enke Chen (enkechen)" 
Date: Friday, September 6, 2019 at 12:23 AM
To: David Miller 
Cc: "kuz...@ms2.inr.ac.ru" , "yoshf...@linux-ipv6.org" 
, "net...@vger.kernel.org" , 
"linux-kernel@vger.kernel.org" , 
"xe-linux-external(mailer list)" , "Enke Chen 
(enkechen)" 
Subject: Re: [PATCH] net: Remove the source address setting in connect() for UDP

Hi, David:

Yes, I understand the code has been there for a long time.  But the issues are 
real, and it's really nasty when
You run into them.  As I described in the patch log, there is no backward 
compatibility Issue for fixing it.

---
There is no backward compatibility issue here as the source address setting
in connect() is not needed anyway.

  - No impact on the source address selection when the source address
is explicitly specified by "bind()", or by the "IP_PKTINFO" option.

  - In the case that the source address is not explicitly specified,
the selection of the source address would be more accurate and
reliable based on the up-to-date routing table.
---

Thanks.  -- Enke

-Original Message-
From:  on behalf of David Miller 

Date: Friday, September 6, 2019 at 12:14 AM
To: "Enke Chen (enkechen)" 
Cc: "kuz...@ms2.inr.ac.ru" , "yoshf...@linux-ipv6.org" 
, "net...@vger.kernel.org" , 
"linux-kernel@vger.kernel.org" , 
"xe-linux-external(mailer list)" 
Subject: Re: [PATCH] net: Remove the source address setting in connect() for UDP

From: Enke Chen 
Date: Thu,  5 Sep 2019 19:54:37 -0700

> The connect() system call for a UDP socket is for setting the destination
> address and port. But the current code mistakenly sets the source address
> for the socket as well. Remove the source address setting in connect() for
> UDP in this patch.

Do you have any idea how many decades of precedence this behavior has and
therefore how much you potentially will break userspace?

This boat has sailed a long time ago I'm afraid.

Re: [PATCH] net: Remove the source address setting in connect() for UDP

2019-09-06 Thread Enke Chen (enkechen)

Hi, David:

Yes, I understand the code has been there for a long time.  But the issues are 
real, and it's really nasty when
You run into them.  As I described in the patch log, there is no backward 
compatibility Issue for fixing it.

---
There is no backward compatibility issue here as the source address setting
in connect() is not needed anyway.

  - No impact on the source address selection when the source address
is explicitly specified by "bind()", or by the "IP_PKTINFO" option.

  - In the case that the source address is not explicitly specified,
the selection of the source address would be more accurate and
reliable based on the up-to-date routing table.
---

Thanks.  -- Enke

-Original Message-
From:  on behalf of David Miller 

Date: Friday, September 6, 2019 at 12:14 AM
To: "Enke Chen (enkechen)" 
Cc: "kuz...@ms2.inr.ac.ru" , "yoshf...@linux-ipv6.org" 
, "net...@vger.kernel.org" , 
"linux-kernel@vger.kernel.org" , 
"xe-linux-external(mailer list)" 
Subject: Re: [PATCH] net: Remove the source address setting in connect() for UDP

From: Enke Chen 
Date: Thu,  5 Sep 2019 19:54:37 -0700

> The connect() system call for a UDP socket is for setting the destination
> address and port. But the current code mistakenly sets the source address
> for the socket as well. Remove the source address setting in connect() for
> UDP in this patch.

Do you have any idea how many decades of precedence this behavior has and
therefore how much you potentially will break userspace?

This boat has sailed a long time ago I'm afraid.

[PATCH] net: Remove the source address setting in connect() for UDP

2019-09-05 Thread Enke Chen

The connect() system call for a UDP socket is for setting the destination
address and port. But the current code mistakenly sets the source address
for the socket as well. Remove the source address setting in connect() for
UDP in this patch.

Implications of the bug:

  - Packet drop:

On a multi-homed device, an address assigned to any interface may
qualify as a source address when originating a packet. If needed, the
IP_PKTINFO option can be used to explicitly specify the source address.
But with the source address being mistakenly set for the socket in
connect(), a return packet (for the socket) destined to an interface
address different from that source address would be wrongly dropped
due to address mismatch.

This can be reproduced easily. The dropped packets are shown in the
following output by "netstat -s" for UDP:

  xxx packets to unknown port received

  - Source address selection:

The source address, if unspecified via "bind()" or IP_PKTINFO, should
be determined by routing at the time of packet origination, and not at
the time when the connect() call is made. The difference matters as
routing can change, e.g., by interface down/up events, and using a
source address of an "down" interface is known to be problematic.

There is no backward compatibility issue here as the source address setting
in connect() is not needed anyway.

  - No impact on the source address selection when the source address
is explicitly specified by "bind()", or by the "IP_PKTINFO" option.

  - In the case that the source address is not explicitly specified,
the selection of the source address would be more accurate and
reliable based on the up-to-date routing table.

Signed-off-by: Enke Chen 
---
 net/ipv4/datagram.c |  7 ---
 net/ipv6/datagram.c | 15 +--
 2 files changed, 1 insertion(+), 21 deletions(-)

diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index f915abff1350..4065808ec6c1 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -64,13 +64,6 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr 
*uaddr, int addr_len
err = -EACCES;
goto out;
}
-   if (!inet->inet_saddr)
-   inet->inet_saddr = fl4->saddr;  /* Update source address */
-   if (!inet->inet_rcv_saddr) {
-   inet->inet_rcv_saddr = fl4->saddr;
-   if (sk->sk_prot->rehash)
-   sk->sk_prot->rehash(sk);
-   }
inet->inet_daddr = fl4->daddr;
inet->inet_dport = usin->sin_port;
sk->sk_state = TCP_ESTABLISHED;
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index ecf440a4f593..80388cd50dc3 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -197,19 +197,6 @@ int __ip6_datagram_connect(struct sock *sk, struct 
sockaddr *uaddr,
goto out;
 
ipv6_addr_set_v4mapped(inet->inet_daddr, >sk_v6_daddr);
-
-   if (ipv6_addr_any(>saddr) ||
-   ipv6_mapped_addr_any(>saddr))
-   ipv6_addr_set_v4mapped(inet->inet_saddr, >saddr);
-
-   if (ipv6_addr_any(>sk_v6_rcv_saddr) ||
-   ipv6_mapped_addr_any(>sk_v6_rcv_saddr)) {
-   ipv6_addr_set_v4mapped(inet->inet_rcv_saddr,
-  >sk_v6_rcv_saddr);
-   if (sk->sk_prot->rehash)
-   sk->sk_prot->rehash(sk);
-   }
-
goto out;
}
 
@@ -247,7 +234,7 @@ int __ip6_datagram_connect(struct sock *sk, struct sockaddr 
*uaddr,
 *  destination cache for it.
 */
 
-   err = ip6_datagram_dst_update(sk, true);
+   err = ip6_datagram_dst_update(sk, false);
if (err) {
/* Restore the socket peer info, to keep it consistent with
 * the old socket state
-- 
2.19.1

Re: [PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification

2018-11-29 Thread Enke Chen

Hi, Dave:

On 11/29/18 3:55 AM, Dave Martin wrote:
>> Indeed, I defined the signal code CLD_PREDUMP for SIGCHLD initially, but it
>> was removed after discussion:
>>
>> v3 --> v4:
>>
>> Addressed review comments from Oleg Nesterov, and Eric W. Biederman,
>> including:
>> o remove the definition CLD_PREDUMP.
>> ---
>>
>> You can look at the discussions in the email thread, in particular several
>> issues pointed out by Eric Biederman, and my reply to Eric.
> 
> Ah, right.
> 
>> There are two models 1:1 (one process manager with one child process), and 
>> 1:N
>> (one process manager with many child processes). A legacy signal like SIGCHLD
>> would not work in the 1:N model due to compression/loss of legacy signals. 
>> One
>> need to use a RT signal in that case. 
> 
> SIGCHLD can be redirected to an RT signal via clone().  Are you saying
> the signal is still not queued in that case?  I had assumed that things
> like pthreads rely on this working.
> 
> However, one detail I had missed is that only child exits are reported
> via the exit signal set by clone().  Other child status changes are
> seem to be reported via SIGCHLD always.
> 
> Making your supervised processes into clone-children might interact
> badly with pthreads if it uses wait(__WCLONE) internally.  I've not
> looked into that.

As Oleg commented before:

   And once again, SIGCHLD/SIGUSR do not queue, this means that 
PR_SET_PREDUMP_SIG
   is pointless if you have 2 or more children.
In addition, there is really no need to introduce a new semantics to SIGCHLD.
There are enough signals available for one to be designated in the parent 
process
for the pre-coredump notification.

> 
>> One more point in my reply:
>>
>> When an application chooses a signal for pre-coredump notification, it is 
>> much
>> simpler and robust for the signal to be dedicated for that purpose (in the 
>> parent)
>> and not be mixed with other semantics. The "signo + pid" should be 
>> sufficient for
>> the parent process in both 1:1 and 1:N models.
> 
> What if the signal queue overflows?  sigqueue() returns EAGAIN, but I
> think that signals queued by the kernel would simply be lost.  This
> probably won't happen in any non-pathological scenario, but the process
> manager may just silently go wrong instead of failing cleanly when/if
> this happens.

As pointed out by Oleg: 

   see the legacy_queue() check. Any signal < SIGRTMIN do not queue. IOW, if 
SIGCHLD
   is already pending, then next SIGCHLD is simply ignored.

I went though the code and confirm it.

> 
> SIGCHLD + wait() is immune to this problem for other child status
> notifications (albeit with higher overhead).
> 
> Unless I've missed something fundamental, signals simply aren't a
> reliable data transport: if you need 100% reliability, you need to be
> using another mechanism, either in combination with a signal, or by
> itself.

Given the right signo, e.g., a RT signal for both models, or SIGUSR1/SIGUSR2
for 1:1 model, the pre-coredump signal notification is 100% reliable, and
it is the simplest solution.

When there are many child processes for the 1:N model, if needed, there is an
API for enlarging queue limit:

setrlimit(RLIMIT_SIGPENDING, xxx);

> 
>>>>
>>>> Signed-off-by: Enke Chen 
>>>> Reviewed-by: Oleg Nesterov 
>>>> ---
>>>> v4 -> v5:
>>>> Addressed review comments from Oleg Nesterov:
>>>> o use rcu_read_lock instead.
>>>> o revert back to notify the real_parent.
>>>>
>>>>  fs/coredump.c| 23 +++
>>>>  fs/exec.c|  3 +++
>>>>  include/linux/sched/signal.h |  3 +++
>>>>  include/uapi/linux/prctl.h   |  4 
>>>>  kernel/sys.c | 13 +
>>>>  5 files changed, 46 insertions(+)
>>>>
>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>> index e42e17e..740b1bb 100644
>>>> --- a/fs/coredump.c
>>>> +++ b/fs/coredump.c
>>>> @@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info 
>>>> *info, struct cred *new)
>>>>return err;
>>>>  }
>>>>  
>>>> +/*
>>>> + * While do_notify_parent() notifies the parent of a child's death post
>>>> + * its coredump, this function lets the parent (if so desired) know about
>>>> + * the imminent death of a child just prior to its coredump.
>>>> + */
>>>> +static void do_notify_parent_predump(void)
>>>> +{
>>>> +

Re: [PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification

2018-11-29 Thread Enke Chen

Hi, Dave:

On 11/29/18 3:55 AM, Dave Martin wrote:
>> Indeed, I defined the signal code CLD_PREDUMP for SIGCHLD initially, but it
>> was removed after discussion:
>>
>> v3 --> v4:
>>
>> Addressed review comments from Oleg Nesterov, and Eric W. Biederman,
>> including:
>> o remove the definition CLD_PREDUMP.
>> ---
>>
>> You can look at the discussions in the email thread, in particular several
>> issues pointed out by Eric Biederman, and my reply to Eric.
> 
> Ah, right.
> 
>> There are two models 1:1 (one process manager with one child process), and 
>> 1:N
>> (one process manager with many child processes). A legacy signal like SIGCHLD
>> would not work in the 1:N model due to compression/loss of legacy signals. 
>> One
>> need to use a RT signal in that case. 
> 
> SIGCHLD can be redirected to an RT signal via clone().  Are you saying
> the signal is still not queued in that case?  I had assumed that things
> like pthreads rely on this working.
> 
> However, one detail I had missed is that only child exits are reported
> via the exit signal set by clone().  Other child status changes are
> seem to be reported via SIGCHLD always.
> 
> Making your supervised processes into clone-children might interact
> badly with pthreads if it uses wait(__WCLONE) internally.  I've not
> looked into that.

As Oleg commented before:

   And once again, SIGCHLD/SIGUSR do not queue, this means that 
PR_SET_PREDUMP_SIG
   is pointless if you have 2 or more children.
In addition, there is really no need to introduce a new semantics to SIGCHLD.
There are enough signals available for one to be designated in the parent 
process
for the pre-coredump notification.

> 
>> One more point in my reply:
>>
>> When an application chooses a signal for pre-coredump notification, it is 
>> much
>> simpler and robust for the signal to be dedicated for that purpose (in the 
>> parent)
>> and not be mixed with other semantics. The "signo + pid" should be 
>> sufficient for
>> the parent process in both 1:1 and 1:N models.
> 
> What if the signal queue overflows?  sigqueue() returns EAGAIN, but I
> think that signals queued by the kernel would simply be lost.  This
> probably won't happen in any non-pathological scenario, but the process
> manager may just silently go wrong instead of failing cleanly when/if
> this happens.

As pointed out by Oleg: 

   see the legacy_queue() check. Any signal < SIGRTMIN do not queue. IOW, if 
SIGCHLD
   is already pending, then next SIGCHLD is simply ignored.

I went though the code and confirm it.

> 
> SIGCHLD + wait() is immune to this problem for other child status
> notifications (albeit with higher overhead).
> 
> Unless I've missed something fundamental, signals simply aren't a
> reliable data transport: if you need 100% reliability, you need to be
> using another mechanism, either in combination with a signal, or by
> itself.

Given the right signo, e.g., a RT signal for both models, or SIGUSR1/SIGUSR2
for 1:1 model, the pre-coredump signal notification is 100% reliable, and
it is the simplest solution.

When there are many child processes for the 1:N model, if needed, there is an
API for enlarging queue limit:

setrlimit(RLIMIT_SIGPENDING, xxx);

> 
>>>>
>>>> Signed-off-by: Enke Chen 
>>>> Reviewed-by: Oleg Nesterov 
>>>> ---
>>>> v4 -> v5:
>>>> Addressed review comments from Oleg Nesterov:
>>>> o use rcu_read_lock instead.
>>>> o revert back to notify the real_parent.
>>>>
>>>>  fs/coredump.c| 23 +++
>>>>  fs/exec.c|  3 +++
>>>>  include/linux/sched/signal.h |  3 +++
>>>>  include/uapi/linux/prctl.h   |  4 
>>>>  kernel/sys.c | 13 +
>>>>  5 files changed, 46 insertions(+)
>>>>
>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>> index e42e17e..740b1bb 100644
>>>> --- a/fs/coredump.c
>>>> +++ b/fs/coredump.c
>>>> @@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info 
>>>> *info, struct cred *new)
>>>>return err;
>>>>  }
>>>>  
>>>> +/*
>>>> + * While do_notify_parent() notifies the parent of a child's death post
>>>> + * its coredump, this function lets the parent (if so desired) know about
>>>> + * the imminent death of a child just prior to its coredump.
>>>> + */
>>>> +static void do_notify_parent_predump(void)
>>>> +{
>>>> +

Re: [PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification

2018-11-28 Thread Enke Chen

Hi, Dave:

Thanks for your comments. You have indeed missed some of the prior reviews
and discussions. But that is OK.

Please see my replies inline.

On 11/28/18 7:19 AM, Dave Martin wrote:
> On Tue, Nov 27, 2018 at 10:54:41PM +0000, Enke Chen wrote:
>> [Repost as a series, as suggested by Andrew Morton]
>>
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal for such a notification.
>>
>> Changes to prctl(2):
>>
>>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>   Set the child pre-coredump signal of the calling process to
>>   arg2 (either a signal value in the range 1..maxsig, or 0 to
>>   clear). This is the signal that the calling process will get
>>   prior to the coredump of a child process. This value is
>>   cleared across execve(2), or for the child of a fork(2).
>>
>>PR_GET_PREDUMP_SIG (since Linux 4.20.x)
>>   Return the current value of the child pre-coredump signal,
>>   in the location pointed to by (int *) arg2.
>>
>> Background:
>>
>> As the coredump of a process may take time, in certain time-sensitive
>> applications it is necessary for a parent process (e.g., a process
>> manager) to be notified of a child's imminent death before the coredump
>> so that the parent process can act sooner, such as re-spawning an
>> application process, or initiating a control-plane fail-over.
>>
>> One application is BFD. The early fault notification is a critical
>> component for maintaining BFD sessions (with a timeout value of
>> 50 msec or 100 msec) across a control-plane failure.
>>
>> Currently there are two ways for a parent process to be notified of a
>> child process's state change. One is to use the POSIX signal, and
>> another is to use the kernel connector module. The specific events and
>> actions are summarized as follows:
>>
>> Process EventPOSIX SignalConnector-based
>> --
>> ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>>  SIGCHLD / CLD_STOPPED
>>
>> ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>>  SIGCHLD / CLD_CONTINUED
>>
>> pre_coredump/N/A proc_coredump_connector()
>> get_signal()
>>
>> post_coredump/   do_notify_parent()  proc_exit_connector()
>> do_exit()SIGCHLD / exit_signal
>> --
>>
>> As shown in the table, the signal-based pre-coredump notification is not
>> currently available. In some cases using a connector-based notification
>> can be quite complicated (e.g., when a process manager is written in shell
>> scripts and thus is subject to certain inherent limitations), and a
>> signal-based notification would be simpler and better suited.
> 
> Since this is a notification of a change of process status, would it be
> more natural to send it through SIGCHLD?
> 
> As with other supplementary child status events, a flag could be added
> for wait and sigaction.sa_flags to indicate whether the parent wants
> this event to be reported or not.
> 
> Then a suitable CLD_XXX could be defined for this, and we could
> piggyback on PR_{SET,GET}_PDEATHSIG rather than having to have something
> new.
> 

> (I hadn't been watching this thread closely, so apologies if this has
> been discussed already.)

Indeed, I defined the signal code CLD_PREDUMP for SIGCHLD initially, but it
was removed after discussion:

v3 --> v4:

Addressed review comments from Oleg Nesterov, and Eric W. Biederman,
including:
o remove the definition CLD_PREDUMP.
---

You can look at the discussions in the email thread, in particular several
issues pointed out by Eric Biederman, and my reply to Eric.

There are two models 1:1 (one process manager with one child process), and 1:N
(one process manager with many child processes). A legacy signal like SIGCHLD
would not work in the 1:N model due to compression/loss of legacy signals. One
need to use a RT signal in that case. 

One more point in my reply:

When an application chooses a signal for pre-coredump notification, it is much
simpler and robust for the signal to be dedicated for that purpose (in the 
parent)
and not be mixed with other semantics. The "signo + pid" should be sufficient 
f

Re: [PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification

2018-11-28 Thread Enke Chen

Hi, Dave:

Thanks for your comments. You have indeed missed some of the prior reviews
and discussions. But that is OK.

Please see my replies inline.

On 11/28/18 7:19 AM, Dave Martin wrote:
> On Tue, Nov 27, 2018 at 10:54:41PM +0000, Enke Chen wrote:
>> [Repost as a series, as suggested by Andrew Morton]
>>
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal for such a notification.
>>
>> Changes to prctl(2):
>>
>>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>   Set the child pre-coredump signal of the calling process to
>>   arg2 (either a signal value in the range 1..maxsig, or 0 to
>>   clear). This is the signal that the calling process will get
>>   prior to the coredump of a child process. This value is
>>   cleared across execve(2), or for the child of a fork(2).
>>
>>PR_GET_PREDUMP_SIG (since Linux 4.20.x)
>>   Return the current value of the child pre-coredump signal,
>>   in the location pointed to by (int *) arg2.
>>
>> Background:
>>
>> As the coredump of a process may take time, in certain time-sensitive
>> applications it is necessary for a parent process (e.g., a process
>> manager) to be notified of a child's imminent death before the coredump
>> so that the parent process can act sooner, such as re-spawning an
>> application process, or initiating a control-plane fail-over.
>>
>> One application is BFD. The early fault notification is a critical
>> component for maintaining BFD sessions (with a timeout value of
>> 50 msec or 100 msec) across a control-plane failure.
>>
>> Currently there are two ways for a parent process to be notified of a
>> child process's state change. One is to use the POSIX signal, and
>> another is to use the kernel connector module. The specific events and
>> actions are summarized as follows:
>>
>> Process EventPOSIX SignalConnector-based
>> --
>> ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>>  SIGCHLD / CLD_STOPPED
>>
>> ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>>  SIGCHLD / CLD_CONTINUED
>>
>> pre_coredump/N/A proc_coredump_connector()
>> get_signal()
>>
>> post_coredump/   do_notify_parent()  proc_exit_connector()
>> do_exit()SIGCHLD / exit_signal
>> --
>>
>> As shown in the table, the signal-based pre-coredump notification is not
>> currently available. In some cases using a connector-based notification
>> can be quite complicated (e.g., when a process manager is written in shell
>> scripts and thus is subject to certain inherent limitations), and a
>> signal-based notification would be simpler and better suited.
> 
> Since this is a notification of a change of process status, would it be
> more natural to send it through SIGCHLD?
> 
> As with other supplementary child status events, a flag could be added
> for wait and sigaction.sa_flags to indicate whether the parent wants
> this event to be reported or not.
> 
> Then a suitable CLD_XXX could be defined for this, and we could
> piggyback on PR_{SET,GET}_PDEATHSIG rather than having to have something
> new.
> 

> (I hadn't been watching this thread closely, so apologies if this has
> been discussed already.)

Indeed, I defined the signal code CLD_PREDUMP for SIGCHLD initially, but it
was removed after discussion:

v3 --> v4:

Addressed review comments from Oleg Nesterov, and Eric W. Biederman,
including:
o remove the definition CLD_PREDUMP.
---

You can look at the discussions in the email thread, in particular several
issues pointed out by Eric Biederman, and my reply to Eric.

There are two models 1:1 (one process manager with one child process), and 1:N
(one process manager with many child processes). A legacy signal like SIGCHLD
would not work in the 1:N model due to compression/loss of legacy signals. One
need to use a RT signal in that case. 

One more point in my reply:

When an application chooses a signal for pre-coredump notification, it is much
simpler and robust for the signal to be dedicated for that purpose (in the 
parent)
and not be mixed with other semantics. The "signo + pid" should be sufficient 
f

[PATCH v5 2/2] selftests/prctl: selftest for pre-coredump signal notification

2018-11-27 Thread Enke Chen

[Repost as a series, as suggested by Andrew Morton]

Selftest for the pre-coredump signal notification

Signed-off-by: Enke Chen 
---
 tools/testing/selftests/prctl/Makefile   |   2 +-
 tools/testing/selftests/prctl/predump-sig-test.c | 160 +++
 2 files changed, 161 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/prctl/predump-sig-test.c

diff --git a/tools/testing/selftests/prctl/Makefile 
b/tools/testing/selftests/prctl/Makefile
index c7923b2..f8d60d5 100644
--- a/tools/testing/selftests/prctl/Makefile
+++ b/tools/testing/selftests/prctl/Makefile
@@ -5,7 +5,7 @@ ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e 
s/x86_64/x86/)
 
 ifeq ($(ARCH),x86)
 TEST_PROGS := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test \
-   disable-tsc-test
+   disable-tsc-test predump-sig-test
 all: $(TEST_PROGS)
 
 include ../lib.mk
diff --git a/tools/testing/selftests/prctl/predump-sig-test.c 
b/tools/testing/selftests/prctl/predump-sig-test.c
new file mode 100644
index 000..15d62691
--- /dev/null
+++ b/tools/testing/selftests/prctl/predump-sig-test.c
@@ -0,0 +1,160 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018, Enke Chen, Cisco Systems, Inc.
+ *
+ * Tests for prctl(PR_SET_PREDUMP_SIG, ...) / prctl(PR_GET_PREDUMP_SIG, ...)
+ *
+ * When set with prctl(), the specified signal is sent to the parent process
+ * prior to the coredump of a child process.
+ *
+ * Usage: ./predump-sig-test {SIGUSR1 | SIGRT2}
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef PR_SET_PREDUMP_SIG
+#define PR_SET_PREDUMP_SIG 54
+#define PR_GET_PREDUMP_SIG 55
+#endif
+
+#define SIGRT2(SIGRTMIN + 1)
+
+#define handle_error(msg) \
+   do { perror(msg); exit(EXIT_FAILURE); } while (0)
+
+static sig_idx;
+static siginfo_t siginfo_rcv[2];
+
+static void sigaction_func(int sig, siginfo_t *siginfo, void *arg)
+{
+   memcpy(_rcv[sig_idx], siginfo, sizeof(siginfo_t));
+   sig_idx++;
+}
+
+static int set_sigaction(int sig)
+{
+   struct sigaction new_action;
+   int rc;
+
+   memset(_action, 0, sizeof(struct sigaction));
+   new_action.sa_sigaction = sigaction_func;
+   new_action.sa_flags = SA_SIGINFO;
+   sigemptyset(_action.sa_mask);
+
+   return sigaction(sig, _action, NULL);
+}
+
+static int test_prctl(int sig)
+{
+   int sig2, rc;
+
+   rc = prctl(PR_SET_PREDUMP_SIG, sig, 0, 0, 0);
+   if (rc < 0)
+   handle_error("prctl: setting");
+
+   rc = prctl(PR_GET_PREDUMP_SIG, , 0, 0, 0);
+   if (rc < 0)
+   handle_error("prctl: getting");
+
+   if (sig2 != sig) {
+   printf("prctl: sig %d, post %d\n", sig, sig2);
+   return -1;
+   }
+   return 0;
+}
+
+static void child_fn(void)
+{
+   int rc, sig;
+
+   printf("\nChild pid: %ld\n", (long)getpid());
+
+   /* Test: Child should not inherit the predump_signal */
+   rc = prctl(PR_GET_PREDUMP_SIG, , 0, 0, 0);
+   if (rc < 0)
+   handle_error("prctl: child");
+
+   printf("child: predump_signal %d\n", sig);
+
+   /* Force coredump here */
+   printf("child: calling abort()\n");
+   fflush(stdout);
+   abort();
+}
+
+static int parent_fn(pid_t child_pid)
+{
+   int i, status, count;
+   siginfo_t *si;
+   pid_t w;
+
+   for (count = 0; count < 2; count++) {
+   w = waitpid(child_pid, , 0);
+   printf("\nwaitpid: %d\n", w);
+   if (w < 0)
+   perror("waitpid");
+
+   si = _rcv[count];
+   printf("signal: si_signo %d, si_pid %ld, si_uid %d\n",
+  si->si_signo, si->si_pid, si->si_uid);
+   printf("siginfo: si_errno %d, si_code %d, si_status %d\n",
+  si->si_errno, si->si_code, si->si_status);
+   }
+   fflush(stdout);
+}
+
+int main(int argc, char *argv[])
+{
+   pid_t child_pid;
+   int rc, signo;
+
+   if (argc != 2) {
+   printf("invalid number of arguments\n");
+   exit(EXIT_FAILURE);
+   }
+
+   if (strcmp(argv[1], "SIGUSR1") == 0)
+   signo = SIGUSR1;
+   else if (strcmp(argv[1], "SIGRT2") == 0)
+   signo = SIGRT2;
+   else {
+   printf("invalid argument for signal\n");
+   fflush(stdout);
+   exit(EXIT_FAILURE);
+   }
+
+   rc = set_sigaction(SIGCHLD);
+   if (rc < 0)
+   handle_error("set_sigaction: SIGCHLD");
+
+   if (signo != SIGCHLD) {
+   rc = set_sigaction(signo);
+   if (rc < 0)
+   handle_error("set_sigaction: S

[PATCH v5 2/2] selftests/prctl: selftest for pre-coredump signal notification

2018-11-27 Thread Enke Chen

[Repost as a series, as suggested by Andrew Morton]

Selftest for the pre-coredump signal notification

Signed-off-by: Enke Chen 
---
 tools/testing/selftests/prctl/Makefile   |   2 +-
 tools/testing/selftests/prctl/predump-sig-test.c | 160 +++
 2 files changed, 161 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/prctl/predump-sig-test.c

diff --git a/tools/testing/selftests/prctl/Makefile 
b/tools/testing/selftests/prctl/Makefile
index c7923b2..f8d60d5 100644
--- a/tools/testing/selftests/prctl/Makefile
+++ b/tools/testing/selftests/prctl/Makefile
@@ -5,7 +5,7 @@ ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e 
s/x86_64/x86/)
 
 ifeq ($(ARCH),x86)
 TEST_PROGS := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test \
-   disable-tsc-test
+   disable-tsc-test predump-sig-test
 all: $(TEST_PROGS)
 
 include ../lib.mk
diff --git a/tools/testing/selftests/prctl/predump-sig-test.c 
b/tools/testing/selftests/prctl/predump-sig-test.c
new file mode 100644
index 000..15d62691
--- /dev/null
+++ b/tools/testing/selftests/prctl/predump-sig-test.c
@@ -0,0 +1,160 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018, Enke Chen, Cisco Systems, Inc.
+ *
+ * Tests for prctl(PR_SET_PREDUMP_SIG, ...) / prctl(PR_GET_PREDUMP_SIG, ...)
+ *
+ * When set with prctl(), the specified signal is sent to the parent process
+ * prior to the coredump of a child process.
+ *
+ * Usage: ./predump-sig-test {SIGUSR1 | SIGRT2}
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef PR_SET_PREDUMP_SIG
+#define PR_SET_PREDUMP_SIG 54
+#define PR_GET_PREDUMP_SIG 55
+#endif
+
+#define SIGRT2(SIGRTMIN + 1)
+
+#define handle_error(msg) \
+   do { perror(msg); exit(EXIT_FAILURE); } while (0)
+
+static sig_idx;
+static siginfo_t siginfo_rcv[2];
+
+static void sigaction_func(int sig, siginfo_t *siginfo, void *arg)
+{
+   memcpy(_rcv[sig_idx], siginfo, sizeof(siginfo_t));
+   sig_idx++;
+}
+
+static int set_sigaction(int sig)
+{
+   struct sigaction new_action;
+   int rc;
+
+   memset(_action, 0, sizeof(struct sigaction));
+   new_action.sa_sigaction = sigaction_func;
+   new_action.sa_flags = SA_SIGINFO;
+   sigemptyset(_action.sa_mask);
+
+   return sigaction(sig, _action, NULL);
+}
+
+static int test_prctl(int sig)
+{
+   int sig2, rc;
+
+   rc = prctl(PR_SET_PREDUMP_SIG, sig, 0, 0, 0);
+   if (rc < 0)
+   handle_error("prctl: setting");
+
+   rc = prctl(PR_GET_PREDUMP_SIG, , 0, 0, 0);
+   if (rc < 0)
+   handle_error("prctl: getting");
+
+   if (sig2 != sig) {
+   printf("prctl: sig %d, post %d\n", sig, sig2);
+   return -1;
+   }
+   return 0;
+}
+
+static void child_fn(void)
+{
+   int rc, sig;
+
+   printf("\nChild pid: %ld\n", (long)getpid());
+
+   /* Test: Child should not inherit the predump_signal */
+   rc = prctl(PR_GET_PREDUMP_SIG, , 0, 0, 0);
+   if (rc < 0)
+   handle_error("prctl: child");
+
+   printf("child: predump_signal %d\n", sig);
+
+   /* Force coredump here */
+   printf("child: calling abort()\n");
+   fflush(stdout);
+   abort();
+}
+
+static int parent_fn(pid_t child_pid)
+{
+   int i, status, count;
+   siginfo_t *si;
+   pid_t w;
+
+   for (count = 0; count < 2; count++) {
+   w = waitpid(child_pid, , 0);
+   printf("\nwaitpid: %d\n", w);
+   if (w < 0)
+   perror("waitpid");
+
+   si = _rcv[count];
+   printf("signal: si_signo %d, si_pid %ld, si_uid %d\n",
+  si->si_signo, si->si_pid, si->si_uid);
+   printf("siginfo: si_errno %d, si_code %d, si_status %d\n",
+  si->si_errno, si->si_code, si->si_status);
+   }
+   fflush(stdout);
+}
+
+int main(int argc, char *argv[])
+{
+   pid_t child_pid;
+   int rc, signo;
+
+   if (argc != 2) {
+   printf("invalid number of arguments\n");
+   exit(EXIT_FAILURE);
+   }
+
+   if (strcmp(argv[1], "SIGUSR1") == 0)
+   signo = SIGUSR1;
+   else if (strcmp(argv[1], "SIGRT2") == 0)
+   signo = SIGRT2;
+   else {
+   printf("invalid argument for signal\n");
+   fflush(stdout);
+   exit(EXIT_FAILURE);
+   }
+
+   rc = set_sigaction(SIGCHLD);
+   if (rc < 0)
+   handle_error("set_sigaction: SIGCHLD");
+
+   if (signo != SIGCHLD) {
+   rc = set_sigaction(signo);
+   if (rc < 0)
+   handle_error("set_sigaction: S

[PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification

2018-11-27 Thread Enke Chen

[Repost as a series, as suggested by Andrew Morton]

For simplicity and consistency, this patch provides an implementation
for signal-based fault notification prior to the coredump of a child
process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
be used by an application to express its interest and to specify the
signal for such a notification.

Changes to prctl(2):

   PR_SET_PREDUMP_SIG (since Linux 4.20.x)
  Set the child pre-coredump signal of the calling process to
  arg2 (either a signal value in the range 1..maxsig, or 0 to
  clear). This is the signal that the calling process will get
  prior to the coredump of a child process. This value is
  cleared across execve(2), or for the child of a fork(2).

   PR_GET_PREDUMP_SIG (since Linux 4.20.x)
  Return the current value of the child pre-coredump signal,
  in the location pointed to by (int *) arg2.

Background:

As the coredump of a process may take time, in certain time-sensitive
applications it is necessary for a parent process (e.g., a process
manager) to be notified of a child's imminent death before the coredump
so that the parent process can act sooner, such as re-spawning an
application process, or initiating a control-plane fail-over.

One application is BFD. The early fault notification is a critical
component for maintaining BFD sessions (with a timeout value of
50 msec or 100 msec) across a control-plane failure.

Currently there are two ways for a parent process to be notified of a
child process's state change. One is to use the POSIX signal, and
another is to use the kernel connector module. The specific events and
actions are summarized as follows:

Process EventPOSIX SignalConnector-based
--
ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_STOPPED

ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_CONTINUED

pre_coredump/N/A proc_coredump_connector()
get_signal()

post_coredump/   do_notify_parent()  proc_exit_connector()
do_exit()SIGCHLD / exit_signal
--

As shown in the table, the signal-based pre-coredump notification is not
currently available. In some cases using a connector-based notification
can be quite complicated (e.g., when a process manager is written in shell
scripts and thus is subject to certain inherent limitations), and a
signal-based notification would be simpler and better suited.

Signed-off-by: Enke Chen 
Reviewed-by: Oleg Nesterov 
---
v4 -> v5:
Addressed review comments from Oleg Nesterov:
o use rcu_read_lock instead.
o revert back to notify the real_parent.

 fs/coredump.c| 23 +++
 fs/exec.c|  3 +++
 include/linux/sched/signal.h |  3 +++
 include/uapi/linux/prctl.h   |  4 
 kernel/sys.c | 13 +
 5 files changed, 46 insertions(+)

diff --git a/fs/coredump.c b/fs/coredump.c
index e42e17e..740b1bb 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info *info, 
struct cred *new)
return err;
 }
 
+/*
+ * While do_notify_parent() notifies the parent of a child's death post
+ * its coredump, this function lets the parent (if so desired) know about
+ * the imminent death of a child just prior to its coredump.
+ */
+static void do_notify_parent_predump(void)
+{
+   struct task_struct *parent;
+   int sig;
+
+   rcu_read_lock();
+   parent = rcu_dereference(current->real_parent);
+   sig = parent->signal->predump_signal;
+   if (sig != 0)
+   do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
+   rcu_read_unlock();
+}
+
 void do_coredump(const kernel_siginfo_t *siginfo)
 {
struct core_state core_state;
@@ -590,6 +608,11 @@ void do_coredump(const kernel_siginfo_t *siginfo)
if (retval < 0)
goto fail_creds;
 
+   /*
+* Send the pre-coredump signal to the parent if requested.
+*/
+   do_notify_parent_predump();
+
old_cred = override_creds(cred);
 
ispipe = format_corename(, );
diff --git a/fs/exec.c b/fs/exec.c
index fc281b7..7714da7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
/* we have changed execution domain */
tsk->exit_signal = SIGCHLD;
 
+   /* Clear the pre-coredump signal before loading a new binary */
+   sig->predump_signal = 0;
+
 #ifdef CONFIG_POSIX_TIMERS
exit_itimers(sig);
flush_itimer_signals();
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 13789d1..728ef68 100644
--- a/include/linux/sched/signal.h
+++ b/include

[PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification

2018-11-27 Thread Enke Chen

[Repost as a series, as suggested by Andrew Morton]

For simplicity and consistency, this patch provides an implementation
for signal-based fault notification prior to the coredump of a child
process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
be used by an application to express its interest and to specify the
signal for such a notification.

Changes to prctl(2):

   PR_SET_PREDUMP_SIG (since Linux 4.20.x)
  Set the child pre-coredump signal of the calling process to
  arg2 (either a signal value in the range 1..maxsig, or 0 to
  clear). This is the signal that the calling process will get
  prior to the coredump of a child process. This value is
  cleared across execve(2), or for the child of a fork(2).

   PR_GET_PREDUMP_SIG (since Linux 4.20.x)
  Return the current value of the child pre-coredump signal,
  in the location pointed to by (int *) arg2.

Background:

As the coredump of a process may take time, in certain time-sensitive
applications it is necessary for a parent process (e.g., a process
manager) to be notified of a child's imminent death before the coredump
so that the parent process can act sooner, such as re-spawning an
application process, or initiating a control-plane fail-over.

One application is BFD. The early fault notification is a critical
component for maintaining BFD sessions (with a timeout value of
50 msec or 100 msec) across a control-plane failure.

Currently there are two ways for a parent process to be notified of a
child process's state change. One is to use the POSIX signal, and
another is to use the kernel connector module. The specific events and
actions are summarized as follows:

Process EventPOSIX SignalConnector-based
--
ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_STOPPED

ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_CONTINUED

pre_coredump/N/A proc_coredump_connector()
get_signal()

post_coredump/   do_notify_parent()  proc_exit_connector()
do_exit()SIGCHLD / exit_signal
--

As shown in the table, the signal-based pre-coredump notification is not
currently available. In some cases using a connector-based notification
can be quite complicated (e.g., when a process manager is written in shell
scripts and thus is subject to certain inherent limitations), and a
signal-based notification would be simpler and better suited.

Signed-off-by: Enke Chen 
Reviewed-by: Oleg Nesterov 
---
v4 -> v5:
Addressed review comments from Oleg Nesterov:
o use rcu_read_lock instead.
o revert back to notify the real_parent.

 fs/coredump.c| 23 +++
 fs/exec.c|  3 +++
 include/linux/sched/signal.h |  3 +++
 include/uapi/linux/prctl.h   |  4 
 kernel/sys.c | 13 +
 5 files changed, 46 insertions(+)

diff --git a/fs/coredump.c b/fs/coredump.c
index e42e17e..740b1bb 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info *info, 
struct cred *new)
return err;
 }
 
+/*
+ * While do_notify_parent() notifies the parent of a child's death post
+ * its coredump, this function lets the parent (if so desired) know about
+ * the imminent death of a child just prior to its coredump.
+ */
+static void do_notify_parent_predump(void)
+{
+   struct task_struct *parent;
+   int sig;
+
+   rcu_read_lock();
+   parent = rcu_dereference(current->real_parent);
+   sig = parent->signal->predump_signal;
+   if (sig != 0)
+   do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
+   rcu_read_unlock();
+}
+
 void do_coredump(const kernel_siginfo_t *siginfo)
 {
struct core_state core_state;
@@ -590,6 +608,11 @@ void do_coredump(const kernel_siginfo_t *siginfo)
if (retval < 0)
goto fail_creds;
 
+   /*
+* Send the pre-coredump signal to the parent if requested.
+*/
+   do_notify_parent_predump();
+
old_cred = override_creds(cred);
 
ispipe = format_corename(, );
diff --git a/fs/exec.c b/fs/exec.c
index fc281b7..7714da7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
/* we have changed execution domain */
tsk->exit_signal = SIGCHLD;
 
+   /* Clear the pre-coredump signal before loading a new binary */
+   sig->predump_signal = 0;
+
 #ifdef CONFIG_POSIX_TIMERS
exit_itimers(sig);
flush_itimer_signals();
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 13789d1..728ef68 100644
--- a/include/linux/sched/signal.h
+++ b/include

Re: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

2018-11-21 Thread Enke Chen

Hi, Andrew:

As suggested, I will post them as a patch series (with the same version v5):

[PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification
[PATCH v5 2/2] selftests/prctl: selftest for pre-coredump signal 
notification

I have a diff for the manpage as well. I guess that it should be submitted 
separately
from the code.

Thanks.  -- Enke

On 11/21/18 5:33 PM, Andrew Morton wrote:
> On Wed, 21 Nov 2018 17:09:50 -0800 Enke Chen  wrote:
> 
>> Hi, Andrew:
>>
>> On 11/21/18 4:37 PM, Andrew Morton wrote:
>>> On Tue, 30 Oct 2018 17:46:29 +0100 Oleg Nesterov  wrote:
>>>
>>>> On 10/29, Enke Chen wrote:
>>>>>
>>>>> Reviewed-by: Oleg Nesterov 
>>>>
>>>> Hmm. I didn't say this ;)
>>>>
>>>> But OK, feel free to keep this tag.
>>>>
>>>> I do not like this feauture.
>>>
>>> Why is that?
>>>
>>>> But I see no technical problems in this version
>>>> and I never pretented I understand the user-space needs, so I won't argue.
>>>
>>> The changelog appears to spell this all out quite well?  Unusually
>>> well, in my experience ;)
>>
>> I also followed up with a little more explanation in the email thread on
>> 10/30/2018:
>>
>> ---
>> As I explained earlier, the primary application is in the area of network
>> high-availability / non-stop-forwarding where early fault notification and
>> early action can help maintain BFD sessions and thus avoid unnecessary
>> disruption to forwarding while the control-plane is recovering.
>> ---
>>
>> BTW, I probably should have pointed out this earlier:
>>
>> BFD stands for "RFC 5880: Bi-directional forwarding detection".
> 
> I saw that.  My point is that your above followup wasn't necessary -
> the changelog is clear!
> 
>>>
>>> - As it's a linux-specific feature, a test under
>>>   tools/testing/selftests would be appropriate.  I don't know how much
>>>   that work will be. 
>>
>> The selftest code was submitted on 10/25/2018:
>>
>>[PATCH] selftests/prctl: selftest for pre-coredump signal notification
> 
> OK, please prepare these as a patch series.
> 
>>> Do we have other linux-specific signal extensions which could piggyback 
>>> onto that?
>>
>> No. There are enough existing signals that an application can choose for this
>> purpose, such as SIGUSR1, SIGUSR1, and any of the RT signals.
>>
> 
> My point is that if we have previously added any linux-specific signal
> expensions then your selftest patch would be an appropriate place where
> we could add tests for those features.  I'm not saying that you should
> add such tests at this time, but please do prepare the selftest as a
> thing which tests linux-specific signal extensions in general, not as a
> thing which tests pre-coredump signals only.
>

Re: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

2018-11-21 Thread Enke Chen

Hi, Andrew:

As suggested, I will post them as a patch series (with the same version v5):

[PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification
[PATCH v5 2/2] selftests/prctl: selftest for pre-coredump signal 
notification

I have a diff for the manpage as well. I guess that it should be submitted 
separately
from the code.

Thanks.  -- Enke

On 11/21/18 5:33 PM, Andrew Morton wrote:
> On Wed, 21 Nov 2018 17:09:50 -0800 Enke Chen  wrote:
> 
>> Hi, Andrew:
>>
>> On 11/21/18 4:37 PM, Andrew Morton wrote:
>>> On Tue, 30 Oct 2018 17:46:29 +0100 Oleg Nesterov  wrote:
>>>
>>>> On 10/29, Enke Chen wrote:
>>>>>
>>>>> Reviewed-by: Oleg Nesterov 
>>>>
>>>> Hmm. I didn't say this ;)
>>>>
>>>> But OK, feel free to keep this tag.
>>>>
>>>> I do not like this feauture.
>>>
>>> Why is that?
>>>
>>>> But I see no technical problems in this version
>>>> and I never pretented I understand the user-space needs, so I won't argue.
>>>
>>> The changelog appears to spell this all out quite well?  Unusually
>>> well, in my experience ;)
>>
>> I also followed up with a little more explanation in the email thread on
>> 10/30/2018:
>>
>> ---
>> As I explained earlier, the primary application is in the area of network
>> high-availability / non-stop-forwarding where early fault notification and
>> early action can help maintain BFD sessions and thus avoid unnecessary
>> disruption to forwarding while the control-plane is recovering.
>> ---
>>
>> BTW, I probably should have pointed out this earlier:
>>
>> BFD stands for "RFC 5880: Bi-directional forwarding detection".
> 
> I saw that.  My point is that your above followup wasn't necessary -
> the changelog is clear!
> 
>>>
>>> - As it's a linux-specific feature, a test under
>>>   tools/testing/selftests would be appropriate.  I don't know how much
>>>   that work will be. 
>>
>> The selftest code was submitted on 10/25/2018:
>>
>>[PATCH] selftests/prctl: selftest for pre-coredump signal notification
> 
> OK, please prepare these as a patch series.
> 
>>> Do we have other linux-specific signal extensions which could piggyback 
>>> onto that?
>>
>> No. There are enough existing signals that an application can choose for this
>> purpose, such as SIGUSR1, SIGUSR1, and any of the RT signals.
>>
> 
> My point is that if we have previously added any linux-specific signal
> expensions then your selftest patch would be an appropriate place where
> we could add tests for those features.  I'm not saying that you should
> add such tests at this time, but please do prepare the selftest as a
> thing which tests linux-specific signal extensions in general, not as a
> thing which tests pre-coredump signals only.
>

Re: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

2018-11-21 Thread Enke Chen

Hi, Andrew:

On 11/21/18 5:09 PM, Enke Chen wrote:
> Hi, Andrew:
> 
> On 11/21/18 4:37 PM, Andrew Morton wrote:

>> Do we have other linux-specific signal extensions which could piggyback onto 
>> that?
> 
> No. There are enough existing signals that an application can choose for this
> purpose, such as SIGUSR1, SIGUSR1, and any of the RT signals.
> 
> Thanks.  -- Enke
> 

Let me clarify: I meant this feature is complete by itself. But you seem to ask 
a
somewhat different question, for which I am not aware of.

Thanks.  -- Enke

Re: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

2018-11-21 Thread Enke Chen

Hi, Andrew:

On 11/21/18 5:09 PM, Enke Chen wrote:
> Hi, Andrew:
> 
> On 11/21/18 4:37 PM, Andrew Morton wrote:

>> Do we have other linux-specific signal extensions which could piggyback onto 
>> that?
> 
> No. There are enough existing signals that an application can choose for this
> purpose, such as SIGUSR1, SIGUSR1, and any of the RT signals.
> 
> Thanks.  -- Enke
> 

Let me clarify: I meant this feature is complete by itself. But you seem to ask 
a
somewhat different question, for which I am not aware of.

Thanks.  -- Enke

Re: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

2018-11-21 Thread Enke Chen

Hi, Andrew:

On 11/21/18 4:37 PM, Andrew Morton wrote:
> On Tue, 30 Oct 2018 17:46:29 +0100 Oleg Nesterov  wrote:
> 
>> On 10/29, Enke Chen wrote:
>>>
>>> Reviewed-by: Oleg Nesterov 
>>
>> Hmm. I didn't say this ;)
>>
>> But OK, feel free to keep this tag.
>>
>> I do not like this feauture.
> 
> Why is that?
> 
>> But I see no technical problems in this version
>> and I never pretented I understand the user-space needs, so I won't argue.
> 
> The changelog appears to spell this all out quite well?  Unusually
> well, in my experience ;)

I also followed up with a little more explanation in the email thread on
10/30/2018:

---
As I explained earlier, the primary application is in the area of network
high-availability / non-stop-forwarding where early fault notification and
early action can help maintain BFD sessions and thus avoid unnecessary
disruption to forwarding while the control-plane is recovering.
---

BTW, I probably should have pointed out this earlier:

BFD stands for "RFC 5880: Bi-directional forwarding detection".

> 
> A couple of things...
> 
> - We'll be looking for a manpage update please.  (Search MAINTAINERS
>   for "manpage")

Yes, I will submit a manpage update.  Most of the text is already
written in the patch description.

> 
> - As it's a linux-specific feature, a test under
>   tools/testing/selftests would be appropriate.  I don't know how much
>   that work will be. 

The selftest code was submitted on 10/25/2018:

   [PATCH] selftests/prctl: selftest for pre-coredump signal notification

> Do we have other linux-specific signal extensions which could piggyback onto 
> that?

No. There are enough existing signals that an application can choose for this
purpose, such as SIGUSR1, SIGUSR1, and any of the RT signals.

Thanks.  -- Enke

Re: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

2018-11-21 Thread Enke Chen

Hi, Andrew:

On 11/21/18 4:37 PM, Andrew Morton wrote:
> On Tue, 30 Oct 2018 17:46:29 +0100 Oleg Nesterov  wrote:
> 
>> On 10/29, Enke Chen wrote:
>>>
>>> Reviewed-by: Oleg Nesterov 
>>
>> Hmm. I didn't say this ;)
>>
>> But OK, feel free to keep this tag.
>>
>> I do not like this feauture.
> 
> Why is that?
> 
>> But I see no technical problems in this version
>> and I never pretented I understand the user-space needs, so I won't argue.
> 
> The changelog appears to spell this all out quite well?  Unusually
> well, in my experience ;)

I also followed up with a little more explanation in the email thread on
10/30/2018:

---
As I explained earlier, the primary application is in the area of network
high-availability / non-stop-forwarding where early fault notification and
early action can help maintain BFD sessions and thus avoid unnecessary
disruption to forwarding while the control-plane is recovering.
---

BTW, I probably should have pointed out this earlier:

BFD stands for "RFC 5880: Bi-directional forwarding detection".

> 
> A couple of things...
> 
> - We'll be looking for a manpage update please.  (Search MAINTAINERS
>   for "manpage")

Yes, I will submit a manpage update.  Most of the text is already
written in the patch description.

> 
> - As it's a linux-specific feature, a test under
>   tools/testing/selftests would be appropriate.  I don't know how much
>   that work will be. 

The selftest code was submitted on 10/25/2018:

   [PATCH] selftests/prctl: selftest for pre-coredump signal notification

> Do we have other linux-specific signal extensions which could piggyback onto 
> that?

No. There are enough existing signals that an application can choose for this
purpose, such as SIGUSR1, SIGUSR1, and any of the RT signals.

Thanks.  -- Enke

Re: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

2018-11-12 Thread Enke Chen

Hi, Folks:

Could you please take care of this patch [PATCH 5]?

Thanks.  -- Enke

On 10/29/18 3:31 PM, Enke Chen wrote:
> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal for such a notification.
> 
> Changes to prctl(2):
> 
>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>   Set the child pre-coredump signal of the calling process to
>   arg2 (either a signal value in the range 1..maxsig, or 0 to
>   clear). This is the signal that the calling process will get
>   prior to the coredump of a child process. This value is
>   cleared across execve(2), or for the child of a fork(2).
> 
>PR_GET_PREDUMP_SIG (since Linux 4.20.x)
>   Return the current value of the child pre-coredump signal,
>   in the location pointed to by (int *) arg2.
> 
> Background:
> 
> As the coredump of a process may take time, in certain time-sensitive
> applications it is necessary for a parent process (e.g., a process
> manager) to be notified of a child's imminent death before the coredump
> so that the parent process can act sooner, such as re-spawning an
> application process, or initiating a control-plane fail-over.
> 
> One application is BFD. The early fault notification is a critical
> component for maintaining BFD sessions (with a timeout value of
> 50 msec or 100 msec) across a control-plane failure.
> 
> Currently there are two ways for a parent process to be notified of a
> child process's state change. One is to use the POSIX signal, and
> another is to use the kernel connector module. The specific events and
> actions are summarized as follows:
> 
> Process EventPOSIX SignalConnector-based
> --
> ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_STOPPED
> 
> ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_CONTINUED
> 
> pre_coredump/N/A proc_coredump_connector()
> get_signal()
> 
> post_coredump/   do_notify_parent()  proc_exit_connector()
> do_exit()SIGCHLD / exit_signal
> --
> 
> As shown in the table, the signal-based pre-coredump notification is not
> currently available. In some cases using a connector-based notification
> can be quite complicated (e.g., when a process manager is written in shell
> scripts and thus is subject to certain inherent limitations), and a
> signal-based notification would be simpler and better suited.
> 
> Signed-off-by: Enke Chen 
> Reviewed-by: Oleg Nesterov 
> ---
> v4 -> v5:
> Addressed review comments from Oleg Nesterov:
> o use rcu_read_lock instead.
> o revert back to notify the real_parent.
> 
>  fs/coredump.c| 23 +++
>  fs/exec.c|  3 +++
>  include/linux/sched/signal.h |  3 +++
>  include/uapi/linux/prctl.h   |  4 
>  kernel/sys.c | 13 +
>  5 files changed, 46 insertions(+)
> 
> diff --git a/fs/coredump.c b/fs/coredump.c
> index e42e17e..740b1bb 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info *info, 
> struct cred *new)
>   return err;
>  }
>  
> +/*
> + * While do_notify_parent() notifies the parent of a child's death post
> + * its coredump, this function lets the parent (if so desired) know about
> + * the imminent death of a child just prior to its coredump.
> + */
> +static void do_notify_parent_predump(void)
> +{
> + struct task_struct *parent;
> + int sig;
> +
> + rcu_read_lock();
> + parent = rcu_dereference(current->real_parent);
> + sig = parent->signal->predump_signal;
> + if (sig != 0)
> + do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
> + rcu_read_unlock();
> +}
> +
>  void do_coredump(const kernel_siginfo_t *siginfo)
>  {
>   struct core_state core_state;
> @@ -590,6 +608,11 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>   if (retval < 0)
>   goto fail_creds;
>  
> + /*
> +  * Send the pre-coredump signal to the parent if requested.
> +  */
> + do_notify_parent_predump();
> +
>   old_cred = override_creds(cred);
>  
>   ispipe = format_co

Re: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

2018-11-12 Thread Enke Chen

Hi, Folks:

Could you please take care of this patch [PATCH 5]?

Thanks.  -- Enke

On 10/29/18 3:31 PM, Enke Chen wrote:
> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal for such a notification.
> 
> Changes to prctl(2):
> 
>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>   Set the child pre-coredump signal of the calling process to
>   arg2 (either a signal value in the range 1..maxsig, or 0 to
>   clear). This is the signal that the calling process will get
>   prior to the coredump of a child process. This value is
>   cleared across execve(2), or for the child of a fork(2).
> 
>PR_GET_PREDUMP_SIG (since Linux 4.20.x)
>   Return the current value of the child pre-coredump signal,
>   in the location pointed to by (int *) arg2.
> 
> Background:
> 
> As the coredump of a process may take time, in certain time-sensitive
> applications it is necessary for a parent process (e.g., a process
> manager) to be notified of a child's imminent death before the coredump
> so that the parent process can act sooner, such as re-spawning an
> application process, or initiating a control-plane fail-over.
> 
> One application is BFD. The early fault notification is a critical
> component for maintaining BFD sessions (with a timeout value of
> 50 msec or 100 msec) across a control-plane failure.
> 
> Currently there are two ways for a parent process to be notified of a
> child process's state change. One is to use the POSIX signal, and
> another is to use the kernel connector module. The specific events and
> actions are summarized as follows:
> 
> Process EventPOSIX SignalConnector-based
> --
> ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_STOPPED
> 
> ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>  SIGCHLD / CLD_CONTINUED
> 
> pre_coredump/N/A proc_coredump_connector()
> get_signal()
> 
> post_coredump/   do_notify_parent()  proc_exit_connector()
> do_exit()SIGCHLD / exit_signal
> --
> 
> As shown in the table, the signal-based pre-coredump notification is not
> currently available. In some cases using a connector-based notification
> can be quite complicated (e.g., when a process manager is written in shell
> scripts and thus is subject to certain inherent limitations), and a
> signal-based notification would be simpler and better suited.
> 
> Signed-off-by: Enke Chen 
> Reviewed-by: Oleg Nesterov 
> ---
> v4 -> v5:
> Addressed review comments from Oleg Nesterov:
> o use rcu_read_lock instead.
> o revert back to notify the real_parent.
> 
>  fs/coredump.c| 23 +++
>  fs/exec.c|  3 +++
>  include/linux/sched/signal.h |  3 +++
>  include/uapi/linux/prctl.h   |  4 
>  kernel/sys.c | 13 +
>  5 files changed, 46 insertions(+)
> 
> diff --git a/fs/coredump.c b/fs/coredump.c
> index e42e17e..740b1bb 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info *info, 
> struct cred *new)
>   return err;
>  }
>  
> +/*
> + * While do_notify_parent() notifies the parent of a child's death post
> + * its coredump, this function lets the parent (if so desired) know about
> + * the imminent death of a child just prior to its coredump.
> + */
> +static void do_notify_parent_predump(void)
> +{
> + struct task_struct *parent;
> + int sig;
> +
> + rcu_read_lock();
> + parent = rcu_dereference(current->real_parent);
> + sig = parent->signal->predump_signal;
> + if (sig != 0)
> + do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
> + rcu_read_unlock();
> +}
> +
>  void do_coredump(const kernel_siginfo_t *siginfo)
>  {
>   struct core_state core_state;
> @@ -590,6 +608,11 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>   if (retval < 0)
>   goto fail_creds;
>  
> + /*
> +  * Send the pre-coredump signal to the parent if requested.
> +  */
> + do_notify_parent_predump();
> +
>   old_cred = override_creds(cred);
>  
>   ispipe = format_co

Re: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

2018-10-30 Thread Enke Chen

Hi, Oleg:

On 10/30/18 9:46 AM, Oleg Nesterov wrote:
> On 10/29, Enke Chen wrote:
>>
>> Reviewed-by: Oleg Nesterov 
> 
> Hmm. I didn't say this ;)
> 
> But OK, feel free to keep this tag.

Thanks.

> 
> I do not like this feauture. But I see no technical problems in this version
> and I never pretented I understand the user-space needs, so I won't argue.

As I explained earlier, the primary application is in the area of network
high-availability / non-stop-forwarding where early fault notification and
early action can help maintain BFD sessions and thus avoid unnecessary
disruption to forwarding while the control-plane is recovering.

-- Enke

Re: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

2018-10-30 Thread Enke Chen

Hi, Oleg:

On 10/30/18 9:46 AM, Oleg Nesterov wrote:
> On 10/29, Enke Chen wrote:
>>
>> Reviewed-by: Oleg Nesterov 
> 
> Hmm. I didn't say this ;)
> 
> But OK, feel free to keep this tag.

Thanks.

> 
> I do not like this feauture. But I see no technical problems in this version
> and I never pretented I understand the user-space needs, so I won't argue.

As I explained earlier, the primary application is in the area of network
high-availability / non-stop-forwarding where early fault notification and
early action can help maintain BFD sessions and thus avoid unnecessary
disruption to forwarding while the control-plane is recovering.

-- Enke

[PATCH v5] kernel/signal: Signal-based pre-coredump notification

2018-10-29 Thread Enke Chen

For simplicity and consistency, this patch provides an implementation
for signal-based fault notification prior to the coredump of a child
process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
be used by an application to express its interest and to specify the
signal for such a notification.

Changes to prctl(2):

   PR_SET_PREDUMP_SIG (since Linux 4.20.x)
  Set the child pre-coredump signal of the calling process to
  arg2 (either a signal value in the range 1..maxsig, or 0 to
  clear). This is the signal that the calling process will get
  prior to the coredump of a child process. This value is
  cleared across execve(2), or for the child of a fork(2).

   PR_GET_PREDUMP_SIG (since Linux 4.20.x)
  Return the current value of the child pre-coredump signal,
  in the location pointed to by (int *) arg2.

Background:

As the coredump of a process may take time, in certain time-sensitive
applications it is necessary for a parent process (e.g., a process
manager) to be notified of a child's imminent death before the coredump
so that the parent process can act sooner, such as re-spawning an
application process, or initiating a control-plane fail-over.

One application is BFD. The early fault notification is a critical
component for maintaining BFD sessions (with a timeout value of
50 msec or 100 msec) across a control-plane failure.

Currently there are two ways for a parent process to be notified of a
child process's state change. One is to use the POSIX signal, and
another is to use the kernel connector module. The specific events and
actions are summarized as follows:

Process EventPOSIX SignalConnector-based
--
ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_STOPPED

ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_CONTINUED

pre_coredump/N/A proc_coredump_connector()
get_signal()

post_coredump/   do_notify_parent()  proc_exit_connector()
do_exit()SIGCHLD / exit_signal
--

As shown in the table, the signal-based pre-coredump notification is not
currently available. In some cases using a connector-based notification
can be quite complicated (e.g., when a process manager is written in shell
scripts and thus is subject to certain inherent limitations), and a
signal-based notification would be simpler and better suited.

Signed-off-by: Enke Chen 
Reviewed-by: Oleg Nesterov 
---
v4 -> v5:
Addressed review comments from Oleg Nesterov:
o use rcu_read_lock instead.
o revert back to notify the real_parent.

 fs/coredump.c| 23 +++
 fs/exec.c|  3 +++
 include/linux/sched/signal.h |  3 +++
 include/uapi/linux/prctl.h   |  4 
 kernel/sys.c | 13 +
 5 files changed, 46 insertions(+)

diff --git a/fs/coredump.c b/fs/coredump.c
index e42e17e..740b1bb 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info *info, 
struct cred *new)
return err;
 }
 
+/*
+ * While do_notify_parent() notifies the parent of a child's death post
+ * its coredump, this function lets the parent (if so desired) know about
+ * the imminent death of a child just prior to its coredump.
+ */
+static void do_notify_parent_predump(void)
+{
+   struct task_struct *parent;
+   int sig;
+
+   rcu_read_lock();
+   parent = rcu_dereference(current->real_parent);
+   sig = parent->signal->predump_signal;
+   if (sig != 0)
+   do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
+   rcu_read_unlock();
+}
+
 void do_coredump(const kernel_siginfo_t *siginfo)
 {
struct core_state core_state;
@@ -590,6 +608,11 @@ void do_coredump(const kernel_siginfo_t *siginfo)
if (retval < 0)
goto fail_creds;
 
+   /*
+* Send the pre-coredump signal to the parent if requested.
+*/
+   do_notify_parent_predump();
+
old_cred = override_creds(cred);
 
ispipe = format_corename(, );
diff --git a/fs/exec.c b/fs/exec.c
index fc281b7..7714da7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
/* we have changed execution domain */
tsk->exit_signal = SIGCHLD;
 
+   /* Clear the pre-coredump signal before loading a new binary */
+   sig->predump_signal = 0;
+
 #ifdef CONFIG_POSIX_TIMERS
exit_itimers(sig);
flush_itimer_signals();
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 13789d1..728ef68 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -112,6 +112,9 @@ struct signal

[PATCH v5] kernel/signal: Signal-based pre-coredump notification

2018-10-29 Thread Enke Chen

For simplicity and consistency, this patch provides an implementation
for signal-based fault notification prior to the coredump of a child
process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
be used by an application to express its interest and to specify the
signal for such a notification.

Changes to prctl(2):

   PR_SET_PREDUMP_SIG (since Linux 4.20.x)
  Set the child pre-coredump signal of the calling process to
  arg2 (either a signal value in the range 1..maxsig, or 0 to
  clear). This is the signal that the calling process will get
  prior to the coredump of a child process. This value is
  cleared across execve(2), or for the child of a fork(2).

   PR_GET_PREDUMP_SIG (since Linux 4.20.x)
  Return the current value of the child pre-coredump signal,
  in the location pointed to by (int *) arg2.

Background:

As the coredump of a process may take time, in certain time-sensitive
applications it is necessary for a parent process (e.g., a process
manager) to be notified of a child's imminent death before the coredump
so that the parent process can act sooner, such as re-spawning an
application process, or initiating a control-plane fail-over.

One application is BFD. The early fault notification is a critical
component for maintaining BFD sessions (with a timeout value of
50 msec or 100 msec) across a control-plane failure.

Currently there are two ways for a parent process to be notified of a
child process's state change. One is to use the POSIX signal, and
another is to use the kernel connector module. The specific events and
actions are summarized as follows:

Process EventPOSIX SignalConnector-based
--
ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_STOPPED

ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_CONTINUED

pre_coredump/N/A proc_coredump_connector()
get_signal()

post_coredump/   do_notify_parent()  proc_exit_connector()
do_exit()SIGCHLD / exit_signal
--

As shown in the table, the signal-based pre-coredump notification is not
currently available. In some cases using a connector-based notification
can be quite complicated (e.g., when a process manager is written in shell
scripts and thus is subject to certain inherent limitations), and a
signal-based notification would be simpler and better suited.

Signed-off-by: Enke Chen 
Reviewed-by: Oleg Nesterov 
---
v4 -> v5:
Addressed review comments from Oleg Nesterov:
o use rcu_read_lock instead.
o revert back to notify the real_parent.

 fs/coredump.c| 23 +++
 fs/exec.c|  3 +++
 include/linux/sched/signal.h |  3 +++
 include/uapi/linux/prctl.h   |  4 
 kernel/sys.c | 13 +
 5 files changed, 46 insertions(+)

diff --git a/fs/coredump.c b/fs/coredump.c
index e42e17e..740b1bb 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info *info, 
struct cred *new)
return err;
 }
 
+/*
+ * While do_notify_parent() notifies the parent of a child's death post
+ * its coredump, this function lets the parent (if so desired) know about
+ * the imminent death of a child just prior to its coredump.
+ */
+static void do_notify_parent_predump(void)
+{
+   struct task_struct *parent;
+   int sig;
+
+   rcu_read_lock();
+   parent = rcu_dereference(current->real_parent);
+   sig = parent->signal->predump_signal;
+   if (sig != 0)
+   do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
+   rcu_read_unlock();
+}
+
 void do_coredump(const kernel_siginfo_t *siginfo)
 {
struct core_state core_state;
@@ -590,6 +608,11 @@ void do_coredump(const kernel_siginfo_t *siginfo)
if (retval < 0)
goto fail_creds;
 
+   /*
+* Send the pre-coredump signal to the parent if requested.
+*/
+   do_notify_parent_predump();
+
old_cred = override_creds(cred);
 
ispipe = format_corename(, );
diff --git a/fs/exec.c b/fs/exec.c
index fc281b7..7714da7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
/* we have changed execution domain */
tsk->exit_signal = SIGCHLD;
 
+   /* Clear the pre-coredump signal before loading a new binary */
+   sig->predump_signal = 0;
+
 #ifdef CONFIG_POSIX_TIMERS
exit_itimers(sig);
flush_itimer_signals();
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 13789d1..728ef68 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -112,6 +112,9 @@ struct signal

Re: [PATCH v4] kernel/signal: Signal-based pre-coredump notification

2018-10-29 Thread Enke Chen

Hi, Oleg:

Yes, it should be the "real_parent" that is more interested in the notification.
Will revert back.

+static void do_notify_parent_predump(void)
+{
+   struct task_struct *parent;
+   int sig;
+
+   rcu_read_lock();
+   parent = rcu_dereference(current->real_parent);
+   sig = parent->signal->predump_signal;
+   if (sig != 0)
+   do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
+   rcu_read_unlock();
+}

Thanks.  -- Enke

On 10/29/18 4:18 AM, Oleg Nesterov wrote:
> Hi,
> 
> On 10/26, Enke Chen wrote:
>>
>> This is really a good idea given that "parent" is declared as RCU-protected.
>> Just a bit odd, though, that the "parent" has not been accessed this way in
>> the code base.
> 
> It is acccessed when possible,
> 
>> So just to confirm: the revised code would look like the following:
>>
>> static void do_notify_parent_predump(void)
>> {
>> struct task_struct *parent;
>> int sig;
>>
>> rcu_read_lock();
>> parent = rcu_dereference(current->parent);
>> sig = parent->signal->predump_signal;
>> if (sig != 0)
>> do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
>> rcu_read_unlock();
>> }
> 
> Yes, this is what I meant.
> 
> But I still think do_notify_parent_predump() should notify ->real_parent,
> not ->parent.
> 
> Oleg.
>

Re: [PATCH v4] kernel/signal: Signal-based pre-coredump notification

2018-10-29 Thread Enke Chen

Hi, Oleg:

Yes, it should be the "real_parent" that is more interested in the notification.
Will revert back.

+static void do_notify_parent_predump(void)
+{
+   struct task_struct *parent;
+   int sig;
+
+   rcu_read_lock();
+   parent = rcu_dereference(current->real_parent);
+   sig = parent->signal->predump_signal;
+   if (sig != 0)
+   do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
+   rcu_read_unlock();
+}

Thanks.  -- Enke

On 10/29/18 4:18 AM, Oleg Nesterov wrote:
> Hi,
> 
> On 10/26, Enke Chen wrote:
>>
>> This is really a good idea given that "parent" is declared as RCU-protected.
>> Just a bit odd, though, that the "parent" has not been accessed this way in
>> the code base.
> 
> It is acccessed when possible,
> 
>> So just to confirm: the revised code would look like the following:
>>
>> static void do_notify_parent_predump(void)
>> {
>> struct task_struct *parent;
>> int sig;
>>
>> rcu_read_lock();
>> parent = rcu_dereference(current->parent);
>> sig = parent->signal->predump_signal;
>> if (sig != 0)
>> do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
>> rcu_read_unlock();
>> }
> 
> Yes, this is what I meant.
> 
> But I still think do_notify_parent_predump() should notify ->real_parent,
> not ->parent.
> 
> Oleg.
>

Re: [PATCH v4] kernel/signal: Signal-based pre-coredump notification

2018-10-26 Thread Enke Chen

Hi, Olge:

This is really a good idea given that "parent" is declared as RCU-protected.
Just a bit odd, though, that the "parent" has not been accessed this way in
the code base.

So just to confirm: the revised code would look like the following:

static void do_notify_parent_predump(void)
{
struct task_struct *parent;
int sig;

rcu_read_lock();
parent = rcu_dereference(current->parent);
sig = parent->signal->predump_signal;
if (sig != 0)
do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
rcu_read_unlock();
}

Thank you so much for your help during this review. I would like to ack your
contribution in the "Reviewed-by:" field.

-- Enke

On 10/26/18 1:28 AM, Oleg Nesterov wrote:
> On 10/25, Enke Chen wrote:
>>
>> +static void do_notify_parent_predump(void)
>> +{
>> +struct task_struct *parent;
>> +int sig;
>> +
>> +read_lock(_lock);
>> +parent = current->parent;
>> +sig = parent->signal->predump_signal;
>> +if (sig != 0)
>> +do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
>> +read_unlock(_lock);
> 
> Ah. It is strange I didn't think about this before... Please, do not take
> tasklist_lock, use rcu_read_lock() instead. do_send_sig_info() uses the
> rcu-friendly lock_task_sighand(), so rcu_dereference(parent) should work
> fine.
> 
> Oleg.
>

Re: [PATCH v4] kernel/signal: Signal-based pre-coredump notification

2018-10-26 Thread Enke Chen

Hi, Olge:

This is really a good idea given that "parent" is declared as RCU-protected.
Just a bit odd, though, that the "parent" has not been accessed this way in
the code base.

So just to confirm: the revised code would look like the following:

static void do_notify_parent_predump(void)
{
struct task_struct *parent;
int sig;

rcu_read_lock();
parent = rcu_dereference(current->parent);
sig = parent->signal->predump_signal;
if (sig != 0)
do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
rcu_read_unlock();
}

Thank you so much for your help during this review. I would like to ack your
contribution in the "Reviewed-by:" field.

-- Enke

On 10/26/18 1:28 AM, Oleg Nesterov wrote:
> On 10/25, Enke Chen wrote:
>>
>> +static void do_notify_parent_predump(void)
>> +{
>> +struct task_struct *parent;
>> +int sig;
>> +
>> +read_lock(_lock);
>> +parent = current->parent;
>> +sig = parent->signal->predump_signal;
>> +if (sig != 0)
>> +do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
>> +read_unlock(_lock);
> 
> Ah. It is strange I didn't think about this before... Please, do not take
> tasklist_lock, use rcu_read_lock() instead. do_send_sig_info() uses the
> rcu-friendly lock_task_sighand(), so rcu_dereference(parent) should work
> fine.
> 
> Oleg.
>

[PATCH] selftests/prctl: selftest for pre-coredump signal notification

2018-10-25 Thread Enke Chen



Dependency: [PATCH] kernel/signal: Signal-based pre-coredump notification

Signed-off-by: Enke Chen 
---
 tools/testing/selftests/prctl/Makefile   |   2 +-
 tools/testing/selftests/prctl/predump-sig-test.c | 160 +++
 2 files changed, 161 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/prctl/predump-sig-test.c

diff --git a/tools/testing/selftests/prctl/Makefile 
b/tools/testing/selftests/prctl/Makefile
index c7923b2..f8d60d5 100644
--- a/tools/testing/selftests/prctl/Makefile
+++ b/tools/testing/selftests/prctl/Makefile
@@ -5,7 +5,7 @@ ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e 
s/x86_64/x86/)
 
 ifeq ($(ARCH),x86)
 TEST_PROGS := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test \
-   disable-tsc-test
+   disable-tsc-test predump-sig-test
 all: $(TEST_PROGS)
 
 include ../lib.mk
diff --git a/tools/testing/selftests/prctl/predump-sig-test.c 
b/tools/testing/selftests/prctl/predump-sig-test.c
new file mode 100644
index 000..15d62691
--- /dev/null
+++ b/tools/testing/selftests/prctl/predump-sig-test.c
@@ -0,0 +1,160 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018, Enke Chen, Cisco Systems, Inc.
+ *
+ * Tests for prctl(PR_SET_PREDUMP_SIG, ...) / prctl(PR_GET_PREDUMP_SIG, ...)
+ *
+ * When set with prctl(), the specified signal is sent to the parent process
+ * prior to the coredump of a child process.
+ *
+ * Usage: ./predump-sig-test {SIGUSR1 | SIGRT2}
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef PR_SET_PREDUMP_SIG
+#define PR_SET_PREDUMP_SIG 54
+#define PR_GET_PREDUMP_SIG 55
+#endif
+
+#define SIGRT2(SIGRTMIN + 1)
+
+#define handle_error(msg) \
+   do { perror(msg); exit(EXIT_FAILURE); } while (0)
+
+static sig_idx;
+static siginfo_t siginfo_rcv[2];
+
+static void sigaction_func(int sig, siginfo_t *siginfo, void *arg)
+{
+   memcpy(_rcv[sig_idx], siginfo, sizeof(siginfo_t));
+   sig_idx++;
+}
+
+static int set_sigaction(int sig)
+{
+   struct sigaction new_action;
+   int rc;
+
+   memset(_action, 0, sizeof(struct sigaction));
+   new_action.sa_sigaction = sigaction_func;
+   new_action.sa_flags = SA_SIGINFO;
+   sigemptyset(_action.sa_mask);
+
+   return sigaction(sig, _action, NULL);
+}
+
+static int test_prctl(int sig)
+{
+   int sig2, rc;
+
+   rc = prctl(PR_SET_PREDUMP_SIG, sig, 0, 0, 0);
+   if (rc < 0)
+   handle_error("prctl: setting");
+
+   rc = prctl(PR_GET_PREDUMP_SIG, , 0, 0, 0);
+   if (rc < 0)
+   handle_error("prctl: getting");
+
+   if (sig2 != sig) {
+   printf("prctl: sig %d, post %d\n", sig, sig2);
+   return -1;
+   }
+   return 0;
+}
+
+static void child_fn(void)
+{
+   int rc, sig;
+
+   printf("\nChild pid: %ld\n", (long)getpid());
+
+   /* Test: Child should not inherit the predump_signal */
+   rc = prctl(PR_GET_PREDUMP_SIG, , 0, 0, 0);
+   if (rc < 0)
+   handle_error("prctl: child");
+
+   printf("child: predump_signal %d\n", sig);
+
+   /* Force coredump here */
+   printf("child: calling abort()\n");
+   fflush(stdout);
+   abort();
+}
+
+static int parent_fn(pid_t child_pid)
+{
+   int i, status, count;
+   siginfo_t *si;
+   pid_t w;
+
+   for (count = 0; count < 2; count++) {
+   w = waitpid(child_pid, , 0);
+   printf("\nwaitpid: %d\n", w);
+   if (w < 0)
+   perror("waitpid");
+
+   si = _rcv[count];
+   printf("signal: si_signo %d, si_pid %ld, si_uid %d\n",
+  si->si_signo, si->si_pid, si->si_uid);
+   printf("siginfo: si_errno %d, si_code %d, si_status %d\n",
+  si->si_errno, si->si_code, si->si_status);
+   }
+   fflush(stdout);
+}
+
+int main(int argc, char *argv[])
+{
+   pid_t child_pid;
+   int rc, signo;
+
+   if (argc != 2) {
+   printf("invalid number of arguments\n");
+   exit(EXIT_FAILURE);
+   }
+
+   if (strcmp(argv[1], "SIGUSR1") == 0)
+   signo = SIGUSR1;
+   else if (strcmp(argv[1], "SIGRT2") == 0)
+   signo = SIGRT2;
+   else {
+   printf("invalid argument for signal\n");
+   fflush(stdout);
+   exit(EXIT_FAILURE);
+   }
+
+   rc = set_sigaction(SIGCHLD);
+   if (rc < 0)
+   handle_error("set_sigaction: SIGCHLD");
+
+   if (signo != SIGCHLD) {
+   rc = set_sigaction(signo);
+   if (rc < 0)
+   handle_error("set_sigaction: SIGCHLD");
+   }
+
+

[PATCH] selftests/prctl: selftest for pre-coredump signal notification

2018-10-25 Thread Enke Chen



Dependency: [PATCH] kernel/signal: Signal-based pre-coredump notification

Signed-off-by: Enke Chen 
---
 tools/testing/selftests/prctl/Makefile   |   2 +-
 tools/testing/selftests/prctl/predump-sig-test.c | 160 +++
 2 files changed, 161 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/prctl/predump-sig-test.c

diff --git a/tools/testing/selftests/prctl/Makefile 
b/tools/testing/selftests/prctl/Makefile
index c7923b2..f8d60d5 100644
--- a/tools/testing/selftests/prctl/Makefile
+++ b/tools/testing/selftests/prctl/Makefile
@@ -5,7 +5,7 @@ ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e 
s/x86_64/x86/)
 
 ifeq ($(ARCH),x86)
 TEST_PROGS := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test \
-   disable-tsc-test
+   disable-tsc-test predump-sig-test
 all: $(TEST_PROGS)
 
 include ../lib.mk
diff --git a/tools/testing/selftests/prctl/predump-sig-test.c 
b/tools/testing/selftests/prctl/predump-sig-test.c
new file mode 100644
index 000..15d62691
--- /dev/null
+++ b/tools/testing/selftests/prctl/predump-sig-test.c
@@ -0,0 +1,160 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018, Enke Chen, Cisco Systems, Inc.
+ *
+ * Tests for prctl(PR_SET_PREDUMP_SIG, ...) / prctl(PR_GET_PREDUMP_SIG, ...)
+ *
+ * When set with prctl(), the specified signal is sent to the parent process
+ * prior to the coredump of a child process.
+ *
+ * Usage: ./predump-sig-test {SIGUSR1 | SIGRT2}
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef PR_SET_PREDUMP_SIG
+#define PR_SET_PREDUMP_SIG 54
+#define PR_GET_PREDUMP_SIG 55
+#endif
+
+#define SIGRT2(SIGRTMIN + 1)
+
+#define handle_error(msg) \
+   do { perror(msg); exit(EXIT_FAILURE); } while (0)
+
+static sig_idx;
+static siginfo_t siginfo_rcv[2];
+
+static void sigaction_func(int sig, siginfo_t *siginfo, void *arg)
+{
+   memcpy(_rcv[sig_idx], siginfo, sizeof(siginfo_t));
+   sig_idx++;
+}
+
+static int set_sigaction(int sig)
+{
+   struct sigaction new_action;
+   int rc;
+
+   memset(_action, 0, sizeof(struct sigaction));
+   new_action.sa_sigaction = sigaction_func;
+   new_action.sa_flags = SA_SIGINFO;
+   sigemptyset(_action.sa_mask);
+
+   return sigaction(sig, _action, NULL);
+}
+
+static int test_prctl(int sig)
+{
+   int sig2, rc;
+
+   rc = prctl(PR_SET_PREDUMP_SIG, sig, 0, 0, 0);
+   if (rc < 0)
+   handle_error("prctl: setting");
+
+   rc = prctl(PR_GET_PREDUMP_SIG, , 0, 0, 0);
+   if (rc < 0)
+   handle_error("prctl: getting");
+
+   if (sig2 != sig) {
+   printf("prctl: sig %d, post %d\n", sig, sig2);
+   return -1;
+   }
+   return 0;
+}
+
+static void child_fn(void)
+{
+   int rc, sig;
+
+   printf("\nChild pid: %ld\n", (long)getpid());
+
+   /* Test: Child should not inherit the predump_signal */
+   rc = prctl(PR_GET_PREDUMP_SIG, , 0, 0, 0);
+   if (rc < 0)
+   handle_error("prctl: child");
+
+   printf("child: predump_signal %d\n", sig);
+
+   /* Force coredump here */
+   printf("child: calling abort()\n");
+   fflush(stdout);
+   abort();
+}
+
+static int parent_fn(pid_t child_pid)
+{
+   int i, status, count;
+   siginfo_t *si;
+   pid_t w;
+
+   for (count = 0; count < 2; count++) {
+   w = waitpid(child_pid, , 0);
+   printf("\nwaitpid: %d\n", w);
+   if (w < 0)
+   perror("waitpid");
+
+   si = _rcv[count];
+   printf("signal: si_signo %d, si_pid %ld, si_uid %d\n",
+  si->si_signo, si->si_pid, si->si_uid);
+   printf("siginfo: si_errno %d, si_code %d, si_status %d\n",
+  si->si_errno, si->si_code, si->si_status);
+   }
+   fflush(stdout);
+}
+
+int main(int argc, char *argv[])
+{
+   pid_t child_pid;
+   int rc, signo;
+
+   if (argc != 2) {
+   printf("invalid number of arguments\n");
+   exit(EXIT_FAILURE);
+   }
+
+   if (strcmp(argv[1], "SIGUSR1") == 0)
+   signo = SIGUSR1;
+   else if (strcmp(argv[1], "SIGRT2") == 0)
+   signo = SIGRT2;
+   else {
+   printf("invalid argument for signal\n");
+   fflush(stdout);
+   exit(EXIT_FAILURE);
+   }
+
+   rc = set_sigaction(SIGCHLD);
+   if (rc < 0)
+   handle_error("set_sigaction: SIGCHLD");
+
+   if (signo != SIGCHLD) {
+   rc = set_sigaction(signo);
+   if (rc < 0)
+   handle_error("set_sigaction: SIGCHLD");
+   }
+
+

[PATCH v4] kernel/signal: Signal-based pre-coredump notification

2018-10-25 Thread Enke Chen

For simplicity and consistency, this patch provides an implementation
for signal-based fault notification prior to the coredump of a child
process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
be used by an application to express its interest and to specify the
signal for such a notification.

Changes to prctl(2):

   PR_SET_PREDUMP_SIG (since Linux 4.20.x)
  Set the child pre-coredump signal of the calling process to
  arg2 (either a signal value in the range 1..maxsig, or 0 to
  clear). This is the signal that the calling process will get
  prior to the coredump of a child process. This value is
  cleared across execve(2), or for the child of a fork(2).

   PR_GET_PREDUMP_SIG (since Linux 4.20.x)
  Return the current value of the child pre-coredump signal,
  in the location pointed to by (int *) arg2.

Background:

As the coredump of a process may take time, in certain time-sensitive
applications it is necessary for a parent process (e.g., a process
manager) to be notified of a child's imminent death before the coredump
so that the parent process can act sooner, such as re-spawning an
application process, or initiating a control-plane fail-over.

One application is BFD. The early fault notification is a critical
component for maintaining BFD sessions (with a timeout value of
50 msec or 100 msec) across a control-plane failure.

Currently there are two ways for a parent process to be notified of a
child process's state change. One is to use the POSIX signal, and
another is to use the kernel connector module. The specific events and
actions are summarized as follows:

Process EventPOSIX SignalConnector-based
--
ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_STOPPED

ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_CONTINUED

pre_coredump/N/A proc_coredump_connector()
get_signal()

post_coredump/   do_notify_parent()  proc_exit_connector()
do_exit()SIGCHLD / exit_signal
--

As shown in the table, the signal-based pre-coredump notification is not
currently available. In some cases using a connector-based notification
can be quite complicated (e.g., when a process manager is written in shell
scripts and thus is subject to certain inherent limitations), and a
signal-based notification would be simpler and better suited.

Signed-off-by: Enke Chen 
---
v3 -> v4:

Addressed review comments from Oleg Nesterov, and Eric W. Biederman,
including:
o remove the definition CLD_PREDUMP.
o code simplification.
o split out the selftest code.

 fs/coredump.c| 23 +++
 fs/exec.c|  3 +++
 include/linux/sched/signal.h |  3 +++
 include/uapi/linux/prctl.h   |  4 
 kernel/sys.c | 13 +
 5 files changed, 46 insertions(+)

diff --git a/fs/coredump.c b/fs/coredump.c
index e42e17e..22c40dc 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info *info, 
struct cred *new)
return err;
 }
 
+/*
+ * While do_notify_parent() notifies the parent of a child's death post
+ * its coredump, this function lets the parent (if so desired) know about
+ * the imminent death of a child just prior to its coredump.
+ */
+static void do_notify_parent_predump(void)
+{
+   struct task_struct *parent;
+   int sig;
+
+   read_lock(_lock);
+   parent = current->parent;
+   sig = parent->signal->predump_signal;
+   if (sig != 0)
+   do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
+   read_unlock(_lock);
+}
+
 void do_coredump(const kernel_siginfo_t *siginfo)
 {
struct core_state core_state;
@@ -590,6 +608,11 @@ void do_coredump(const kernel_siginfo_t *siginfo)
if (retval < 0)
goto fail_creds;
 
+   /*
+* Send the pre-coredump signal to the parent if requested.
+*/
+   do_notify_parent_predump();
+
old_cred = override_creds(cred);
 
ispipe = format_corename(, );
diff --git a/fs/exec.c b/fs/exec.c
index fc281b7..7714da7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
/* we have changed execution domain */
tsk->exit_signal = SIGCHLD;
 
+   /* Clear the pre-coredump signal before loading a new binary */
+   sig->predump_signal = 0;
+
 #ifdef CONFIG_POSIX_TIMERS
exit_itimers(sig);
flush_itimer_signals();
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 13789d1..728ef68 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -112,6 +112,9 @@

[PATCH v4] kernel/signal: Signal-based pre-coredump notification

2018-10-25 Thread Enke Chen

For simplicity and consistency, this patch provides an implementation
for signal-based fault notification prior to the coredump of a child
process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
be used by an application to express its interest and to specify the
signal for such a notification.

Changes to prctl(2):

   PR_SET_PREDUMP_SIG (since Linux 4.20.x)
  Set the child pre-coredump signal of the calling process to
  arg2 (either a signal value in the range 1..maxsig, or 0 to
  clear). This is the signal that the calling process will get
  prior to the coredump of a child process. This value is
  cleared across execve(2), or for the child of a fork(2).

   PR_GET_PREDUMP_SIG (since Linux 4.20.x)
  Return the current value of the child pre-coredump signal,
  in the location pointed to by (int *) arg2.

Background:

As the coredump of a process may take time, in certain time-sensitive
applications it is necessary for a parent process (e.g., a process
manager) to be notified of a child's imminent death before the coredump
so that the parent process can act sooner, such as re-spawning an
application process, or initiating a control-plane fail-over.

One application is BFD. The early fault notification is a critical
component for maintaining BFD sessions (with a timeout value of
50 msec or 100 msec) across a control-plane failure.

Currently there are two ways for a parent process to be notified of a
child process's state change. One is to use the POSIX signal, and
another is to use the kernel connector module. The specific events and
actions are summarized as follows:

Process EventPOSIX SignalConnector-based
--
ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_STOPPED

ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_CONTINUED

pre_coredump/N/A proc_coredump_connector()
get_signal()

post_coredump/   do_notify_parent()  proc_exit_connector()
do_exit()SIGCHLD / exit_signal
--

As shown in the table, the signal-based pre-coredump notification is not
currently available. In some cases using a connector-based notification
can be quite complicated (e.g., when a process manager is written in shell
scripts and thus is subject to certain inherent limitations), and a
signal-based notification would be simpler and better suited.

Signed-off-by: Enke Chen 
---
v3 -> v4:

Addressed review comments from Oleg Nesterov, and Eric W. Biederman,
including:
o remove the definition CLD_PREDUMP.
o code simplification.
o split out the selftest code.

 fs/coredump.c| 23 +++
 fs/exec.c|  3 +++
 include/linux/sched/signal.h |  3 +++
 include/uapi/linux/prctl.h   |  4 
 kernel/sys.c | 13 +
 5 files changed, 46 insertions(+)

diff --git a/fs/coredump.c b/fs/coredump.c
index e42e17e..22c40dc 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info *info, 
struct cred *new)
return err;
 }
 
+/*
+ * While do_notify_parent() notifies the parent of a child's death post
+ * its coredump, this function lets the parent (if so desired) know about
+ * the imminent death of a child just prior to its coredump.
+ */
+static void do_notify_parent_predump(void)
+{
+   struct task_struct *parent;
+   int sig;
+
+   read_lock(_lock);
+   parent = current->parent;
+   sig = parent->signal->predump_signal;
+   if (sig != 0)
+   do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
+   read_unlock(_lock);
+}
+
 void do_coredump(const kernel_siginfo_t *siginfo)
 {
struct core_state core_state;
@@ -590,6 +608,11 @@ void do_coredump(const kernel_siginfo_t *siginfo)
if (retval < 0)
goto fail_creds;
 
+   /*
+* Send the pre-coredump signal to the parent if requested.
+*/
+   do_notify_parent_predump();
+
old_cred = override_creds(cred);
 
ispipe = format_corename(, );
diff --git a/fs/exec.c b/fs/exec.c
index fc281b7..7714da7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
/* we have changed execution domain */
tsk->exit_signal = SIGCHLD;
 
+   /* Clear the pre-coredump signal before loading a new binary */
+   sig->predump_signal = 0;
+
 #ifdef CONFIG_POSIX_TIMERS
exit_itimers(sig);
flush_itimer_signals();
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 13789d1..728ef68 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -112,6 +112,9 @@

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-25 Thread Enke Chen

Hi, Eric:

Please see my replied inline.

On 10/25/18 5:23 AM, Eric W. Biederman wrote:
> Enke Chen  writes:
> 
>> Hi, Eric:
>>
>> Thanks for your comments. Please see my replies inline.
>>
>> On 10/24/18 6:29 AM, Eric W. Biederman wrote:
>>> Enke Chen  writes:
>>>
>>>> For simplicity and consistency, this patch provides an implementation
>>>> for signal-based fault notification prior to the coredump of a child
>>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>>> be used by an application to express its interest and to specify the
>>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>>
>>>> Changes to prctl(2):
>>>>
>>>>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>>>   Set the child pre-coredump signal of the calling process to
>>>>   arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>>>>   This is the signal that the calling process will get prior to
>>>>   the coredump of a child process. This value is cleared across
>>>>   execve(2), or for the child of a fork(2).
>>>>
>>>>   When SIGCHLD is specified, the signal code will be set to
>>>>   CLD_PREDUMP in such an SIGCHLD signal.
>>>
>>> Your signal handling is still not right.  Please read and comprehend
>>> siginfo_layout.
>>>
>>> You have not filled in all of the required fields for the SIGCHLD case.
>>> For the non SIGCHLD case you are using si_code == 0 == SI_USER which is
>>> very wrong.  This is not a user generated signal.
>>>
>>> Let me say this slowly.  The pair si_signo si_code determines the union
>>> member of struct siginfo.  That needs to be handled consistently. You
>>> aren't.  I just finished fixing this up in the entire kernel and now you
>>> are trying to add a usage that is worst than most of the bugs I have
>>> fixed.  I really don't appreciate having to deal with no bugs.
>>>
>>
>> My apologies. I will investigate and make them consistent.
>>
>>>
>>>
>>> Further siginfo can be dropped.  Multiple signals with the same signal
>>> number can be consolidated.  What is your plan for dealing with that?
>>
>> The primary application for the early notification involves a process
>> manager which is responsible for re-spawning processes or initiating
>> the control-plane fail-over. There are two models:
>>
>> One model is to have 1:1 relationship between a process manager and
>> application process. There can only be one predump-signal (say, SIGUSR1)
>> from the child to the parent, and will unlikely be dropped or consolidated.
>>
>> Another model is to have 1:N where there is only one process manager with
>> multiple application processes. One of the RT signal can be used to help
>> make it more reliable.
> 
> Which suggests you want one of the negative si_codes, and to use the _rt
> siginfo member like sigqueue.

It seems that we do not need to touch the si_codes. A dedicated signal
for the pre-coredump notification is simpler and more robust. There are
enough RT signal numbers available.

> 
>>> Other code paths pair with wait to get the information out.  There
>>> is no equivalent of wait in your code.
>>
>> I was not aware of that before.  Let me investigate.
>>
>>>
>>> Signals can be delayed by quite a bit, scheduling delays etc.  They can
>>> not provide any meaningful kind of real time notification.
>>>
>>
>> The timing requirement is about 50-100 msecs for BFD.  Not sure if that
>> qualifies as "real time".  This mechanism has worked well in deployment
>> over the years.
> 
> It would help if those numbers were put into the patch description so
> people can tell if the mechanism is quick enough.

I will do as suggested, but at the risk of making the patch description
longer than the patch itself :-)

> 
>>> So between delays and loss of information signals appear to be a very
>>> poor fit for this usecase.
>>>
>>> I am concerned about code that does not fit the usecase well because
>>> such code winds up as code that no one cares about that must be
>>> maintained indefinitely, because somewhere out there there is one use
>>> that would break if the interface was removed.  This does not feel like
>>> an interface people wil

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-25 Thread Enke Chen

Hi, Eric:

Please see my replied inline.

On 10/25/18 5:23 AM, Eric W. Biederman wrote:
> Enke Chen  writes:
> 
>> Hi, Eric:
>>
>> Thanks for your comments. Please see my replies inline.
>>
>> On 10/24/18 6:29 AM, Eric W. Biederman wrote:
>>> Enke Chen  writes:
>>>
>>>> For simplicity and consistency, this patch provides an implementation
>>>> for signal-based fault notification prior to the coredump of a child
>>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>>> be used by an application to express its interest and to specify the
>>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>>
>>>> Changes to prctl(2):
>>>>
>>>>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>>>   Set the child pre-coredump signal of the calling process to
>>>>   arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>>>>   This is the signal that the calling process will get prior to
>>>>   the coredump of a child process. This value is cleared across
>>>>   execve(2), or for the child of a fork(2).
>>>>
>>>>   When SIGCHLD is specified, the signal code will be set to
>>>>   CLD_PREDUMP in such an SIGCHLD signal.
>>>
>>> Your signal handling is still not right.  Please read and comprehend
>>> siginfo_layout.
>>>
>>> You have not filled in all of the required fields for the SIGCHLD case.
>>> For the non SIGCHLD case you are using si_code == 0 == SI_USER which is
>>> very wrong.  This is not a user generated signal.
>>>
>>> Let me say this slowly.  The pair si_signo si_code determines the union
>>> member of struct siginfo.  That needs to be handled consistently. You
>>> aren't.  I just finished fixing this up in the entire kernel and now you
>>> are trying to add a usage that is worst than most of the bugs I have
>>> fixed.  I really don't appreciate having to deal with no bugs.
>>>
>>
>> My apologies. I will investigate and make them consistent.
>>
>>>
>>>
>>> Further siginfo can be dropped.  Multiple signals with the same signal
>>> number can be consolidated.  What is your plan for dealing with that?
>>
>> The primary application for the early notification involves a process
>> manager which is responsible for re-spawning processes or initiating
>> the control-plane fail-over. There are two models:
>>
>> One model is to have 1:1 relationship between a process manager and
>> application process. There can only be one predump-signal (say, SIGUSR1)
>> from the child to the parent, and will unlikely be dropped or consolidated.
>>
>> Another model is to have 1:N where there is only one process manager with
>> multiple application processes. One of the RT signal can be used to help
>> make it more reliable.
> 
> Which suggests you want one of the negative si_codes, and to use the _rt
> siginfo member like sigqueue.

It seems that we do not need to touch the si_codes. A dedicated signal
for the pre-coredump notification is simpler and more robust. There are
enough RT signal numbers available.

> 
>>> Other code paths pair with wait to get the information out.  There
>>> is no equivalent of wait in your code.
>>
>> I was not aware of that before.  Let me investigate.
>>
>>>
>>> Signals can be delayed by quite a bit, scheduling delays etc.  They can
>>> not provide any meaningful kind of real time notification.
>>>
>>
>> The timing requirement is about 50-100 msecs for BFD.  Not sure if that
>> qualifies as "real time".  This mechanism has worked well in deployment
>> over the years.
> 
> It would help if those numbers were put into the patch description so
> people can tell if the mechanism is quick enough.

I will do as suggested, but at the risk of making the patch description
longer than the patch itself :-)

> 
>>> So between delays and loss of information signals appear to be a very
>>> poor fit for this usecase.
>>>
>>> I am concerned about code that does not fit the usecase well because
>>> such code winds up as code that no one cares about that must be
>>> maintained indefinitely, because somewhere out there there is one use
>>> that would break if the interface was removed.  This does not feel like
>>> an interface people wil

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-25 Thread Enke Chen

Hi, Eric:

I have a couple comments inlined.

>> On Wed, Oct 24, 2018 at 3:30 PM Eric W. Biederman  
>> wrote:
>>> Enke Chen  writes:
>>>> For simplicity and consistency, this patch provides an implementation
>>>> for signal-based fault notification prior to the coredump of a child
>>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>>> be used by an application to express its interest and to specify the
>>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>>
>>>> Changes to prctl(2):
>>>>
>>>>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>>>   Set the child pre-coredump signal of the calling process to
>>>>   arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>>>>   This is the signal that the calling process will get prior to
>>>>   the coredump of a child process. This value is cleared across
>>>>   execve(2), or for the child of a fork(2).
>>>>
>>>>   When SIGCHLD is specified, the signal code will be set to
>>>>   CLD_PREDUMP in such an SIGCHLD signal.
>> [...]
>>> Ugh.  Your test case is even using signalfd.  So you don't even want
>>> this signal to be delivered as a signal.
>>
>> Just to make sure everyone's on the same page: You're suggesting that
>> it might make sense to deliver the pre-dump notification via a new
>> type of file instead (along the lines of signalfd, timerfd, eventfd
>> and so on)?
> 
> My real complaint was that the API was not being tested in the way it
> is expected to be used.  Which makes a test pretty much useless as some
> aspect userspace could regress and the test would not notice because it
> is testing something different.
> 
> 

As I stated in a prior email, I have test code for both sigaction/waipid(),
and signefd(). As the sigaction/waitpid is more widely used and that is
what you prefer, I will change the selftest code to reflect that in the
next version.  Actually I should separate out the selftest code.

> 
> I do think that a file descriptor based API might be a good alternative
> to a signal based API.  The proc connector and signals are not the only
> API solution.
> 
> The common solution to this problem is that distributions defailt the
> rlimit core file size to 0.

We do need coredumps in order to have the bugs fixed.

Thanks.  -- Enke

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-25 Thread Enke Chen

Hi, Eric:

I have a couple comments inlined.

>> On Wed, Oct 24, 2018 at 3:30 PM Eric W. Biederman  
>> wrote:
>>> Enke Chen  writes:
>>>> For simplicity and consistency, this patch provides an implementation
>>>> for signal-based fault notification prior to the coredump of a child
>>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>>> be used by an application to express its interest and to specify the
>>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>>
>>>> Changes to prctl(2):
>>>>
>>>>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>>>   Set the child pre-coredump signal of the calling process to
>>>>   arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>>>>   This is the signal that the calling process will get prior to
>>>>   the coredump of a child process. This value is cleared across
>>>>   execve(2), or for the child of a fork(2).
>>>>
>>>>   When SIGCHLD is specified, the signal code will be set to
>>>>   CLD_PREDUMP in such an SIGCHLD signal.
>> [...]
>>> Ugh.  Your test case is even using signalfd.  So you don't even want
>>> this signal to be delivered as a signal.
>>
>> Just to make sure everyone's on the same page: You're suggesting that
>> it might make sense to deliver the pre-dump notification via a new
>> type of file instead (along the lines of signalfd, timerfd, eventfd
>> and so on)?
> 
> My real complaint was that the API was not being tested in the way it
> is expected to be used.  Which makes a test pretty much useless as some
> aspect userspace could regress and the test would not notice because it
> is testing something different.
> 
> 

As I stated in a prior email, I have test code for both sigaction/waipid(),
and signefd(). As the sigaction/waitpid is more widely used and that is
what you prefer, I will change the selftest code to reflect that in the
next version.  Actually I should separate out the selftest code.

> 
> I do think that a file descriptor based API might be a good alternative
> to a signal based API.  The proc connector and signals are not the only
> API solution.
> 
> The common solution to this problem is that distributions defailt the
> rlimit core file size to 0.

We do need coredumps in order to have the bugs fixed.

Thanks.  -- Enke

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-25 Thread Enke Chen

Hi, Eric:

It turns out that the definition of CLD_PREDUMP could well be considered as
another instance of "over specification", and is completely unnecessary.  When
an application chooses a signal for pre-coredump notification, it is much 
simpler
and robust for the signal to be dedicated for that purpose (in the parent) and
not be mixed with other semantics. The "signo + pid" should be sufficient for
the parent process in both 1:1 and 1:N models.

So I will remove the CLD_PREDUMP and related definitions, and the code can then
be simplified as the following:

+static void do_notify_parent_predump(void)
+{
+   struct task_struct *parent;
+   int sig;
+
+   read_lock(_lock);
+   parent = current->parent;
+   sig = parent->signal->predump_signal;
+   if (sig != 0)
+   do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
+   read_unlock(_lock);
+}

I will follow up with your other comments.

Thanks. -- Enke

On 10/25/18 5:23 AM, Eric W. Biederman wrote:
> Enke Chen  writes:
> 
>> Hi, Eric:
>>
>> Thanks for your comments. Please see my replies inline.
>>
>> On 10/24/18 6:29 AM, Eric W. Biederman wrote:
>>> Enke Chen  writes:
>>>
>>>> For simplicity and consistency, this patch provides an implementation
>>>> for signal-based fault notification prior to the coredump of a child
>>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>>> be used by an application to express its interest and to specify the
>>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>>
>>>> Changes to prctl(2):
>>>>
>>>>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>>>   Set the child pre-coredump signal of the calling process to
>>>>   arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>>>>   This is the signal that the calling process will get prior to
>>>>   the coredump of a child process. This value is cleared across
>>>>   execve(2), or for the child of a fork(2).
>>>>
>>>>   When SIGCHLD is specified, the signal code will be set to
>>>>   CLD_PREDUMP in such an SIGCHLD signal.
>>>
>>> Your signal handling is still not right.  Please read and comprehend
>>> siginfo_layout.
>>>
>>> You have not filled in all of the required fields for the SIGCHLD case.
>>> For the non SIGCHLD case you are using si_code == 0 == SI_USER which is
>>> very wrong.  This is not a user generated signal.
>>>
>>> Let me say this slowly.  The pair si_signo si_code determines the union
>>> member of struct siginfo.  That needs to be handled consistently. You
>>> aren't.  I just finished fixing this up in the entire kernel and now you
>>> are trying to add a usage that is worst than most of the bugs I have
>>> fixed.  I really don't appreciate having to deal with no bugs.
>>>
>>
>> My apologies. I will investigate and make them consistent.
>>
>>>
>>>
>>> Further siginfo can be dropped.  Multiple signals with the same signal
>>> number can be consolidated.  What is your plan for dealing with that?
>>
>> The primary application for the early notification involves a process
>> manager which is responsible for re-spawning processes or initiating
>> the control-plane fail-over. There are two models:
>>
>> One model is to have 1:1 relationship between a process manager and
>> application process. There can only be one predump-signal (say, SIGUSR1)
>> from the child to the parent, and will unlikely be dropped or consolidated.
>>
>> Another model is to have 1:N where there is only one process manager with
>> multiple application processes. One of the RT signal can be used to help
>> make it more reliable.
> 
> Which suggests you want one of the negative si_codes, and to use the _rt
> siginfo member like sigqueue.
> 
>>> Other code paths pair with wait to get the information out.  There
>>> is no equivalent of wait in your code.
>>
>> I was not aware of that before.  Let me investigate.
>>
>>>
>>> Signals can be delayed by quite a bit, scheduling delays etc.  They can
>>> not provide any meaningful kind of real time notification.
>>>
>>
>> The timing requirement is about 50-100 msecs for BFD.  Not sure if that
>> qualifies as "real time".  This mechanism has worked well in deployment
>> over

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-25 Thread Enke Chen

Hi, Eric:

It turns out that the definition of CLD_PREDUMP could well be considered as
another instance of "over specification", and is completely unnecessary.  When
an application chooses a signal for pre-coredump notification, it is much 
simpler
and robust for the signal to be dedicated for that purpose (in the parent) and
not be mixed with other semantics. The "signo + pid" should be sufficient for
the parent process in both 1:1 and 1:N models.

So I will remove the CLD_PREDUMP and related definitions, and the code can then
be simplified as the following:

+static void do_notify_parent_predump(void)
+{
+   struct task_struct *parent;
+   int sig;
+
+   read_lock(_lock);
+   parent = current->parent;
+   sig = parent->signal->predump_signal;
+   if (sig != 0)
+   do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
+   read_unlock(_lock);
+}

I will follow up with your other comments.

Thanks. -- Enke

On 10/25/18 5:23 AM, Eric W. Biederman wrote:
> Enke Chen  writes:
> 
>> Hi, Eric:
>>
>> Thanks for your comments. Please see my replies inline.
>>
>> On 10/24/18 6:29 AM, Eric W. Biederman wrote:
>>> Enke Chen  writes:
>>>
>>>> For simplicity and consistency, this patch provides an implementation
>>>> for signal-based fault notification prior to the coredump of a child
>>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>>> be used by an application to express its interest and to specify the
>>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>>
>>>> Changes to prctl(2):
>>>>
>>>>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>>>   Set the child pre-coredump signal of the calling process to
>>>>   arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>>>>   This is the signal that the calling process will get prior to
>>>>   the coredump of a child process. This value is cleared across
>>>>   execve(2), or for the child of a fork(2).
>>>>
>>>>   When SIGCHLD is specified, the signal code will be set to
>>>>   CLD_PREDUMP in such an SIGCHLD signal.
>>>
>>> Your signal handling is still not right.  Please read and comprehend
>>> siginfo_layout.
>>>
>>> You have not filled in all of the required fields for the SIGCHLD case.
>>> For the non SIGCHLD case you are using si_code == 0 == SI_USER which is
>>> very wrong.  This is not a user generated signal.
>>>
>>> Let me say this slowly.  The pair si_signo si_code determines the union
>>> member of struct siginfo.  That needs to be handled consistently. You
>>> aren't.  I just finished fixing this up in the entire kernel and now you
>>> are trying to add a usage that is worst than most of the bugs I have
>>> fixed.  I really don't appreciate having to deal with no bugs.
>>>
>>
>> My apologies. I will investigate and make them consistent.
>>
>>>
>>>
>>> Further siginfo can be dropped.  Multiple signals with the same signal
>>> number can be consolidated.  What is your plan for dealing with that?
>>
>> The primary application for the early notification involves a process
>> manager which is responsible for re-spawning processes or initiating
>> the control-plane fail-over. There are two models:
>>
>> One model is to have 1:1 relationship between a process manager and
>> application process. There can only be one predump-signal (say, SIGUSR1)
>> from the child to the parent, and will unlikely be dropped or consolidated.
>>
>> Another model is to have 1:N where there is only one process manager with
>> multiple application processes. One of the RT signal can be used to help
>> make it more reliable.
> 
> Which suggests you want one of the negative si_codes, and to use the _rt
> siginfo member like sigqueue.
> 
>>> Other code paths pair with wait to get the information out.  There
>>> is no equivalent of wait in your code.
>>
>> I was not aware of that before.  Let me investigate.
>>
>>>
>>> Signals can be delayed by quite a bit, scheduling delays etc.  They can
>>> not provide any meaningful kind of real time notification.
>>>
>>
>> The timing requirement is about 50-100 msecs for BFD.  Not sure if that
>> qualifies as "real time".  This mechanism has worked well in deployment
>> over

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-24 Thread Enke Chen

Hi, Eric:

Thanks for your comments. Please see my replies inline.

On 10/24/18 6:29 AM, Eric W. Biederman wrote:
> Enke Chen  writes:
> 
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>
>> Changes to prctl(2):
>>
>>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>   Set the child pre-coredump signal of the calling process to
>>   arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>>   This is the signal that the calling process will get prior to
>>   the coredump of a child process. This value is cleared across
>>   execve(2), or for the child of a fork(2).
>>
>>   When SIGCHLD is specified, the signal code will be set to
>>   CLD_PREDUMP in such an SIGCHLD signal.
> 
> Your signal handling is still not right.  Please read and comprehend
> siginfo_layout.
> 
> You have not filled in all of the required fields for the SIGCHLD case.
> For the non SIGCHLD case you are using si_code == 0 == SI_USER which is
> very wrong.  This is not a user generated signal.
> 
> Let me say this slowly.  The pair si_signo si_code determines the union
> member of struct siginfo.  That needs to be handled consistently. You
> aren't.  I just finished fixing this up in the entire kernel and now you
> are trying to add a usage that is worst than most of the bugs I have
> fixed.  I really don't appreciate having to deal with no bugs.
> 

My apologies. I will investigate and make them consistent.

> 
> 
> Further siginfo can be dropped.  Multiple signals with the same signal
> number can be consolidated.  What is your plan for dealing with that?

The primary application for the early notification involves a process
manager which is responsible for re-spawning processes or initiating
the control-plane fail-over. There are two models:

One model is to have 1:1 relationship between a process manager and
application process. There can only be one predump-signal (say, SIGUSR1)
from the child to the parent, and will unlikely be dropped or consolidated.

Another model is to have 1:N where there is only one process manager with
multiple application processes. One of the RT signal can be used to help
make it more reliable.

> Other code paths pair with wait to get the information out.  There
> is no equivalent of wait in your code.

I was not aware of that before.  Let me investigate.

> 
> Signals can be delayed by quite a bit, scheduling delays etc.  They can
> not provide any meaningful kind of real time notification.
> 

The timing requirement is about 50-100 msecs for BFD.  Not sure if that
qualifies as "real time".  This mechanism has worked well in deployment
over the years.

> So between delays and loss of information signals appear to be a very
> poor fit for this usecase.
> 
> I am concerned about code that does not fit the usecase well because
> such code winds up as code that no one cares about that must be
> maintained indefinitely, because somewhere out there there is one use
> that would break if the interface was removed.  This does not feel like
> an interface people will want to use and maintain in proper working
> order forever.
> 
> Ugh.  Your test case is even using signalfd.  So you don't even want
> this signal to be delivered as a signal.

I actually tested sigaction()/waitpid() as well. If there is a preference,
I can check in the sigaction()/waitpid() version instead.

> 
> You add an interface that takes a pointer and you don't add a compat
> interface.  See Oleg's point of just returning the signal number in the
> return code.

This is what Oleg said "but I won't insist, this is subjective and cosmetic".

It is no big deal either way. It just seems less work if we do not keep
adding exceptions to the prctl(2) manpage:

prctl(2):

   On success, PR_GET_DUMPABLE,   PR_GET_KEEPCAPS,   PR_GET_NO_NEW_PRIVS,   
PR_CAPBSET_READ,PR_GET_TIMING,PR_GET_SECUREBITS,
   PR_MCE_KILL_GET,  PR_CAP_AMBIENT+PR_CAP_AMBIENT_IS_SET,  and  (if  it 
returns) PR_GET_SECCOMP return the nonnegative values described
   above.  All other option values return 0 on success.  On error, -1 is 
returned, and errno is set appropriately.

> 
> Now I am wondering how well prctl works from a 32bit process on a 64bit
> kernel.  At first glance it looks like it probably does not work.
>

I am not sure which part would be problemat

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-24 Thread Enke Chen

Hi, Eric:

Thanks for your comments. Please see my replies inline.

On 10/24/18 6:29 AM, Eric W. Biederman wrote:
> Enke Chen  writes:
> 
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>
>> Changes to prctl(2):
>>
>>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>   Set the child pre-coredump signal of the calling process to
>>   arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>>   This is the signal that the calling process will get prior to
>>   the coredump of a child process. This value is cleared across
>>   execve(2), or for the child of a fork(2).
>>
>>   When SIGCHLD is specified, the signal code will be set to
>>   CLD_PREDUMP in such an SIGCHLD signal.
> 
> Your signal handling is still not right.  Please read and comprehend
> siginfo_layout.
> 
> You have not filled in all of the required fields for the SIGCHLD case.
> For the non SIGCHLD case you are using si_code == 0 == SI_USER which is
> very wrong.  This is not a user generated signal.
> 
> Let me say this slowly.  The pair si_signo si_code determines the union
> member of struct siginfo.  That needs to be handled consistently. You
> aren't.  I just finished fixing this up in the entire kernel and now you
> are trying to add a usage that is worst than most of the bugs I have
> fixed.  I really don't appreciate having to deal with no bugs.
> 

My apologies. I will investigate and make them consistent.

> 
> 
> Further siginfo can be dropped.  Multiple signals with the same signal
> number can be consolidated.  What is your plan for dealing with that?

The primary application for the early notification involves a process
manager which is responsible for re-spawning processes or initiating
the control-plane fail-over. There are two models:

One model is to have 1:1 relationship between a process manager and
application process. There can only be one predump-signal (say, SIGUSR1)
from the child to the parent, and will unlikely be dropped or consolidated.

Another model is to have 1:N where there is only one process manager with
multiple application processes. One of the RT signal can be used to help
make it more reliable.

> Other code paths pair with wait to get the information out.  There
> is no equivalent of wait in your code.

I was not aware of that before.  Let me investigate.

> 
> Signals can be delayed by quite a bit, scheduling delays etc.  They can
> not provide any meaningful kind of real time notification.
> 

The timing requirement is about 50-100 msecs for BFD.  Not sure if that
qualifies as "real time".  This mechanism has worked well in deployment
over the years.

> So between delays and loss of information signals appear to be a very
> poor fit for this usecase.
> 
> I am concerned about code that does not fit the usecase well because
> such code winds up as code that no one cares about that must be
> maintained indefinitely, because somewhere out there there is one use
> that would break if the interface was removed.  This does not feel like
> an interface people will want to use and maintain in proper working
> order forever.
> 
> Ugh.  Your test case is even using signalfd.  So you don't even want
> this signal to be delivered as a signal.

I actually tested sigaction()/waitpid() as well. If there is a preference,
I can check in the sigaction()/waitpid() version instead.

> 
> You add an interface that takes a pointer and you don't add a compat
> interface.  See Oleg's point of just returning the signal number in the
> return code.

This is what Oleg said "but I won't insist, this is subjective and cosmetic".

It is no big deal either way. It just seems less work if we do not keep
adding exceptions to the prctl(2) manpage:

prctl(2):

   On success, PR_GET_DUMPABLE,   PR_GET_KEEPCAPS,   PR_GET_NO_NEW_PRIVS,   
PR_CAPBSET_READ,PR_GET_TIMING,PR_GET_SECUREBITS,
   PR_MCE_KILL_GET,  PR_CAP_AMBIENT+PR_CAP_AMBIENT_IS_SET,  and  (if  it 
returns) PR_GET_SECCOMP return the nonnegative values described
   above.  All other option values return 0 on success.  On error, -1 is 
returned, and errno is set appropriately.

> 
> Now I am wondering how well prctl works from a 32bit process on a 64bit
> kernel.  At first glance it looks like it probably does not work.
>

I am not sure which part would be problemat

Re: [PATCH v3] kernel/signal: Signal-based pre-coredump notification

2018-10-24 Thread Enke Chen

Hi, Oleg:

On 10/24/18 7:02 AM, Oleg Nesterov wrote:
> On 10/23, Enke Chen wrote:
>>
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -590,6 +590,12 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>  if (retval < 0)
>>  goto fail_creds;
>>
>> +/*
>> + * Send the pre-coredump signal to the parent if requested.
>> + */
>> +do_notify_parent_predump();
>> +cond_resched();
> 
> I am still not sure cond_resched() makes any sense here...
> 
>> @@ -1553,6 +1553,9 @@ static int copy_signal(unsigned long clone_flags, 
>> struct task_struct *tsk)
>>  tty_audit_fork(sig);
>>  sched_autogroup_fork(sig);
>>
>> +/* Clear the pre-coredump signal for the child */
>> +sig->predump_signal = 0;
> 
> No need, copy_signal() does zalloc().
> 

Removed.

> 
>> +void do_notify_parent_predump(void)
>> +{
>> +struct sighand_struct *sighand;
>> +struct kernel_siginfo info;
>> +struct task_struct *parent;
>> +unsigned long flags;
>> +int sig;
>> +
>> +read_lock(_lock);
>> +parent = current->parent;
>> +sig = parent->signal->predump_signal;
>> +if (sig != 0) {
>> +clear_siginfo();
>> +info.si_pid = task_tgid_vnr(current);
>> +info.si_signo = sig;
>> +if (sig == SIGCHLD)
>> +info.si_code = CLD_PREDUMP;
>> +
>> +sighand = parent->sighand;
>> +spin_lock_irqsave(>siglock, flags);
>> +__group_send_sig_info(sig, , parent);
>> +spin_unlock_irqrestore(>siglock, flags);
> 
> You can just use do_send_sig_info() and remove 
> sighand/flags/spin_lock_irqsave.

Ok.

> 
> Perhaps the "likely" predump_signal==0 check at the start makes sense to avoid
> read_lock(tasklist).

I am not sure if we should/need to deviate from the convention (locking before
access the parent). In any case it may not matter as the coredump is in the
exceptional code flow.

> 
> And I'd suggest to move it into coredump.c and make it static. It won't have
> another user.

Ok.

Thanks.  -- Enke

Re: [PATCH v3] kernel/signal: Signal-based pre-coredump notification

2018-10-24 Thread Enke Chen

Hi, Oleg:

On 10/24/18 7:02 AM, Oleg Nesterov wrote:
> On 10/23, Enke Chen wrote:
>>
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -590,6 +590,12 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>  if (retval < 0)
>>  goto fail_creds;
>>
>> +/*
>> + * Send the pre-coredump signal to the parent if requested.
>> + */
>> +do_notify_parent_predump();
>> +cond_resched();
> 
> I am still not sure cond_resched() makes any sense here...
> 
>> @@ -1553,6 +1553,9 @@ static int copy_signal(unsigned long clone_flags, 
>> struct task_struct *tsk)
>>  tty_audit_fork(sig);
>>  sched_autogroup_fork(sig);
>>
>> +/* Clear the pre-coredump signal for the child */
>> +sig->predump_signal = 0;
> 
> No need, copy_signal() does zalloc().
> 

Removed.

> 
>> +void do_notify_parent_predump(void)
>> +{
>> +struct sighand_struct *sighand;
>> +struct kernel_siginfo info;
>> +struct task_struct *parent;
>> +unsigned long flags;
>> +int sig;
>> +
>> +read_lock(_lock);
>> +parent = current->parent;
>> +sig = parent->signal->predump_signal;
>> +if (sig != 0) {
>> +clear_siginfo();
>> +info.si_pid = task_tgid_vnr(current);
>> +info.si_signo = sig;
>> +if (sig == SIGCHLD)
>> +info.si_code = CLD_PREDUMP;
>> +
>> +sighand = parent->sighand;
>> +spin_lock_irqsave(>siglock, flags);
>> +__group_send_sig_info(sig, , parent);
>> +spin_unlock_irqrestore(>siglock, flags);
> 
> You can just use do_send_sig_info() and remove 
> sighand/flags/spin_lock_irqsave.

Ok.

> 
> Perhaps the "likely" predump_signal==0 check at the start makes sense to avoid
> read_lock(tasklist).

I am not sure if we should/need to deviate from the convention (locking before
access the parent). In any case it may not matter as the coredump is in the
exceptional code flow.

> 
> And I'd suggest to move it into coredump.c and make it static. It won't have
> another user.

Ok.

Thanks.  -- Enke

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-24 Thread Enke Chen

Hi, Olge:

On 10/24/18 6:52 AM, Oleg Nesterov wrote:
> On 10/23, Enke Chen wrote:
>>
>>>> +  /*
>>>> +   * Send the pre-coredump signal to the parent if requested.
>>>> +   */
>>>> +  read_lock(_lock);
>>>> +  notify = do_notify_parent_predump(current);
>>>> +  read_unlock(_lock);
>>>> +  if (notify)
>>>> +  cond_resched();
>>>
>>> Hmm. I do not understand why do we need cond_resched(). And even if we need 
>>> it,
>>> why we can't call it unconditionally?
>>
>> Remember the goal is to allow the parent (e.g., a process manager) to take 
>> early
>> action. The "yield" before doing coredump will help.
> 
> I don't see how can it actually help...
> 
> cond_resched() is nop if CONFIG_PREEMPT or should_resched() == 0.
> 
> and the coredumping thread will certainly need to sleep/wait anyway.

I am really surprised by this - cond_resched() is used in many places and it 
actually
does not do anything w/o CONFIG_PREEMPT.

Will remove.

> 
>>> And once again, SIGCHLD/SIGUSR do not queue, this means that 
>>> PR_SET_PREDUMP_SIG
>>> is pointless if you have 2 or more children.
>>
>> Hmm, could you point me to the code where SIGCHLD/SIGUSR is treated 
>> differently
>> w.r.t. queuing?  That does not sound right to me.
> 
> see the legacy_queue() check. Any signal < SIGRTMIN do not queue. IOW, if 
> SIGCHLD
> is already pending, then next SIGCHLD is simply ignored.

Got it. This means that a distinct signal (in particular a RT signal) would be 
more
preferred. This is what it is done in our application. You earlier suggestion 
about
removing the signal limitation makes a lot sense to me now.

Given that a distinct signal is more preferred, I am wondering if I should just 
remove
CLD_PREDUMP from the patch.

Thanks.  -- Enke

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-24 Thread Enke Chen

Hi, Olge:

On 10/24/18 6:52 AM, Oleg Nesterov wrote:
> On 10/23, Enke Chen wrote:
>>
>>>> +  /*
>>>> +   * Send the pre-coredump signal to the parent if requested.
>>>> +   */
>>>> +  read_lock(_lock);
>>>> +  notify = do_notify_parent_predump(current);
>>>> +  read_unlock(_lock);
>>>> +  if (notify)
>>>> +  cond_resched();
>>>
>>> Hmm. I do not understand why do we need cond_resched(). And even if we need 
>>> it,
>>> why we can't call it unconditionally?
>>
>> Remember the goal is to allow the parent (e.g., a process manager) to take 
>> early
>> action. The "yield" before doing coredump will help.
> 
> I don't see how can it actually help...
> 
> cond_resched() is nop if CONFIG_PREEMPT or should_resched() == 0.
> 
> and the coredumping thread will certainly need to sleep/wait anyway.

I am really surprised by this - cond_resched() is used in many places and it 
actually
does not do anything w/o CONFIG_PREEMPT.

Will remove.

> 
>>> And once again, SIGCHLD/SIGUSR do not queue, this means that 
>>> PR_SET_PREDUMP_SIG
>>> is pointless if you have 2 or more children.
>>
>> Hmm, could you point me to the code where SIGCHLD/SIGUSR is treated 
>> differently
>> w.r.t. queuing?  That does not sound right to me.
> 
> see the legacy_queue() check. Any signal < SIGRTMIN do not queue. IOW, if 
> SIGCHLD
> is already pending, then next SIGCHLD is simply ignored.

Got it. This means that a distinct signal (in particular a RT signal) would be 
more
preferred. This is what it is done in our application. You earlier suggestion 
about
removing the signal limitation makes a lot sense to me now.

Given that a distinct signal is more preferred, I am wondering if I should just 
remove
CLD_PREDUMP from the patch.

Thanks.  -- Enke

[PATCH v3] kernel/signal: Signal-based pre-coredump notification

2018-10-23 Thread Enke Chen

For simplicity and consistency, this patch provides an implementation
for signal-based fault notification prior to the coredump of a child
process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
be used by an application to express its interest and to specify the
signal for such a notification. A new signal code CLD_PREDUMP is also
defined for SIGCHLD.

Changes to prctl(2):

   PR_SET_PREDUMP_SIG (since Linux 4.20.x)
  Set the child pre-coredump signal of the calling process to
  arg2 (either a signal value in the range 1..maxsig, or 0 to
  clear). This is the signal that the calling process will get
  prior to the coredump of a child process. This value is
  cleared across execve(2), or for the child of a fork(2).

  When SIGCHLD is specified, the signal code will be set to
  CLD_PREDUMP in such an SIGCHLD signal.

   PR_GET_PREDUMP_SIG (since Linux 4.20.x)
  Return the current value of the child pre-coredump signal,
  in the location pointed to by (int *) arg2.

Background:

As the coredump of a process may take time, in certain time-sensitive
applications it is necessary for a parent process (e.g., a process
manager) to be notified of a child's imminent death before the coredump
so that the parent process can act sooner, such as re-spawning an
application process, or initiating a control-plane fail-over.

Currently there are two ways for a parent process to be notified of a
child process's state change. One is to use the POSIX signal, and
another is to use the kernel connector module. The specific events and
actions are summarized as follows:

Process EventPOSIX SignalConnector-based
--
ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_STOPPED

ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_CONTINUED

pre_coredump/N/A proc_coredump_connector()
get_signal()

post_coredump/   do_notify_parent()  proc_exit_connector()
do_exit()SIGCHLD / exit_signal
--

As shown in the table, the signal-based pre-coredump notification is not
currently available. In some cases using a connector-based notification
can be quite complicated (e.g., when a process manager is written in shell
scripts and thus is subject to certain inherent limitations), and a
signal-based notification would be simpler and better suited.

Signed-off-by: Enke Chen 
---
v2 -> v3:

Addressed review comments from Oleg Nesterov, including:
o remove the restriction on signal for PR_SET_PREDUMP_SIG.
o code simplification

 arch/x86/kernel/signal_compat.c  |   2 +-
 fs/coredump.c|   6 +
 fs/exec.c|   3 +
 include/linux/sched/signal.h |   4 +
 include/uapi/asm-generic/siginfo.h   |   3 +-
 include/uapi/linux/prctl.h   |   4 +
 kernel/fork.c|   3 +
 kernel/signal.c  |  31 +
 kernel/sys.c |  13 ++
 tools/testing/selftests/prctl/Makefile   |   2 +-
 tools/testing/selftests/prctl/predump-sig-test.c | 169 +++
 11 files changed, 237 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/prctl/predump-sig-test.c

diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 9ccbf05..a3deba8 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
BUILD_BUG_ON(NSIGSEGV != 7);
BUILD_BUG_ON(NSIGBUS  != 5);
BUILD_BUG_ON(NSIGTRAP != 5);
-   BUILD_BUG_ON(NSIGCHLD != 6);
+   BUILD_BUG_ON(NSIGCHLD != 7);
BUILD_BUG_ON(NSIGSYS  != 1);
 
/* This is part of the ABI and can never change in size: */
diff --git a/fs/coredump.c b/fs/coredump.c
index e42e17e..d6ca1a3 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -590,6 +590,12 @@ void do_coredump(const kernel_siginfo_t *siginfo)
if (retval < 0)
goto fail_creds;
 
+   /*
+* Send the pre-coredump signal to the parent if requested.
+*/
+   do_notify_parent_predump();
+   cond_resched();
+
old_cred = override_creds(cred);
 
ispipe = format_corename(, );
diff --git a/fs/exec.c b/fs/exec.c
index fc281b7..7714da7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
/* we have changed execution domain */
tsk->exit_signal = SIGCHLD;
 
+   /* Clear the pre-coredump signal before loading a new binary */
+   sig->predump_signal =

[PATCH v3] kernel/signal: Signal-based pre-coredump notification

2018-10-23 Thread Enke Chen

For simplicity and consistency, this patch provides an implementation
for signal-based fault notification prior to the coredump of a child
process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
be used by an application to express its interest and to specify the
signal for such a notification. A new signal code CLD_PREDUMP is also
defined for SIGCHLD.

Changes to prctl(2):

   PR_SET_PREDUMP_SIG (since Linux 4.20.x)
  Set the child pre-coredump signal of the calling process to
  arg2 (either a signal value in the range 1..maxsig, or 0 to
  clear). This is the signal that the calling process will get
  prior to the coredump of a child process. This value is
  cleared across execve(2), or for the child of a fork(2).

  When SIGCHLD is specified, the signal code will be set to
  CLD_PREDUMP in such an SIGCHLD signal.

   PR_GET_PREDUMP_SIG (since Linux 4.20.x)
  Return the current value of the child pre-coredump signal,
  in the location pointed to by (int *) arg2.

Background:

As the coredump of a process may take time, in certain time-sensitive
applications it is necessary for a parent process (e.g., a process
manager) to be notified of a child's imminent death before the coredump
so that the parent process can act sooner, such as re-spawning an
application process, or initiating a control-plane fail-over.

Currently there are two ways for a parent process to be notified of a
child process's state change. One is to use the POSIX signal, and
another is to use the kernel connector module. The specific events and
actions are summarized as follows:

Process EventPOSIX SignalConnector-based
--
ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_STOPPED

ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_CONTINUED

pre_coredump/N/A proc_coredump_connector()
get_signal()

post_coredump/   do_notify_parent()  proc_exit_connector()
do_exit()SIGCHLD / exit_signal
--

As shown in the table, the signal-based pre-coredump notification is not
currently available. In some cases using a connector-based notification
can be quite complicated (e.g., when a process manager is written in shell
scripts and thus is subject to certain inherent limitations), and a
signal-based notification would be simpler and better suited.

Signed-off-by: Enke Chen 
---
v2 -> v3:

Addressed review comments from Oleg Nesterov, including:
o remove the restriction on signal for PR_SET_PREDUMP_SIG.
o code simplification

 arch/x86/kernel/signal_compat.c  |   2 +-
 fs/coredump.c|   6 +
 fs/exec.c|   3 +
 include/linux/sched/signal.h |   4 +
 include/uapi/asm-generic/siginfo.h   |   3 +-
 include/uapi/linux/prctl.h   |   4 +
 kernel/fork.c|   3 +
 kernel/signal.c  |  31 +
 kernel/sys.c |  13 ++
 tools/testing/selftests/prctl/Makefile   |   2 +-
 tools/testing/selftests/prctl/predump-sig-test.c | 169 +++
 11 files changed, 237 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/prctl/predump-sig-test.c

diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 9ccbf05..a3deba8 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
BUILD_BUG_ON(NSIGSEGV != 7);
BUILD_BUG_ON(NSIGBUS  != 5);
BUILD_BUG_ON(NSIGTRAP != 5);
-   BUILD_BUG_ON(NSIGCHLD != 6);
+   BUILD_BUG_ON(NSIGCHLD != 7);
BUILD_BUG_ON(NSIGSYS  != 1);
 
/* This is part of the ABI and can never change in size: */
diff --git a/fs/coredump.c b/fs/coredump.c
index e42e17e..d6ca1a3 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -590,6 +590,12 @@ void do_coredump(const kernel_siginfo_t *siginfo)
if (retval < 0)
goto fail_creds;
 
+   /*
+* Send the pre-coredump signal to the parent if requested.
+*/
+   do_notify_parent_predump();
+   cond_resched();
+
old_cred = override_creds(cred);
 
ispipe = format_corename(, );
diff --git a/fs/exec.c b/fs/exec.c
index fc281b7..7714da7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
/* we have changed execution domain */
tsk->exit_signal = SIGCHLD;
 
+   /* Clear the pre-coredump signal before loading a new binary */
+   sig->predump_signal =

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-23 Thread Enke Chen

Hi, Oleg:

On 10/23/18 12:43 PM, Enke Chen wrote:

>>
>>> --- a/fs/coredump.c
>>> +++ b/fs/coredump.c
>>> @@ -546,6 +546,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>> struct cred *cred;
>>> int retval = 0;
>>> int ispipe;
>>> +   bool notify;
>>> struct files_struct *displaced;
>>> /* require nonrelative corefile path and be extra careful */
>>> bool need_suid_safe = false;
>>> @@ -590,6 +591,15 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>> if (retval < 0)
>>> goto fail_creds;
>>>
>>> +   /*
>>> +* Send the pre-coredump signal to the parent if requested.
>>> +*/
>>> +   read_lock(_lock);
>>> +   notify = do_notify_parent_predump(current);
>>> +   read_unlock(_lock);
>>> +   if (notify)
>>> +   cond_resched();
>>
>> Hmm. I do not understand why do we need cond_resched(). And even if we need 
>> it,
>> why we can't call it unconditionally?
> 
> Remember the goal is to allow the parent (e.g., a process manager) to take 
> early
> action. The "yield" before doing coredump will help.
> 
> The yield is made conditional because the notification is conditional.
> Is that ok?

Given this is in do_coredump(), it is ok to make it unconditional for 
simplicity.

>>
>>> +bool do_notify_parent_predump(struct task_struct *tsk)
>>> +{
>>> +   struct sighand_struct *sighand;
>>> +   struct kernel_siginfo info;
>>> +   struct task_struct *parent;
>>> +   unsigned long flags;
>>> +   pid_t pid;
>>> +   int sig;
>>> +
>>> +   parent = tsk->parent;
>>> +   sighand = parent->sighand;
>>> +   pid = task_tgid_vnr(tsk);
>>> +
>>> +   spin_lock_irqsave(>siglock, flags);
>>> +   sig = parent->signal->predump_signal;
>>> +   if (!valid_predump_signal(sig)) {
>>> +   spin_unlock_irqrestore(>siglock, flags);
>>> +   return false;
>>> +   }
>>
>> Why do we need to check parent->signal->predump_signal under ->siglock?
>> This complicates the code for no reason, afaics.

Will simplify.

Thanks.  -- Enke

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-23 Thread Enke Chen

Hi, Oleg:

On 10/23/18 12:43 PM, Enke Chen wrote:

>>
>>> --- a/fs/coredump.c
>>> +++ b/fs/coredump.c
>>> @@ -546,6 +546,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>> struct cred *cred;
>>> int retval = 0;
>>> int ispipe;
>>> +   bool notify;
>>> struct files_struct *displaced;
>>> /* require nonrelative corefile path and be extra careful */
>>> bool need_suid_safe = false;
>>> @@ -590,6 +591,15 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>> if (retval < 0)
>>> goto fail_creds;
>>>
>>> +   /*
>>> +* Send the pre-coredump signal to the parent if requested.
>>> +*/
>>> +   read_lock(_lock);
>>> +   notify = do_notify_parent_predump(current);
>>> +   read_unlock(_lock);
>>> +   if (notify)
>>> +   cond_resched();
>>
>> Hmm. I do not understand why do we need cond_resched(). And even if we need 
>> it,
>> why we can't call it unconditionally?
> 
> Remember the goal is to allow the parent (e.g., a process manager) to take 
> early
> action. The "yield" before doing coredump will help.
> 
> The yield is made conditional because the notification is conditional.
> Is that ok?

Given this is in do_coredump(), it is ok to make it unconditional for 
simplicity.

>>
>>> +bool do_notify_parent_predump(struct task_struct *tsk)
>>> +{
>>> +   struct sighand_struct *sighand;
>>> +   struct kernel_siginfo info;
>>> +   struct task_struct *parent;
>>> +   unsigned long flags;
>>> +   pid_t pid;
>>> +   int sig;
>>> +
>>> +   parent = tsk->parent;
>>> +   sighand = parent->sighand;
>>> +   pid = task_tgid_vnr(tsk);
>>> +
>>> +   spin_lock_irqsave(>siglock, flags);
>>> +   sig = parent->signal->predump_signal;
>>> +   if (!valid_predump_signal(sig)) {
>>> +   spin_unlock_irqrestore(>siglock, flags);
>>> +   return false;
>>> +   }
>>
>> Why do we need to check parent->signal->predump_signal under ->siglock?
>> This complicates the code for no reason, afaics.

Will simplify.

Thanks.  -- Enke

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-23 Thread Enke Chen

Hi, Oleg:

Thanks for your review. Please see my replies inline.

On 10/23/18 2:23 AM, Oleg Nesterov wrote:
> On 10/22, Enke Chen wrote:
>>
>> As the coredump of a process may take time, in certain time-sensitive
>> applications it is necessary for a parent process (e.g., a process
>> manager) to be notified of a child's imminent death before the coredump
>> so that the parent process can act sooner, such as re-spawning an
>> application process, or initiating a control-plane fail-over.
> 
> Personally I still do not like this feature, but I won't argue.
> 
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -546,6 +546,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>  struct cred *cred;
>>  int retval = 0;
>>  int ispipe;
>> +bool notify;
>>  struct files_struct *displaced;
>>  /* require nonrelative corefile path and be extra careful */
>>  bool need_suid_safe = false;
>> @@ -590,6 +591,15 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>  if (retval < 0)
>>  goto fail_creds;
>>
>> +/*
>> + * Send the pre-coredump signal to the parent if requested.
>> + */
>> +read_lock(_lock);
>> +notify = do_notify_parent_predump(current);
>> +read_unlock(_lock);
>> +if (notify)
>> +cond_resched();
> 
> Hmm. I do not understand why do we need cond_resched(). And even if we need 
> it,
> why we can't call it unconditionally?

Remember the goal is to allow the parent (e.g., a process manager) to take early
action. The "yield" before doing coredump will help.

The yield is made conditional because the notification is conditional.
Is that ok?

> 
> I'd also suggest to move read_lock/unlock(tasklist) into 
> do_notify_parent_predump()
> and remove the "task_struct *tsk" argument, tsk is always current.
> 
> Yes, do_notify_parent() and do_notify_parent_cldstop() are called with 
> tasklist_lock
> held, but there are good reasons for that.

Sure I will make the suggested changes. This function is only called in one 
place.

> 
> 
>> +static inline int valid_predump_signal(int sig)
>> +{
>> +return (sig == SIGCHLD) || (sig == SIGUSR1) || (sig == SIGUSR2);
>> +}
> 
> I still do not understand why do we need to restrict predump_signal.
> 
> PR_SET_PREDUMP_SIG can only change the caller's ->predump_signal, so to me
> even PR_SET_PREDUMP_SIG(SIGKILL) is fine.

I will remove it to reduce the code size and give more flexibility to the 
application.

> 
> And once again, SIGCHLD/SIGUSR do not queue, this means that 
> PR_SET_PREDUMP_SIG
> is pointless if you have 2 or more children.

Hmm, could you point me to the code where SIGCHLD/SIGUSR is treated differently
w.r.t. queuing?  That does not sound right to me.

> 
>> +bool do_notify_parent_predump(struct task_struct *tsk)
>> +{
>> +struct sighand_struct *sighand;
>> +struct kernel_siginfo info;
>> +struct task_struct *parent;
>> +unsigned long flags;
>> +pid_t pid;
>> +int sig;
>> +
>> +parent = tsk->parent;
>> +sighand = parent->sighand;
>> +pid = task_tgid_vnr(tsk);
>> +
>> +spin_lock_irqsave(>siglock, flags);
>> +sig = parent->signal->predump_signal;
>> +if (!valid_predump_signal(sig)) {
>> +spin_unlock_irqrestore(>siglock, flags);
>> +return false;
>> +}
> 
> Why do we need to check parent->signal->predump_signal under ->siglock?
> This complicates the code for no reason, afaics.
> 
>> +clear_siginfo();
>> +info.si_pid = pid;
>> +info.si_signo = sig;
>> +if (sig == SIGCHLD)
>> +info.si_code = CLD_PREDUMP;
>> +
>> +__group_send_sig_info(sig, , parent);
>> +__wake_up_parent(tsk, parent);
> 
> Why __wake_up_parent() ?

not needed, and will remove.

> 
> do_notify_parent() does this to wake up the parent sleeping in do_wait(), to
> report the event. But predump_signal has nothing to do with wait().
> 
> Now. This version sends the signal to ->parent, not ->real_parent. OK, but 
> this
> means that real_parent won't be notified if its child is traced.
> > 
>> +case PR_SET_PREDUMP_SIG:
>> +if (arg3 || arg4 || arg5)
>> +return -EINVAL;
>> +
>> +/* 0 is valid for disabling the feature */
>> +if (arg2 && !valid_predump_signal((int)arg2))
>> +return -EINVAL;
>> +me->

Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-23 Thread Enke Chen

Hi, Oleg:

Thanks for your review. Please see my replies inline.

On 10/23/18 2:23 AM, Oleg Nesterov wrote:
> On 10/22, Enke Chen wrote:
>>
>> As the coredump of a process may take time, in certain time-sensitive
>> applications it is necessary for a parent process (e.g., a process
>> manager) to be notified of a child's imminent death before the coredump
>> so that the parent process can act sooner, such as re-spawning an
>> application process, or initiating a control-plane fail-over.
> 
> Personally I still do not like this feature, but I won't argue.
> 
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -546,6 +546,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>  struct cred *cred;
>>  int retval = 0;
>>  int ispipe;
>> +bool notify;
>>  struct files_struct *displaced;
>>  /* require nonrelative corefile path and be extra careful */
>>  bool need_suid_safe = false;
>> @@ -590,6 +591,15 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>  if (retval < 0)
>>  goto fail_creds;
>>
>> +/*
>> + * Send the pre-coredump signal to the parent if requested.
>> + */
>> +read_lock(_lock);
>> +notify = do_notify_parent_predump(current);
>> +read_unlock(_lock);
>> +if (notify)
>> +cond_resched();
> 
> Hmm. I do not understand why do we need cond_resched(). And even if we need 
> it,
> why we can't call it unconditionally?

Remember the goal is to allow the parent (e.g., a process manager) to take early
action. The "yield" before doing coredump will help.

The yield is made conditional because the notification is conditional.
Is that ok?

> 
> I'd also suggest to move read_lock/unlock(tasklist) into 
> do_notify_parent_predump()
> and remove the "task_struct *tsk" argument, tsk is always current.
> 
> Yes, do_notify_parent() and do_notify_parent_cldstop() are called with 
> tasklist_lock
> held, but there are good reasons for that.

Sure I will make the suggested changes. This function is only called in one 
place.

> 
> 
>> +static inline int valid_predump_signal(int sig)
>> +{
>> +return (sig == SIGCHLD) || (sig == SIGUSR1) || (sig == SIGUSR2);
>> +}
> 
> I still do not understand why do we need to restrict predump_signal.
> 
> PR_SET_PREDUMP_SIG can only change the caller's ->predump_signal, so to me
> even PR_SET_PREDUMP_SIG(SIGKILL) is fine.

I will remove it to reduce the code size and give more flexibility to the 
application.

> 
> And once again, SIGCHLD/SIGUSR do not queue, this means that 
> PR_SET_PREDUMP_SIG
> is pointless if you have 2 or more children.

Hmm, could you point me to the code where SIGCHLD/SIGUSR is treated differently
w.r.t. queuing?  That does not sound right to me.

> 
>> +bool do_notify_parent_predump(struct task_struct *tsk)
>> +{
>> +struct sighand_struct *sighand;
>> +struct kernel_siginfo info;
>> +struct task_struct *parent;
>> +unsigned long flags;
>> +pid_t pid;
>> +int sig;
>> +
>> +parent = tsk->parent;
>> +sighand = parent->sighand;
>> +pid = task_tgid_vnr(tsk);
>> +
>> +spin_lock_irqsave(>siglock, flags);
>> +sig = parent->signal->predump_signal;
>> +if (!valid_predump_signal(sig)) {
>> +spin_unlock_irqrestore(>siglock, flags);
>> +return false;
>> +}
> 
> Why do we need to check parent->signal->predump_signal under ->siglock?
> This complicates the code for no reason, afaics.
> 
>> +clear_siginfo();
>> +info.si_pid = pid;
>> +info.si_signo = sig;
>> +if (sig == SIGCHLD)
>> +info.si_code = CLD_PREDUMP;
>> +
>> +__group_send_sig_info(sig, , parent);
>> +__wake_up_parent(tsk, parent);
> 
> Why __wake_up_parent() ?

not needed, and will remove.

> 
> do_notify_parent() does this to wake up the parent sleeping in do_wait(), to
> report the event. But predump_signal has nothing to do with wait().
> 
> Now. This version sends the signal to ->parent, not ->real_parent. OK, but 
> this
> means that real_parent won't be notified if its child is traced.
> > 
>> +case PR_SET_PREDUMP_SIG:
>> +if (arg3 || arg4 || arg5)
>> +return -EINVAL;
>> +
>> +/* 0 is valid for disabling the feature */
>> +if (arg2 && !valid_predump_signal((int)arg2))
>> +return -EINVAL;
>> +me->

[PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-22 Thread Enke Chen

For simplicity and consistency, this patch provides an implementation
for signal-based fault notification prior to the coredump of a child
process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
be used by an application to express its interest and to specify the
signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.

Changes to prctl(2):

   PR_SET_PREDUMP_SIG (since Linux 4.20.x)
  Set the child pre-coredump signal of the calling process to
  arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
  This is the signal that the calling process will get prior to
  the coredump of a child process. This value is cleared across
  execve(2), or for the child of a fork(2).

  When SIGCHLD is specified, the signal code will be set to
  CLD_PREDUMP in such an SIGCHLD signal.

   PR_GET_PREDUMP_SIG (since Linux 4.20.x)
  Return the current value of the child pre-coredump signal,
  in the location pointed to by (int *) arg2.

Background:

As the coredump of a process may take time, in certain time-sensitive
applications it is necessary for a parent process (e.g., a process
manager) to be notified of a child's imminent death before the coredump
so that the parent process can act sooner, such as re-spawning an
application process, or initiating a control-plane fail-over.

Currently there are two ways for a parent process to be notified of a
child process's state change. One is to use the POSIX signal, and
another is to use the kernel connector module. The specific events and
actions are summarized as follows:

Process EventPOSIX SignalConnector-based
--
ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_STOPPED

ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_CONTINUED

pre_coredump/N/A proc_coredump_connector()
get_signal()

post_coredump/   do_notify_parent()  proc_exit_connector()
do_exit()SIGCHLD / exit_signal
--

As shown in the table, the signal-based pre-coredump notification is not
currently available. In some cases using a connector-based notification
can be quite complicated (e.g., when a process manager is written in shell
scripts and thus is subject to certain inherent limitations), and a
signal-based notification would be simpler and better suited.

Signed-off-by: Enke Chen 
---
v1 -> v2:

o remove the setting/gettting on others in prctl().
o move the notification from get_signal() to do_coredump().
o notify the "parent" instead of "real_parent".
o move the "predump_signal" from "task_struct" to "signal_struct".
o clear the signal setting across execve(2) as well.
o add validation for unused prctl() parameters.
o add selftests for the new prctl() API.

 arch/x86/kernel/signal_compat.c  |   2 +-
 fs/coredump.c|  10 ++
 fs/exec.c|   3 +
 include/linux/sched/signal.h |   4 +
 include/linux/signal.h   |   5 +
 include/uapi/asm-generic/siginfo.h   |   3 +-
 include/uapi/linux/prctl.h   |   4 +
 kernel/fork.c|   3 +
 kernel/signal.c  |  39 ++
 kernel/sys.c |  15 ++
 tools/testing/selftests/prctl/Makefile   |   2 +-
 tools/testing/selftests/prctl/predump-sig-test.c | 171 +++
 12 files changed, 258 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/prctl/predump-sig-test.c

diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 9ccbf05..a3deba8 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
BUILD_BUG_ON(NSIGSEGV != 7);
BUILD_BUG_ON(NSIGBUS  != 5);
BUILD_BUG_ON(NSIGTRAP != 5);
-   BUILD_BUG_ON(NSIGCHLD != 6);
+   BUILD_BUG_ON(NSIGCHLD != 7);
BUILD_BUG_ON(NSIGSYS  != 1);
 
/* This is part of the ABI and can never change in size: */
diff --git a/fs/coredump.c b/fs/coredump.c
index e42e17e..f11e31f 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -546,6 +546,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
struct cred *cred;
int retval = 0;
int ispipe;
+   bool notify;
struct files_struct *displaced;
/* require nonrelative corefile path and be extra careful */
bool need_suid_safe = false;
@@ -590,6 +591,15 @@ void do_coredump(co

[PATCH v2] kernel/signal: Signal-based pre-coredump notification

2018-10-22 Thread Enke Chen

For simplicity and consistency, this patch provides an implementation
for signal-based fault notification prior to the coredump of a child
process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
be used by an application to express its interest and to specify the
signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.

Changes to prctl(2):

   PR_SET_PREDUMP_SIG (since Linux 4.20.x)
  Set the child pre-coredump signal of the calling process to
  arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
  This is the signal that the calling process will get prior to
  the coredump of a child process. This value is cleared across
  execve(2), or for the child of a fork(2).

  When SIGCHLD is specified, the signal code will be set to
  CLD_PREDUMP in such an SIGCHLD signal.

   PR_GET_PREDUMP_SIG (since Linux 4.20.x)
  Return the current value of the child pre-coredump signal,
  in the location pointed to by (int *) arg2.

Background:

As the coredump of a process may take time, in certain time-sensitive
applications it is necessary for a parent process (e.g., a process
manager) to be notified of a child's imminent death before the coredump
so that the parent process can act sooner, such as re-spawning an
application process, or initiating a control-plane fail-over.

Currently there are two ways for a parent process to be notified of a
child process's state change. One is to use the POSIX signal, and
another is to use the kernel connector module. The specific events and
actions are summarized as follows:

Process EventPOSIX SignalConnector-based
--
ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_STOPPED

ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_CONTINUED

pre_coredump/N/A proc_coredump_connector()
get_signal()

post_coredump/   do_notify_parent()  proc_exit_connector()
do_exit()SIGCHLD / exit_signal
--

As shown in the table, the signal-based pre-coredump notification is not
currently available. In some cases using a connector-based notification
can be quite complicated (e.g., when a process manager is written in shell
scripts and thus is subject to certain inherent limitations), and a
signal-based notification would be simpler and better suited.

Signed-off-by: Enke Chen 
---
v1 -> v2:

o remove the setting/gettting on others in prctl().
o move the notification from get_signal() to do_coredump().
o notify the "parent" instead of "real_parent".
o move the "predump_signal" from "task_struct" to "signal_struct".
o clear the signal setting across execve(2) as well.
o add validation for unused prctl() parameters.
o add selftests for the new prctl() API.

 arch/x86/kernel/signal_compat.c  |   2 +-
 fs/coredump.c|  10 ++
 fs/exec.c|   3 +
 include/linux/sched/signal.h |   4 +
 include/linux/signal.h   |   5 +
 include/uapi/asm-generic/siginfo.h   |   3 +-
 include/uapi/linux/prctl.h   |   4 +
 kernel/fork.c|   3 +
 kernel/signal.c  |  39 ++
 kernel/sys.c |  15 ++
 tools/testing/selftests/prctl/Makefile   |   2 +-
 tools/testing/selftests/prctl/predump-sig-test.c | 171 +++
 12 files changed, 258 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/prctl/predump-sig-test.c

diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 9ccbf05..a3deba8 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
BUILD_BUG_ON(NSIGSEGV != 7);
BUILD_BUG_ON(NSIGBUS  != 5);
BUILD_BUG_ON(NSIGTRAP != 5);
-   BUILD_BUG_ON(NSIGCHLD != 6);
+   BUILD_BUG_ON(NSIGCHLD != 7);
BUILD_BUG_ON(NSIGSYS  != 1);
 
/* This is part of the ABI and can never change in size: */
diff --git a/fs/coredump.c b/fs/coredump.c
index e42e17e..f11e31f 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -546,6 +546,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
struct cred *cred;
int retval = 0;
int ispipe;
+   bool notify;
struct files_struct *displaced;
/* require nonrelative corefile path and be extra careful */
bool need_suid_safe = false;
@@ -590,6 +591,15 @@ void do_coredump(co

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-22 Thread Enke Chen

Jann,

Thanks for the feedback. I will post a revised patch shortly.

On the related topic of "pdeath_signal", there are several inconsistencies
by preserving the flag across execve(2). The flag is cleared under several
conditions in different places. I will start a separate thread to see if
it can still be cleaned up.

 PR_SET_PDEATHSIG (since Linux 2.1.57)
  Set the parent death signal of the calling process to arg2
  (either a signal value in the range 1..maxsig, or 0 to clear).
  This is the signal that the calling process will get when its
  parent dies.  This value is cleared for the child of a fork(2)
  and (since Linux 2.4.36 / 2.6.23) when executing a set-user-ID
  or set-group-ID binary, or a binary that has associated
  capabilities (see capabilities(7)).  This value is preserved
  across execve(2).

-- Enke

On 10/22/18 8:40 AM, Jann Horn wrote:
> On Sat, Oct 20, 2018 at 1:01 AM Enke Chen  wrote:
>> Regarding the security considerations, it seems simpler and more secure to
>> just clear the "pre-coredump signal" cross execve(2), and let the new program
>> decide for itself.  What do you think?
> 
> I don't have a problem with these semantics.
> 
> I could imagine someone being unhappy about the theoretical race
> window if they want to perform an in-place reexecution of a running
> service, but I don't know whether anyone actually cares about that.
> 
>> Changes to prctl(2):
>>
>> DESCRIPTION
>>
>>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>   This allows the calling process to receive a signal (arg2,
>>   if nonzero) from a child process prior to the coredump of
>>   the child process. arg2 must be SIGUSR1, or SIGUSR2, or
>>   SIGCHLD, or 0 (for clear).
>>
>>   When SIGCHLD is specified, the signal code is set to
>>   CLD_PREDUMP in such an SIGCHLD signal.
>>
>>   The value of the pre-coredump signal is cleared across
>>   execve(2), or for the child of a fork(2).
>>
>>PR_GET_PREDUMP_SIG (since Linux 4.20.x)
>>   Return the current value of the pre-coredump signal for the
>>   calling process, in the location pointed to by (int *) arg2.

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-22 Thread Enke Chen

Jann,

Thanks for the feedback. I will post a revised patch shortly.

On the related topic of "pdeath_signal", there are several inconsistencies
by preserving the flag across execve(2). The flag is cleared under several
conditions in different places. I will start a separate thread to see if
it can still be cleaned up.

 PR_SET_PDEATHSIG (since Linux 2.1.57)
  Set the parent death signal of the calling process to arg2
  (either a signal value in the range 1..maxsig, or 0 to clear).
  This is the signal that the calling process will get when its
  parent dies.  This value is cleared for the child of a fork(2)
  and (since Linux 2.4.36 / 2.6.23) when executing a set-user-ID
  or set-group-ID binary, or a binary that has associated
  capabilities (see capabilities(7)).  This value is preserved
  across execve(2).

-- Enke

On 10/22/18 8:40 AM, Jann Horn wrote:
> On Sat, Oct 20, 2018 at 1:01 AM Enke Chen  wrote:
>> Regarding the security considerations, it seems simpler and more secure to
>> just clear the "pre-coredump signal" cross execve(2), and let the new program
>> decide for itself.  What do you think?
> 
> I don't have a problem with these semantics.
> 
> I could imagine someone being unhappy about the theoretical race
> window if they want to perform an in-place reexecution of a running
> service, but I don't know whether anyone actually cares about that.
> 
>> Changes to prctl(2):
>>
>> DESCRIPTION
>>
>>PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>   This allows the calling process to receive a signal (arg2,
>>   if nonzero) from a child process prior to the coredump of
>>   the child process. arg2 must be SIGUSR1, or SIGUSR2, or
>>   SIGCHLD, or 0 (for clear).
>>
>>   When SIGCHLD is specified, the signal code is set to
>>   CLD_PREDUMP in such an SIGCHLD signal.
>>
>>   The value of the pre-coredump signal is cleared across
>>   execve(2), or for the child of a fork(2).
>>
>>PR_GET_PREDUMP_SIG (since Linux 4.20.x)
>>   Return the current value of the pre-coredump signal for the
>>   calling process, in the location pointed to by (int *) arg2.

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-19 Thread Enke Chen

Hi, Jann:

Regarding the security considerations, it seems simpler and more secure to
just clear the "pre-coredump signal" cross execve(2), and let the new program
decide for itself.  What do you think?

---
Changes to prctl(2):

DESCRIPTION

   PR_SET_PREDUMP_SIG (since Linux 4.20.x)
  This allows the calling process to receive a signal (arg2,
  if nonzero) from a child process prior to the coredump of
  the child process. arg2 must be SIGUSR1, or SIGUSR2, or
  SIGCHLD, or 0 (for clear).

  When SIGCHLD is specified, the signal code is set to
  CLD_PREDUMP in such an SIGCHLD signal.

  The value of the pre-coredump signal is cleared across
  execve(2), or for the child of a fork(2).

   PR_GET_PREDUMP_SIG (since Linux 4.20.x)
  Return the current value of the pre-coredump signal for the
  calling process, in the location pointed to by (int *) arg2.
---

Thanks.  -- Enke

On 10/15/18 11:54 AM, Jann Horn wrote:
> On Mon, Oct 15, 2018 at 8:36 PM Enke Chen  wrote:
>> On 10/13/18 11:27 AM, Jann Horn wrote:
>>> On Sat, Oct 13, 2018 at 2:33 AM Enke Chen  wrote:
>>>> For simplicity and consistency, this patch provides an implementation
>>>> for signal-based fault notification prior to the coredump of a child
>>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>>> be used by an application to express its interest and to specify the
>>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>
>>> Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
>>> some important differences:
>>>
>>>  - You don't reset the signal on setuid execution.
> [...]
>>>
>>> For both of these: Are these differences actually necessary, and if
>>> so, can you provide a specific rationale? From a security perspective,
>>> I would very much prefer it if this API had semantics closer to
>>> PR_SET_PDEATHSIG.
>>
> [...]
>>
>> Regarding the impact of "setuid", this property "PR_SET_PREDUMP_SIG" has to
>> do with the application/process whether the signal handler is set for 
>> receiving
>> such a notification.  If it is set, the "uid" should not matter.
> 
> If an attacker's process first calls PR_SET_PREDUMP_SIG, then forks
> off a child, then calls execve() on a setuid binary, the setuid binary
> calls setuid(0), and the attacker-controlled child then crashes, the
> privileged process will receive an unexpected signal that the attacker
> wouldn't have been allowed to send otherwise. For similar reasons, the
> parent death signal is reset when a setuid binary is executed:
> 
> void setup_new_exec(struct linux_binprm * bprm)
> {
> /*
>  * Once here, prepare_binrpm() will not be called any more, so
>  * the final state of setuid/setgid/fscaps can be merged into the
>  * secureexec flag.
>  */
> bprm->secureexec |= bprm->cap_elevated;
> 
> if (bprm->secureexec) {
> /* Make sure parent cannot signal privileged process. */
> current->pdeath_signal = 0;
> [...]
> }
> [...]
> }
> 
> int commit_creds(struct cred *new)
> {
> [...]
> /* dumpability changes */
> if (!uid_eq(old->euid, new->euid) ||
> !gid_eq(old->egid, new->egid) ||
> !uid_eq(old->fsuid, new->fsuid) ||
> !gid_eq(old->fsgid, new->fsgid) ||
> !cred_cap_issubset(old, new)) {
> if (task->mm)
> set_dumpable(task->mm, suid_dumpable);
> task->pdeath_signal = 0;
> smp_wmb();
> }
> [...]
> }
> 
> AppArmor and SELinux also do related changes:
> 
> static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> /* bail out if unconfined or not changing profile */
> if ((new_label->proxy == label->proxy) ||
> (unconfined(new_label)))
> return;
> 
> aa_inherit_files(bprm->cred, current->files);
> 
> current->pdeath_signal = 0;
> [...]
> }
> 
> static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> new_tsec = bprm->cred->security;
> if (new_tsec->sid == new_tsec->osid)
> return;
> 
> /* Close files for which the new task SID is not authorized. */
> flush_unauthorized_files(bprm->cred, current->files);
> 
> /* Always clear parent death signal on SID transitions. */
> current->pdeath_signal = 0;
> [...]
> }
> 
> You should probably reset the coredump signal in the same places - or
> even better, add a new helper for resetting the parent death signal,
> and then add code for resetting the coredump signal in there.
>

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-19 Thread Enke Chen

Hi, Jann:

Regarding the security considerations, it seems simpler and more secure to
just clear the "pre-coredump signal" cross execve(2), and let the new program
decide for itself.  What do you think?

---
Changes to prctl(2):

DESCRIPTION

   PR_SET_PREDUMP_SIG (since Linux 4.20.x)
  This allows the calling process to receive a signal (arg2,
  if nonzero) from a child process prior to the coredump of
  the child process. arg2 must be SIGUSR1, or SIGUSR2, or
  SIGCHLD, or 0 (for clear).

  When SIGCHLD is specified, the signal code is set to
  CLD_PREDUMP in such an SIGCHLD signal.

  The value of the pre-coredump signal is cleared across
  execve(2), or for the child of a fork(2).

   PR_GET_PREDUMP_SIG (since Linux 4.20.x)
  Return the current value of the pre-coredump signal for the
  calling process, in the location pointed to by (int *) arg2.
---

Thanks.  -- Enke

On 10/15/18 11:54 AM, Jann Horn wrote:
> On Mon, Oct 15, 2018 at 8:36 PM Enke Chen  wrote:
>> On 10/13/18 11:27 AM, Jann Horn wrote:
>>> On Sat, Oct 13, 2018 at 2:33 AM Enke Chen  wrote:
>>>> For simplicity and consistency, this patch provides an implementation
>>>> for signal-based fault notification prior to the coredump of a child
>>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>>> be used by an application to express its interest and to specify the
>>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>
>>> Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
>>> some important differences:
>>>
>>>  - You don't reset the signal on setuid execution.
> [...]
>>>
>>> For both of these: Are these differences actually necessary, and if
>>> so, can you provide a specific rationale? From a security perspective,
>>> I would very much prefer it if this API had semantics closer to
>>> PR_SET_PDEATHSIG.
>>
> [...]
>>
>> Regarding the impact of "setuid", this property "PR_SET_PREDUMP_SIG" has to
>> do with the application/process whether the signal handler is set for 
>> receiving
>> such a notification.  If it is set, the "uid" should not matter.
> 
> If an attacker's process first calls PR_SET_PREDUMP_SIG, then forks
> off a child, then calls execve() on a setuid binary, the setuid binary
> calls setuid(0), and the attacker-controlled child then crashes, the
> privileged process will receive an unexpected signal that the attacker
> wouldn't have been allowed to send otherwise. For similar reasons, the
> parent death signal is reset when a setuid binary is executed:
> 
> void setup_new_exec(struct linux_binprm * bprm)
> {
> /*
>  * Once here, prepare_binrpm() will not be called any more, so
>  * the final state of setuid/setgid/fscaps can be merged into the
>  * secureexec flag.
>  */
> bprm->secureexec |= bprm->cap_elevated;
> 
> if (bprm->secureexec) {
> /* Make sure parent cannot signal privileged process. */
> current->pdeath_signal = 0;
> [...]
> }
> [...]
> }
> 
> int commit_creds(struct cred *new)
> {
> [...]
> /* dumpability changes */
> if (!uid_eq(old->euid, new->euid) ||
> !gid_eq(old->egid, new->egid) ||
> !uid_eq(old->fsuid, new->fsuid) ||
> !gid_eq(old->fsgid, new->fsgid) ||
> !cred_cap_issubset(old, new)) {
> if (task->mm)
> set_dumpable(task->mm, suid_dumpable);
> task->pdeath_signal = 0;
> smp_wmb();
> }
> [...]
> }
> 
> AppArmor and SELinux also do related changes:
> 
> static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> /* bail out if unconfined or not changing profile */
> if ((new_label->proxy == label->proxy) ||
> (unconfined(new_label)))
> return;
> 
> aa_inherit_files(bprm->cred, current->files);
> 
> current->pdeath_signal = 0;
> [...]
> }
> 
> static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> new_tsec = bprm->cred->security;
> if (new_tsec->sid == new_tsec->osid)
> return;
> 
> /* Close files for which the new task SID is not authorized. */
> flush_unauthorized_files(bprm->cred, current->files);
> 
> /* Always clear parent death signal on SID transitions. */
> current->pdeath_signal = 0;
> [...]
> }
> 
> You should probably reset the coredump signal in the same places - or
> even better, add a new helper for resetting the parent death signal,
> and then add code for resetting the coredump signal in there.
>

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-16 Thread Enke Chen

Hi, Oleg:

On 10/16/18 7:14 AM, Oleg Nesterov wrote:
> On 10/15, Enke Chen wrote:
>>
>>> I don't understand why we need valid_predump_signal() at all.
>>
>> Most of the signals have well-defined semantics, and would not be appropriate
>> for this purpose.
> 
> you are going to change the rules anyway.
> 
>> That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.
> 
> Which do not queue. So the parent won't get the 2nd signal if 2 children
> crash at the same time.
> 
>>>>if (sig_kernel_coredump(signr)) {
>>>> +  /*
>>>> +   * Notify the parent prior to the coredump if the
>>>> +   * parent is interested in such a notificaiton.
>>>> +   */
>>>> +  int p_sig = current->real_parent->predump_signal;
>>>> +
>>>> +  if (valid_predump_signal(p_sig)) {
>>>> +  read_lock(_lock);
>>>> +  do_notify_parent_predump(current);
>>>> +  read_unlock(_lock);
>>>> +  cond_resched();
>>>
>>> perhaps this should be called by do_coredump() after coredump_wait() kills
>>> all the sub-threads?
>>
>> proc_coredump_connector(current) is located here, they should stay together.
> 
> Why?
> 
> Once again, other threads are still alive. So if the parent restarts the 
> service
> after it recieves -predump_signal, the new process can "race" with the old 
> thread.

Yes, it is a good idea to do the signal notification in do_coredump() after
coredump_wait(). Will make the change as suggested.

Thanks.  -- Enke

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-16 Thread Enke Chen

Hi, Oleg:

On 10/16/18 7:14 AM, Oleg Nesterov wrote:
> On 10/15, Enke Chen wrote:
>>
>>> I don't understand why we need valid_predump_signal() at all.
>>
>> Most of the signals have well-defined semantics, and would not be appropriate
>> for this purpose.
> 
> you are going to change the rules anyway.
> 
>> That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.
> 
> Which do not queue. So the parent won't get the 2nd signal if 2 children
> crash at the same time.
> 
>>>>if (sig_kernel_coredump(signr)) {
>>>> +  /*
>>>> +   * Notify the parent prior to the coredump if the
>>>> +   * parent is interested in such a notificaiton.
>>>> +   */
>>>> +  int p_sig = current->real_parent->predump_signal;
>>>> +
>>>> +  if (valid_predump_signal(p_sig)) {
>>>> +  read_lock(_lock);
>>>> +  do_notify_parent_predump(current);
>>>> +  read_unlock(_lock);
>>>> +  cond_resched();
>>>
>>> perhaps this should be called by do_coredump() after coredump_wait() kills
>>> all the sub-threads?
>>
>> proc_coredump_connector(current) is located here, they should stay together.
> 
> Why?
> 
> Once again, other threads are still alive. So if the parent restarts the 
> service
> after it recieves -predump_signal, the new process can "race" with the old 
> thread.

Yes, it is a good idea to do the signal notification in do_coredump() after
coredump_wait(). Will make the change as suggested.

Thanks.  -- Enke

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Eric:

On 10/15/18 4:28 PM, Eric W. Biederman wrote:
> Enke Chen  writes:
> 
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>
>> Background:
>>
>> As the coredump of a process may take time, in certain time-sensitive
>> applications it is necessary for a parent process (e.g., a process
>> manager) to be notified of a child's imminent death before the coredump
>> so that the parent process can act sooner, such as re-spawning an
>> application process, or initiating a control-plane fail-over.
> 
> You talk about time senstive and then you talk about bash scripts.
> I don't think your definition of time-sensitive and my definition match.

It's certainly not my preference to have a process manager (or one for each
application) written in bash scripts. But they do work, and are deployed.

> 
> With that said I think the best solution would be to figure out how to
> allow the coredump to run in parallel with the usual exit signal, and
> exit code reaping of the process> 
> That would solve the problem for everyone, and would not introduce any
> new complicated APIs.

That would certainly help. But given the huge deployment of Linux, I don't
think it would be feasible to change this fundamental behavior (signal post
coredump).

> 
> Short of that having the prctl in the process that receives the signals
> they you are doing is the right way to go.

Thanks for for the encouragement.

> 
> You are however calling do_notify_parent_predump from the wrong
> function, and frankly with the wrong locking.  There are multiple paths
> to the do_coredump function so you really want this notification from
> do_coredump.

This makes two - Oleg also suggested doing it in do_coredump().
I will look into it, perhaps also relocated proc_coredump_connector().

> 
> But I still think it would be better to solve the root cause problem and
> change the coredump logic to be able to run in parallel with the normal
> exit notification and zombie reaping logic.  Then the problem you are
> trying to solve goes away and everyone's code gets simpler.
> 
> Eric
> 

Thanks.  -- Enke

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Eric:

On 10/15/18 4:28 PM, Eric W. Biederman wrote:
> Enke Chen  writes:
> 
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>
>> Background:
>>
>> As the coredump of a process may take time, in certain time-sensitive
>> applications it is necessary for a parent process (e.g., a process
>> manager) to be notified of a child's imminent death before the coredump
>> so that the parent process can act sooner, such as re-spawning an
>> application process, or initiating a control-plane fail-over.
> 
> You talk about time senstive and then you talk about bash scripts.
> I don't think your definition of time-sensitive and my definition match.

It's certainly not my preference to have a process manager (or one for each
application) written in bash scripts. But they do work, and are deployed.

> 
> With that said I think the best solution would be to figure out how to
> allow the coredump to run in parallel with the usual exit signal, and
> exit code reaping of the process> 
> That would solve the problem for everyone, and would not introduce any
> new complicated APIs.

That would certainly help. But given the huge deployment of Linux, I don't
think it would be feasible to change this fundamental behavior (signal post
coredump).

> 
> Short of that having the prctl in the process that receives the signals
> they you are doing is the right way to go.

Thanks for for the encouragement.

> 
> You are however calling do_notify_parent_predump from the wrong
> function, and frankly with the wrong locking.  There are multiple paths
> to the do_coredump function so you really want this notification from
> do_coredump.

This makes two - Oleg also suggested doing it in do_coredump().
I will look into it, perhaps also relocated proc_coredump_connector().

> 
> But I still think it would be better to solve the root cause problem and
> change the coredump logic to be able to run in parallel with the normal
> exit notification and zombie reaping logic.  Then the problem you are
> trying to solve goes away and everyone's code gets simpler.
> 
> Eric
> 

Thanks.  -- Enke

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Alan:

As I replied earlier, I will remove the logic that allows setting on others.
This function "set_predump_signal_perm()" will be gone too.

Thanks.  -- Enke

On 10/15/18 2:21 PM, Alan Cox wrote:
>> +/*
>> + * Returns true if current's euid is same as p's uid or euid,
>> + * or has CAP_SYS_ADMIN.
>> + *
>> + * Called with rcu_read_lock, creds are safe.
>> + *
>> + * Adapted from set_one_prio_perm().
>> + */
>> +static bool set_predump_signal_perm(struct task_struct *p)
>> +{
>> +const struct cred *cred = current_cred(), *pcred = __task_cred(p);
>> +
>> +return uid_eq(pcred->uid, cred->euid) ||
>> +   uid_eq(pcred->euid, cred->euid) ||
>> +   capable(CAP_SYS_ADMIN);
>> +}
> 
> This makes absolutely no security sense whatsoever. The uid and euid of
> the parent and child can both change between the test and the signal
> delivery.
> 
> There are reasons that the child signal control code is incredibly
> careful about either the parent or child using execve or doing a
> privilege change that might pose a risk.
> 
> Until this code gets the same protections I don't believe it's safe.
> 
> Alan
>

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Alan:

As I replied earlier, I will remove the logic that allows setting on others.
This function "set_predump_signal_perm()" will be gone too.

Thanks.  -- Enke

On 10/15/18 2:21 PM, Alan Cox wrote:
>> +/*
>> + * Returns true if current's euid is same as p's uid or euid,
>> + * or has CAP_SYS_ADMIN.
>> + *
>> + * Called with rcu_read_lock, creds are safe.
>> + *
>> + * Adapted from set_one_prio_perm().
>> + */
>> +static bool set_predump_signal_perm(struct task_struct *p)
>> +{
>> +const struct cred *cred = current_cred(), *pcred = __task_cred(p);
>> +
>> +return uid_eq(pcred->uid, cred->euid) ||
>> +   uid_eq(pcred->euid, cred->euid) ||
>> +   capable(CAP_SYS_ADMIN);
>> +}
> 
> This makes absolutely no security sense whatsoever. The uid and euid of
> the parent and child can both change between the test and the signal
> delivery.
> 
> There are reasons that the child signal control code is incredibly
> careful about either the parent or child using execve or doing a
> privilege change that might pose a risk.
> 
> Until this code gets the same protections I don't believe it's safe.
> 
> Alan
>

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Olge:

>> probably ->predump_signal should be cleared on exec?

As I replied to Jann, will do.

Thanks. -- Enke

On 10/15/18 12:17 PM, Enke Chen wrote:
> Hi, Oleg:
> 
> I missed some of your comments in my previous reply.
> 
> On 10/15/18 5:05 AM, Oleg Nesterov wrote:
>> On 10/12, Enke Chen wrote:
>>>
>>> For simplicity and consistency, this patch provides an implementation
>>> for signal-based fault notification prior to the coredump of a child
>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>> be used by an application to express its interest and to specify the
>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>
>> To be honest, I can't say I like this new feature...
>>
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -696,6 +696,10 @@ struct task_struct {
>>> int exit_signal;
>>> /* The signal sent when the parent dies: */
>>> int pdeath_signal;
>>> +
>>> +   /* The signal sent prior to a child's coredump: */
>>> +   int predump_signal;
>>> +
>>
>> At least, I think predump_signal should live in signal_struct, not
>> task_struct.
>>
>> (pdeath_signal too, but it is too late to change (fix) this awkward API).
>>
>>> +static void do_notify_parent_predump(struct task_struct *tsk)
>>> +{
>>> +   struct sighand_struct *sighand;
>>> +   struct task_struct *parent;
>>> +   struct kernel_siginfo info;
>>> +   unsigned long flags;
>>> +   int sig;
>>> +
>>> +   parent = tsk->real_parent;
>>
>> So, debuggere won't be notified, only real_parent...
>>
>>> +   sig = parent->predump_signal;
>>
>> probably ->predump_signal should be cleared on exec?
> 
> 
> Is this not enough in "copy_process()"?
> 
> @@ -1985,6 +1985,7 @@ static __latent_entropy struct task_struct 
> *copy_process(
>   p->dirty_paused_when = 0;
>  
>   p->pdeath_signal = 0;
> + p->predump_signal = 0;
> 
>>
>>> +   /* Check again with tasklist_lock" locked by the caller */
>>> +   if (!valid_predump_signal(sig))
>>> +   return;
>>
>> I don't understand why we need valid_predump_signal() at all.
> 
> Most of the signals have well-defined semantics, and would not be appropriate
> for this purpose.  That is why it is limited to only SIGCHLD, SIGUSR1, 
> SIGUSR2.
> 
>>
>>>  bool get_signal(struct ksignal *ksig)
>>>  {
>>> struct sighand_struct *sighand = current->sighand;
>>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>>> current->flags |= PF_SIGNALED;
>>>  
>>> if (sig_kernel_coredump(signr)) {
>>> +   /*
>>> +* Notify the parent prior to the coredump if the
>>> +* parent is interested in such a notificaiton.
>>> +*/
>>> +   int p_sig = current->real_parent->predump_signal;
>>> +
>>> +   if (valid_predump_signal(p_sig)) {
>>> +   read_lock(_lock);
>>> +   do_notify_parent_predump(current);
>>> +   read_unlock(_lock);
>>> +   cond_resched();
>>
>> perhaps this should be called by do_coredump() after coredump_wait() kills
>> all the sub-threads?
> 
> proc_coredump_connector(current) is located here, they should stay together.
> 
> Thanks.  -- Enke
> 
>>
>>> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, 
>>> int sig)
>>> +{
>>> +   struct task_struct *p;
>>> +   int error;
>>> +
>>> +   /* 0 is valid for disabling the feature */
>>> +   if (sig && !valid_predump_signal(sig))
>>> +   return -EINVAL;
>>> +
>>> +   /* For the current task, the common case */
>>> +   if (pid == 0) {
>>> +   tsk->predump_signal = sig;
>>> +   return 0;
>>> +   }
>>> +
>>> +   error = -ESRCH;
>>> +   rcu_read_lock();
>>> +   p = find_task_by_vpid(pid);
>>> +   if (p) {
>>> +   if (!set_predump_signal_perm(p))
>>> +   error = -EPERM;
>>> +   else {
>>> +   error = 0;
>>> +   p->predump_signal = sig;
>>> +   }
>>> +   }
>>> +   rcu_read_unlock();
>>> +   return error;
>>> +}
>>
>> Why? I mean, why do we really want to support the pid != 0 case?
>>
>> Oleg.
>>

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Olge:

>> probably ->predump_signal should be cleared on exec?

As I replied to Jann, will do.

Thanks. -- Enke

On 10/15/18 12:17 PM, Enke Chen wrote:
> Hi, Oleg:
> 
> I missed some of your comments in my previous reply.
> 
> On 10/15/18 5:05 AM, Oleg Nesterov wrote:
>> On 10/12, Enke Chen wrote:
>>>
>>> For simplicity and consistency, this patch provides an implementation
>>> for signal-based fault notification prior to the coredump of a child
>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>> be used by an application to express its interest and to specify the
>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>
>> To be honest, I can't say I like this new feature...
>>
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -696,6 +696,10 @@ struct task_struct {
>>> int exit_signal;
>>> /* The signal sent when the parent dies: */
>>> int pdeath_signal;
>>> +
>>> +   /* The signal sent prior to a child's coredump: */
>>> +   int predump_signal;
>>> +
>>
>> At least, I think predump_signal should live in signal_struct, not
>> task_struct.
>>
>> (pdeath_signal too, but it is too late to change (fix) this awkward API).
>>
>>> +static void do_notify_parent_predump(struct task_struct *tsk)
>>> +{
>>> +   struct sighand_struct *sighand;
>>> +   struct task_struct *parent;
>>> +   struct kernel_siginfo info;
>>> +   unsigned long flags;
>>> +   int sig;
>>> +
>>> +   parent = tsk->real_parent;
>>
>> So, debuggere won't be notified, only real_parent...
>>
>>> +   sig = parent->predump_signal;
>>
>> probably ->predump_signal should be cleared on exec?
> 
> 
> Is this not enough in "copy_process()"?
> 
> @@ -1985,6 +1985,7 @@ static __latent_entropy struct task_struct 
> *copy_process(
>   p->dirty_paused_when = 0;
>  
>   p->pdeath_signal = 0;
> + p->predump_signal = 0;
> 
>>
>>> +   /* Check again with tasklist_lock" locked by the caller */
>>> +   if (!valid_predump_signal(sig))
>>> +   return;
>>
>> I don't understand why we need valid_predump_signal() at all.
> 
> Most of the signals have well-defined semantics, and would not be appropriate
> for this purpose.  That is why it is limited to only SIGCHLD, SIGUSR1, 
> SIGUSR2.
> 
>>
>>>  bool get_signal(struct ksignal *ksig)
>>>  {
>>> struct sighand_struct *sighand = current->sighand;
>>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>>> current->flags |= PF_SIGNALED;
>>>  
>>> if (sig_kernel_coredump(signr)) {
>>> +   /*
>>> +* Notify the parent prior to the coredump if the
>>> +* parent is interested in such a notificaiton.
>>> +*/
>>> +   int p_sig = current->real_parent->predump_signal;
>>> +
>>> +   if (valid_predump_signal(p_sig)) {
>>> +   read_lock(_lock);
>>> +   do_notify_parent_predump(current);
>>> +   read_unlock(_lock);
>>> +   cond_resched();
>>
>> perhaps this should be called by do_coredump() after coredump_wait() kills
>> all the sub-threads?
> 
> proc_coredump_connector(current) is located here, they should stay together.
> 
> Thanks.  -- Enke
> 
>>
>>> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, 
>>> int sig)
>>> +{
>>> +   struct task_struct *p;
>>> +   int error;
>>> +
>>> +   /* 0 is valid for disabling the feature */
>>> +   if (sig && !valid_predump_signal(sig))
>>> +   return -EINVAL;
>>> +
>>> +   /* For the current task, the common case */
>>> +   if (pid == 0) {
>>> +   tsk->predump_signal = sig;
>>> +   return 0;
>>> +   }
>>> +
>>> +   error = -ESRCH;
>>> +   rcu_read_lock();
>>> +   p = find_task_by_vpid(pid);
>>> +   if (p) {
>>> +   if (!set_predump_signal_perm(p))
>>> +   error = -EPERM;
>>> +   else {
>>> +   error = 0;
>>> +   p->predump_signal = sig;
>>> +   }
>>> +   }
>>> +   rcu_read_unlock();
>>> +   return error;
>>> +}
>>
>> Why? I mean, why do we really want to support the pid != 0 case?
>>
>> Oleg.
>>

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Jann:

Thanks for your detail explanation. Will take care of it.

-- Enke

On 10/15/18 11:54 AM, Jann Horn wrote:
> On Mon, Oct 15, 2018 at 8:36 PM Enke Chen  wrote:
>> On 10/13/18 11:27 AM, Jann Horn wrote:
>>> On Sat, Oct 13, 2018 at 2:33 AM Enke Chen  wrote:
>>>> For simplicity and consistency, this patch provides an implementation
>>>> for signal-based fault notification prior to the coredump of a child
>>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>>> be used by an application to express its interest and to specify the
>>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>
>>> Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
>>> some important differences:
>>>
>>>  - You don't reset the signal on setuid execution.
> [...]
>>>
>>> For both of these: Are these differences actually necessary, and if
>>> so, can you provide a specific rationale? From a security perspective,
>>> I would very much prefer it if this API had semantics closer to
>>> PR_SET_PDEATHSIG.
>>
> [...]
>>
>> Regarding the impact of "setuid", this property "PR_SET_PREDUMP_SIG" has to
>> do with the application/process whether the signal handler is set for 
>> receiving
>> such a notification.  If it is set, the "uid" should not matter.
> 
> If an attacker's process first calls PR_SET_PREDUMP_SIG, then forks
> off a child, then calls execve() on a setuid binary, the setuid binary
> calls setuid(0), and the attacker-controlled child then crashes, the
> privileged process will receive an unexpected signal that the attacker
> wouldn't have been allowed to send otherwise. For similar reasons, the
> parent death signal is reset when a setuid binary is executed:
> 
> void setup_new_exec(struct linux_binprm * bprm)
> {
> /*
>  * Once here, prepare_binrpm() will not be called any more, so
>  * the final state of setuid/setgid/fscaps can be merged into the
>  * secureexec flag.
>  */
> bprm->secureexec |= bprm->cap_elevated;
> 
> if (bprm->secureexec) {
> /* Make sure parent cannot signal privileged process. */
> current->pdeath_signal = 0;
> [...]
> }
> [...]
> }
> 
> int commit_creds(struct cred *new)
> {
> [...]
> /* dumpability changes */
> if (!uid_eq(old->euid, new->euid) ||
> !gid_eq(old->egid, new->egid) ||
> !uid_eq(old->fsuid, new->fsuid) ||
> !gid_eq(old->fsgid, new->fsgid) ||
> !cred_cap_issubset(old, new)) {
> if (task->mm)
> set_dumpable(task->mm, suid_dumpable);
> task->pdeath_signal = 0;
> smp_wmb();
> }
> [...]
> }
> 
> AppArmor and SELinux also do related changes:
> 
> static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> /* bail out if unconfined or not changing profile */
> if ((new_label->proxy == label->proxy) ||
> (unconfined(new_label)))
> return;
> 
> aa_inherit_files(bprm->cred, current->files);
> 
> current->pdeath_signal = 0;
> [...]
> }
> 
> static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> new_tsec = bprm->cred->security;
> if (new_tsec->sid == new_tsec->osid)
> return;
> 
> /* Close files for which the new task SID is not authorized. */
> flush_unauthorized_files(bprm->cred, current->files);
> 
> /* Always clear parent death signal on SID transitions. */
> current->pdeath_signal = 0;
> [...]
> }
> 
> You should probably reset the coredump signal in the same places - or
> even better, add a new helper for resetting the parent death signal,
> and then add code for resetting the coredump signal in there.
>

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Jann:

Thanks for your detail explanation. Will take care of it.

-- Enke

On 10/15/18 11:54 AM, Jann Horn wrote:
> On Mon, Oct 15, 2018 at 8:36 PM Enke Chen  wrote:
>> On 10/13/18 11:27 AM, Jann Horn wrote:
>>> On Sat, Oct 13, 2018 at 2:33 AM Enke Chen  wrote:
>>>> For simplicity and consistency, this patch provides an implementation
>>>> for signal-based fault notification prior to the coredump of a child
>>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>>> be used by an application to express its interest and to specify the
>>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>
>>> Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
>>> some important differences:
>>>
>>>  - You don't reset the signal on setuid execution.
> [...]
>>>
>>> For both of these: Are these differences actually necessary, and if
>>> so, can you provide a specific rationale? From a security perspective,
>>> I would very much prefer it if this API had semantics closer to
>>> PR_SET_PDEATHSIG.
>>
> [...]
>>
>> Regarding the impact of "setuid", this property "PR_SET_PREDUMP_SIG" has to
>> do with the application/process whether the signal handler is set for 
>> receiving
>> such a notification.  If it is set, the "uid" should not matter.
> 
> If an attacker's process first calls PR_SET_PREDUMP_SIG, then forks
> off a child, then calls execve() on a setuid binary, the setuid binary
> calls setuid(0), and the attacker-controlled child then crashes, the
> privileged process will receive an unexpected signal that the attacker
> wouldn't have been allowed to send otherwise. For similar reasons, the
> parent death signal is reset when a setuid binary is executed:
> 
> void setup_new_exec(struct linux_binprm * bprm)
> {
> /*
>  * Once here, prepare_binrpm() will not be called any more, so
>  * the final state of setuid/setgid/fscaps can be merged into the
>  * secureexec flag.
>  */
> bprm->secureexec |= bprm->cap_elevated;
> 
> if (bprm->secureexec) {
> /* Make sure parent cannot signal privileged process. */
> current->pdeath_signal = 0;
> [...]
> }
> [...]
> }
> 
> int commit_creds(struct cred *new)
> {
> [...]
> /* dumpability changes */
> if (!uid_eq(old->euid, new->euid) ||
> !gid_eq(old->egid, new->egid) ||
> !uid_eq(old->fsuid, new->fsuid) ||
> !gid_eq(old->fsgid, new->fsgid) ||
> !cred_cap_issubset(old, new)) {
> if (task->mm)
> set_dumpable(task->mm, suid_dumpable);
> task->pdeath_signal = 0;
> smp_wmb();
> }
> [...]
> }
> 
> AppArmor and SELinux also do related changes:
> 
> static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> /* bail out if unconfined or not changing profile */
> if ((new_label->proxy == label->proxy) ||
> (unconfined(new_label)))
> return;
> 
> aa_inherit_files(bprm->cred, current->files);
> 
> current->pdeath_signal = 0;
> [...]
> }
> 
> static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> new_tsec = bprm->cred->security;
> if (new_tsec->sid == new_tsec->osid)
> return;
> 
> /* Close files for which the new task SID is not authorized. */
> flush_unauthorized_files(bprm->cred, current->files);
> 
> /* Always clear parent death signal on SID transitions. */
> current->pdeath_signal = 0;
> [...]
> }
> 
> You should probably reset the coredump signal in the same places - or
> even better, add a new helper for resetting the parent death signal,
> and then add code for resetting the coredump signal in there.
>

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Oleg:

I missed some of your comments in my previous reply.

On 10/15/18 5:05 AM, Oleg Nesterov wrote:
> On 10/12, Enke Chen wrote:
>>
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> 
> To be honest, I can't say I like this new feature...
> 
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -696,6 +696,10 @@ struct task_struct {
>>  int exit_signal;
>>  /* The signal sent when the parent dies: */
>>  int pdeath_signal;
>> +
>> +/* The signal sent prior to a child's coredump: */
>> +int predump_signal;
>> +
> 
> At least, I think predump_signal should live in signal_struct, not
> task_struct.
> 
> (pdeath_signal too, but it is too late to change (fix) this awkward API).
> 
>> +static void do_notify_parent_predump(struct task_struct *tsk)
>> +{
>> +struct sighand_struct *sighand;
>> +struct task_struct *parent;
>> +struct kernel_siginfo info;
>> +unsigned long flags;
>> +int sig;
>> +
>> +parent = tsk->real_parent;
> 
> So, debuggere won't be notified, only real_parent...
> 
>> +sig = parent->predump_signal;
> 
> probably ->predump_signal should be cleared on exec?


Is this not enough in "copy_process()"?

@@ -1985,6 +1985,7 @@ static __latent_entropy struct task_struct *copy_process(
p->dirty_paused_when = 0;
 
p->pdeath_signal = 0;
+   p->predump_signal = 0;

> 
>> +/* Check again with tasklist_lock" locked by the caller */
>> +if (!valid_predump_signal(sig))
>> +return;
> 
> I don't understand why we need valid_predump_signal() at all.

Most of the signals have well-defined semantics, and would not be appropriate
for this purpose.  That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.

> 
>>  bool get_signal(struct ksignal *ksig)
>>  {
>>  struct sighand_struct *sighand = current->sighand;
>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>>  current->flags |= PF_SIGNALED;
>>  
>>  if (sig_kernel_coredump(signr)) {
>> +/*
>> + * Notify the parent prior to the coredump if the
>> + * parent is interested in such a notificaiton.
>> + */
>> +int p_sig = current->real_parent->predump_signal;
>> +
>> +if (valid_predump_signal(p_sig)) {
>> +read_lock(_lock);
>> +do_notify_parent_predump(current);
>> +read_unlock(_lock);
>> +cond_resched();
> 
> perhaps this should be called by do_coredump() after coredump_wait() kills
> all the sub-threads?

proc_coredump_connector(current) is located here, they should stay together.

Thanks.  -- Enke

> 
>> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, int 
>> sig)
>> +{
>> +struct task_struct *p;
>> +int error;
>> +
>> +/* 0 is valid for disabling the feature */
>> +if (sig && !valid_predump_signal(sig))
>> +return -EINVAL;
>> +
>> +/* For the current task, the common case */
>> +if (pid == 0) {
>> +tsk->predump_signal = sig;
>> +return 0;
>> +}
>> +
>> +error = -ESRCH;
>> +rcu_read_lock();
>> +p = find_task_by_vpid(pid);
>> +if (p) {
>> +if (!set_predump_signal_perm(p))
>> +error = -EPERM;
>> +else {
>> +error = 0;
>> +p->predump_signal = sig;
>> +}
>> +}
>> +rcu_read_unlock();
>> +return error;
>> +}
> 
> Why? I mean, why do we really want to support the pid != 0 case?
> 
> Oleg.
>

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Oleg:

I missed some of your comments in my previous reply.

On 10/15/18 5:05 AM, Oleg Nesterov wrote:
> On 10/12, Enke Chen wrote:
>>
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> 
> To be honest, I can't say I like this new feature...
> 
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -696,6 +696,10 @@ struct task_struct {
>>  int exit_signal;
>>  /* The signal sent when the parent dies: */
>>  int pdeath_signal;
>> +
>> +/* The signal sent prior to a child's coredump: */
>> +int predump_signal;
>> +
> 
> At least, I think predump_signal should live in signal_struct, not
> task_struct.
> 
> (pdeath_signal too, but it is too late to change (fix) this awkward API).
> 
>> +static void do_notify_parent_predump(struct task_struct *tsk)
>> +{
>> +struct sighand_struct *sighand;
>> +struct task_struct *parent;
>> +struct kernel_siginfo info;
>> +unsigned long flags;
>> +int sig;
>> +
>> +parent = tsk->real_parent;
> 
> So, debuggere won't be notified, only real_parent...
> 
>> +sig = parent->predump_signal;
> 
> probably ->predump_signal should be cleared on exec?


Is this not enough in "copy_process()"?

@@ -1985,6 +1985,7 @@ static __latent_entropy struct task_struct *copy_process(
p->dirty_paused_when = 0;
 
p->pdeath_signal = 0;
+   p->predump_signal = 0;

> 
>> +/* Check again with tasklist_lock" locked by the caller */
>> +if (!valid_predump_signal(sig))
>> +return;
> 
> I don't understand why we need valid_predump_signal() at all.

Most of the signals have well-defined semantics, and would not be appropriate
for this purpose.  That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.

> 
>>  bool get_signal(struct ksignal *ksig)
>>  {
>>  struct sighand_struct *sighand = current->sighand;
>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>>  current->flags |= PF_SIGNALED;
>>  
>>  if (sig_kernel_coredump(signr)) {
>> +/*
>> + * Notify the parent prior to the coredump if the
>> + * parent is interested in such a notificaiton.
>> + */
>> +int p_sig = current->real_parent->predump_signal;
>> +
>> +if (valid_predump_signal(p_sig)) {
>> +read_lock(_lock);
>> +do_notify_parent_predump(current);
>> +read_unlock(_lock);
>> +cond_resched();
> 
> perhaps this should be called by do_coredump() after coredump_wait() kills
> all the sub-threads?

proc_coredump_connector(current) is located here, they should stay together.

Thanks.  -- Enke

> 
>> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, int 
>> sig)
>> +{
>> +struct task_struct *p;
>> +int error;
>> +
>> +/* 0 is valid for disabling the feature */
>> +if (sig && !valid_predump_signal(sig))
>> +return -EINVAL;
>> +
>> +/* For the current task, the common case */
>> +if (pid == 0) {
>> +tsk->predump_signal = sig;
>> +return 0;
>> +}
>> +
>> +error = -ESRCH;
>> +rcu_read_lock();
>> +p = find_task_by_vpid(pid);
>> +if (p) {
>> +if (!set_predump_signal_perm(p))
>> +error = -EPERM;
>> +else {
>> +error = 0;
>> +p->predump_signal = sig;
>> +}
>> +}
>> +rcu_read_unlock();
>> +return error;
>> +}
> 
> Why? I mean, why do we really want to support the pid != 0 case?
> 
> Oleg.
>

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Oleg:

On 10/15/18 5:05 AM, Oleg Nesterov wrote:
> On 10/12, Enke Chen wrote:
>>
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> 
> To be honest, I can't say I like this new feature...

The requirement for predump notification is real. IMO signal notification
is simpler than "connector" or "signal + connector".

> 
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -696,6 +696,10 @@ struct task_struct {
>>  int exit_signal;
>>  /* The signal sent when the parent dies: */
>>  int pdeath_signal;
>> +
>> +/* The signal sent prior to a child's coredump: */
>> +int predump_signal;
>> +
> 
> At least, I think predump_signal should live in signal_struct, not
> task_struct.

It makes sense as "signal handling" must be consistent in a process.
I was following the wrong example. I will make the change.

> 
> (pdeath_signal too, but it is too late to change (fix) this awkward API).
> 
>> +static void do_notify_parent_predump(struct task_struct *tsk)
>> +{
>> +struct sighand_struct *sighand;
>> +struct task_struct *parent;
>> +struct kernel_siginfo info;
>> +unsigned long flags;
>> +int sig;
>> +
>> +parent = tsk->real_parent;
> 
> So, debuggere won't be notified, only real_parent...
> 
>> +sig = parent->predump_signal;
> 
> probably ->predump_signal should be cleared on exec?
> 
>> +/* Check again with tasklist_lock" locked by the caller */
>> +if (!valid_predump_signal(sig))
>> +return;
> 
> I don't understand why we need valid_predump_signal() at all.
> 
>>  bool get_signal(struct ksignal *ksig)
>>  {
>>  struct sighand_struct *sighand = current->sighand;
>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>>  current->flags |= PF_SIGNALED;
>>  
>>  if (sig_kernel_coredump(signr)) {
>> +/*
>> + * Notify the parent prior to the coredump if the
>> + * parent is interested in such a notificaiton.
>> + */
>> +int p_sig = current->real_parent->predump_signal;
>> +
>> +if (valid_predump_signal(p_sig)) {
>> +read_lock(_lock);
>> +do_notify_parent_predump(current);
>> +read_unlock(_lock);
>> +cond_resched();
> 
> perhaps this should be called by do_coredump() after coredump_wait() kills
> all the sub-threads?
> 
>> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, int 
>> sig)
>> +{
>> +struct task_struct *p;
>> +int error;
>> +
>> +/* 0 is valid for disabling the feature */
>> +if (sig && !valid_predump_signal(sig))
>> +return -EINVAL;
>> +
>> +/* For the current task, the common case */
>> +if (pid == 0) {
>> +tsk->predump_signal = sig;
>> +return 0;
>> +}
>> +
>> +error = -ESRCH;
>> +rcu_read_lock();
>> +p = find_task_by_vpid(pid);
>> +if (p) {
>> +if (!set_predump_signal_perm(p))
>> +error = -EPERM;
>> +else {
>> +error = 0;
>> +p->predump_signal = sig;
>> +}
>> +}
>> +rcu_read_unlock();
>> +return error;
>> +}
> 
> Why? I mean, why do we really want to support the pid != 0 case?

I will remove it. Please see my reply to Jann.

Thanks.  -- Enke

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Oleg:

On 10/15/18 5:05 AM, Oleg Nesterov wrote:
> On 10/12, Enke Chen wrote:
>>
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> 
> To be honest, I can't say I like this new feature...

The requirement for predump notification is real. IMO signal notification
is simpler than "connector" or "signal + connector".

> 
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -696,6 +696,10 @@ struct task_struct {
>>  int exit_signal;
>>  /* The signal sent when the parent dies: */
>>  int pdeath_signal;
>> +
>> +/* The signal sent prior to a child's coredump: */
>> +int predump_signal;
>> +
> 
> At least, I think predump_signal should live in signal_struct, not
> task_struct.

It makes sense as "signal handling" must be consistent in a process.
I was following the wrong example. I will make the change.

> 
> (pdeath_signal too, but it is too late to change (fix) this awkward API).
> 
>> +static void do_notify_parent_predump(struct task_struct *tsk)
>> +{
>> +struct sighand_struct *sighand;
>> +struct task_struct *parent;
>> +struct kernel_siginfo info;
>> +unsigned long flags;
>> +int sig;
>> +
>> +parent = tsk->real_parent;
> 
> So, debuggere won't be notified, only real_parent...
> 
>> +sig = parent->predump_signal;
> 
> probably ->predump_signal should be cleared on exec?
> 
>> +/* Check again with tasklist_lock" locked by the caller */
>> +if (!valid_predump_signal(sig))
>> +return;
> 
> I don't understand why we need valid_predump_signal() at all.
> 
>>  bool get_signal(struct ksignal *ksig)
>>  {
>>  struct sighand_struct *sighand = current->sighand;
>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>>  current->flags |= PF_SIGNALED;
>>  
>>  if (sig_kernel_coredump(signr)) {
>> +/*
>> + * Notify the parent prior to the coredump if the
>> + * parent is interested in such a notificaiton.
>> + */
>> +int p_sig = current->real_parent->predump_signal;
>> +
>> +if (valid_predump_signal(p_sig)) {
>> +read_lock(_lock);
>> +do_notify_parent_predump(current);
>> +read_unlock(_lock);
>> +cond_resched();
> 
> perhaps this should be called by do_coredump() after coredump_wait() kills
> all the sub-threads?
> 
>> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, int 
>> sig)
>> +{
>> +struct task_struct *p;
>> +int error;
>> +
>> +/* 0 is valid for disabling the feature */
>> +if (sig && !valid_predump_signal(sig))
>> +return -EINVAL;
>> +
>> +/* For the current task, the common case */
>> +if (pid == 0) {
>> +tsk->predump_signal = sig;
>> +return 0;
>> +}
>> +
>> +error = -ESRCH;
>> +rcu_read_lock();
>> +p = find_task_by_vpid(pid);
>> +if (p) {
>> +if (!set_predump_signal_perm(p))
>> +error = -EPERM;
>> +else {
>> +error = 0;
>> +p->predump_signal = sig;
>> +}
>> +}
>> +rcu_read_unlock();
>> +return error;
>> +}
> 
> Why? I mean, why do we really want to support the pid != 0 case?

I will remove it. Please see my reply to Jann.

Thanks.  -- Enke

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Greg:

On 10/15/18 11:43 AM, Greg Kroah-Hartman wrote:
> On Mon, Oct 15, 2018 at 11:16:36AM -0700, Enke Chen wrote:
>> Hi, Greg:
>>
>>> Shouldn't there also be a manpage update, and a kselftest added for this
>>> new user/kernel api that is being created?
>>>
>>
>> I will submit a patch for manpage update once the code is accepted.
> 
> Writing a manpage update is key to see if what you are describing
> actually matches the code you have submitted.  You should do both at the
> same time so that they can be reviewed together.

Ok, will do at the same time. But should I submit it as a separate patch?

> 
>> Regarding the kselftest, I am not sure.  Once the prctl() is limited to
>> self (which I will do), the logic would be pretty straightforward. Not
>> sure if the selftest would add much value.
> 
> If you do not have a test for this feature, how do you know it even
> works at all?  How will you know if it breaks in a future kernel
> release?  Have you tested this?  If so, how?

I have the test code. I am just not sure whether I should submit and check
it in to the kselftest?

Thanks.  -- Enke

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Greg:

On 10/15/18 11:43 AM, Greg Kroah-Hartman wrote:
> On Mon, Oct 15, 2018 at 11:16:36AM -0700, Enke Chen wrote:
>> Hi, Greg:
>>
>>> Shouldn't there also be a manpage update, and a kselftest added for this
>>> new user/kernel api that is being created?
>>>
>>
>> I will submit a patch for manpage update once the code is accepted.
> 
> Writing a manpage update is key to see if what you are describing
> actually matches the code you have submitted.  You should do both at the
> same time so that they can be reviewed together.

Ok, will do at the same time. But should I submit it as a separate patch?

> 
>> Regarding the kselftest, I am not sure.  Once the prctl() is limited to
>> self (which I will do), the logic would be pretty straightforward. Not
>> sure if the selftest would add much value.
> 
> If you do not have a test for this feature, how do you know it even
> works at all?  How will you know if it breaks in a future kernel
> release?  Have you tested this?  If so, how?

I have the test code. I am just not sure whether I should submit and check
it in to the kselftest?

Thanks.  -- Enke

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Jann:

Thanks a lot for you detailed review. Please see my replied/comments inline.

On 10/13/18 11:27 AM, Jann Horn wrote:
> On Sat, Oct 13, 2018 at 2:33 AM Enke Chen  wrote:
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> 
> Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
> some important differences:
> 
>  - You don't reset the signal on setuid execution.
>  - You permit setting this not just on the current process, but also on 
> others.
> 
> For both of these: Are these differences actually necessary, and if
> so, can you provide a specific rationale? From a security perspective,
> I would very much prefer it if this API had semantics closer to
> PR_SET_PDEATHSIG.

Regarding setting on others, I started with setting for self. But there is
a requirement for supporting the feature for a process manager written in
bash script. That's the reason for allowing the setting on others.

Given the feedback from you and others, I agree that it would be simpler and
more secure to remove the setting on others. We can submit a patch for bash
to support the setting natively.

Regarding the impact of "setuid", this property "PR_SET_PREDUMP_SIG" has to
do with the application/process whether the signal handler is set for receiving
such a notification.  If it is set, the "uid" should not matter.

> 
> [...]
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 312b43e..eb4a483 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -2337,6 +2337,44 @@ static int ptrace_signal(int signr, kernel_siginfo_t 
>> *info)
>> return signr;
>>  }
>>
>> +/*
>> + * Let the parent, if so desired, know about the imminent death of a child
>> + * prior to its coredump.
>> + *
>> + * Locking logic is similar to do_notify_parent_cldstop().
>> + */
>> +static void do_notify_parent_predump(struct task_struct *tsk)
>> +{
>> +   struct sighand_struct *sighand;
>> +   struct task_struct *parent;
>> +   struct kernel_siginfo info;
>> +   unsigned long flags;
>> +   int sig;
>> +
>> +   parent = tsk->real_parent;
>> +   sig = parent->predump_signal;
>> +
>> +   /* Check again with "tasklist_lock" locked by the caller */
>> +   if (!valid_predump_signal(sig))
>> +   return;
>> +
>> +   clear_siginfo();
>> +   info.si_signo = sig;
>> +   if (sig == SIGCHLD)
>> +   info.si_code = CLD_PREDUMP;
>> +
>> +   rcu_read_lock();
>> +   info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(parent));
>> +   info.si_uid = from_kuid_munged(task_cred_xxx(parent, user_ns),
>> +  task_uid(tsk));
> 
> You're sending a signal from the current namespaces, but with IDs that
> have been mapped into the parent's namespaces? That looks wrong to me.

I am following the example "do_notify_parent_cldstop()" called in the same
routine "get_signal()". If there is a better way, sure I will use it.

> 
>> +   rcu_read_unlock();
>> +
>> +   sighand = parent->sighand;
>> +   spin_lock_irqsave(>siglock, flags);
>> +   __group_send_sig_info(sig, , parent);
>> +   spin_unlock_irqrestore(>siglock, flags);
>> +}
>> +
>>  bool get_signal(struct ksignal *ksig)
>>  {
>> struct sighand_struct *sighand = current->sighand;
>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>> current->flags |= PF_SIGNALED;
>>
>> if (sig_kernel_coredump(signr)) {
>> +   /*
>> +* Notify the parent prior to the coredump if the
>> +* parent is interested in such a notificaiton.
>> +*/
>> +   int p_sig = current->real_parent->predump_signal;
> 
> current->real_parent is an __rcu member. I think if you run the sparse
> checker against this patch, it's going to complain. Are you allowed to
> access current->real_parent in this context?

Let me check, and get back to you on this one.

> 
>> +   if (valid_predump_signal(

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Jann:

Thanks a lot for you detailed review. Please see my replied/comments inline.

On 10/13/18 11:27 AM, Jann Horn wrote:
> On Sat, Oct 13, 2018 at 2:33 AM Enke Chen  wrote:
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> 
> Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
> some important differences:
> 
>  - You don't reset the signal on setuid execution.
>  - You permit setting this not just on the current process, but also on 
> others.
> 
> For both of these: Are these differences actually necessary, and if
> so, can you provide a specific rationale? From a security perspective,
> I would very much prefer it if this API had semantics closer to
> PR_SET_PDEATHSIG.

Regarding setting on others, I started with setting for self. But there is
a requirement for supporting the feature for a process manager written in
bash script. That's the reason for allowing the setting on others.

Given the feedback from you and others, I agree that it would be simpler and
more secure to remove the setting on others. We can submit a patch for bash
to support the setting natively.

Regarding the impact of "setuid", this property "PR_SET_PREDUMP_SIG" has to
do with the application/process whether the signal handler is set for receiving
such a notification.  If it is set, the "uid" should not matter.

> 
> [...]
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 312b43e..eb4a483 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -2337,6 +2337,44 @@ static int ptrace_signal(int signr, kernel_siginfo_t 
>> *info)
>> return signr;
>>  }
>>
>> +/*
>> + * Let the parent, if so desired, know about the imminent death of a child
>> + * prior to its coredump.
>> + *
>> + * Locking logic is similar to do_notify_parent_cldstop().
>> + */
>> +static void do_notify_parent_predump(struct task_struct *tsk)
>> +{
>> +   struct sighand_struct *sighand;
>> +   struct task_struct *parent;
>> +   struct kernel_siginfo info;
>> +   unsigned long flags;
>> +   int sig;
>> +
>> +   parent = tsk->real_parent;
>> +   sig = parent->predump_signal;
>> +
>> +   /* Check again with "tasklist_lock" locked by the caller */
>> +   if (!valid_predump_signal(sig))
>> +   return;
>> +
>> +   clear_siginfo();
>> +   info.si_signo = sig;
>> +   if (sig == SIGCHLD)
>> +   info.si_code = CLD_PREDUMP;
>> +
>> +   rcu_read_lock();
>> +   info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(parent));
>> +   info.si_uid = from_kuid_munged(task_cred_xxx(parent, user_ns),
>> +  task_uid(tsk));
> 
> You're sending a signal from the current namespaces, but with IDs that
> have been mapped into the parent's namespaces? That looks wrong to me.

I am following the example "do_notify_parent_cldstop()" called in the same
routine "get_signal()". If there is a better way, sure I will use it.

> 
>> +   rcu_read_unlock();
>> +
>> +   sighand = parent->sighand;
>> +   spin_lock_irqsave(>siglock, flags);
>> +   __group_send_sig_info(sig, , parent);
>> +   spin_unlock_irqrestore(>siglock, flags);
>> +}
>> +
>>  bool get_signal(struct ksignal *ksig)
>>  {
>> struct sighand_struct *sighand = current->sighand;
>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>> current->flags |= PF_SIGNALED;
>>
>> if (sig_kernel_coredump(signr)) {
>> +   /*
>> +* Notify the parent prior to the coredump if the
>> +* parent is interested in such a notificaiton.
>> +*/
>> +   int p_sig = current->real_parent->predump_signal;
> 
> current->real_parent is an __rcu member. I think if you run the sparse
> checker against this patch, it's going to complain. Are you allowed to
> access current->real_parent in this context?

Let me check, and get back to you on this one.

> 
>> +   if (valid_predump_signal(

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Christian:

As I replied to Jann, I will remove the code that does the setting on others
to make the code simpler and more secure.

Thanks.  -- Enke

>> +static bool set_predump_signal_perm(struct task_struct *p)
>> +{
>> +const struct cred *cred = current_cred(), *pcred = __task_cred(p);
>> +
>> +return uid_eq(pcred->uid, cred->euid) ||
>> +   uid_eq(pcred->euid, cred->euid) ||
>> +   capable(CAP_SYS_ADMIN);
> 
> So before proceeding I'd like to discuss at least two points:
> - how does this interact with the dumpability of a process?
> - do we need the capable(CAP_SYS_ADMIN) restriction to init_user_ns?
>   Seems we could make this work per-user-ns just like
>   PRCTL_SET_PDEATHSIG does?
> 
>> +}

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Christian:

As I replied to Jann, I will remove the code that does the setting on others
to make the code simpler and more secure.

Thanks.  -- Enke

>> +static bool set_predump_signal_perm(struct task_struct *p)
>> +{
>> +const struct cred *cred = current_cred(), *pcred = __task_cred(p);
>> +
>> +return uid_eq(pcred->uid, cred->euid) ||
>> +   uid_eq(pcred->euid, cred->euid) ||
>> +   capable(CAP_SYS_ADMIN);
> 
> So before proceeding I'd like to discuss at least two points:
> - how does this interact with the dumpability of a process?
> - do we need the capable(CAP_SYS_ADMIN) restriction to init_user_ns?
>   Seems we could make this work per-user-ns just like
>   PRCTL_SET_PDEATHSIG does?
> 
>> +}

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Greg:

> Shouldn't there also be a manpage update, and a kselftest added for this
> new user/kernel api that is being created?
> 

I will submit a patch for manpage update once the code is accepted.

Regarding the kselftest, I am not sure.  Once the prctl() is limited to
self (which I will do), the logic would be pretty straightforward. Not
sure if the selftest would add much value.

Thanks.  -- Enke

On 10/12/18 11:40 PM, Greg Kroah-Hartman wrote:
> On Fri, Oct 12, 2018 at 05:33:35PM -0700, Enke Chen wrote:
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>
>> Background:
>>
>> As the coredump of a process may take time, in certain time-sensitive
>> applications it is necessary for a parent process (e.g., a process
>> manager) to be notified of a child's imminent death before the coredump
>> so that the parent process can act sooner, such as re-spawning an
>> application process, or initiating a control-plane fail-over.
>>
>> Currently there are two ways for a parent process to be notified of a
>> child process's state change. One is to use the POSIX signal, and
>> another is to use the kernel connector module. The specific events and
>> actions are summarized as follows:
>>
>> Process EventPOSIX SignalConnector-based
>> --
>> ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>>  SIGCHLD / CLD_STOPPED
>>
>> ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>>  SIGCHLD / CLD_CONTINUED
>>
>> pre_coredump/N/A proc_coredump_connector()
>> get_signal()
>>
>> post_coredump/   do_notify_parent()  proc_exit_connector()
>> do_exit()SIGCHLD / exit_signal
>> --
>>
>> As shown in the table, the signal-based pre-coredump notification is not
>> currently available. In some cases using a connector-based notification
>> can be quite complicated (e.g., when a process manager is written in shell
>> scripts and thus is subject to certain inherent limitations), and a
>> signal-based notification would be simpler and better suited.
>>
>> Signed-off-by: Enke Chen 
>> ---
>>  arch/x86/kernel/signal_compat.c|  2 +-
>>  include/linux/sched.h  |  4 ++
>>  include/linux/signal.h |  5 +++
>>  include/uapi/asm-generic/siginfo.h |  3 +-
>>  include/uapi/linux/prctl.h |  4 ++
>>  kernel/fork.c  |  1 +
>>  kernel/signal.c| 51 +
>>  kernel/sys.c   | 77 
>> ++
>>  8 files changed, 145 insertions(+), 2 deletions(-)
> 
> Shouldn't there also be a manpage update, and a kselftest added for this
> new user/kernel api that is being created?
> 
> thanks,
> 
> greg k-h
>

Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-15 Thread Enke Chen

Hi, Greg:

> Shouldn't there also be a manpage update, and a kselftest added for this
> new user/kernel api that is being created?
> 

I will submit a patch for manpage update once the code is accepted.

Regarding the kselftest, I am not sure.  Once the prctl() is limited to
self (which I will do), the logic would be pretty straightforward. Not
sure if the selftest would add much value.

Thanks.  -- Enke

On 10/12/18 11:40 PM, Greg Kroah-Hartman wrote:
> On Fri, Oct 12, 2018 at 05:33:35PM -0700, Enke Chen wrote:
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>
>> Background:
>>
>> As the coredump of a process may take time, in certain time-sensitive
>> applications it is necessary for a parent process (e.g., a process
>> manager) to be notified of a child's imminent death before the coredump
>> so that the parent process can act sooner, such as re-spawning an
>> application process, or initiating a control-plane fail-over.
>>
>> Currently there are two ways for a parent process to be notified of a
>> child process's state change. One is to use the POSIX signal, and
>> another is to use the kernel connector module. The specific events and
>> actions are summarized as follows:
>>
>> Process EventPOSIX SignalConnector-based
>> --
>> ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>>  SIGCHLD / CLD_STOPPED
>>
>> ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
>>  SIGCHLD / CLD_CONTINUED
>>
>> pre_coredump/N/A proc_coredump_connector()
>> get_signal()
>>
>> post_coredump/   do_notify_parent()  proc_exit_connector()
>> do_exit()SIGCHLD / exit_signal
>> --
>>
>> As shown in the table, the signal-based pre-coredump notification is not
>> currently available. In some cases using a connector-based notification
>> can be quite complicated (e.g., when a process manager is written in shell
>> scripts and thus is subject to certain inherent limitations), and a
>> signal-based notification would be simpler and better suited.
>>
>> Signed-off-by: Enke Chen 
>> ---
>>  arch/x86/kernel/signal_compat.c|  2 +-
>>  include/linux/sched.h  |  4 ++
>>  include/linux/signal.h |  5 +++
>>  include/uapi/asm-generic/siginfo.h |  3 +-
>>  include/uapi/linux/prctl.h |  4 ++
>>  kernel/fork.c  |  1 +
>>  kernel/signal.c| 51 +
>>  kernel/sys.c   | 77 
>> ++
>>  8 files changed, 145 insertions(+), 2 deletions(-)
> 
> Shouldn't there also be a manpage update, and a kselftest added for this
> new user/kernel api that is being created?
> 
> thanks,
> 
> greg k-h
>

[PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-12 Thread Enke Chen

For simplicity and consistency, this patch provides an implementation
for signal-based fault notification prior to the coredump of a child
process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
be used by an application to express its interest and to specify the
signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.

Background:

As the coredump of a process may take time, in certain time-sensitive
applications it is necessary for a parent process (e.g., a process
manager) to be notified of a child's imminent death before the coredump
so that the parent process can act sooner, such as re-spawning an
application process, or initiating a control-plane fail-over.

Currently there are two ways for a parent process to be notified of a
child process's state change. One is to use the POSIX signal, and
another is to use the kernel connector module. The specific events and
actions are summarized as follows:

Process EventPOSIX SignalConnector-based
--
ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_STOPPED

ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_CONTINUED

pre_coredump/N/A proc_coredump_connector()
get_signal()

post_coredump/   do_notify_parent()  proc_exit_connector()
do_exit()SIGCHLD / exit_signal
--

As shown in the table, the signal-based pre-coredump notification is not
currently available. In some cases using a connector-based notification
can be quite complicated (e.g., when a process manager is written in shell
scripts and thus is subject to certain inherent limitations), and a
signal-based notification would be simpler and better suited.

Signed-off-by: Enke Chen 
---
 arch/x86/kernel/signal_compat.c|  2 +-
 include/linux/sched.h  |  4 ++
 include/linux/signal.h |  5 +++
 include/uapi/asm-generic/siginfo.h |  3 +-
 include/uapi/linux/prctl.h |  4 ++
 kernel/fork.c  |  1 +
 kernel/signal.c| 51 +
 kernel/sys.c   | 77 ++
 8 files changed, 145 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 9ccbf05..a3deba8 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
BUILD_BUG_ON(NSIGSEGV != 7);
BUILD_BUG_ON(NSIGBUS  != 5);
BUILD_BUG_ON(NSIGTRAP != 5);
-   BUILD_BUG_ON(NSIGCHLD != 6);
+   BUILD_BUG_ON(NSIGCHLD != 7);
BUILD_BUG_ON(NSIGSYS  != 1);
 
/* This is part of the ABI and can never change in size: */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 09026ea..cfb9645 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -696,6 +696,10 @@ struct task_struct {
int exit_signal;
/* The signal sent when the parent dies: */
int pdeath_signal;
+
+   /* The signal sent prior to a child's coredump: */
+   int predump_signal;
+
/* JOBCTL_*, siglock protected: */
unsigned long   jobctl;
 
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 706a499..7cb976d 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -256,6 +256,11 @@ static inline int valid_signal(unsigned long sig)
return sig <= _NSIG ? 1 : 0;
 }
 
+static inline int valid_predump_signal(int sig)
+{
+   return (sig == SIGCHLD) || (sig == SIGUSR1) || (sig == SIGUSR2);
+}
+
 struct timespec;
 struct pt_regs;
 enum pid_type;
diff --git a/include/uapi/asm-generic/siginfo.h 
b/include/uapi/asm-generic/siginfo.h
index cb3d6c2..1a47cef 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -267,7 +267,8 @@ struct {\
 #define CLD_TRAPPED4   /* traced child has trapped */
 #define CLD_STOPPED5   /* child has stopped */
 #define CLD_CONTINUED  6   /* stopped child has continued */
-#define NSIGCHLD   6
+#define CLD_PREDUMP7   /* child is about to dump core */
+#define NSIGCHLD   7
 
 /*
  * SIGPOLL (or any other signal without signal specific si_codes) si_codes
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c0d7ea0..79f0a8a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -219,4 +219,8 @@ struct prctl_mm_map {
 # define PR_SPEC_DISABLE   (1UL << 2)
 # define PR_SPEC_FORCE_DISABLE (1UL << 3)
 
+/* Whether to

[PATCH] kernel/signal: Signal-based pre-coredump notification

2018-10-12 Thread Enke Chen

For simplicity and consistency, this patch provides an implementation
for signal-based fault notification prior to the coredump of a child
process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
be used by an application to express its interest and to specify the
signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.

Background:

As the coredump of a process may take time, in certain time-sensitive
applications it is necessary for a parent process (e.g., a process
manager) to be notified of a child's imminent death before the coredump
so that the parent process can act sooner, such as re-spawning an
application process, or initiating a control-plane fail-over.

Currently there are two ways for a parent process to be notified of a
child process's state change. One is to use the POSIX signal, and
another is to use the kernel connector module. The specific events and
actions are summarized as follows:

Process EventPOSIX SignalConnector-based
--
ptrace_attach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_STOPPED

ptrace_detach()  do_notify_parent_cldstop()  proc_ptrace_connector()
 SIGCHLD / CLD_CONTINUED

pre_coredump/N/A proc_coredump_connector()
get_signal()

post_coredump/   do_notify_parent()  proc_exit_connector()
do_exit()SIGCHLD / exit_signal
--

As shown in the table, the signal-based pre-coredump notification is not
currently available. In some cases using a connector-based notification
can be quite complicated (e.g., when a process manager is written in shell
scripts and thus is subject to certain inherent limitations), and a
signal-based notification would be simpler and better suited.

Signed-off-by: Enke Chen 
---
 arch/x86/kernel/signal_compat.c|  2 +-
 include/linux/sched.h  |  4 ++
 include/linux/signal.h |  5 +++
 include/uapi/asm-generic/siginfo.h |  3 +-
 include/uapi/linux/prctl.h |  4 ++
 kernel/fork.c  |  1 +
 kernel/signal.c| 51 +
 kernel/sys.c   | 77 ++
 8 files changed, 145 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 9ccbf05..a3deba8 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
BUILD_BUG_ON(NSIGSEGV != 7);
BUILD_BUG_ON(NSIGBUS  != 5);
BUILD_BUG_ON(NSIGTRAP != 5);
-   BUILD_BUG_ON(NSIGCHLD != 6);
+   BUILD_BUG_ON(NSIGCHLD != 7);
BUILD_BUG_ON(NSIGSYS  != 1);
 
/* This is part of the ABI and can never change in size: */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 09026ea..cfb9645 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -696,6 +696,10 @@ struct task_struct {
int exit_signal;
/* The signal sent when the parent dies: */
int pdeath_signal;
+
+   /* The signal sent prior to a child's coredump: */
+   int predump_signal;
+
/* JOBCTL_*, siglock protected: */
unsigned long   jobctl;
 
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 706a499..7cb976d 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -256,6 +256,11 @@ static inline int valid_signal(unsigned long sig)
return sig <= _NSIG ? 1 : 0;
 }
 
+static inline int valid_predump_signal(int sig)
+{
+   return (sig == SIGCHLD) || (sig == SIGUSR1) || (sig == SIGUSR2);
+}
+
 struct timespec;
 struct pt_regs;
 enum pid_type;
diff --git a/include/uapi/asm-generic/siginfo.h 
b/include/uapi/asm-generic/siginfo.h
index cb3d6c2..1a47cef 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -267,7 +267,8 @@ struct {\
 #define CLD_TRAPPED4   /* traced child has trapped */
 #define CLD_STOPPED5   /* child has stopped */
 #define CLD_CONTINUED  6   /* stopped child has continued */
-#define NSIGCHLD   6
+#define CLD_PREDUMP7   /* child is about to dump core */
+#define NSIGCHLD   7
 
 /*
  * SIGPOLL (or any other signal without signal specific si_codes) si_codes
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c0d7ea0..79f0a8a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -219,4 +219,8 @@ struct prctl_mm_map {
 # define PR_SPEC_DISABLE   (1UL << 2)
 # define PR_SPEC_FORCE_DISABLE (1UL << 3)
 
+/* Whether to

Re: [PATCH] FUSE: add the async option for the flush/release operation

2016-08-16 Thread Enke Chen

Hi, Miklos:

Thanks for your reply and explanation. Please see my comments below.

On 8/15/16 2:36 AM, Miklos Szeredi wrote:
> On Wed, Aug 10, 2016 at 6:50 PM, Enke Chen <enkec...@cisco.com> wrote:
>> Hi, Miklos:
>>
>> On 8/9/16 11:52 PM, Miklos Szeredi wrote:
>>> On Wed, Aug 10, 2016 at 5:26 AM, Enke Chen <enkec...@cisco.com> wrote:
>>>> Hi, Miklos:
>>>>
>>>> This patch adds the async option for the flush/release operation in FUSE.
>>>>
>>>> The async flush/release option allows a FUSE-based application to be 
>>>> terminated
>>>> without being blocked in the flush/release operation even in the presence 
>>>> of
>>>> complex external interactions. In addition, the async operation can be more
>>>> efficient when a large number of fuse-based files is involved.
>>>>
>>>> ---
>>>> Deadlock Example:
>>>>
>>>> Process A is a multi-threaded application that interacts with Process 
>>>> B,
>>>> a FUSE-server.
>>>>
>>>>
>>>>UNIX-domain socket
>>>> App (A)  ---  FUSE-server (B)
>>>>|   |
>>>>|   |
>>>>|   |
>>>>+---+
>>>>open/flush/release
>>>
>>> Why would the fuse server want to communicate with the app (using
>>> other than the filesystem)?
>>
>> In this particular case, the other communication channel is used to 
>> coordinate
>> the allocation (with "open") and de-alocation (with "flush/release") of the
>> shared memory associated with the opened "file".
>>
>> In general an application may have special handling for the "flush/release"
>> operation that involve external interactions with one or more other 
>> processes,
> 
> Sure, it can interact with other processes, but *not* with the process
> accessing the filesystem.  There are no end of possible deadlocks that
> way, and it goes straight against the design philosophy of kernel
> interfaces.
> 
> Maybe I'm missing something, but this sure looks like a case of bad
> system design.   You need to give more details to convince me
> otherwise.

On the "system design", I agree and I certainly prefer simpler external
interactions among processes.  But we do not always know what libs would
do, and when another interaction would be introduced.

> 
>> and that is where this "async" operation can help.
>>
>> IMO it would be even better if the "async" operation can be made the default 
>> so
>> folks do not need to worry about this types of deadlocks.  From reading of 
>> the
>> code, it seems that FUSE does "async" release under certain conditions 
>> already.
> 
> Release being async is okay, and is the default for non-fuseblk
> mounts.  For fuseblk it is not the default, because it could cause
> problems:
> 
>   5a18ec176c93 ("fuse: fix hang of single threaded fuseblk filesystem")
> 
> We could make release optionally async for fuseblk.
> 
> Flush being async is not okay, it needs to return an error value.  If
> the filesystem does not want to return an error value, it may omit a
> flush implementation completely.

I was not aware that the release operation is already async by default for
non-fuseblk mounts. That is really what we need in order to break the deadlock
in the example. (I had the "flush" operation there for the sake of completeness,
and not out of necessity.)

The deadlock I described was an old problem from several years ago. I just
re-ran the test with the newer kernel (3.14 and 4.7), and confirmed that the
issue is gone with the "release" operation being async.  The deadlock was also
reproduced after changing the release operation from "async" to "sync" in the
fuse module.

So the patch is no longer needed unless we want to modify it to support the
"async release" for the fuseblk.

Thanks again.  -- Enke

Re: [PATCH] FUSE: add the async option for the flush/release operation

2016-08-16 Thread Enke Chen

Hi, Miklos:

Thanks for your reply and explanation. Please see my comments below.

On 8/15/16 2:36 AM, Miklos Szeredi wrote:
> On Wed, Aug 10, 2016 at 6:50 PM, Enke Chen  wrote:
>> Hi, Miklos:
>>
>> On 8/9/16 11:52 PM, Miklos Szeredi wrote:
>>> On Wed, Aug 10, 2016 at 5:26 AM, Enke Chen  wrote:
>>>> Hi, Miklos:
>>>>
>>>> This patch adds the async option for the flush/release operation in FUSE.
>>>>
>>>> The async flush/release option allows a FUSE-based application to be 
>>>> terminated
>>>> without being blocked in the flush/release operation even in the presence 
>>>> of
>>>> complex external interactions. In addition, the async operation can be more
>>>> efficient when a large number of fuse-based files is involved.
>>>>
>>>> ---
>>>> Deadlock Example:
>>>>
>>>> Process A is a multi-threaded application that interacts with Process 
>>>> B,
>>>> a FUSE-server.
>>>>
>>>>
>>>>UNIX-domain socket
>>>> App (A)  ---  FUSE-server (B)
>>>>|   |
>>>>|   |
>>>>|   |
>>>>+---+
>>>>open/flush/release
>>>
>>> Why would the fuse server want to communicate with the app (using
>>> other than the filesystem)?
>>
>> In this particular case, the other communication channel is used to 
>> coordinate
>> the allocation (with "open") and de-alocation (with "flush/release") of the
>> shared memory associated with the opened "file".
>>
>> In general an application may have special handling for the "flush/release"
>> operation that involve external interactions with one or more other 
>> processes,
> 
> Sure, it can interact with other processes, but *not* with the process
> accessing the filesystem.  There are no end of possible deadlocks that
> way, and it goes straight against the design philosophy of kernel
> interfaces.
> 
> Maybe I'm missing something, but this sure looks like a case of bad
> system design.   You need to give more details to convince me
> otherwise.

On the "system design", I agree and I certainly prefer simpler external
interactions among processes.  But we do not always know what libs would
do, and when another interaction would be introduced.

> 
>> and that is where this "async" operation can help.
>>
>> IMO it would be even better if the "async" operation can be made the default 
>> so
>> folks do not need to worry about this types of deadlocks.  From reading of 
>> the
>> code, it seems that FUSE does "async" release under certain conditions 
>> already.
> 
> Release being async is okay, and is the default for non-fuseblk
> mounts.  For fuseblk it is not the default, because it could cause
> problems:
> 
>   5a18ec176c93 ("fuse: fix hang of single threaded fuseblk filesystem")
> 
> We could make release optionally async for fuseblk.
> 
> Flush being async is not okay, it needs to return an error value.  If
> the filesystem does not want to return an error value, it may omit a
> flush implementation completely.

I was not aware that the release operation is already async by default for
non-fuseblk mounts. That is really what we need in order to break the deadlock
in the example. (I had the "flush" operation there for the sake of completeness,
and not out of necessity.)

The deadlock I described was an old problem from several years ago. I just
re-ran the test with the newer kernel (3.14 and 4.7), and confirmed that the
issue is gone with the "release" operation being async.  The deadlock was also
reproduced after changing the release operation from "async" to "sync" in the
fuse module.

So the patch is no longer needed unless we want to modify it to support the
"async release" for the fuseblk.

Thanks again.  -- Enke

Re: [PATCH] FUSE: add the async option for the flush/release operation

2016-08-10 Thread Enke Chen

Hi, Miklos:

On 8/9/16 11:52 PM, Miklos Szeredi wrote:
> On Wed, Aug 10, 2016 at 5:26 AM, Enke Chen <enkec...@cisco.com> wrote:
>> Hi, Miklos:
>>
>> This patch adds the async option for the flush/release operation in FUSE.
>>
>> The async flush/release option allows a FUSE-based application to be 
>> terminated
>> without being blocked in the flush/release operation even in the presence of
>> complex external interactions. In addition, the async operation can be more
>> efficient when a large number of fuse-based files is involved.
>>
>> ---
>> Deadlock Example:
>>
>> Process A is a multi-threaded application that interacts with Process B,
>> a FUSE-server.
>>
>>
>>UNIX-domain socket
>> App (A)  ---  FUSE-server (B)
>>|   |
>>|   |
>>|   |
>>+---+
>>open/flush/release
> 
> Why would the fuse server want to communicate with the app (using
> other than the filesystem)?

In this particular case, the other communication channel is used to coordinate
the allocation (with "open") and de-alocation (with "flush/release") of the
shared memory associated with the opened "file".

In general an application may have special handling for the "flush/release"
operation that involve external interactions with one or more other processes,
and that is where this "async" operation can help.

IMO it would be even better if the "async" operation can be made the default so
folks do not need to worry about this types of deadlocks.  From reading of the
code, it seems that FUSE does "async" release under certain conditions already.

Thanks.  -- Enke

1 2 >

1 - 100 of 103 matches

Mail list logo